Anda di halaman 1dari 6

Identification of Fake News Articles

Rohun Kulkarni Stephanie Moon


Dept. of Mechanical Engineering Dept. of Mechanical Engineering
Stanford University Stanford University
rohun@stanford.edu sjmoon@stanford.edu

Abstract

Fake news is a type of yellow journalism that deliberately misinforms the audience with disinfor-
mation or hoaxes. Recently, there has been an increased number of fake news from on the web
with the rise of social media. To distinguish such fake news from accurate news, we create three
supervised binary classifiers - support vector machine, neural network, and logistic regression.
We process the text and create features using n-grams. We obtain good performance of 87.83%
and 86.88% using SVM and NN, respectively. Logistic regression resulted in poorer performance
at 53.51% accuracy. Future work include more detailed text preprocessing and feature generation
for more informative indicators of fake news.

1 Introduction
The diffusion of inaccurate and outright false news stories has been a recurring issue since the prolif-
eration of social media. Manually checking and corroborating news articles is a time-intensive process.
Recently, there has been significant interest in automating the detection of fake news. Several papers
have shown that fake news can be distinguished from accurate news media through analysis of the text
corpus.
We have developed classifier systems using support vector machine, neural network, and logistic re-
gression based on existing datasets which feature a large collection of recent news articles. We preprocess
the text and choose features as the top n-grams, or sequences of consecutive words. For each example,
we rate the occurrence of the features using a Term Frequency-Inverse Document Frequency (TF-IDF)
metric. Thus, the input data to our classifiers is a { vector with rows of examples and columns of TF-IDF
weights }. The output is a {prediction of the classification label of fake (positive) or real (negative)}.

2 Related Work
A diverse array of techniques exist for analyzing a text corpus based on Natural Language Processing
Methods. Feature extraction via tokenization from text is well documented and commonly seen in the
use of spam filters. This involves removing punctuation, capitalization, stopwords and lemmatization
(Gupta et al.) At a higher level, sentiment analysis is gaining traction as a method for identifying an
authors tone and bias towards the content. Horne et al used this method to identify features such as bias
and morality.
While our approach focuses exclusively on the text corpus of the article, researchers have found suc-
cess in combining higher-level analysis. Baly et al analyzed the article based on text structure, sentiment
as well as the reliability of the source. This approach was combined with textual analysis of features
based on the Horne feature set. This was based on the notion that a source known for fake news will
most likely to continue to deliver fake news. Previous projects for CS 229 have examined the problem
of fake news identification as well. Choudari et al employed similar textual tokenization for feature ex-
traction. They applied an averaging hypothesis method to several classifiers (SVM, Naive Bayes and
Neural Networks) to improve their prediction. Our approach differs in its simplicity while demonstrating
improved accuracy, likely due to more detailed textual processing.
3 Dataset and Features
Our real news dataset gathered by Chan in Kaggle consisted of 2016 articles from news publication com-
panies with mixed to very high factual reporting rating from the media bias/fact check website (Media
Bias). They included sources such as New York Times, Washington Post, CNN, and Fox News. Our fake
news dataset gathered by Risdal in Kaggle was 2016 news articles from various websites are those tagged
as bullshit by the BS Detector Chrome Extension. Each article contained about 2,000-4,000 words, and
we had a total of 27, 985 articles,about 12,000 from real news and 15,000 from fake news (some entries
were blank or corrupted.) Each example was labeled with a ground truth value of fake or real.
Our data required significant preprocessing. We preprocessed the text in the following steps using the
NLTK NLP library: (1) remove whitespace, (2) remove punctuations, (3) remove numbers, (4) remove
stop words such as prepositions and linking verbs, (5) tokenize text (lemmatization). The library for
tokenization uses an unsupervised learning algorithm pre-trained on a large data of words to determine
word relations. We then randomized and split the data into 60:20:20 training: cross validation: test sets.
To extract features, we generated n-grams, a sequence of consecutive words, across all examples. We
found that considering n=1,2,3, or unigrams, bigrams, and trigrams, led to the most optimized results.
A basic way of constructing the input data would have been counting the number of occurrences of the
feature in each example. However, some features occur more frequently across all the examples. To
normalize, we used the TF-IDF library in python to identify and select features in the entire text corpus.
Term Frequency - Inverse Document Frequency weight increases proportionally as a feature is used more
frequently in the document, and offsets for the number of documents containing the word. This ensures
that words used frequently across all documents in the corpus are less weighted. The weights were
computed by:
ωi,j = tfi,j × log(N/dfi )
where tfi,j = number of occurrences of i in j, dfi = number of documents containing i, N = total
number of documents.

The 1000 n-grams with the highest occurences were used as features. Our input data was a vector of rows
of examples and columns of TF-IDF weights of the chosen features. Examples of features are shown in
Table 1.
TF-IDF TF-IDF
n-gram n-gram
weights weights
trump 2517.19 like 1061.43
said 2093.68 new york 1053.46
clinton 1476.73 elect 1030.61
donald trump 1397.60 new 1005.54
peopl 1236.11 look like 488.62
one 1184.72 social media 439.41
unit state 1182.80 prime minist 394.00

Table 1: Examples of unigrams, bigrams, and trigrams used as features with their respective Term
Frequency - Inverse Document Frequency (TF-IDF) weights for an example article. The TF-IDF weights
are calculated by evaluating the features’ occurrences in a document relative to all documents.

4 Methods
4.1 Support Vector Machine
Here, we used a linear SVM with a ”hinge” loss function. A support vector machine converts features to
points in high dimensional space and groups them into classes using a hyperplane with a a high margin.
The cost function is as follows:
m n
X
(i) T (i) (i) 1X T (i)
J(Θ) = y cost1 (θ x ) + (1 − y )cost0 (θ x ) + Θ2j
i=1
2 j=1

where cost0 (z) = max(0, k(1 + z)),cost1 (z) = max(0, k(1 − z))
and C = λ1 is the regularization term.

The hypothesis, unlike logistic regression, is a discriminant function (if ΘT x >= 0 then hθ = 1 and
hθ = 0 otherwise). Although various nonlinear kernels could be used, such as polynomial kernel,
Gaussian kernel, and sigmoid kernel, we used a linear one due to computing resource limitations.

4.2 Neural Network


Neural networks are capable of producing highly complex decision boundaries and effective for non-
linear classification. They model the human brain through layers of interconnected nodes. The input
consists of the set of features, in our case the TF-IDF scores for the highest scoring n-grams. Linear
combinations of these values are taken and passed to the hidden layer. Using a sigmoid as an activation
function, linear combinations of these values are taken to produce a hypothesis. The sigmoid of this
hypothesis taken to create a prediction. We implemented a 1-layer Neural Network with a sigmoid
activation function given by the equation:

1
g(z) =
1 − e−z
A standard feature size of 1000 features with 1, 2 and 3-grams was used based on positive results in the
Support Vector Machine Test. The cost function was calculated with the formula:

m X K s s
1 X (i) (i) λ L−1
XX l Xl+1
(l)
J(θ) = − [yk log((hΘ (x(i) ))k ) + (1 − yk log(1 − (hΘ (x(i) ))k )] + (Θ )2
m i=1 k=1 2m l=1 i=1 j=1 j,i

The weights for the features were calculated by minimizing the cost function via Backpropagation. These
weights were randomly initialized to a random value in the range calculated from [−, ] where

6
= q
(Loutput + Linput )
where L is the size of layer. Random initialization is used to break symmetry in the neural network. The
accuracy of the network was measured after adjusting hidden layer size and regularization parameters.

4.3 Logistic Regression


For our third model, we used a logistic regression. A logistic regression computes the class probabilities
using a hypothesis model. Then the class is determined by the most likely estimate (if hθ (x) >= 0.5
then y = 1 and if hθ (x) < 0.5 then y = 0). We used the sigmoid function g(z) as the hypothesis and
used a gradient descent on the cost function J(θ) to optimize over the weights Θ.

1
hθ (x) = g(z = θT x) =
1 − e−z
m
1 X
J(θ) = − [y (i) log(hθ (x(i) )) + (1 − y (i) )log(1 − hθ (x(i) ))]
m i=1
and where every step of the gradient descent updates:

∂J(θ)
θj := θj − α
∂θj
5 Experiments, Results, Discussion
The overall performance accuracy for the three models can be found in Table 2.

Training Accuracy Test Accuracy


SVM 88.77% 87.83%
Neural Network 88.04% 86.88%
Logistic Regression 53.71% 53.51%

Table 2: Summary of results for three models.

5.1 SVM
The SVM resulted in 88.77% accuracy on training dataset and 87.83% accuracy on test dataset. We
plotted a learning curve to optimize the parameters. We found that having 5000 features, 1,2,3-grams,
and C = 1 resulted in high variance (overfitting) as shown in Figure 1a. We decreased the number of
features and the regularization term to reduce overfitting. We also found that 1,2,3-grams performed
slightly better than 1,2-grams as feature types. Thus, our final model resulted in 1000 features, 1,2,3-
grams, C = 0.1 (Figure 1b). The confusion matrix for the final model is shown in Table 3.

(a) High variance SVM model (b) Optimized SVM model

Figure 1: Learning curves for SVM parameter tuning. We tuned from an overfitted one in (a) to an
optimized one in (b).

Predicted Predicted
Fake Real
Actual Fake 2867 275
Actual Real 406 2049

Table 3: Confusion matrix for SVM with optimized parameters.

5.2 Neural Network


The Neural network resulted in a 88.04% accuracy on the training set and 86.88% on the test dataset.
The hyper-parmeters for the network were tuned by using learning curves and comparing the training
accuracy with cross validation accuracy as shown in Figure 2. These learning curves indicated that the
model was high variance. The parameters for the best performance were a hidden layer of size 10, λ =
1.0, and 3000 iterations. The Neural network’s ability to produce a high-dimensional decision boundary
made it well suited for the classification task. Additionally, the smaller size of hidden layer compared
the the input feature size prevented specific features in the training set from dominating the model.
Figure 2: Learning curve for 1-layer neural network with 1000 features, λ = 0

5.3 Logistic Regression


The Logistic Regression classifier performed poorly on classifying both the training and test sets. The
highest accuracy achieved on the training set was 53.71% without regularization. Given that the training
dataset comprised of 56.1% real articles, it is clear that the logistic regression classifier was not perform-
ing suitably. Here, we optimized the hyper-parameters to α = 15 over 2000 iterations. In comparison
to the support vector machine and neural network, both methods are capable of producing higher order
decision boundaries. However, it is likely that logistic regression is more affected by examples that lie
close to the boundary. In this case, support vectors benefit from the large margin such that only examples
a significant distance from the hyper-plane are weighted in the results. Because neural networks take
linear combinations of logistic regression, they are better suited for complex decision boundaries.

Figure 3: Logistic regression cost function with increasing iteration.

6 Future Work and Conclusion


This paper has demonstrated the effectiveness of using simple machine learning techniques in classifying
fake news articles. When using metrics like TF-IDF, classifiers capable of producing complex decision
boundaries excelled. Both SVM and neural network produced high test accuracy scores (87.83% and
86.88% respectively) but were prone to high variance. Logistic regression proved incapable of producing
an adequate decision boundary, performing worse than a classifier that would classify all articles as real
for the given dataset.
Further work on the dataset and feature extraction could yield improvements in the classifier. Roughly
5% of the articles in the dataset were in other languages including Russian, Arabic, and some 12 others.
Other textual features other than n-grams such as capitalization and punctuation could also yield better
markers. These extra features could be incorporated into the classifier in a straightforward manner by
analyzing the text prior to tokenization. Several articles that could be classified as satirical were consid-
ered part of the fake news dataset. It is important to make distinctions between these types of articles and
more insidious articles that seek to spread false information. To take on these more complex nuances of
the dataset, higher order analysis is necessary.
Our model’s effectiveness is limited by the timeliness of its content. Because of the political nature of
today’s fake news, the approach used would be limited in future news stories that do not contain the same
token words. Sentiment and contextual analysis have proven to be effective tools at identifying sources
of fake news. By incorporating features related to the bias, tone and source reliability, a classifier could
be generalized. These tools would require significantly improved labeled datasets.
As information is distributed at a faster rate, the need for verifying its accuracy is increasingly im-
portant. Complex NLP techniques are used for ambitious goals of determining sentiment and intention.
These techniques utilize features that require labeling of tone, bias, morality and require significant pro-
cessing. Although these features could improve performance, our work shows that relatively simple
machine learning methods may be employed with minimal amount of labeling to produce high-accuracy
classifiers.

7 Contributions
Both authors contributed equally to data gathering, data preprocessing, and writing of the report. While
both contributed to all algorithms, Rohun focused more on the neural network algorithm while Stephanie
focused more on the SVM and logistic regression algorithms.

References
Baly, Ramy, et al. ”Predicting Factuality of Reporting and Bias of News Media Sources.” arXiv preprint
arXiv:1810.01765 (2018).

Castillo, Carlos, Marcelo Mendoza, and Barbara Poblete. ”Information credibility on twitter.” Proceedings of the
20th international conference on World wide web. ACM, 2011.

Chan, A. Kaggle, October 2018. https://www.kaggle.com/anthonyc1/gathering-real-news-for-oct-dec-2016


Choudary D., Mone, S., and Singhania A., Fake News Identification. CS 229 2017
Media Bias Fact Check, LLC. 2018. https://mediabiasfactcheck.com/

Risdal, M. Kaggle, 2016. https://www.kaggle.com/mrisdal/fake-news


Gupta, Vishal, and Gurpreet S. Lehal. ”A survey of text mining techniques and applications.” Journal of emerging
technologies in web intelligence 1.1 (2009): 60-76.

Anda mungkin juga menyukai