Abstract
Fake news is a type of yellow journalism that deliberately misinforms the audience with disinfor-
mation or hoaxes. Recently, there has been an increased number of fake news from on the web
with the rise of social media. To distinguish such fake news from accurate news, we create three
supervised binary classifiers - support vector machine, neural network, and logistic regression.
We process the text and create features using n-grams. We obtain good performance of 87.83%
and 86.88% using SVM and NN, respectively. Logistic regression resulted in poorer performance
at 53.51% accuracy. Future work include more detailed text preprocessing and feature generation
for more informative indicators of fake news.
1 Introduction
The diffusion of inaccurate and outright false news stories has been a recurring issue since the prolif-
eration of social media. Manually checking and corroborating news articles is a time-intensive process.
Recently, there has been significant interest in automating the detection of fake news. Several papers
have shown that fake news can be distinguished from accurate news media through analysis of the text
corpus.
We have developed classifier systems using support vector machine, neural network, and logistic re-
gression based on existing datasets which feature a large collection of recent news articles. We preprocess
the text and choose features as the top n-grams, or sequences of consecutive words. For each example,
we rate the occurrence of the features using a Term Frequency-Inverse Document Frequency (TF-IDF)
metric. Thus, the input data to our classifiers is a { vector with rows of examples and columns of TF-IDF
weights }. The output is a {prediction of the classification label of fake (positive) or real (negative)}.
2 Related Work
A diverse array of techniques exist for analyzing a text corpus based on Natural Language Processing
Methods. Feature extraction via tokenization from text is well documented and commonly seen in the
use of spam filters. This involves removing punctuation, capitalization, stopwords and lemmatization
(Gupta et al.) At a higher level, sentiment analysis is gaining traction as a method for identifying an
authors tone and bias towards the content. Horne et al used this method to identify features such as bias
and morality.
While our approach focuses exclusively on the text corpus of the article, researchers have found suc-
cess in combining higher-level analysis. Baly et al analyzed the article based on text structure, sentiment
as well as the reliability of the source. This approach was combined with textual analysis of features
based on the Horne feature set. This was based on the notion that a source known for fake news will
most likely to continue to deliver fake news. Previous projects for CS 229 have examined the problem
of fake news identification as well. Choudari et al employed similar textual tokenization for feature ex-
traction. They applied an averaging hypothesis method to several classifiers (SVM, Naive Bayes and
Neural Networks) to improve their prediction. Our approach differs in its simplicity while demonstrating
improved accuracy, likely due to more detailed textual processing.
3 Dataset and Features
Our real news dataset gathered by Chan in Kaggle consisted of 2016 articles from news publication com-
panies with mixed to very high factual reporting rating from the media bias/fact check website (Media
Bias). They included sources such as New York Times, Washington Post, CNN, and Fox News. Our fake
news dataset gathered by Risdal in Kaggle was 2016 news articles from various websites are those tagged
as bullshit by the BS Detector Chrome Extension. Each article contained about 2,000-4,000 words, and
we had a total of 27, 985 articles,about 12,000 from real news and 15,000 from fake news (some entries
were blank or corrupted.) Each example was labeled with a ground truth value of fake or real.
Our data required significant preprocessing. We preprocessed the text in the following steps using the
NLTK NLP library: (1) remove whitespace, (2) remove punctuations, (3) remove numbers, (4) remove
stop words such as prepositions and linking verbs, (5) tokenize text (lemmatization). The library for
tokenization uses an unsupervised learning algorithm pre-trained on a large data of words to determine
word relations. We then randomized and split the data into 60:20:20 training: cross validation: test sets.
To extract features, we generated n-grams, a sequence of consecutive words, across all examples. We
found that considering n=1,2,3, or unigrams, bigrams, and trigrams, led to the most optimized results.
A basic way of constructing the input data would have been counting the number of occurrences of the
feature in each example. However, some features occur more frequently across all the examples. To
normalize, we used the TF-IDF library in python to identify and select features in the entire text corpus.
Term Frequency - Inverse Document Frequency weight increases proportionally as a feature is used more
frequently in the document, and offsets for the number of documents containing the word. This ensures
that words used frequently across all documents in the corpus are less weighted. The weights were
computed by:
ωi,j = tfi,j × log(N/dfi )
where tfi,j = number of occurrences of i in j, dfi = number of documents containing i, N = total
number of documents.
The 1000 n-grams with the highest occurences were used as features. Our input data was a vector of rows
of examples and columns of TF-IDF weights of the chosen features. Examples of features are shown in
Table 1.
TF-IDF TF-IDF
n-gram n-gram
weights weights
trump 2517.19 like 1061.43
said 2093.68 new york 1053.46
clinton 1476.73 elect 1030.61
donald trump 1397.60 new 1005.54
peopl 1236.11 look like 488.62
one 1184.72 social media 439.41
unit state 1182.80 prime minist 394.00
Table 1: Examples of unigrams, bigrams, and trigrams used as features with their respective Term
Frequency - Inverse Document Frequency (TF-IDF) weights for an example article. The TF-IDF weights
are calculated by evaluating the features’ occurrences in a document relative to all documents.
4 Methods
4.1 Support Vector Machine
Here, we used a linear SVM with a ”hinge” loss function. A support vector machine converts features to
points in high dimensional space and groups them into classes using a hyperplane with a a high margin.
The cost function is as follows:
m n
X
(i) T (i) (i) 1X T (i)
J(Θ) = y cost1 (θ x ) + (1 − y )cost0 (θ x ) + Θ2j
i=1
2 j=1
where cost0 (z) = max(0, k(1 + z)),cost1 (z) = max(0, k(1 − z))
and C = λ1 is the regularization term.
The hypothesis, unlike logistic regression, is a discriminant function (if ΘT x >= 0 then hθ = 1 and
hθ = 0 otherwise). Although various nonlinear kernels could be used, such as polynomial kernel,
Gaussian kernel, and sigmoid kernel, we used a linear one due to computing resource limitations.
1
g(z) =
1 − e−z
A standard feature size of 1000 features with 1, 2 and 3-grams was used based on positive results in the
Support Vector Machine Test. The cost function was calculated with the formula:
m X K s s
1 X (i) (i) λ L−1
XX l Xl+1
(l)
J(θ) = − [yk log((hΘ (x(i) ))k ) + (1 − yk log(1 − (hΘ (x(i) ))k )] + (Θ )2
m i=1 k=1 2m l=1 i=1 j=1 j,i
The weights for the features were calculated by minimizing the cost function via Backpropagation. These
weights were randomly initialized to a random value in the range calculated from [−, ] where
√
6
= q
(Loutput + Linput )
where L is the size of layer. Random initialization is used to break symmetry in the neural network. The
accuracy of the network was measured after adjusting hidden layer size and regularization parameters.
1
hθ (x) = g(z = θT x) =
1 − e−z
m
1 X
J(θ) = − [y (i) log(hθ (x(i) )) + (1 − y (i) )log(1 − hθ (x(i) ))]
m i=1
and where every step of the gradient descent updates:
∂J(θ)
θj := θj − α
∂θj
5 Experiments, Results, Discussion
The overall performance accuracy for the three models can be found in Table 2.
5.1 SVM
The SVM resulted in 88.77% accuracy on training dataset and 87.83% accuracy on test dataset. We
plotted a learning curve to optimize the parameters. We found that having 5000 features, 1,2,3-grams,
and C = 1 resulted in high variance (overfitting) as shown in Figure 1a. We decreased the number of
features and the regularization term to reduce overfitting. We also found that 1,2,3-grams performed
slightly better than 1,2-grams as feature types. Thus, our final model resulted in 1000 features, 1,2,3-
grams, C = 0.1 (Figure 1b). The confusion matrix for the final model is shown in Table 3.
Figure 1: Learning curves for SVM parameter tuning. We tuned from an overfitted one in (a) to an
optimized one in (b).
Predicted Predicted
Fake Real
Actual Fake 2867 275
Actual Real 406 2049
7 Contributions
Both authors contributed equally to data gathering, data preprocessing, and writing of the report. While
both contributed to all algorithms, Rohun focused more on the neural network algorithm while Stephanie
focused more on the SVM and logistic regression algorithms.
References
Baly, Ramy, et al. ”Predicting Factuality of Reporting and Bias of News Media Sources.” arXiv preprint
arXiv:1810.01765 (2018).
Castillo, Carlos, Marcelo Mendoza, and Barbara Poblete. ”Information credibility on twitter.” Proceedings of the
20th international conference on World wide web. ACM, 2011.