Sie sind auf Seite 1von 71

Automatic Fake News Detection System

Waleed Ahmed 2014410

Saif Ali Khan 2014310

Haris Jamil 2014112

Mansoor Naseer 2014162

Advisor: Dr. Masroor Hussain

Co-Advisor: Dr. Fawad Hussain

Faculty of Computer Sciences and Engineering

Ghulam Ishaq Khan Institute of Engineering Sciences and


Technology

1
Certificate of Approval

It is certified that the work presented in this report was performed by Waleed
Ahmed, Saif Ali Khan, Haris Jamil and Mansoor Naseer under the
supervision of Dr. Masroor Hussain and Dr. Fawad Hussain. The work is
adequate and lies within the scope of the BS degree in Computer Engineering
at Ghulam Ishaq Khan Institute of Engineering Sciences and Technology.

---------------------

Dr. Masroor Hussain

(Advisor)

-------------------

Dr. Fawad Hussain

(Dean)

-------------------

Dr. Khalid Siddiqui

(Dean)

2
ABSTRACT
Fake News is a growing problem in the modern world, it aims at swaying the
opinion of the vast majority of people who use social media on a day to day
basis. This project aims to solve the problem of fake news on internet. The
project is a web based application which determines whether a news Article
is fake or credible, using different machine learning models, which are
trained on a large dataset. Web application takes a URL as an input from the
user and extracts the relevant text from the URL using a web crawler and
then extracts feature vectors from the text using NLP. Machine Learning
Models are then used on the feature vectors to classify news source as fake
or credible.

3
ACKNOWLEDGEMENTS
We would like to express our profound gratitude to our supervisor, Dr.
Masroor Hussain for sharing his perspective and experience regarding this
subject. We have put great efforts in this project and achieved great success,
but it would have been nearly impossible without his guidance and motivation.
He was always there for our help and always kept a check on us.

Furthermore, we are thankful to our Co-Advisor, Dr. Fawad Hussain and


Dean, Dr. Khalid Siddiqui. Their discussions were exemplary especially in
the research and development phase of the project. We are gratefully indebted
to their quality time, spared for us in every situation.

Thanks a lot for your motivation and encouragement!

4
TABLE OF CONTENTS

CHAPTER I ................................................................................................ 10

1. Introduction ................................................................................ 10

1.1 Purpose.................................................................................... 10

1.2 Product Scope ......................................................................... 11

CHAPTER II ............................................................................................... 12

2. Literature Review ....................................................................... 12

2.1 Literature Survey .................................................................... 12

2.2 Approach ................................................................................. 13

2.3 Previous work ......................................................................... 14

2.4 Data Collection ....................................................................... 15

CHAPTER III ............................................................................................. 16

3. Design ......................................................................................... 16

3.1 Overview ................................................................................. 16

3.2 Product Functions ................................................................... 17

3.3 User Characteristics ................................................................ 17

3.4 Constraints .............................................................................. 17

3.5 User Requirements .................................................................. 18

3.6 Performance Requirements ..................................................... 29

3.7 Use Case Diagrams ................................................................. 29

3.8 UML Diagrams ....................................................................... 33

5
CHAPTER IV.............................................................................................. 36

4. Proposed Solution ....................................................................... 36

4.1 Methodology ........................................................................... 36

4.2 Training ................................................................................... 41

4.3 Server-side Implementation .................................................... 52

4.4 Database Design...................................................................... 52

4.5 Schedule .................................................................................. 53

4.6 Technological Aspects ............................................................ 58

CHAPTER V ............................................................................................... 59

5. Results and Discussion ............................................................... 59

5.1 Take a Valid News Article URL, (FR-01) .............................. 59

5.2 Extract Relevant Text from URL, (FR-02) ............................. 60

5.3 Extracting Feature from Relevant Text, (FR-03) .................... 60

5.4 Applying Machine Learning Algorithms for Classification, (FR-


04) 60

5.5 Store Classification Result in Database, (FR-05) ................... 61

5.6 User Login an Sign up, (FR-06) ............................................. 61

5.7 User Feedback, (FR-07) .......................................................... 61

5.8 Verifying Results, (FR-08) ..................................................... 62

5.9 Retraining of Machine Learning Models, (FR-09) ................. 62

5.10 Non-Functional Requirement Achieved.............................. 63

CHAPTER VI.............................................................................................. 65

6. Conclusion and Future Work ...................................................... 65

6
6.1 Conclusion .............................................................................. 65

6.2 Future Work ............................................................................ 66

GLOSSARY................................................................................................. 67

REFERENCES ............................................................................................ 68

APPENDIX .................................................................................................. 71

7
LIST OF FIGURES

CHAPTER I ................................................................................................ 10
CHAPTER II ............................................................................................... 12
Figure 2.3-1 Accuracy Comparison with Reseach Papers ............................ 14
CHAPTER III ............................................................................................. 16
Figure 3.1-1 Layered Architecture ................................................................ 16
Figure 3.7-1 Use Case Diagram 1 ................................................................. 30
Figure 3.7-2 Use Case Diagram 2 ................................................................. 31
Figure 3.8-1 Component Diagram ................................................................ 33
Figure 3.8-2 ER Diagram .............................................................................. 34
Figure 3.8-3 Data Flow Diagram .................................................................. 35
CHAPTER IV.............................................................................................. 36
Figure 4.1-1Gantt Chart ................................................................................ 37
Figure 4.1-2Workflow Diagram ................................................................... 39
Figure 4.2-1 Accuracy vs Number of Features ............................................. 42
Figure 4.2-2 Accuracy vs SVM Kernel ........................................................ 43
Figure 4.2-3 Accuracy vs Depth of Random Forest and Decision Tree ....... 44
Figure 4.2-4 Accuracy vs train/test split ....................................................... 45
Figure 4.2-5 Feature Reduction (Graph) ....................................................... 46
CHAPTER V ............................................................................................... 59
CHAPTER VI.............................................................................................. 65
GLOSSARY................................................................................................. 67
REFERENCES ............................................................................................ 68
APPENDIX .................................................................................................. 71

8
LIST OF TABLES

Table 1.2-1Terms used in this document and their description .................... 11


Table 2.1-1 Naive Bayes Result.................................................................... 13
Table 2.2-1 News Sources Credibility List ................................................... 13
Table 2.4-1 Dataset Sources ......................................................................... 15
Table 3.6-1 Performance Requirements ....................................................... 29
Table 4.1-1 Features with importance .......................................................... 40
Table 4.2-6 Feature Reduction (Tabulated) .................................................. 47
Table 4.2-7 Decision Tree............................................................................. 48
Table 4.2-8 Random Forest........................................................................... 49
Table 4.2-9 SVM .......................................................................................... 50
Table 4.2-10 Accuracy vs Training Algorithms ........................................... 51
Table 4.6-1 Programming Languages ........................................................... 58
Table 4.6-2 Libraries and Framework .......................................................... 58
Table 4.6-3 Miscellaneous ............................................................................ 58
Table 5.10-1Performance Requirement ........................................................ 63
Table 5.10-2 Security Requirements............................................................. 63
Table 5.10-3 Usability Requirements ........................................................... 64
Table 6.2-1 Work Breakdown Structure ....................................................... 71

9
CHAPTER I

1. Introduction
1.1 Purpose
Analyzing and detecting fake news on the internet is one the hardest problem
to be solved. Recently, Fake News had been an important talk in general public
and researchers due to online media outlets like social media feeds, blogs and
online newspaper. According to BBC survey, 79 percent of people are worried
about what is fake and real online. The survey of more than 16,000 adults was
conducted by Globescan. Globescan’s chairman Doug Miller said: “These poll
findings suggest that the era of ‘fake news’ may be as significant in reducing
the credibility of on-line information as Edward Snowden’s 2013 National
Security Agency (NSA) surveillance revelations were in reducing people’s
comfort in expressing their opinions online”. Apple’s stock took a temporary
10-point hit after a false report surfaced on CNN’s iReport that Steve Jobs had
a heart attack.

In light of above incidents we discover that fake news could have much more
drastic effect even on country`s economy. So to minimize such news to create
drastic effect, we have to verify fake news. Purpose of our project is to detect
fake news.

10
1.2 Product Scope
The scope of our product is to detect fake news from online articles using
machine learning. Our fake news detectors purely uses linguistics features to
detect fake news in content. By using different machine learning models, we
will detect fake news for better accuracy. Our project has major impact on
social media like Facebook and Twitter because major population of world has
access to these platforms. Fake news have impact on decision making of these
people which could lead to serious mistakes.

Table 1.2-1Terms used in this document and their description

Name Description
NLP Natural Language Processing
URL Uniform Resource Locator
ML Machine Learning

11
CHAPTER II

2. Literature Review

2.1 Literature Survey


Fake news is burning issue in today’s world. Various medium to spread news
are available in today’s high tech world has made it very easy to spread
misleading information. Generally most of the misleading information is made
available through social media but sometime it starts circulating in mainstream
media. An analysis by Buzzfeed found that the top 20 fake news stories about
the 2016 U.S. presidential election received more engagement on Facebook
than the top 20 news stories on the election from 19 major media outlets1.
Many Researches have been done on the topic of detecting deception and
falsehood using machine. Most of research work is associated with classifying
online reviews and publicly available social media posts. Determining ‘fake
news’ has been a hot topic since American Presidential elections and has got
attention of people and researchers since then. Most of the research has
implemented simple classification models like Naive Bayes which has shown
very promising result in classifying fake news but has an issue attached that
these type of classification algorithms only classify based on word occurring
in news article. Some researches are based on Argument Interchange Format
which models an argument as a network of connected nodes of information
(claims and datum which we model as premises and evidence) and schemes
(warrants or rules of inference which we model as a particular conclusion or

1
Chang, Juju; Lefferman, Jake; Pedersen, Claire; Martz, Geoff (November 29, 2016).
"When Fake News Stories Make Real News Headlines". Nightline. ABC News.

12
stance). Their graph-theoretic approach also keeps track of provenance in
argumentation schemes. The results of Naive Bayes approach are given below.

Table 2.1-1 Naive Bayes Result

2.2 Approach
In this project we are classifying fake news based on purely linguistic
features. There have been work which simply do fact checking to classify
fake news. There are some sources which are known to be spreading fake
news. The list reliable and non-reliable is maintained at OpenSources and
FakeNewsWatch.

Table 2.2-1 News Sources Credibility List

The most difficult task was to collect labeled data of classified news.
Fortunately we were able to download the labeled data from Datacamp. But
insufficiency of has always remain the main issue with our project.

13
2.3 Previous work
We have followed multiple research papers from different universities as
reference. These research papers worked only on Title on the News article with
maximum of 12 features. We are using 38 features and these features are
extracted from both Title and Text. We got help from these research papers in
finding the appropriate features. For example we got the idea of using text
difficulty index (Gunning Fog) from the paper of Michigan University
(Ver´onica P´erez-Rosas et al., 2017). These researches were done in
universities like Stanford and Michigan. We tried to gather the best points
from each of the research paper and apply them to our project. Due to which
we are getting much higher accuracy than the research papers that we have
been following.

90

80

70

60

50

40

30

20

10

0
Brian Edmonds, Xiaojing Ji, Ver´onicaVer´, Bennett Victoria L. Rubin, Niall J. Our Project
Shiyi Li, Xingyu Liu, 2017 Kleinberg2, Alexandra Lefevre Conroy, Yimin Chen, 2015
Rada Mihalcea, 2017

Column1 Accuracy

Figure 2.3-1 Accuracy Comparison with Reseach Papers

14
We split our dataset of size 6,300 News Articles into two sets, 80% - Training
Set and 20% - Test Set. We did the training on training set and then tested the
trained system on test set for accuracy (how accurate the prediction is).
Currently we are getting maximum accuracy of 85.7%. We tested the system
by feeding the system with URLs of fake news and authentic news, system
outputs satisfactory results.

2.4 Data Collection


It was a bit hard to gather large amount of data for news classification. We
managed to get a csv file from datacamp, containing labelled data with title
and text separated. Other than that, we collected fake and authentic news
manually from multiple websites, and managed to get a total dataset of 6,700
news articles.

Following are the sources from which we collected our data.

Table 2.4-1 Dataset Sources

https://www.datacamp.com/community/podcast/data-science-fake-
news
https://github.com/docketrun/Detecting-Fake-News-with-Scikit-Learn
http://dailyheadlines.net
https://www.snopes.com/
https://tribune.com.pk/fake-news/
https://www.scoopwhoop.com
http://abcnews.go.com/alerts/fact-or-fake

15
CHAPTER III

3. Design
3.1 Overview
The system works on already trained Machine Learning algorithms. Multiple
machine learning algorithms have been trained by providing a data set of both
fake and authentic news. The summary of overall procedure is as follows.

i. User enters URL.


ii. URL is verified if entered text is in URL format or not, then web
crawler extracts relevant text from that news URL.
iii. NLP is applied on text extracted.
iv. Features extracted from NLP are fed to ML Algorithms.
v. There’s a voting mechanism among ML algorithms, which predicts
whether the news is fake or authentic.
vi. Each classified gets stored in the database.
vii. A user can login to give a feedback if previously classified news was

Figure 3.1-1 Layered Architecture

16
3.2 Product Functions
 A URL of news article must be entered.
 NLP is performed on the text extracted from the URL and relevant
features are extracted from that NLP.
 News articles are classified as fake or authentic from the features
extracted.
 Classified news are stored in data base to maintain list of URLs with
the output predicted (Fake/Authentic), and each user can view that
maintained list.
 User can vote on maintained list if that specific news isn’t classified
correctly.

3.3 User Characteristics


Moderator: The moderator will be monitoring the rating submitted by the
users, to maintain the credibility of ratings.

Administrator: Will maintain the overall aspects of web application and will
be responsible for giving users appropriate roles and authority.

User: The main actor using the web application to analyze the URLs.

3.4 Constraints
i. Our software will never assure authenticity of the result. For this,
we need user feedback.
ii. Our software will only be available in English language and news
article provided to the software should also be in English
language.

17
iii. We don’t have access to huge amount of data for training of
machine learning model.
iv. Software will not work without internet connection.
v. Our software does not perform well when article`s body is plain,
short and emotionless.

3.5 User Requirements


Following are the user requirements that describe what the user expects from
the software to do

3.5.1 External Interface Requirements


The user interface will be web based provided to user through a web browser.
The screen will consist of a log in form. Upon logging in the user will
presented with a dashboard. The dashboard will consist of a header, sidebar
menu and body. On the top right the menu for managing user preferences will
be provided. The body will be consisting of dialogue box which will be used
to get the input from user. There will be a button to submit the query entered
by user in the dialogue box. Below the dialogue and button, a list of previously
processed URLs with their rating from user will be displayed. Against each
list item the user will be able to rate that corresponding processed URL result
either good or bad.

 Numpy: a scientific computing package generating N-dimensional


array objects. As for this project, several machine learning models use
Numpy as the data container; the implementation of our random tree
and random forest also depends on this.
 Scikit-learn: A Python library built on Numpy. This project uses it
mainly for data classification.

18
 NLTK: A Python library used for NLP (natural language processing).
We will be using NLTK for feature extraction from the news article.
 Angular: The angular 4 will be used to implement the web based
interface and client side of application.
 Scrapy: A Python library to scrape websites. We will be using scrapy
to fetch text of the news article’s header from URL provided by the
user

3.5.2 Functional Requirements


1. Take a valid news article URL from user.
2. Extract relevant text from the URL, provided by the user, using Scrapy.
3. Then we will extract relevant features from the text using NLP (Natural
Language Processing).
4. Correctly classify news article as fake news or credible news using
different machine learning models (SVM and Random Forest).
5. Store the classification results in database to maintain a list of URL’s
which are already processed and classified.
6. User can Sign up and Login.
7. Each user can view all the recently processed and classified news
articles and verify the correctness of the classification by voting (sign
in required).
8. After a predefined limit of time and number of votes we can verify that
whether the software classified a given news article correctly or not.
9. We can then modify our classification if needed and add the news
article in the training set to improve accuracy of future predictions.

19
3.5.2.1 Functional Requirements with Traceability information

3.5.2.1.1 Takes a news article as an input

Requirement 1 Requirement Functional Use Case #


ID Type

Status Ne Agreed- - Baseline - Rejected -


w to d

Parent
Requirement
#

Description Take A News Article URL from user which is to be analyzed and classified. It
must be a valid news URL.

Rationale System must take a valid URL from user to extract text from.

Source Source -
Document

Acceptance/Fi
t Criteria

Dependencies

Priority Essential Conditional - Optional -

Change
History

20
3.5.2.1.2 Extract the title and article using scrappy

Requirement 2 Requirement Functional Use Case #


ID Type

Status Ne Agreed- - Baseline - Rejected -


w to d

Parent
Requirement
#

Description Extract Relevant Text from the URL provided using Scrapy.

Rationale System has to extract only title and body of the article which is then fed to the
classification system for feature extraction and classification.

Source Source -
Document

Acceptance/Fi
t Criteria

Dependencies

Priority Essential Conditional - Optional -

Change
History

21
3.5.2.1.3 NLP is applied on text extracted using scrappy

Requirement 3 Requirement Functional Use Case #


ID Type

Status Ne Agreed- - Baseline - Rejected -


w to d

Parent
Requirement
#

Description The text extracted by web crawler is used for feature extraction using NLP.

Rationale We have to extract features so they can be used in Machine Learning Algorithms
for classification.

Source Source -
Document

Acceptance/Fi
t Criteria

Dependencies

Priority Essential Conditional - Optional -

Change
History

22
3.5.2.1.4 Apply machine learning algorithms on the data.

Requirement 4 Requirement Functional Use Case #


ID Type

Status Ne Agreed- - Baseline - Rejected -


w to d

Parent
Requirement
#

Description Apply Machine Learning Algorithms on feature vectors to classify news as fake
or credible.

Rationale This requirement is the backbone of the system.

Source Source -
Document

Acceptance/Fi
t Criteria

Dependencies

Priority Essential Conditional - Optional -

Change
History

23
3.5.2.1.5 Store the results in the database

Requirement 5 Requirement Functional Use Case #


ID Type

Status Ne Agreed- - Baseline - Rejected -


w to d

Parent
Requirement
#

Description Store the classification Results in the database.

Rationale If another user enters the same URL the system does not have to process the
URL again and simply return result.

Source Source -
Document

Acceptance/Fi
t Criteria

Dependencies

Priority Essential Conditional - Optional -

Change
History

24
3.5.2.1.6 User can sign using email and password

Requirement 6 Requirement Functional Use Case #


ID Type

Status Ne Agreed- - Baseline - Rejected -


w to d

Parent
Requirement
#

Description User can Sign up using his email address and login.

Rationale This is required for feedback.

Source Source -
Document

Acceptance/Fi
t Criteria

Dependencies

Priority Essential Conditional - Optional -

Change
History

25
3.5.2.1.7 User can view the results of news stored in the database

Requirement 7 Requirement Functional Use Case #


ID Type

Status Ne Agreed- - Baseline - Rejected -


w to d

Parent
Requirement
#

Description User can view all the recently processed and classified news article and vote for
the accuracy of classification.

Rationale This will help the developers to improve the system and get feedback regarding
the accuracy of the classification system.

Source Source -
Document

Acceptance/Fi
t Criteria

Dependencies

Priority Essential Conditional - Optional -

Change
History

26
3.5.2.1.8 User feedback for the classified news

Requirement 8 Requirement Functional Use Case #


ID Type

Status Ne Agreed- - Baseline - Rejected -


w to d

Parent
Requirement
#

Description After a predefined time limit and number of votes the system verifies the
classification.

Rationale Verification of the classification is very important for the gaining users trust and
also for system improvements.

Source Source -
Document

Acceptance/Fi
t Criteria

Dependencies

Priority Essential Conditional - Optional -

Change
History

27
3.5.2.1.9 Using the stored classified data in database for training.

Requirement 9 Requirement Functional Use Case #


ID Type

Status Ne Agreed- - Baseline - Rejected -


w to d

Parent
Requirement
#

Description Add the features of verified classifications to the training set.

Rationale This is to improve the accuracy of future classification.

Source Source -
Document

Acceptance/Fi
t Criteria

Dependencies

Priority Essential Conditional - Optional -

Change
History

28
3.6 Performance Requirements

Table 3.6-1 Performance Requirements

ID Performance Requirement

1 Feature Extraction must be done in reasonable time

2 Time taken by machine learning algorithms should be in


milliseconds
3 System should be able to handle multiple simultaneous requests.

3.7 Use Case Diagrams


Following are the use case diagrams for our system that describe a set of
actions (use cases) that the system should or can perform in collaboration with
one or more external users of the system (actors).

29
3.7.1 Use Case Diagram 1

Figure 3.7-1 Use Case Diagram 1

The classification System is the backbone of entire software. Figure 3.7-1


shows the use case related to classification system. The classification system
extracts text from News URL and uses NLP to extract the required features.
Then different machine learning algorithms are applied using the features and
results are displayed to the user and stored in the database.

30
3.7.2 Use Case Diagram 2

Figure 3.7-2 Use Case Diagram 2

The use case related to user feedback is shown in Figure 3.7-2. In order for a
user to give feedback related to accuracy of classification a user must sign up.
The system displays all the recently processed/classified URL’s to the user. If
the user is logged in he can choose to vote for any classification result. After
some time ( 1 week) the system will check the votes for the classification and
based on the votes the system will be able to verify whether the classification
was correct or not. If the classification is verified the system adds the features
of the correct classification to the training set.

31
3.7.3 Use Case Diagram 3

Figure 3.7-3 Use Case Diagram 3

Figure 3.7-3 shows the use case related to basic use of the software. User
enters a News URL. System verifies the URL and extracts relevant text from
the URL using a web crawler and then classified the news article as fake or
credible using machine learning algorithms. After the result is computed the
user can view the result.

32
3.8 UML Diagrams
Following are the Unified Modelling Language (UML) diagrams that are
intended to provide a standard way to visualize the design of our system.

3.8.1 Component Diagrams


Following is the component diagram, and describes the components to make
the functionalities of the system

Figure 3.8-1 Component Diagram

The Figure 3.8-1 shows the overall view of the system showing all the
different components and information that flows between these different
components. User Interface is the view available to the user through which
user interacts with the system. Our user interface will be web application. The
User will input a News URL which will then be passed to the web crawler.
Web crawler will crawl the URL and extract relevant text and pass it to the
Classification System. Classification System will then extract the required
features from the text and apply machine learning algorithms on the feature
vector and will store the results in the Database.

33
3.8.2 Database ER Diagram

Figure 3.8-2 ER Diagram

Figure 3.8-2 shows the entity relation diagram of our system. There’s a many
to many relationship of voting between User and List of classified news, but
it isn’t necessary for every user to vote a classified news and vice versa.
Classified News and Domain are related by Many to One Relationship with
total cardinality ratio.

34
3.8.3 Data Flow Diagram

Figure 3.8-3 Data Flow Diagram

Figure 3.8-3 shows the flow the flow of data. First User sends the URL, error
is displayed if entered text isn’t in URL format and else the URL is searched
in Database in ‘Already Classified List’. If URL is found, it just displays the
previous result, else the crawler crawls the website and scraps the relevant text.
NLP is applied on text and features from NLP are processed by ML
algorithms. Each Algorithm gives result, all the results are sent to Voting
algorithm, and the final result is displayed, and stored in Database.

35
CHAPTER IV

4. Proposed Solution
The only solution to the problem defined in the earlier section was to design
and implement such a Web based application which will take a news URL as
an input and will give result of its authenticity with higher accuracy. We had
a problem in achieving higher accuracy because of limited dataset. We still are
achieving 85.7% test accuracy which is much higher than the Research Papers
we have been following. To tackle this issue, we have implemented the
mechanism where processed URLs get stored in the database which are the
fed to the training algorithms. In this way our system keeps getting smarter
with time.

4.1 Methodology
Developing an Automatic Fake News Detector was a challenging problem. To
make sure, that we accomplished this task efficiently, without facing major
problems, which would have caused major redesigns and re-engineering of the
software architecture, in a time and cost constrained project environment, we
started off with developing SRS (Software Requirement Specifications) and
detailed design of the system. Gantt chart and work break down structure were
created in that phase to monitor the project and when a phase should start or
end.

36
Figure 4.1-1Gantt Chart

After that we started to gather dataset for training purpose. We were


able to gather dataset of about 6,500 labeled News Articles from multiple
sources. After that we started our research on which Machine Learning
Algorithms to apply and what kind of NLP to use. We used SVM and Random
forest as our machine learning algorithms, which gave us accuracy of 85.7%.

Over all process is as follows.

 Labeled Dataset is gathered of about 6,500 News articles containing Text and
Title of News.

 NLP is applied on each news article to extract relevant


features e.g., Punctuation Count, Text difficulty index etc.

 In total 38 features are extracted.

 Training is done by SVM (Linear Kernel) and Random Forest

 When the URL is entered, text and title of the news form that URL is scrapped
using WEB crawler.

37
 Same NLP is applied to the extracted text and title and 38
features are fed to Machine Learning Algorithms.

 We have combined the strong points of both Algorithms


which increases our Accuracy.

 SVM is better at detecting Fake News while Radom Forest is


better for Authentic News

 When user enters a URL and checks the authentication of News, it gets stored in
Database.

 System maintains a list of already processed URLs which


users can see.

 User can also give a feedback to any Already Processed News


article by a dislike button, if the news has been predicted
wrong by our Algorithm.

 Predicted News with Low user ratings are then manually


observed.

 After some time, these already processed News articles are


fed to Machine Learning Algorithms.

 Size of our dataset keeps increasing and the System keeps


getting smarter with time.

38
Figure 4.1-2Workflow Diagram

4.1.1 Feature Selection


We have used total 38 features in total. These features were extracted from
title and news article both. Previous researches done on this topic used only
title of the news for training. We couldn’t get our desired accuracy using title
only.

Following is the table of features selected for text with the weight/importance
of each feature as calculated by machine learning algorithm. Same features
have been selected for title but not mentioned in the table.

39
Table 4.1-1 Features with importance

Feature Importance
Word Count 0.03223736
Character Count 0.11497973
Punctuation Count 0.0979961
Uppercase Count 0.07135418
Gunning Fog 0.0166595
Automated Readability Index 0.03313012
Linsear Write Formula 0.01666274
Difficult Words 0.0262762
Dale-chall Readability Score 0.01767803
Punctuation Count / Character Count 0.21654589
Count of numbers 0.01909209
Count of brackets 0.00145834
Count of Asterisk (offensive words) 0.01956875

The above table shows us which features are most important for news
classification, by giving them a weight or score. For example, according to
this table, Ratio of Punctuation Count and Character Count has highest score
(.2165). It means that this feature has 21.65% importance, and it has the
highest probability of classifying the news. While bracket counts has least
importance, which means that this feature helps least to classify the news
article into fake or authentic.

40
4.1.2 Normalization
We have used the normalization in which we rescaled the feature values
between [0, 1]. There was quite obvious increase in our accuracy after the use
of this normalization method.

The formula is given as:

Where x is an original value, x' is the normalized value.

For example if punctuation count ranges from [10 , 200], x' can be calculated
by subtracted each news’s punctuation count with 10, and dividing by 190.

4.2 Training
After cleaning and normalizing the data, we set it to training. We tried multiple
algorithms and techniques for training the data, and selected two (Random
Forest and SVM) which gave the highest accuracy. Training acquired most of
the time of the project development, because we had endless combinations and
possibilities to try out, in order to achieve highest accuracy with limited size
of dataset. We tried changing the normalization method, training algorithm,
number of iterations, kernel in SVM and number of features.

41
4.2.1 Number of Features
Following is the graph of Accuracy vs Number of Features.

NO. OF FEATURES
Random Forest Decision Tree SVM (Linear)

90

80

70

60

50

40

30
6 13 19 25 28

Figure 4.2-1 Accuracy vs Number of Features

Above graph clearly shows the phenomenon of over and underfitting. At 19


number of features, we are getting the highest accuracy (85.7%) by SVM
Linear Kernel. After that, the model starts to overfit the data and test accuracy
starts to decline.

Note that, 19 features are used for title and text separately, in total 38 features
are used.

42
4.2.2 SVM Kernels
Following graph shows the difference in accuracy with different SVM kernels.

SVM Kernel
90
80
70
60
50
40
30
20
10
0
Default Linear

SVM Kernel

Figure 4.2-2 Accuracy vs SVM Kernel

In above graph, it can be seen that Linear kernel gives the highest accuracy
(85.7%). That’s because most of textual data is linearly separable, and linear
kernel works really good when data is linearly separable or has high number
of features. That's because mapping the data to a higher dimensional space
does not really improve the performance (L Arras, F Horn et al., 2017).

43
4.2.3 Random Forest and Decision Tree
Following is graph of Accuracy vs Maximum Depth of Random Forest and
Decision Tree.

MAXIMUM DEPTH
Random Forest Decision Tree

90
80
70
60
50
40
30
20
10
0
5 8 10 14

Figure 4.2-3 Accuracy vs Depth of Random Forest and Decision Tree

Here it can be seen that maximum accuracy is at depth 10 by Random Forest


(83.8%). And it can also be observed that Decision tree never surpasses the
accuracy obtained by Random Forest.

44
4.2.4 Train/Test Split
Right now we’re splitting the data into 80/20, with 80 being training set and
20 being the test set. Following is the graph that shows Accuracy vs Machine
Learning models with different splits.

TRAIN/TEST SPLIT
SVM (Linear) Random Forest Decision Tree

90

85

80

75

70

65
90/10 80/20 70/30 60/40

Figure 4.2-4 Accuracy vs train/test split

It can be seen from this graph that highest accuracy is achieved when the
dataset is split 80/20, with 20% being test set. Phenomenon of over and
underfitting can be observed in this graph as well.

45
4.2.5 Feature Reduction
We have used PCA and LDA for feature reduction.

Following is the graph of Accuracy with PCA and LDA, and without feature
reduction vs number of features.

Note: Feature reduction is applied on Random Forest, and accuracy of


Random Forest has been used. Algorithm was trained multiple times, and
accuracy of Normal Random Forest in each try was compared with Random
Forest’s accuracy after PCA and LDA

CHART TITLE
No Feature Reduction PCA LDA

88

86

84

82

80

78

76
10 15 20 25

Figure 4.2-5 Feature Reduction (Graph)

46
Above graph is given below in tabulated form

Table 4.2-1 Feature Reduction (Tabulated)

Without Feature PCA LDA


Reduction
10 82.39 83.5 80.2
15 82.767 85 79.63
20 82.7 84.9 79.9

It can be clearly seen that PCA has always been greater than Random forest
trained without Reduction in Features.

47
4.2.6 Summary of Training
As depicted in the previous graphs, we played around with the data, features
and machine learning algorithms to achieve the desired accuracy. We also
implemented neural networks but it was giving really low accuracy (53%) due
to insufficient data size. So we decided not to include neural network in our
work, we will add it in future when we have hands on sufficient data size. We
hope when have large amount of news articles, deep learning will cause a great
increase the accuracy of our system.

Following is the overall summary of what has been discussed previously


related to Training the data.

Decision Tree:

Table 4.2-2 Decision Tree

Features Depth Accuracy %


6 5 35
- 8 38
- 10 41.12
- 14 39
13 5 53.5
- 8 55
- 10 58.78
- 14 54.2
19 5 68.12
- 8 69.5
- 10 77
- 14 74.2

48
Random Forest:

Table 4.2-3 Random Forest

Features Depth Accuracy %


6 5 37
- 8 39.75
- 10 43
- 14 41.2
13 5 54.5
- 8 59
- 10 61.25
- 14 58
19 5 79.54
- 8 84
- 10 82.3
- 14 78

49
SVM:

Table 4.2-4 SVM

Kernel Features Accuracy %


Default 6 39.25
- 13 51.8
- 19 56
- 25 58.7
Linear 6 68.12
- 13 82.35
- 19 85.7
- 25 84.2

50
Over-All:

This is the over-all summary, Table 4.2-8 is constructed considering following


values.

 No. of Features = 19 (For title and text separate, total 38)


 Maximum depth for Random Forest and Decision Tree = 10
 SVM Kernel = Linear
 Train/Test Split = 80/20

Table 4.2-5 Accuracy vs Training Algorithms

Training Algorithm Accuracy %


Random Forest 84
SVM 85.7
Decision Tree 77
ANN (2 hidden Layers) 51
ANN (5 hidden Layers) 57
ANN (10 hidden Layers) 53

Here we can see that SVM gives us the highest accuracy among other Machine
Learning algorithms, the reason has been described previously. SVM performs
great on textual data because textual data is almost all the time linearly
separable and SVM is a good choice for linearly separable data.

51
4.3 Server-side Implementation
Main part of our server is Machine Learning Algorithms. Classification and
Web Backend part of the project has been implemented in Python. Django is
used for back-end library of Sklearn is used for the training purposes. We
started our project with Decision Tree algorithm with 19 features, and got 53%
accuracy after splitting the dataset to 80-20 into training and testing. After
going through research papers and obtaining strong points from each of them
we were able to get 85.7% accuracy. We combined Random Forest and SVM
(Linear Kernel) to give us the highest accuracy. We wanted to use Deep
learning and hoped to get much higher accuracy with it, but failed due to small
size of dataset. For the NLP part, we used NLTK and Textstat (python APIs)
for complex feature extraction like adverb count or text difficulty.

One of our main hurdle was to scrap html page properly. Online news
articles are not written in standard form, e.g., news on Facebook is written in
different html format than the news on bbc.com. We couldn’t tackle this
generality, and used python’s library Newspaper3k which is made specially to
scrap of news articles.

4.4 Database Design


SQLite is chosen to progress our database. SQLite is self-contained, high-
reliability, embedded, full-featured, public-domain, SQL database engine.
There are two main tables of Users and URLs. User table keeps record of
password and username etc. so that user can login to the system. While URL
table keeps record of already processed news article so if any new user enters
the same URL again, system doesn’t have to go through all the processing

52
again and can just give the result from the database. Voting table has also been
maintained which keeps record of vote give to each URL.

4.5 Schedule
The four core modules of the system were divided among the group to be
designed, developed and deployed in isolation and then integrate with the
system to achieve the overall functionalities.

See Appendix – A for Work Breakdown Structure, (WBS)

Serial # Activity Days(s)


1.0.0 Software Requirement
Specifications
1.1.0 Identifying User Classes 1
1.1.2 Determining User 2–3
Requirements
1.1.3 Qualification of Users 4–6
1.2.1 Determining Use Cases 7
1.2.2 Inputs 8–9
1.2.3 Processes 10 – 13
1.3.0 Outputs 14 – 17
1.3.1 Determining Modules 19
1.3.2 Determining Product 21 – 27
Functionalities
1.3.3 Determining Functional 28 – 30
Requirements

53
1.4.0 Determining Non- 31-34
Functional
Requirements
1.5.0 Identifying Security 35 – 37
Measures
1.6.0 Communication and 38 – 42
user interface
requirements
1.6.1 Determining System 43
Dependencies
1.7.0 Constraints 44 – 46
1.8.0 Other Interfaces 49
1.8.1 Criticality of 64
Application
1.8.2 Logical and Database 65 – 70
Requirements
1.9.0 Functional Hierarchy 72

54
Serial # Activity Day(s)
2.0.0 Designing System
Architecture
2.1.0 Identifying sub System 73
2.2.0 How sub Systems 74 – 75
Would Interact
2.3.0 Knowledge of Server, 76 – 79
Memory Processing
Capabilities
2.4.0 User-Server 80
Communication
2.5.0 Dependencies of sub 81 – 82
Systems
2.6.0 Limitation of User 83
hardware

55
Serial # Activity Day(s)
3.0.0 Prototype
Development
3.1.0 Developing User 84
Application
3.1.1 Designing Interactive
User Interfaces
3.2.0 Determine Functional 85 – 87
Requirement of
System
3.3.0 General Prototype 88 – 90
3.4.0 Deploy over Servers 92
3.5.0 Testing 93
3.5.1 Debug 94 – 100
3.5.2 Initial Launch 101
3.6.0 Improving and 102 – 107
Finalizing User
Interface
3.7.0 Testing 110
3.7.1 Debug 111 – 120
3.7.2 Final Launch 121

56
Serial # Activity Day(s)
4.0.0 Beta Launch
4.1.0 Deploying on Server 130
4.2.0 Testing 131 – 132
4.3.0 Debug 133 - 134

Serial # Activity Day(s)


5.0.0 Commercialization
and Marketing
5.1.0 Identifying Potential 140
Customers
5.2.0 Flyers and Pamphlets 141 – 150
5.3.0 Live Demonstration 152 – 160
5.4.0 Advertisement via 170
Social Media

57
4.6 Technological Aspects

Table 4.6-1 Programming Languages

Programming Languages
Python
SQL
JavaScript

Table 4.6-2 Libraries and Framework

Technologies Libraries and


Frameworks
JavaScript JQuery
Web HTML5, CSS
Runtime Environment Django
Frameworks Ajax

Table 4.6-3 Miscellaneous

Other
IDE Pycharm
Versioning Control Git/GitHub
Database SQLite
Networking Protocols HTTP/HTTPs

58
CHAPTER V

5. Results and Discussion


We integrated all the system components successfully. Our Systems accuracy
was quite good. It correctly classified news article with 85.7% accuracy. Our
main goal was to develop a user friendly web application which classify a
news article as fake or credible, by simply taking its URL from the user. We
achieved this goal by fulfilling all the user requirement which were crucial to
the success of our project.

There were also requirements related to performance. We constantly improved


our system to achieve maximum performance and the results were quite
satisfactory. The response time of our system was adequately fast. We
constantly applied software engineering processes to keep track of all the
functional and non-functional requirements.

5.1 Take a Valid News Article URL, (FR-01)


This functional requirement was critical to our system. In order for all the
system components to work flawlessly, the system must get a valid news
article URL from the user, from where it extracts text. If the system does not
get a news article URL, the web crawler will generate an exception. In order
to fulfil this requirement we used a form input of URL type so that it takes
only a URL as input and we also used exception handling to catch the
exception if the URL provided is not of a news article.

59
5.2 Extract Relevant Text from URL, (FR-02)
This was a very challenging problem in our project. In order to classify the
news article as fake or credible we only needed the relevant text from page
source, on which our system applies Natural Language Processing to make
feature vectors. This was particularly hard as we had to make generic scrapper
that works for every news website. We used newspaper3k API to solve this
problem, which made it easier for us to extract only the news article title and
text (body).

5.3 Extracting Feature from Relevant Text, (FR-03)


The system uses nltk to apply NLP on the news article title and text to make
feature vectors, which are then fed to the machine learning algorithms. We
used 38 dimensional feature vectors. This is a necessary step as it allows us to
convert text into numeric form which is then easy to use for machine learning
algorithms.

5.4 Applying Machine Learning Algorithms for Classification, (FR-04)


This requirement is the backbone of our system. The success of our system
depended on how accurately our machine learning models predicted whether
a news article is fake or not. In order to achieve maximum accuracy with finite
resources, we trained our machine learning models on a labelled dataset of
7000 news articles. We used 2 different machine learning models SVM and
Random Forest for classification and we combined the result of both models.
We achieved a maximum of 86% test accuracy.

60
5.5 Store Classification Result in Database, (FR-05)
We stored the result of every URL processed by our system in our database
alongside its title and text. This requirement helped us improve the
performance of our as it eliminated redundancy. If 2 users entered the same
URL our system will only process it once and will it store its classification
result in the database for subsequent queries.

5.6 User Login an Sign up, (FR-06)


We used django user model to implement this requirement. This was also a
necessary requirement as users need to login to give feedback on the
classification results.

5.7 User Feedback, (FR-07)


After a user login into the system, user can give feedback on all the
classification results of the processed URL’s. We implemented this by creating
a voting system. In which a user can like or dislike a URL’s classification
result. We also made a table of voting in the database which is associated to
both user model and URL model to make sure that a user can vote only once
for a particular URL.

61
5.8 Verifying Results, (FR-08)
After a month of processing a URL our system automatically checks the rating,
which is given by the users, of URL. If the rating is more than 50% our system
retains the classification result. But if the rating is less than 50% the
classification result is altered as poor rating shows incorrect classification by
the system.

5.9 Retraining of Machine Learning Models, (FR-09)


After a month all the URL’s which are verified our added to our dataset along
with their classification result. All the machine learning models our trained
and saved again. This ensures that our system improves with time as more and
more data is available for training. This will help our system evolve
continuously and our accuracy will get better and better.

62
5.10 Non-Functional Requirement Achieved

Table 5.10-1Performance Requirement

Performance Requirements
The system should respond to a user 
query and return a result in less than
5 seconds.
Web crawling should be done in fast 
time.
Feature extraction must be done in 
milliseconds.
Time taken by ML algorithms 
should be in milliseconds.
System should be able to handle 
multiple simultaneous requests.

Table 5.10-2 Security Requirements

Security Requirements
User should be able to securely 
login.
User password should be encrypted. 
It is stored in the database in
encrypted form.
User password should be long and 
contain special characters.

63
Table 5.10-3 Usability Requirements

Usability Requirements
The system should be user friendly 
and easy to use
The system should not need extra 
instruction manual to use
The user should be able to learn the 
system in less than 5 minutes

64
CHAPTER VI

6. Conclusion and Future Work

6.1 Conclusion
Fake news make people confuse about who to trust or not some people even
say that Donald Trump became president because of some fake twitter. In
order to tackle such problem, we are working on linguistics basic purely. Our
efficient scrapper extracts the title and text of the news and using Natural
language processing (NLP) we extracted 38 features and applied Support
vector machine(SVM) and Random Forest to detect whether the news is
authentic or fake.

This web application is solution to the very important problem on social media
platforms like Facebook and Twitter, to which every person has easy access.
News on the social media has a very large impact on the thought process of
the people, our web application provides people an easy way to determine the
credibility of any news article. Accuracy of 86.7% shows that our application
can be very useful in practical world. Even though there is chance that our web
application can predict the news wrong, user feedback mechanism has also
been added in the system so a user can vote if the news is correctly predicted.
After month or two, user votes will be manually checked and if the prediction
was wrong, result of that prediction will be changed manually. These predicted
news articles can be used to train the machine learning models and increase
the efficiency and accuracy of the application. With time and user feedback,
we can improve our software in terms of accuracy and user experience.

65
6.2 Future Work
We have combined two machine learning algorithms (SVM and Random
Forest). We are combining them in such a way that strong points of both of
these algorithms can be used to predict the credibility of News Article. Our
main focus is to improve the software as much as we can. As we know that the
greater the dataset for the machine learning models to train greater is the
chance that machine learning models will work better. So we will use large
scale dataset to train the machine learning models.

Efficiency of the system is also increased by feedback mechanism of news


articles. News articles which have already been processed are fed back to the
training set, and so the size of dataset keeps getting larger. This is because
there is chance that our software can give wrong results so to tackle that we
are providing user with user feedback mechanism so user can personally vote
whether predicted news is fake or authentic. We then manually check whether
voting provided by the user is right or wrong, so we manually correct the result
accordingly and use that data to train the machine learning models.

66
GLOSSARY

Name Description
NLP Natural Language Processing
SVM Support Vector Machine
URL Uniform Resource Locator
ML Machine Learning
APIs Application Programming Interface
SRS Software Requirement Specifications
HTTPs Hypertext Transfer Protocol
HTML Hypertext Markup Language

67
REFERENCES

Chris Reed, D. Walton, and F. Macagno. Argument diagramming in logic, law


and artificial intelligence. The Knowledge Engineering Review, 22(01):87–
109, 2007.

Douglas Walton, Christopher Reed, and Fabrizio Macagno. Argumentation


Schemes. Cambridge University Press, 2008.

Alice Toniolo, Federico Cerutti, Nir Oren, Tj Norman, and Katia Sycara.
Making Informed Decisions with Provenance and Argumentation Schemes. In
11th International Workshop on Argumentation in Multi-Agent Systems,
pages 1–20. Aamas2014.Lip6.Fr, 2014.

Alice Toniolo, Timothy Dropps, Robin Wentao, and John a Allen.


Argumentation-based collaborative intelligence analysis in CISpaces. In
COMMA, pages 6–7, 2014.

Conroy, Niall J., Victoria L. Rubin, and Yimin Chen. "Automatic deception
detection: Methods for finding fake news." Proceedings of the Association for
Information Science and Technology 52.1 (2015): 1-4.

Kolari, Pranam, et al. "Detecting spam blogs: A machine learning


approach." AAAI. Vol. 6. 2006.

Wang, William Yang. "" Liar, Liar Pants on Fire": A New Benchmark Dataset
for Fake News Detection." arXiv preprint arXiv:1705.00648 (2017).

Kolari, Pranam, Tim Finin, and Anupam Joshi. "SVMs for the Blogosphere:
Blog Identification and Splog Detection." AAAI spring symposium:
Computational approaches to analyzing weblogs. 2006.

68
Jin, Zhiwei, et al. "News credibility evaluation on microblog with a
hierarchical propagation model." Data Mining (ICDM), 2014 IEEE
International Conference on. IEEE, 2014.

Rubin, Victoria, et al. "Fake news or truth? using satirical cues to detect
potentially misleading news." Proceedings of the Second Workshop on
Computational Approaches to Deception Detection. 2016.

Volkova, Svitlana, et al. "Separating facts from fiction: Linguistic models to


classify suspicious and trusted news posts on twitter." Proceedings of the 55th
Annual Meeting of the Association for Computational Linguistics (Volume 2:
Short Papers). Vol. 2. 2017.

Ver´onica P´erez-Rosas, Bennett Kleinberg, Alexandra Lefevre and Rada


Mihalcea. Automatic Detection of Fake News. Michigan University 2017.

Brian Edmonds, Xiaojing Ji, Shiyi Li, Xingyu Liu. Fake News Detection Final
Report.

Chicago University, 2017.

Davis, Wynne. "Fake Or Real? How To Self-Check The News And Get The
Facts." NPR. NPR,

05 Dec. 2016. Web. 22 Apr. 2017.

<
http://www.npr.org/sections/alltechconsidered/2016/12/05/503581220/fake-
or-real-how-to-self-check-the-news-and-get-the-facts >.

69
Samir Bajaj. “The Pope Has a New Baby!” Fake News Detection Using Deep
Learning. Stanford University, 2017.

L Arras, F Horn, G Montavon, KR Müller, W Samek - PloS one, 2017 -


journals.plos.org

70
APPENDIX

Appendix - A

Table 6.2-1 Work Breakdown Structure

71

Das könnte Ihnen auch gefallen