FAKE NEWS DETECTION SYSTEM USING MACHINE LEARNING AND NLP
S B Akshaya*
1, M Priyanga*
2, S M Raghul*
3, S Saran Kumar*
4,
P Prakash*
5, Dr.T.Ramraj*
6*1,2,3,4,5
Student, Department Of Computer Science And Engineering, Coimbatore Institute Of
Technology, India.
*6
Assistant Professor, Department Of Computer Science And Engineering, Coimbatore Institute Of
Technology, India.
ABSTRACT
As technology evolves day by day, everyone depends on online sources for news rather than conventional newspapers. It brings with it a series of positive and negative consequences and for the latter part, propagation of fake news is something which cannot be ignored. Such spread has many adverse effects and one of which is the creation of biased opinions to influence election outcomes for the benefit of certain candidates. This can lead to misinformation and problems in society. Hence, in this paper, a classification technique is proposed to identify which news is real and which is fake. The model presented will use various Machine Learning and Natural Language Processing (NLP) techniques to achieve maximum accuracy. In the proposed system, five different machine learning algorithms are trained and are compared based on various performance metrics like recall, F1 score etc. and the best trained model is used for classification. The observed accuracy score of this system is 93.6%
Keywords: Fake news, Machine Learning, NLP, Feature Extraction, Logistic Regression, Decision Tree, Random Forest, Passive Aggressive Classifier, Gradient Boosting Classifier.
I.
INTRODUCTION
Fake news is not something new however, with growing technologies the detection of fake news has also become more challenging. As social media continues to dominate everyday lives, it accelerates fake news spread. From studies it was observed that "It takes the truth about 6 times as long as falsehood to reach 1500 people". Online platforms are used for spreading such fake news with an intention of obtaining monetary gain or for some financial or political incentives. Hence, it is important to utilize technology at our advantage to detect such fake news and prevent them from spreading.
The objective of this paper is to use a classification technique to identify fake news. Using implementation of Machine Learning techniques, in particular Supervised Learning, fake news detection has been carried out. A dataset of fake and real news has been deployed to train a Machine Learning model utilizing Scikit-learn Library in Python. Features are extracted from the dataset using Term Frequency – Inverse Document Frequency (TF-IDF), one of the text representation models. It was observed that Passive-Aggressive Classifier works best with the TF – IDF model in the process of content classification. And the resulting accuracy was about 93.6%. It is recommended to employ this research in real time to actively prevent spread of fake information. This research will prove beneficial in detection and classification of the fake from real news with improved accuracy rates.
II.
METHODOLOGY
SYSTEM ARCHITECTURE :
TEXT COLLECTION :
The text collection process is carried out through referring datasets collected from “Kaggle”. The focus of this project is to train the ML model classifying real or fake news based on the news title or news content without knowing where the sources of the news is coming from. As such, the model trained may not be over-dependent on news sources and this can help in generalization of the model. The Kaggle data set consists of around 6335 real and fake news, in which 80% is used for training the model and remaining 20% is used for testing process. TEXT PREPROCESSING :
After the data set gets imported, pre-processing is carried out, which includes the following steps:
Lowercasing
Punctuation and Stop word removal
Stemming
Tokenization
Removal of Stop words:
Stops Words such as “and”, “or”, “but”, “of”, “in”, “from”, “to”, “a”, “an”, “the” etc may be filtered and processed from content because they are increasingly normal, hold less significant data and it can even utilize important processing time. Hence removing such stop words is a crucial task.
Stemming and Lemmatization :
Stemming technique is used to detach suffixes or prefixes from a word. Eg.: “running” -> “run”. And in lemmatization process, a given word is reduced to its root word. Eg.: “better” -> “good”. The root word is called a stem in the stemming process, and it is called a lemma in the lemmatization process.
Tokenization:
Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP.
The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words. For example, the text “It is raining” can be tokenized into ‘It’, ‘is’, ‘raining’.
FEATURE EXTRACTION :
The pre-processed text needs to be parsed to evaluate words and those words should be encoded as integers or floating-point values before giving it into the machine learning algorithm. In this proposed system, the vectorization method used is called TF-IDF (Term Frequency – Inverse Document Frequency) vectorizer. It is used to generate vectors for the pre-processed text.
Term – Frequency (TF)
The number of times a word appears in a document divided by the total number of words in the document. Every document has its own term frequency.
Inverse Document Frequency (IDF)
The log of the number of documents divided by the number of documents that contain the word w. Inverse data frequency determines the weight of rare words across all documents in the dataset.
Finally, the TF-IDF is simply the TF multiplied by IDF.
PERFORMANCE METRICS (CONFUSION MATRIX) :
Confusion matrix is a tabular representation of a classification model performance on the test set, which consists of four parameters: true positive, false positive, true negative, and false negative.
Table 1. Basic Representation of Confusion Matrix
true positives (TP) : These are cases in which we predicted and that is the actual result too.
true negatives (TN) : When both predicted and actual value is no.
false positives (FP) : When the predicted value is yes but the actual value is no
false negatives (FN) : When the predicted value is no but the actual value is yes CLASSIFIERS (For Model Training and Evaluation) :
In this proposed system, to compute the prediction, different Machine learning algorithms are used such as
Passive-Aggressive Classifier
Logistic Regression
Decision Tree Classifier
Gradient Boosting Classifier
Random Forest Classifier
All the above listed algorithms are applied to train the model and finally, the best model is analyzed for fake news recognition.
PASSIVE- AGGRESSIVE CLASSIFIER :
It is a supervised learning algorithm in which it remains passive for correct predictions and also the model is kept unchanged. On the other side, it responds aggressively to incorrect predictions and the model is changed to correct it. This is very useful in situations where there is a huge amount of data and it is computationally infeasible to train the entire dataset because of the sheer size of the data.
Important parameters:
C : This is the regularization parameter, and denotes the penalization the model will make on an
incorrect prediction
max_iter : The maximum number of iterations the model makes over the training data.
LOGISTIC REGRESSION :
It is one of the most common binary classification techniques when the data in question has binary output, so whether it belongs to one class or another, or is either a 0 or 1. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1.
. Fig 2 : Graph for Sigmoid Function Inference from the above graph :
y tends towards 1 as x -> infinity
y tends towards 0 as x -> (-infinity)
y is always bounded between 0 and 1 GRADIENT BOOSTING CLASSIFIER :In this classifier model, each predictor tries to give a better performance than the previous one by reducing the errors in each step. Initial Prediction value is nothing but the division of total number of true values(1) by total number of false values(0) called the log(odds). Then the obtained value is converted into a probability by using a logistic function : e*log(odds) / (1+e*log(odds)) to make predictions.
Then residuals are calculated for each instance in the training set (Residuals = Observed value – Predicted value) with which a new decision tree is formed. After that process, the new residuals of the tree are calculated and a new tree is created to fit the new residuals. Again, the process is repeated until a certain predefined threshold is reached, or the residuals are negligible.
DECISION TREE CLASSIFIER :
In a decision tree, for predicting the class of the given dataset, the tree is created with the root node that contains the complete dataset. Then, by using Attribute Selection Measure (ASM), the best attribute from the dataset is chosen and the root node i.e. the dataset is divided into subsets having possible values for the chosen attribute. A decision tree node which contains the best attribute is generated. Using those subsets of dataset, decision trees are recursively made until a stage is reached where you cannot further classify the nodes and called the final node as a leaf node.
RANDOM FOREST CLASSFIER :
This classifier can be used for both classification and regression problems. It is a collection of random tress based on a randomly split dataset. The individual decision trees are generated using the information gain values of each attribute. Each tree depends on an independent random sample. Then a prediction result is obtained from each decision tree and voting is performed for each predicted result. Finally, the predicted result with the most number of votes is selected as the final prediction.
III.
MODELING AND ANALYSIS
The proposed system tries to fill the void of existing systems by designing a website with very high user-interface, which could do both fact checking as well as telecasting live news of various popular news channels. Users could see the live news being played in various channels in one-go as they get auto played when that particular module is visible to the users. This system also includes various news categories from which a user could select the type that the news belongs to, so that, that specific dataset gets imported and goes through various machine learning and feature extraction steps to get the expected results.
The modules used in the system are,
Sign up / Login Page
News Authenticator module
SIGN UP / LOGIN PAGE :
The system allows a non-registered user to create a secure account. This module requires basic details of the user like name and email-id. This data is stored in the database. For a registered user, the system redirects to the login page. This system requires a proper username and password from the user. This system verifies the username and password with the database. But even a non-registered user can see the live news of different channels that are added in the website.
NEWS AUTHENTICATOR MODULE :
News authenticator module follows certain above defined steps to check whether the news is true or false. In the front-end, user will feed the input column with news which is to be checked as real or fake. This module will compare the news given from the user side with the dataset containing various news obtained from different news sources only after they get converted into vectors using TF-IDF vector. Then the resulting vector is fed to the final classifier model. Finally it shows whether the given news is real or fake. This can help us from falling for the fake news. These days’ fake news spread very fast because of social media and the internet. So, news authenticator helps us to detect how true a news is.
LIVE NEWS MODULE :
In this module, different live news running in various popular news channels like “NDTV”, “CNA”, “CNN”, “India Today”, “Republic World” are auto-played in one-go. When a user wishes to see the live news in a particular channel, he/she can click the news and it gets redirected to the YouTube page of the channel and can see that news in a detailed view. Additional news channels can also be added to the list as per the users’ wish as a future work.
IV.
RESULT
A person with an authorized access to the system is said to be a user. Here, the user registers into the system. He / she could make use of the system by feeding particular news content in the given input field. And when submitted, the system converts the input into vector and pass the resultant vector to the classifier for evaluation where the input vector is compared with vectors of news in the dataset. And finally based on similarity with either the existing real or fake news, the system displays the label of the input news as real or fake. Also the system enables the user to watch live news in the channel in which he/she wishes to see.
The accuracy and confusion matrix results of each classifier used in this system are given in the table below : Table 2. Comparison of all Classifiers used in the system
SN. Classifier Accuracy True
Negative True Positive False Positive False Negative 1 Passive – Aggressive Classifier 93.60 571 613 44 39 2 Logistic Regression 91.55 570 590 45 62
3 Random Forest Classifier 90.92 552 600 63 52
4 Gradient Boosting Classifier 90.69 569 580 46 72
5 Decision Tree Classifier 83.74 498 563 117 89
As Passive – Aggressive Classifier tops the table with the highest accuracy rate of 93.6 %, it is used to identify in which category does the news that is given as an input to the system belongs to.
V.
CONCLUSION AND FUTURE WORK
Fake news is a widespread problem that is meant to deceive and mislead readers. The problems related to fake news have increased considerably in the last few years, particularly in the political sector. But, the task of classifying news manually requires in-depth knowledge of the domain and expertise to identify anomalies in the text. In this research, we have discussed the problem of classifying fake news articles using machine learning models. The ultimate aim of this project “Fake News Detection System” is to make the users know
whether the news they are looking at is real or not. This project implements an efficient approach by using vectorizers and classifiers which are more effective than others i.e. TF-IDF vectorizer and Passive-Aggressive Classifier. As a result, the model has reached its highest accuracy rate with the accuracy score of 93.6%. Although there is evident success in detection of fake news using various ML approaches, the ever-changing characteristics and features of fake news in social media networks is posing a challenge in categorization of fake news. Hence, deep learning methods which might give high accuracy can be considered as a future work. Likewise, real time fake news identification in videos can be another possible future direction. Also, a news suggestion / recommendation module which can suggest the news related to the news which the user has given for authentication can be embedded in the website. This can be achieved by processing the keywords present in the news which we wish to authenticate.
VI.
REFERENCES
[1] Sahil Gaonkar, Avinash Gaonkar, Sachin Itagi, Shailendra Aswale, Rhethiqe, Pratiksha Shetgaonkar :
Detection Of Online Fake News : A Survey : 2019 International Conference on Vision Towards Emerging Trends in Communication and Networking (ViTECoN)
[2] Terry Traylor, Jeremy Straub, Gurmeet, Nicholas Snell : Classifying Fake News Articles Using Natural
Language Processing to Identify In-Article Attribution as a Supervised Learning Estimator : 2019 IEEE 13th International Conference on Semantic Computing (ICSC)
[3] Anjali Jain , Avinash Shakya, Harsh Khatter, Amit Kumar Gupta : A Smart System for Fake News
Detection Using Machine Learning - 2019 2nd International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT)
[4] Sheng How Kong, Li Mei Tan, Keng Hoon, Nur Hana Samsudin S : Fake News Detection using Deep
Learning : 2019 1st International Conference on Advances in Information Technology
[5] Zaitul Iradah Mahid, Selvakumar Manickam, Shankar Karuppayah : Fake News on Social Media: Brief
Review on Detection Techniques : 2018 IEEE
[6] Pavan M N, Pranav R Prasad, Tejas Gowda, Vibhakar TS, Dr. Sushila Shidnal : Fake News Detection
using Machine Learning – Dec 2020
[7] Akshay Murdiya, Amol Geete, Avinash Babel, Ayushi Jain, Prof. Ronak Jain – Fake News Detection
System – May 2020
[8] Smitha. N, Bharath .R - Performance Comparison of Machine Learning Classifiers for Fake News
Detection : Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020) IEEE Xplore Part Number: CFP20N67-ART