Beruflich Dokumente
Kultur Dokumente
Trustworthy Media News Content Retrieval from Web Using Truth Content
Discovery Algorithm
PII: S1389-0417(18)30751-4
DOI: https://doi.org/10.1016/j.cogsys.2019.01.002
Reference: COGSYS 810
Please cite this article as: Solainayagi, P., Ponnusamy, R., Trustworthy Media News Content Retrieval from Web
Using Truth Content Discovery Algorithm, Cognitive Systems Research (2019), doi: https://doi.org/10.1016/
j.cogsys.2019.01.002
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers
we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting proof before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Trustworthy Media News Content Retrieval from Web Using Truth Content Discovery
Algorithm
1
Research Scholar, Sathyabama Institute of Science and Technology, Chennai, Tamilnadu,
India
2
Professor, CVR College of Engineering, Hyderabad, Telangana, India
1*
Correspondence Author’s E-mail: solaisbu.phd@gmail.com
.
Abstract: Nowadays, the internet plays a major role in online information retrieval. As, we
are well aware the web provides information related all the fields like national, lifestyle,
movies, shopping, spirituality, sports, entertainment and much more. One cannot assume that
the web retrieved information is believable or trustworthy due to multiple answers to the
same query. The paper aims to study many research articles which are related to truth
information discovery. However, the paper found that there is no proper research to find
trustworthiness of news content which is extracted from multiple information sources with
minimum misclassification error and retrieval time. To alleviate these issues, the truth content
discovery algorithm is proposed to produce trustworthy information with minimal time along
with multiple domain news information. The system provides reliable information from a
various source of news provider with minimum classification error. The proposed method
ranked the news content index based on query matches from extracted information. It
minimizes the query retrieval time and classification error. Based on the experimental
evaluation, proposed TCD+J48 algorithm produced the best result compared to existing
approaches. It minimizes reduces 7.54 milliseconds the query retrieval time (QRT), 5% Error
rate(ER) and 0.98 and Mean Normalized Absolute Distance (MNAD)
Keywords: news content, information sources, truth content discovery, multiple conflicting,
information retrieval.
1. INTRODUCTION
In the present globalized scenario, the internet plays a major role in online information
retrieval. As, we are well known that the internet provides information; related to all the
fields like national, lifestyle, movies, shopping, spirituality, sports, entertainment and much
more. Artificial Intelligence is the latest research area which offers various technologies in
human life day-by-day. The research domain considers the context awareness, interaction and
Ubiquitous computing services behalf of trustworthiness. For example, if anybody wants to
retrieve specific news like "central government work performance of the last two years" from
major national or international media channel, then it can be easily extracted from
www.ask.com, www.google.com,www.bings.com, etc. However, it cannot be assumed that
whatever information is retrieved from web source; whether, it is trustworthy?. The method
observed that in many incidents different and contrary information is available for the same
query. It is indeed an impossible and time-consuming task to analyse the data collected for
identifying the most trustworthy information.
Based on the case study of Rasmussen reports [27] Feb 2013, it found that 56% of USA
based people said that they get most of their news from TV news, 32% from cable and 24%
from traditional network news. According to the national telephonic survey noticed that rest
25% USA people from the internet, 10% trust in the hardcopy of newspaper and 7% from the
radio. Here, 56% of voters stated that media reported news is to some extent trustworthy.
However, "just 6% people say like that it is very trustworthy". Where, 42% do not trust
media and 12 % people believe that news is just a kind of report. The Gaussian Truth Model
(GTM) introduced a Bayesian probabilistic mechanism for truth content discovery. However,
GTM can be only used for continues data for training model data predictions. GTM produces
very poor results on truth content discovery. Conflict Resolution on Heterogeneous Data
(CRH) represented to infer the truth information from multiple sources of conflicting
information; where, it includes various data types. The method works to solve optimize issues
to reduce the overall weighted deviation among identified truths content. However, the CRH
techniques do not consider source reliability evaluations, or skips the unique characteristics
of every data type. TruthFinder method expressed to solve the veracity issues, and established
the relationships between websites and their content. TruthFinder obtains facts among
conflicting information, and detect the reliable web source compare than famous search
engines. The similarity function utilized to adjust the vote value by consideration of the
influences among facts. However, the method is time expensive to retrieve the information
from multiple web sources. Voting is a procedure to aggregate individual preferences to get
collective decisions. The system accepts that all necessary URL resources are equally
available. However, a similar assumption cannot be applied in all cases. For Example, let's
Consider "Mount Everest height is "29, 029 feet" based on the information collected from a
majority of users and considered as true information. However, the major challenge for the
method is information source reliability is usually unknown whose practice has to be
gathered from the multiple sources. The Feedforward and Feedback Memory Network
(FFMN) introduced for evaluation of statement credibility and various types of information
with different weight. Here, every data weight is calculated automatically by model learning.
The method utilized vectors that have good visualization capability than natural numbers to
express the statement and source reliability. However, the technique consumes more time to
retrieve the information and prediction process has complexity. The two-dimensional model
expressed by Fogg and Tseng et al., 1999[28] to reduce the credibility factor ranges, namely:
trustworthiness and expertise. It is closest method to surface credibility assessment and its
can be treated a way of experiencing credibility that composed precise trustworthiness and
expertise. However, there is one major difference between two models that how can treat
visual attractiveness. But, the method also does not consider about multiple source conflicting
in media news information. The enhanced trustworthiness method designed by
Ramachandran et al., 2009[29] for trustworthy information retrieval from internet sources.
Where, the truth content filtrations are done in provenance, authority, age, popularity, and
related URL Links to retrieve trustworthy information from websites. The estimated
trustworthiness is stored only for static websites. However, the method is not flexible for
media new channel content and URL; where, the scale is unable to predict the trustworthy
information. However, the method also does not have much work of multiple source of
conflicting information identifications
To overcome the above problems, the truth content discovery algorithm [30-31] introduces
to extract reliable and intelligent information from multiple information sources with minimal
retrieval time. The system provides reliable, trustable news content extracted from various
conflicting information sources. Where, the proposed method uses the word "fact" which
presumes that the obtained information would be either true or false. Here, the fact is studied
based on properties of the object in content or relationship between two objects. The
trustworthiness of an information source provider depends on interdependency between facts
and websites; which select an iterative computational approach to identify the trustworthiness
of information. Each iteration facts possibilities is being true, and the trustworthiness of
websites are varied against their intelligent information retrieval. The technique avoids
conflicting issues which are collected from various media bias. The method retrieves most
relevant information using the re-ranking function. The re-ranking function applies to cluster
most relevant information from numerous similar new media URL content. Here, it is clear
that trustworthiness of information neither can be evaluated by addition of weight of facts nor
possibilities of facts being truth based. The rest of the paper deals with the following
objectives.
To design the truth discovery algorithm to extract trustworthy news content from multiple
conflicting sources.
To solve the conflicting information issues, new content sparsity, and scalability during the
truth content discovery in social media applications and media news channel URL links.
To define the content reliability level of facts and weight of fact to display reliable or
genuine information using iterative methods.
To provide trustworthy information from various sources media channel collected news
content in web services by the study of historical information contribution and content
study.
To apply re-ranking function for retrieving news media content with the best accuracy and
minimal retrieval time.
To visualize the reliable media news content with good information prediction noise free
and accuracy.
To minimize query retrieval time (QRT), error rate(ER), and Mean Normalized Absolute
Distance (MNAD) compare with existing methods.
The next part of the paper, Section 2 deals with the literature study related to the proposed
work. The article is followed by section 3 which explores the system methodology and its
implementation details. Section 4 discusses the final result after the execution of the design
and evaluates the performance of the design. Finally, Section 5 summarizes the overall work
with future enhancements.
2. LITERATURE STUDY
The section provides the literature review of related research work, analytical study and
case studies about truth content identifier from various conflicting information sources. Ma et
al., 2013 [1] designed Fair Crowd, fine-grained truth discovery model to aggregate the
contradictory data from multiple information sources. Gupta et al., 2011 [2] derived a model
to evaluate the trustworthiness of objects and information providers. However, theses method
did not focused on multiple source of conflicting information .It makes groups for the set of
objects, which contain similar set to provide useful facts and better accuracy. Li et al.,2014
[3] tried to resolve conflicting issues among multiple sources of heterogeneous data. The
method minimizes the overall weighted deviation between the truths and the multi-
information source providers. Li et al., 2014 [4] developed confidence-aware truth discovery
(CATD) method to detect truths from multiple conflicting data with the long-tail
phenomenon. It also estimated source reliability with various levels of participation. Yu et al.,
2014 [5] combined Information Extraction and truth finding to perform unsupervised multi-
dimensional truth finding framework. However, these methods are unable to detect
trustworthy information from various sources of media news due to traditional parameters to
detect the truth information.
Zhao et al., 2011 [6] implemented a new truth-finding method for handling numerical data.
The technique works based on Bayesian probabilistic models. Wang et al., 2012 [7]
addressed the challenge of truth discovery from noisy social sensing data. It worked to
predict the noisy data from social media. Li et al., 2012 [8] studied the truthfulness of in-
depth web data. The system worked based on data fusion methods to resolve conflicts and
find true information. Pasternack et al., 2011 [9] introduced a generalized fact-finding
framework to incorporate the additional information into the investigative process to claim
the conflicting information. Vydiswaranet al., 2011 [10] designed content-based trust
propagation structure to ascertain the veracity of free text claims and measure the
trustworthiness of their information providers. These methods concentrate on fact
information. However, the does not consider source reliability and truth claimed information.
Moustafa et al., 2014 [11] expressed a stigmergic-based approach for providing
decentralized service interactions and handle service composition in highly dynamic open
environments. Lu et al., 2013 [12] introduced a method to infer user search goals by
analyzing search engine query logs. Robin et al., 2016 [13] implemented a new method to
tackle the truth discovery problem through principled probabilistic modeling. Dong et al.,
2013, [14] introduced a streaming approach to solving the truth estimation problem in
crowdsourcing applications. The method also resolved an expectation maximization (EM)
problem to determine the odds of the correctness of different observations. Angadi et al.,
2013 [15] expressed a fact finder framework to find out the facts among conflicting
information and identified trustworthy information sources better than popular search
engines. Nevertheless, these methods consume more time for query executions and there may
chances to produce the information error.
Bhuvaneswaran et al. [16] implemented the combination of the ranking model along with
the truth-finder algorithm for evaluating the similarity between two or more websites. Ansari
et al. [17] introduced the method to validate the trustworthiness of URL link. The system
filters trustworthy URL based on five major areas namely authority, related resources,
popularity, age, and recommendation. Sharma et al. [18] designed parallel and streamed truth
discovery algorithms for quantitative crowdsourcing applications. Gao et al. [19] developed a
crowdsourcing aggregation approach to solve the conflicting issues and achieve high-quality
data. Ba et al. [20] demonstrated VERA framework that supports data mining from web-
based textual data and micro-texts from Twitter. These techniques are worked on similarity
detection and source reliability. But, these did not focused on evaluation of trustworthiness of
media news; Where, it is challenging task to identify the information source of content.
Xiao et al. [21] implemented truth discovery approach with randomized Gaussian mixture
model (RGMM) to represent multi-source data. Yin et al. [22] developed a semi-supervised
approach to finds true values with the help of ground truth data. Dong XL et al. [23]
demonstrated the SOLOMON system to detect the copying data from various internet
sources. Yong-Xin et al. [24] implemented a two-stage data conflict resolution protocol based
on Markov Logic Networks. The technique divides attributes according to their conflict
degree, and then resolves data conflicting. Li et al. [25] designed incremental truth discovery
framework to update truth object and information source weights dynamically based on the
upcoming data. Li et al.[26] introduced Feedforward and Feedback Memory Network
(FFMN) for estimation of statement credibility and various types of information with
different weight. Here, the learning model calculates every data weight automatically. The
method used vectors that have good visualization capability than natural numbers to express
the statement and source reliability. However, the method consumes more time to retrieve the
information.
3. SYSTEM METHODOLOGY
The system design explains the proposed truth content discovery framework to extract
trustworthy news content from various information sources with minimum error and reduce
the query retrieval time. It elaborates the working principle and implementation procedure of
the proposed framework in Figure 1. The application is straightforward and user-friendly
where multiple domains of news can be viewed in a single place. The proposed designs
various components like as ubiquitous computing, context awareness and socially intelligent
communication to relate the trustworthiness. The framework is divided into the following
module; User GUI Design, Content Preprocessing & Retrieved Result Visualization.
User
Information retrieval
from multiple sources
The module design the graphical user interface (GUI) for a user to retrieve the news
information from the web. Here, the user can give input as any news related query.
After collecting the user's query, it extracts all related content from the database. Next, it
applies the Truth Content Discovery algorithm to retrieve the trustworthy content index from
various information sources. It measures the reliability level and weight of extracted news
content. Hence, the method calculates the fact of material to evaluate the trustworthiness
level of news content.
The proposed algorithm extracts the information from multiple conflicting sources; where,
the media new channel application have groups G and news sources S = (S1, S2, ..., SG). It
has a group of claims K; where C=(C1, C2, ..., CK ). Si indicates the ith source and Cj
indicates jth claim. Where, QSi, jt represents the by source Si behalf of claim Cj respective time
t. Just consider media news channel source is the example which variety of news content and
various source of news collections; where, news auditors have to pay attention to identify the
event, object, or trend with source details(where, news collected). Where, some of the
information has minimal comment and trend. However, it has true information with fact;
news media itself consider as official news or announcement. The method considers Cj = T
and Cj = F which indicated that claimed information; it may be true or false. Where, each
claim is correlated with a basic trustworthiness model{x ∗ j} such that xj = 1 means Cj have
true information and xj = 0 represents Cj have false information. The objective of the method
is the most trustworthy and relevant information from multiple conflicting web sources with
known information source reliability; which details are given below:
Steps 1: Trustworthiness Claim:
The method claims trustworthiness of Dj for claim Cj to be true. The maximum value of Dj is
more chances to claim Cj true. Generally, the technique expresses Dj to evaluate in equation
(1).
PrC j T (1)
Where, Pr (Cj=T) is probability to claim trustworthiness of at Time T.
TIViT, j 1 UViT, j IViT, j (3)
TIViT, j is the Tth trustworthy Information value of information report from Is on claim Cj.
The system assumes that true Information value depend set of semantic values like
uncertainty and independent value to gets the most reliable and trustworthy report. Where a)
information report should be satisfied or dissatisfied on claim C j; b) The source of
information I s copied or forwarded from any cause or individual observation behalf of the
claim Cj.
Where Ir represents the reliability of information source Is, TIViT, j represents Is previous
information credibility value of a generated information record at time T behalf of claimed
Cj. The sgn( TIViT, j ) expresses the sign of TIVi , j and S indicates the volume of a TIViT, j
sequence. Where, sign mathematical derivation depends on recent information report. Here,
the proposed method gives priority for self-improvement behavior by only consideration of
recent information report. The method utilizes the I rS 1 S as a “damping scale” to allocate the
maximum weights for recent information reports. The proposed method minimized
spamming effects of media news content: if the source has similar information over time;
where, previous information report has quite been an effect on information contribution value
globally behalf of an information source. The method prefers the highest weight of the recent
information reports. These all factors are displayed to recognize the trustworthy information
from multiple sources of widely spread conflicting information. The proposed Truth Content
Discovery Algorithm Pseudocode is given below in details:
The proposed algorithm is implemented on a laptop with Intel Dual Core processor (1.836
Hz), 4 GB memory and Window 7 system. Here, truth discovery framework is designed in
JAVA programming languages using NetBeans 8.0.2, Apache Tomcat 8.0.3, and Microsoft
SQL server databases. The proposed method used Weka(https://www.cs.waikato.ac.nz/ml/we
ka/) library to predict the accuracy of information. WEKA (Waikato Environment for
Knowledge Analysis) is collection of machine learning techniques for data mining tasks.It is
used for proposed algorithm performance evaluation on respective dataset. It is used for data
pre-processing, classification, regression, clustering, association rules, and visualization. The
library is utilized for implementations for classification, numeric prediction, and meta-
scheme techniques.
4.1.1 Datasets
4.2 Result
The section explains the mathematical evaluation parameters to evaluate the efficiency of
proposed TCD algorithm. The section evaluates the trustworthiness content accuracy and
retrieval time of given query or terms by the user. Proposed TCD also works to predict the
truthful information with minimal classification error and retrieval time. The method
evaluates Error rate(ER) and Mean Normalized Absolute Distance (MNAD) and query
retrieval time (QRT) for stock and flight dataset respectively.
Mean Normalized Absolute Distance (MNAD) calculates the distance between retrieved
trustworthiness outputs and ground truth. Different types of data entry have the different scale
to measure. Where, MNAD have an individual variance to calculate their mean. Here, the
proposed method describes a mathematical model to estimate the differences between values
predicted by a model and the values observed from the environment that is being modeled.
These individual differences are also called residuals in equation (6).
2
1
MNAD
n
yi yi (6)
Where, ^yi is a vector of n predictions, and yi is the vector of observed values
corresponding to the inputs to the function which generates the predictions.
Here, the proposed framework describes the mathematical model for query retrieval time
in equation (7). The proposed method calculates processing times based on the average
multiplication of query processing time and candidate dataset. Here, query retrieval time
(QRT) is calculated as:
QRT TCD TAR (7)
Where, TCD is a total number of candidate data set and T AR is average retrieval time of a
data set.
Table 1 displays error rate(ER), mean normalized absolute distance (MNAD) and query
retrieval time (QRT) for stock and flight datasets. The proposed method is evaluated with
existing conventional techniques namely Gaussian truth model (GTM) [26], conflict
resolution on heterogeneous data (CRH)[26], TRUTHFINDER[26], Voting[26], and
Feedforward and Feedback Memory Network (FFMN)[26]. Furthermore, it is noticed that
FFMN [26] is the closest approaches to the proposed TCD method compared to other existing
approaches. However, proposed TCD algorithm result is far better than FFMN [26]. Proposed
TCD algorithm is integrated with the J48 classifier to reduce the prediction error and improve
the media content accuracy to offer the most trustworthy media news content. The classifier
represents a procedure for classifying categorical data based on their attributes. It is also
effective for processing large amount of data and so it is often used in data mining
application. Based on Tabular 1 result, it noticed that proposed TCD perform well on every
respective dataset (Flight and Stock) concerning ER, MNAD, and QR.
Table 1: Error Rate(ER), Mean Normalized Absolute Distance (MNAD) and Query
Retrieval time (QRT) for Stock and Flight Datasets
0.140 Flight
0.120 0.119 Stock
0.100
Error Rate
10.000 Flight
Mean Normalized Absolute
7.670 8.135
Stock
Distance (MNAD)
4.861
5.000 3.425
2.645 2.714
1.511
1.260 0.524 0.283
0.000
Axis Title
Figure 3: Mean Normalized Absolute Distance (MNAD) for Flight and Stock Dataset
60.00
53.06 Flight
50.00 Stock
38.94
Query rRetrieval Time
40.00 34.55
30.65
30.00 27.62
20.00 16.55
14.02
9.18 7.85 8.52
10.00 6.65 4.50
0.00
Axis Title
Figure 4: Query retrieval time (QRT) in Milliseconds for Flight and Stock Datasets.
12
10
8
Total Pages
6 Relevant Pages
4 QRT
2
0
General Search Indexed Based Truth Content
Search Discovery
Figure 5: Retrieval pages and Query retrieval time total pages performance with general
approaches and proposed Truth Content Discovery algorithm.
Conclusion
Funding:
The authors declare that they have sponsored the publication fee and research
works. Since, there is no funding for the work
Acknowledgments
The authors would like to say thanks to the Research and Development Division
(Academic) of Sathyabama Institute of Science and Technology Chennai for the
facilities. We also like to say thanks to Dean Research and Doctoral committee
member for their valuable suggestion to improve the quality of work.
Authors’ contributions
The article work is part of academic research work that researched finding
trustworthy information from multiple conflicting information sources. It is
crucial issues for us to find in which information is reliable or trustworthy? When,
people are getting similar information from various sources. Both authors work on
ground level to identify the problem statement, carried out research objective, find
out the solution of the above problems with result hypothesis, and deploy the
implemented result with comparative analysis. Both authors fully read and
finalized the articles for publication.
References
Initially, the system has to collect large volume of information based on the
object (term or query) and store in the centralized database. Then, the user can
retrieve the query from the database based on their requirement. After collecting
user query, proposed algorithm obtains the related information based on the
relationship between the query and available information in the centralized
database. The proposed approach collects set of HTML embedded media content
and web content and proceed clean the unwanted information from media bias.
Hence, the method clusters irrelevant media news content from collected media
bias. Next, re-ranking function is applied to rank the most relevant information
from various similar information. Finally, the most relevant reliable media
information is displayed with trustworthiness score with minimal content retrieval
time and best accuracy. The method clusters most relevant news content from
media bias from numerous of similar media content using re-function. The method
applies re-ranking function to visualize the most relevant information based on
user preferences and available news content media bias.
Apply TDC
Algorithm overy
Collect Query
Process Query
Validate
Query in DB
Validated
Table 1. MAE, RMSE and QRT for Weather, Flight and Stock Datasets
Fig.2 Mean Absolute Error (MAE) for Weather, Flight and Stock Datasets.
58.1676
60 53.0601 51.6956
50 45.8221
40 33.7244
Weather
30 Flight
20 Stock
Fig. 3Root MeanSquared Error (RMSE) for Weather, Flight and Stock Dataset
90
80 81.1288
70
60
50 Weather
40 Flight
38.9449
Stock
30 30.6503
20
14.0154
10 9.1807
6.6506 7.6165 6.55
3.1769 1.148 1.521 1.01
0 0.6371 0.2849 0.06
CATD CRH GTM DynaTD TCD+J48
Fig. 4 Query retrieval time (QRT) for Weather, Flight and Stock Datasets.
12
10
8
Total Pages
6 Relevant Pages
4 QRT
2
0
General Search
Indexed Based Truth
SearchContent Detector
Fig. 5 Retrieval pages and Query retrieval time with current approaches
&proposed approach.