Sie sind auf Seite 1von 24

Accepted Manuscript

Trustworthy Media News Content Retrieval from Web Using Truth Content
Discovery Algorithm

Palaiyah Solainayagi, Ramalingam Ponnusamy

PII: S1389-0417(18)30751-4
DOI: https://doi.org/10.1016/j.cogsys.2019.01.002
Reference: COGSYS 810

To appear in: Cognitive Systems Research

Received Date: 29 September 2018


Revised Date: 8 November 2018
Accepted Date: 6 January 2019

Please cite this article as: Solainayagi, P., Ponnusamy, R., Trustworthy Media News Content Retrieval from Web
Using Truth Content Discovery Algorithm, Cognitive Systems Research (2019), doi: https://doi.org/10.1016/
j.cogsys.2019.01.002

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers
we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting proof before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Trustworthy Media News Content Retrieval from Web Using Truth Content Discovery
Algorithm

Palaiyah Solainayagi1*, Ramalingam Ponnusamy 2

1
Research Scholar, Sathyabama Institute of Science and Technology, Chennai, Tamilnadu,
India
2
Professor, CVR College of Engineering, Hyderabad, Telangana, India
1*
Correspondence Author’s E-mail: solaisbu.phd@gmail.com
.
Abstract: Nowadays, the internet plays a major role in online information retrieval. As, we
are well aware the web provides information related all the fields like national, lifestyle,
movies, shopping, spirituality, sports, entertainment and much more. One cannot assume that
the web retrieved information is believable or trustworthy due to multiple answers to the
same query. The paper aims to study many research articles which are related to truth
information discovery. However, the paper found that there is no proper research to find
trustworthiness of news content which is extracted from multiple information sources with
minimum misclassification error and retrieval time. To alleviate these issues, the truth content
discovery algorithm is proposed to produce trustworthy information with minimal time along
with multiple domain news information. The system provides reliable information from a
various source of news provider with minimum classification error. The proposed method
ranked the news content index based on query matches from extracted information. It
minimizes the query retrieval time and classification error. Based on the experimental
evaluation, proposed TCD+J48 algorithm produced the best result compared to existing
approaches. It minimizes reduces 7.54 milliseconds the query retrieval time (QRT), 5% Error
rate(ER) and 0.98 and Mean Normalized Absolute Distance (MNAD)
Keywords: news content, information sources, truth content discovery, multiple conflicting,
information retrieval.

1. INTRODUCTION
In the present globalized scenario, the internet plays a major role in online information
retrieval. As, we are well known that the internet provides information; related to all the
fields like national, lifestyle, movies, shopping, spirituality, sports, entertainment and much
more. Artificial Intelligence is the latest research area which offers various technologies in
human life day-by-day. The research domain considers the context awareness, interaction and
Ubiquitous computing services behalf of trustworthiness. For example, if anybody wants to
retrieve specific news like "central government work performance of the last two years" from
major national or international media channel, then it can be easily extracted from
www.ask.com, www.google.com,www.bings.com, etc. However, it cannot be assumed that
whatever information is retrieved from web source; whether, it is trustworthy?. The method
observed that in many incidents different and contrary information is available for the same
query. It is indeed an impossible and time-consuming task to analyse the data collected for
identifying the most trustworthy information.
Based on the case study of Rasmussen reports [27] Feb 2013, it found that 56% of USA
based people said that they get most of their news from TV news, 32% from cable and 24%
from traditional network news. According to the national telephonic survey noticed that rest
25% USA people from the internet, 10% trust in the hardcopy of newspaper and 7% from the
radio. Here, 56% of voters stated that media reported news is to some extent trustworthy.
However, "just 6% people say like that it is very trustworthy". Where, 42% do not trust
media and 12 % people believe that news is just a kind of report. The Gaussian Truth Model
(GTM) introduced a Bayesian probabilistic mechanism for truth content discovery. However,
GTM can be only used for continues data for training model data predictions. GTM produces
very poor results on truth content discovery. Conflict Resolution on Heterogeneous Data
(CRH) represented to infer the truth information from multiple sources of conflicting
information; where, it includes various data types. The method works to solve optimize issues
to reduce the overall weighted deviation among identified truths content. However, the CRH
techniques do not consider source reliability evaluations, or skips the unique characteristics
of every data type. TruthFinder method expressed to solve the veracity issues, and established
the relationships between websites and their content. TruthFinder obtains facts among
conflicting information, and detect the reliable web source compare than famous search
engines. The similarity function utilized to adjust the vote value by consideration of the
influences among facts. However, the method is time expensive to retrieve the information
from multiple web sources. Voting is a procedure to aggregate individual preferences to get
collective decisions. The system accepts that all necessary URL resources are equally
available. However, a similar assumption cannot be applied in all cases. For Example, let's
Consider "Mount Everest height is "29, 029 feet" based on the information collected from a
majority of users and considered as true information. However, the major challenge for the
method is information source reliability is usually unknown whose practice has to be
gathered from the multiple sources. The Feedforward and Feedback Memory Network
(FFMN) introduced for evaluation of statement credibility and various types of information
with different weight. Here, every data weight is calculated automatically by model learning.
The method utilized vectors that have good visualization capability than natural numbers to
express the statement and source reliability. However, the technique consumes more time to
retrieve the information and prediction process has complexity. The two-dimensional model
expressed by Fogg and Tseng et al., 1999[28] to reduce the credibility factor ranges, namely:
trustworthiness and expertise. It is closest method to surface credibility assessment and its
can be treated a way of experiencing credibility that composed precise trustworthiness and
expertise. However, there is one major difference between two models that how can treat
visual attractiveness. But, the method also does not consider about multiple source conflicting
in media news information. The enhanced trustworthiness method designed by
Ramachandran et al., 2009[29] for trustworthy information retrieval from internet sources.
Where, the truth content filtrations are done in provenance, authority, age, popularity, and
related URL Links to retrieve trustworthy information from websites. The estimated
trustworthiness is stored only for static websites. However, the method is not flexible for
media new channel content and URL; where, the scale is unable to predict the trustworthy
information. However, the method also does not have much work of multiple source of
conflicting information identifications
To overcome the above problems, the truth content discovery algorithm [30-31] introduces
to extract reliable and intelligent information from multiple information sources with minimal
retrieval time. The system provides reliable, trustable news content extracted from various
conflicting information sources. Where, the proposed method uses the word "fact" which
presumes that the obtained information would be either true or false. Here, the fact is studied
based on properties of the object in content or relationship between two objects. The
trustworthiness of an information source provider depends on interdependency between facts
and websites; which select an iterative computational approach to identify the trustworthiness
of information. Each iteration facts possibilities is being true, and the trustworthiness of
websites are varied against their intelligent information retrieval. The technique avoids
conflicting issues which are collected from various media bias. The method retrieves most
relevant information using the re-ranking function. The re-ranking function applies to cluster
most relevant information from numerous similar new media URL content. Here, it is clear
that trustworthiness of information neither can be evaluated by addition of weight of facts nor
possibilities of facts being truth based. The rest of the paper deals with the following
objectives.
 To design the truth discovery algorithm to extract trustworthy news content from multiple
conflicting sources.
 To solve the conflicting information issues, new content sparsity, and scalability during the
truth content discovery in social media applications and media news channel URL links.
 To define the content reliability level of facts and weight of fact to display reliable or
genuine information using iterative methods.
 To provide trustworthy information from various sources media channel collected news
content in web services by the study of historical information contribution and content
study.
 To apply re-ranking function for retrieving news media content with the best accuracy and
minimal retrieval time.
 To visualize the reliable media news content with good information prediction noise free
and accuracy.
 To minimize query retrieval time (QRT), error rate(ER), and Mean Normalized Absolute
Distance (MNAD) compare with existing methods.
The next part of the paper, Section 2 deals with the literature study related to the proposed
work. The article is followed by section 3 which explores the system methodology and its
implementation details. Section 4 discusses the final result after the execution of the design
and evaluates the performance of the design. Finally, Section 5 summarizes the overall work
with future enhancements.

2. LITERATURE STUDY

The section provides the literature review of related research work, analytical study and
case studies about truth content identifier from various conflicting information sources. Ma et
al., 2013 [1] designed Fair Crowd, fine-grained truth discovery model to aggregate the
contradictory data from multiple information sources. Gupta et al., 2011 [2] derived a model
to evaluate the trustworthiness of objects and information providers. However, theses method
did not focused on multiple source of conflicting information .It makes groups for the set of
objects, which contain similar set to provide useful facts and better accuracy. Li et al.,2014
[3] tried to resolve conflicting issues among multiple sources of heterogeneous data. The
method minimizes the overall weighted deviation between the truths and the multi-
information source providers. Li et al., 2014 [4] developed confidence-aware truth discovery
(CATD) method to detect truths from multiple conflicting data with the long-tail
phenomenon. It also estimated source reliability with various levels of participation. Yu et al.,
2014 [5] combined Information Extraction and truth finding to perform unsupervised multi-
dimensional truth finding framework. However, these methods are unable to detect
trustworthy information from various sources of media news due to traditional parameters to
detect the truth information.
Zhao et al., 2011 [6] implemented a new truth-finding method for handling numerical data.
The technique works based on Bayesian probabilistic models. Wang et al., 2012 [7]
addressed the challenge of truth discovery from noisy social sensing data. It worked to
predict the noisy data from social media. Li et al., 2012 [8] studied the truthfulness of in-
depth web data. The system worked based on data fusion methods to resolve conflicts and
find true information. Pasternack et al., 2011 [9] introduced a generalized fact-finding
framework to incorporate the additional information into the investigative process to claim
the conflicting information. Vydiswaranet al., 2011 [10] designed content-based trust
propagation structure to ascertain the veracity of free text claims and measure the
trustworthiness of their information providers. These methods concentrate on fact
information. However, the does not consider source reliability and truth claimed information.
Moustafa et al., 2014 [11] expressed a stigmergic-based approach for providing
decentralized service interactions and handle service composition in highly dynamic open
environments. Lu et al., 2013 [12] introduced a method to infer user search goals by
analyzing search engine query logs. Robin et al., 2016 [13] implemented a new method to
tackle the truth discovery problem through principled probabilistic modeling. Dong et al.,
2013, [14] introduced a streaming approach to solving the truth estimation problem in
crowdsourcing applications. The method also resolved an expectation maximization (EM)
problem to determine the odds of the correctness of different observations. Angadi et al.,
2013 [15] expressed a fact finder framework to find out the facts among conflicting
information and identified trustworthy information sources better than popular search
engines. Nevertheless, these methods consume more time for query executions and there may
chances to produce the information error.
Bhuvaneswaran et al. [16] implemented the combination of the ranking model along with
the truth-finder algorithm for evaluating the similarity between two or more websites. Ansari
et al. [17] introduced the method to validate the trustworthiness of URL link. The system
filters trustworthy URL based on five major areas namely authority, related resources,
popularity, age, and recommendation. Sharma et al. [18] designed parallel and streamed truth
discovery algorithms for quantitative crowdsourcing applications. Gao et al. [19] developed a
crowdsourcing aggregation approach to solve the conflicting issues and achieve high-quality
data. Ba et al. [20] demonstrated VERA framework that supports data mining from web-
based textual data and micro-texts from Twitter. These techniques are worked on similarity
detection and source reliability. But, these did not focused on evaluation of trustworthiness of
media news; Where, it is challenging task to identify the information source of content.
Xiao et al. [21] implemented truth discovery approach with randomized Gaussian mixture
model (RGMM) to represent multi-source data. Yin et al. [22] developed a semi-supervised
approach to finds true values with the help of ground truth data. Dong XL et al. [23]
demonstrated the SOLOMON system to detect the copying data from various internet
sources. Yong-Xin et al. [24] implemented a two-stage data conflict resolution protocol based
on Markov Logic Networks. The technique divides attributes according to their conflict
degree, and then resolves data conflicting. Li et al. [25] designed incremental truth discovery
framework to update truth object and information source weights dynamically based on the
upcoming data. Li et al.[26] introduced Feedforward and Feedback Memory Network
(FFMN) for estimation of statement credibility and various types of information with
different weight. Here, the learning model calculates every data weight automatically. The
method used vectors that have good visualization capability than natural numbers to express
the statement and source reliability. However, the method consumes more time to retrieve the
information.

3. SYSTEM METHODOLOGY

The system design explains the proposed truth content discovery framework to extract
trustworthy news content from various information sources with minimum error and reduce
the query retrieval time. It elaborates the working principle and implementation procedure of
the proposed framework in Figure 1. The application is straightforward and user-friendly
where multiple domains of news can be viewed in a single place. The proposed designs
various components like as ubiquitous computing, context awareness and socially intelligent
communication to relate the trustworthiness. The framework is divided into the following
module; User GUI Design, Content Preprocessing & Retrieved Result Visualization.
User

Enter the Query

Collect the query

Apply TCD Algorithm

Process the query

Information retrieval
from multiple sources

Conflicting source Info

Database1 Database2 Database3

Extract the media news information

Verify information source & reliability

Retrieved trustworthy news information

Calculate trustworthy information value

Display most relevant Truthful Result

Figure 1: Workflow of proposed truth content discovery algorithm

3.1 User GUI Design

The module design the graphical user interface (GUI) for a user to retrieve the news
information from the web. Here, the user can give input as any news related query.

3.1 Content Preprocessing

After collecting the user's query, it extracts all related content from the database. Next, it
applies the Truth Content Discovery algorithm to retrieve the trustworthy content index from
various information sources. It measures the reliability level and weight of extracted news
content. Hence, the method calculates the fact of material to evaluate the trustworthiness
level of news content.

3.3 Truth Content Discovery Algorithm


Initially, the system has to collect a large volume of news media and channel information
based on the object (term or query) and store in the centralized database. Then, the user can
retrieve the query from the database based on their requirement. After collecting user query,
the proposed algorithm obtains the related information based on the relationship between the
query and available information in the centralized database. The proposed approach collects a
set of HTML embedded media content, web content, and proceeds to clean the unwanted
information from media bias. Hence, the method filters inappropriate media news content
from the collected media bias. Here, the proposed method verifies the trustworthiness of
media new content based Information source reliability and trustworthiness claimed
information CJ. The process also confirmed the claimed trustworthy information based on
source reliability uncertainty value, independent value, information contribution to finding
trustworthy Information value for each extracted media news content. The method avoids
conflicting information about media content from a different database. The design
externalizes component of awareness function and constructs values-based trust. Hence, it
will verify the trustworthiness of retrieved content based on facts and weight of trustworthy
information. The proposed method integrated J48 classifier to improve information accuracy
and reduce the prediction error of conflict information. The method provides trustworthiness
content accuracy based on the retrieved information. Next, the re-ranking function is applied
to rank the most relevant information from collected various similar information. Finally, the
most relevant reliable media information is displayed with trustworthiness score, minimal
content retrieval time, and best accuracy. The method applies a re-ranking function to
visualize the most relevant information based on user preferences and available news content
media bias. After producing the result, the proposed algorithm calculates the count of every
query results.

The proposed algorithm extracts the information from multiple conflicting sources; where,
the media new channel application have groups G and news sources S = (S1, S2, ..., SG). It
has a group of claims K; where C=(C1, C2, ..., CK ). Si indicates the ith source and Cj
indicates jth claim. Where, QSi, jt represents the by source Si behalf of claim Cj respective time
t. Just consider media news channel source is the example which variety of news content and
various source of news collections; where, news auditors have to pay attention to identify the
event, object, or trend with source details(where, news collected). Where, some of the
information has minimal comment and trend. However, it has true information with fact;
news media itself consider as official news or announcement. The method considers Cj = T
and Cj = F which indicated that claimed information; it may be true or false. Where, each
claim is correlated with a basic trustworthiness model{x ∗ j} such that xj = 1 means Cj have
true information and xj = 0 represents Cj have false information. The objective of the method
is the most trustworthy and relevant information from multiple conflicting web sources with
known information source reliability; which details are given below:
Steps 1: Trustworthiness Claim:
The method claims trustworthiness of Dj for claim Cj to be true. The maximum value of Dj is
more chances to claim Cj true. Generally, the technique expresses Dj to evaluate in equation
(1).
PrC j  T  (1)
Where, Pr (Cj=T) is probability to claim trustworthiness of at Time T.

Steps 2: Information Source Reliability:


Information source reliability Ir for the source of information Is values indicates that how
information sources are trustworthy and reliable. The maximum amount of Ir assumes that Is
offers more reliable and trustworthy information. The method expresses the Information
source reliability Ir in equation (2)
 C T 
PR  j  (2)
 I C T 
 s j 

Where is PR(Cj=T) is a probability of claiming trustworthiness of at Time T. Where PR (


I s C j  T ) is probability of claiming trustworthiness of formation source at Time T. Since,
some of the information sources are often conflicting in news media and not always offers
reliable and trustworthy claim. The proposed system works to find truthful and reliable
information sources. The method also works to get credibility value of the reliable source of
information Is behalf of claim Cj at time T.
T
Steps 3: Uncertainty Value (UV ij )
Uncertainty value indicates that how much information I s behalf of claim Cj has
trustworthiness and value is measured in 0 to 1 range in information report. Where, maximum
uncertainty value report displays that information report does not have good reliability and
truthful information sources.
T
Steps 4: Independent Value ( IV i, j )
Independent Value expresses that claimed information Ir is designed or created by an
individual person or copied or duplicated by some other information sources I s; where value
range is calculated between 0 to 1. The maximum independent values indicate that obtained
information is not copied by some other sources and like to be treated more trustworthy and
reliable information. The system generally calculates trustworthy Information value from
multiple sources of conflicting information Is behalf claimed Cj at time T. The trustworthy
Information value is expressed in equation (3).

 
TIViT, j  1  UViT, j  IViT, j (3)

TIViT, j is the Tth trustworthy Information value of information report from Is on claim Cj.
The system assumes that true Information value depend set of semantic values like
uncertainty and independent value to gets the most reliable and trustworthy report. Where a)
information report should be satisfied or dissatisfied on claim C j; b) The source of
information I s copied or forwarded from any cause or individual observation behalf of the
claim Cj.

Steps 5: Information Contribution Value (ICVi,j)


The source of information Is’s aggregated information contribution to claimed Cj which is a
function of the information source reliability and true Information value of all previous
information records by an information source. Where, information contribution value is
estimated by utilization of following things: More specific, the Information Contribution
value for a source of information is on claimed Cj is indicated as (ICVi,j)and it is evaluated in
equation(4)
  I
T
S 1 S
ICV i , j sgn TIViT, j r TIViT, j (4)
t 1

Where Ir represents the reliability of information source Is, TIViT, j represents Is previous
information credibility value of a generated information record at time T behalf of claimed
Cj. The sgn( TIViT, j ) expresses the sign of TIVi , j and S indicates the volume of a TIViT, j
sequence. Where, sign mathematical derivation depends on recent information report. Here,
the proposed method gives priority for self-improvement behavior by only consideration of
recent information report. The method utilizes the I rS 1 S as a “damping scale” to allocate the
maximum weights for recent information reports. The proposed method minimized
spamming effects of media news content: if the source has similar information over time;
where, previous information report has quite been an effect on information contribution value
globally behalf of an information source. The method prefers the highest weight of the recent
information reports. These all factors are displayed to recognize the trustworthy information
from multiple sources of widely spread conflicting information. The proposed Truth Content
Discovery Algorithm Pseudocode is given below in details:

Algorithm: Truth Content Discovery Algorithm


Input: The Set of information Source IS, Information Reliability Ir, Set of
Facts F and establish communication among them.
Output: Retrieve trustworthy information from multiple media New bias
source and fact, a source of Information Is, Information reliability
Ir, Trustworthiness information value with minimal content
retrieval time with the best accuracy A
Procedure:
Steps1: collect the User given query;
Steps2 Validate(spelling mistake or etc.) the user collected query from
DB;
Steps3: Apply TCD algorithm;
Stpes4: Pre-process the query;
Steps5 Extract query related information;
Steps6: Calculate the matrices of object A &B;
Steps7: Repeat Ir  IS; /* for initializing initial steps*/
Steps8: t(I )  t 0;
r
Steps9: T(Ir);  1n1  t I r 
Steps10:   B  
Repeat the process  * T ; /* iterate evaluation*/
Steps11:  
 '  t ; /* make copy of 
 
Evaluates from  *
t t */

Steps12: Process t  A  Is
;
 
Steps13: Evaluate  
T from t ;
Steps14: Identify the Information Source Is on claim Cj;
Steps15: Recognize the Information reliability I r on claim Cj;
Steps16: Evaluate the uncertainty value, independent value and Information
contribution value regarding Is and Ir on claimed Cj at Time T;
Steps17:    '
Continue until cosine similarity of t and t is greater
than 1   Count the query ;
Steps18: Retrieve most relevant and truthful information;
Steps19: Evaluate the trustworthy information value for each information
source Is
Steps20: Prioritize the retrieved relevant, trustworthy information based on
query similarity;
Steps21: Visualize most relevant retrieved trustworthy information from
multiple media bias source with minimal content retrieval time,
error rate and best accuracy A;
Steps22: End;
Where t is true information, T is trustworthiness content and t’ is multiple sources of
conflicting information.

3.4 Retrieved Result Visualization


This system displays content index wise based on trustworthiness level of content. The
module only shows the trustworthiness content based on user query.

4. Results and Discussions

4.1 Implementation Setup

The proposed algorithm is implemented on a laptop with Intel Dual Core processor (1.836
Hz), 4 GB memory and Window 7 system. Here, truth discovery framework is designed in
JAVA programming languages using NetBeans 8.0.2, Apache Tomcat 8.0.3, and Microsoft
SQL server databases. The proposed method used Weka(https://www.cs.waikato.ac.nz/ml/we
ka/) library to predict the accuracy of information. WEKA (Waikato Environment for
Knowledge Analysis) is collection of machine learning techniques for data mining tasks.It is
used for proposed algorithm performance evaluation on respective dataset. It is used for data
pre-processing, classification, regression, clustering, association rules, and visualization. The
library is utilized for implementations for classification, numeric prediction, and meta-
scheme techniques.

4.1.1 Datasets

The technique selects two types of datasets (http://lunadong.com/fusion-DataSets.htm)


namely flight and stock related news content. To elaborate, the stock data set is collected
form 55 information sources from July 2011 market and property related news information
[26]. The stock dataset has 1000 stock symbols with 16 properties. The fundamental truths
information has NASDAQ100 stocks records and remaining 100 random collective stocks.
These all information is gathered by following primary information sources: Nasdaq.com,
Yahoo Finance information, Google Finance, Bloomberg, and MSN Finance information.
The flight data contains flight information from major 38 sources in December 2011[26] with
1200 flights 6 types of information properties. The dataset has as departure gate and arrival
gate information as a grouped data with other properties as continuous information. Where,
arrival time and departure time is converted into minutes and 100 ground truth data selected
randomly for flight.

4.2 Result

The section explains the mathematical evaluation parameters to evaluate the efficiency of
proposed TCD algorithm. The section evaluates the trustworthiness content accuracy and
retrieval time of given query or terms by the user. Proposed TCD also works to predict the
truthful information with minimal classification error and retrieval time. The method
evaluates Error rate(ER) and Mean Normalized Absolute Distance (MNAD) and query
retrieval time (QRT) for stock and flight dataset respectively.

4.2.1 Error Rate(ER):


The error rate(ER) of proposed TCD is evaluated based on a computation of trustworthy
output for which are different from ground truth information. The mathematical model of
error is explained in equation (5) to evaluate how predicted trustworthy information close to
the eventual outcomes.

Pr edicted Trustinf  EventualTrustinf


ER  100 (5)
EventualTrustinf

Where, Predictedtrustinf is the ratio of probability to get trustworthy information, and


Eventualtrustinf is obtained trustworthy information.

4.2.2 Mean Normalized Absolute Distance (MNAD)

Mean Normalized Absolute Distance (MNAD) calculates the distance between retrieved
trustworthiness outputs and ground truth. Different types of data entry have the different scale
to measure. Where, MNAD have an individual variance to calculate their mean. Here, the
proposed method describes a mathematical model to estimate the differences between values
predicted by a model and the values observed from the environment that is being modeled.
These individual differences are also called residuals in equation (6).
2
 
1  
MNAD  
n 
 yi  yi  (6)

 
Where, ^yi is a vector of n predictions, and yi is the vector of observed values
corresponding to the inputs to the function which generates the predictions.

4.2.3 Query Retrieval Time (QRT)

Here, the proposed framework describes the mathematical model for query retrieval time
in equation (7). The proposed method calculates processing times based on the average
multiplication of query processing time and candidate dataset. Here, query retrieval time
(QRT) is calculated as:
QRT  TCD  TAR (7)

Where, TCD is a total number of candidate data set and T AR is average retrieval time of a
data set.

Table 1 displays error rate(ER), mean normalized absolute distance (MNAD) and query
retrieval time (QRT) for stock and flight datasets. The proposed method is evaluated with
existing conventional techniques namely Gaussian truth model (GTM) [26], conflict
resolution on heterogeneous data (CRH)[26], TRUTHFINDER[26], Voting[26], and
Feedforward and Feedback Memory Network (FFMN)[26]. Furthermore, it is noticed that
FFMN [26] is the closest approaches to the proposed TCD method compared to other existing
approaches. However, proposed TCD algorithm result is far better than FFMN [26]. Proposed
TCD algorithm is integrated with the J48 classifier to reduce the prediction error and improve
the media content accuracy to offer the most trustworthy media news content. The classifier
represents a procedure for classifying categorical data based on their attributes. It is also
effective for processing large amount of data and so it is often used in data mining
application. Based on Tabular 1 result, it noticed that proposed TCD perform well on every
respective dataset (Flight and Stock) concerning ER, MNAD, and QR.
Table 1: Error Rate(ER), Mean Normalized Absolute Distance (MNAD) and Query
Retrieval time (QRT) for Stock and Flight Datasets

Learning Flight Stock


Algorithms ER MAND QRT ER MAND QRT
GTM XX 7.6703 30.6503 XX 3.4253 6.6506
CRH 0.0823 4.8613 38.9449 0.0700 2.6445 9.1807
TRUTHFINDER 0.0950 8.1351 34.5500 0.1194 2.7140 7.8530
Voting 0.0859 XX 53.0601 0.0817 XX 14.0154
FFMN 0.0008 1.2600 27.6165 0.0207 1.5105 8.5210
TCD+J48 0.0004 0.5244 16.5500 0.0125 0.2834 4.5010

0.140 Flight
0.120 0.119 Stock
0.100
Error Rate

0.082 0.095 0.082


0.080
0.070 0.086
0.060
0.040
0.020 0.021 0.012
0.000 0.001 0.001

Figure 2: Error Rate(ER), for Stock and Flight Datasets

10.000 Flight
Mean Normalized Absolute

7.670 8.135
Stock
Distance (MNAD)

4.861
5.000 3.425
2.645 2.714
1.511
1.260 0.524 0.283
0.000

Axis Title

Figure 3: Mean Normalized Absolute Distance (MNAD) for Flight and Stock Dataset
60.00
53.06 Flight
50.00 Stock
38.94
Query rRetrieval Time
40.00 34.55
30.65
30.00 27.62

20.00 16.55
14.02
9.18 7.85 8.52
10.00 6.65 4.50

0.00

Axis Title

Figure 4: Query retrieval time (QRT) in Milliseconds for Flight and Stock Datasets.

12
10
8
Total Pages
6 Relevant Pages
4 QRT
2
0
General Search Indexed Based Truth Content
Search Discovery

Figure 5: Retrieval pages and Query retrieval time total pages performance with general
approaches and proposed Truth Content Discovery algorithm.

According to the above Figure 2 to 5, it noticed that the proposed TCD+J48


approach is offered best retrieved trustworthy result and classification accuracy
compare to the existing approach. The proposed TCD+J48 system is evaluated
with Gaussian truth model (GTM) [26], conflict resolution on heterogeneous data
(CRH)[26], TRUTHFINDER[26], Voting[26], and Feedforward and Feedback
Memory Network (FFMN)[26] conventional methods on Stock and Flight dataset.
The proposed TCD+J48 closest approaches are FFMN [26] behalf of ER, MNAD,
and QRT for flight and stock dataset. FFMN [26] addressed for evaluation of
statement credibility and various types of information with different weight. Here,
every data weight is calculated automatically on the basis of model learning.
However, the method consumes more time to retrieve the information, and there is
no prediction accuracy of trustworthy information. The GTM [26] explained a
probabilistic Bayesian method for truth information discovery. However, GTM
can be only used for continues data for training model data predictions and very
poor results on truth information discovery.CRH[26] designed to infer the truth
information from multiple conflicting sources; where, it includes various data
types. However, the CRH technique does not consider source reliability
evaluations and skips the unique characteristics of every data. TruthFinder [26]
introduced to obtain facts among conflicting information, and detect the reliable
web source compare than famous search engines. TruthFinder designed to solve
the veracity issues and established the relationships between websites and their
content. However, the method is time expensive to retrieve the information from
multiple web sources. Voting [26] techniques worked to aggregate individual
preferences to get collective decisions. The system accepts that all required URL
resources are equally available. However, a similar assumption cannot be applied
in all cases.
The ER & MAND is evaluated on a respective flight and the Stock dataset of a
proposed TCD+J48 algorithm with the existing method whose details are given in
Figure 2 to 3. Where, flight dataset accuracy is quite dull. The proposed TCD+J48
provide the best result with minimal classification error of trustworthy content
during content retrieval on the overall dataset. Figure 4 expresses the query
retrieval time of proposed TCD+J48 algorithm along with existing algorithms. It
measures query retrieval (relevant index extraction, relevant result indexing) time
of overall data sets. Here, it is observed that the proposed system display the
trustworthy result on the respective dataset. The experimental work proves that
TCD+J48 performance is satisfactory compared to other techniques. Figure 5
expresses the truth discovery content algorithm performance behalf of query
search. Here, the proposed TCD+J48 algorithm is implemented to retrieve the
trustworthy news content. It is designed in three section namely general search,
indexed based search and truth content discovery.
Moreover, every search will perform to retrieve the news content. General
search work as a normal search and indexed the search retrieve the result based on
the popularity of index or most visited index. Truth content discovery retrieves the
trustworthy content with minimal classification error and time. It avoids
unnecessary browsing of data that produces irrelevant information. The technique
always displays the information to a user based on content dependency level and
weight of objects. It reduces 7.54 milliseconds the query retrieval time (QRT), 5%
Error rate(ER) and 0.98 and Means Normalized Absolute Distance (MNAD).
Finally, the paper claims that the proposed TCD+J48 are the best approach to
retrieve trustworthy news content from various information sources on flight and
stock news related fusion dataset and respective parameters.

Conclusion

The article has presented a truth content discovery algorithm intended to


produce trustworthy information from multiple conflicting sources of media bias.
The method avoids media news conflicting information; which was collected from
various sources of information. The re-ranking function applies to cluster most
relevant information from numerous similar media news URL content. Here, the
proposed method verifies the trustworthiness of media new content-based
information source reliability and trustworthiness claimed information CJ. The
method also verifies the claimed trustworthy information based on source
reliability, uncertainty value, independent value, and information contribution to
finding trustworthy information value for every extracted media news content.
The proposed TCD+J48 approach is integrated to offer the most reliable
trustworthy information with minimal classification error and good
trustworthiness score. The proposed algorithm minimizes the query processing
time and provides the most relevant indexes compared to other existing
approaches. The proposed TCD+J48 algorithm reduces 7.54 milliseconds the
query retrieval time (QRT), 5% Error rate(ER) and 0.98 and Mean Normalized
Absolute Distance (MNAD). Finally, the paper states that proposed
TCD+J48approach is the best to approach to retrieve trustworthy news content
from various information sources on flight and stock news related fusion dataset
and respective parameters.

In the future, the article can be extended to extract trustworthy information


for live multimedia video streaming news in cloud environments; where, security
and reliability of content is the very challengeable task.

Funding:
The authors declare that they have sponsored the publication fee and research
works. Since, there is no funding for the work

Acknowledgments
The authors would like to say thanks to the Research and Development Division
(Academic) of Sathyabama Institute of Science and Technology Chennai for the
facilities. We also like to say thanks to Dean Research and Doctoral committee
member for their valuable suggestion to improve the quality of work.

Authors’ contributions
The article work is part of academic research work that researched finding
trustworthy information from multiple conflicting information sources. It is
crucial issues for us to find in which information is reliable or trustworthy? When,
people are getting similar information from various sources. Both authors work on
ground level to identify the problem statement, carried out research objective, find
out the solution of the above problems with result hypothesis, and deploy the
implemented result with comparative analysis. Both authors fully read and
finalized the articles for publication.

Authors' information (optional

Palaiyah Solainayagi received her MCA (Master of


Computer Application) from University of Madras, Chennai,
India in 2004. She completed her ME (Master of
Engineering) in Computer Science and Engineering from
Vinayaka Missions University, Chennai, Tamil Nadu (India)
in 2007. She is pursuing his Doctorate degree from
Sathyabama University Chennai, Tamilnadu (India).
Currently, she is serving as Associate Professor at Computer Science &
Engineering Department in Madha Engineering College, Chennai, Tamil Nadu
India. She participated in many National Conferences, International conferences,
Workshop & Seminar in various institutions in India. She published 5 papers in
National Journal, International journals, Scopus Indexed journal and conferences.
Her research area is Data Mining, Database Management System, Object Oriented
Programming, Artificial Intelligence, Digital Electronics, Cryptography and
Network Security, Web Technology, Information
Retrieval. She is holding the following membership: ISTE
& CSI.
Ramalingam Ponnusamy received his Masters degree in Computer Science &
Engineering from Pondicherry University, Pondicherry, India in 2000 and has
obtained PhD in Computer Science from Anna University, Chennai, India in 2008.
He has 17 years of teaching experience in various institutions in India. He is
recognized as a research supervisor for many universities in Tamilnadu India. He
has successfully led many projects at UG and PG level. He also served as
Lecturer, Assistant Professor, Associate Professor, Professor and Principal in
various institutions. Currently, he is acting as a Professor, CVR College of
Engineering, Hyderabad, Telangana, India. He is the member of IETE, ACM,
AAAI(Association for the Advancement of Artificial Intelligence), CSI, ISTE,
The Internet Technical Group, Institute of Engineers- India (IEI), UNESCO
Observatory on the Information Society, the Community of Science and the
International Research Alliance for Science and Technology (IRAST). He is also
served as programme committee member of the IEEE- CIS-RAM, IIWAS 2009,
ICEIS and ICAART Conferences and served as a Program Chair of IEEE IAMA
2009 and CACS 2010. He published and presented 45 research articles in the
various journal and international & national conferences in his related area. He
awarded as Young Scientist Fellowship from Tamil Nadu State Council for
Science & Technology in 2005 and Visiting Fellowship award at Education
Research Center, Indian Institute of Science, Bangalore, India in the year 2005-
2006. He had organized and edited the proceedings of the IEEE International
Conference on Intelligent Agent & Multi-Agent Systems 2009 and Editorial
Board Member of the Journal of the World University Forum, World Academy of
Science, Engineering and Technology and the International Journal of Computer
Applications. His field of interests and specialization are Distributed Artificial
Intelligence, Soft-Computing, E-Governance, Information Retrieval, and Human-
Computer.

References

[1]. Ma F, Li Y, Li Q, Qiu M, Gao J, Zhi S, Han J. Fair crowd: Fine-grained


truth discovery for crowdsourced data aggregation. In Proceedings of the
21th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining 2015; 745-754.
[2]. Gupta M, Sun Y, Han J. Trust analysis with clustering. In Proceedings of
the 20th international conference companion on World Wide Web 2011;
53-54.
[3]. Li Q, Li Y, Gao J, Zhao B, Fan W, Han J. Resolving conflicts in
heterogeneous data by truth discovery and source reliability estimation.
In Proceedings of the 2014 ACM SIGMOD international conference on
Management of data 2014; 1187-1198.
[4]. Li Q, Li Y, Gao J, Su L, Zhao B, Demirbasand M, Han J. A confidence-
aware approach for truth discovery on long-tail data. In Proceedings of the
VLDB Endowment 2014; 8(4): 425-436.
[5]. Yu D, Huang H, Cassidy T, Ji H, Wang C, Zhi S, Zhi J, Han C, Voss M,
Magdon I. The wisdom of minority: Unsupervised slot filling validation
based on multi-dimensional truth-finding. In Proc. of COLING 2014;
1567-1578.
[6]. Zhao B, Han J. A probabilistic model for estimating real-valued truth
from conflicting sources. In Proc. of QDB 2012; 1-7.
[7]. Wang D. Kaplan L, Le H, Abdelzaher T. On truth discovery in social
sensing: A maximum likelihood estimation approach. In Proceedings of
the 11th International Conference on Information Processing in Sensor
Networks 2012; 233-244.
[8]. Li X, Dong XL, Lyons K, Meng W, Srivastava D. Truth finding on the
deep web: is the problem solved. In Proceedings of the VLDB Endowment
2012; 6(2): 97-108.
[9]. Pasternack J, Roth D. Making better informed trust decisions with
generalized fact-finding. In IJCAI Proceedings-International Joint
Conference on Artificial Intelligence 2011; 22(3): 2325-2329.
[10]. Vydiswaran VG, Zhai C, Roth D, Content-driven trust propagation
framework. In Proceedings of the 17th ACM SIGKDD international
conference on Knowledge discovery and data mining 2011; 974-982.
[11]. Moustafa MZ, Bai Q. Trustworthy Stigmergic Service Composition and
Adaptation in Decentralized Environments. IEEE Transactions on Services
Computing 2014; 9(2): 317-329.
[12]. Lu Z, Zha H, Yang X, Lin W, Zheng Z. A new algorithm for inferring
user search goals with feedback sessions. IEEE Transactions on
Knowledge and Data Engineering 2013;25(3): 502-513.
[13]. Robin WO, Srivastava M, Alice T, Timothy JN. Truth discovery in crowd
sourced detection of spatial events. In IEEE Transactions on Knowledge
and Data Engineering 2016; 28(4): 1047-1060.
[14]. Dong W, Tarek A, Lance K, Aggarwal C. Recursive fact-finding: A
streaming approach to truth estimation in crowd sourcing applications. In
ICDCS 2013; 530–539.
[15]. Angadi KG, Desai P. Extracting Accurate Data from Multiple Conflicting
Information on Web Sources. In International Journal of Communication
Network Security 2013; 2(2):. 70-75.
[16]. Bhuvaneswaran R, Sarojini K. Margin and Slack Rescaling Technique
with Real Truth Finder Algorithm for a Specific Domain Search. In
International Journal of Advanced Research in Computer Science and
Software Engineering 2014; 4(2): 480-485.
[17]. Ansari S, Gadge J. Architecture for checking the trustworthiness of
websites. In International Journal of Computer Applications 2012; 44(14):
22-26.
[18]. Sharma PR, Patil ME. Extracting Trustworthy Data from Multiple
Conflicting Information using Semi-Supervised Approach. In International
Journal of advancement in electronics and computer engineering 2014;
2(11): 271-275.
[19]. Gao J, Li Q, Zhao B, Fan W, Han J. Truth discovery and crowd sourcing
aggregation: A unified perspective. In Proceedings of the VLDB
Endowment 2015; 8(12): 2048-2049.
[20]. Ba ML, Berti-Equille L, Shah K. Hammady HM. VERA: A Platform for
Veracity Estimation over Web Data. In Proceedings of the 25th
International Conference Companion on World Wide Web 2016; 159-162.
[21]. Xiao H, Gao J, Wang Z, Wang S, Su L, Liu H. A Truth Discovery
Approach with Theoretical Guarantee. KDD 2016 San Francisco,
California 2016; 1-10.
[22]. Yin X, Tan W, Semi-supervised truth discovery. In Proceedings of the
20th International Conference on World Wide Web 2011; 217-226.
[23]. Dong XL, Berti-Equally L, Hu Y, Srivastava D. Solomon: Seeking the
truth via copying detection. In Proceedings of the VLDB Endowment
2010; 3(1):1617-1620.
[24]. Yong XZ, Qing ZL, Zhao HP. A novel method for data conflict resolution
using multiple rules. Computer Science Information Systems 2013; 10(1):
215-235.
[25]. Li Y, Li Q, Gao J, Su L, Zhao B, Fan W, Han J. On the discovery of
evolving truth. In Proceedings of the 21st ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining 2015: 675-684.
[26]. S Li, L., Qin, B., Ren, W., & Liu, T. Truth discovery with memory
network. Tsinghua Science and Technology,2017; 22(6):609-618.
[27]. www.rasmussenreports.com.
[28]. Fogg, B. J. & Tseng, H. The elements of computer credibility. Paper
presented at the CHI 99,1999.
[29]. Ramachandran, S., Paulraj, S., Joseph, S. and Ramaraj, V. Enhanced
trustworthy and high-quality information retrieval system for web search
engines. arXiv preprint arXiv:0911.0914 2009;5:38-42.
[30]. A. Karthikeyan, P. Senthil Kumar, Randomly prioritized buffer-less
routing architecture for 3D Network on Chip, Computers & Electrical
Engineering, Volume 59, April 2017, Pages 39-50, ISSN 0045-7906,
https://doi.org/10.1016/j.compeleceng.2017.03.006.
[31]. Karthikeyan, A. & Kumar, P.S. Cluster Comput (2017).
https://doi.org/10.1007/s10586-017-0979-0
Research Objectives

 To design truth discovery framework to extract trustworthy news content from


multiple sources.
 To solve the conflicting information issues, new content sparsity, and
scalability during the truth content discovery in social media applications.
 To define the content reliability level of facts and weight of fact to display
reliable or genuine information using iterative methods.
 To provide trustworthy information from various sources media channel
collected news content in web services by study of historical information
contribution and content study.
 To apply re- ranking function for retrieving news media content with best
accuracy and minimal retrieval time.
 To visualize the reliable media news content with good information prediction
with accuracy.
 To minimize query retrieval time (QRT), mean absolute error (MAE), root
means squared error (RMSE) and compare with existing methods.

Truth Content Discovery Algorithm

Initially, the system has to collect large volume of information based on the
object (term or query) and store in the centralized database. Then, the user can
retrieve the query from the database based on their requirement. After collecting
user query, proposed algorithm obtains the related information based on the
relationship between the query and available information in the centralized
database. The proposed approach collects set of HTML embedded media content
and web content and proceed clean the unwanted information from media bias.
Hence, the method clusters irrelevant media news content from collected media
bias. Next, re-ranking function is applied to rank the most relevant information
from various similar information. Finally, the most relevant reliable media
information is displayed with trustworthiness score with minimal content retrieval
time and best accuracy. The method clusters most relevant news content from
media bias from numerous of similar media content using re-function. The method
applies re-ranking function to visualize the most relevant information based on
user preferences and available news content media bias.

The method avoids conflicting information of media content from different


database. An Ambient intelligence assist to establish the trust amongst involves
users. The design externalizes component of awareness function and constructs
values-based trust. Hence, it will verify the trustworthiness of retrieved content
based on facts and weight of trustworthy information. The proposed method
integrated J48 classifier to improve the information accuracy and reduce the
prediction error of conflict information. The method provides trustworthiness
content accuracy based on the retrieved information. After producing the result,
the proposed algorithm calculates the count of every query results.
Algorithm: Truth Content Discovery Algorithm
Input: The Set of information Source IS, Set of Facts F and establish
communication between them.
Output: Retrieve trustworthy information from multiple media bias source and
fact confidence C with minimal content retrieval time with best accuracy A
Procedure:
Steps1: Read the User given query;
Steps2 Validate the query from DB;
Steps3: Pre-process the query;
Stpes4: Apply the TCD algorithm;
Steps5: Extract query related information;
Steps6: Calculate the matrices of object A &B;
Steps7: Repeat C  IS; /* for initializing initial steps*/
Steps8: t(C)  t0 ;
Steps9: T(C)  1n1  t (C  ;
  B  
Steps10: Repeat the process  * T ; /* iterate evaluation*/
   '  t ; /* make copy of 

 
Steps11: Evaluates from  * t t */
 
Steps12: Process t  A S ;
  
Steps13: Evaluate T from t ;
   '
Steps14: Continue until cosine similarity of t and t is
greater than 1   Count the query ;
Steps15: Retrieve relevant trustworthy information;
Steps16: Prioritize the retrieved relevant trustworthy information based
on query similarity;
Steps17: Diaplay retrieved trustworthy information from multiple media
bias source and minimal content retrieval time with best accuracy A;
End;
Where t is true information, T is trustworthiness content and t’ is conflict
information.
User

Enter the Query

Apply TDC
Algorithm overy

Collect Query

Process Query

Validate
Query in DB

Validated

Database1 Database2 Database3

Verify the similarity of


content

Retrieved the Fact and


Weight of content

Discover the Truth Content

Calculate count of each


query

Display Trustworthy Result

Fig. 1Work Design of truth content discovery framework.


Table 1 performs the MAE, RMSE and Query Retrieval Timefor Weather,
Flight and Stock data set with existing learning approaches namely CATD [4],
CRH [3], GTM [6] and DynaTD [26]. Furthermore, it is noticed that DynaTD[26]
performs closest result to proposed system compared to other existing
approaches. However, proposed TCD algorithm result is far
better than DynaTD[26].Proposed TCD algorithm is integrated with J48 classifier
to reduce the prediction error and improve the media content accuracy to offer
most trustworthy media news content.

Table 1. MAE, RMSE and QRT for Weather, Flight and Stock Datasets

Learning Weather Flight Stock


Algorithms MAE RMSE QRT MAE RMSE QRT MAE RMSE QRT
CATD 4.6310 6.0178 3.1769 8.6453 53.0601 81.1288 0.8952 2.5527 14.0154
CRH 3.9493 5.1038 0.6371 8.6980 58.1676 38.9449 0.8398 2.6234 9.1807
GTM 4.7463 6.1749 1.1480 7.6506 51.6956 30.6503 0.8863 2.5365 6.6506
DynaTD 3.7093 4.7857 0.2849 6.2309 45.8221 7.6165 0.1481 0.7845 1.5210
TCD+J48 1.4193 1.8009 0.06 4.6118 33.7244 6.55 0.0942 0.2286 1.01
10
8.6453 8.698
9
7.6506
8
7 6.2309
6
4.7463 Weather
5 4.631 4.6118
3.9493 3.7093 Flight
4
Stock
3
2 1.4193
0.8952 0.8398 0.8863
1 0.1481 0.0942
0
CATD CRH GTM DynaTD TCD+J48

Fig.2 Mean Absolute Error (MAE) for Weather, Flight and Stock Datasets.

58.1676
60 53.0601 51.6956
50 45.8221

40 33.7244
Weather
30 Flight

20 Stock

10 6.0178 5.1038 6.1749 4.7857


2.5527 2.6234 2.5365 0.78451.8009
0.2286
0
CATD CRH GTM DynaTD TCD+J48

Fig. 3Root MeanSquared Error (RMSE) for Weather, Flight and Stock Dataset

90

80 81.1288

70

60

50 Weather

40 Flight
38.9449
Stock
30 30.6503

20
14.0154
10 9.1807
6.6506 7.6165 6.55
3.1769 1.148 1.521 1.01
0 0.6371 0.2849 0.06
CATD CRH GTM DynaTD TCD+J48
Fig. 4 Query retrieval time (QRT) for Weather, Flight and Stock Datasets.

12
10
8
Total Pages
6 Relevant Pages
4 QRT
2
0
General Search
Indexed Based Truth
SearchContent Detector

Fig. 5 Retrieval pages and Query retrieval time with current approaches
&proposed approach.

According to the above Figure 2 to 5, it noticed that proposed TCD+J48


approach is offered best retrieved result and classification accuracy compare to
existing approach. The proposed TCD+J48 system is evaluated with all existing
methods to predict the MAE & RMSE on respective Weather and Stock dataset
whose details are given Figure 2 to 3. Where, flight data set accuracy is quite dull.
The proposed TCD+J48 provide best result with minimal classification error of
trustworthy content during content retrieval on overall dataset. Figure 4 expresses
the query retrieval time of proposed TCD+J48 algorithm along with existing
algorithms. It measures query retrieval (relevant index extraction, relevant result
indexing) time of overall data sets. Here it is observed that proposed system
display the trustworthy result on respective dataset. The experimental work proves
that TCD+J48 performance is satisfactory compared to other techniques.

Figure 5 expresses the truth discovery content framework performance behalf


of query search. Here, the proposed TCD+J48 algorithm is implemented to
retrieve the trustworthy news content. It is designed in three section namely
general search, indexed based search and truth content detector. Moreover, every
search will perform to retrieve the news content. General search work as a normal
search and indexed the search retrieve the result based on the popularity of index
or most visited index. Truth content detector retrieves the trustworthy content with
minimal classification error and time. It avoids unnecessary browsing of data that
produces irrelevant information. The technique always displays the information to
a user based on content dependency level and weight of objects. It reduces 0.9275
seconds the query retrieval time (QRT), 1.22 mean absolute errors (MAE) and
5.14 root mean squared error (RMSE). Finally, the paper claims that the proposed
TCD+J48 is the best approach to retrieve trustworthy news content from various
information sources.

Das könnte Ihnen auch gefallen