Sie sind auf Seite 1von 18

CA652A

Semantic Web Based Sentiment Engine


A system to determine online sentiment on current affairs for the purpose of analysis and prediction

11210889 52595354
CA652A

ABSTRACT
Sentiment analysis involves classifying opinions from text as "positive", "negative" or neutral. Its purpose and benefit is to assist in extracting valuable information and insight from copious amounts of unstructured data. This proposed system will have the capability to determine online sentiment on current affairs for the purpose of analysis and prediction. For the sentiment analysis a cluster-method approach is recommended, which is a recent advancement in this area. Various APIs will assist in extracting other data such as location and time. Evaluation of system through the use of the Pang et al movie review data sets is recommended to validate basic functionality and real life data in the form of the 2008 US presidential race data to evaluate all functionality of the system. Multiple industries are identified as potential users of this system from marketing companies to hotels adding to our benefit in the commercialisation potential of the system.

1|Page

A report submitted to Dublin City University, School of Computing for module

CA652: Information Access, 2011/2012.


We hereby certify that the work presented and the material contained herein is my/our own except where explicitly stated references to other material are made

Student Numbers
52595354 11210889

2|Page

TABLE OF CONTENTS
Abstract .................................................................................................................................... 1 Introduction ............................................................................................................................ 5 Concept Overview ................................................................................................................. 5 Constraints and Limitations ............................................................................................ 5 Functional Description ......................................................................................................... 6 Sentiment Search Functions............................................................................................... 6 Techniques ........................................................................................................................... 6 Time parameter Based Search ....................................................................................... 8 Geographical Extraction Based ..................................................................................... 8 Social Sentiment Extraction Based data ....................................................................... 9 Graphical Data Generation Tools ................................................................................. 9 Pros & Cons of proposed system ...................................................................................... 9 Evaluation Plan..................................................................................................................... 10 Stage One Testing - Validation ..................................................................................... 10 Stage Two Testing Functionality Testing ................................................................ 11 Stage Three Testing Real Life Data ........................................................................... 11 Commercialisation Potential ............................................................................................. 13 Conclusion and Further Research Opportunities .......................................................... 14 References .............................................................................................................................. 15

3|Page

Table of Figures Figure 1 - Sentiment Analysis framework ........................................................................... 7 Figure 2 - Cluster Method Accuracy/Efficiency ................................................................ 8 Figure 3 - Graphical Representation of content .................................................................. 9 Figure 4 - Basic Validation Testing Results ....................................................................... 11 Figure 5 - Two Topic Validation Testing ........................................................................... 11 Figure 6 - Sample Test Output (Obama) ............................................................................ 12 Figure 7 - Sample Test Data (McCain) ............................................................................... 13

4|Page

INTRODUCTION
The media as we now conceptualise it has changed dramatically. With the internet, people have an opportunity to weigh in on events, by providing their opinions, and feedback and in real time through blogs, forum, social networks and commenting systems on news websites. There is a growing interest in measuring sentiment that can be contributed to the dramatic increase in the volume of digitized information. An increasing number of studies in political communication focus on the sentiment or tone of news content, political speeches, or advertisements (Young, L, & Soroka, S 2012) This report discusses the concept of developing a Semantic Web based sentiment engine that will be able to analyse public sentiment on current issues, from politics to reality TV shows. Based on the analysis, tracking of popular opinion through social media channels and leveraging research in the area of sentiment analysis, accurate predictions could be made possible on events from presidential elections to the X-Factor competition.

CONCEPT OVERVIEW
This proposed system is not a standard sentiment engine that returns static data; it offers increased functionality to assist with data interpretation. By allowing end users to customise their search, filter the returned data under multiple parameters and have graphical representation of results to facilitate interpretation.

CONSTRAINTS AND LIMITATIONS


The limitations of this concept are not due to the technological constraints but are simply down to the volatility of public opinion and that is something that cannot be remedied or correcting by technology. Another limitation is the scope of the opinion being captured. User groups of social media and participants in online forums are statistical of a younger age group. The lack of inclusion of the opinion of older age groups could greatly affect the accuracy

5|Page

of the data as it would not be entirely representative the impact of this imbalance would particularly impact politics with older groups statistical more likely to vote.

FUNCTIONAL DESCRIPTION
SENTIMENT SEARCH FUNCTIONS
Users can enter multiple search terms for the purpose of data comparison. Other features would be utilised to improve the analysis returns. Multiple Search Parameters o Time Frame Defined Search - Data retrieved can be limited to a specific time frame. o Geographical Location Based Search Search data retrieved can be filtered by location of users o Narrow Search Scope Select websites to exclude or restrict search to small number of websites. Graphical representations of the data are generated.

TECHNIQUES
Sentiment Analysis Techniques There is much research in the area of sentiment analysis, the primary objective being to find a technique where there is no trade-off between speed and accuracy. Several new and emerging techniques have been researched as part of identifying the best fit for this system. Proximity-Based Approach (Hasan, S, & Adjeroh, D 2011) o This proposed method uses proximity-based features to determine sentiment; proximity distribution, mutual information between proximity types, and proximity patterns.

6|Page

Based on Annotation (Shukla, A 2011) o This proposed method counts all the annotation present, calculates sentiment scores of all annotation including comments to determine sentiments.

Sentence-level Lexical Based Semantic Orientation (Khan, A et al, 2011) o This proposed method uses SentiWordNet to calculate the semantic score of sentences it has classified as subjective from reviews and blog comments.

Machine Learning approach to contextual information (YANG, C et al, 2008) o This proposed method differentiates itself from others by taking context into account when determining the sentiment category. Its primary focus and test data sets have been blog posts. Figure 1 below, shows the framework employed.

FIGURE 1 - SENTIMENT ANALYSIS FRAMEWORK

Clustering-Based Sentiment Analysis Approach (Li, G, & Liu, F 2012)

The method deemed most appropriate for this proposed system was based on a article from the Journal Of Information Science in April this year, which outlined the Clustering-Based Sentiment Analysis approach. It proposed that by applying a TFIDF weighting method, a voting mechanism and importing term scores, an acceptable and stable clustering result can be obtained (Li, G, & Liu, F 2012) The evaluation results 7|Page

were the most impressive of all techniques reviewed as part of this research. It appears to have performed well in terms of both accuracy and efficiency with no need for human participation, as can be seen from figure 1.

FIGURE 2 - CLUSTER METHOD ACCURACY/EFFICIENCY

Apart from its accuracy and efficiency, this technique was deemed the most suitable as it can be applied universally to any data set. Other techniques researched, have been developed for particular data types, customer reviews or blogs and their evaluation appraisals appear to suggest they do not perform as well outside of these data types. TIME PARAMETER BASED SEARCH This sentiment engine would make use of the adaptible Librato API libraries to allow sentiment returns to be time sensative. This would be in order for a user to evaluate how sentiment is changing over time specific time periods. GEOGRAPHICAL EXTRACTION BASED Adding a geographical element would be a unique feature allowing for mapping of sentiment results. Preferred location content will be pulled from the Twitter API as it gives access to Twitter profile location. Comment systems used by news websites etc. request a location prior to posting the comment like on the Irish Times website. Facebook API allows access to location of user if the privacy setting is turned on. OAUTH setting would be used to allow the users of the sentiment engine to explore the opinions of their friends and networked associates and how it would fit on the sentiment scales. Other free use location APIs may also be needed. or what sentiment was during

8|Page

SOCIAL SENTIMENT EXTRACTION BASED DATA The content used to create athematrix of information to evaluate sentiment within via FLP would likely be the following but not limited to: Twitter; Disqus; Livefyre; Intensedebate; Drupal comments; Wordpress comments; other blog posts; scraped open facebook and fan page comments; facebook comment system; text comments; G+ posts; Slideshare.net; Pinterest pins; Google News articles; various bookmarking site comments like fark.com reddit; and other language relavent wire news services. GRAPHICAL DATA GENERATION TOOLS Graphical representations of the data are generated. The results could be rendered as web-based flash objects or in way that is complient to the evolving HTML5 standards and be IOS 5 comlient given the anamosity Apple has with Adobe over flash for results to be useful on mobile devices and tablets. These reports woud be exportable to Crystal Reports.
1600 1400 1200 1000 800 600 400 200 0 Postive Neutral Negative Candidate A Candidate B

FIGURE 3 - GRAPHICAL REPRESENTATION OF CONTENT

PROS & CONS OF PROPOSED SYSTEM


The primary argument for why sentiment engines via Semantic Web and linked data are useful is based upon the new information and insight that can be gleaned from it. The ability to know relative and positional sentiment can be useful in many anytical or informational arbitrage situations. 9|Page

In terms of the cons, primary concern would be data quality. Problems with data quality are a huge issue and can skew any resulting analysis. The extent of the data quality problem has been often discovered by information activists working in the open data movement. Secondly privacy concerns and staying within the spirit and letter of the relavent data privacy laws of the regulatory regime you operate under may at times be an issue. This can be tricky given the interconnected nature of the web. Lastly, inaccuracies of data and it being organisied in short sets vs deeper data may create false sentiments. Is their enough data being looked at to create a realist postive or negative sentiment? Some additional analysis may need some addition parsing to tease out, for example, intial heated emotion responses from the rationale morning after response.

EVALUATION PLAN
STAGE ONE TESTING - VALIDATION
The evaluation plan would begin with simple software validation. The first test case would consist of validating the fundamental functionality of the system, its ability to differentiate between sentiments. The data set thats to be used is the movie review data from Pang et al experiments1 Movie review data is widely regarded as the most challenging data for sentiment engines to analysis, this can be contributed to the fact that a positive review may contain descriptions of gory or violent scenes and equally a negative review could contain descriptions of light-hearted pleasant scenes. For additional testing other data sets could be used for each iteration of this dynamic testing stage

Pang B, Lee L, Vaithyanathan S. Thumbs up, Sentiment classification using

machine learning techniques. In: Conference on empirical methods in natural language processing (EMNLP). Philadelphia, Pennsylvania, USA, 2002, p. 79.

10 | P a g e

20% 39% Neutral Positive 41% Negative

.
FIGURE 4 - BASIC VALIDATION TESTING RESULTS

STAGE TWO TESTING FUNCTIONALITY TESTING


The second stage of testing would be the validation of the multiple input functionality; to ensure that data can be retrieved for two or more search terms and also that they can be accurately differentiated. The test case for this would be built on the first stage of testing with added content regarding a second movie etc.

Schlinder's List

The Usual Suspects

39%

20% 41%

Neutral Positive Negative

20%

21%

Neutral Positive Negative

59%

FIGURE 5 - TWO TOPIC VALIDATION TESTING

STAGE THREE TESTING REAL LIFE DATA


The final stage of the evaluation plan would be to perform testing using previous high profile events as the test cases, such as the US Presidential Election of 2008 and

11 | P a g e

the X-Factor competition from previous years. This validation is more complex as it will span the entire internet not just the staging website. The testing would be performed over different time intervals, days, weeks, months, and the entire duration of the event. In the case of the political elections these time periods could be used to coincide with official opinion polls, for example Gallop and Rasmussen state side or RedC for Irish based events. Validation of the geographical based sentiment analysis function would be tested to gauge the accuracy of the location results. In the case of the US Presidential Election the final voting percentages for each candidate per state would give an accurate basis for comparison. SAMPLE EVALUATION TEST CASE By taking the ten states where each candidate won by the largest percentage majority, and graphing the percentage of votes each candidate received, and also the percentage of positive, negative and neutral data regarding that candidate. What one would expect in a fully evaluated system would be a close correlation between positive data and the percentage of votes and also a correlation with the negative or neutral data and the other candidates percentage of votes, as per the sample charts below for Obama and McCain respectively.
90 80 70 60 50 40 30 20 10 0 Neutral % Negative %

Obamas Data
Obama's Percentage of Votes McCain's Percentage of Votes Positive %

FIGURE 6 - SAMPLE TEST OUTPUT (OBAMA)

12 | P a g e

70 60 50 40 30 20 10 0

McCains Data
McCain's Percentage of Votes Obama's Percentage of Votes Positive % Negative % Neutral %

FIGURE 7 - SAMPLE TEST DATA (MCCAIN)

COMMERCIALISATION POTENTIAL
In an era where both business and individuals are attempting to move further and further to data driven decision sentiment engine products have a range of commercial potential. Some companies have already begun commercializing Semantic Web applications like IBM licensing of their WebFountain Internet analytical engine to FActiva and ThompsonReuters 2003 for example for those interested in corporate reputational data. Various market research for people who cannot afford Enterprise Resoruce Planning (ERP) add ons like SAP Business Objects, SAS, or say LexisNexis Analytics and for who the current available crop of free semantic sentiment engines (name a few from those ten) tools are just insufficient, too niche, or unscalable (Basu, 2010). Semantic Web products are becoming important in internal and external Business Inframatics. However, information arbitrage is not merely for professional market traders. This system would likely be a software as service (SaaS) on the web, it could be sold on a free-mium basis or a monthly subscription or yearly license depending on the implementation.

13 | P a g e

Primary clients would depend on the sentiments needing to be parsed and the proprietary and public data sets being used in within the sentiment engine. Examples to be included: Corporate Media; Content Publishing industry; PR firms; polling; market research firms; Trading platforms; Political Parties; Elections; Government agencies; security services; and Bookmarkers for deciding odds on Novelty bets - reality TV shows, politics etc.

CONCLUSION AND FURTHER RESEARCH OPPORTUNITIES


Where does the Semantic Web lead to exactly? We dont really know, but opening up the segregated data silos and making sense of deeper dark big data, in pursuit of the benefits of a deeper rooted hyperdata would be a nice path. However, the road will be long but it may improve our day to day lives immensely. "Many applications and services claim to be "semantic" in one manner or another, but that does not mean they are "Semantic Web." Semantic applications include any applications that can make sense of meaning, particularly in language such as unstructured text, or structured data in some cases. By this definition, all search engines today are somewhat "semantic" but few would qualify as "Semantic Web" apps. (Spivak, 2007) How we get from the early steps of Web 3.0 to this deeper data web will be a long process. It will provide countless benefits, many of which we may not even percieve today. However, sentiment engines are mearly one way to get the public and the developer community interested and excited for all the other benefits that this open data future could hold. For that reason sentiment engines will remain an important component in the near term future, as big data, holds much of the future promise to bring the of the web of things and make sense and use of them.

14 | P a g e

REFERENCES
Abbasi, A, Hsinchun, C, & Salem, A 2008, 'Sentiment Analysis in Multiple Languages: Feature Selection for Opinion Classification in Web Forums', ACM Transactions On Information Systems, 26, 3, pp. 1-34, Computers & Applied Sciences Complete, viewed 4 May 2012. Basu, Saikat 2010. 10 Web Tools To Try Out Sentiment Search & Feel the Pulse Make Use Of [Online] 30 April. http://www.makeuseof.com/tag/10-web-tools-sentimentsearch-feel-pulse/ [Accessed 1 May 2012] Bergman, Mike 2010. I Have Yet to Metadata I Didnt Like. AI3 [Online] 16 August. http://www.mkbergman.com/902/i-have-yet-to-metadata-i-didnt-like/ [Accessed 1 May 2012] Bollen, J. Mao, Huina. Zeng, Xiao-Jun March 2011. Twitter mood predicts the stock market. Journal of Computational Science, 2(1), Pages 1-8 Available from: http://arxiv.org/abs/1010.3003 Cai, K, Spangler, S, Ying, C, & Li, Z 2010, 'Leveraging sentiment analysis for topic detection', Web Intelligence & Agent Systems, 8, 3, pp. 291-302, Academic Search Complete, viewed 20 April 2012. Dalton, Jeff 2007. Caff Java Open Source NLP and Text Mining tools. Jeff's Search Engine Caff [Online] 16 March. http://www.searchenginecaffe.com/2007/03/javaopen-source-text-mining-and.html [Accessed 1 May 2012] Hamouda, A, Marei, M, & Rohaim, M 2011, 'Building Machine Learning Based Sentiword Lexicon for Sentiment Analysis', Journal Of Advances In Information Technology, 2, 4, pp. 199-203, Library, Information Science & Technology Abstracts with Full Text, , viewed 1 May 2012. Hasan, S, & Adjeroh, D 2011, 'Detecting Human Sentiment from Text using a Proximity-Based Approach', Journal Of Digital Information Management, 9, 5, pp.

15 | P a g e

206-212, Library, Information Science & Technology Abstracts with Full Text, viewed 7 May 2012.

Kang, H, Yoo, S, & Han, D 2012, 'Senti-lexicon and improved Nave Bayes algorithms for sentiment analysis of restaurant reviews', Expert Systems With Applications, 39, 5, pp. 6000-6010, Academic Search Complete, , viewed 10 April 2012. Lvy, Pierre CRC, FRSC 2007. Elements of Semantic Engineering I3 workshop / WWW Consortium Conference / Banff 2007 Available from:

http://www.ieml.org/text/semantic_space.pdf Li, G, & Liu, F 2012, 'Application of a clustering method on sentiment analysis', Journal Of Information Science, 38, 2, pp. 127-139, Business Source Complete, , viewed 21 April 2012. Pang B, Lee L, Vaithyanathan S. Thumbs up, Sentiment classification using machine learning techniques. In: Conference on empirical methods in natural language processing (EMNLP). Philadelphia, Pennsylvania, USA, 2002, p. 79. Shukla, A 2011, 'SENTIMENT ANALYSIS OF DOCUMENT BASED ON ANNOTATION', International Journal Of Web & Semantic Technology, 2, 4, pp. 91-103, Computers & Applied Sciences Complete, , viewed 6 May 2012. Spivac, Nova 2007. The Semantic Web, Collective Intelligence and Hyperdata. novaspivack.typepad.com [Online] 18 September.

http://novaspivack.typepad.com/nova_spivacks_weblog/2007/09/hyperdata.html [Accessed 1 May 2012] Vishwanath, J, & Aishwarya, S 2011, 'User Suggestions Extraction from customer Reviews: A Sentiment Analysis approach', International Journal On Computer Science & Engineering, 3, 3, pp. 1203-1206, Academic Search Complete, , viewed 1 May 2012. YANG, C, LIN, K, & CHEN, H 2008, 'Sentiment Analysis in Weblog Using Contextual Information:: A Machine Learning Approach', International Journal Of

16 | P a g e

Computer Processing Of Languages, 21, 4, pp. 331-345, Academic Search Complete, , viewed 27 April 2012. Young, L, & Soroka, S 2012, 'Affective News: The Automated Coding of Sentiment in Political Texts', Political Communication, 29, 2, pp. 205-231, Academic Search Complete, , viewed 10 May 2012.

17 | P a g e