Sie sind auf Seite 1von 10

A PROJECT PROGRESS REPORT

ON

SENTIMENT ANALYSIS &


INFORMATION EXTRACTION
IN
PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE AWARD
OF THE DEGREE
OF
BACHELOR OF TECHNOLOGY
SESSION 2010-2014

GUIDED BY
Ms. PARUL YADAV

SUBMITTED BY
DIKSHA MAHAJAN (25011503110)

CERTIFICATE
This is to certify that the project entitled SENTIMENT ANALYSIS &
INFORMATION EXTRACTION is the original work carried out by Diksha Mahajan
(25011503110) student of B.Tech (IT), BVCOE, affiliated to GGSIPU, during the year 2014, in
partial fulfillment of the requirements for the award of the Degree in Bachelor of Technology,
Information Technology and that the project has not formed the basis for the award previously of
any degree, diploma, associateship, fellowship or any other similar title.

Signature of the Guide

Ms. PARUL YADAV


IT Dept, BVCOE

1. Objective
1.1.

2.
2.1.

Abstract:
The project aims at providing a sentiment analysis system through a web interface that enables
web users, analysts and product managers to get insights into public sentiment on particular
products and services. The project makes extensive use of product and services review sites and
forums like IMDB, as well as micro blogging sites like Twitter. The system aims to apply
efficient information retrieval algorithms, as well as do the complex task of feature extraction for
a
more
drilled
down
analysis,
in
the
most
efficient
way.

Introduction
What is Part of Speech Tagging and how we implemented it?
In the collection of linguistics Part of Speech tagging is also called grammatical
tagging or word category disambiguation, in which we discern the words
according to their category eg in English dividing words in categories of noun,
verbs, prepositions etc. Part of Speech tagging is now been performed in the
context of computer linguistics using algorithms built on Hidden Markov Model,
Decision table, Dynamic Programming Models, Unsupervised Taggers etc.It
comes in Natural Language Processing and a lot of successful contribution has
been made under this topic
A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in
some language and assigns parts of speech to each word (and other token), such
as noun, verb, adjective, etc., although generally computational applications use
more fine-grained POS tags like 'noun-plural'. We used Stanford POS tagger, this
software is a Java implementation of the log-linear part-of-speech taggers
developed
by
stanford
engineers
and
researchers.

2.2.
2.2.1.

2.2.2.
2.2.2.1.

Sentiment analysis-introduction and how we are going to implement it


Sentiment
analysis
Sentiment Classification, a sub topic of Sentiment Analysis, is the study of computationally
determining whether a given piece of text is positive or negative. We usually apply machine
learning techniques to sentiment classification, in which a classifier is required to be trained on a
labeled training set. This is called supervised learning. However, owing to its nature and the
number of tweets that can be collected, it is a challenging task to manually label a training set of
such magnitude.
Algorithm Used :
Naive-Bayes Classifier

2.2.3.
2.2.3.1.
2.3.
2.3.1.
2.3.2.
2.3.3.
2.3.4.
2.3.5.
2.3.6.
2.4.

3.

Tools to use:
Wekaparallel
Algorithm followed:
Generate the imdb movie review url for the movie.
Download all the reviews web pages from IMDB.
Apply POS tagging on the downloaded movie reviews to get all the proper nouns like
"leonardo", "acting", "direction", "oscars" etc.
Identify all the actors, actresses, directors and movie names present in the above generated list
(in 3rd point).
Extract all the sentences which have the above generated keywords (as generated in 4th point).
Apply sentiment analysis
on the sentences
extracted from above step.
IMDBCrawler:
We made an IMDB review extracter as IMDB does not provide any API for extracting reviews.
We used an API provided which gives the imdb id for that movie, after that we download that
web page and store the results. We used Jsoup java library for downloading web content and
applying complex pattern matching on that text.

Handouts:

4.

Progress:

5.

S.NO

TASKS

ATTEMPTED

STATUS

Feature Extraction

1.1

Actors

Yes

Completed

1.2

Actresses

Yes

Completed

1.3

Directors

Yes

Completed

1.4

Movies

Yes

Completed

Crawler

2.1

IMDB

Yes

Completed

2.2

Rotten Tomatoes

No

2.3

GSM Arena

No

Algorithm

3.1

POS Integration

Yes

Completed

3.2

Sentiment Analysis

No

3.3

Entity Recognition

No

User Interface

4.1

Main Module

Yes

In Progress

4.2

Contribution Module

No

4.3

Project Wiki

No

References:

[1] Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging
with a cyclic dependency network. In: NAACL 3. (2003) 252259
[2]Christopher D. Manning. 2011.: Part-of-Speech Tagging from 97% to 100%: Is It Time for
Some Linguistics? Computational Linguistics and Intelligent Text Processing , 12th
International Conference, CICLing 2011
[3] Shen, L., Satta, G., Joshi, A.: Guided learning for bidirectional sequence classication. In:
ACL 2007. (2007)
[4]Spoustova, D.j., Hajic, J., Raab, J., Spousta, M.: Semi-supervised training for the averaged
perceptron POS tagger. In: Proceedings of the 12th Conference of the European Chapter of the
ACL (EACL 2009). (2009) 763771
[5]Sgaard, A.: Simple semi-supervised training of part-of-speech taggers. in proceedings of
the ACL 2010 Conference Short Papers. (2010)
[6] B Pang, L Lee .: Opinion mining and sentiment analysis, In:Foundations and trends in
information retrieval, 2008 - dl.acm.org
[7] Changhua Yang, Kevin Hsin-Yih Lin, Hsin-Hsi Chen, .: Building emotion lexicon from
weblog corpora in proceedings of ACL '07 ACL on Interactive Poster and Demonstration
Sessions
[8] Alec Go, Lei Huang, and Richa Bhayani. 2009 .:Twitter sentiment analysis. Final Projects
from CS224N for Spring 2008/2009 at The Stanford Natural Language Processing Group.

Das könnte Ihnen auch gefallen