Sie sind auf Seite 1von 23

Update ‐ 26 Jan 2009

Update  26 Jan 2009

Product Review Summarization

Ly Duy Khang
Recall: 
Summarization system
1.
1 A set of product reviews
A set of product reviews
2. Product facet identification
3
3. Facet‐oriented sentence clustering
i d l i
4. In‐focused snippet summarization
Recall:
Product facet identification (1/2)
d f d f ( / )
• Automatic approach:
Automatic approach:
– Statistical measurement (ARM) to extract frequent 
noun/noun phrases
noun/noun phrases
– Noise + Facet coverage + no implicit facets
– Affect the sentence clustering later
Affect the sentence clustering later
Recall:
Product facet identification (2/2)
d f d f ( / )
• Manual approach:
Manual approach:
– Identify facets manually
– Associate each facet with a list of words that 
Associate each facet with a list of words that
would trigger that facet
• Affordability 
Affordability ‐>> {price, money, value, expensive … }
{price, money, value, expensive … }
– Noise + Facet coverage + no implicit facets
Recall:
Facet‐oriented sentence clustering
d l
• For each sentence, 
For each sentence
– Facet is labeled based on keywords occurrence
– Multiple facets labeling is possible
Multiple facets labeling is possible
In‐focused snippet summarization ‐
Problem formulation (1/2)
bl f l ( / )
• Input:
– Cluster sentences belonging to each facet.

• Output:
– Sentence ranking + grouping 
– Sentence polarity
– In‐focused snippet representation
In‐focused snippet summarization ‐
Problem formulation (2/2)
Battery:
bl f l Battery:
( / )
## sen1 + sen1 (and 5 similarities)
## sen2 3 ( d 0 i il i i )
‐ sen3 (and 0 similarities)
## sen3
## sen4 + sen5 (and 1 similarities)
## sen5 ( )
‐ sen6 (and 2 similarities)
## sen6
… + sen2 (and 2 similarities)
## senN
Ungrouped:
Ungro ped
+ sen7
‐ sen10

the drawbacks were that they were not user friendly for the casual photographer , 
the lcd screen is a little too small .

… the lcd screen is too small .
In‐focused snippet summarization:
Methodology
h d l
• Sentiment analysis (Manually)
Sentiment analysis (Manually)
• Editing
• Cl
Clustering
i
• Ranking
Methodology:
Editing (1/13)
d ( / )
• Related works (1/2):
Related works (1/2):
– Jing, H. (2000). “Sentence reduction for automatic 
text summarization”
text summarization
– Machine learning technique
– Learn from pair of extract‐abstract
Learn from pair of extract‐abstract
– Features:
• Syntactic tree
Syntactic tree
• Grammar checking (Integrated lexicon)
• Context information
Methodology:
Editing (2/13)
d ( / )
• Related works (2/2):
Related works (2/2):
– Knight, K., & Marcu, D. (2000). “Statistics‐based 
summarization‐step
summarization step one: Sentence compression
one: Sentence compression”
– Noisy channel model
– Source (short sentence) and target (long sentence)
Source (short sentence) and target (long sentence)
– The model is learned by machine learning
– Features:
• Syntactic tree
• Bigram
Methodology:
Editing (3/13)
d ( / )
• Problem:
– General purpose compression
– Have no control on retaining/removing a specific 
Have no control on retaining/removing a specific
targeted part of the sentence.
– Cannot reduce to a phrase (incomplete sentence)
Cannot reduce to a phrase (incomplete sentence)
Methodology:
Editing (4/13)
d ( / )
• Proposed approach:
Proposed approach:
– Initialize a focused part of the sentence
– Insert words until the resulting snippet is 
Insert words until the resulting snippet is
minimum yet meaningful.
Methodology:
Editing (5/13)
d ( / )
the drawbacks were that they were not user friendly for the casual photographer , 
the lcd screen is a little too small .

the drawbacks were that they were not user friendly for the casual photographer , 
the lcd screen is a little too small .

the drawbacks were that they were not user friendly for the casual photographer , 
the lcd screen is a little too small .

the drawbacks were that they were not user friendly for the casual photographer , 
th d b k th t th t f i dl f th l h t h
the lcd screen is a little too small .

… the lcd
the lcd screen is too small .
screen is too small
Methodology:
Editing (6/13)
d ( / )
• Problem modeling (1):
Problem modeling (1):
– Consider a sentence can be represented as a 
sequence of words in the following form:
q g
“NNNNNN00111000NNN”
• 1: We want to keep the word at this position
• 0: We want to remove the word at this position
• N: We haven’t decided with the word at this position
– At
At each step, we decide on an unknown word whether 
each step we decide on an unknown word whether
we want to keep it or not
– We terminate when all unknown words are decided.
We terminate when all unknown words are decided
Methodology:
Editing (7/13)
d ( / )
N100
N101
N10N
010N
N11N 110N

N1N0
N1NN
N1N1

01NN

11N1

T=0 T=1 T=2


Methodology:
Editing (8/13)
d ( / )
• Question:
– How large is the search space?
– How can we evaluate the quality of the insertion?
How can we evaluate the quality of the insertion?
Methodology:
Editing (9/13)
d ( / )
• The search space:
p
– Factorial
– Heuristic: Dependency links

– Ex: The battery life of the camera is impressive .

• det(life‐3, The‐1)
• nn(life‐3, battery‐2)
• nsubj(impressive‐8
nsubj(impressive 8, life
life‐3)
3)
• det(camera‐6, the‐5)
• prep_of(life‐3, camera‐6)
• cop(impressive‐8 is‐7)
cop(impressive‐8, is‐7)
Methodology:
Editing (10/13)
d ( / )
• The search space:
The search space:
– Only words that has at least one link with the 
existing selected words are considered
existing selected words are considered
Methodology:
Editing (11/13)
d ( / )
• The quality of the insertion:
The quality of the insertion:
– We want to compute: P(St+1|St) where St is a 
random variable that can take on all possible
random variable that can take on all possible 
sequence at time t
– The Vterbi
The Vterbi algorithm to find the maximum 
algorithm to find the maximum
sequence.
– We use the following formula:
g
Methodology:
Editing (12/13)
d ( / )
• Let
Let X be the word that we are going to decide to 
X be the word that we are going to decide to
move the state from t to t+1.
• Let D(X) = 1 if X is kept, otherwise D(X) = 0
( ) p, ( )
• Let E be the set of words that have been decided at 
time t, and R(X) subset of E be the set of words that 
have dependency with X.
P ( S t +1 | S t )
= P( D( X ) | E )
= α × C ( X , E ) + β × N - gram( X , R( X )) + (1 − α − β ) × Stat ( X , R ( X ))
where 0 ≤ α , β ≤ 1
Methodology:
Editing (13/13)
d ( / )
C ( X , E ) = P (Re( X , E ) | Re( E ))
⎧Re( E ) represents the existing set of dependencies

⎩Re(X, E) represents the expanded set of dependencies

N − gram( X , R( X )) can be modeled with Proximity search + WebPMI

Stat ( X , R( X )) : Review corpus evidence


Clustering + Ranking
Clustering + Ranking
• We
We already have the sentence similarity 
already have the sentence similarity
measurement adapted from baseline‐1
• Run MMR on the edited sentences with 
Run MMR on the edited sentences with
threshold will tell us how many clusters of 
sentences there are
sentences there are.
Todos
• Finish
Finish the whole system, with simple 
the whole system with simple
formulation of the editing part by considering 
text contingency
text contingency.
• Next meeting:
– Wednesday 03 Feb 1:00am Singapore time
W d d 03 F b 1 00 Si ti

Das könnte Ihnen auch gefallen