Beruflich Dokumente
Kultur Dokumente
Step 0 : Nouns/NPs and contexts extraction We adopted MI5 metric to compute weight score of a fea-
ture term in a category. Weight score implies the strength of
• The dataset is splitted into sentences. After lemma-
connection between the term and the category. For example,
tizing all words and tokenizing all sentences, all de-
in figure 2a because “delicious” is more specific to category
pendency pairs are extracted from the sentences. Let
food than “good” is, thus it get much higher weight. Let
N = {n1 , n2 , . . .}, C = {c1 , c2 , . . .} denote the set of
Nk = {n1 , n2 , n3 , ..., nl } denote the current member set of
nouns/NPs and context terms. Then context of ni ∈ N
category k, then weight score of feature term fi ∈ Fk in k
is managed in
at this time is computed as the following.
Cni = {ci ∈ C | (ni , ci ) is a dependency pair}
e.g. Ccof f ee = {delicious, have, drink} in figure 2. 1 X P (fi , nj )
coefk (fi ) = P (fi , nj )log
Step 1 : Seed word selection Ficat P (fi )P (nj )
nj ∈Nk
As
(tea)=0.8 As (tea)=0.1
4 Experiments
tea
Precision[%]
C ?? &(2(
0.6 |%
01 e (¡
K¢£ !"
¤
every aspect candidate is two. 0.5 ¥¦§
01 e(¡ <= ¨¦ %96
C.
110/248 1 «e _ %9¬ ©ª
-
0.4 ®
01° ¥¦§
±¶·¸ '²' -¯r
qr³b
b¼@´µ01 Y% ¹º»
k½
0.3 ¾
ÄÅ' ¿¬ÀÁ
VQR&Æ ÂÃ
¹º»
104/548 ÇÈÉ@
À'Q x '
012 ʽ
0.2 뫵'u -
¶1_
ËÌÍ ¹º»
%96
Table 1. Seed words (translated to English) 0.1 tYS
Ï1 'u
% À%
Ð
ÔÕ `a ÑÒÓ
O½ ©j@ CÖ
j× ØÙ Ú
ÇÈ
0 100 200 300 400 500 600 Ûn ݤݤ Þ Ü
¹º»
ßà
@ ÇÈÉ@
â3 g'Áá
N äåæÖ çÕ fÀeãèé ' %
Food : breakfast(1636), meal(1575), food(745), dinner(499), QÀ%2
R
' îêë ìí
¢£
ðóôñuòYW ïã
~
C ¹º»
ñ6 %
bread(375), restaurant (197), taste(193), coffee(168) 0.5 õöäõöä'Q ½÷ ]%96
weighted ]øùÀ
'^%01 [
û bú(¡
e
0.45 unweighted ýþ 'uü
tYS
QZÿWZÿ
!"
¹º» ]
¹º» '^%
Room : room(6127), noise(439), bed(337), toilet(268), ¥¦@
01%WÀ¬ 6¡
- Ù
§
©¢ o
<õ
Kµ 01 e
01CDE@ (¡
!
Precision[%]
0.4
smell(233), window(186), refrigerator(177), shower(172)
2 A>
b
,
&
%'f ü
0.35 ?@
zµ @ú
- ]Ê'^%
%96N8
Furo(Bath) : Furo(1932), hot spring(521), bath(483), open- 91/262 !"C!
0~µ
)*
"#
¥¦§@
eQ
$%&
@
'(
+1 ,' -.
air bath(246), hot water(214) 0.3 CJ
?@
¢£/0
.1 '^%y
]]%96]'^%
'^%
23E g'(
D
41
$%&
401 e
6/23 (¡
^'
A>: 5
µ6
k
Service : support(2055), service(1270), reception(913), 0.25 û
1
~1 e
ÇÈC@
Î01°
78
9U
?ó (¡
=>
C3 A>µ
:;<
Y@B'9' e&YV
A@
staff(379), employee(362), smiling face(248) 0.2
0 50 100 150 200 250 300 &ï%À
| É § :
ÔÙC£
D À'Q E-
ÉFB ©ª
g'Áá &
&
Location : station(682), location(686), place(459), conve- N G
É@
¶1_
$HI
J$
K
X%
&¹º»
%96
C:L ¡ï'ò MfÀ
nience store(261), supermarket(32)
Facility/Amenity : facility(497), parking place(390), Top-N-precision graphs Top 100 terms
amenity(243), air purifier(63), humidifier(52)
Price/Cost : price(794), cost(569), charge(394), cost- Figure 3. Effectiveness of feature weighting
performance(134)
Trash (a part) : hotel(1846), spirit(1217), sense(767),
chance(743), person(568), care(532), accommodation(519), Figure 3 shows the results of two categories: food as
lodging(519), feeling(380), other(352), children(307), ... the most narrow-meaning and service as the most broad-
meaning category. In the left side, upper and lower graphs
are the top-N-precision graphs of food and service respec-
tively. Consider a point (x,y) of top-N-precision graph,
y presents the precision score computed only for the top
Table 2. Manually labeled data x terms (terms are sorted in decreasing-order of member-
Category Total Percentage ship score). The number for example 110/248 indicates that
Food 185 9.2% there are 110 proper terms 7 among 248 collected terms of
Room 267 13.3% category food. For a more visual display, we present the
Furo(Bath) 93 4.6% case of N = 100 for two categories in the right side. Terms
Service 208 10.3% marked with gray indicate the proper terms. The 1st and
Location 117 5.8% 2nd column show the top 100 terms of food. The left one
Facility/Amenity 107 5.3% is the result when feature terms were weighted and the right
Price/Cost 53 2.6% one is the result when feature terms were not weighted. In
Other 986 48.9% the same way, the 3rd and 4th column are the weighted case
and un-weighted case for category service.
We observed the effectiveness of feature weighting in
improving the performance for the both categories. With
4.2 Experimental results category food, we could see overall precision score in-
creased from 0.2 to 0.45, recall score slightly increased
4.2.1 Effectiveness of feature weighting from 104/185 to 110/185. And the density of proper terms
in the top ranks was significantly improved. Category ser-
This experiment aims at verifying the effectiveness of fea-
vice showed the interesting change when it not only raised
ture weighting in three criteria. First, if the feature weight-
the precision and recall score by increasing the density
ing can improve classification performance in comparison
but could collect much more new member terms with the
to the case when feature terms are not weighted. Second,
if category-specific feature terms are assigned higher scores 7 proper term indicates the term which was classified properly
Table 3. Top feature terms Table 4. low-frequency terms in top 50
Food : delicious, breakfast, eat, have, service, smorgasbord, Food : rice(3), set meal(2), beef sashimi(3), kamameshi(2),
dinner, good, food, bread, served, satisfied, drink, coffee, ocean dinner(2), soba(2), fish dish(2), soup(4), welcome drink(3),
Room : room, large, tidy, see, beautiful, nice, furo, narrow, plum wine(2), free breakfast(4), coffee(4), variety(2), salt(2)
big, hear, use, view, enter, stinking, twin, noise, cleanness Furo(Bath) : crack(4), low temperature(2), bowl(4), ice(4),
Furo(Bath) : enter, large, good, relaxed, room, toilet, hot, basket(2), bag(4), bath for family(2), toilet(4)
cold, leave, weak, shower, go, furo, warm, food, narrow Service : hotel-man(3), front desk(3), employee(4), gentle-
Service : support, good, front desk, employee, staff, careful, men(2), front desk man(2), official(2), hotel employee(2), man-
kind, everyone, support, rent, correspond, smiling face, satisfied ual(4), receipt(4), rental car(4), modem(2), extension cord(2)
Location : near, good, convenient, far, walk, station, sepa- Location : access to work(3), Oodori park(3), bar city(2), city
rated, subway, room, hotel, think, shopping center, face hall(2), bar(2), curb market(2), mountain(4), walk(3), town ar-
Facility/Amenity : enhanced, prepared, need, old, room, far, eas(4), shopping(2), nearest station(2), access(2)
large, complete, new, beautiful, service, be separated, narrow Price/Cost : lodging fee(3), parking fee(2), total(4), cost of a
Price/Cost : cheap, takai(), reasonable, discount, good, ex- meal(2), greatly-cheap price(2)
cellent, room, great, vending machine, satisfied, cheap, satisfied
Trash : good, go, stay, several, say, bad, hotel, other, price,
wait, use, pleased, go, room, less, be able to, old
4.2.2 Performance and effectiveness of category trash
In this experiment, we examine the performance of Bau-
text and the trade-off between precision and recall score (the
weighted case. It is clear that the in-order and high-density trade-off caused by category trash).
results benefit the manual review process in real application. Figure 4 shows the top-N-precision graphs of seven tar-
get categories, each contains graphs of two cases: with
and without category trash. The graph in the bottom-right
A. Top feature terms explain the result presents only the result of trash. Noting that Y axes have
different ranges since some categories are more prolific than
Table 3 shows the top feature terms produced by Bautext. It others. Numbers at the end of each line indicate the recall
is clear that these feature terms are specific to each category, scores. Excepting for price/cost and facility/amenity, pre-
for example, category food is characterized by {delicious, cision scores of other categories was improved when trash
eat, have, drink, breakfast, dinner etc. } and room is char- is in the category setting. Similar to the effectiveness of fea-
acterized by {large, narrow, tidy, clean, beautiful, noise, ture weighting, we observed the significant improvement in
hear etc.}. Therefore they must have contributed highly in density of proper terms in top ranks in prolific categories
extracting properly member terms. (food, room, furo, service). The most interesting thing is
that while precision scores increased, the recall scores was
We found one interesting result that the top-scored fea- preserved (case of price/cost) or even slightly increased in
tures can provide explanation for some incorrect classi- almost every categories. Service was the only exception
fications. For example, in Japanese takai () while hav- with recall score slightly decreased by 0.03.
ing the meaning of expensive, it also has meaning of tall,
Result produced by Bautext is summarized in 2nd,
high. Therefore, though takai itself is specific to category
3rd and 4th column of table 5. Food, room, loca-
price/cost, it also causes some ambiguous terms like level
tion, price/cost, service, furo(bath) had reasonable per-
(in high level), evaluation (in high evaluation) etc. to be
formance with F1 scores around 0.4. The low perfor-
classified to this category.
mance of facility/amenity can be explained as follow-
ing. While other target categories well captured category-
specific terms, the top-scored features terms of this category
B. Low frequency terms in the top ranks were disconnected. This might be caused by the not very
good seed words or by the confusing setting of categories.
Table 4 shows properly classified terms with frequency less
than four which are in top 50 of each category. Almost 4.2.3 Comparison to Basilisk
every of these terms were mis-classified when the feature
terms were not weighted. The result demonstrated the ef- To compare with Basilisk, we first ran Basilisk with the
fectiveness of feature weighting in classifying terms with common parameters as used in the paper [7]: P=20, N=5
low frequency. and let Basilisk run 100 iterations. We conjecture that the
Table 5. Empirical results
Bautext Basilisk Adaboost (seeds) Adaboost (50%)
Category
Re Pr F1 Re Pr F1 Re Pr F1 Re Pr F1
Food 0.58 0.49 0.53 0.45 0.50 0.47 0.03 0.71 0.05 0.41 0.75 0.53
Room 0.41 0.40 0.41 0.27 0.21 0.24 0.03 0.62 0.06 0.19 0.50 0.28
Furo(Bath) 0.29 0.21 0.25 0.35 0.07 0.12 0 0 0 0.09 0.41 0.15
Service 0.41 0.36 0.38 0.02 0.50 0.04 0.08 0.77 0.15 0.18 0.64 0.28
Location 0.33 0.53 0.41 0.53 0.18 0.26 0.01 0.33 0.02 0.25 0.75 0.37
Facility/Amenity 0.04 0.40 0.07 0.35 0.08 0.13 0 0 0 0.02 0.14 0.04
Price/Cost 0.38 0.43 0.40 0.34 0.07 0.12 0 0 0 0.22 0.64 0.31
Trash 0.25 0.68 0.36 0.27 0.53 0.36 0.23 0.51 0.32 0.66 0.63 0.64
Macro Average 0.34 0.44 0.35 0.32 0.27 0.22 0.05 0.37 0.08 0.25 0.56 0.33
Macro Average without trash 0.35 0.40 0.35 0.33 0.23 0.20 0.02 0.35 0.04 0.19 0.55 0.28
category trash also improve the classification performance term, we adopted the binary bag-of-term feature vectors
of Basilisk, we thus ran Basilisk on the category setting in- for Adaboost. Only context terms were used to create fea-
cluding trash. ture vectors. Note that context terms also include several
As showed in table 5, excepting for the category facil- nouns/NPs.
ity/amenity, Bautext outperforms Basilisk in all recall, pre- We first ran Adaboost using the seed words defined in ta-
cision and F1 score for remaining categories. For category ble 1 as training data. And the result was very poor (showed
facility/amenity, while Basilisk delivers extremely low pre- in table 5).
cision, Bautext in contrast delivers a extremely low recall. In fact, this experiment aims at comparing Bautext with
Conversely, Bautext substantially outperforms Basilisk for Adaboost to explore that Adaboost require how much of
service (0.38 vs. 0.04) and price/cost (0.40 vs. 0.12) due to training data to deliver a similar performance. We repeated
the balance in precision and recall score. the experiment with some randomly selected training data,
Another aspect to concern is the running time. In fact, in then tested the remaining data and computed the average F1
order to process 10,000 reviews, Bautext took about 20 min- score of eight categories. Experiments for the same amount
utes while Basilisk took about 160 minutes. Thus, Bautext of training data were repeated three times and the final F1
is about 8 times faster than Basilisk while deliver a better was computed as the average of three previous F1 scores.
performance in almost every categories. We can think of Actually, we started from 10% of training data, then in-
a situation of Bautext-human collaboration which takes ad- creased the percent of training data 10% each time until the
vantage of the fast mechanism in practical as follows. After final F1 score approaches the F1 score delivered by Bautext
every run of Bautext, a human will review some top n good (F1 = 0.35). The experiment stopped when Adaboost used
terms to add to seed words of each category. This process 50% of training data to produce the final F1 = 0.33.
can be conducted a number of times. The detailed result is showed in table 5. Comparing
As described in section 2, bootstrapping techniques (in- to Bautext, Adaboost in general delivers higher precision
cluding Basilisk) take advantage in re-evaluating all patterns scores but lower recall scores, thus leads to lower F1 scores.
in a common pattern pool in order to select some good pat- It is important to mention that, the number of testing terms
terns each iteration. However, this idea causes the useless are half decreased in this case, thus the low recall scores
repeated re-evaluation of a large number of non-related pat- imply the low number of properly classified terms. Besides,
terns. Bautext lessens this useless re-evaluations by creat- we can see that trash is the exception that Adaboost outper-
ing separate pattern pools (the feature set) for each category. forms Bautext. If we consider the macro average F1 score of
The 8 times faster running time demonstrated the effective- only target categories, the F1 score produced by Adaboost
ness of Bautext in reducing several useless re-evaluation. (50%) further decreases from 0.33 to 0.28 (macro average
F1 of Bautext is 0.35 for both cases of with and without
trash).
4.2.4 Comparison to Adaboost
We adopted the AdaBoost.MH with decision-tree learning 5 Conclusion
as weak classifiers [12] for the comparison. Adaboost was
run on eight categories the same as done with Bautext. We have proposed Bautext as a new minimally super-
Because context terms were used to present a candidate vised technique for learning member terms for multiple se-
1
with trash
0.65
0.6
with trash Acknowledgements
0.9 without trash without trash
0.8 0.55
Precision[%]
Precision[%]
0.5
0.7
0.45 This work was supported by the Rakuten Institute of
0.6
0.5 0.58
0.4
0.35
0.41 Technology.
0.4 0.55 0.39
0.3
0.3 0.25
0 50 100 150
N (Food)
200 250 300 350 0 50 100 150 200 250 300 350 400 450
N (Room)
References
0.35 0.55
with trash with trash
0.3 without trash 0.5 without trash
0.25 0.45
[1] E. Riloff and J. Shepherd, A Corpus-Based Bootstrap-
Precision[%]
Precision[%]
0.2
0.29
0.4
0.41
ping Algorithm for Semi-Automated Semantic Lexi-
0.15 0.35
con Construction, Journal of Natural Language Engi-
0.1 0.26 0.3 0.44
0.05 0.25
neering, 1999
0 0.2
0 50 100 150 200 250 0 50 100 150 200 250 300 350 400
N (Furo) N (Service)
[2] W. Phillips and E. Riloff, Exploiting Strong Syntactic
1 0.5 Heuristics and Co-Training to Learn Semantic Lexi-
with trash with trash
0.9
without trash
0.4
without trash
cons, In Proc. EMNLP, 2002
Precision[%]
Precision[%]
0.8 0.04
0.3
0.05
0.7 [3] B. Roark, and E. Charniak, Noun-phrase Cooccur-
0.2
0.6 rence Statistics for Semi-automatic Semantic Lexicon
0.33 0.1
0.5 0.28 Construction, In Proc. ACL, 1998
0
0 20 40 60 80 100 0 5 10 15 20
N (Location) N (Facility/Amenity)
[4] R. Jones, A. McCallum, K. Nigam and E. Riloff, Boot-
0.65 1
0.6
with trash
without trash
trash strapping for Text Learning Tasks, In Proc. Workshop
0.8
0.55 on Text Mining Foundations, Techniques, and Appli-
Precision[%]
Precision[%]
0.38 352/589
0.6
0.5
cations, 1999
0.45 0.4
0.38
0.4
0.35
0.2 [5] E. Riloff and R. Jones, Learning Dictionaries for In-
0.3
0 10 20 30 40 50
0
0 100 200 300 400 500 600
formation Extraction by Multi-Level Bootstrapping,
N (Price/Cost) N (Trash)
In Proc. AAAI, 1999
Figure 4. Results of eight categories [6] R. James R, Curran, Tara Murphy, and Bernhard
Scholz, Minimising Semantic Drift with Mutual Ex-
clusion Bootstrapping, In Proc. PACLING, 2007
mantic categories. Bautext has been applied to the task of [7] M. Thelen and E. Riloff , A Bootstrapping Method for
learning ratable aspects discussed in Japanese review do- Learning Semantic Lexicons using Extraction Pattern
main. Bautext is featured by the unique classification mech- Contexts, In Proc. EMNLP, 2002
anism which learns the member terms and the category-
[8] H. Avancini, A. Lavelli, B. Magnini, F. Sebastiani,
specific terms at the same time for each categories. Ex-
and R. Zanoli, Expanding Domain-specific Lexicons
periment results demonstrated that 1) Bautext significantly
by Term Categorization, In Proc. 18th ACM Sympo-
outperforms Basilisk in all terms of accuracy and speed.
sium on Applied Computing (SAC), 2003
2) Feature weighting is effective in identifying category-
specific features and classifying properly low frequency as- [9] M. Hu and B. Liu, Mining and Summarizing Cus-
pects which are usually cut off by a frequency threshold in tomer Reviews, In Proc. ACM SIGKDD, 2004
a pruning processing in several prior works. 3) Category
trash was used to collect non-purpose aspects thus led to a [10] M. Hu and B. Liu, Mining Opinion Features in Cus-
significant improvement in precision scores but could con- tomer Reviews, In Proc. AAAI, 2004
strain the trade-off in decreasing recall scores. 4) Adaboost
[11] A. Popescu and O. Etzioni, Extracting Product Fea-
requires about 50% of random selected training data to de-
tures and Opinion from Reviews, In Proc. EMNLP,
liver a similar classification performance as Bautext does
2005
with less than ten selective seed words for each category.
Since Bautext relies on syntactic dependency relations to [12] R. E. Schapire and Y. Singer, Improved Boosting Al-
extract aspects candidates, we believe that our technique is gorithms Using Confidence-rated Predictions, In Proc.
in essence language-free and domain-independent. In future of the Eleventh Annual Conference on Computational
work, we intend to verify Bautext in other languages and Learning Theory, 1999
other text domains.