Sie sind auf Seite 1von 8

A New Minimally Supervised Learning Method for Semantic Term Classification

- Experimental Results on Classifying Ratable Aspects discussed in Customer


Thao Pham Thanh Nguyen Takahiro Hayashi Rikio Onai

University of Electro-Communications Niigata University University of Electro-Communications
Yuhei Nishioka Takamasa Takenaka Masaya Mori
Rakuten Institute of Technology Rakuten Institute of Technology Rakuten Institute of Technology

Abstract cally summarizing a large volume of reviews into an easy-

to-process display.
We present Bautext, a new minimally supervised ap- Prior studies mainly concentrated on product domain
proach for automatically extracting ratable aspects from e.g. digital electronics and summarized reviews in aspect-
customer reviews and classifying them to some previously based (also called feature-based) style as the following
defined categories. Bautext requires a small amount of seed illustration[9, 10, 11].
words as supervised data and uses a bootstrapping mech-
anism to progressively collect new member for each cate- • Camera 1
gory. Learning new category members and the category-
– Aspect1: picture quality
specific terms for each category at the same time is the
unique and featured classification mechanism of Bautext. ∗ Positive: 253
Category-specific terms are terms that play important roles @@sentences containing detailed opinions
for properly extracting new category members. Further- ∗ Negative: 12
more, we proposed to use an additional Trash category to @@sentences containing detailed opinions
filter non-purpose aspects, thus led to a significant improve-
– Aspect2: size
ment in precision score but could constrain the trade-off in
decreasing recall score. Experimental results, conducted
on a Japanese hotel review dataset, showed that Bautext Ratable aspects 1 are listed usually in decreasing-order of
outperforms the alternative techniques in all terms of preci- appearance frequency. Each is accompanied with senti-
sion, recall score and significantly in running time. And in ment (positive or negative) and sentences containing de-
the further comparison to Adaboost (as the state-of-the-art tailed opinions to support why the sentiment was assigned.
machine learning technique for semantic term classification However, we think that this style of summarization is not
task), we found that Adaboost require about 50% training effective for service domain e.g. hotel or restaurant reviews,
data to deliver a similar performance as Bautext does with because ratable aspects in hotel/restaurant reviews are much
less than ten selective seed words for each category. more varied than ones in product reviews. Therefore, it is
expected that aspects will be organized into topics for exam-
ple “sushi”, “soup”, “steak” should be grouped into a food
1 Introduction topic rather than be listed as three distinct aspects.
Consider each topic as a category, we call this style
category-based summarization. Category-based summa-
Online customer reviews are increasingly becoming a rization is effective for opinion summarization because 1)
standard for measuring the quality of products or services. organizing aspects into categories helps readers browse and
However, the large volume of review makes it difficult for find for the interested aspects easily. 2) Such categories can
readers to capture the meaningful information. As a result, be recognized as standard evaluation criteria for the target
opinion summarization has become an important research
subject in the field of data mining, which aims at automati- 1 ratable aspects are entities on which reviewers have commented
products or services, for example room, location, service Jones et al. [4, 5] improved the idea by using extraction
etc. are popular evaluation criteria for hotel reviews. patterns to extract candidate terms instead of noun and noun
The most studied problem in opinion mining domain is co-occurrence. A multi-level bootstrapping mechanism has
the task of identifying the sentiment over each of the as- been proposed to select some top score patterns in each iter-
pects. However, in this paper, we do not consider this task ation in order to capture new terms for categories. The main
but focus only on extracting ratable aspects from reviews advantage of the mechanism comes from the re-evaluation
and classifying them to some previously defined categories. of the extraction patterns after every bootstrapping process.
In order to run this classification task, machine learning Unfortunately, adding a term of multiple senses can quickly
techniques are reasonable solutions, however these tech- introduce errors. This is the basic drawback of bootstrap-
niques require significant human efforts in labeling training ping techniques and also called semantic drift.
data. Consider machine learning techniques as fully super- To combat semantic drift, James et al.[6] introduced
vised classification techniques, we conversely aim to pro- the Mutual Exclusion Bootstrapping algorithm. The algo-
pose a minimally supervised classification technique. rithm excludes both generic extraction patterns and can-
Starting from a small of supervised data (called seed didate terms that can occur frequently with multiple cate-
words), we use a bootstrapping mechanism to automati- gories. Then the chance of assigning ambiguous terms to
cally collecting new category members based on a set of their less dominant sense is reduced. However, in small-
category-specific terms, called feature terms. Feature terms scale domain, when pattern space is relatively limited, this
play important roles for properly extracting new category principle is not effective.
members in every bootstrapping iteration. Our technique is To reduce semantic drift, Thelen et al.[7] proposed
featured by the unique classification mechanism which au- Basilisk which explore the idea of bootstrapping multiple
tomatically extends mutually both set of category members semantic categories simultaneously. The idea aims to con-
and feature terms during the classification process. That’s strain the ability of a category to overstep its bounds. With
why we name it Bautext (Both AUtomaTic EXTention). success of this idea, Basilisk was considered as the best-
Experimental results show that Bautext greatly outper- performing bootstrapping method for the task at the time.
form alternative bootstrapping techniques in all terms of
precision, recall score and running time. In the further com-
parison to Adaboost (as the state-of-the-art machine learn-
3 Bautext
ing technique for semantic term classification task[8]), we
found that Adaboost requires about 50% of training data to 3.1 Overview of Bautext
deliver a similar performance as Bautext does with less than
ten selective seed words for each category. Bautext was motivated by the observation that each cat-
The rest of the paper is structured as the following. We egory can derive its new members based on some char-
summarize some bootstrapping techniques in section 2 and acteristic terms which constraint the meaning of the cate-
describe Bautext in detail in section 3. Experimental results gory. For example, words modified by “delicious”, “tasty”
are in section 4 and conclusion is in section 5. should have high probability to be members of category
food. We call these characteristic terms category-specific
2 Related works terms or feature terms.
One category in our technique consists of two subsets:
Though this is the first time a minimally supervised tech- the member set containing category member terms and the
nique has been applied to review data, in NLP 2 communi- feature set containing feature terms. For each category,
ties, there exists some studies which are closely related to Bautext initializes member set with a small amount of man-
Bautext in terms of bootstrapping, corpus-based and using ually defined seed words, and feature set with the contexts
a small amount of supervised data. of seed words. In each iteration of classification process,
Riloff et al.[1, 2] and Roark et al.[3] pioneered the boot- Bautext classifies one candidate term of ratable aspects into
strapping technique and simply used noun and noun co- one category where the candidate can help maximizing the
occurrence to extract candidate terms of categories. Then mutual information 3 between the member set and the fea-
some top score candidates are chosen based on a TFIDF ture set of that category. When the candidate term was
score. Though these techniques could capture some correct added to a category, it might help extending the feature set
member words, the co-occurrence relation between noun by contributing some new terms from its context. Overall,
and noun does not accurately indicate the correct category Bautext progressively extends both member set and feature
membership. Thus, these techniques should deliver a very set of every categories during the classification process.
low recall score. 3 Mutual information is the measure of information amount between
2 NLP: Natural Language Processing two sets of terms
Because nouns/NPs4 are more relevant to be ratable as- Step 2 : Category initialization
pects than other syntactic types, in our experiments so
far, we have considered only nouns/NPs as aspect candi- • Let Nk , Fk denote the member set and feature set
dates. And syntactic dependency relations in sentence level of category k. Then member set of each cate-
was adopted to extract contexts of nouns/NPs. Although gory is initialized with corresponding seed words
many true dependencies can be captured using dependency (Nk = Sk ), and feature set is initialized with con-
parsing over sentence structure, it has limited ability in text terms of seed words (Fk = {ci ∈ Cn |∀n ∈
extracting dependencies of longer distances. In particu- Nk }) e.g. Nf ood = {dinner, coffee} and Ff ood =
lar, only verbs/verb phrases, adjectives/adjective phrases {delicious,have,drink,good} in figure 2a.
and noun/NPs were used as contexts. Figure 1 gives an ex- • Feature terms are then weighted depending on the
ample where arrow, the underlined and italic words indicate category they belonged to e.g. “good” have different
the dependency relations, nouns/NPs and context terms re- weights in different categories in figure 2a.
spectively. Each noun/NP and its context term are linked
together as a dependency pair showed below. Numbers are Step 3 : Classification process
the appearance frequencies of corresponding pairs.
• Obtain an unclassified noun/NP ni ∈ N and com-
pute the membership score of ni against each of avail-
Windows and balcony are very large and have terrific ocean view. able categories e.g. membership scores of “tea” against
food and room are 0.8 and 0.1 respectively.
(window, balcony): 1
(balcony, window): 1 • Let k denote the category where ni earns the biggest
(balcony, large): 1 membership, then ni will be classified to k e.g. “tea”
(balcony, have): 1
(ocean view, terrific): 1 is classified to food.

• If there exist context terms of ni which have not in-

Figure 1. Example of dependency pairs cluded in Fk , then these terms are added to Fk
(F´k = Fk ∪ {ci ∈ Cni ∧ ci ∈ / Fk }) e.g. “fragrant” is
added to Ff ood in figure 2c.
3.2 Algorithm • Then, weights of feature terms in Fk which have con-
nection to ni are updated e.g. “drink”, “good” and
Here we describe the algorithm of Bautext while giving “fragrant” are updated in figure 2c.
an illustration in figure 2. Example in figure 2 shows the
case of collecting terms for two categories food, room. Step 4 : if there still exists an unclassified noun/NP, return
Terms in the left side are category members and terms in to Step 3 else stop here.
the right side are category feature terms. Edges denote the
dependency relations. 3.3 Weight score for feature terms

Step 0 : Nouns/NPs and contexts extraction We adopted MI5 metric to compute weight score of a fea-
ture term in a category. Weight score implies the strength of
• The dataset is splitted into sentences. After lemma-
connection between the term and the category. For example,
tizing all words and tokenizing all sentences, all de-
in figure 2a because “delicious” is more specific to category
pendency pairs are extracted from the sentences. Let
food than “good” is, thus it get much higher weight. Let
N = {n1 , n2 , . . .}, C = {c1 , c2 , . . .} denote the set of
Nk = {n1 , n2 , n3 , ..., nl } denote the current member set of
nouns/NPs and context terms. Then context of ni ∈ N
category k, then weight score of feature term fi ∈ Fk in k
is managed in
at this time is computed as the following.
Cni = {ci ∈ C | (ni , ci ) is a dependency pair}
e.g. Ccof f ee = {delicious, have, drink} in figure 2. 1 X P (fi , nj )
coefk (fi ) = P (fi , nj )log
Step 1 : Seed word selection Ficat P (fi )P (nj )
nj ∈Nk

• Let Si = {n1 , n2 , . . . , nl } denote the set of seed Fr (f ,n ) (fi ) Fr (nj )

words for category i. Seed words for each category P (fi , nj ) = Fpairi j
, P (fi ) = Fr
Fword , P (nj ) = Fword
are manually defined e.g. Sf ood = {dinner, coffee}, Fr (w): frequency of term w.
Sroom = {room, bed} in figure 2a. Fr (fi , nj ): frequency of depending pair (nj , fi ).
4 NP: noun phrase 5 MI : Mutual information
Food Room purpose aspects refer to ratable aspects which are not appro-
delicious (0.09) room large (0.08) priate to be classified into any target categories. Similar to
a) have (0.08) clean (0.08)
drink (0.07)
tidy (0.07)
target categories, trash is also manually set up seed words,
good (0.01) good (0.05) and both of its member set and feature set are extended dur-
narrow (0.05)
ing the classification process. We expect that trash can
Food Room constrain the territory of target categories during the clas-
sification process. As the result, trash can help improving
delicious (0.09) large (0.08)
have (0.08) clean (0.08) the precision of target categories; however the trade-off in
drink (0.07) tidy (0.07) decreasing recall score is expected.
b) good (0.01) good (0.05)
narrow (0.05)

(tea)=0.8 As (tea)=0.1

4 Experiments

fragrant drink good

4.1 Experiment settings

Food Room We have applied Bautext to a Japanese dataset of 10,000

delicious (0.09) room large (0.08) reviews for hotels. This dataset consists of 40,007 sentences
bed clean
and was crawled from the review site ”Rakuten Travel” 6 .
fragrant (0.02) good (0.05) Target categories are: food, room, furo(bath), service, lo-
tea narrow (0.05)
good (0.02)
cation, facilities and amenities, which are set up on the
site as six evaluation criteria. Besides, we added category
Figure 2. Algorithm illustration price/cost to confirm if Bautext can deal with relatively
narrow-meaning categories. It is worth noting that the clas-
sification result is sensitive to the category setting. The cate-
Fword : total frequency of all terms. gory setting here is confusing because category service can
Fpair : total frequency of all dependency pairs. be thought of including facilities and amenities or even
Ficat : number of categories which contain fi as a feature room and furo(bath). However this is an unavoidable prac-
term. tical problem. Furthermore, in order to increase the per-
Because generic terms (e.g. “good” in figure 2) can be formance for above target categories, we set up a filtering
feature term of multiple categories, Ficat was added to de- category trash, as described in section 3.5.
crease the impact of them in identifying new members for Table 1 shows the seed words selected for eight cate-
category k. gories, all are translated into English. Numbers in brackets
are the frequencies of corresponding words. The same as
mentioned for category setting above, the classification per-
3.4 Membership score for nouns/NPs
formance is sensitive to the set of seed words. Since high-
frequency nouns/NPs provide broader coverage of context
Let Fk = {f1 , f2 , f3 , ..., fh } denote the current feature
terms, our policy is to select seed words among the most
set of category k. Membership score of a candidate term
frequent nouns/NPs. Number of seeds for each target cate-
ni against category k is computed as the weighted mutual
gory is fewer ten, but we let trash have 30 seeds. And be-
information between ni and Fk .
fore running experiments , we had a group of (non-expert)
X P (fj , ni ) people label every nouns/NPs in the dataset. Summary of
Ask (ni ) = coefk (fj )P (fj , ni )log the manually labeled data is showed in Table 2.
P (fj )P (ni )
fj ∈Fk As described in section 2, Basilisk is the best performing
bootstrapping technique so far, we used Basilisk as the base-
It’s worth noting that the mutual information between the
line result for comparison with Bautext. Besides, machine
member set and feature set of a category is equal to the
learning techniques are popular solutions for classification
sum of membership scores of all member terms. Therefore,
tasks. It is of interest to explore that these techniques re-
membership score implies the contribution of a term against
quire how much of training data to deliver a similar perfor-
a category to make the category more informative.
mance as Bautext can do with a small amount of supervised
data. Since Adaboost was reported to be the state-of-the-
3.5 category Trash art machine learning technique for task of semantic term
classification[8], we chose Adaboost as the representative
Beside target categories, we define an additional cate-
gory trash in order to collect non-purpose aspects. Non- 6 bbs top.html
for this comparison. In particular, we have done four exper- than other generic terms. Third, whether candidate terms of
iments. The first two experiments are to verify the classi- low frequency can be properly classified.
fication performance and the effectiveness of Bautext. The

remaining experiments are to compare Bautext with Basilisk     
012 )34*+ ,-
<= 34 !"78 >?@ %96
and Adaboost. Note that, in the experiments, we consider 01CDE@
K '
T,- LMU N8
Y6Z '^%
_[g'(  0\]
0.9 weight
only dependency pairs with frequency greater than two in unweighted mn h`a opi 'u
0.8 s x tYS %96 Qvw
0.7 ,|" ~ }y€ z
‚: LM {
the classification process; thus the smallest frequency for 01„…†‡2
CDE@ ˆ1 ‰Š
Ž ‹&Œ%96

C’“ ”• –—?? &(2(˜™
0.6 |š›%œ
01 e (¡
K¢£ !"
every aspect candidate is two. 0.5 ¥¦§
01 e(¡ <= ¨¦ %96
110/248 1 «e _ %9¬ ©ª
0.4 ®…„…
01° ¥¦§
±¶·¸ '²' -¯r
b¼@´µ01 Y% ¹º»

0.3 ¾
ÄÅ' ¿¬ÀÁ
104/548 ÇÈÉ@
À'Q x '
012 ”ʽ
0.2 δµ€'u -
ËÌÍ ¹º»
Table 1. Seed words (translated to English) 0.1 tYS
Ï1 ” 'u
 %Œ  À%
O½ ©j@ C’Ö 
j× ØÙ Ú

0 100 200 300 400 500 600 Ûn ݤݤ Þ ‘Ü 
’–@ ÇÈÉ@
â3 g'Áá
N äåæÖ çÕ fÀeãèé ' %
Food : breakfast(1636), meal(1575), food(745), dinner(499), QÀ%2
'Œ îêë ìí
ðóôñŒœuòYW ïãŒ
C’“ š¹º»
ñ6 % 
bread(375), restaurant (197), taste(193), coffee(168) 0.5 õöäõöä'Q ½÷ ]%96
weighted ]øùÀ
'^%01 [
Šû bú(¡
0.45 unweighted ýþ 'uü•
¹º» ]
¹º» '^%

Room : room(6127), noise(439), bed(337), toilet(268), ¥¦–@
01%WÀ¬Œ 6¡

­- Ù
 ©¢ o

Kµ  01 e

01CDE@ (¡ ”•

smell(233), window(186), refrigerator(177), shower(172) 

„…†‡2 A>


%'f ü
0.35 ?@
zµ @ú
- ]Ê'^%
Furo(Bath) : Furo(1932), hot spring(521), bath(483), open- 91/262 !"C!
eQ š
+1 ,' -.
air bath(246), hot water(214) 0.3 CJ
.1 '^%y
23E g'(
401 e 
6/23 (¡
A>: 5
Service : support(2055), service(1270), reception(913), 0.25 Šû
~1 e


?ƒó (¡
C3’ A>µ
Y@B'9' e&YV
staff(379), employee(362), smiling face(248) 0.2
0 50 100 150 200 250 300 ‹&Œï%À
| É § ‚:
D À'Q E-
ÉFB ©ª
g'Áá ‹&Œ
Location : station(682), location(686), place(459), conve- N G
C:L ¡ï'ò MfÀ
nience store(261), supermarket(32)
Facility/Amenity : facility(497), parking place(390), Top-N-precision graphs Top 100 terms
amenity(243), air purifier(63), humidifier(52)
Price/Cost : price(794), cost(569), charge(394), cost- Figure 3. Effectiveness of feature weighting
Trash (a part) : hotel(1846), spirit(1217), sense(767),
chance(743), person(568), care(532), accommodation(519), Figure 3 shows the results of two categories: food as
lodging(519), feeling(380), other(352), children(307), ... the most narrow-meaning and service as the most broad-
meaning category. In the left side, upper and lower graphs
are the top-N-precision graphs of food and service respec-
tively. Consider a point (x,y) of top-N-precision graph,
y presents the precision score computed only for the top
Table 2. Manually labeled data x terms (terms are sorted in decreasing-order of member-
Category Total Percentage ship score). The number for example 110/248 indicates that
Food 185 9.2% there are 110 proper terms 7 among 248 collected terms of
Room 267 13.3% category food. For a more visual display, we present the
Furo(Bath) 93 4.6% case of N = 100 for two categories in the right side. Terms
Service 208 10.3% marked with gray indicate the proper terms. The 1st and
Location 117 5.8% 2nd column show the top 100 terms of food. The left one
Facility/Amenity 107 5.3% is the result when feature terms were weighted and the right
Price/Cost 53 2.6% one is the result when feature terms were not weighted. In
Other 986 48.9% the same way, the 3rd and 4th column are the weighted case
and un-weighted case for category service.
We observed the effectiveness of feature weighting in
improving the performance for the both categories. With
4.2 Experimental results category food, we could see overall precision score in-
creased from 0.2 to 0.45, recall score slightly increased
4.2.1 Effectiveness of feature weighting from 104/185 to 110/185. And the density of proper terms
in the top ranks was significantly improved. Category ser-
This experiment aims at verifying the effectiveness of fea-
vice showed the interesting change when it not only raised
ture weighting in three criteria. First, if the feature weight-
the precision and recall score by increasing the density
ing can improve classification performance in comparison
but could collect much more new member terms with the
to the case when feature terms are not weighted. Second,
if category-specific feature terms are assigned higher scores 7 proper term indicates the term which was classified properly
Table 3. Top feature terms Table 4. low-frequency terms in top 50

Food : delicious, breakfast, eat, have, service, smorgasbord, Food : rice(3), set meal(2), beef sashimi(3), kamameshi(2),
dinner, good, food, bread, served, satisfied, drink, coffee, ocean dinner(2), soba(2), fish dish(2), soup(4), welcome drink(3),
Room : room, large, tidy, see, beautiful, nice, furo, narrow, plum wine(2), free breakfast(4), coffee(4), variety(2), salt(2)
big, hear, use, view, enter, stinking, twin, noise, cleanness Furo(Bath) : crack(4), low temperature(2), bowl(4), ice(4),
Furo(Bath) : enter, large, good, relaxed, room, toilet, hot, basket(2), bag(4), bath for family(2), toilet(4)
cold, leave, weak, shower, go, furo, warm, food, narrow Service : hotel-man(3), front desk(3), employee(4), gentle-
Service : support, good, front desk, employee, staff, careful, men(2), front desk man(2), official(2), hotel employee(2), man-
kind, everyone, support, rent, correspond, smiling face, satisfied ual(4), receipt(4), rental car(4), modem(2), extension cord(2)
Location : near, good, convenient, far, walk, station, sepa- Location : access to work(3), Oodori park(3), bar city(2), city
rated, subway, room, hotel, think, shopping center, face hall(2), bar(2), curb market(2), mountain(4), walk(3), town ar-
Facility/Amenity : enhanced, prepared, need, old, room, far, eas(4), shopping(2), nearest station(2), access(2)
large, complete, new, beautiful, service, be separated, narrow Price/Cost : lodging fee(3), parking fee(2), total(4), cost of a
Price/Cost : cheap, takai(), reasonable, discount, good, ex- meal(2), greatly-cheap price(2)
cellent, room, great, vending machine, satisfied, cheap, satisfied
Trash : good, go, stay, several, say, bad, hotel, other, price,
wait, use, pleased, go, room, less, be able to, old
4.2.2 Performance and effectiveness of category trash
In this experiment, we examine the performance of Bau-
text and the trade-off between precision and recall score (the
weighted case. It is clear that the in-order and high-density trade-off caused by category trash).
results benefit the manual review process in real application. Figure 4 shows the top-N-precision graphs of seven tar-
get categories, each contains graphs of two cases: with
and without category trash. The graph in the bottom-right
A. Top feature terms explain the result presents only the result of trash. Noting that Y axes have
different ranges since some categories are more prolific than
Table 3 shows the top feature terms produced by Bautext. It others. Numbers at the end of each line indicate the recall
is clear that these feature terms are specific to each category, scores. Excepting for price/cost and facility/amenity, pre-
for example, category food is characterized by {delicious, cision scores of other categories was improved when trash
eat, have, drink, breakfast, dinner etc. } and room is char- is in the category setting. Similar to the effectiveness of fea-
acterized by {large, narrow, tidy, clean, beautiful, noise, ture weighting, we observed the significant improvement in
hear etc.}. Therefore they must have contributed highly in density of proper terms in top ranks in prolific categories
extracting properly member terms. (food, room, furo, service). The most interesting thing is
that while precision scores increased, the recall scores was
We found one interesting result that the top-scored fea- preserved (case of price/cost) or even slightly increased in
tures can provide explanation for some incorrect classi- almost every categories. Service was the only exception
fications. For example, in Japanese takai () while hav- with recall score slightly decreased by 0.03.
ing the meaning of expensive, it also has meaning of tall,
Result produced by Bautext is summarized in 2nd,
high. Therefore, though takai itself is specific to category
3rd and 4th column of table 5. Food, room, loca-
price/cost, it also causes some ambiguous terms like level
tion, price/cost, service, furo(bath) had reasonable per-
(in high level), evaluation (in high evaluation) etc. to be
formance with F1 scores around 0.4. The low perfor-
classified to this category.
mance of facility/amenity can be explained as follow-
ing. While other target categories well captured category-
specific terms, the top-scored features terms of this category
B. Low frequency terms in the top ranks were disconnected. This might be caused by the not very
good seed words or by the confusing setting of categories.
Table 4 shows properly classified terms with frequency less
than four which are in top 50 of each category. Almost 4.2.3 Comparison to Basilisk
every of these terms were mis-classified when the feature
terms were not weighted. The result demonstrated the ef- To compare with Basilisk, we first ran Basilisk with the
fectiveness of feature weighting in classifying terms with common parameters as used in the paper [7]: P=20, N=5
low frequency. and let Basilisk run 100 iterations. We conjecture that the
Table 5. Empirical results
Bautext Basilisk Adaboost (seeds) Adaboost (50%)
Re Pr F1 Re Pr F1 Re Pr F1 Re Pr F1
Food 0.58 0.49 0.53 0.45 0.50 0.47 0.03 0.71 0.05 0.41 0.75 0.53
Room 0.41 0.40 0.41 0.27 0.21 0.24 0.03 0.62 0.06 0.19 0.50 0.28
Furo(Bath) 0.29 0.21 0.25 0.35 0.07 0.12 0 0 0 0.09 0.41 0.15
Service 0.41 0.36 0.38 0.02 0.50 0.04 0.08 0.77 0.15 0.18 0.64 0.28
Location 0.33 0.53 0.41 0.53 0.18 0.26 0.01 0.33 0.02 0.25 0.75 0.37
Facility/Amenity 0.04 0.40 0.07 0.35 0.08 0.13 0 0 0 0.02 0.14 0.04
Price/Cost 0.38 0.43 0.40 0.34 0.07 0.12 0 0 0 0.22 0.64 0.31
Trash 0.25 0.68 0.36 0.27 0.53 0.36 0.23 0.51 0.32 0.66 0.63 0.64
Macro Average 0.34 0.44 0.35 0.32 0.27 0.22 0.05 0.37 0.08 0.25 0.56 0.33
Macro Average without trash 0.35 0.40 0.35 0.33 0.23 0.20 0.02 0.35 0.04 0.19 0.55 0.28

category trash also improve the classification performance term, we adopted the binary bag-of-term feature vectors
of Basilisk, we thus ran Basilisk on the category setting in- for Adaboost. Only context terms were used to create fea-
cluding trash. ture vectors. Note that context terms also include several
As showed in table 5, excepting for the category facil- nouns/NPs.
ity/amenity, Bautext outperforms Basilisk in all recall, pre- We first ran Adaboost using the seed words defined in ta-
cision and F1 score for remaining categories. For category ble 1 as training data. And the result was very poor (showed
facility/amenity, while Basilisk delivers extremely low pre- in table 5).
cision, Bautext in contrast delivers a extremely low recall. In fact, this experiment aims at comparing Bautext with
Conversely, Bautext substantially outperforms Basilisk for Adaboost to explore that Adaboost require how much of
service (0.38 vs. 0.04) and price/cost (0.40 vs. 0.12) due to training data to deliver a similar performance. We repeated
the balance in precision and recall score. the experiment with some randomly selected training data,
Another aspect to concern is the running time. In fact, in then tested the remaining data and computed the average F1
order to process 10,000 reviews, Bautext took about 20 min- score of eight categories. Experiments for the same amount
utes while Basilisk took about 160 minutes. Thus, Bautext of training data were repeated three times and the final F1
is about 8 times faster than Basilisk while deliver a better was computed as the average of three previous F1 scores.
performance in almost every categories. We can think of Actually, we started from 10% of training data, then in-
a situation of Bautext-human collaboration which takes ad- creased the percent of training data 10% each time until the
vantage of the fast mechanism in practical as follows. After final F1 score approaches the F1 score delivered by Bautext
every run of Bautext, a human will review some top n good (F1 = 0.35). The experiment stopped when Adaboost used
terms to add to seed words of each category. This process 50% of training data to produce the final F1 = 0.33.
can be conducted a number of times. The detailed result is showed in table 5. Comparing
As described in section 2, bootstrapping techniques (in- to Bautext, Adaboost in general delivers higher precision
cluding Basilisk) take advantage in re-evaluating all patterns scores but lower recall scores, thus leads to lower F1 scores.
in a common pattern pool in order to select some good pat- It is important to mention that, the number of testing terms
terns each iteration. However, this idea causes the useless are half decreased in this case, thus the low recall scores
repeated re-evaluation of a large number of non-related pat- imply the low number of properly classified terms. Besides,
terns. Bautext lessens this useless re-evaluations by creat- we can see that trash is the exception that Adaboost outper-
ing separate pattern pools (the feature set) for each category. forms Bautext. If we consider the macro average F1 score of
The 8 times faster running time demonstrated the effective- only target categories, the F1 score produced by Adaboost
ness of Bautext in reducing several useless re-evaluation. (50%) further decreases from 0.33 to 0.28 (macro average
F1 of Bautext is 0.35 for both cases of with and without
4.2.4 Comparison to Adaboost
We adopted the AdaBoost.MH with decision-tree learning 5 Conclusion
as weak classifiers [12] for the comparison. Adaboost was
run on eight categories the same as done with Bautext. We have proposed Bautext as a new minimally super-
Because context terms were used to present a candidate vised technique for learning member terms for multiple se-
with trash
with trash Acknowledgements
0.9 without trash without trash
0.8 0.55


0.45 This work was supported by the Rakuten Institute of
0.5 0.58
0.41 Technology.
0.4 0.55 0.39
0.3 0.25
0 50 100 150
N (Food)
200 250 300 350 0 50 100 150 200 250 300 350 400 450
N (Room)
0.35 0.55
with trash with trash
0.3 without trash 0.5 without trash
0.25 0.45
[1] E. Riloff and J. Shepherd, A Corpus-Based Bootstrap-

ping Algorithm for Semi-Automated Semantic Lexi-
0.15 0.35
con Construction, Journal of Natural Language Engi-
0.1 0.26 0.3 0.44
0.05 0.25
neering, 1999
0 0.2
0 50 100 150 200 250 0 50 100 150 200 250 300 350 400
N (Furo) N (Service)
[2] W. Phillips and E. Riloff, Exploiting Strong Syntactic
1 0.5 Heuristics and Co-Training to Learn Semantic Lexi-
with trash with trash
without trash
without trash
cons, In Proc. EMNLP, 2002


0.8 0.04
0.7 [3] B. Roark, and E. Charniak, Noun-phrase Cooccur-
0.6 rence Statistics for Semi-automatic Semantic Lexicon
0.33 0.1
0.5 0.28 Construction, In Proc. ACL, 1998
0 20 40 60 80 100 0 5 10 15 20
N (Location) N (Facility/Amenity)
[4] R. Jones, A. McCallum, K. Nigam and E. Riloff, Boot-
0.65 1
with trash
without trash
trash strapping for Text Learning Tasks, In Proc. Workshop
0.55 on Text Mining Foundations, Techniques, and Appli-


0.38 352/589
cations, 1999
0.45 0.4
0.2 [5] E. Riloff and R. Jones, Learning Dictionaries for In-
0 10 20 30 40 50
0 100 200 300 400 500 600
formation Extraction by Multi-Level Bootstrapping,
N (Price/Cost) N (Trash)
In Proc. AAAI, 1999

Figure 4. Results of eight categories [6] R. James R, Curran, Tara Murphy, and Bernhard
Scholz, Minimising Semantic Drift with Mutual Ex-
clusion Bootstrapping, In Proc. PACLING, 2007

mantic categories. Bautext has been applied to the task of [7] M. Thelen and E. Riloff , A Bootstrapping Method for
learning ratable aspects discussed in Japanese review do- Learning Semantic Lexicons using Extraction Pattern
main. Bautext is featured by the unique classification mech- Contexts, In Proc. EMNLP, 2002
anism which learns the member terms and the category-
[8] H. Avancini, A. Lavelli, B. Magnini, F. Sebastiani,
specific terms at the same time for each categories. Ex-
and R. Zanoli, Expanding Domain-specific Lexicons
periment results demonstrated that 1) Bautext significantly
by Term Categorization, In Proc. 18th ACM Sympo-
outperforms Basilisk in all terms of accuracy and speed.
sium on Applied Computing (SAC), 2003
2) Feature weighting is effective in identifying category-
specific features and classifying properly low frequency as- [9] M. Hu and B. Liu, Mining and Summarizing Cus-
pects which are usually cut off by a frequency threshold in tomer Reviews, In Proc. ACM SIGKDD, 2004
a pruning processing in several prior works. 3) Category
trash was used to collect non-purpose aspects thus led to a [10] M. Hu and B. Liu, Mining Opinion Features in Cus-
significant improvement in precision scores but could con- tomer Reviews, In Proc. AAAI, 2004
strain the trade-off in decreasing recall scores. 4) Adaboost
[11] A. Popescu and O. Etzioni, Extracting Product Fea-
requires about 50% of random selected training data to de-
tures and Opinion from Reviews, In Proc. EMNLP,
liver a similar classification performance as Bautext does
with less than ten selective seed words for each category.
Since Bautext relies on syntactic dependency relations to [12] R. E. Schapire and Y. Singer, Improved Boosting Al-
extract aspects candidates, we believe that our technique is gorithms Using Confidence-rated Predictions, In Proc.
in essence language-free and domain-independent. In future of the Eleventh Annual Conference on Computational
work, we intend to verify Bautext in other languages and Learning Theory, 1999
other text domains.