Beruflich Dokumente
Kultur Dokumente
http://eprints.qut.edu.au/
Lu, Chengye and Xu, Yue and Geva, Shlomo (2007) A Bottom-Up Term
Extraction Approach for Web-Based Translation in Chinese-English IR Systems.
In Spink, Amanda and Turpin, Andrew and Wu, Mingfang, Eds. Proceedings The
Twelfth Australasian Document Computing Symposium (ADCS2007), pages
pp. 64-71, Melbourne, Victoria.
p(x, y) Nf (x, y)
MI(x, y) = log2 = log2 (1)
S is the input string, w1..wi is the substring of S, LC() and
p(x) p( y) f (x) f ( y) RC() are functions to calculate the number of unique
left(right) adjacent characters of S. A string is judged as a
The mutual information measurement quantifies the distance Chinese term if the SCPCD value is greater or equal than
between the joint distribution of terms X and Y and the product of the SCPCD value of all the substrings of S.
their marginal distributions [1]. When using mutual information in
Chinese segmentation, x, y are two Chinese characters; f(x), f(y), 3. PROPOSED APPROACH
f(x,y) are the frequencies that x appears, y appears, and x and y The term extraction approaches listed above have been used on
appear together, respectively; N is the size of the corpus. A string large corpus. However, in our experiments, the performance of
XY will be judged as a term if the MI value is greater than a those approaches is not always satisfactory in search engine based
predefined threshold.
66
OOV term translation approaches. As described in introduction, is low in small corpora, the frequencies of Chinese characters still
web search engine result pages are used for search engine based have relatively high values. According to Observation 1, if a given
OOV term translation. In most cases, only a few hundreds of top sequence is an MLU, the characters in the sequence should have a
results from the result pages are used for translation extraction. As similar frequency, in other words, σ should be small. If the
a result, the corpus size for search engine based approaches is frequencies of all the characters in a Chinese sequence are equal,
quite small. In a small collection, the frequencies of strings very then σ = 0. Because σ represents the average frequency error
often are too low to be used in the approaches reviewed in Section of individual characters in the sequence, according to observation
2. Moreover, the search engine results are usually part of a 1, in an MLU, the longer substring of that MLU will have smaller
sentence, which makes the traditional Chinese word segmentation average frequency error.
hard to be applied in this situation. That is why many researchers
[4; 5; 7; 9; 13] try to apply statistical based approaches on search According to our observation 1, an MLU can be identified by
engine base translation for term extraction. Equation 5. However, as Equation 5 only measures the frequency
similarity between individual characters, any character
In this section, we describe a term extraction approach specifically combinations may be identified as MLU if their frequencies are
designed for the search engine based translation extraction, which similar; even when they are not occurring together. To avoid this
uses term frequency change as an indicator to determine term problem, we introduce sequence frequency f(S) into the formula.
boundaries and also uses the similarity comparison between Therefore, if the characters are not occurring together, they won’t
individual character frequencies instead of terms to reduce the be considered as a sequence and therefore f(S) = 0. Thus any
impact of low term frequency in small collections. Together with character combinations can be identified if they appear together as
the term extraction approach, we also describe a bottom-up term a sequence in the corpra.
extraction approach that can help to increase the extraction
quality. Finally, we combine the sequence frequency and the RMSE
measurement together. We designed the following equation to
3.1 Frequency Change Measurement measure the possibility of S being a term:
The approaches mentioned in Section 2 use a top-down approach
that starts with examining the whole sentence and then examining f (S) f (S) (6)
substrings of the sentence to extract MLUs until the substring R(S) = =
σ +1 1 n
∑(xi − x)2 +1
becomes empty. We propose to use a bottom-up approach that
starts with examining the first character and then examines super
strings. Our approach is based on the following observations for n i=1
small document collections:
Where, S is a Chinese sequence; f(S) is the frequency of s in the
Observation 1: In a small collection of Chinese text such as a corpus. We use σ +1 as the denominator instead of using σ to
collection of search engine result pages, the frequencies of the
characters in a MLU are similar. This is because in a small avoid 0 denominators.
collection of text, there are a small number of MLUs, the Let S be a Chinese sequence with n characters; S= a1a2….an. And
characters appearing in one MLU may not appear in other MLUs. S’ is a substring of S with length n-1; S’ = a1a2….an-1.
On the other hand, some times MLUs with similar meanings will
According to observation 1, we should have:
share similar characters and those characters are unlikely to be
used in other unrelated MLUs. For example, 戰機 (Fighter l If S is an MLU, we will have f(S) ≈ f(S’).
Aircraft) and 戰鬥機 have the same meaning in Chinese. l If S is an MLU, the longer is S, the smaller the average
They share similar Chinese characters. Therefore although mean square error is.
the term’s frequency is low, the individual characters of the Therefore, in the case S’ is a substring of S with length n-1, we
term might still have relatively high and also similar would haveσ<σ’. As a result we will have R(S)>R(S’). In another
frequencies. The high frequency can help in term case where S’ is a substring of S and S’ is an MLU while S is not.
extraction. In other words, S has an additional character to an MLU. In this
case, we will have f(S) <f(S’) and the frequency of the additional
Observation 2: When a correct Chinese term is extended with an
character makes the RMSE value larger, soσ> σ’.Therefore, R(S)
additional character, the frequency of the new term very often
<R(S’).
drops significantly.
In summary, for a string S and its substring S’, the one with higher
According to Observation 1, the frequencies of a term and each
R value would most likely be an MLU. Table 1 gives the R value
character in the term should be similar. We propose to use the root
of each possible term in a Chinese sentence chosen from a small
mean square error (RMS error) given in Equation (5) to measure
the similarity between the character frequencies. collection of summaries returned from a search engine: “隱形戰
機/是/一種 /靈活度 /極差/的/戰機” (“/” indicates the
(5) lexicon boundary given by a human ).
1 n
σ= ∑
n i=1
(xi − x)2 Table 1 Chinese strings and R(S)
String S R(S)
For a given Chinese character sequence, xi is the frequency of 隱形 26.00
each character in the sequence, x is the average frequency of all
the characters in the sequence. Although the frequency of a string 隱形戰 0.94
67
靈活度極 1.07
The algorithm makes the sub sequence uncheckable once it is
極差 0.8 identified as an MLU (i.e., b=e+1 in step 3 ensures that the next
valid checkable sequence doesn’t contain t1 which was just
極差的 0.07
extracted as an MLU). However, when using the bottom-up
戰機 2.89 strategy described above, some longer term might be missed since
the longer term contains several shorter terms. As showed in our
This example clearly shows that if a Chinese MLU has an example, “隱形戰機” (Stealth Fighter) consists of two terms
additional character, its R value will be significantly smaller than
“隱形” and “戰機”. When using bottom-up strategy, “隱形戰
the R value of the MLU. For example, R(一種)=0.44>R(一種/
機” would not be extracted because the composite term has been
靈)=0.21, R(靈活)=R(靈活度)=2.00>R(靈活度極)=1.07. It
segmented into two terms. To avoid this problem, we set up a
fixed number ω which equals specifies the maximum number of
is reasonable that if we segment the Chinese sentence at the
position that the string’s R value drops greatly. For the example
characters to be examined before reducing the size of the
sentence, it would be segmented as: “隱形/戰機/是/一種/靈活 checkable sequence. The modified algorithm is given below:
度/極差/的/戰機” by the proposed method. The only
Algorithm BUTE-M(s)
difference between the human segmented sentence and the
Input: s=a1a2….an is a Chinese sentence with n Chinese characters
automatic segmented sentence is that “隱形戰機” (Stealth
Fighter) is segmented into two words “隱形” (Stealth) and Output: M, a set of MLUs
“戰機” (Fighter) by the proposed method. However, this is 1. Check each character in s, if it is a stop character such as 是,
still an acceptable segmentation because those two words 了 ,的 …, remove it from s. After removing all stop
are meaningful. characters, s becomes a1a2….am, m≤n.
3.2 A Bottom-up Term Extraction Strategy 2. Let b=2, e=2, First-term = true, and M= φ
As mentioned in Section 3.1, the top-down strategy is firstly to 3. Let t1= aba2….ae, t2= aba2….a(e+1).
check whether the whole sentence is an MLU, then reduce the
sentence size by 1 and recursively check sub sequences. It is If R(t1) >R(t2),
reported that over 90% of meaningful Chinese terms consist of then M:=M ∪ {t1)
less than 4 characters [9], and on average, the number of
If First-term = true
characters in a sentence is much larger than 4. Obviously, a whole
sentence is unlikely to be an MLU. Therefore, checking the whole then first-position:= e and First-term:= false
sentence for an MLU is unnecessary. In this section, we describe a If e-b+1 ≥ ω
bottom-up strategy that extracts terms starting from the first
character in the sentence. The basic idea is to determine the then e:=first-position, b:=e+1, First-term:=true.
boundary of a term in a sentence by examining the frequency 4. e=e+1, if e+1>m, return M, otherwise go to step 3
change (i.e., the change of the R value defined in Equation (6))
when the size of the term is increasing. If the R value of a term
with size n+1 drops compared with its largest sub term with size n, In algorithm BUTE-M, the variable first-position gives the ending
the sub term with size n is extracted as an MLU. For example, in position of the first identified MLU. Only when ω characters
Table 1, there is a big drop between the R value of the third term have been examined, the first identified MLU will be removed
“ 靈活度” (2.00) and its super term “靈活度極” (1.07). from the next valid checkable sequence, otherwise the current
Therefore, “靈活度” is considered as an MLU. sequence is still being checked for a possible MLU even it
contains an extracted MLU. Therefore, not only the term “隱
The following algorithm describes the bottom-up term
形” and “戰機” will be extracted but also the longer term “隱形
extraction strategy:
戰機” (Stealth Fighter) will be extracted.
Algorithm BUTE(s)
Input: s=a1a2….an is a Chinese sentence with n Chinese characters 3.3 Translation selection
Translation selection is relatively simple compared with term
Output: M, a set of MLUs extraction. The translation of a word in a source language is
typically determined according to the ranking of the extracted
terms. Each of the terms is assigned a rank, usually calculated
68
based on term frequency and term length[13]. The term with the 4.1 Test set
highest rank in the extracted term list is selected as the translation 140 English queries from the NTCIR6 CLIR task were used.
of the English term. Query terms were first translated using Yahoo’s online dictionary.
As we have described in other papers [9; 10], the traditional (http://tw.dictionary.yahoo.com/). The remaining OOV terms
translation selection approaches select the translation on the basis which could not be translated were used to evaluate the
of word frequency and word length [4; 13]. We have suggested an performance of our web based query translation approach
approach to finding the most appropriate translation from the described in Section 3.2 and 3.3 by comparing with other
extracted word list regardless of term frequency. In our scheme approaches. There are 69 OOV terms altogether. The correct
even a low frequency word will have a chance to be selected. Our translation of each OOV term is given by NTCIR.
experiments in that paper show that in some cases, the most
appropriate translation is the low frequency word. In this paper, 4.2 System setup
we only give a brief description of our translation selection In our experiments, we evaluated the effectiveness of term
technique. The reader is referred to [9] for a more complete extraction approaches in OOV translation. All the approaches
discussion. described in section 2.1 were used in the experiment. The
abbreviations are: MI for Mutual information, SE for the
The idea of our approach is to use the translation disambiguation approach introduced by Chien, SCP for the Local Maxima
technology to select the translation from the extracted term list. As introduced by Silva and Lopes, and SCPCD for the approach
the extracted terms are from the result set returned by the web introduced by Jenq-Haur Wang et al.. Our extraction approach
search engine, it is reasonable to assume that those terms are with BUTE-M extraction strategy is abbreviated as SQUT
relevant to the English query term that was submitted to the web
search engine. If we assume all those terms are translations of the The OOV term is translated via the following steps:
English term, we can apply the translation disambiguation 1. From the result pages downloaded from Google, use the 5
technique to select the most appropriate term as the translation of different term extraction approaches to produce 5 Chinese
the English term. We also introduced a filtering technique in our term lists.
approach to minimize the length of the extracted term list.
2. For each term list, remove a term if it can be translated to
In our approach, the correct translation will be selected using a English by Yahoo’s online dictionary. This leaves only
simple translation disambiguation technique that is based on OOV terms.
co-occurrence statistic. We use the total correlation which is one 3. From each remaining term list which contains only OOV
of several generalizations of the mutual information to calculate terms, select the top 20 terms as translation candidates.
the relationship between the query words.
Select the final translation from the candidate list using our
Our modified total correlation equation is defined as translation selection approach described in 3.3.
f (x1x2 x3...xn ) +1 Finally we have 5 sets of OOV translations, as shown in the
C(x1x2 x3...xn ) = log2 (7) Appendix.
( f (x1) +1)( f (x2 ) +1)...(f (xn ) +1)
4.3 Results and discussion
Here, xi are query words, f(xi) is the frequency that the query word For the 69 OOV terms, by using the 5 different term extraction
xi appears in the corpus, f ( x1 x2 x3 ...xn ) is the frequency that approaches, we obtained the translation results shown in Table 2.
Details of the translation are showed in appendix.
all query words appears in the corpus. For each word frequency,
we add 1 because we want to avoid 0 appearing in the equation As we were using the same corpus and the same translation
when a word’s frequency is 0. selection approach, the difference in translation accuracy is the
result of different term extraction approaches. Thus we can claim
The frequency information required by equation 7 can be easily
that the approach with the higher translation accuracy has higher
collected from local corpus.
extraction accuracy.
As we can see from table 2 below, SQUT has the highest
4. EVALUATION translation accuracy. SCP and SCPCD provided similar
We have conducted experiments to evaluate our proposed query performance. The approaches based on mutual information
translation approach. Using the algorithm described in Section 3.2, provided lowest performance.
we firstly extract Chinese terms from the summaries returned by a
search engine with an English term as the query. These Chinese Table 2. OOV translation accuracy
terms are considered the candidate translations of the query
English term. We then use the method described in Section 3.3 to Correct Accuracy (%)
select the most appropriate translation. The web search engine that MI 30 43.5
we used in the experiments is Google, and the top 300 summaries
returned were used for later processing. The English queries SE 41 59.4
entered into Google will be enclosed by double quotation to SCP 53 76.8
ensure Google only returns result with exact phrases. Also
SCPCD 52 75.4
specified result pages written in Chinese.
SQUT 59 85.5
69
results given by NTCIR. There are only 11 OOV terms in those 50 SIGIR conference on Research and development in
queries. As a result, the number of OOV terms is not enough to information retrieval, ACM Press, New Orleans,
distinguish the different approaches. While many researchers [4; 5; Louisiana, United States, 2001.
7; 9; 13] had showed that better query translation quality should [8] M.-G. Jang, S.H. Myaeng, and S.Y. Park, Using mutual
improve the CLIR performance, it might be necessary to test information to resolve query translation ambiguities and
actual benefits of high translation accuracy in CLIR in the future query term weighting, Proceedings of the 37th annual
when we have appropriate testing data. meeting of the Association for Computational
Linguistics on Computational Linguistics, Association
6. REFERENCES for Computational Linguistics, College Park, Maryland,
1999.
[1] Mutual information, Wikipedia. [9] C. Lu, Y. Xu, and S. Geva, Translation disambiguation in
[2] The List of common use Chinese Characters, Ministry of web-based translation extraction for English-Chinese
Education of the People's Republic of China. CLIR, Proceeding of The 22nd Annual ACM
[3] A. Chen, H. Jiang, and F. Gey, Combining multiple sources Symposium on Applied Computing, 2007.
for short query translation in Chinese-English [10] C. Lu, Y. Xu, and S. Geva, Improving translation accuracy in
cross-language information retrieval, Proceedings of the web-based translation extraction, Proceedings of the 6th
fifth international workshop on on Information retrieval NTCIR Workshop Meeting on Evaluation of
with Asian languages, ACM Press, Hong Kong, China, Information Access Technologies:Information Retrieval,
2000. Question Answering and Cross-Lingual Information
[4] A. Chen, and F. Gey, Experiments on Cross-language and Access, Japan, 2007.
Patent retrieval at NTCIR3 Worksho, Proceedings of the [11] M. Salovesh, How many words in an "average" person's
3rd NTCIR Workshop, Japan, 2003. vocabulary?,
[5] P.-J. Cheng, J.-W. Teng, R.-C. Chen, J.-H. Wang, W.-H. Lu, http://unauthorised.org/anthropology/anthro-l/august-19
and L.-F. Chien, Translating unknown queries with web 96/0436.html, 1996.
corpora for cross-language information retrieval, [12] J.F.d. Silva, and G.P. Lopes, A Local Maxima Method and a
Proceedings of the 27th annual international ACM Fair Dispersion Normalization for Extracting Multiword
SIGIR conference on Research and development in Units, International Conference on Mathematics of
information retrieval, ACM Press, Sheffield, United Language, 1999.
Kingdom, 2004. [13] Y. Zhang, and P. Vines, Using the web for automated
[6] L.-F. Chien, PAT-tree-based keyword extraction for Chinese translation extraction in cross-language information
information retrieval, Proceedings of the 20th annual retrieval, Proceedings of the 27th annual international
international ACM SIGIR conference on Research and ACM SIGIR conference on Research and development
development in information retrieval ACM Press, in information retrieval, ACM Press, Sheffield, United
Philadelphia, Pennsylvania, United States 1997. Kingdom, 2004.
[7] J. Gao, J.-Y. Nie, E. Xun, J. Zhang, M. Zhou, and C. Huang,
Improving query translation for cross-language
information retrieval using statistical models,
Proceedings of the 24th annual international ACM
7. Appendix Sample of translated terms
OOV term SQUT SCP SCPCD SE MI