Beruflich Dokumente
Kultur Dokumente
ciated with more than one cluster of terms during “Docu- WebIntelligence web, page, semant, ontolog, concept, relat, document
ments mapping”, the multi-topic property is obtained nat- Figure 4: Examples of topics obtained using the
urally. However, some topics may be redundant because Torch
they can cover the same set of documents. Thus, the fi-
nal topic hierarchy is constructed by combining the topics The results indicate that the Torch can effectively help
of the same level of the hierarchy according to the equation users to build topic hierarchies from growing text collections.
| Ti ∩ Tj | / | Ti ∪ Tj |6 α, where | Ti | is the number of For example, the topic hierarchy obtained can be used as a
documents in the topic Ti and α is a value that defines the first structure to be improved by the user or can also be used
level of overlap. as metadata in information retrieval systems.
After performing these steps, a topic hierarchy is available 4. CONCLUDING REMARKS
for users and applications. The Torch uses a XML format In this paper we presented an overview of the Torch tool,
as output of the results, since it is easy to understand and developed with the goal of building topic hierarchies from
can be used in various applications. growing text collections. The architecture of the tool and
It is important to note that the requirements discussed its main functionalities were described and also discussed
above have been met. Thus, the building of the topic hierar- how the tool meets the requirements: incremental clustering,
chies is performed incrementally and, moreover, documents understandable cluster labels, and topics overlapping.
can belong to more than one topic. A simple controlled experiment is carried out to illustrate
the functionalities of the tool in the construction of digital
3. EXPERIMENT: BUILDING TOPIC HIER- libraries. In this experiment, we identified and evaluated the
ARCHIES FOR DIGITAL LIBRARIES main topics of eight text collections.
In order to illustrate the main functionalities of the tool The Torch tool and text collections are available online at:
Torch, we carried out a simple controlled experiment with http://labic.icmc.usp.br/softwares-and-application-tools/.
the aim of supporting the construction of digital libraries. In In the future, we plan to improve the Torch tool by in-
our experiments we used eight different real text collection cluding new similarity measures between terms. Further-
obtained from ACM’s digital library. Each text collection more, we intend to investigate visualization techniques of
contains about 500 articles about computer science and is topic hierarchies to facilitate analysis of results.
organized into 40 predefined topics catalogued according to
the ACM library. The aim of the experiment is to obtain a 5. REFERENCES
topic hierarchy from the tool Torch and compare it with the [1] M. J. A. Berry and J. Kogan. Text Mining:
predefined topics. During this process, we inserted one arti- Applications and Theory. Wiley, West Sussex, 2010.
cle at a time to assess the incremental effect and to simulate [2] B. C. M. Fung, K. Wang, and M. Ester. Encyclopedia
a scenario involving growing text collections. The results of data warehousing and mining, volume 1, chapter
were evaluated by the FScore index, which compares the Hierarchical Document Clustering, pages 555–560.
topic hierarchy and predefined topics using precision and Information Science Reference - IGI Publishing, 2008.
recall measures. [3] R. M. Marcacini, M. F. Moura, and S. O. Rezende.
0.90
IFM’s Digital Library: an application for organizing
0.80 information using hierarchical clustering (In
0.70 portuguese). In III Workshop on Digital Libraries
0.60
FScore Inde x
0.50
(WDL), XIII Webmedia, pages 1–8, 2007.
0.40 [4] R. M. Marcacini and S. O. Rezende. Incremental
0.30 construction of topic hierarchies using hierarchical term
0.20
0.10
clustering. In Proceedings of the 22nd International
0.00 Conference on Software Engineering and Knowledge
ACM1 ACM2 ACM3 ACM4 ACM5 ACM6 ACM7 ACM8
Te xt Colle ctions
Engineering (SEKE), pages 553–558. KSI - Knowledge
Figure 3: FScore values obtained in the experiment Systems Institute, 2010.
[5] M. F. Moura, R. M. Marcacini, B. M. Nogueira, M. S.
Figure 3 shows the graph with the FScore values obtained Conrado, and S. O. Rezende. A proposal for building
from each text collection. The highest FScore value obtained domain topic taxonomies. In Proceedings of 1st
is 0.79 while the lowest FScore is 0.63, indicating a good rela- International Workshop on Web and Text Intelligence
tionship between the topic hierarchy and predefined topics. (WTI), XIX SBIA, pages 83–84, 2008.