Beruflich Dokumente
Kultur Dokumente
business intelligence
and knowledge
management
by W. F. Cody
J. T. Kreulen
V. Krishna
W. S. Spangler
IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 0018-8670/02/$5.00 © 2002 IBM CODY ET AL. 697
ing is a systematic approach to collecting relevant knowledge portals, customer relationship manage-
business data into a single repository, where it is or- ment (CRM), and bioinformatics, require merging
ganized and validated so that it can be analyzed and these unstructured information technologies with
presented in a form that is useful for business de- structured business data analysis.
cision-making. 1 The various sources for the relevant
business data are referred to as the operational data It is our belief that over time techniques from both
stores (ODS). The data are extracted, transformed, BI and KM will blend. Today’s disparate systems will
and loaded (ETL) from the ODS systems into a data use techniques from each and will, in turn, inspire
mart. An important part of this process is data cleans- new techniques that will seamlessly span the anal-
ing, in which variations on schemas and data values ysis of both data and text. With this in mind, we de-
from disparate ODS systems are resolved. In the data scribe our contributions in this direction. First, we
mart, the data are modeled as an OLAP cube (mul- briefly describe some business problems that moti-
tidimensional model), which supports flexible drill- vate this integration and some of the technical chal-
down and roll-up analyses. Tools from various ven- lenges that they pose. Then we describe eClassifier,
dors (e.g., Hyperion, Brio, Cognos) provide the end a comprehensive text analysis tool that provides a
user with a query and analysis front end to the data framework for integrating advanced text analytics.
mart. Large data warehouses currently hold tens of Next, we present an example that motivates our par-
terabytes of data, whereas smaller, problem-specific ticular approach toward integrating data and text
data marts are typically in the 10 to 100 gigabytes analysis and describe our architecture for a combined
range. data and document warehouse and associated tool-
ing. Finally, we discuss some current research direc-
Knowledge management definitions span organiza- tions in extracting information from documents that
tional behavioral science, collaboration, content can increase the value of a data cube.
management, and other technologies. In this con-
text, we are using it to address technologies used for Motivation for BIKM
the management and analysis of unstructured infor-
mation, particularly text documents. It is conjectured The desire to extend the capabilities of business in-
that there is as much business knowledge to be telligence applications to include textual informa-
gleaned from the mass of unstructured information tion has existed for quite some time. The major in-
available as there is from classical business data. We hibitors have included the separation of the data on
believe this to be true and assert that unstructured different data management systems, typically across
information will become commonly used to provide different organizations, and the immaturity of au-
deeper insights and explanations into events discov- tomated text analysis techniques for deriving bus-
ered in the business data. The ability to provide in- iness value from large amounts of text. The current
sights into observed events (e.g., trends, anomalies) focus on information integration in many enterprises
in the data will clearly have applications in business, is rapidly diminishing the first inhibitor, and advances
market, competitive, customer, and partner intelli- in machine learning, information retrieval, and sta-
gence as well as in many domains such as manufac- tistical natural language processing are eroding the
turing, consumer goods, finance, and life sciences. second.
The variety of textual information sources is ex- Examples of BIKM problems. To understand the im-
tremely large, including business documents, e-mail, portance of BIKM, it is useful to look at some real
news and press articles, technical journals, patents, business problems and to determine how this tech-
conference proceedings, business contracts, govern- nology can provide a return on the investment (ROI).
ment reports, regulatory filings, discussion groups, The ROI can be achieved, in general, in one of two
problem report databases, sales and support notes, ways: (1) through cost reductions and identification
and, of course, the Web. Knowledge and content of inefficiencies (improved productivity), and (2)
management technologies are used to search, orga- through identification of revenue opportunities and
nize, and extract value from all of these information growth. Here are some typical scenarios in which our
sources and are a focus of significant research and customers believe their business analyses would ben-
development. 2,3 These technologies include cluster- efit substantially from data and text integration:
ing, taxonomy building, classification, information
extraction, and summarization. An increasing num- 1. Understanding sales effectiveness. A telemarketing
ber of applications, such as expertise location, 4,5 revenue data cube can help identify products that
current feature space. The category centroid is cen- of the documents in the category, these two words
tral to the computation of the summary information combined become the name (separated by a com-
in this view. ma). If not, this process is repeated. If none of the
top four terms is contained in 10 percent or more
The category label is generated using a term-cover- of the documents, the category is called “Miscella-
age algorithm that identifies dominant terms in the neous.” We have experimented with various other
feature space. If a single term occurs in 90 percent algorithms for labeling categories, including finding
or more of the documents in a category, the cate- the most frequently occurring phrases. Although
gory is labeled with that term. If more than one term these sometimes appear to be more meaningful, we
occurs with 90 percent frequency, then all of these have found them to be often misleading and to mis-
terms (up to four) will be included in the name, with characterize the category as a whole. Although this
the “&” character between each term. If no single algorithm is effective for quickly summarizing a cat-
term covers 90 percent of the documents, then the egory, we also allow the user to assign a different
most frequent term becomes the first entry in the label at any time.
name. The second entry is the one that occurs most
frequently in all documents that do not contain the In addition to a label, three metrics are computed
first term of the name. If the set of documents con- for each category by default. The size column dis-
taining either of these two words is now 90 percent plays a count of the number of documents in the cat-
egory and its percentage of the total collection. Co- to be able to precisely understand what core con-
hesion is a measure of the variance of the documents cept a category represents. To address this need,
within a category. The cohesion is calculated based eClassifier provides a special view that shows statis-
on the average cosine distance from the centroid. tics about term frequency, induced classification
We have found that this provides a good measure rules, and document examples as shown in Figure
of the similarity within a category, and categories with 2. eClassifier has a bar graph representation of the
low cohesion are good candidates for splitting. Dis- category centroid. For each term in the feature space,
tinctness is a measure of variance of the documents it shows the frequency of occurrence within the given
between categories. The distinctness is calculated class (red bar) and the frequency of occurrence
based on the cosine distance from a category’s cen- within the total document collection (blue bar). The
troid to its nearest neighboring category’s centroid. terms are sorted in decreasing order of red minus
We have found this to provide a good measure of blue bar height in order to focus attention on the
similarity between categories, so categories with low most relevant terms for the class. The class compo-
distinctness are similar to a neighboring category and nents panel visualizes the effect of certain terms
would be candidates for potential merging. (inclusion ⫽ ⫹, exclusion ⫽ ⫺) when used as a de-
cision tree classifier. Nodes in the tree are selected
Category evaluation. In addition to understanding based on minimizing the entropy of in-category vs
a given taxonomy at a macro level, it is important out-of-category documents. If certain rules are par-
ticularly meaningful, a user can click on the node visualization to help a user to explore the relation-
and create a new class from the identified set of doc- ships between categories of documents within a tax-
uments. Finally, this view provides example docu- onomy. We show an example of eClassifier’s visu-
ments from each category. Several sorting techniques alization in Figure 3. In the visualization each dot
are available. Ordering by “most typical” is calcu- represents a document, which, when clicked on, will
lated based on proximity to the centroid. This is an be displayed. The color of the dot denotes its mem-
effective technique— examining a few documents bership within a corresponding category. A large dot
close to the centroid helps a user to understand the represents a category centroid, which is the average
essence of the category’s concept. Ordering by “least feature vector for the documents in that category.
typical,” by showing example documents farthest For each rendering we select the centroids of three
from the centroid, helps the user to evaluate uni- categories to form a plane. All the documents are
formity within the category. Examples that the user then projected onto this plane.
identifies as not really belonging to the category can
easily be moved to other categories or to a newly cre- This visualization is useful for exploring the relation-
ated category. With each modification, all relevant ship between various categories. We can quickly see
statistics are dynamically updated. which categories are close in proximity and we can
find specific documents that may span these cate-
Taxonomy visualization. Visualization is an impor- gories by selecting documents that lie on their re-
tant technique to convey information. eClassifier uses spective borders.
achieve a tighter integration of the text information dimension has an associated hierarchy: group 3 type
with the associated data. One way to do this is to use 3 product. The geography dimension has an asso-
an OLAP multidimensional data model 1 as the inte- ciated hierarchy: country 3 state 3 city. The date
grating mechanism. The typical dimensional data dimension has an associated hierarchy: year 3 quar-
model for an OLAP system uses a star schema as the ter 3 day. These three dimensions represent the at-
model for a data cube. A basic star schema consists tributes that we can use to analyze our facts. In this
of a fact table at its center and a corresponding set example, we have a revenue fact table. Each row in
of dimension tables, as shown in Figure 4. A fact ta- the fact table represents the aggregate transactions
ble is a normalized table that consists of a set of mea- at the lowest level in each of the dimensions. In this
sures or facts and a set of attributes represented by case, each fact is aggregated at product, city, and day.
foreign keys into a set of dimension tables. The mea- The measures that are aggregated are revenue and
sures are typically numeric and additive (or at least units sold. This model allows us to explore the effect
partially additive). Because fact tables can have a very of product, geography, and date, at all levels in each
large number of rows, great effort is made to keep hierarchy, on revenue and units sold and other mea-
the columns as concise as possible. A dimension ta- sures computed from these values such as total rev-
ble is highly denormalized and contains the descrip- enue, average revenue, total units sold, and average
tive attributes of each fact table entry. These at- units sold. Typically an analyst would use an appli-
tributes can consist of multiple hierarchies as well cation to analyze these various measures by looking
as simple attributes. Because the dimension tables at trends over time, or by finding weaknesses or
typically consist of less than 1 percent of the overall strengths in products or geographies.
storage requirement, it is quite acceptable to repeat
information to improve system query performance. Integrating text information into this analysis re-
The level at which the dimensions and measures en- quires progress in several areas of text analytics in
capsulate the data is referred to as the “fact grain.” which we are currently working. The first is the use
An example of a low-level grain is at the transac- of text classification technology either to find at-
tional level, where the dimensions are the product, tributes in the documents that can be used to link
geography, and date of the transaction, and the mea- them to the data, or to find attributes in the doc-
sures are the dollar revenue and units sold. uments that can be used as additional dimensions
to deepen the understanding of the data. Second,
In the example in Figure 4 we have three dimension we are researching information extraction technol-
tables: product, geography, and date. The product ogies to process the text and to compute quantita-
Product Geography Date key Customer Problem Days open Severity Ticket
key key key type key of the identifier
complaint
tive values from the documents. The quantitative val- and date dimensions, while computing the average
ues can then be used as measures in a document fact dollar sales and average units sold, and if we roll up
table (Table 1). The combined data can not only be the second table along the product, geography, date,
“sliced and diced” in the traditional OLAP paradigm and problem dimensions while computing the num-
of data analysis, but also the related documents can ber of customer keys, the average days open, and
be explored in various ways that exploit their struc- the average complaint severity, then the join will give
ture to make their content more useful. This inter- us a picture of the revenue as well as the problem
action model and its underlying information model costs for each product, per location, per time period
is an area for our current research. associated with a problem type (e.g., installation,
missing CDs, etc.). Then, with an integrated tooling
Consider again the example in Figure 4. The facts environment we can perform this type of quantita-
have keys for the dimensions of product, geography, tive dimensional OLAP analysis and then seamlessly
and date. Now we also have a database of problem move into a document analysis to understand the
tickets resulting from service calls. The problem tick- complaints in more depth. A discussion of such an
ets have meta-data recorded along with a transcript experimental tooling environment that has been built
of the problem description. If we run a set of anal- at the Almaden Research Center follows.
yses over this collection of documents we can hope
to accomplish several things. First, by using a clas- Integrated BIKM tools (Sapient & eClassifier). In the
sification process we can divide the problem tickets previous section we describe our text analysis sys-
by problem type, thereby creating a new dimension, tem, eClassifier. In this section we describe the tool-
in addition to the existing meta-data dimensions, into ing we have built to apply the OLAP data model to
which problem tickets can be categorized. Second, text documents, creating a document warehouse. We
by running experimental text analyses over the text then describe how we link the data model for the
of the problem tickets we can attempt to quantify data and the documents through shared dimensions,
the severity of the problem in the ticket. Upon do- and how this enhances our analytical capabilities. Fi-
ing this, the problem ticket documents can be or- nally, we describe how text analytics can be used to
ganized into their own fact table with the schema dynamically enhance this data model with what we
shown in Table 1. call dynamic dimensions.
The first four columns, which are foreign keys into The tool we have developed allows us to explore data
dimension tables, are derived from the meta-data cubes with a star schema and consists of a report view
associated with the tickets in the problem ticket da- and navigational controls. The report view provides
tabase. The fifth column is now a dimension asso- a view of the results of data queries on a data cube.
ciated with the problem that was created by auto- The reports can be summary tables (Figure 5), trend
matically classifying the tickets. The sixth column is line graphs (Figure 6), or pie charts. An important
a measure associated with the problem that can be part of the navigational controls are the dimensions
calculated from the meta-data. The seventh column and metrics selection boxes. The dimension selec-
is a measure of the severity of the problem as cal- tion box allows the user to select and drill down on
culated by a text analysis of the transcription of the each dimension. This includes drilling down a dimen-
call. This may be a scoring of the frustration or an- sion hierarchy or cross drilling from one dimension
ger felt by the caller. Finally, the last column ties this to another. The metric selection box allows the user
document fact back to the original document in the to select metrics that are computable for the given
ticket database to facilitate movement from the OLAP data cube. Additional navigation buttons allow for-
environment of these facts into a document analysis ward and backward navigation to view previous re-
environment. ports. Other navigation controls are discussed later.
Given these fact table schemas, if we roll up the first Document warehousing. We extend the techniques
fact table (Figure 4) along the product, geography, used on data in business intelligence to documents
by using a dimensional model where the fact table Shared dimensions. Thus far we have shown how star
granularity is a document, and the dimension tables schemas can be used to organize and analyze both
hold the attributes of the document. Without addi- data and document cubes. Although each on their
tional processing this representation is a “factless” own can provide very useful information, providing
fact table, because there are, as yet, no associated a mechanism to link them will allow deeper analysis
measures. The process of populating the document and thereby provide greater value. As an example,
warehouse has some complexities beyond typical ETL we revisit our product-geography-date revenue cube
processing. In many cases the source of the docu- from Figure 4. If we have a collection of documents
ments is not an operational data store. Typically doc- that are relevant to the given products, in the given
uments are automatically and incrementally put into geographies, over the given times, the information
the document warehouse based on either a subscrip- they contain and its relationship to the business data
tion (push model) or a scheduled retrieval process analysis can greatly improve decision making. Some
(pull model). Additionally, we need a method to fil- documents that could provide insight in this exam-
ter the documents because not all documents will ple would be sales logs, customer support logs, news
be relevant to the purpose of the document cube. and press articles, marketing material, and discus-
sion groups. All of these could provide unique in-
Depending on the source, most documents have sights into why a product is selling well or poorly in
some associated meta-data that can naturally be used a given geography during a given time frame. The
to populate some dimensions, such as author, date key to achieving these insights is to directly link the
of publication, and document source. However, there data to the documents through shared dimensions.
are dimensions of potential interest that may not be An example data model of data and document cubes
included in the meta-data. If the dimension is known, with shared dimensions is illustrated in Figure 7.
classification techniques can be used to populate it.
Using this model, all of the techniques previously Dynamic dimensions. At this point we have data and
described that are available to data cubes are now document cubes that are linked through shared di-
available to document cubes. mensions. All of the analytical techniques used on
data cubes can be used on the document cubes. lect (extract) and initiate eClassifier on an arbitrary
Given the linkage created by shared dimensions, we subset of the documents from the document ware-
can use the constraints used to identify a subset of house. Once we have invoked eClassifier on the doc-
data to then identify the corresponding set of doc- uments we can perform all of the analytical capa-
uments and then make inferences from those doc- bilities outlined previously.
uments about the data. For example, if the data show
a drop in revenue for a product in certain geogra- Furthermore, eClassifier can be used to create a new
phies during a given time period, we can use these taxonomy over this selected set of documents. This
constraints on the document cube to identify the doc- new taxonomy is effectively a new (hierarchical) di-
uments that might best explain the drop in revenue. mension that adds value to the existing data and doc-
We can then use standard OLAP techniques to in- ument cubes. For example, problem tickets can be
vestigate the relationship to any additional (non- classified into problem types. This dimension pro-
shared) dimensions available for the documents. vides a finer granularity for understanding the prob-
However, sometimes the existing dimensions and lems that are contributing to the costs associated with
their taxonomies may be insufficient to fully explain products in a given region and time period.
the data. The documents can then be further ana-
lyzed using a deeper text analytical system such as The new taxonomy can be made available to the doc-
eClassifier. We have provided this in our BIKM sys- ument warehouse by creating a corresponding di-
tem by augmenting the document warehouse with mension table to represent the taxonomy and then
an additional table (i.e., the token table) that has the populating an added column in the fact table, asso-
document identifier, token identifier, and token off- ciating all known documents with the newly pub-
set for every token in every document (shown in Fig- lished dimension. This new dimension is now avail-
ure 7). The token table allows us to dynamically se- able to all of the analytical and reporting capabilities
PRODUCT
CAUSE
Prod_ID
Prod_Line Cause_ID
Prod_Group Cause
Product SUBJECT
Subj_ID
DATA DOCUMENT Subject
FACT GEOGRAPHY FACT
Geo_ID SOLUTION
Site Sol_ID
Location Solution
DOCUMENT
METRICS METRICS Doc_ID
DATE Title
Transaction Counts Document Counts
Total Revenue Date_ID Abstract
Average Revenue Date
Total Units Month TOKEN
Average Units Year Doc_ID
Token_ID
Offset
in the OLAP environment. Additional processing can traction from documents. These efforts include: (1)
be performed to classify all of the documents that extracting quantitative facts from documents (e.g.,
were not in the extracted set of documents into the the financial terms of a contract); (2) deducing re-
new dimension. lationships between entities in a document (e.g., new
product A competes with product B); and (3) mea-
For example, we selected the “ThinkPad* T20” prod- suring the level of subjective values such as severity
uct (see Figure 5) and extracted into eClassifier the or sentiment in documents (e.g., a customer letter
2858 documents associated with this product. We reflects extreme displeasure with a company’s ser-
used eClassifier to produce the new taxonomy shown vice). Currently we are exploring techniques to ac-
in Figure 8. We then saved this taxonomy for the complish these tasks based on statistical machine-
document warehouse by publishing it as the “new learning approaches. We hope to report on these in
thinkpad taxonomy” dimension and updating the a future paper.
document fact table appropriately. This allows us to
drill from within the data warehouse, and the results Another area of future research that we believe is
are shown in Figure 9. promising is the integration of ontologies into the
taxonomy generation and dimension publishing por-
Summary and future research tions of our BIKM architecture. Ontologies provide
a level of semantics that we do not currently address,
The previous sections discuss our current integra- allowing improved taxonomies and reasoning about
tion model for data and text analysis and the tooling the data and text. Furthermore, emerging ontolog-
we have built to experiment with it. The missing, and ical technologies such as the semantic Web can pro-
somewhat open-ended, portion of this integration vide a vehicle to integrate the text and data under
is the text analytics that will be used to create the study with a far larger body of text and data, thereby
quantitative metrics that populate the document expanding the potential insights.
cube and augment the data cube metrics. There is
significant work going on in the IBM research com- In this paper we show that text integrated with bus-
munity, especially within the unstructured informa- iness data can provide valuable insights for improv-
tion management area, to perform information ex- ing the quality of business decisions. We describe a
text analysis framework and how to integrate it into 2. D. Sullivan, Document Warehousing and Text Mining, John
a business intelligence data warehouse by introduc- Wiley & Sons, Inc., New York (2001).
3. T. Nasukawa and T. Nagano, “Text Analysis and Knowledge
ing a document warehouse and linking the two Mining System,” IBM Systems Journal 40, No. 4, 967–984
through shared dimensions. We believe that this pro- (2001).
vides a platform on which to build and research new 4. W. Pohs, Practical Knowledge Management, IBM Press, Dou-
algorithms to find the currently hidden business value ble Oak, TX (2001).
5. W. Pohs, G. Pinder, C. Dougherty, and M. White, “The Lo-
in the vast amount of text related to business data. tus Knowledge Discovery System: Tools and Experiences,”
Technologies in the areas of information extraction IBM Systems Journal 40, No. 4, 956 –966 (2001).
and integrated text and data mining will build on this 6. See http://www-4.ibm.com/software/data/bi/banking/ezmart.
framework, allowing it to achieve its full business po- htm.
tential. 7. M. Hernandez, R. J. Miller, and L. Haas, “Clio: A Semi-Au-
tomatic Tool for Schema Mapping,” Proceedings, Special In-
terest Group on Management of Data, Santa Barbara, CA (May
Acknowledgments 21–24, 2001).
8. See http://www.itl.nist.gov/iad/894.02/related_projects/muc/
The authors gratefully acknowledge the contribu- index.html.
9. S. Sarawagi, R. Agrawal, and N. Megiddo, “Discovery-Driv-
tions of Dharmendra Modha, Ray Strong, Justin en Exploration of OLAP Data Cubes,” Proceedings, 6th In-
Lessler, Thomas Brant, Iris Eiron, Hamid Pirahesh, ternational Conference on Extending Database Technology, Va-
Shivakumar Vaithyanathan, and Anant Jhingran for lencia, Spain (March 23–27, 1998), pp. 168 –182.
their contributions to eClassifier, Sapient, and the 10. C. Kwok, O. Etzioni, and D. S. Weld, “Scaling Question An-
underlying ideas of BIKM. swering to the Web,” Proceedings, 10th International World
Wide Web Conference, Hong Kong (May 1–5, 2001), avail-
*Trademark or registered trademark of International Business able at http://www10.org/cdrom/papers/120/.
Machines Corporation. 11. G. Salton and M. J. McGill, Introduction to Modern Retrieval,
McGraw-Hill Publishing, New York (1983).
12. G. Salton and C. Buckley, “Term-Weighting Approaches in
Cited references Automatic Text Retrieval,” Information Processing and Man-
agement 4, No. 5, 512–523 (1988).
1. R. Kimball, The Data Warehouse Toolkit, John Wiley & Sons, 13. R. O. Duda and P. E. Hart, Pattern Classification and Scene
Inc., New York (1996). Analysis, John Wiley & Sons, Inc., New York (1973).
14. J. A. Hartigan, Clustering Algorithms, John Wiley & Sons, Inc., Jeffrey T. Kreulen IBM Research Division, Almaden Research
New York (1975). Center, 650 Harry Road, San Jose, California 95120 (electronic mail:
15. E. Rasmussen, “Clustering Algorithms,” W. B. Frakes and kreulen@almaden.ibm.com). Dr. Kreulen is a manager at the IBM
R. Baeza-Yates, Editors, Information Retrieval: Data Struc- Almaden Research Center. He holds a B.S. degree in applied
tures and Algorithms, Prentice Hall, Englewood Cliffs, New mathematics (computer science) from Carnegie Mellon Univer-
Jersey (1992), pp. 419 – 442. sity and an M.S. degree in electrical engineering and a Ph.D. de-
16. S. Vaithyanathan and B. Dom, “Model-Based Hierarchical Clus- gree in computer engineering, both from Pennsylvania State Uni-
tering,” available at http://www.almaden.ibm.com/cs/people/ versity. Since joining IBM in 1992, he has worked on multi-
dom/papers/uai2k.ps. processor systems design and verification, operating systems,
17. I. Dhillon, D. Modha, and S. Spangler, “Visualizing Class systems management, Web-based service delivery, and integrated
Structures of Multi-Dimensional Data,” Proceedings, 30th text and data analysis.
Conference on Interface, Computer Science and Statistics, May
1998.
Vikas Krishna IBM Research Division, Almaden Research Cen-
ter, 650 Harry Road, San Jose, California 95120 (electronic mail:
Accepted for publication July 12, 2002. vikas@almaden.ibm.com). Mr. Krishna is a software engineer at
the IBM Almaden Research Center. He holds a B.Tech. degree
in naval architecture from IIT Madras, an M.E. degree in com-
William F. Cody IBM Research Division, Almaden Research Cen- putational fluid dynamics from Memorial University, Newfound-
ter, 650 Harry Road, San Jose, California 95120 (electronic mail: land, Canada, and a M.S. degree in computer engineering from
wcody@almaden.ibm.com). Dr. Cody is a senior manager of the Syracuse University, New York. Since joining IBM in 1997, he
Knowledge Middleware and Technology group at IBM’s Alma- has developed systems for Web-based service delivery, business-
den Research Center. He received his Ph.D. degree in mathe- to-business information exchange, and the integrated analysis of
matics in 1979 and has held various product development, re- text and data.
search, and management positions with IBM since joining the
company in 1974. He has published papers on database appli-
cations, database technology, software engineering, and group W. Scott Spangler IBM Research Division, Almaden Research
theory. Center, 650 Harry Road, San Jose, California 95120 (electronic mail: