Sie sind auf Seite 1von 17

The integration of

business intelligence
and knowledge
management
by W. F. Cody
J. T. Kreulen
V. Krishna
W. S. Spangler

Enterprise executives understand that timely, an OLAP-style interaction model. Finally, we


accurate knowledge can mean improved discuss some new research that we are
business performance. Two technologies have pursuing to enhance this approach.
been central in improving the quantitative and
qualitative value of the knowledge available to
decision makers: business intelligence and
knowledge management. Business A critical component for the success of the modern
intelligence has applied the functionality, enterprise is its ability to take advantage of all avail-
scalability, and reliability of modern database able information. This challenge becomes more dif-
management systems to build ever-larger ficult with the constantly increasing volume of infor-
data warehouses, and to utilize data mining mation, both internal and external to an enterprise.
techniques to extract business advantage It is further exacerbated because many enterprises
are becoming increasingly “knowledge-centric,” and
from the vast amount of available enterprise
therefore a larger number of employees need access
data. Knowledge management technologies,
to a greater variety of information to be effective.
while less mature than business intelligence The explosive growth of the World Wide Web clearly
technologies, are now capable of combining compounds this problem.
today’s content management systems and
the Web with vastly improved searching and Enterprises have been investing in technology in an
text mining capabilities to derive more value effort to manage the information glut and to glean
from the explosion of textual information. We knowledge that can be leveraged for a competitive
believe that these systems will blend over edge. Two technologies in particular have shown
time, borrowing techniques from each other good return on investment in some applications and
and inspiring new approaches that can are benefiting from a large concentration of research
analyze data and text together, seamlessly. and development. The technologies are business in-
We call this blended technology BIKM. In this telligence (BI) and knowledge management (KM).
paper, we describe some of the current
business problems that require analysis of Business intelligence technology has coalesced in the
both text and data, and some of the technical last decade around the use of data warehousing and
challenges posed by these problems. We on-line analytical processing (OLAP). Data warehous-
describe a particular approach based on an 娀Copyright 2002 by International Business Machines Corpora-
OLAP (on-line analytical processing) model tion. Copying in printed form for private use is permitted with-
enhanced with text analysis, and describe two out payment of royalty provided that (1) each reproduction is done
without alteration and (2) the Journal reference and IBM copy-
tools that we have developed to explore this right notice are included on the first page. The title and abstract,
approach— eClassifier performs text analysis, but no other portions, of this paper may be copied or distributed
and Sapient integrates data and text through royalty free without further permission by computer-based and
other information-service systems. Permission to republish any
other portion of this paper must be obtained from the Editor.

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 0018-8670/02/$5.00 © 2002 IBM CODY ET AL. 697
ing is a systematic approach to collecting relevant knowledge portals, customer relationship manage-
business data into a single repository, where it is or- ment (CRM), and bioinformatics, require merging
ganized and validated so that it can be analyzed and these unstructured information technologies with
presented in a form that is useful for business de- structured business data analysis.
cision-making. 1 The various sources for the relevant
business data are referred to as the operational data It is our belief that over time techniques from both
stores (ODS). The data are extracted, transformed, BI and KM will blend. Today’s disparate systems will
and loaded (ETL) from the ODS systems into a data use techniques from each and will, in turn, inspire
mart. An important part of this process is data cleans- new techniques that will seamlessly span the anal-
ing, in which variations on schemas and data values ysis of both data and text. With this in mind, we de-
from disparate ODS systems are resolved. In the data scribe our contributions in this direction. First, we
mart, the data are modeled as an OLAP cube (mul- briefly describe some business problems that moti-
tidimensional model), which supports flexible drill- vate this integration and some of the technical chal-
down and roll-up analyses. Tools from various ven- lenges that they pose. Then we describe eClassifier,
dors (e.g., Hyperion, Brio, Cognos) provide the end a comprehensive text analysis tool that provides a
user with a query and analysis front end to the data framework for integrating advanced text analytics.
mart. Large data warehouses currently hold tens of Next, we present an example that motivates our par-
terabytes of data, whereas smaller, problem-specific ticular approach toward integrating data and text
data marts are typically in the 10 to 100 gigabytes analysis and describe our architecture for a combined
range. data and document warehouse and associated tool-
ing. Finally, we discuss some current research direc-
Knowledge management definitions span organiza- tions in extracting information from documents that
tional behavioral science, collaboration, content can increase the value of a data cube.
management, and other technologies. In this con-
text, we are using it to address technologies used for Motivation for BIKM
the management and analysis of unstructured infor-
mation, particularly text documents. It is conjectured The desire to extend the capabilities of business in-
that there is as much business knowledge to be telligence applications to include textual informa-
gleaned from the mass of unstructured information tion has existed for quite some time. The major in-
available as there is from classical business data. We hibitors have included the separation of the data on
believe this to be true and assert that unstructured different data management systems, typically across
information will become commonly used to provide different organizations, and the immaturity of au-
deeper insights and explanations into events discov- tomated text analysis techniques for deriving bus-
ered in the business data. The ability to provide in- iness value from large amounts of text. The current
sights into observed events (e.g., trends, anomalies) focus on information integration in many enterprises
in the data will clearly have applications in business, is rapidly diminishing the first inhibitor, and advances
market, competitive, customer, and partner intelli- in machine learning, information retrieval, and sta-
gence as well as in many domains such as manufac- tistical natural language processing are eroding the
turing, consumer goods, finance, and life sciences. second.

The variety of textual information sources is ex- Examples of BIKM problems. To understand the im-
tremely large, including business documents, e-mail, portance of BIKM, it is useful to look at some real
news and press articles, technical journals, patents, business problems and to determine how this tech-
conference proceedings, business contracts, govern- nology can provide a return on the investment (ROI).
ment reports, regulatory filings, discussion groups, The ROI can be achieved, in general, in one of two
problem report databases, sales and support notes, ways: (1) through cost reductions and identification
and, of course, the Web. Knowledge and content of inefficiencies (improved productivity), and (2)
management technologies are used to search, orga- through identification of revenue opportunities and
nize, and extract value from all of these information growth. Here are some typical scenarios in which our
sources and are a focus of significant research and customers believe their business analyses would ben-
development. 2,3 These technologies include cluster- efit substantially from data and text integration:
ing, taxonomy building, classification, information
extraction, and summarization. An increasing num- 1. Understanding sales effectiveness. A telemarketing
ber of applications, such as expertise location, 4,5 revenue data cube can help identify products that

698 CODY ET AL. IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002


are most successfully sold over the phone, sales in these different environments. In this section, we
representatives who generate the most sales, and distinguish three general environments based on
customers who are the most receptive to this sales the degree of integration of the text and the data
approach. Unfortunately, the particular sales sources.
techniques used by these successful sales repre-
sentatives in various situations are not captured The simplest scenario occurs when the text informa-
by quantitative measures in the OLAP cube. How- tion sits inside the same database as the business data
ever, these sales conversations are now frequently and is unambiguously associated with the related bus-
recorded and converted to text. The text of con- iness data. The text may simply be in character fields
versations associated with high-revenue sales rep- in the business data records or in separate tables
resentatives and high-yield customers can be an- linked with the data records through common join
alyzed by various language processing or pattern attributes. In this situation, text analysis techniques
detection techniques to find patterns in the use can be used to extract value from the text in the form
of phrases or phrase sequences. of additional attributes, relationships, and facts that
2. Improving support and warranty analysis. Fre- can then be directly related to the business data. As
quently in business applications, short text de- integrated database systems that bring text (e.g., XML
scriptions, from, for example, customer complaints, [Extensible Markup Language]) together with data
are recorded in a database but are then encoded in a single database become more common, the abil-
into short classification codes by a person. The ity to use text analysis to enrich the directly related
code fields then become the basis for any busi- data will also increase.
ness analysis of the set of customer complaints.
Variations in the assignment of codes by differ- Currently, most textual information is in systems dis-
ent people can cause emerging trends or prob- tinct from the ODS systems used to populate busi-
lem situations to be overlooked. The application ness intelligence data marts. In the simplest case the
of modern linguistic and machine-learning tech- text system has meta-data that logically correspond
niques (i.e., classification) to the text could pro- to fields in the business data, for example, customer
vide a more consistent encoding, or at least a val- name or product name, which can be used to link
idation of the human encoding, as the basis for the text with the data. However, the text system may
the business analysis. use different forms for the meta-data values than
3. Relating CRM to profitability. Data cubes for un- those in the database, and this necessitates a data
derstanding revenues achieved over a set of cus- mapping transform to determine the correct asso-
tomers frequently omit the costs associated with ciation of text to data, for example, associating “DB2”
individual customers. In some industries these with IBM DB2 Universal Database*, Enterprise-
costs can substantially reduce the profit from a Extended Edition, or “J. Smith” with John W. Smith.
customer. The costs can include the number of These problems are common and difficult when in-
calls the customer made into the business for tegrating data from different source systems, but for
problem resolution, complaint handling, or just this discussion we assume that enough data cleans-
“hand-holding.” Extracting measures of these ing and transformation tools exist to at least some-
costs (e.g., time spent on the phone with the cus- what automate this mapping. 7
tomer) and measures of the customer’s loyalty for
continued business (e.g., sentiment analysis of the In the absence of adequate meta-data to relate the
customer interaction) from a customer relation- text to the data, classification technology can be used
ship management (CRM) system and merging to categorize the text documents. The classes might
these measures into the revenue cube would pro- correspond to the values in a data attribute—for ex-
vide a more complete picture of the profitability ample, the members in a dimension of an OLAP cube.
derived from a customer. 6 The assignment of a document to a particular class
for a data attribute (e.g., product name) could have
Environmental issues. We have briefly presented a confidence measure associated with it and the doc-
some typical customer scenarios in which bringing ument might be assigned to several classes. This clas-
text analysis together with classical data analysis can sification process may require training, and it should
provide increased business value. Naturally, there are make use of any relevant meta-data available in the
environments of varying complexity in which these database. Once the text has been appropriately re-
scenarios occur, and consequently there are a vari- lated to the business data, it can be processed by the
ety of technologies and tools that may be needed text analysis techniques to extract the desired bus-

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 CODY ET AL. 699


iness information, such as additional attributes, re- approach that integrates complementary and best-
lationships, or facts. 8 of-breed algorithms with guidance from domain ex-
pertise. eClassifier was designed to fill this need by
A more complicated situation arises when, unlike in incorporating multiple algorithms into an architec-
the previous examples, the sources of text to relate ture that supports the integration of additional al-
to a business data analysis are not known a priori. gorithms as they become proven. eClassifier is an ap-
The relevant text sources can depend on the type of plication that can quickly analyze a large collection
data analysis being performed, and the number of of documents and utilize multiple algorithms, visu-
possibilities for such sources may be very large. In this alizations, and metrics to create and to maintain a
case, a discovery process is needed to identify the ap- taxonomy. The taxonomies that eClassifier helps to
propriate text sources for the business analysis, and then create can be arbitrarily complex hierarchical cat-
an association technology is needed to relate the text egorizations. The algorithms and representation
to the data records. Finally, the appropriate text anal- must be robust in order to apply across many diverse
ysis can be used to extract the business value. As a brief domains. In our research, we very quickly encoun-
example, consider a business analyst exploring a rev- tered environments where the documents to be an-
enue cube and detecting a downward movement in rev- alyzed were ungrammatical and contained misspell-
enues for a software product in some part of the United ings, esoteric terms, and abbreviations. Help-desk
States. 9 The data cube shows the phenomenon but does problem tickets or discussion groups are examples
not provide any explanation for it. Because the issue of such environments.
is the revenue decline of a software product in a cer-
tain region, there is a natural set of questions that might eClassifier is a comprehensive text-analysis applica-
be asked to understand the revenue decline and a sub- tion that allows a knowledge worker to learn from
stantial number of text sources one might wish to large collections of unstructured documents. It was
review to find the answers. In general, the questions designed to employ a “mixed-initiative” approach
to be asked depend on the issue under investigation that applies domain expertise, through interactions
and the characteristics of the data, for example, its with state-of-the-art text analysis algorithms and vi-
schema, its meta-data, its application context. In this sualization, to provide a global understanding of a
example the text sources could include: document collection. Most of the complexities in-
herent in text mining are hidden by using default be-
1. Enterprise-specific information, such as service haviors, which can be modified as a user gains expe-
call logs about the product and competitive in- rience. The tool can be used to automatically
telligence reports about other companies’ prod- categorize a large collection of text documents and
ucts then provide to a knowledge worker a broad spec-
2. Purchased text information, from sources such as trum of controls to refine the building of an arbi-
Hoovers and Dun & Bradstreet, on general soft- trarily complex hierarchical taxonomy. eClassifier has
ware market conditions implemented numerous analytical, graphical, and re-
3. Public documents in Web forums that contain dis- porting algorithms to allow a deep understanding of
cussions about products, such as ePinions.com the concepts contained within a document collection.
The tool has been optimized to analyze over a mil-
Current work on meta-data to represent the infor- lion documents. Additionally, after a given taxon-
mation content published in data sources and work omy has been generated, a classification object can
on question-answering systems to match questions be published and used within another application,
to information sources will help to automate the dis- through the eClassifier run-time API (application pro-
covery process. 10 In all of these cases, the interac- gramming interface), to dynamically retrieve infor-
tive analysis of data and text may ultimately require mation about the documents as well as to incremen-
the use of a modern text-analysis tool to explore the tally process new documents. Advanced visualization
text documents themselves. In the next section we techniques allow the concept space to be explored
describe such a tool. from many different perspectives. Multiple taxono-
mies can be generated and explored to discover new
eClassifier relationships or important cross sections. Text sort-
ing and extraction techniques provide valuable con-
Research and development investment in knowl- cept summarizations for each category. Many views
edge-management technologies has made significant are provided, including spreadsheets, bar graphs,
progress. However, there still exists a need for an plots, trees, and summary reports.

700 CODY ET AL. IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002


We have used eClassifier extensively in conjunction documents in which the term occurs, and the rele-
with Lotus Discovery Server* and IBM Global Ser- vance measure is the maximum frequency with which
vices on both internal and customer information a term occurs in any category, effectively measuring
sources. Based on our application of eClassifier in the term’s influence on the taxonomy. Terms or
various domains, with many users, we have reached phrases with high values for either of these measures
the conclusion that it is very difficult to automati- should be considered carefully, because they heav-
cally produce a satisfactory taxonomy for a diverse ily influence the document representation and there-
set of users without allowing human intervention. fore the resulting taxonomy. We have found this
The power of eClassifier is that it explicitly provides combination of techniques to be important and effec-
for the incorporation of human judgment at all ap- tive across a broad range of document sources.
propriate phases of the taxonomy generation process.
It provides the necessary tools for understanding the Taxonomy generation. The first step in the analysis
taxonomy, for efficiently editing it, and for validating of the document collection is to create an initial cat-
that the taxonomy is learnable by a classifier. egorization or taxonomy, which can be automated
by applying clustering algorithms. In eClassifier we
Document representation. The applications for have implemented an optimized variant of the
which eClassifier has typically been applied are char- k-means algorithm 13,14 using a cosine similarity met-
acterized by documents about a single concept. Such ric 15 to automatically partition the documents into
application domains with documents that are rela- k disjoint clusters. K-means can then be applied to
tively short include help-desk problem tickets and each cluster to create a hierarchical taxonomy. In
e-mail. In domains with longer, multitopic docu- addition to k means we have implemented an EM
ments, some preprocessing is needed to break the (expectation maximization) clustering algorithm and
documents down into conceptual chunks. Typically EM with MHAC (modified hierarchical agglomerative
this is done using document structures such as chap- clustering), which is a variant that generates hier-
ters, sections, or paragraphs. archical taxonomies. 16 In practice we have found au-
tomatic clustering algorithms to be very effective in
eClassifier represents each document with a vector creating initial taxonomies, which are used to get a
of weighted frequencies from a feature space of sense of the concepts contained in the document col-
terms and phrases. 11,12 The feature space is obtained lection. However, clustering does not always parti-
by counting the occurrence of terms and phrases in tion the documents in ways that are meaningful to
each document and the vector is normalized to have a human. To partially address this, we have devel-
unit Euclidean norm (the sum of the squares of the oped some additional methods for creating taxon-
elements is one). To reduce the feature space rep- omies, one of which is an interactive, query-based
resentation for efficiency of computation and scal- clustering that seeds categories based on a set of key-
ability, while maintaining maximum information, we words, tests out the queries, and then refines the clus-
utilize several techniques. We use stop-word lists to ters based on the observed results. The query-based
eliminate words bearing no content. We utilize syn- clusters can then be further subclassed using unsu-
onym lists to collapse semantically similar words and pervised clustering techniques. Finally, we have also
stem variants to their base form. We use stock phrase found that it is sometimes useful to start with an ini-
lists to eliminate structural or no-content repetitive tial classification based upon meta-data provided
phrases. Stock phrases can also be automatically de- with the documents.
tected by the system through the use of statistical
counting techniques. We use “include word” lists to Taxonomy evaluation. Once we have an initial tax-
identify semantically important terms that must re- onomy of the documents, eClassifier provides the
main in the feature space. Finally, we heuristically means to understand and to evaluate it. This allows
reduce the feature space by removing terms with the us to address the unexpected results that do not meet
highest and lowest frequency of occurrences. human expectations. Figure 1 is an eClassifier screen-
shot showing summary information and statistics for
Once the feature space is determined, eClassifier a set of categories (note this could be at any depth
uses a dictionary tool that provides a convenient in a hierarchical taxonomy). This view provides cat-
method for dynamically inspecting and modifying the egory label, size, cohesion, and distinctness measures.
feature space. This tool provides a frequency mea- The vector-space model lends itself to computation
sure and a relevance measure for each term and of a normalized centroid for each cluster, which rep-
phrase. The frequency measure is the percentage of resents the average document in the cluster for the

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 CODY ET AL. 701


Figure 1 eClassifier class table view

current feature space. The category centroid is cen- of the documents in the category, these two words
tral to the computation of the summary information combined become the name (separated by a com-
in this view. ma). If not, this process is repeated. If none of the
top four terms is contained in 10 percent or more
The category label is generated using a term-cover- of the documents, the category is called “Miscella-
age algorithm that identifies dominant terms in the neous.” We have experimented with various other
feature space. If a single term occurs in 90 percent algorithms for labeling categories, including finding
or more of the documents in a category, the cate- the most frequently occurring phrases. Although
gory is labeled with that term. If more than one term these sometimes appear to be more meaningful, we
occurs with 90 percent frequency, then all of these have found them to be often misleading and to mis-
terms (up to four) will be included in the name, with characterize the category as a whole. Although this
the “&” character between each term. If no single algorithm is effective for quickly summarizing a cat-
term covers 90 percent of the documents, then the egory, we also allow the user to assign a different
most frequent term becomes the first entry in the label at any time.
name. The second entry is the one that occurs most
frequently in all documents that do not contain the In addition to a label, three metrics are computed
first term of the name. If the set of documents con- for each category by default. The size column dis-
taining either of these two words is now 90 percent plays a count of the number of documents in the cat-

702 CODY ET AL. IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002


Figure 2 eClassifier class view

egory and its percentage of the total collection. Co- to be able to precisely understand what core con-
hesion is a measure of the variance of the documents cept a category represents. To address this need,
within a category. The cohesion is calculated based eClassifier provides a special view that shows statis-
on the average cosine distance from the centroid. tics about term frequency, induced classification
We have found that this provides a good measure rules, and document examples as shown in Figure
of the similarity within a category, and categories with 2. eClassifier has a bar graph representation of the
low cohesion are good candidates for splitting. Dis- category centroid. For each term in the feature space,
tinctness is a measure of variance of the documents it shows the frequency of occurrence within the given
between categories. The distinctness is calculated class (red bar) and the frequency of occurrence
based on the cosine distance from a category’s cen- within the total document collection (blue bar). The
troid to its nearest neighboring category’s centroid. terms are sorted in decreasing order of red minus
We have found this to provide a good measure of blue bar height in order to focus attention on the
similarity between categories, so categories with low most relevant terms for the class. The class compo-
distinctness are similar to a neighboring category and nents panel visualizes the effect of certain terms
would be candidates for potential merging. (inclusion ⫽ ⫹, exclusion ⫽ ⫺) when used as a de-
cision tree classifier. Nodes in the tree are selected
Category evaluation. In addition to understanding based on minimizing the entropy of in-category vs
a given taxonomy at a macro level, it is important out-of-category documents. If certain rules are par-

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 CODY ET AL. 703


Figure 3 eClassifier visualization

ticularly meaningful, a user can click on the node visualization to help a user to explore the relation-
and create a new class from the identified set of doc- ships between categories of documents within a tax-
uments. Finally, this view provides example docu- onomy. We show an example of eClassifier’s visu-
ments from each category. Several sorting techniques alization in Figure 3. In the visualization each dot
are available. Ordering by “most typical” is calcu- represents a document, which, when clicked on, will
lated based on proximity to the centroid. This is an be displayed. The color of the dot denotes its mem-
effective technique— examining a few documents bership within a corresponding category. A large dot
close to the centroid helps a user to understand the represents a category centroid, which is the average
essence of the category’s concept. Ordering by “least feature vector for the documents in that category.
typical,” by showing example documents farthest For each rendering we select the centroids of three
from the centroid, helps the user to evaluate uni- categories to form a plane. All the documents are
formity within the category. Examples that the user then projected onto this plane.
identifies as not really belonging to the category can
easily be moved to other categories or to a newly cre- This visualization is useful for exploring the relation-
ated category. With each modification, all relevant ship between various categories. We can quickly see
statistics are dynamically updated. which categories are close in proximity and we can
find specific documents that may span these cate-
Taxonomy visualization. Visualization is an impor- gories by selecting documents that lie on their re-
tant technique to convey information. eClassifier uses spective borders.

704 CODY ET AL. IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002


The visualization gives multiple views of the data by sifier will run a chi-squared test to find statistical
allowing the user to select different planes on which anomalies for a given category in relation to other
to project. This can be done for all possible selec- categorical attributes associated with documents.
tions of three centroids to show many different views Continuous variables, such as time, are made dis-
of the data, in procession. This process is known as creet before analysis. Analyzing an attribute vs time
touring. 17 The visualization also has a “navigator in this way can lead to the discovery of spikes or other
mode,” which displays only closely related catego- interesting trends. This technique can also be applied
ries and allows the user to navigate by clicking on to any categorical data associated with the document.
encircled centroids to show that category’s most For example, assume we generated a technology-
closely related categories. based taxonomy of patents using eClassifier. We
could then analyze the patents to see which technol-
Classification. Once a taxonomy is created for a doc- ogies are receiving the most patents over time. Once
ument collection, it is often useful to assign addi- we know which technologies are “hot,” we could then
tional documents to the taxonomy as they become analyze the patents with their associated corporate
available. To do this, eClassifier creates a batch clas- assignees to see which corporations are active in the
sifier to process the additional documents. We have hot technologies.
found that no single classification algorithm is su-
perior under all circumstances, so we have imple- Another useful analysis is to use a generated taxon-
mented four algorithms and a methodology for eval- omy to compare document collections. For a given
uating which is best for a given document collection. taxonomy and collection of documents, we can an-
For a given taxonomy, half of the documents are se- alyze a second collection of documents to discover
lected as the training set and half are left as the test which areas are poorly covered within the taxonomy.
set. A classification model is generated for each of We have applied this technique to help-desk prob-
the four implemented algorithms (nearest centroid, lem tickets and the associated self-help knowledge
naive Bayes multivariate, naive Bayes multinomial, bases to identify knowledge gaps, for example, prob-
and decision tree) based on the training set. The best lems that are not well covered in the knowledge base.
algorithm is then selected by determining classifi- This can also be used to compare a collection of re-
cation accuracy performance on the test set. At each quirements documents against a collection of capa-
level of the taxonomy hierarchy a different classifier bility documents to discover knowledge-gap deficien-
may be selected, based on which approach is most cies.
accurate at classifying the documents at that level.
Additionally, as was the case during the clustering An integration paradigm
process, we allow complete control over the selec-
tion of the classification approach. Based on the (lack In each of the environments discussed earlier, text
of) classification accuracy of the model selected, the is ultimately associated with business data records
user may choose to make adjustments to the taxon- to enhance the understanding of the data. In some
omy to improve the accuracy of the classifier. The analysis-oriented environments, just bringing the as-
classification accuracy for various classification al- sociated text “near” the data with a flexible, inter-
gorithms can also be visualized in the class view (see active browsing and analysis tool such as eClassifier
Figure 1). is sufficient to provide the user with some explana-
tion for the business phenomenon. In the “discov-
Analysis and reporting. In addition to the techniques ery” environment this may be the natural and only
described for taxonomy generation and maintenance, realizable paradigm. Therefore, in addition to search
eClassifier provides several techniques for deeper capability, mechanisms to discover patterns, at-
analysis of the text, for example discovery of corre- tributes, and schema in the documents, allowing
lations of the text with corresponding data and for them to be readily associated with the data, and tool-
comparing document collections. The first technique ing to provide an interactive analysis environment
we call “FAQ analysis” because it has commonly been for both data and text will be helpful here. Though
applied to find frequently asked questions in help- a valuable step, this approach has scalability prob-
desk data sets, although it can, in general, find fre- lems if the number of sources or the number of as-
quently occurring topics in any document collection. sociated documents is large.
Discovery of correlations is useful when analyzing
a given taxonomy with respect to time (trend anal- In the more narrowly constrained first and second
ysis) or against other associated meta-data. eClas- environments discussed earlier, we might strive to

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 CODY ET AL. 705


Figure 4 Example star schema data model

PRODUCT DIMENSION GEOGRAPHY DIMENSION

PKey Group Type Product GKey Country State City


01 Software Database DB2 01 USA CA San Jose
02 Software Messaging MQ Series 02 USA NY NY
03 Hardware Server S/390 03 USA IL Chicago
04 Hardware PC Thinkpad T20 04 Canada Quebec Toronto
... ... ... ... ... ... ... ...

REVENUE FACTS DATE DIMENSION

PKey GKey TKey Revenue Units DKey Year Quarter Day


01 01 02 1000 1 01 2002 Q1 Jan 1
01 02 02 2000 2 02 2002 Q2 Apr 15
03 03 03 25000 5 03 2001 Q4 Dec 10
04 04 04 20000 2 04 2001 Q3 Aug 20
... ... ... ... ... ... ... ... ...

achieve a tighter integration of the text information dimension has an associated hierarchy: group 3 type
with the associated data. One way to do this is to use 3 product. The geography dimension has an asso-
an OLAP multidimensional data model 1 as the inte- ciated hierarchy: country 3 state 3 city. The date
grating mechanism. The typical dimensional data dimension has an associated hierarchy: year 3 quar-
model for an OLAP system uses a star schema as the ter 3 day. These three dimensions represent the at-
model for a data cube. A basic star schema consists tributes that we can use to analyze our facts. In this
of a fact table at its center and a corresponding set example, we have a revenue fact table. Each row in
of dimension tables, as shown in Figure 4. A fact ta- the fact table represents the aggregate transactions
ble is a normalized table that consists of a set of mea- at the lowest level in each of the dimensions. In this
sures or facts and a set of attributes represented by case, each fact is aggregated at product, city, and day.
foreign keys into a set of dimension tables. The mea- The measures that are aggregated are revenue and
sures are typically numeric and additive (or at least units sold. This model allows us to explore the effect
partially additive). Because fact tables can have a very of product, geography, and date, at all levels in each
large number of rows, great effort is made to keep hierarchy, on revenue and units sold and other mea-
the columns as concise as possible. A dimension ta- sures computed from these values such as total rev-
ble is highly denormalized and contains the descrip- enue, average revenue, total units sold, and average
tive attributes of each fact table entry. These at- units sold. Typically an analyst would use an appli-
tributes can consist of multiple hierarchies as well cation to analyze these various measures by looking
as simple attributes. Because the dimension tables at trends over time, or by finding weaknesses or
typically consist of less than 1 percent of the overall strengths in products or geographies.
storage requirement, it is quite acceptable to repeat
information to improve system query performance. Integrating text information into this analysis re-
The level at which the dimensions and measures en- quires progress in several areas of text analytics in
capsulate the data is referred to as the “fact grain.” which we are currently working. The first is the use
An example of a low-level grain is at the transac- of text classification technology either to find at-
tional level, where the dimensions are the product, tributes in the documents that can be used to link
geography, and date of the transaction, and the mea- them to the data, or to find attributes in the doc-
sures are the dollar revenue and units sold. uments that can be used as additional dimensions
to deepen the understanding of the data. Second,
In the example in Figure 4 we have three dimension we are researching information extraction technol-
tables: product, geography, and date. The product ogies to process the text and to compute quantita-

706 CODY ET AL. IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002


Table 1 Schema for problem ticket documents

Product Geography Date key Customer Problem Days open Severity Ticket
key key key type key of the identifier
complaint

tive values from the documents. The quantitative val- and date dimensions, while computing the average
ues can then be used as measures in a document fact dollar sales and average units sold, and if we roll up
table (Table 1). The combined data can not only be the second table along the product, geography, date,
“sliced and diced” in the traditional OLAP paradigm and problem dimensions while computing the num-
of data analysis, but also the related documents can ber of customer keys, the average days open, and
be explored in various ways that exploit their struc- the average complaint severity, then the join will give
ture to make their content more useful. This inter- us a picture of the revenue as well as the problem
action model and its underlying information model costs for each product, per location, per time period
is an area for our current research. associated with a problem type (e.g., installation,
missing CDs, etc.). Then, with an integrated tooling
Consider again the example in Figure 4. The facts environment we can perform this type of quantita-
have keys for the dimensions of product, geography, tive dimensional OLAP analysis and then seamlessly
and date. Now we also have a database of problem move into a document analysis to understand the
tickets resulting from service calls. The problem tick- complaints in more depth. A discussion of such an
ets have meta-data recorded along with a transcript experimental tooling environment that has been built
of the problem description. If we run a set of anal- at the Almaden Research Center follows.
yses over this collection of documents we can hope
to accomplish several things. First, by using a clas- Integrated BIKM tools (Sapient & eClassifier). In the
sification process we can divide the problem tickets previous section we describe our text analysis sys-
by problem type, thereby creating a new dimension, tem, eClassifier. In this section we describe the tool-
in addition to the existing meta-data dimensions, into ing we have built to apply the OLAP data model to
which problem tickets can be categorized. Second, text documents, creating a document warehouse. We
by running experimental text analyses over the text then describe how we link the data model for the
of the problem tickets we can attempt to quantify data and the documents through shared dimensions,
the severity of the problem in the ticket. Upon do- and how this enhances our analytical capabilities. Fi-
ing this, the problem ticket documents can be or- nally, we describe how text analytics can be used to
ganized into their own fact table with the schema dynamically enhance this data model with what we
shown in Table 1. call dynamic dimensions.

The first four columns, which are foreign keys into The tool we have developed allows us to explore data
dimension tables, are derived from the meta-data cubes with a star schema and consists of a report view
associated with the tickets in the problem ticket da- and navigational controls. The report view provides
tabase. The fifth column is now a dimension asso- a view of the results of data queries on a data cube.
ciated with the problem that was created by auto- The reports can be summary tables (Figure 5), trend
matically classifying the tickets. The sixth column is line graphs (Figure 6), or pie charts. An important
a measure associated with the problem that can be part of the navigational controls are the dimensions
calculated from the meta-data. The seventh column and metrics selection boxes. The dimension selec-
is a measure of the severity of the problem as cal- tion box allows the user to select and drill down on
culated by a text analysis of the transcription of the each dimension. This includes drilling down a dimen-
call. This may be a scoring of the frustration or an- sion hierarchy or cross drilling from one dimension
ger felt by the caller. Finally, the last column ties this to another. The metric selection box allows the user
document fact back to the original document in the to select metrics that are computable for the given
ticket database to facilitate movement from the OLAP data cube. Additional navigation buttons allow for-
environment of these facts into a document analysis ward and backward navigation to view previous re-
environment. ports. Other navigation controls are discussed later.

Given these fact table schemas, if we roll up the first Document warehousing. We extend the techniques
fact table (Figure 4) along the product, geography, used on data in business intelligence to documents

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 CODY ET AL. 707


Figure 5 Document counts for products

by using a dimensional model where the fact table Shared dimensions. Thus far we have shown how star
granularity is a document, and the dimension tables schemas can be used to organize and analyze both
hold the attributes of the document. Without addi- data and document cubes. Although each on their
tional processing this representation is a “factless” own can provide very useful information, providing
fact table, because there are, as yet, no associated a mechanism to link them will allow deeper analysis
measures. The process of populating the document and thereby provide greater value. As an example,
warehouse has some complexities beyond typical ETL we revisit our product-geography-date revenue cube
processing. In many cases the source of the docu- from Figure 4. If we have a collection of documents
ments is not an operational data store. Typically doc- that are relevant to the given products, in the given
uments are automatically and incrementally put into geographies, over the given times, the information
the document warehouse based on either a subscrip- they contain and its relationship to the business data
tion (push model) or a scheduled retrieval process analysis can greatly improve decision making. Some
(pull model). Additionally, we need a method to fil- documents that could provide insight in this exam-
ter the documents because not all documents will ple would be sales logs, customer support logs, news
be relevant to the purpose of the document cube. and press articles, marketing material, and discus-
sion groups. All of these could provide unique in-
Depending on the source, most documents have sights into why a product is selling well or poorly in
some associated meta-data that can naturally be used a given geography during a given time frame. The
to populate some dimensions, such as author, date key to achieving these insights is to directly link the
of publication, and document source. However, there data to the documents through shared dimensions.
are dimensions of potential interest that may not be An example data model of data and document cubes
included in the meta-data. If the dimension is known, with shared dimensions is illustrated in Figure 7.
classification techniques can be used to populate it.
Using this model, all of the techniques previously Dynamic dimensions. At this point we have data and
described that are available to data cubes are now document cubes that are linked through shared di-
available to document cubes. mensions. All of the analytical techniques used on

708 CODY ET AL. IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002


Figure 6 Time trend chart for products

data cubes can be used on the document cubes. lect (extract) and initiate eClassifier on an arbitrary
Given the linkage created by shared dimensions, we subset of the documents from the document ware-
can use the constraints used to identify a subset of house. Once we have invoked eClassifier on the doc-
data to then identify the corresponding set of doc- uments we can perform all of the analytical capa-
uments and then make inferences from those doc- bilities outlined previously.
uments about the data. For example, if the data show
a drop in revenue for a product in certain geogra- Furthermore, eClassifier can be used to create a new
phies during a given time period, we can use these taxonomy over this selected set of documents. This
constraints on the document cube to identify the doc- new taxonomy is effectively a new (hierarchical) di-
uments that might best explain the drop in revenue. mension that adds value to the existing data and doc-
We can then use standard OLAP techniques to in- ument cubes. For example, problem tickets can be
vestigate the relationship to any additional (non- classified into problem types. This dimension pro-
shared) dimensions available for the documents. vides a finer granularity for understanding the prob-
However, sometimes the existing dimensions and lems that are contributing to the costs associated with
their taxonomies may be insufficient to fully explain products in a given region and time period.
the data. The documents can then be further ana-
lyzed using a deeper text analytical system such as The new taxonomy can be made available to the doc-
eClassifier. We have provided this in our BIKM sys- ument warehouse by creating a corresponding di-
tem by augmenting the document warehouse with mension table to represent the taxonomy and then
an additional table (i.e., the token table) that has the populating an added column in the fact table, asso-
document identifier, token identifier, and token off- ciating all known documents with the newly pub-
set for every token in every document (shown in Fig- lished dimension. This new dimension is now avail-
ure 7). The token table allows us to dynamically se- able to all of the analytical and reporting capabilities

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 CODY ET AL. 709


Figure 7 Shared dimension data model

PRODUCT
CAUSE
Prod_ID
Prod_Line Cause_ID
Prod_Group Cause
Product SUBJECT
Subj_ID
DATA DOCUMENT Subject
FACT GEOGRAPHY FACT

Geo_ID SOLUTION
Site Sol_ID
Location Solution

DOCUMENT
METRICS METRICS Doc_ID
DATE Title
Transaction Counts Document Counts
Total Revenue Date_ID Abstract
Average Revenue Date
Total Units Month TOKEN
Average Units Year Doc_ID
Token_ID
Offset

in the OLAP environment. Additional processing can traction from documents. These efforts include: (1)
be performed to classify all of the documents that extracting quantitative facts from documents (e.g.,
were not in the extracted set of documents into the the financial terms of a contract); (2) deducing re-
new dimension. lationships between entities in a document (e.g., new
product A competes with product B); and (3) mea-
For example, we selected the “ThinkPad* T20” prod- suring the level of subjective values such as severity
uct (see Figure 5) and extracted into eClassifier the or sentiment in documents (e.g., a customer letter
2858 documents associated with this product. We reflects extreme displeasure with a company’s ser-
used eClassifier to produce the new taxonomy shown vice). Currently we are exploring techniques to ac-
in Figure 8. We then saved this taxonomy for the complish these tasks based on statistical machine-
document warehouse by publishing it as the “new learning approaches. We hope to report on these in
thinkpad taxonomy” dimension and updating the a future paper.
document fact table appropriately. This allows us to
drill from within the data warehouse, and the results Another area of future research that we believe is
are shown in Figure 9. promising is the integration of ontologies into the
taxonomy generation and dimension publishing por-
Summary and future research tions of our BIKM architecture. Ontologies provide
a level of semantics that we do not currently address,
The previous sections discuss our current integra- allowing improved taxonomies and reasoning about
tion model for data and text analysis and the tooling the data and text. Furthermore, emerging ontolog-
we have built to experiment with it. The missing, and ical technologies such as the semantic Web can pro-
somewhat open-ended, portion of this integration vide a vehicle to integrate the text and data under
is the text analytics that will be used to create the study with a far larger body of text and data, thereby
quantitative metrics that populate the document expanding the potential insights.
cube and augment the data cube metrics. There is
significant work going on in the IBM research com- In this paper we show that text integrated with bus-
munity, especially within the unstructured informa- iness data can provide valuable insights for improv-
tion management area, to perform information ex- ing the quality of business decisions. We describe a

710 CODY ET AL. IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002


Figure 8 eClassifier taxonomy for ThinkPad T20 documents

text analysis framework and how to integrate it into 2. D. Sullivan, Document Warehousing and Text Mining, John
a business intelligence data warehouse by introduc- Wiley & Sons, Inc., New York (2001).
3. T. Nasukawa and T. Nagano, “Text Analysis and Knowledge
ing a document warehouse and linking the two Mining System,” IBM Systems Journal 40, No. 4, 967–984
through shared dimensions. We believe that this pro- (2001).
vides a platform on which to build and research new 4. W. Pohs, Practical Knowledge Management, IBM Press, Dou-
algorithms to find the currently hidden business value ble Oak, TX (2001).
5. W. Pohs, G. Pinder, C. Dougherty, and M. White, “The Lo-
in the vast amount of text related to business data. tus Knowledge Discovery System: Tools and Experiences,”
Technologies in the areas of information extraction IBM Systems Journal 40, No. 4, 956 –966 (2001).
and integrated text and data mining will build on this 6. See http://www-4.ibm.com/software/data/bi/banking/ezmart.
framework, allowing it to achieve its full business po- htm.
tential. 7. M. Hernandez, R. J. Miller, and L. Haas, “Clio: A Semi-Au-
tomatic Tool for Schema Mapping,” Proceedings, Special In-
terest Group on Management of Data, Santa Barbara, CA (May
Acknowledgments 21–24, 2001).
8. See http://www.itl.nist.gov/iad/894.02/related_projects/muc/
The authors gratefully acknowledge the contribu- index.html.
9. S. Sarawagi, R. Agrawal, and N. Megiddo, “Discovery-Driv-
tions of Dharmendra Modha, Ray Strong, Justin en Exploration of OLAP Data Cubes,” Proceedings, 6th In-
Lessler, Thomas Brant, Iris Eiron, Hamid Pirahesh, ternational Conference on Extending Database Technology, Va-
Shivakumar Vaithyanathan, and Anant Jhingran for lencia, Spain (March 23–27, 1998), pp. 168 –182.
their contributions to eClassifier, Sapient, and the 10. C. Kwok, O. Etzioni, and D. S. Weld, “Scaling Question An-
underlying ideas of BIKM. swering to the Web,” Proceedings, 10th International World
Wide Web Conference, Hong Kong (May 1–5, 2001), avail-
*Trademark or registered trademark of International Business able at http://www10.org/cdrom/papers/120/.
Machines Corporation. 11. G. Salton and M. J. McGill, Introduction to Modern Retrieval,
McGraw-Hill Publishing, New York (1983).
12. G. Salton and C. Buckley, “Term-Weighting Approaches in
Cited references Automatic Text Retrieval,” Information Processing and Man-
agement 4, No. 5, 512–523 (1988).
1. R. Kimball, The Data Warehouse Toolkit, John Wiley & Sons, 13. R. O. Duda and P. E. Hart, Pattern Classification and Scene
Inc., New York (1996). Analysis, John Wiley & Sons, Inc., New York (1973).

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 CODY ET AL. 711


Figure 9 Dynamic dimension results

14. J. A. Hartigan, Clustering Algorithms, John Wiley & Sons, Inc., Jeffrey T. Kreulen IBM Research Division, Almaden Research
New York (1975). Center, 650 Harry Road, San Jose, California 95120 (electronic mail:
15. E. Rasmussen, “Clustering Algorithms,” W. B. Frakes and kreulen@almaden.ibm.com). Dr. Kreulen is a manager at the IBM
R. Baeza-Yates, Editors, Information Retrieval: Data Struc- Almaden Research Center. He holds a B.S. degree in applied
tures and Algorithms, Prentice Hall, Englewood Cliffs, New mathematics (computer science) from Carnegie Mellon Univer-
Jersey (1992), pp. 419 – 442. sity and an M.S. degree in electrical engineering and a Ph.D. de-
16. S. Vaithyanathan and B. Dom, “Model-Based Hierarchical Clus- gree in computer engineering, both from Pennsylvania State Uni-
tering,” available at http://www.almaden.ibm.com/cs/people/ versity. Since joining IBM in 1992, he has worked on multi-
dom/papers/uai2k.ps. processor systems design and verification, operating systems,
17. I. Dhillon, D. Modha, and S. Spangler, “Visualizing Class systems management, Web-based service delivery, and integrated
Structures of Multi-Dimensional Data,” Proceedings, 30th text and data analysis.
Conference on Interface, Computer Science and Statistics, May
1998.
Vikas Krishna IBM Research Division, Almaden Research Cen-
ter, 650 Harry Road, San Jose, California 95120 (electronic mail:
Accepted for publication July 12, 2002. vikas@almaden.ibm.com). Mr. Krishna is a software engineer at
the IBM Almaden Research Center. He holds a B.Tech. degree
in naval architecture from IIT Madras, an M.E. degree in com-
William F. Cody IBM Research Division, Almaden Research Cen- putational fluid dynamics from Memorial University, Newfound-
ter, 650 Harry Road, San Jose, California 95120 (electronic mail: land, Canada, and a M.S. degree in computer engineering from
wcody@almaden.ibm.com). Dr. Cody is a senior manager of the Syracuse University, New York. Since joining IBM in 1997, he
Knowledge Middleware and Technology group at IBM’s Alma- has developed systems for Web-based service delivery, business-
den Research Center. He received his Ph.D. degree in mathe- to-business information exchange, and the integrated analysis of
matics in 1979 and has held various product development, re- text and data.
search, and management positions with IBM since joining the
company in 1974. He has published papers on database appli-
cations, database technology, software engineering, and group W. Scott Spangler IBM Research Division, Almaden Research
theory. Center, 650 Harry Road, San Jose, California 95120 (electronic mail:

712 CODY ET AL. IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002


spangles@almaden.ibm.com). Mr. Spangler has been doing knowl-
edge base and data mining research for the past 15 years—lately
at IBM and previously at the General Motors Technical Center,
where he won the prestigious “Boss” Kettering award (1992) for
technical achievement. Since coming to IBM in 1996, he has de-
veloped software components, available through the Lotus Dis-
covery Server product and IBM alphaWorks 䉸 , for data visual-
ization and text mining. He holds a B.S. degree in mathematics
from the Massachusetts Institute of Technology and an M.S. de-
gree in computer science from the University of Texas.

IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 CODY ET AL. 713

Das könnte Ihnen auch gefallen