004

SOME APPLICATIONS OF ANGOSS KNOWLEDGE-SEEKER IV AND
SAS ENTERPRISE MINER FOR COURSE INSTRUCTION ON DATA MINING

OF DATA WAREHOUSES
Richard S. Segall, Ph.D.
Arkansas State University, College of Business,
Department of Economics and Decision Sciences,
State University, AR 72467-0239 USA
Telephone: (870)972-3416
E-mail: rsegall@mail.astate.edu and rssegall@aol.com
Abstract: This paper first provides a background on the concepts and development of data mining and data warehousing
that need to be known by students and educators. This paper then discusses some of the currently available software tools for
this relatively new field of data mining. Two software tools for data mining of KnowledgeSEEKER IV by Angoss and SAS
Enterprise Miner are selected for their applications to selected databases. The novelty of this paper for instructional
implementations lies in the presentation of a brief summary of some of the currently available software tools for data mining,
some of which are available for trail version downloads with sample data sets, and illustration and discussion of the results
for applying KnowledgeSEEKER IV and Enterprise Miner to databases that would be useful in teaching a course in data
mining. Conclusions and future directions of the research are also discussed.
INTRODUCTION
1.1 Data Mining
Data mining is sometimes called data or knowledge discovery and is the process of automating information discovery. Data
mining is the process of analyzing data from different perspectives and summarizing it into useful information. Although data
mining is a relatively new term, the technology is not. Companies have used powerful computers to sift through volumes of
supermarket scanner data and analyze market research reports for years.
The potential applications of data mining include database analysis and decision support for market analysis and
management such as target marketing and customer relationship management, and for risk analysis and management such as
forecasting and competitive analysis. Other applications of data mining include text mining for documents and e-mail and
web mining such as for financial information of stock quotes, restaurant information and car prices. Fundamentals of data
mining have been presented in recent texts by Bhandari and Colet[1999], Groth[1998,2000], Han and Kamber[2001],
Kennedy et at[1997], Maralas[1999], Pyle[1999], and Westphal and Blaxton[1998] which also includes a CD-ROM of more
than ten demonstration versions of data mining tools described within its text.
Data mining using neural networks has been discussed in an entire text by Bigus[1996], and is one of the key
technologies used for data mining (Bigus[1996]), and an entire text on solving data mining problems through pattern
recognition has been recently published by members of Unica Technology, Inc.[1998].
1.2 Data Warehouses
A data warehouse, as the name implies, is a data store for a large amount of corporate data. Data warehousing opens new
possibilities in term of decision support systems. Analysts can not make good decisions unless they have all of the available
data. A good corporate data warehouse makes that data readily available. In addition, it makes possible a whole new class of
computing applications as described above and now known as data mining.
Fundamentals of data warehouses have been discussed in recent texts by Agosta[2000], Jarke et al. [1998], Bischoff
[1997], and Simon [1997], Westerman [2000] wrote an entire text on data warehousing using the Wal-Mart Model with the
intent of informing the reader of the general principles and specific techniques one needs to understand to be a valuable part
of an organization’s own data warehouse project. Simon [2001] presented an entire text on the relationship between data
warehousing and business intelligence for electronic commerce.
Russell[1998] shared secrets from the successful implementers of very large data warehouses. The companies illustrated
by Russell[1998] included the catalog giant Spiegel in Downers Grove, Illinois for developing their first data warehouse for
the company's credit department in 1994, Quest Informatics, a unit of the nation’s largest clinical testing laboratory, and PCS
Health Systems which developed its data warehouse in 1994.
7
Whiting[1999] debated the best approach to designing and building a data warehouse system by discussing the creation of
data warehouses based on a hybrid architecture which some call "federated" or "hub-and-spoke" systems which incorporate
aspects of centralized data warehouses and distributed data marts. Whiting[1999] also emphasized the problem with having
multiple data marts is that each may have its own way of defining and organizing information. The inconsistency inherent in
data marts makes it nearly impossible to integrate them into a single, centralized system. Data marts also tend to be focused
on individual products or product lines. But more businesses today are trying to become customer-focused, which requires
understanding how an enterprise is engaged with a customer across all its business units and products.
Data warehouses pertaining to distributed databases over a geographical domain have been discussed in an entire text by
Rigaux et al[2002] on spatial databases with applications to GIS(geographical information systems). Data warehouses with
time-varying data has been discussed in an entire text by Snodgrass[1999] for developing time-oriented database applications
in structured query languages(SQL) such as Oracle 8i and DB2 of IBM.
1.3 Data Mining and Model Building
Data mining and knowledge discovery involves looking in the data for such factors as associations, sequences, clusters,
forecasting including model fitting, and patterns that could be represented according to classification rules or trees. The
specific models that have been used for data mining include statistical analysis of data, neural networks, expert systems,
fuzzy logic, multidimensional analysis, data visualization, and decision trees.
Data mining is a confluence of multiple disciplines including database systems, data warehousing, on-line analytical
processing (OLAP), statistics, machine learning, visualization, information science, neural networks and mathematical
modeling. Figure 2 presents a view of the multi-tiered architecture for data warehouses in presenting the relationship between
data sources, data warehouse, data marts, an OLAP, and tools used on a data warehouse such as analysis, query, reports, and
data mining. A data mart as shown in Figure 2 is a specialized system that brings together the data needed for a department or
set of related applications. The data mining functions for algorithms used in previous papers by the author
Segall[1984,1988,2002] for medical databases include modeling via linear and nonlinear regression, curve fitting to models,
and others. Data mining functions have also been used by the author Segall[1995,1996,2001] for databases obtained from
applications to models for learning rules of neural networks.
An introduction to the applications of data mining and data warehousing to these databases was presented in
Segall[2001]. The novelty of this paper lies in the illustration of results that can be obtained using two of the current software
tools available for data mining of selected data warehouses, combined with a list of forty-five (45) current available software
tools in data mining as compiled by the author along with a brief background on data mining and data warehouses. It is hoped
that the reader would become interested and motivated to investigate the application of the software tools discussed for their
individual teaching and research needs.
In addition to the management of the data, a major concern in data mining is the quality of the data. Most databases
contain incomplete and inaccurate data. When limited data is available, you have to estimate values for the missing data. The
most common technique is to set the data fields equal to the mean or median value for numeric data, and to the mode for non-
numeric data. As noted by Simoudis et al.[1995], outliers of data also can severely impact the performance of a neural
network model. Cortes et al.[1995] also investigated the effects of bad data on learning rules for neural networks.
2. Objectives of Paper
One of the purposes of this paper is to present illustrations of some of graphical means of providing the results of data mining
using databases made available by the software vendors. The novelty of this paper is the illustration of selection of variables
and attributes by the author that yielded a set of concise tutorials for the two softwares that were selected for more detailed
discussion and presentation. These tutorials may be useful as examples of locating free/cheap collections of data for use in
classroom teaching of data mining, which is both a very new and rapidly expanding area of information systems. Those that
are to be discussed in more depth are KnowlegdeSEEKER that utilizes CART (Classification and Regression Trees), and
SAS Enterprise Miner which is described in depth in SAS Course Notes by Wielenga et at [1999] and also in SAS Enterprise
Miner Course Notes by Georges[2001] for performing predictive modeling and in similar SAS Course Notes by
Truxillo[2001] for statistical analysis.
3. Description of KnowledgeSEEKER IV and CART (Classification and Regression Trees)
Angoss Software Corporation has a decision tree based analysis program named KnowledgeSEEKER that utilizes two
decision-tree algorithms: named CHAID (Chi-Square Automatic Interaction Detection) and CART (Classification and
Regression Trees). CHAID is used to study categorical data, like gender or states in a country.
According to the Salford Systems White Paper Series[2000], the CART methodology has been technically known as
binary recursive partitioning. The process is binary whenever parent nodes are split into exactly two child nodes. The process
8
is recursive because the process can be repeated by treating each child node as a parent. The novelty of the example shown in
this paper is that multiple child nodes can be split from a parent node using the CART software.
CART is an automatic high-speed data analysis tool capable of discovering complex relationships in huge databases
and is marketed by Salford Systems. According to Dan Steinberg, the president of Salford Systems, “The most important data
mining business applications, such as classification and predictive modeling, can be accomplished using just CART.”
The results of CART are displayed in flowcharts that are easy for non-specialists to understand. CART is notably
successful in quality control applications such as predicting assembly-time failures and conditions under which the defects
are most likely to occur. In addition, CART is widely used in business intelligence applications such as direct mail targeting,
managing credit risk, developing customer retention and acquisition strategies, and detecting telecommunications and credit
card fraud. CART has been successfully applied to data warehouses representing pharmaceutical industries.
The CART software uses historical data to discover patterns, trends and relationships, and it automatically generates
high-performance predictive models that can be applied to new data. This information facilitates better business decisions
and increases profitability.
4. Case Applications for Classroom Illustrations of KnowledgeSEEKER IV and CART
Figure 3 shows the application of CART to a blood pressure database of 360 patients from a community health survey which
is a data set that Angoss includes with their demonstration CD-ROM in Groth[1998,2000], and which has levels of low,
normal, and high blood pressure for three age groups of individuals: 32-50, 51-62, and 63-72. Figure 4 shows a 3-
dimensional plot of the percentage of these three age groups who had each of the three blood pressure levels of low, normal,
and high. As you can see from this Figure 4 and also the statistics presented in Figure 3, that the maximum percentage of this
sample of 360 patients occurred for those who had normal blood pressure in the 51-62 age group.
The applications of KnowledgeSEEKER IV to the blood pressure database as provided by Angoss in CD-ROM of
Groth[1998,2000] indicated that those patients in the age group of 32-50 had the highest percentage of 35.1% of low blood
pressure while the age group of 63-72 had the lowest percentage of 5.4% of low blood pressure. Similarly these same age
groups of 32-50 had the lowest percentage of 8.8% of high blood pressure and age group of 63-72 had the highest percentage
of 48.9% of high blood pressure.
Applying KnowledgeSEEKER IV to the blood pressure database provided in Groth[1998,2000] however is a difficult
process to illustrate as there are twenty-six variable fields to select from, as well as additional steps that need to be taken to
prepare this database to be mined. One of the purposes of this paper is to simplify this illustration of application and outcome
by selecting well known health hazards of drinking and smoking with or without factors of salt in food and cheese servings
last week respectively, with data on a healthy diet as indicated by number of servings of fish last week.
KnowledgeSEEKER IV provided results that indicated that those patients in the age group of 63-72 of the study who ate
between two and seven servings of fish last week, independent of data on smoking and drinking behavior had a significantly
lower percentage of those with high blood pressure and a greater number of 19.2% of those with low blood pressure.
Data categorized by smoking and drinking patterns were then investigated in relationship to selected variable related to
diets characterized by the amount of salt in food consumed from data set in CD-ROM in Groth[1998,2000]. The result is
shown in Figure 5 where it can be seen that those patients in the age group of 32-50 who had a regular drinking pattern and
used a lot or moderate amount of salt in their diets had a greater percentage (19%) of high blood pressure.
Figure 5
The variable of amount of cheese in diet of previous week was then selected as an additional variable for regular or
occasional smokers and is indicated in the lower level of Figure 5 for this smoker category. This additional variable selection
indicated that the patients in the age group of 51-62 who were regular or occasional smokers and ate zero or one servings of
9
cheese last week had a higher percentage (69.2%) of high blood pressure, and which is significantly higher than those for
former or never smokers who used any level of salt in their diets.
The CART software is also available separately from vendor Salford-Systems Inc. as a free 30-day trial version as
downloadable from web site www.salford-systems.com that allows more flexibility in tree diagrams obtained from data
mining. Splitters and tree and node details are options to be created and printed. As stated earlier, one of the purposes of this
paper is to be a collective resource of free data available for demonstration purposes of data mining techniques that can be
incorporated into the teaching of a course by a faculty member. One of the free databases available on the 30-day free
downloadable version of CART is that of tax information for the city of Boston, Massachusetts.
Figures 6(a) and 6(b) show how CART splits the main tree of variables obtained using the splitters option with the
database of tax information for the city of Boston. Figure 6(a) shows the skeleton of the splitters and Figure 6(b) shows the
same figure with all the details as provided by CART.
The application of CART to the Boston, Massachusetts tax data illustrated the usefulness of data mining to separate
more specifically into smaller subgroups of the data for selected age, industry type, and tax groups. Further data mining could
be performed on this database to determine the tax contributions of each of these subgroups toward their selected health
plans.
5. Description of SAS Enterprise Miner
SAS Enterprise Miner claims to be the first and only data mining solution that addresses the entire data mining process using
an entirely intuitive point-and-click graphical user interface (GUI). Combined with SAS data warehousing and OLAP
technologies, Enterprise Miner creates a synchronized end-to-end solution that addresses the full spectrum of knowledge
discovery. SAS Enterprise Miner uses a “Sample, Explore, Modify, Model, Assess”(SEMMA) approach to data mining in a
windows based environment that is object-oriented so that point-and-click approach is used. Beginning with a statistically
representative sample of your data, the SEMMA approach used by SAS Enterprise Miner makes it easy to apply exploratory
statistical and visualization techniques, select and transform the most significant predictive variables, model the variables to
predict outcomes, and confirm a model’s accuracy.
SAS Enterprise Miner combines powerful statistical analysis with graphical ease of use. Enterprise Miner delivers a
broad range of predictive models that the user can apply, test, and compare to determine the best fit for the available data. In
Enterprise Miner much of the work is done in a GUI interface by manipulating icons in a Windows environment such that the
user can connect nodes in the graphical workspace, adjust settings, and run the workflow.
6. Case Applications for Classroom Illustrations of SAS Enterprise Miner
This part of the paper presents a description of the type of output that can be obtained from Enterprise Miner as provided as
part of a several day Pilot Training Session administer by SAS professionals as part of a $250,000 Grant awarded to my
university due to my efforts. Universities that are awarded SAS Grants also have the option of receiving additional sample
data sets from the SAS Educational Division for illustration in use in the classroom installed with Enterprise Miner.
Figure 7 is SAS Enterprise Miner output for class demonstration example that was provided in the recent Pilot Training
Session at my university of ASU. The data set is a demonstration set of data for approximately two thousand (e.g. 1,966)
customers and the frequency counts for 49 different items they purchased. The “Enterprise Miner Workspace” shown in Figure
7 includes for the given database of customer determination data, sample nodes for a input data source node and data partition,
modify nodes for transform variables and data set attributes, model nodes for tree construction and regression, and a final
reporter node.
10
Figure 7
The tree node enables the user to perform multi-way splitting of the given database based on nominal, ordinal, and
continuous variables. The construction of trees in SAS Enterprise Miner is superior to that in KnowledgeSEEKER because it
represents a hybrid of the best of CHAID, CART and other algorithms. Any node in the tree in Enterprise Miner supports both
automatic and interactive training. When the tree is run in automatic mode, it automatically ranks the input variables based on
the strength of their contributions to the tree. This ranking may be used to select variables for use in subsequent modeling. The
user may override any automatic step with the option to define a splitting rule and prune explicit nodes or subtrees. Interactive
training enables the user to explore and evaluate a large set of trees as the user develops them.
The output of Figure 7 indicates that the proportion of correctly classified items increases progressively as the number of
leaves increases. The fit statistics of Figure 7 yielded a root average squared error (RASE) of 0.46, a average squared error of
0.22, total degrees of freedom (DOF) of 1180, and a misclassification rate of 0.34. The splitting criteria used for the tree setting
is the chi-square test with a significance level of 0.2, with a maximum depth of tree of 6, a maximum number of branches from
a node of 2, and a maximum tries in an exhaustive split search of 5000.
As shown by the model assessment plots for both training and validating the data in Figure 7, the percent response,
cumulative percent response, profit, and cumulative profit all substantially reduced in value as the 100 percentile mark of was
reached. Figure 7 also provides parameter estimates for the selected items such as blankets, his/her apparel, towels, and outdoor
wear. Figure 7 indicates a total profit for dining_bin items of $561 (in thousands of dollars).
Finally the tree of Figure 7 indicates a splitting of data for male/female as 47.5% male and 52.5% female shoppers or
frequency of 561 male and 619 female respectively. Branches of the tree plotted are for number of towels purchased 3 or less
versus 4 or more, his/her apparel of none versus more than one, and blankets purchased of one or less versus more than one.
This tree figure of Enterprise Miner indicates that women purchased more towels in either branch split, men purchased no
his/her apparel items more often and women purchased more than one item of his/her apparel more often, and that men
purchased one blanket more frequently than women who more frequently purchased more than two blankets.
7. Summary and Conclusions
This paper has presented an overview of the theory of data mining and data warehousing as well as applications as illustrated
by KnowledgeSEEKER IV and SAS Enterprise Miner to several data warehouses available from sources indicated as
free/cheap sample data sets. These examples useful for teaching courses in data mining illustrate the usefulness of the
techniques of data mining and warehousing as being powerful tools for the analysis and visual representation of
characteristics of databases especially with respect to buying patterns in the Enterprise Miner dataset.
This paper has discussed the relationship of data mining and data warehousing to model building. Enterprise Miner and
DataMind are able to perform model building using a Windows type environment using node icons. A summary is presented
of the current available software of data mining, as well as a listing of forty-five additional current software and their
manufacture’s web site addresses in Table 1.
11
This paper illustrates the different types of opportunities to apply data mining to data warehouses and some of the
decision-making that must be made by the user as to which software would present the desired type of output for further
analysis. That is, should data mining be performed using modeling, decision type trees such as produced by CART, or
histograms, or other visual means.
The KnowledgeSEEKER IV software has been simplistically applied to a twenty-six (26) variable blood pressure
database provided in Groth[1998,2000]. The relatively simple example tree structure output is presented and illustrates the
usefulness of data mining in determining the effects of drinking, smoking, and diets had on the frequencies of the three blood
pressure levels. As to be expected each of these had an adverse effect on blood pressure levels.
SAS Enterprise Miner has bridged the gap between traditional statistics packages and data mining tools by combining
these coherently together in one software package. KnowledgeSEEKER performs data mining only using decision trees
techniques. If one were not to use Enterprise Miner that one need use other supplemental software such as NeuralWorks
Predict by NeuralWare Inc. for neural networks and DataMind by DataMind Inc. and Red Brick Systems for model building
and prediction. This is summarized in Table 2. A downloadable demo version of KnowledgeSEEKER is currently available
for $10.00 from Hearne Scientific Software at web site www.hearne.com.au that also sell a academic site license for a fee
only disclosed upon contacting the company.
Table 2: Comparison of Data Mining Software with KnowledgeSEEKER and Enterprise Miner
SAS Enterprise Miner KnowledgeSEEKER DataMind NeuralWorks

o Decision Trees Yes Yes No No
Clustering Yes No No No
Neural Networks Yes No No Yes
Prediction Yes No Yes No
Model Building Yes No Yes No
The conclusions of this research include illustration of the usefulness of the KnowledgeSEEKER and SAS Enterprise
Miner in applying data mining techniques to visualize tree structures as well as different dimensions and relationships of the
attributes of the data than would ordinarily be obtained, and a simplistic tutorial in using these data mining tools for use in the
classroom with free/cheap downloads of their software and sample data sets.
This paper also illustrates the important relationship of data mining to the topic of information management. Data
mining is a powerful tool that utilizes many types of techniques that can extract data to the users specific needs of attributes
selected. Hence the users of data mining can manage the information within the data warehouses to their needs. The latter is
achieved by first carefully selecting the appropriate information technology software.
8. Acknowledgements
Figure 5 is output that was generated using the software Angoss KnowledgeSEEKER IV using data sets provided by Angoss
Software Corporation, and is published with prior written consent of the CEO of Angoss Software Corporation for
educational purposes. KnowledgeSEEKER is a trademark of Angoss Software Corporation.
All Figures, Tables and References Available Upon Request from Author.
12

004

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

004

Hochgeladen von

Copyright:

Verfügbare Formate

SOME APPLICATIONS OF ANGOSS KNOWLEDGE-SEEKER IV AND

SAS ENTERPRISE MINER FOR COURSE INSTRUCTION ON DATA MINING

1.1 Data Mining

1.2 Data Warehouses

1.3 Data Mining and Model Building

3. Description of KnowledgeSEEKER IV and CART (Classification and Regression Trees)

4. Case Applications for Classroom Illustrations of KnowledgeSEEKER IV and CART

5. Description of SAS Enterprise Miner

6. Case Applications for Classroom Illustrations of SAS Enterprise Miner

7. Summary and Conclusions

SAS Enterprise Miner KnowledgeSEEKER DataMind NeuralWorks

Das könnte Ihnen auch gefallen