Beruflich Dokumente
Kultur Dokumente
jects applied to databases containing ● Box-and-whiskers plots summarizing the distributions of continuous
STATISTICA, data in external databas- STATISTICA by extracting the observations belonging to the current
es can be processed by the subset;
STATISTICA Association Rules mod-
ule in-place, so the program is For example, you could review the types of purchases that customers
prepared to handle efficiently made with different demographic characteristics, study the effectiveness of
extremely large analysis tasks. certain drugs within different treatment groups, ages, etc., or extract likely
The results can be displayed in customers for a new product from a database of previous customers
tables, and also in unique 2D based on careful study of apparent (market) segments exposed by the
and 3D graphs where strong drill-down analysis.
associations are highlighted by Interactive Drill-Down Explorer and OLAP (On-Line Analytic
thick lines connecting the Processing). On the surface, the operation of the simplest aspect of the
respective items. Interactive Drill-Down Explorer (exploration of multidimensional
tables) is very similar to the functionality offered by designated OLAP tools
INTERACTIVE DRILL-DOWN EXPLORER (such as those offered in the optional OLAP add-on module for
STATISTICA Data Miner). OLAP tools allow users to quickly query a data-
A first step of many data mining projects is to explore the data interactive- base to extract observations and summary information about those obser-
ly, to gain a first “impression” of the types of variables in the analyses, and vations taking advantage of the optimized OLAP Server facilities offered for
their possible relationships. The purpose of the Interactive Drill-Down a specific database platform (e.g., Oracle, or MS SQL Server), and often
Explorer is to provide a combined graphical, exploratory data analysis, providing significant performance advantages over tools based on tradi-
and tabulation tool that will allow you to quickly review the distributions tional (non-OLAP driven) query tools. However, the main advantages
of variables in the analyses, their relationships to other variables, and to STATISTICA Interactive Drill-Down Explorer has over OLAP are:
identify the actual observations belonging to specific subgroups in the
data. (a) its tight integration with STATISTICA’s flexible categorization tools and
exploratory environment (the analytic capabilities provided in the
How the Drill-Down Explorer Works. The “drill-down” metaphor STATISTICA Interactive Drill-Down Explorer are much more compre-
within the data mining context summarizes the basic operation of this ana- hensive and also general than typical OLAP tools, supporting flexible “drill
lytic process quite well: the program allows you to select observations up” operations, and allowing you to quickly review custom, complex sum-
from larger data sets by selecting subgroups based on specific values or mary graphs, detailed descriptive statistics, etc.), and
ranges of values of particular variables of interest (e.g., Gender and (b) the fact that the STATISTICA Interactive Drill-Down Explorer is not
Average Purchase in the example above); in a sense you can expose the limited to any particular database platform and does not require a desig-
“deeper layers” or “strata” in the data by reviewing smaller and smaller nated OLAP Server to be present (e.g., it can operate directly on
subsets of observations selected by increasingly complex logical selection STATISTICA data files). At the same time, by connecting to the STATISTICA
conditions. application a (remote) database for in-place processing, you can efficient-
Drilling “up.” The interactive nature of the Drill Down Explorer allows ly perform drill-down operations on any data source, regardless of
you not only to drill down into the data or database (select groups of whether or not designated OLAP tools are available on the server.
observations with increasingly specific logical
selection conditions), but also to “drill up”:
at any time, you can select one of the previ-
ously specified variable (category) groups
and de-select it from the list of drill-down
conditions; while processing the data the pro-
gram will then only select those observations
that fit the remaining logical (case) selection
conditions, and update the results according-
ly.
Applications of the Interactive Drill-
Down Explorer. The example shown earli-
er is very simple, exposing only the basic
functionality of the program. The real power
of the STATISTICA Interactive Drill-Down
Explorer lies in the various auxiliary results
which can automatically be updated during
the interactive drill-down/up exploration: you
10 Data Miner
GENERALIZED EM & K-MEANS CLUSTER GTREES
ANALYSIS The Classification and Regression Trees module ®is a comprehensive
The STATISTICA Generalized EM (Expectation Maximization) and k- implementation of the methods described as CART by Breiman,
Means Clustering module is an extension of the techniques available in Friedman, Olshen, and Stone (1984). However, the GTrees module con-
the general STATISTICA Cluster Analysis options, specifically designed to tains various extensions and options that are typically not found in imple-
handle large data sets and to allow clustering of continuous and/or cate- mentations of this algorithm, and that are particularly useful for data min-
gorical variables, and to provide the functionality for complete unsuper- ing applications.
vised learning (clustering) for pattern recognition, with all deployment User interface; specifying “models.” In addition to standard analyses
options for predictive clustering. Various cross-validation options are (as described by Breiman, et al.), the implementation of these methods in
provided (including modified v-fold cross-validation options) that will STATISTICA allow you to specify ANOVA/ANCOVA-like designs with continu-
automatically choose and evaluate a best final solution for the clustering ous and/or categorical predictor variables, and their interactions. Three
problem; you do not need to specify the number of clusters before an alternative user interfaces are provided to allow you to specify such
analysis; instead the program will use automatic (cross-validation based) designs; these are analogous to the methods provided in GLM (General
methods to choose a best cluster solution (number of clusters) for you! Linear Models), GLZ (Generalized Linear Models), GRM (General
The advanced EM clustering technique available in this module is some- Regression Models), GDA (General Discriminant Analysis Models), and
times referred to as probability-based clustering or statistical clustering. PLS (General Partial Least Squares Models), and are described in detail
The program will cluster observations based on continuous and categori- in the respective sections. In short, ANOVA/ANCOVA-like predictor designs
cal variables, assuming different distributions for the variables in the can be specified via dialogs, Wizards, or (design) command syntax; more-
analyses (as specified by the user). Various cross-validation options are over, the command syntax is compatible across modules, so you can
provided to allow you to choose and evaluate a best final solution for the quickly apply identical designs to very different analyses (e.g., compare
clustering problem. Detailed output summaries and graphs (e.g., distrib- the quality of classification using GDA vs. GTrees).
ution plots for EM clustering), and detailed classification statistics are
computed for each observation. These methods are optimized to handle
very large data sets, and various results are provided to facilitate subse-
quent analyses using the assignment of observations to clusters. Options
for deploying cluster solutions (in C, C++, C#, Visual Basic, or XML syntax
based PMML), for classifying new observations, are also included.
1 tomer credit scoring, predicting specific aspects of customer behavior or providing answers to specific CRM questions, managing the risk
of an equipment failure using a model based on the mining of a very complex set of historical data). For these customers, StatSoft offers
a complete installation and deployment of data mining solutions that will draw data from an existing corporate database or data ware-
house and generate predictions or ratings using a specific model that StatSoft consultants will deploy on-site (services to develop a data warehouse
solution or restructure the existing one are also available). These specialized data mining solutions can later be modified (by StatSoft or other consul-
tants) as the needs of the company change. The modification of such already deployed systems are very easy because all STATISTICA solutions are
stored in the form of industry standard VB scripts), and they can readily be deployed in the industry standard C++ code.
Customers who need a general powerful data mining solution development system, to be used to design and deploy custom systems (in-
2 house) by the corporate analysts and IS/IT personnel. These customers will license the same set of tools, following the same price struc-
ture as the customers from the previous category (see above), except that they will not order the deployment and consulting services.
2300 E. 14th St. • Tulsa, OK 74104 • USA • (918) 749-1119 • Fax: (918) 749-2217 • info@statsoft.com • www.statsoft.com
Australia: StatSoft Pacific Pty Ltd. Germany: StatSoft GmbH Japan: StatSoft Japan Inc. Portugal: StatSoft Iberica Ltda. Spain: StatSoft Espana
Brazil: StatSoft Brazil Ltda. Hungary: StatSoft Hungary Ltd. Korea: StatSoft Korea Russia: StatSoft Russia Sweden: StatSoft Scandinavia AB
Czech Republic: StatSoft Czech Rep. s.r.o. Israel: StatSoft Israel Ltd. Netherlands: StatSoft Benelux BV Singapore: StatSoft Singapore Taiwan: StatSoft Taiwan
France: StatSoft France Italy: StatSoft Italia srl Poland: StatSoft Polska Sp. z o. o. S. Africa: StatSoft S. Africa (Pty) Ltd. UK: StatSoft Ltd.
STATISTICA and StatSoft are trademarks of StatSoft, Inc. © Copyright StatSoft, Inc. 1984 - 2002