StatSoft Staff-Statistica Data Miner-O'Reilly Media

TM
Data Miner enterprise system

✔
✔ Uncover hidden trends
✔
✔ Explain known patterns
✔
✔ Predict the future
STATISTICA has received the highest

rating in EVERY comparative review of
www.statsoft.com statistics software in which it has been
featured since its first release in 1993.
2 Data Miner
■ Flexible deployment engine, integrated with custom development envi-
Table of Contents ronment allowing you to manage optimized analytic objects (nodes)
for data mining using quick, industry standard, Visual Basic scripts
A General Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 (VB is built into the system);
Advanced Software Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 ■ Extremely fast and efficient deployment via portable, XML syntax based
Using Data Miner with Large Data Sets . . . . . . . . . . . . . . . . . . . . . . 5 PMML (Predictive Models Markup Language) files for prediction, pre-
Unique Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 dictive classification, or predictive clustering of large data files; trained
Data Mining Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 models can be shared between desktop and WebSTATISTICA Data
Specialized Data Mining Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Miner (Client-Server) installations (see below);
The Client-Server Version/WebSTATISTICA . . . . . . . . . . . . . . . . . . 13 ■ Options to write predicted values, classifications, classification proba-
bilities, prediction residuals, and so on directly into external databases
for subsequent analyses, selection, etc.; by using efficient IDP (In-
Place Database Processing) technology for reading and writing infor-
mation from and to external databases, datasets of extremely large
A General Overview sizes can be analyzed and scored (i.e., used to update predicted val-
ues, classification probabilities, and so on in the database);
The most comprehensive and effective system of user-friendly ■ Open, COM-based architecture, unlimited automation options, and
tools for the entire data mining process - from querying support for custom extensions (using industry standard VB (built in),
databases to generating final reports. Java, or C/C++/C#);
■ To the best of our knowledge, STATISTICA Data Miner contains the ■ Desktop or Client-Server options;
most comprehensive selection of data mining methods available on the ■ Multithreading and distributed processing architecture delivers
market (e.g., by far the most comprehensive selection of clustering unmatched performance (offered in the Client-Server version) includ-
techniques, neural networks architectures, classification/regression ing super-computer-like parallel processing technology that optionally
trees, multivariate modeling (including MAR Splines), and many other scales to multiple server computers that can work in parallel to rapid-
predictive techniques; the largest selection of graphics and visualiza- ly process computationally intensive projects;
tion procedures of any competing products); ■ Complete Web-enablement options (via WebSTATISTICA offering sup-
■ A selection of comprehensive, complete data mining projects, ready to port for all data mining operations, including the interactive model
run, and set up to competitively evaluate alternative models [using building, via Internet browser using any computer connected to the
bagging (voting, averaging), boosting, stacking, meta-learning, etc.], Web); this ultimate enterprise data analysis/mining system allows you
and to produce presentation-quality summary reports; to manage projects over the Web and work collaboratively “across the
■ An extremely easy to use, drag-and-drop based user interface that can hall or across continents.”
be used even by novices, but is still highly flexible, customizable, and
provides one-click access to the underlying scripts; STATISTICA Data Miner is a truly unique application in terms of its sheer
comprehensiveness, power, and technology, and flexibility of the available
■ Powerful, interactive data exploration (drilling, slicing, dicing) tools, user interfaces:
including the most comprehensive selection of interactive, exploratory
■ Choose from the largest selection of algorithms on the market (based
graphics-visualization tools available in any product;
on the STATISTICA technology) for classification, prediction, cluster-
■ Ability to handle/process simultaneously multiple data streams; ing, and modeling;
■ Optimized for processing extremely large data sets (including options ■ Access and process huge data sets in remote databases in-place; off-
to pre-screen even over a million of variables, and/or draw stratified load time-consuming database queries to the server;
or simple random samples of records using DIEHARD-certified ran-
■ Write predicted values, classifications, classification probabilities etc.
dom sampling procedures);
computed from trained models directly to an external database; score
■ Highly optimized read (and write) access to large databases, including very large databases using one or more deployed models;
the IDP (In-Place Database Processing) technology that reads data
■ Access huge data files on your local
asynchronously directly from remote database servers (using distrib-
uted processing if supported by the server), and bypassing the need to (desktop) Windows computer; as special-
“import” data and create a local copy; ized queries into custom data warehouses
are sometimes expensive (requiring the
services of designated consultants), it can
be more cost effective to download even
huge databases to your local machine;
such data files can then be processed with
unmatched speed by STATISTICA Data
Miner routines;
■ Data mining project templates can be
selected from menus; with only a few
clicks of the mouse, you can apply even
advanced methods such as meta-learning
techniques (voting, bagging, etc) to your
specific analysis problems;
Data Miner 3
■ Integrate diverse methods and technologies into the data mining pro- tories) with other stakeholders in the data mining projects; etc.
ject, from quality control charting and process capability analysis, ■ Integrate input data, stakeholders, analysts, and users of results of
Weibull analysis, power analysis, or linear and nonlinear models, to data mining projects from any location around the world;
advanced automated searches for neural network architectures; all WebSTATISTICA enables you to connect to data on one server (over
STATISTICA procedures can be selected as nodes for data mining pro- the Internet), share analyses with other data mining professionals
jects, and no programming or custom-development work is required worldwide, and deploy solutions and results to users in even the most
to use these procedures; remote locations (e.g., to branch managers in small rural areas, engi-
■ Graphical/visual data mining: All of STATISTICA’s unique and neers on remote drilling platforms, ships en-route across oceans,
unmatched graphical capabilities are available for data mining; choose etc.); as long as even slow Internet access is available, you can include
from hundreds of graph types to visualize data after cleaning, slicing, individuals in those locations in your data mining project)
or drilling down; ■ Ideal for teaching data mining: provide participants (students) with
■ Intuitive user interface and full integration with STATISTICA’s award the option to analyze data from home or their office, wherever there is
winning solutions: you will be up-and-running in minutes; access to the Internet; allow professionals to complete assignments at
■ Complete integration into StatSoft’s desktop (STATISTICA) and Web the time and place that most conveniently fits their schedules.
(WebSTATISTICA) applications; interactively explore, drill down on, WebSTATISTICA allows all course or training participants hands-on
chart, etc., all intermediate results; experience with the most advanced data mining tools available today!
■ Organize results in reports, spreadsheets, graphs, etc., or publish
results on the Web; STATISTICA Data
■ Access to STATISTICA’s comprehensive library of analytic facilities; Miner is based on
a technology that
■ Update analyses and results automatically when the data change; offers both (a)
■ Open architecture design. Fully integrate your own proprietary algo- the full advan-
rithms and methods or third-party algorithms; tages of the inter-
■ Fully programmable and customizable system (using the industry stan- active, “point and
dard languages such as the built-in Visual Basic, C++, C#, Java, etc.). click” user inter-
Develop highly customized data mining systems specifically tailored to face and (b) com-
your needs; plete programma-
■ Automatically deploy solutions in seconds using built-in tools, or add bility and cus-
automatically generated computer code for deployment (e.g., in C++, tomizability.
PMML) to your own programs.
Data Miner in the WebSTATISTICA Client-Server installation. The
desktop version of STATISTICA Data Miner is designed for the Windows Advanced Software Technology =
environment. The Client-Server version of STATISTICA Data Miner is plat-
form independent on the client side and features an Internet browser- Efficient and Elegant User Interface
based user interface; the server side works with all major Web server
operating systems (e.g., UNIX Apache) and Wintel server computers. STATISTICA analysis “objects” and nodes. At the heart of
■ Seamless integration of desktop and WebSTATISTICA data mining STATISTICA Data Miner are a set of over 300 highly optimized, efficient,
tools: design models on one platform (desktop or WebSTATISTICA and extremely fast STATISTICA procedures embedded in user-selectable
server), execute on the other; train models on one platform (desktop nodes, which are used to specify the relations between the procedures
or WebSTATISTICA server) and deploy to the other platform (objects) and control the logic of the project (and the “flow” of data).
■ Distributed processing and multi-threaded evaluation of projects: The This flexible, customizable architecture delivers the full functionality of all
program will automatically take advantage of multi-processor and/or statistical and analytic procedures to the data mining environment as self-
multiple-server architectures, to evaluate models via multiple simulta- contained analysis objects. Behind each node, and accessible to
neous processes (multithread- advanced users of the STATISTICA Data Miner system, are simple scripts
ing, distributed processing); (analysis objects encoded in industry-standard Visual Basic) that serve as
hence the ability of the “wrappers” or glue for defining the flow of data through the project,
WebSTATISTICA Data Miner while the actual numerical analyses are performed via the extremely fast
installations to take full advan- analytic procedures of STATISTICA. These objects, which can be used as
tage of such architectures pro- the nodes for data cleaning and/or filtering, and for analyzing the data,
vides tremendous flexibility for are organized in the Node Browser.
scaling the system to mine The nodes available in the node browser (and, hence, available to the
even extremely large databas- data mining project) are:
es. ■ Nodes for data input and data acquisition. Here you can create
■ Full flexibility of and store the scripts necessary to connect to remote (protected) data
WebSTATISTICA: analyze data sources on a server. Of course, you can also analyze STATISTICA data
in batch mode, receive notifi- files or place holders for in-place processing of remote databases, in
cation by email when the which case no special nodes (scripts) have to be created.
results are ready; share results
■ Nodes for data filtering, cleaning, verification, feature selection,
in designated folders (reposi-
4 Data Miner
and sub-sampling. These options are essential to data mining, to select variables, select the arrow tool to connect the data), the program
detect and correct erroneous information that may bias final conclu- will automatically:
sions. The sub-sampling facilities are useful for analyzing very large ■ Create two samples for training and for cross-validation, to avoid over-
data sets, to extract random or stratified random samples for further fitting;
analyses. The feature selection options allow you to automatically ■ Apply best subset linear regression, standard regression trees algo-
select informative variables (predictors) from among, for example, rithms, CHAID and exhaustive CHAID, a 3-layer multilayer perceptron
hundreds of thousands of possible predictors. neural network, and a radial basis function neural network to find a
■ Nodes for data analyses. These nodes contain the full functionality
good model for predicting credit risk;
of all STATISTICA analyses and graphics capabilities; hundreds of pro- ■ Combine all responses into a meta-learner that picks the best model,
cedures are available to address essentially all analytic needs that can or combines the predictions from multiple models.
possibly arise in your data mining project.
After applying these cut-
Creating the data mining project. These nodes can simply be con- ting-edge techniques for
nected in the data mining workspace. modeling linear, nonlin-
The data mining workspace is a structured, highly efficient, user-friendly ear, or even chaotic rela-
data analysis environment, where you can move around and interconnect tionships, you are ready
data, analyses, and results by simply dragging icons and connecting for deployment: Simply
arrows. You can simultaneously open, modify, and run as many data min- connect the data source
ing workspaces as you like and drag nodes (objects) between work- for the new data (new
spaces and node browsers. The workspace area is pre-divided to make customers) to the
room for: Compute Best Prediction
■ Data acquisition. Here is where the data sources can be specified
From All Models node, and the program will automatically apply the fully
(e.g., STATISTICA data files, place-holders for in-place processing of trained models to derive the best prediction possible.
data on remote servers, programs that generate data programmatically, Speed. The analysis nodes (objects) contain the full functionality of
for use in advanced modeling). STATISTICA, encapsulated into nodes that can further be customized using
■ Data preparation, cleaning, transformation. The nodes in this area standard Visual Basic syntax. The actual analyses are performed via the
will accept one or more data sources for input, and create one or highly optimized STATISTICA analysis modules, which have been refined
more (filtered, cleaned, transformed) data sources for further “down- for almost two decades to deliver maximum speed, capacity, and accura-
stream” analyses. cy.
■ Data analysis, modeling, classification, forecasting. The nodes in
Large data sets. STATISTICA Data Miner uses a number of technolo-
this area will perform the numeric analyses. gies specifically developed to optimize the processing of large data sets,
■ Reports. This area will show the results of the analyses.
and it is designed to handle even the largest scale computational prob-
Creating a Data Mining project is easy: first select a data source; second, lems and process very large databases. For example, data sets with over
apply any data preparation, cleaning, or transformation; third, connect one million variables can be processed and screened automatically (using
the desired analyses to the cleaned data; and, fourth, review and/or pub- a wide selection of methods) to search for best predictors or most rele-
lish the results. Many users of STATISTICA Data Miner will never need to vant variables. Please visit www.statsoft.com for benchmarks illustrating
go beyond this simple interactive, “point and click” user interface. the unmatched speed of STATISTICA data processing.
Specifying complex models. The simple user interface -- based on Customizing analyses. The analyses or data cleaning/filtering opera-
point-and-click selections from menus and browsers -- will allow you to tions implemented by the nodes of STATISTICA Data Miner can further be
apply even very advanced methods. Several comprehensive and flexible customized by simply double-clicking on the respective icons: every icon
project “templates” can be contains the options to fully cus-
selected to address com- tomize the respective operations;
mon data mining tasks. For for example, clicking on a neur-
example, in order to find a al network node will bring up a
good model for predicting dialog (and dialog help) for cus-
credit risk of new clients tomizing the specific analysis (to
based on historical data change the number of iterations,
that includes various poten- number of layers in the network,
tially useful predictors, you the detail of reported results,
could simply select the tem- etc.).
plate for the Advanced Saving the project. The entire
Comprehensive Regression project (workspace) can be
Models project. saved, along with all customiza-
All you need to do next is tion, intermediate data sources,
connect your historical comments, etc. Routine analy-
data, specify the variables of ses (e.g., for regular updating of
interest, and “train” the a trained complex set of models
project; thus, in just a few for voted classification based on
seconds (select data file, various methods) can be saved
Data Miner 5
and later applied by clicking on a single button (“update”). solution, voted solution, etc.) to new data; the end user only needs to
Technical Note: STATISTICA Data Miner Node Scripts. STATISTICA connect new data to the deployment node to compute predictions,
Data Miner’s computational routines are extremely fast and highly opti- classifications, forecasts, etc.
mized. For example, in the WebSTATISTICA Client-Server environment, the ■ PMML-based rapid deployment of predictive models. The Rapid
program will automatically take advantage of multi-processor and/or mul- Deployment of Predictive Models options provide the fastest, most
tiple-server architectures (with proper hardware support), to evaluate efficient methods for computing predictions from fully trained models;
models via multiple simultaneous processes (multithreading, distributed in fact, it is very difficult to “beat” the performance (speed of compu-
processing). Moreover, the highly optimized routines for processing data tations) of this tool, even if you were to write your own compiled C++
will outperform other software in head-to-head comparisons (see the code, based on the (C, C++, or C#) deployment code generated by
benchmarks at www.statsoft.com for details). Yet, advanced users will the respective models. The Rapid Deployment of Predictive Models
find it very easy to customize the system: each node in STATISTICA Data options allows you to load one or more PMML files with deployment
Miner consists of a standardized STATISTICA Visual Basic script (that information, and to compute very quickly (in a single pass through the
calls the respective STATISTICA procedures), with access to additional data) predictions for large numbers of observations (for one or more
functions to provide the user interface to further customize analyses. It models). PMML (Predictive Models Markup Language) files can be
may never be necessary to modify or customize these scripts; however, if generated from practically all analytic procedures for predictive data
your in-house IT department or consultants want to insert proprietary mining (as well as the Generalized EM & k-Means Cluster Analysis
algorithms into STATISTICA Data Miner, this can very easily be accom- options). PMML is an XML-based (Extensiveble Markup Language)
plished. Any number of proprietary or highly customized numeric opera- industry standard set of syntax convention that is particularly well suit-
tions could be performed inside the script, to change practically all ed to allow sharing of deployment information in a Client-Server archi-
aspects of the data, or to apply any of the thousands of analytic functions tecture (e.g., via WebSTATISTICA).
available in form of simple function calls that can be made from C++ or ■ C, C++,C#, Visual Basic code generator options. Code-generator
STATISTICA Visual Basic. This general open architecture of STATISTICA options are also available for regression (prediction of continuous
Data Miner provides numerous unique (to data mining software) advan- variables), classification (prediction of categorical variables), and
tages (also further elaborated in the section on Unique Features). clustering types of problems; for example, you can save C++ code or
■ Each node can handle multiple data sources on input, and multiple Visual Basic code that implements the prediction from tree-classifica-
data sources on output; identical operations can be applied to multiple tion algorithms, linear discriminant function analysis, generalized lin-
data sources via a single node. ear models, neural networks, MAR Splines (multivariate adaptive
■ A data source can be mapped into a database that does not need to regression splines), k-means or EM clustering solutions (unsupervised
actually (physically) reside on the machine running STATISTICA Data learning), etc. The code generated by these options can quickly be
Miner, nor does it have to be copied; this is extremely important for integrated into custom programs for deployment. For example, the
the processing of large data sets, as they commonly occur in data min- Visual Basic code generated from STATISTICA analysis modules will
ing. seamlessly integrate into the STATISTICA Data Miner architecture;
■ You can perform operations within and between data sources; for based on the Visual Basic code generated by STATISTICA, custom
example, you could merge data in different remote databases into a deployment nodes can be programmed in minutes, even by inexperi-
single data file, for further processing with STATISTICA Data Miner enced programmers.
analytic nodes.
■ Visual Basic itself is a simple, object-oriented language, available for
most industry-standard application programs; there is a virtually limit- Using Data Miner with Large Data Sets
less supply of programming resources, talented and experienced pro-
grammers, and ready-to-use third-party applications that can be inte- The entire STATISTICA family of products and STATISTICA Data Miner in
grated with STATISTICA Data Miner. Likewise, STATISTICA Data particular are specifically optimized to efficiently process extremely large
Miner can be integrated with other applications; for example, to auto- data sets, with millions of observations (records) and millions of vari-
matically deliver results to the Web or email, or to export results into ables (fields). Please refer also to the speed benchmarks detailed at the
other applications. Also, a fully Web-based version of STATISTICA StatSoft Web site (www.statsoft.com).
Data Miner, powered by WebSTATISTICA, is available. Processing databases that are larger than the local storage
■ STATISTICA’s macro recording facilities will automatically record device. STATISTICA Data Miner (and optionally other STATISTICA prod-
interactive analyses; these recordings can easily be converted into ucts) can process data in (remote) databases “in-place” via its highly
scripts for custom nodes. optimized In-place Database Processing (IDP) technology, which com-
■ Where applicable, STATISTICA’s analyses contain options for generat- bines the processing resources of the database server and the local com-
ing STATISTICA Visual Basic code for deployment (e.g., of trained puter to (a) perform the queries (using the database server CPU) while
neural networks); those scripts can be directly used in scripts for cus- simultaneously (b) processing the fetched records “on-the-fly” on the
tom deployment nodes. local machine (using the local computer (client) CPU). This way, databas-
Deploying solutions. The results of analyses via STATISTICA Data es that are larger than what could fit on the local machine can be
Miner can be deployed (applied to new data or inside other automated processed, and significant performance gains can be achieved by saving
data processing systems) in several ways. the time that would normally be required to first import the data to the
local device and only then process them locally. Practically all common
■ Automatic deployment of models. Data mining templates with
database formats are supported, and powerful tools are provided for
deployment for standard types of analyses can be chosen as options defining the database connection (query).
from pull-down menus: select a template, connect training data to esti-
mate models, and you are ready to apply the best solution (average Processing databases with extremely large numbers of variables
6 Data Miner
(fields): The unique feature selection and variable screening lytic software (WebSTATISTICA): everything works together seamlessly as a
facilities. When the number of variables in the input data file is extreme- single, comprehensive system.
ly large, STATISTICA Data Miner can automatically select subsets of vari- Seamless integration of a vast range of techniques. The seamless
ables from among even over a million of variables (candidates) for pre- integration of STATISTICA Data Miner with all other analytic and graphics
dictive data mining. The extremely fast and efficient algorithm will select options available in STATISTICA provides unmatched flexibility: for exam-
variables (features) that are likely to be the most relevant predictors in ple, no other software will allow you to quickly integrate into a single data
the current data set, without introducing biases into subsequent model mining project quality control charting and Six Sigma methods, trained
building for predictive data mining. ensembles of multiple-architecture neural networks providing a weighted
Processing data files with extremely large numbers of cases average predictions, and categorized icon charts summarizing multiple
(records): Flexible and efficient random sampling. STATISTICA features of interest for each observation. In STATISTICA Data Miner, all
products (including STATISTICA Data Miner) can process data files with of these can be connected by dragging the respective analysis nodes into
practically unlimited numbers of cases (records), and STATISTICA’s data the data mining workspace.
access procedures are highly optimized. However, including all records Every result can further be reviewed, analyzed, saved. All
in the analyses when the number of records is extremely large is (a) results of STATISTICA Data Miner can be displayed
entirely unnecessary, (b) time consuming, and (c) often impractical or in the same manner as the results from other
impossible (in extreme cases it could take hours merely to read all STATISTICA analyses. Hence, interme- STATISTICA Data
records). In order to speed up the analytic process, STATISTICA Data diate results can be saved or most comprehensive s
Miner includes sophisticated tools for drawing r random or stratified ran- immediately used to perform solutions on the market, wi
dom samples from huge data sets (databases). The user can quickly additional interactive analyses
extract simple or systematic random samples of appropriate sizes, with or easy-to-use user interface. Use
using the standard
without replacement, from huge data sets (e.g., with many millions of STATISTICA interactive user
STATISTICA’s analytic routines, hu
records) for further analyses with sophisticated modeling tools that may interface; there are no files graphs, specialized routines for data
require multiple passes through the data (e.g., neural networks, general- to import or export. For to include specialized third-party or
ized linear models, etc.). The random sub-sampling can be based on example, just display the methods. Data Miner is fully prog
STATISTICA’s validated random number generator. Note that STATISTICA spreadsheet with predictions
is one of only few commercially available software products that have can be tailored to respond to yo
and instantly use that spread-
passed the most advanced and most recognized tests for randomness (the sheet to review graphically whether
requirements, and is o
DIEHARD suite of tests). any outliers might have influenced the deployment and
Distributed processing and multi-threaded evaluation of projects results. serv
in the Client-Server environment. The WebSTATISTICA Client-Server Analysis nodes will handle multiple data streams.
installation of STATISTICA Data Miner offers additional advantages for Because of STATISTICA Data Miner’s unique architecture, multiple data
processing very large datasets. The program will automatically take streams can be channeled through a single node: for example, you can
advantage of multi-processor and/or multiple-server architectures (with specify a single node for clustering, and send 20 data sets with different
proper hardware support), to evaluate models via multiple simultaneous variable selections through that node, applying identical specifications
processes (multithreading, distributed processing). Hence, considering such as the type of distance measure to use etc. This allows for efficient
the decreasing costs for advanced server hardware (with multiple proces- processing of lists of data sources (e.g., automatically create identical
sors, or for multiple-server installations), the ability of WebSTATISTICA analyses and reports for data collected from different data processing cen-
Data Miner installations to take full advantage of such architectures pro- ters).
vides tremendous flexibility for scaling the system to mine even extremely
large databases. In-place processing of large data sets on remote servers.
STATISTICA includes advanced options for defining connections to data-
bases in practically all formats on remote servers. To the STATISTICA
Unique Features application, these data sources appear just as another data file that can be
processed without the need to make a copy of or “import” the database to
STATISTICA Data Miner contains a large number of fully inte- the local machine. Because STATISTICA Data Miner is just another seam-
grated advanced techniques for analyzing data. In addition, lessly integrated STATISTICA application, those data sources can be con-
the architecture of the program allows this software to offer nected like any other data source, i.e., by simply selecting it from a list of
features that are absolutely unique in this type of applica- available input data. STATISTICA Data Miner also includes special
tion, and can be crucial for the success of data mining pro- options for selecting subsets of variables from among huge numbers of
jects in the real world. input variables (feature selection, variable filtering). For example, you
The most comprehensive selection of data mining techniques. To can scan over a million of input variables for candidate variables for fur-
the best of our knowledge, STATISTICA Data Miner contains the most ther predictive classification analyses.
comprehensive selection of data mining methods available on the market Open architecture: Add your own custom nodes. Because all nodes
(e.g., by far the most comprehensive selection of clustering techniques, (including any new, custom-nodes) in STATISTICA Data Miner can be
neural networks architectures, classification/regression trees, multivariate modified via Visual Basic programs, it is very easy to customize the system
modeling (including MAR Splines), and many other predictive techniques; to include analysis (or other) nodes (a) that contain your own propri-
the largest selection of graphics and visualization procedures of any com- etary algorithms, (b) developed and implemented in any language that
peting products). can generate functions that can be called from industry-standard Visual
A fully integrated STATISTICA application. STATISTICA Data Miner Basic, (c) with a complete user interface for accepting from the user
is fully integrated into the STATISTICA line of desktop and Web-based ana- parameters, choices of options, etc.; these nodes can be added perma-
Data Miner 7
nently to the selection of available nodes, and identified with an icon con- ic income group, to explore (e.g., create graphical summaries for) select-
taining your custom logo. ed variables, for females in the selected income group only. A unique fea-
Same user interface: Data mining on your local machine or via ture of STATISTICA Drill-Down Explorer is the ability to select and dese-
WebSTATISTICA. The same user interface and options available in the lect drill-down variables and categories in any order; so you could next
STATISTICA Data Miner desktop application are available in the deselect variable Gender and thus display selected graphs and statistics for
WebSTATISTICA Data Miner application. To reiterate, STATISTICA Data the selected Income group, but now for both males and females. Another
Miner is fully integrated into the STATISTICA family of products; it is not a unique feature of the Drill-Down Explorer is its variety of categorization
“foreign” application developed by another company and “forced” into (“slicing”) methods. Hence, the STATISTICA Drill-Down Explorer offers
the STATISTICA framework. Data mining over the Web (via tremendous flexibility for “slicing-and-dicing” the data. The STATISTICA
WebSTATISTICA) is as (or more) efficient and convenient as it is within Drill-Down Explorer can be applied to raw data, database connections
the STATISTICA desktop application. Note that the WebSTATISTICA Client- for in-place processing of data in remote databases, or to any intermedi-
Server installation of STATISTICA Data Miner offers additional advantages ate result computed in a STATISTICA Data Miner project. (A fully inte-
for processing very large datasets: the program will automatically take grated OLAP application is also available (as an optional add-on module
advantage of multi-processor and/or multiple-server for enterprise installations); please contact StatSoft for details.)
architectures (with proper hardware sup- General Classifier. STATISTICA Data Miner offers the widest
a Miner offers the port), to evaluate models via multiple selection of tools to perform data mining classification techniques
selection of data mining simultaneous processes (multi- (and build related deployable models) available on the market, including
ith an icon-based, extremely threading, distributed process- generalized linear models (for binomial and multinomial responses),
ing). classification trees, general classification and regression tree modeling
ers can access the full power of (GTrees), general CHAID models, cluster analysis techniques (including
STATISTICA Data Miner
undreds of analytic and descriptive is itself accessible as a “large capacity” implementations of tree-clustering as well as k-means
mining, and can customize the system COM object. The func- and EM clustering methods with v-fold crossvalidation options to deter-
r in-house proprietary algorithms and tions of STATISTICA Data mine automatically the best number of clusters), and general discriminant
grammable, can work over the Web, Miner are also fully inte- analysis models (including best-subset selection of predictors). Also, the
grated and accessible via the numerous advanced neural network classifiers available in STATISTICA
our specific data and data mining Neural Networks are available in STATISTICA Data Miner, and can be
STATISTICA COM object
offered optionally with model, and they can be called used in conjunction or competition with other classification techniques.
d on-site training from other applications or used in ■ Deployment. Where applicable, the program includes options for
vices. analysis macros (e.g., create predictions generating C, C++, C#, STATISTICA Visual Basic, or (XML-syntax)
from a sophisticated trained multi-architecture PMML code for deployment of final solutions in your custom pro-
model by clicking on a toolbar button). IT departments will be grams. Models are also automatically available for deployment after
able to create very simple STATISTICA - based applications that can be training, so all you need to do is connect new data to the special
used by “operators” (e.g., loan officers reviewing credit applications for deployment node to compute predicted classifications.
fraudulent information) who simply click on predefined buttons; yet the General Modeler/Multivariate Explorer. STATISTICA Data
system may utilize the “wisdom” extracted from testing dozens or even Miner offers the widest selection of tools to build deployable data
hundreds of different methods for prediction. mining models, based on linear, nonlinear, or neural network techniques
and tools to explore data; the user can also build predictive models based
Data Mining Tools on general multivariate techniques. In summary, STATISTICA offers the
full range of techniques, from linear and nonlinear regression models,
STATISTICA Data Miner offers the most comprehensive selection of statis- advanced generalized linear and generalized additive models, regression
tical, exploratory, and visualization techniques available on the market, trees and CHAID, to advanced neural network methods and multivariate
including leading edge and highly efficient neural network/machine learn- adaptive regression splines (MAR Splines). STATISTICA Data Miner also
ing and classification procedures. Also, the complete analytic functionality includes techniques that are not usually found in data mining software,
of STATISTICA is available for data mining, encapsulated in over 300 such as partial least squares methods (for feature selection, to reduce
nodes that can be selected in a structured and customizable Node large numbers of variables), survival analysis (for analyzing data contain-
Browser and dragged into the data mining workspace. ing censored observations; e.g. for medical research data and data from
The specialized tools for data mining are optimized for speed and efficien- industrial reliability and quality control studies), structural equation mod-
cy and can be classified into the following five general “areas” (each com- eling techniques (to build and evaluate confirmatory linear models), cor-
prising of a set of STATISTICA modules, some of them offered only in the respondence analysis (for analyzing the structure of complex tables), fac-
STATISTICA Data Miner environment): tor analysis and multidimensional scaling (for exploring structure in large
numbers of variables), and many others.
General Slicer/Dicer and Drill-Down Explorer. A large
■ Deployment. Where applicable, the program includes options for
number of analysis nodes are available for creating exploratory
graphs, to compute descriptive statistics, tabulations, etc. These nodes generating C, C++, C#, STATISTICA Visual Basic, or (XML-syntax)
can be connected to input data sources, or to all intermediate results. A PMML code for deployment of final solutions in your custom pro-
specialized STATISTICA application module is available (STATISTICA Drill- grams; models are also automatically available for deployment after
Down Explorer) for interactively exploring your data by drilling down on training, so all you need to do is connect new data to the special
selected variables, and categories or ranges of values in those variables. deployment node, to compute predicted values.
For example, you can drill-down on Gender, to display the distribution for General Forecaster. STATISTICA Data Miner includes a broad
a variable Income for females only; next you could drill down on a specif- selection of traditional (i.e., non-neural networks-based) forecast-
8 Data Miner
ing techniques (including ARIMA, exponential smoothing with seasonal VA; Survival/Failure Time Analysis; General Nonlinear Estimation with
components, Fourier spectral decomposition, seasonal decomposition, Logit and Probit Regression; Log-Linear Analysis of Frequency Tables;
regression- and polynomial lags analysis, etc.), as well as neural network and Time Series Analysis/Forecasting; Structural Equation
methods for time series data. Modeling/Path Analysis (SEPATH).
■ Cluster Analysis Techniques; Factor Analysis; Principal Components &
■ Deployment. Forecasts can automatically be computed for multiple
models in data mining project, and plotted in a single graph for com- Classification Analysis; Canonical Correlation Analysis; Reliability/Item
parative evaluation. For example, you can compute and compare pre- Analysis; Classification Trees; Correspondence Analysis;
dictions from multiple ARIMA models, different methods for seasonal Multidimensional Scaling; Discriminant Analysis; and General
and non-seasonal exponential smoothing, and the best time-series Discriminant Analysis Models (GDA).
■ Quality Control Charts techniques, Process Analysis, and Experimental
neural network architectures (after searching over 100 different archi-
tectures). Design (DOE) procedures.
General Neural Networks Explorer. This tool contains the However, several modules include selections of highly specialized data
most comprehensive selection of neural network methods available mining and data mining modeling techniques that are offered only as part
on the market. This powerful component of STATISTICA Data Miner of STATISTICA Data Miner. The following sections include technical infor-
offers tools to approach virtually any data mining problem (including clas- mation about these modules.
sification, hidden structure detection, and powerful forecasting). One of
the unique features of the NN Explorer is the selection of intelligent prob- FEATURE SELECTION & VARIABLE FILTERING
lem solvers and automatic wizards that use Artificial Intelligence methods This module will automatically select subsets of variables from extremely
to help you solve the most demanding problems involved in advanced NN large data files or databases connected for in-place processing (IDP).
analysis (such as selecting the best network architecture and the best sub- The module can handle a practically unlimited number of variables: over
set of variables). The Explorer offers the widest selection of cutting-edge a million (!) of input variables can be scanned to select predictors for
NN architectures and procedures and highly optimized algorithms that regression or classification. Specifically, the program includes several
include: multilayer perceptrons, radial basis function networks, proba- options for selecting variables (“features”) that are likely to be useful or
bilistic neural networks, generalized regression neural networks, self- informative in specific subsequent analyses. The unique algorithms imple-
organizing feature maps, linear models, principal components network, mented in the Feature Selection and Variable Filtering module will
and cluster networks. Network ensembles of these architectures can also select continuous and categorical predictor variables which show a rela-
be evaluated. Estimation methods include back propagation, conjugate tionship to the continuous or categorical dependent variables of interest,
gradient decent, quasi-Newton, Levenberg-Marquardt, quick propagation, regardless of whether that relationship is simple (e.g., linear) or complex
delta-bar-delta, LVQ, pruning algorithms, and more; options are available (nonlinear, non-monotone). Hence, the program does not bias the selec-
for cross validation, bootstrapping, subsampling, sensitivity analysis, etc. tion in favor of any particular model that you may use to find a final best
■ Deployment. STATISTICA Neural Networks includes code generator rule, equation, etc. for prediction or classification. Various advanced fea-
options to produce C, C++, C#, and STATISTICA Visual Basic code for ture selection options are also available. This module is particularly use-
one or more trained networks as well as ensembles of networks. This ful in conjunction with the in-place processing of databases (without the
code can be quickly incorporated into your own custom deployment need to copy or import the input data to the local machine), when it can
programs. In addition, fully trained neural networks and ensembles of be used to scan huge lists of input variables, select likely candidates that
neural networks can be saved, to be applied later for computing pre- contain information relevant to the analyses of interest, and automatically
dicted responses or classifications for new data. A deployment node select those variables for further analyses with other nodes in the data
can be dragged into the data miner workspace to perform prediction miner project. Subsets of variables based on an initial scan via this mod-
and predictive classification based on trained neural networks auto- ule can also be submitted to further (post-) feature selection methods
matically; all you have to do (after the participating network architec- based on neural networks, MAR Splines, linear regression or classifiers,
tures are trained) is connect the data for deployment. or CHAID. These options allow STATISTICA Data Miner to handle data
sets in the multiple giga- and terabyte range.
Specialized Data Mining Modules ASSOCIATION RULES
A large portion of analytic functionality used by STATISTICA Data Miner is This module contains a complete imple-
driven by the computational engines of modules that are included in vari- mentation of the so-called A-priori algo-
ous other STATISTICA products: rithm for detecting (“mining for”) associa-
■ Neural Networks techniques (the largest selection of architectures tion rules such as “customers who order
available, automatic problem solver tools, and advanced feature selec- product A, often also order product B or C”
tion techniques). or “employees who said positive things
■ All STATISTICA Graphics Tools and interactive exploration/visualization about initiative X, also frequently complain
tools; Descriptive statistics, breakdowns, and exploratory data analysis; about issue Y but are happy with issue Z”
Frequency Tables, Crosstabulations, Tables and Stub-and-Banner (see Agrawal and Swami, 1993; Agrawal and
Tables, Multiple Response Analysis; Nonparametric Statistics; Srikant, 1994; Han and Lakshmanan, 2001;
Distribution Fitting; and Power Analysis Techniques. see also Witten and Frank, 2000). The
■ General Linear Models (GLM); General Regression Models (GRM); Association Rules module allows you to
Generalized Linear Models (GLZ); General Partial Least Squares process rapidly huge data sets for associa-
Models (PLS); Variance Components and Mixed Model ANOVA/ANCO- tions (relationships), based on pre-defined
“threshold” values for detection.
Data Miner 9
Specifically, the program will detect relationships or associations between can select a list of variables for review, and compute for the selected
specific values of categorical variables in large data sets. This is a com- cases:
mon task in many data mining pro- ● Descriptive statistics and frequency tables;
jects applied to databases containing ● Box-and-whiskers plots summarizing the distributions of continuous
records of customer transactions variables;

(e.g., items purchased by each cus- ● Scatterplot matrices summarizing the relationships between continu-
tomer), and also in the area of text ous variables;

mining. Like all modules of ● All of the other statistical and graphical analyses available in
STATISTICA, data in external databas- STATISTICA by extracting the observations belonging to the current
es can be processed by the subset;
STATISTICA Association Rules mod-
ule in-place, so the program is For example, you could review the types of purchases that customers
prepared to handle efficiently made with different demographic characteristics, study the effectiveness of
extremely large analysis tasks. certain drugs within different treatment groups, ages, etc., or extract likely
The results can be displayed in customers for a new product from a database of previous customers
tables, and also in unique 2D based on careful study of apparent (market) segments exposed by the
and 3D graphs where strong drill-down analysis.
associations are highlighted by Interactive Drill-Down Explorer and OLAP (On-Line Analytic
thick lines connecting the Processing). On the surface, the operation of the simplest aspect of the
respective items. Interactive Drill-Down Explorer (exploration of multidimensional
tables) is very similar to the functionality offered by designated OLAP tools
INTERACTIVE DRILL-DOWN EXPLORER (such as those offered in the optional OLAP add-on module for
STATISTICA Data Miner). OLAP tools allow users to quickly query a data-
A first step of many data mining projects is to explore the data interactive- base to extract observations and summary information about those obser-
ly, to gain a first “impression” of the types of variables in the analyses, and vations taking advantage of the optimized OLAP Server facilities offered for
their possible relationships. The purpose of the Interactive Drill-Down a specific database platform (e.g., Oracle, or MS SQL Server), and often
Explorer is to provide a combined graphical, exploratory data analysis, providing significant performance advantages over tools based on tradi-
and tabulation tool that will allow you to quickly review the distributions tional (non-OLAP driven) query tools. However, the main advantages
of variables in the analyses, their relationships to other variables, and to STATISTICA Interactive Drill-Down Explorer has over OLAP are:
identify the actual observations belonging to specific subgroups in the
data. (a) its tight integration with STATISTICA’s flexible categorization tools and
exploratory environment (the analytic capabilities provided in the
How the Drill-Down Explorer Works. The “drill-down” metaphor STATISTICA Interactive Drill-Down Explorer are much more compre-
within the data mining context summarizes the basic operation of this ana- hensive and also general than typical OLAP tools, supporting flexible “drill
lytic process quite well: the program allows you to select observations up” operations, and allowing you to quickly review custom, complex sum-
from larger data sets by selecting subgroups based on specific values or mary graphs, detailed descriptive statistics, etc.), and
ranges of values of particular variables of interest (e.g., Gender and (b) the fact that the STATISTICA Interactive Drill-Down Explorer is not
Average Purchase in the example above); in a sense you can expose the limited to any particular database platform and does not require a desig-
“deeper layers” or “strata” in the data by reviewing smaller and smaller nated OLAP Server to be present (e.g., it can operate directly on
subsets of observations selected by increasingly complex logical selection STATISTICA data files). At the same time, by connecting to the STATISTICA
conditions. application a (remote) database for in-place processing, you can efficient-
Drilling “up.” The interactive nature of the Drill Down Explorer allows ly perform drill-down operations on any data source, regardless of
you not only to drill down into the data or database (select groups of whether or not designated OLAP tools are available on the server.
observations with increasingly specific logical
selection conditions), but also to “drill up”:
at any time, you can select one of the previ-
ously specified variable (category) groups
and de-select it from the list of drill-down
conditions; while processing the data the pro-
gram will then only select those observations
that fit the remaining logical (case) selection
conditions, and update the results according-
ly.
Applications of the Interactive Drill-
Down Explorer. The example shown earli-
er is very simple, exposing only the basic
functionality of the program. The real power
of the STATISTICA Interactive Drill-Down
Explorer lies in the various auxiliary results
which can automatically be updated during
the interactive drill-down/up exploration: you
10 Data Miner
GENERALIZED EM & K-MEANS CLUSTER GTREES
ANALYSIS The Classification and Regression Trees module ®is a comprehensive
The STATISTICA Generalized EM (Expectation Maximization) and k- implementation of the methods described as CART by Breiman,
Means Clustering module is an extension of the techniques available in Friedman, Olshen, and Stone (1984). However, the GTrees module con-
the general STATISTICA Cluster Analysis options, specifically designed to tains various extensions and options that are typically not found in imple-
handle large data sets and to allow clustering of continuous and/or cate- mentations of this algorithm, and that are particularly useful for data min-
gorical variables, and to provide the functionality for complete unsuper- ing applications.
vised learning (clustering) for pattern recognition, with all deployment User interface; specifying “models.” In addition to standard analyses
options for predictive clustering. Various cross-validation options are (as described by Breiman, et al.), the implementation of these methods in
provided (including modified v-fold cross-validation options) that will STATISTICA allow you to specify ANOVA/ANCOVA-like designs with continu-
automatically choose and evaluate a best final solution for the clustering ous and/or categorical predictor variables, and their interactions. Three
problem; you do not need to specify the number of clusters before an alternative user interfaces are provided to allow you to specify such
analysis; instead the program will use automatic (cross-validation based) designs; these are analogous to the methods provided in GLM (General
methods to choose a best cluster solution (number of clusters) for you! Linear Models), GLZ (Generalized Linear Models), GRM (General
The advanced EM clustering technique available in this module is some- Regression Models), GDA (General Discriminant Analysis Models), and
times referred to as probability-based clustering or statistical clustering. PLS (General Partial Least Squares Models), and are described in detail
The program will cluster observations based on continuous and categori- in the respective sections. In short, ANOVA/ANCOVA-like predictor designs
cal variables, assuming different distributions for the variables in the can be specified via dialogs, Wizards, or (design) command syntax; more-
analyses (as specified by the user). Various cross-validation options are over, the command syntax is compatible across modules, so you can
provided to allow you to choose and evaluate a best final solution for the quickly apply identical designs to very different analyses (e.g., compare
clustering problem. Detailed output summaries and graphs (e.g., distrib- the quality of classification using GDA vs. GTrees).
ution plots for EM clustering), and detailed classification statistics are
computed for each observation. These methods are optimized to handle
very large data sets, and various results are provided to facilitate subse-
quent analyses using the assignment of observations to clusters. Options
for deploying cluster solutions (in C, C++, C#, Visual Basic, or XML syntax
based PMML), for classifying new observations, are also included.
GENERALIZED ADDITIVE MODELS (GAM)

The STATISTICA Generalized Additive Models facilities are an implemen-
tation of methods developed and popularized by Hastie and Tibshirani
(1990); additional detailed discussion of these methods can also be found Tree pruning, selection, validation. The program provides a large
in Schimek (2000). The program will handle continuous and categorical number of options for controlling the building of the tree(s), the pruning
predictor variables. Note that STATISTICA includes a comprehensive of the tree(s), and the selection of the best-fitting solution. For continu-
selection of methods for fitting non-linear models to data, such as the ous dependent (criterion) variables, pruning of the tree can be based on
Nonlinear Estimation module, Generalized Linear Models, General the variance, or on FACT-style pruning. For categorical dependent (crite-
Classification and Regression Trees, etc. rion) variables, pruning of the tree can be based on misclassification
Distributions and link functions. The program allows the user to errors, variance, or FACT-style pruning. You can specify the maximum
choose from a wide variety of distributions for the dependent variable, number of nodes for the tree or the minimum n per node. Options are
and link functions for the effects of the predictor variables on the depen- provided for validating the best decision tree, using V-fold cross validation,
dent variable: Normal, Gamma, and Poisson distributions: Log link: f(z) or by applying the decision tree to new observations in a validation sam-
= log(z); Inverse link: f(z) = 1/z; Identity link: f(z) = z. Binomial distri- ple. For categorical dependent (criterion) variables, i.e., for classification
bution: Logit link: f(z)=log(z/(1-z)). problems, various measures can be chosen to modify the algorithm and to
Scatterplot smoother. The program uses the cubic spline smoother evaluate the quality of the final classification tree: Options are provided to
with user-defined degrees of freedom to find an optimum transformation specify user-defined prior classification probabilities and misclassification
(function) of the predictor variables. costs; goodness-of-fit measures include the Gini measure, Chi-square, and
G-Square.
Results statistics. The program will report a comprehensive set of results
statistics to aid in the evaluation of the model-adequacy, model fit, and inter- Missing data and sur-
pretation of results. Specifically, results include: the iteration history for the rogate splits. Missing
model fitting computations, summary statistics including the overall R- data values in the pre-
square value (computed from the deviance statistic) model degrees of free- dictors can be handled
dom, and detailed observational statistics pertaining to the predicted by allowing the program
response, residuals, and the smoothing of the predictor variables. Results to determine splits for
graphs include plots of observed responses vs. residual responses, predict- surrogate variables, i.e.,
ed values vs. residuals, histograms of observed and residual values, normal variables that are similar
probability plots of residual values, and partial residual plots for each pre- to the respective variable
dictor, indicating the cubic spline smoothing fit for the final solution; for used for a particular
binary responses (e.g., logit-models) lift charts can also be computed. split (node).
Data Miner 11
®
ANOVA/ANCOVA-like designs. In addition to the traditional CART -style (GTrees) in STATISTICA, the General Chi-square Automatic
analysis, you can combine categorical and continuous predictor variables Interaction Detection module provides not only a comprehensive imple-
into ANOVA/ANCOVA-like designs and perform the analysis using a design mentation of the original technique, but extends these methods to the
matrix for the predictors. This allows you to evaluate and compare com- analysis of ANOVA/ANCOVA-like designs.
plex predictor models, and their efficacy for prediction and classification Standard CHAID. The CHAID analysis can be performed for both con-
using various analytic techniques (e.g., General Linear Models, tinuous and categorical dependent (criterion) variables. Numerous
Generalized Linear Models, General Discriminant Analysis Models, options are available to control the construction of hierarchical trees: the
etc.). user has control over the minimum n per node, maximum number of
Tree browser. In nodes, and probabilities for splitting and for merging categories; the user
addition to simple can also request exhaustive searches for the best solution (Exhaustive
summary tree graphs, CHAID); V-fold validation statistics can be computed to evaluate the stabil-
you can display the ity of the final solution; for classification problems, user-defined misclassi-
results trees in intuitive fication costs can be specified.
interactive tree-
browsers that allow
you to collapse or
expand the nodes of
the tree, and to quickly
review the most salient
information regarding the respective tree node or classification. For
example, you can highlight (click on) a particular node in the browser-
panel and immediately see the classification and misclassification rates for
that particular node. The tree-browser provides a very efficient and intu-
itive facility for reviewing complex tree-structures, using methods that are ANOVA/ANCOVA-like designs. In addition to the traditional CHAID
commonly used in Windows-based computer application to review hierar- analysis, you can combine categorical and continuous predictor variables
chically structured information. Multiple tree-browser can be displayed into ANOVA/ANCOVA-like designs and perform the analysis using a design
simultaneously, containing the final tree, and different sub-trees pruned matrix for the predictors. This allows you to evaluate and compare com-
from the larger tree, and by placing multiple browsers side-by-side it is plex predictor models, and their efficacy for prediction and classification
easy to compare different tree structures and sub-trees. The STATISTICA using various analytic techniques (e.g., General Linear Models,
Tree Browser is an important innovation to aid with the interpretation of Generalized Linear Models, General Discriminant Analysis Models,
complex decision trees. General Classification and Regression Tree Models, etc.). Refer also to
Interactive trees. Options are also provided to review trees interactively, the description of GLM (General Linear Models) and General
either by using STATISTICA Graphics brushing tools or by placing large Classification and Regression Trees (GTrees), above, for details.
tree graphs into scrollable graphics windows where large graphs can be Tree browser. Like
inspected “behind” a smaller (scrollable) window. the binary results tree
Results statistics. The STATISTICA GTrees module provides a very large used to summarize
number of results options. Summary results for each node are accessible, binary classification and
detailed statistics are computed pertaining to classification, classification regression trees, the
costs, gain, and so on. Unique graphical summaries are also available, results of the CHAID
including histograms (for classification problems) for each node, detailed analysis can be
summary plots for continuous dependent variables (e.g., normal probabil- reviewed in the
ity plots, scatterplots), and parallel coordinate plots for each node, pro- STATISTICA Tree
viding an efficient summary of patterns of responses for large classifica- Browser. This unique tree browser provides a very efficient and intuitive
tion problems. As in all statistical procedures of STATISTICA, all numeri- facility for reviewing complex tree-structures and for comparing multiple
cal results can be used as input for further analyses, allowing you to tree-solutions side-by-side (in multiple tree-browsers), using methods that
quickly explore and further analyze observations classified into particular are commonly used in windows-based computer applications to review
nodes (e.g., you could use the GTrees module to produce an initial classi- hierarchically structured information. The STATISTICA Tree Browser is an
fication of cases, and then use best-subset selection of variables in GDA to important innovation to aid with the interpretation of complex decision
find additional variables that may aid in the further classification). trees. For additional details, see also the description of the tree browser
in the context of the General Classification and Regression Trees
C, C++, STATISTICA Visual Basic, SQL Code generators. The (GTrees).
information contained in the final tree can be quickly incorporated into
your own custom programs or database queries via the optional C, C++, Results statistics. The STATISTICA General CHAID Models module pro-
STATISTICA Visual Basic, or SQL query code generator options. The vides a very large number of results options. Summary results for each
STATISTICA Visual Basic will be generated in form that is particularly well node are accessible, detailed statistics are computed pertaining to classifi-
suited for inclusion in custom nodes for STATISTICA Data Miner. cation, classification costs, and so on. Unique graphical summaries are
also available, including histograms (for classification problems) for each
GENERAL CHAID MODELS node, detailed summary plots for continuous dependent variables (e.g.,
normal probability plots, scatterplots), and parallel coordinate plots for
Like the implementation of General Classification and Regression Trees each node, providing an efficient summary of patterns of responses for
12 Data Miner
large classification problems. As in all statistical procedures of The program, which in terms of its functionality can be considered a gen-
STATISTICA, all numerical results can be used as input for further analy- eralization and modification of stepwise Multiple Regression and
ses, allowing you to quickly explore and further analyze observations clas- Classification and Regression Trees (GC&RT), is specifically designed
sified into particular nodes (e.g., you could use the GTrees module to pro- (optimized) for processing very large data sets. A large number of results
duce an initial classification of cases, and then use best-subset selection of options and extended diagnostics are available to allow you to evaluate
variables in GDA to find additional variables that may aid in the further numerically and graphically the quality of the MAR Splines solution.
classification). C/C++, C#, STATISTICA Visual Basic, XML syntax based PMML
code generators. The information contained in the model can be
INTERACTIVE CLASSIFICATION AND quickly incorporated into your own custom programs via the optional
REGRESSION TREES C/C++/C#, STATISTICA Visual Basic, or (XML-syntax based) PMML code
In addition to the modules for automatic tree building (e.g., General generator options. STATISTICA Visual Basic will be generated in a form
Classification and Regression Trees, General CHAID models), that is particularly well suited for inclusion in custom nodes for
STATISTICA Data Miner also includes designated tools for building such STATISTICA Data Miner. PMML (Predictive Models Markup Language)
trees interactively. You can choose either the (binary) General files with deployment information can be used with the Rapid
Classification and Regression Trees method or the CHAID method for Deployment of Predictive Models options to compute predictions for
building the (decision) tree, and at each step grow the tree either interac- large numbers of cases very efficiently; PMML files are fully portable, and
tively (by choosing the splitting variable and splitting criterion) or auto- deployment information generated via the desktop version of STATISTICA
matically. When growing trees interactively, you have full control over all Data Miner can be used in WebSTATISTICA Data Miner (i.e., on the serv-
aspects of how to select and evaluate candidates for each split, how to cat- er side of Client-Server installations), and vice versa.
egorize the range of values in predictors, etc. The highly interactive tools
available for this module allow you to grow and prune back trees to GOODNESS OF FIT COMPUTATIONS
quickly evaluate the quality of the tree for classification or regression pre- The STATISTICA Goodness of Fit module will compute various goodness
diction and to compute all auxiliary statistics at each stage to fully explore of fit statistics for continuous and categorical response variables (for
the nature of each solution. This tool is extremely useful for predictive regression and classification problems). This module is specifically
data mining as well as for exploratory data analysis (EDA), and includes designed for data mining applications to be included in “competitive eval-
the complete set of options for automatic deployment, for the prediction uation of models” projects as a tool to choose the best solution. The pro-
or predicted classification of new observations (see also the description of gram uses as input the predicted values or classifications as computed
these options in the context of CHAID and the General Classification and from any of the STATISTICA modules for regression and classification, and
Regression Trees modules). computes a wide selection of fit statistics as well as graphical summaries
for each fitted response or classification. Goodness of fit statistics for
BOOSTED TREES continuous responses include least squares deviation (LSD), average devi-
The most recent research on statistical and machine learning algorithms ation, relative squared error, relative absolute error, and the correlation
suggests that for some “difficult” estimation and prediction (predicted coefficient. For classification problems (for categorical response vari-
classification) tasks, using successively boosted simple trees can yield ables), the program will compute Chi-square, G-square (maximum likeli-
more accurate predictions than neural network architectures or complex hood chisquare), percent disagreement (misclassification rate), quadratic
single trees alone. STATISTICA Data Miner includes an advanced Boosted loss, and information loss statistics.
Trees module for applying this technique to predictive data mining tasks.
You have control over all aspects of the estimation procedure and detailed RAPID DEPLOYMENT OF PREDICTIVE
summaries of each stage of the estimation procedures are provided so MODELS
that the progress over successive steps can be monitored and evaluated. The Rapid Deployment of Predictive
The results include most of the standard summary statistics for classifica- Models module allows you to load one or
tion and regression computed by the General Classification and more PMML (Predictive Models Markup
Regression Trees module. Automatic methods for deployment of the final Language) files with deployment infor-
boosted tree solution for classification or regression prediction are also mation, and to compute very quickly (in
provided. a single pass through the data) predic-
tions for large numbers of observations
MULTIVARIATE ADAPTIVE REGRESSION (for one or more models). PMML files
SPLINES (MAR Splines) can be generated from practically all
The STATISTICA MAR Splines modules for predictive data mining (as well as the Generalized EM & k-
(Multivariate Adaptive Regression Means Cluster Analysis options). PMML is a XML-based (Extensiveble
Splines) module is based on a complete Markup Language) industry standard set of syntax convention that is par-
implementation of this technique, as origi- ticularly well suited to allow sharing of deployment information in a
nally proposed by Friedman (1991; Client-Server architecture (e.g., via WebSTATISTICA).
Multivariate Adaptive Regression Splines, The Rapid Deployment of Predictive Models options provide the fastest,
Annals of Statistics, 19, 1-141); in most efficient methods for computing predictions from fully trained mod-
STATISTICA Data Miner, the MAR Splines els. All models are pre-programmed in generic form in a highly opti-
options have further been enhanced to mized compiled program; the PMML code only supplies the parameter
accommodate regression and classification problems, with continuous estimates etc. for the fully trained models, to allow the Rapid Deployment
and categorical predictors.
Data Miner 13
of Predictive Models program to compute predictions or predicted classi- optimal management of large computational loads. This tech-
fications (or cluster assignments) in a single pass through the data. nology enables rapid processing of even very large and computationally
In fact, it is very difficult to “beat” the performance (speed of computa- intensive projects, taking full advantage of the multiple CPUs on the server,
tions) of this tool, even if you were to write your own compiled C++ or even multiple servers working in parallel. The illustration below shows
code, based on the (C, C++, or C#) deployment code generated by the a project running on a quad processor server, along with the server per-
respective models. formance monitor demonstrating the full utilization of the resources of all
four CPUs executing in the multithreading mode a single, computationally
Note that the Rapid Deployment of Predictive Models module will also intensive STATISTICA Data Miner project.
automatically compute summary statistics for each model, and if observed
values or classifications are available, the program will automatically com- Ultimate scalability (parallel processing technology). One of the
pute goodness-of-fit indices for participating models, including Gains and unique features of the STATISTICA distributed processing technology is
Lift charts for one or more models (overlaid lift and gain charts), for that it flexibly scales not only to take advantage of all CPUs on the current
binary or multinomial (multi-category) classification problems. server computer (to support both multiple jobs/users and also individual,
computationally intensive projects), but it also scales to multiple server
computers. That unique feature is important, since it delivers significant
The Client-Server Version of performance gains. STATISTICA uses the parallel processing technology
Data Miner and Data Mining Via across separate hardware units (like some supercomputers do), and
WebSTATISTICA therefore, if you have - for example - three servers with 4 processors
each, STATISTICA can run even an individual project on all 12 processors
(if only the scale of that project warrants that mode of processing).
In the desktop version of STATISTICA Data Miner, all computations are
performed on the local computer, and resources of other computers are In addition, the WebSTATISTICA architecture delivers a platform-indepen-
used only in the case when the In-Place Database Processing (IDP) inter- dent, Web browser-based user interface, and provides an ultimate, large
face to external databases is established. IDP is a technology that reads enterprise-level ability to manage projects or groups of users “across the
data asynchronously directly from remote database servers (using distrib- hall or across continents.”
uted processing if supported by the server), and bypassing the need to WebSTATISTICA Data Miner User Interface. The WebSTATISTICA
“import” data and create a local copy of the data set. Records of data are implementation of STATISTICA Data Miner allows users to design, modify,
retrieved and sent to the STATISTICA computer asynchronously by the CPU and edit data mining projects on a client machine in a Web browser inter-
of the database server, while STATISTICA simultaneously processes them face that is essentially identical to that available for the desktop installa-
using the CPU of the local computer. tion.
The Client-Server Architecture. When a Client-Server version of Therefore, the client side of the application (the “front end”) can be run
STATISTICA Data Miner is used, the local computer drives only the user on any computer (even a laptop) as long as it is connected to the Internet.
interface of Data Miner, and all calculations are performed on the server. However, the actual computations and other operations performed on the
The Client-Server architecture which uses advanced multithreading and data will remain on the (remote) server with its usually more powerful
distributed processing technology (see below) and optionally scales to processors and storage resources (and they will be managed using the
multiple servers which can work in parallel, offers obvious advantages optimized, multithreading and distributed processing architecture of the
when your data mining projects are large (e.g., computationally intensive system for maximum performance).
or involving processing of extremely large data sets), and thus when they In essence, the user interface aspects of STATISTICA Data Miner can be
can be offloaded to the servers, freeing your local computer to perform run by one or multiple users from any computer in the world (as long as
other jobs. they are con-
Multithreading, Distributed Processing Technology. Many addition- nected to the
al advantages are offered by the specific implementation of the Client- Internet, even
Server architecture in STATISTICA Data Miner, which is based on the by a slow con-
WebSTATISTICA technology. The WebSTATISTICA platform is built on nection), while
advanced distributed processing and multithreading technology to support the server per-
forms all com-
putations and
data opera-
tions, enforcing
the proper
security and
access privi-
leges applica-
ble to the
respective pro-
jects and class-
es of users, as
designed by the
network
administrator.
STATISTICA Data Miner is designed for two general categories of users:
Customers who need a complete, deployed, and ready to use solution, designed to solve a specific type of problem (e.g., such as cus-
1 tomer credit scoring, predicting specific aspects of customer behavior or providing answers to specific CRM questions, managing the risk
of an equipment failure using a model based on the mining of a very complex set of historical data). For these customers, StatSoft offers
a complete installation and deployment of data mining solutions that will draw data from an existing corporate database or data ware-
house and generate predictions or ratings using a specific model that StatSoft consultants will deploy on-site (services to develop a data warehouse
solution or restructure the existing one are also available). These specialized data mining solutions can later be modified (by StatSoft or other consul-
tants) as the needs of the company change. The modification of such already deployed systems are very easy because all STATISTICA solutions are
stored in the form of industry standard VB scripts), and they can readily be deployed in the industry standard C++ code.
Customers who need a general powerful data mining solution development system, to be used to design and deploy custom systems (in-
2 house) by the corporate analysts and IS/IT personnel. These customers will license the same set of tools, following the same price struc-
ture as the customers from the previous category (see above), except that they will not order the deployment and consulting services.
Common System Features ■ Multiple input files, instances, & multitasking

TM
■ Fully customizable user interfaces ■ Highest quality, interactive graphics

■ Flexible output management ■ Complete set of automation options
■ Presentation-quality reporting ■ Fully integrated Visual Basic A comprehensive array of analytical
■ Full Web-enablement options ■ Distributed processing, Client-Server options tools for virtually any application
■ Optimized for large data sets ■ Optimized Query Interface to databases
■ Interactive database query tools ■ Optional tools for collaborative work
■ Wide set of import/export facilities ■ Specialized database options www.statsoft.com
STATISTICA Enterprise Systems. In addition to the common features listed above,
STATISTICA Enterprise Systems offer a wide selection of tools for collaborative work, a
web browser based user interface (using the optional WebSTATISTICA Server - see right),
specialized databases, and a highly optimized interface to enterprise-wide data reposito-
ries, including options to rapidly process large data sets from remote servers in-place,
without creating local copies. Each product is offered optionally with deployment and on-
site training services.
STATISTICA Data Miner - the most comprehensive selection of data mining
DM
solutions on the market, with an icon-based, extremely easy-to-use user interface
(optionally Web browser based via WebSTATISTICA, see right) and a deployment engine.
It features a selection of completely integrated, and automated, ready to deploy "as is" WebSTATISTICA Server -
(but also easily customizable) specific data mining solutions for a wide variety of business a highly scalable, enterprise-
applications. A designated SPC version is also available (see QC Miner below). level, Web-based database
gateway application system,
STATISTICA Enterprise-wide Data Analysis System (SEDAS) - an inte- built on distributed processing
DA
grated, multi-user software system designed for general purpose data analysis and technology and fully support-
business intelligence applications in research, marketing, finance, and other industries. ing multi-tier Client-Server
SEDAS can optionally offer the statistical functionality available in any STATISTICA product. architecture configurations.
STATISTICA Enterprise-wide SPC System (SEWSS) - based on state-of-the- WebSTATISTICA Server is the
SPC
art connectivity, multitasking, distributed processing technologies, designed for local ultimate enterprise system that
and global enterprise quality control/improvement applications, including Six Sigma; it offers offers the full Web enablement,
real-time monitoring and alarm notification for the production floor, a comprehensive set of including the ability to run
analytical tools for engineers, sophisticated reporting features for management, Six Sigma STATISTICA interactively or in
Reporting options, and much more. batch from a Web browser on
any computer (including Linux,
STATISTICA QC Miner - a powerful software solution designed to monitor processes, UNIX), offload time consuming
QC identify and anticipate problems related to quality control and improvement with tasks to the servers, manage
unmatched sensitivity and effectiveness. STATISTICA QC Miner integrates all quality control projects over the Web, and
charts, process capability analyses, experimental design procedures, and Six Sigma methods collaborate “across the hall or
with a comprehensive library of cutting-edge techniques for exploratory and predictive data across continents.”
mining.
Use Data Miner in
conjunction with other
STATISTICA Enterprise Systems
2300 E. 14th St. • Tulsa, OK 74104 • USA • (918) 749-1119 • Fax: (918) 749-2217 • info@statsoft.com • www.statsoft.com
Australia: StatSoft Pacific Pty Ltd. Germany: StatSoft GmbH Japan: StatSoft Japan Inc. Portugal: StatSoft Iberica Ltda. Spain: StatSoft Espana
Brazil: StatSoft Brazil Ltda. Hungary: StatSoft Hungary Ltd. Korea: StatSoft Korea Russia: StatSoft Russia Sweden: StatSoft Scandinavia AB
Czech Republic: StatSoft Czech Rep. s.r.o. Israel: StatSoft Israel Ltd. Netherlands: StatSoft Benelux BV Singapore: StatSoft Singapore Taiwan: StatSoft Taiwan
France: StatSoft France Italy: StatSoft Italia srl Poland: StatSoft Polska Sp. z o. o. S. Africa: StatSoft S. Africa (Pty) Ltd. UK: StatSoft Ltd.
STATISTICA and StatSoft are trademarks of StatSoft, Inc. © Copyright StatSoft, Inc. 1984 - 2002

StatSoft Staff-Statistica Data Miner-O'Reilly Media

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

StatSoft Staff-Statistica Data Miner-O'Reilly Media

Hochgeladen von

Copyright:

Verfügbare Formate

TM

Data Miner enterprise system

STATISTICA has received the highest

records of customer transactions variables;

tomer), and also in the area of text ous variables;

GENERALIZED ADDITIVE MODELS (GAM)

Common System Features ■ Multiple input files, instances, & multitasking

■ Fully customizable user interfaces ■ Highest quality, interactive graphics

Das könnte Ihnen auch gefallen

StatSoft Staff-Statistica Data Miner-O&#39;Reilly Media

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

StatSoft Staff-Statistica Data Miner-O&#39;Reilly Media

Hochgeladen von

Copyright:

Verfügbare Formate

TM

Data Miner enterprise system

STATISTICA has received the highest

records of customer transactions variables;

tomer), and also in the area of text ous variables;

GENERALIZED ADDITIVE MODELS (GAM)

Common System Features ■ Multiple input files, instances, & multitasking

■ Fully customizable user interfaces ■ Highest quality, interactive graphics

Das könnte Ihnen auch gefallen

StatSoft Staff-Statistica Data Miner-O'Reilly Media

StatSoft Staff-Statistica Data Miner-O'Reilly Media