Sie sind auf Seite 1von 47

4.

Data Analysis and Display


Introduction
• Data Handling is the collection, recording and
presentation of data which helps us to
organise our experiences
Components of a data analysis plan
• Purpose of the evaluation
• Questions
• What you hope to learn from
the question
• Analysis technique
• How data will be presented
Available Tools and Methods
• Spreadsheets and graphical packages
• Statistical software and related tools (e.g.
AMDAS from http://www.environ.org/amdas).
• Simulation Software (e.g. ARENA, VENSIM,
Stella, Easy Java Simulation, etc)
• Data validation and data display tool (e.g.
VOCDat from
ftp://ftp.sonomatech.com/public/vocdat/).
This tool is described in more detail in the
Data Validation section.
Interpreting Data
Data Classification
• Classification of Data is the process of
arranging data into classes or categories
according to some common characteristics
present in the data.
• There are four important bases for
classification
1. Qualitative – by attributes e.g. Gender
2. Quantitative – by quantitative characteristic e.g. Age
3. Geographical – by region or location e.g. district
4. Chronological – by time of occurrence (aka temporal
or time series)
Analysis of geological structure
cross section

Which geological features are the oldest?


Chronological Classification of
geological structures
Data mining
• Data mining is the process of discovering
patterns in data.
• It is about solving problems by analyzing data
already present in databases.
• the data is stored electronically in data
warehouse and the search is automated by
computer algorithms.
Business Intelligence is the end result
of the data mining processes
Describing structural patterns

• If tear production rate = reduced then recommendation = none


Otherwise, if age = young and astigmatic = no then recommendation = soft
• If tear production rate = reduced then recommendation = none
Machine learning
• Machines learn when they change their
behaviour in a way that makes them perform
better in the future.
• We require techniques for finding and
describing structural patterns in data as a tool
for helping to explain that data and make
predictions from it.
Machine learning and statistics
• In machine learning uses standard statistical
methods:
- visualization of data,
- selection of attributes,
- Discarding,
- outliers
• Statistical tests are used to validate machine
learning models and to evaluate machine
learning algorithms.
Successful Data Mining
• six steps:
1. Define what is to be predicted
2. Decide on the appropriate model
3. Prepare data sources
4. Build the model
5. Interpret
6. Deploy
Data Warehouse: A Multi-Tiered Architecture

Monitor
& OLAP Server
Other Metadata
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


29
Step 1. Define what is to be predicted
• Create a clear definition of the prediction and
associated requirements, the data that is
needed, the data that is available, the
reason(s) the prediction is needed, and the
way the prediction will be used.
Step 2. Decide on the appropriate
modelling type
• Choose a modelling type, which can be one of
four:
1. Classification,
2. Clustering/Segmentation,
3. Regression or
4. Forecasting/Trending.
Step 3. Prepare data sources
• The most time-consuming of all the steps is the
data preparation step, also known as extract,
transform and load (ETL).
• pulling data from resident systems
• transforming the data to a format appropriate for
the data mining platform
• identifying data variables necessary for the data
mining effort
• data cleanup (standardization and/or
normalization of the data), in order to avoid
“garbage in, garbage out.
Step 4. Build the model
• Two of the industry-accepted data mining processes
are SEMMA (the SAS Institute approach) and CRISP-DM
(the Cross Industry Standard Process for Data Mining).
• Representative and statistically valid sampling is
required techniques include Explore (perform an
exploratory data analysis), Central Tendencies,
Population Characteristics, Dispersion and Distribution,
Frequencies, Outliers and Anomalies, and Modify.
• Modelling techniques include neural networks,
decision trees, linear and logistic regression,
discriminate, rule based, and assess
Step 5. Interpret
• This step involves a subject matter expert
(SME) to interpret the prediction of the model
as well as to translate the results to a form
appropriate for deployment to an end user.
Step 6. Deploy
• This is the process of making the model
available to the end user.
• The model, in combination with a scoring
engine, produces the prediction for a given
dataset. This process depends on the chosen
DW, application infrastructure and DM
platform.
Data Mining Vendors
Association Rule Mining
Association Rule Mining
Itemset
• A set of items is referred to as itemset.
• An itemset containing k items is called k-
itemset.
• An items set can also be seen as a conjunction
of items (or a predicate)
Rule Measures: Support and Confidence
Strong Rules
Mining Association Rules

Das könnte Ihnen auch gefallen