Sie sind auf Seite 1von 9

Data: Pre-processing and Exploring

in DataMining
BIF 515
Neeru S Redhu
Data
Data set: collection of data objects
Records, Entry, Entity, Point, Vector, Pattern, Event,
Case, Sample Or Observation

Data objects: number of attributes


Variable, Characteristics, Field, Feature Or Dimension

Attribute: property/characteristics of an object that may


vary from one object to another or from one time to
another
Type of Attribute
Attribute type Description Examples Operations
The value of nominal are just
Mode, entropy,
different names i.e. nominal values Zip codes, eye
contingency control,
Nominal provide only enough information to color, gender
X2
(Qualitative)
Categorical

distinguish one object from another


Test
(= =/)

Median, percentile,
Provides enough information to Hardness of
Ordinal rank correlation, sign
order objects material, grades
test

Mean, standard
Difference between values i.e. unit Calendar dates, deviation, Pearson's
Interval
of measurement exist temperature correlation, t and f
(Quantitative)

tests
Numeric

Geometric mean,
Both difference and ratio are
Ratio Age, mass length harmonic mean,
meaningful
percent variation
Attributes by Number of values
Discrete
Binary
Continuous

vAsymmetric Attributes
vAsymmetric Binary
General Characteristics of Data sets
Dimensionality
Number of attributes that the objects in data set posses
Sparsity
When most attributes of an object have 0 value. This an
advantage as significant saving in terms of computation time
and storage
Resolution
Properties of data are different and different resolution. Eg
earth, weather forecasting
Types of Data Sets
Record Data
Transaction or market basket Data
Data matrix
Sparse data matrix
Graph Based Data
Relation among data objects
Data objects that are graph
Ordered Data
Sequential Data
Sequence Data
Time series Data
Spatial Data
Data Quality
Data mining application often applied to the data that was
collected for unspecified purpose/ application

Data mining generally focuses on


Data Cleaning: Detection/correction of data quality
Use the algorithm that tolerate poor data quality
Measurement and Data collection issues
Measurement and Data collection errors
Noise and Artifacts
Precision, Bias and Accuracy
Outliers
Missing values
Eliminate data objects/ attributes
Estimate missing values
Ignore the missing value during analysis
Inconsistent values
Duplicate data
Data Pre processing
Aggregation
Sampling
Dimensionality reduction
Feature subset selection
Feature creation
Discretization and binarization
Variable transformation

Das könnte Ihnen auch gefallen