Beruflich Dokumente
Kultur Dokumente
AIM: 2 Tutorials
THEORY:
DATA EXPLORATION:
Data exploration helps a data consumer focus an information search on the pertinent
aspect of relevant data before true analysis can be achieved. In large data sets, data is
not gathered or controlled in a focused manner. Even in smaller data sets, it is also true
that data gathered are not in a very rigid and specific technique can result in a
disorganized manner and a myriad of subsets each.
In most cases, without a set of techniques, narrowing an information search may cause
several problems because one may lost important perspectives of the relevant data
among the myriad of sets of unrelated data.
There are generally two methodologies one can have to get relevant data from huge
data sources or sets. These are manual and automatic techniques. They are more
commonly known as data mining for automatic and data exploration for manual.
Although they are categorized as such, these terms are not really well defined in the
real IT sense.
Data Mining, along with its near relative, data prospecting, has a wide variety of usage
and has been considered by many as a very abused term in everyday usage. Some
people consider it as synonymous with data analysis although many believe that they
are technically different.
Data mining is a methodology commonly used on very large datasets. In fact, they are
used in entire databases running a data warehouse. A common definition of data mining
is "the nontrivial extraction of implicit, previously unknown, and potentially useful
information from data" or "the science of extracting useful information from large data
sets or databases". Although data mining is guided by a human being specifying some
parameters, it is an automated algorithm handling the mechanism to carry out the
search.
On the other hand, Data Exploration is methodology using manual techniques in order
for data user to find his way through large bulks of data and bring important and
relevant data to be focused and utilized for analysis. The methodology may apply to
data of any type or size but because of its manual nature, many opt to use data
exploration for smaller data sets.
Data preprocessing
Why preprocessing ?
Data cleaning
o Use the attribute mean (or majority nominal value) to fill in the missing
value.
o Use the attribute mean (or majority nominal value) for all samples
belonging to the same class.
o Binning
Sort the attribute values and partition them into bins (see
"Unsupervised discretization" below);
o Clustering: group values in clusters and then detect and remove outliers
(automatic or manual)
Data transformation
1. Normalization:
o Scaling attribute values to fall within a specified range.
o Scaling by using mean and standard deviation (useful when min and max
are unknown or when there are outliers): V'=(V-Mean)/StDev
Data reduction
o Aggregation or generalization
o Sampling
Sort values.
That is, the first 3 belong to "+" tuples and the last 2 - to "-"
tuples
Method:
CONCLUSION: Thus we have learnt about the Data Exploration & Data
Preprocessing techniques.