Beruflich Dokumente
Kultur Dokumente
Data Collection,
Preprocessing and
Implementation
6.1 Introduction
Data collection is the loosely controlled method of gathering the data. Such
data are mostly out of range, impossible data combinations, missing values,
noisy and many more. The data which have not been properly screened will
cause misleading results. To acquire quality result which will helpful for infor-
mation generation and decision making the raw data need to be preprocessed.
The data mining technique which involves transforming raw data into an
appropriate and understandable form for further processing is called data
preprocessing. In real world the data are most often incomplete, uncertain,
missing, and inconsistent and contains many errors. The phrase Garbage In,
Garbage Out is particularly applicable to machine learning and data mining.
To produce quality data for further processing to make decisions, data prepro-
cessing is required.
89
Chapter 6. Data Collection, Preprocessing and Implementation
6.2. Data Collection
Here the student data set for admission as shown in table 6.2 has been con-
sidered and they have been processed on two different sites S1 and S2 . For
simplicity of calculation the site S1 has 179 and site S2 has 142 instances to
process.
The data set is as below.
90
Chapter 6. Data Collection, Preprocessing and Implementation
6.2. Data Collection
Table 6.2: Student admission data set collected from Parul University Web Por-
tal
91
Chapter 6. Data Collection, Preprocessing and Implementation
6.3. Data Pre-Processing
Table 6.3: Student performance data set collected from Departments of PIT
College
For data mining process the quantity of data also plays the most important
role same as the relevance of the data. The quantity of the data is somewhat 1)
Number of instances (records, objects): Rule of thumb: 5,000 or more desired,
if less, results are less reliable; use special methods (boosting,. . . ), 2) Number
of attributes (fields): Rule of thumb: for each attribute, 10 or more instances, If
more fields, use feature reduction and selection and 3) Number of targets: Rule
of thumb: > 100 for each class, if very unbalanced, use stratified sampling.
92
Chapter 6. Data Collection, Preprocessing and Implementation
6.3. Data Pre-Processing
• Data cleaning: fill in missing values, smooth noisy data, identify or re-
move outliers, and resolve inconsistencies.
• Data reduction: reducing the volume but producing the same or similar
analytical results.
93
Chapter 6. Data Collection, Preprocessing and Implementation
6.3. Data Pre-Processing
• Binning
• Sort the attribute values and partition them into bins (see ”Unsu-
pervised discretization” below);
• Then smooth by bin means, bin median, or bin boundaries.
• Clustering: group values in clusters and then detect and remove
outliers (automatic or manual)
• Regression: smooth by fitting the data into regression functions.
1. Normalization:
94
Chapter 6. Data Collection, Preprocessing and Implementation
6.3. Data Pre-Processing
• Sampling
95
Chapter 6. Data Collection, Preprocessing and Implementation 6.4. Test Data Set
6.5 Implementation
The research work has been carried out on different number of sites with the
following hardware and software configurations:
Software
3. Language: C#
Hardware
2. RAM: 4 GB
96
Chapter 6. Data Collection, Preprocessing and Implementation
6.5. Implementation
97
Chapter 6. Data Collection, Preprocessing and Implementation
6.5. Implementation
98
Chapter 6. Data Collection, Preprocessing and Implementation
6.5. Implementation
99
Chapter 6. Data Collection, Preprocessing and Implementation 6.6. Summary
6.6 Summary
In this chapter, the importance of data collection, preprocessing has been dis-
cussed. The different possibilities in which the data set may not be of enough
quality to process. Such data sets need to be preprocessed. In this research
work two different data sets have been used. The local training models have
been generated and merged using the proposed approach. The accuracy of
these global models has been checked on test datasets. The accuracy is more
than 98% to classify the test dataset.
100