Sie sind auf Seite 1von 12

Chapter 6

Data Collection,
Preprocessing and
Implementation

6.1 Introduction
Data collection is the loosely controlled method of gathering the data. Such
data are mostly out of range, impossible data combinations, missing values,
noisy and many more. The data which have not been properly screened will
cause misleading results. To acquire quality result which will helpful for infor-
mation generation and decision making the raw data need to be preprocessed.

The data mining technique which involves transforming raw data into an
appropriate and understandable form for further processing is called data
preprocessing. In real world the data are most often incomplete, uncertain,
missing, and inconsistent and contains many errors. The phrase Garbage In,
Garbage Out is particularly applicable to machine learning and data mining.
To produce quality data for further processing to make decisions, data prepro-
cessing is required.

In this chapter different data collections used for implementation, their


preprocessing and implementation have been discussed in detail.

89
Chapter 6. Data Collection, Preprocessing and Implementation
6.2. Data Collection

6.2 Data Collection


Data collection is the process of gathering and measuring information on vari-
ables of interest, in an established systematic fashion that enables one to an-
swer stated research questions, test hypotheses, and evaluate outcomes. There
are numerous data collection methods available, but in this research work, real
data sets of student admission at Parul University (gathered from PU Web por-
tal) and ZOO data set have been used.

6.2.1 Zoo Data set


This data set has been downloaded from UCI repository. A simple database
containing 17 Boolean-valued attributes and one numeric class (type) attribute
which is unique for each instance. There are total 101 instances with no miss-
ing value. The attribute information is as below:

6.2.2 Student Admission Data Set


This research has been carried out on real data set of Parul University for the
students admission prediction in different fields/branches of different colleges.
These data have been collected from the Parul University Web Portal. There
are more than 1,00,000 records in total used for training purpose. There are
more than 10 attributes in the data set but in the research the attribute section
method has been applied to keep only the relevant attributes. The preprocess-
ing technique also has been applied for making them quality data for further
data mining process.

Here the student data set for admission as shown in table 6.2 has been con-
sidered and they have been processed on two different sites S1 and S2 . For
simplicity of calculation the site S1 has 179 and site S2 has 142 instances to
process.
The data set is as below.

90
Chapter 6. Data Collection, Preprocessing and Implementation
6.2. Data Collection

Sr. No. Attribute Name Data Type Value(Range) Remarks


1 Animal Name Boolean Unique for each instance
2 Hair Boolean
3 Feathers Boolean
4 Eggs Boolean
5 Milk Boolean
6 Airbone Boolean
7 Aquatic Boolean
8 Predator Boolean
9 Toothed Boolean
10 Backbone Boolean
11 Breathes Boolean
12 Venomous Boolean
13 Fins Boolean
14 Legs Numeric {0,2,4,5,6,8}
15 Tail Boolean
16 Domestic Boolean
17 Catsize Boolean
18 Type Numeric [1,7]

Table 6.1: ZOO DataSet

Sr. No. Attribute Name Data Type Value(Range) Remarks


1 Institute Nominal {PIET1,PIET2, PIT1, PIT2}
2 Admtype Nominal {State, Management, TFWS}
3 Category Nominal {SC,ST,SEBC, OPEN}
4 ACPCRank Nominal
5 SSC Nominal [0,100] Percentage
6 HSC Nominal [0,100] Percentage
7 Degree Nominal
8 City Nominal
9 Name Nominal

Table 6.2: Student admission data set collected from Parul University Web Por-
tal

91
Chapter 6. Data Collection, Preprocessing and Implementation
6.3. Data Pre-Processing

Sr. No. Attribute Name Data Type Value(Range) Remarks


1 Attendance Numeric
2 Midsem result Boolean {YES, NO}
3 Pre Bklg Boolean {YES, NO} Previous Backlog
4 Assignment Nominal
5 Pre result Boolean {YES, NO}
6 Branch Nominal [0,100]
7 Pass Boolean {YES, NO}

Table 6.3: Student performance data set collected from Departments of PIT
College

6.2.3 Student Performance Data Set


In this research the student performance data set has been collected from dif-
ferent departments of Parul Institute of Technology College of Parul Univer-
sity. This data set contains many of the attributes but using the attribute se-
lection method only 7 attributes as shown in table 6.3 have been identified for
further processing in data mining. This data set contains more than 50,000
instances.

6.3 Data Pre-Processing


For data mining process the data need to be pre-processed first to make them
quality data to acquire the quality analysis and information to make quality
decision. So before the data base user should be cleared with some of the most
relevant questions such as 1) What data is available for the task? 2) Is this data
relevant? 3) Is additional relevant data available? 4) How much historical data
is available? 5) Who is the data expert?

For data mining process the quantity of data also plays the most important
role same as the relevance of the data. The quantity of the data is somewhat 1)
Number of instances (records, objects): Rule of thumb: 5,000 or more desired,
if less, results are less reliable; use special methods (boosting,. . . ), 2) Number
of attributes (fields): Rule of thumb: for each attribute, 10 or more instances, If
more fields, use feature reduction and selection and 3) Number of targets: Rule
of thumb: > 100 for each class, if very unbalanced, use stratified sampling.

92
Chapter 6. Data Collection, Preprocessing and Implementation
6.3. Data Pre-Processing

Figure 6.1: Forms of Data Preprocessing

The preprocessing is required in advance before data mining task because


1) Real world data are generally incomplete (lacking attribute values, lacking
certain attributes of interest, or containing only aggregate data), Noisy ( con-
taining errors or outliers) and Inconsistent (containing discrepancies in codes
or names). The data preprocessing task are explain below and shown in figure
6.1

• Data cleaning: fill in missing values, smooth noisy data, identify or re-
move outliers, and resolve inconsistencies.

• Data integration: using multiple databases, data cubes, or files.

• Data transformation: normalization and aggregation.

• Data reduction: reducing the volume but producing the same or similar
analytical results.

93
Chapter 6. Data Collection, Preprocessing and Implementation
6.3. Data Pre-Processing

• Data discretization: part of data reduction, replacing numerical attributes


with nominal ones.

Data cleaning: This is the first preprocessing operation. It consists various


ways to clean data.

1. Fill in missing values (attribute or class value):

• Ignore the tuple: usually done when class label is missing.


• Use the attribute mean (or majority nominal value) to fill in the
missing value.
• Use the attribute mean (or majority nominal value) for all samples
belonging to the same class.
• Predict the missing value by using a learning algorithm: consider
the attribute with the missing value as a dependent (class) variable
and run a learning algorithm (usually Bayes or decision tree) to pre-
dict the missing value.

2. Identify outliers and smooth out noisy data:

• Binning
• Sort the attribute values and partition them into bins (see ”Unsu-
pervised discretization” below);
• Then smooth by bin means, bin median, or bin boundaries.
• Clustering: group values in clusters and then detect and remove
outliers (automatic or manual)
• Regression: smooth by fitting the data into regression functions.

3. Correct inconsistent data: use domain knowledge or expert decision.

Data transformation: Data transformation is the process of converting data


or information from one format to another, usually from the format of a source
system into the required format of a new destination system. Some of the data
transformation techniques have been discussed as below:

1. Normalization:

94
Chapter 6. Data Collection, Preprocessing and Implementation
6.3. Data Pre-Processing

• Scaling attribute values to fall within a specified range. Example: to


transform V in [min, max] to V’ in [0,1], apply V’=(V-Min)/(Max-
Min)
• Scaling by using mean and standard deviation (useful when min
and max are unknown or when there are outliers): V’=(V-Mean)/StdDev.

2. Aggregation: moving up in the concept hierarchy on numeric attributes.

3. Generalization: moving up in the concept hierarchy on nominal attributes.

4. Attribute construction: replacing or adding new attributes inferred by


existing attributes.

Data reduction: Data reduction is the transformation of numerical or al-


phabetical digital information derived empirically or experimentally into a
corrected, ordered, and simplified form.

1. Reducing the number of attributes

• Data cube aggregation: applying roll-up, slice or dice operations.


• Removing irrelevant attributes: attribute selection (filtering and wrap-
per methods), searching the attribute space (see Lecture 5: Attribute-
oriented analysis).
• Principle component analysis (numeric attributes only): searching
for a lower dimensional space that can best represent the data..

2. Reducing the number of attribute values

• Binning (histograms): reducing the number of attributes by group-


ing them into intervals (bins).
• Clustering: grouping values in clusters.
• Aggregation or generalization

3. Reducing the number of tuples

• Sampling

95
Chapter 6. Data Collection, Preprocessing and Implementation 6.4. Test Data Set

6.4 Test Data Set


The dataset has been equally partitioned into the subsets equal to the num-
ber of sites. The experiments have been performed on 10k, 20k, 50k and 100k
records (Here k means thousand) at 2, 5 and 10 sites. The local training models
have been generated and merged using the proposed approach. The accuracy
of these global models has been checked on test datasets. The accuracy is more
than 98% to classify the test dataset. The results of basic comparison clearly
show that accuracy, training time, communication overhead and other parame-
ters have been optimized. The data set of student admission for the year 2013-
14, 2014-15 have been used to train the model, this model has been used with
the data set of student admission for the year 2015-16 which gives more than
98.03% accuracy for the prediction. These experimental results have been also
verified using the 10-fold cross validation.

6.5 Implementation
The research work has been carried out on different number of sites with the
following hardware and software configurations:
Software

1. Database: Microsoft Visual 2008 R2

2. Tool: Visual Studio-10 for .Net

3. Language: C#

4. Apache Hadoop Framework

Hardware

1. Processor: AMD E1-2500, 1.4 GHz

2. RAM: 4 GB

3. System: 64-bit OS / Ubuntu Linux OS (For Apache Hadoop Framework)

4. Hard disk: 400 GB

The screen shots captured as below:

96
Chapter 6. Data Collection, Preprocessing and Implementation
6.5. Implementation

Figure 6.2: Site Selection

Figure 6.3: Run J48 algorithm to each site

97
Chapter 6. Data Collection, Preprocessing and Implementation
6.5. Implementation

Figure 6.4: Load/Save the training model

Figure 6.5: Decision Tree and Decision Table at each site

98
Chapter 6. Data Collection, Preprocessing and Implementation
6.5. Implementation

Figure 6.6: Combined Decision Tree and Decision Table

Figure 6.7: Branch wise decision rules

99
Chapter 6. Data Collection, Preprocessing and Implementation 6.6. Summary

6.6 Summary
In this chapter, the importance of data collection, preprocessing has been dis-
cussed. The different possibilities in which the data set may not be of enough
quality to process. Such data sets need to be preprocessed. In this research
work two different data sets have been used. The local training models have
been generated and merged using the proposed approach. The accuracy of
these global models has been checked on test datasets. The accuracy is more
than 98% to classify the test dataset.

100

Das könnte Ihnen auch gefallen