Sie sind auf Seite 1von 18

Data Preprocessing

An Overview:
For Data Quality
Doing some Major Tasks in Data Preprocessing

Data Cleaning
Data Integration
Data Reduction
Data Transformation
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Standardization
Importing necessary libraries and
reading .csv file
Understanding the dataset:
 We have one data set titled “Human_Resources_Employee_Attrition”
In the given data set Human Resources employee Attrition ( in Human
Resource terminology, refers to the phenomenon of the employees leaving
the company. Attrition in a company is usually measured with a metric called
attrition rate, which simply measures the no of employees moving out of the
company)
First five rows of given dataset: {df.head()}
 Data set information:

In the given data set salary and department are object data types
Identifying target variable and independent
variables
We are taken target(output/dependent) variable is column name “left” in the
given dataset.
In column name "left” zero belongs to employee working in organization and
one belongs to employee left the organization.
we need to find predictors(input/independent) variables changes value of
dependent variable . Now we need to find independent variables which are
affecting dependent variable(“left”)
column name(department) not affecting the target(output) variable then we
are dropping department column
Finding null values

there is no null values in the given dataset


Showing the “ how each variable distributed” by using
histogram before normalizing the data
Finding outliers using boxplot

Here lot of outliers are there because ‘average_monthly_hours’ column is not in


similar scale of values comparative to other columns, then we have to normalize the
data after splitting the data as dependent and independent variables
Finding outliers using boxplot

here we taken only four columns for detecting


outliers because these four in a same scale of values
Splitting the dataset as dependent and independent variables

 fdd x is independent variable


 y is dependent variable

Here last column(‘salary’) is non numerical column and this column is also
effected the ‘left’ column then we have to covert this column as numerical
data by using “OneHotEndcoder” because this column contains three types
values(‘low’,’medium’,’high’)
Converting character values to numerical values
Using Standard scaler to convert all the values in a similar scale
Finding outliers after converting values in a similar scale

here there are some outliers and then reducing these outliers by
using Normalizer
Using Normalizer for reducing outliers

After using Normalizer boxplot will be…


 Small amount outliers remaining in the data after using normalizer then we
have to use MinMaxScalar to reduce remaining outliers
Again checking for outliers after using MinMaxScalar

 The box plot will be….

 Finally we reduced all the outliers in the data.


Thank you

Das könnte Ihnen auch gefallen