Sie sind auf Seite 1von 2

Data Science | Steps to approach a Machine Learning Problem

Following are the steps followed to create a good machine learning solution.

1. Data collection
2. Data preprocessing
1) Data cleaning
2) Feature creation and feature selection
3) feature scaling and Normalization
4) Divide data into training and testing sets(You can create cross
validation set also)
3. Build a model on training data.
4. Evaluate the model on the test data.
5. If the performance is satisfying, deploy to the real system.
6. If performance is not good, check for over fitting and under fitting
7. Regularize you algorithm, go to step 3

This process is iterative and you can add more steps in between, depending
on situation. Let’s understand each step:

1. Data Collection:
At this stage we collect data from available sources. for analyzing user click
behavior, you will like to collect web logs data. for predicting, if a mail is
spam or not, you will collect emails. for predicting sentiment of twitter
messages you may like to collect data from twitter.

2. Data Preprocessing:
The data that you receive from any source may not be in readily usable
form. You may like to pre-process it, so that your algorithm can make best
use of collected information
Following are the this you may like to do as part of it.

1)Data Cleaning : You may end up collecting data which have wrong or null
values for some of the records. The wrong or missing values may be very

2)Viewing data : You may like to make some plots of data to see which
parameters affect the output of your record. It will also give you some
picture if your data is of skewed nature or it has normal distribution. Viewing
data in form of plots and histograms may completely surprise you. If you
have data of users who use facebook. you may make a plot to see if male
users have more friends or female users have more friends. If you make a
plot for age of person and number of people with that age, It will give you
very clear picture that which age group is more active on facebook.
3) Data Transformation: Depending on what data you have, you may like
to convert some of features to other form. for example if you have age as
one of feature of your data. You may want that i want to have only 4 groups.
minor(0-18), young(19-45), old(46-65),senior citizen(66- __). you may like
to transform age feature to categorical variable. In some complex scenarios
you may like to convert low dimensional data to high dimensions also(eg
SVM algorithm using Kernals- we will discuss these things later in a separate
post) or high dimension to low dimension(eg PCA- for dimensionality

In general we work with both numerical and categorical data.

Numerical data consists of actual numbers, while categorical data have a
few discrete values. Examples of categorical data include marriage status,
month of birth, employment type or gender. The Categorical variable can
be a number but there is no meaning to adding two vales of actegorical
variable eg Zip code. There may or may not be an order to categorical data.

Das könnte Ihnen auch gefallen