Sie sind auf Seite 1von 33

Machine Learning

Machine Learning Definition


An algorithm that can learn from data without
being reliant on standard programming
practices like object-oriented design.
The Purpose of Machine Learning
Push workload to self-sufficient machine.
Pattern Recognition.
Analysis of data.
Example: Trends in house pricing
Machine Learning:
Topics and Terminology
Supervised and Unsupervised Learning
Static and Model evaluation
Popular Algorithm:
o K-Nearest Neighbor
o Decision Tree
o K-means Clustering, & so on.
Supervised Learning
Supervised learning is where you have input
variables (x) and an output variable (Y) and
you use an algorithm to learn the mapping
function from the input to the output.
Y = f(X)
The goal is to approximate the mapping
function so well that when you have new
input data (x) that you can predict the output
variables (Y) for that data.
Supervised Learning further grouped
Classification: A classification problem is when
the output variable is a category, such as red
or blue or disease and no disease.
Regression: A regression problem is when the
output variable is a real value, such as
dollars or weight.
Unsupervised Learning
Unsupervised learning is where you only have
input data (X) and no corresponding output
variables.
The goal for unsupervised learning is to model
the underlying structure or distribution in the
data in order to learn more about the data.
Unsupervised Learning further
grouped
Clustering: A clustering problem is where you
want to discover the inherent groupings in the
data, such as grouping customers by
purchasing behavior.
Association: An association rule learning
problem is where you want to discover rules
that describe large portions of your data, such
as people that buy X also tend to buy Y.
Supervised Vs Unsupervised Learning

Supervised Learning Unsupervised Learning

Deals with labeled data Deals with unlabeled data

Algorithm for classification and regression Algorithm for clustering and association

Classification is the organization of Clustering is the analysis of patterns and


labeled data grouping of unlabeled data

Regression is the prediction of trends in


labeled data to determine future outcome
Classification
The concept of categorizing data is based off
of training with a set of data so that the
machine can essentially learn boundaries that
separate categories of data. Therefore, new
data inputted into the model can be
categorized based on where the point exists.
for example:

Iris Data Set


for example:

Iris Data Set


Continue..
Now, the model can easily classify the new
point from out-of-sample data set with this
classification.
K-Nearest Neighbors A classification
Model
In K-Nearest Neighbor, data points are
categorized and when determining the
category of a new data point, the K nearest
points are used in this process.
for example: (K = 5)
for example: (K = 14)
Unsupervised Learning
Play a video of 1:41 minutes
Supervised Learning Algorithms
K-Nearest Neighbors Algorithm
1. Pick a value for K.
2. Search for the K observations in the training
data that are nearest to the measurements
of the unknown data.
3. Predict the response of the unknown data
point using the most popular response value
from the K-Nearest Neighbors.
Continue
For classification, the output of the K-NN algorithm is
the classification of an unknown data point based on
the k 'nearest' neighbors in the training data.
For regression, the output is an average of the values
of a target variable based on the k 'nearest' neighbors
in the training data.
Lower the K value, bad prediction and over-fitting the
dataset.
Much higher the K value, overly generalize the model.
Medium the K value, good prediction.
Continue
KNN for regression
Decision Tree
Decision Trees are built by splitting the
training set into distinct nodes, where one
node contains all of, or most of, one category
of the data. These categories can be called
subsets.
Decision tree may not build an optimal tree
since, it uses greedy algorithm to build it.
Some Decision Tree Terminology
Node: A test for the values(data) of a certain
attribute.
Leaf: A terminal node that predict the outcome.
Root: The beginning node that contains the
entire dataset.
Entropy: It is the amount of information disorder,
or the amount of randomness in the data.
Information Gain: Information collected that can
increase the level of certainty in a particular
prediction.
Information Gain example: Good
Information Gain example: Good
Information Gain example: Bad
Random Forest Algorithm
1. For b=1 to B:
a). Draw a bootstrap sample Z* of size N from
the training data.
b). Grow a random-forest tree Tb to the
bootstrapped data, by recursively repeating the
following steps for each terminal node of the
tree, until the minimum node size nmin is
reached.
i). Select m variables at random from p features.
ii). Pick the best variable/split-point among
them.
iii). Split the node into two daughter nodes.
2. Output the ensemble of tree {Tb} 1 to B.
Advantages over Decision Tree
Faster
Reliable
example
example
example

Das könnte Ihnen auch gefallen