Sie sind auf Seite 1von 11

Introduction

Machine learning is one of many approaches to analytics at the disposal of data scientists. There
are many examples of machine learning at work that we all use but may not realize that it is in
fact machine learning.
Let me provide an example, a short while back I was in Palo Alto California on business. Since
I dont know the roads very well in Palo Alto I was using the GPS navigation system that is
provided in my Android smart phone. The navigator plotted a path to get to my destination and
indicated the path with a blue line. In some areas, the blue line turned yellow and in a few others
it turned Red. The yellow sections indicated that traffic was slower than posted speeds and the
red indicated that traffic along the red sections had slowed to nearly a stop. This color coding of
blue, yellow and red is an example of machine learning in practice.
In another example, I logged into my Netflix.com account to watch a movie. As I logged in, I
was presented with a number of movies that were selected or recommended for me. Each of
the selected films is very much like other films that I had watched and enjoyed. It was like
someone had picked out these movies just for me. The Netflix system that recommends movies
is another example of machine learning.
In a final example, I recently installed an application on my Android phone called SoundHound.
Have you ever had the situation where you heard a song being played on the radio and just
couldnt remember the name of the song or the artist who performed it? Using SoundHound, the
application listens to the song for just a few seconds and then identifies the name of the song
and the artist who performed it. It further will show a transcription of the lyrics of the song as
they are sung!
Each of the preceding are examples of machine learning algorithms. Mitchell (2006) defines
machine learning as a way to answer the following question:
How can we build computer systems that automatically improve with experience, and what are
the fundamental laws that govern all learning process
If we consider Mitchells question there are a couple of key points. The first is the idea of a
computer system that can improve with experience. Consider the Netflix example; no one
programmed the computer with a set of preferences for movies that I would enjoy. What did
happen, is that Netflix has a machine learning algorithm that looks at the movie selections of
many different people and using this data, it can begin to group those people into different
categories. Once placed into one (or more) of these categories the system merely suggested that
the movies that others in the same category are watching will be good recommendations for me
as well.

What Netflix has done is to develop an algorithm that can categorize people into different groups
and use these categories to recommend what movies they might enjoy watching. Is the Netflix
algorithm learning? What is learning after all?
There is considerable academic debate about the nature of learning. On one hand we can point to
the acquisition of facts or information as learning. In other contexts, the ability to gain
knowledge, skills, or the ability to solve particular sets of problems is offered as evidence of
learning. Finally, there is an argument that learning involves awareness and that learning is only
present if there is awareness of learned processes. In other words, there must be an
understanding of what has been learned, learning must be accompanied both with understanding
and meaning.
The concept of learning is obviously a complex one, but we are only interested in the concept of
learning as it applies to a computer system. Fortunately machine learning is easier to define than
learning in general. Machine learning is a process that uses data to fit a model that can be used
to solve particular problem sets and to improve its performance of those problem sets over time
as more data and experience is accumulated.

Data Mining and Machine Learning


Although data mining and machine learning, have some similarities and share many of the same
procedures and algorithms they are different. The key difference between data mining and
machine learning is in their intent and application.
Data Mining is used to gain insight or knowledge from existing data. The objective of data
mining is to get more value (or new value) out of a data set. To this end there are is a wide
spectrum of techniques both statistical and visual that are used to provide new views or insight
into a data set. Some of these approaches include:[i]
Description
Reporting
Visualization
Prediction
Classification
Estimation
Clustering
Machine learning on the other hand is designed develop automated engines that can facilitate
decision making by training algorithms to perform a variety of decision tasks. The goal of

machine learning is to create an algorithm that has the capability to improve its performance over
time. Programs in general are sets of instructions that the programmer has created that react to
specific data or inputs. The way that the program responds to any input is always the same
because the responses are always defined by the individual who created the program. In machine
learning the goal is to develop an algorithm that has the ability to alter its response to some input
data. The machine learning algorithm adjusts its response based upon the input data thus making
itself capable of adjusting how it interacts with its environment. This ability to self-adjust can be
defined as a form of learning.
Machine learning algorithms have been found to be very effective for a number of different tasks
some examples include the following:

select the preferred lighting of a room,

classify objects,

recognize specic patterns in (streams of) images,

identify the words in handwritten text

understand a spoken language,

control systems based on sensor data,

predict risks in safety-critical systems,

detect errors in a network,

diagnose abnormal situations in a system,

prescribe actions or repairs, and

discover useful common information in distributed data

The way that machine learning algorithms work, is that a set of training data is used and the
machine learning algorithm evaluates the training data to build a model. As more training data
becomes available, the model can be improved. What the learning algorithm attempts to do is to
develop a set of procedures that fit the training data.
If you have a smart phone then you probably have experienced a machine learning algorithm
firsthand. Most current version smart phones have voice recognition capabilities. The way that
voice recognition works is that a number of samples of words are collected the spoken word is
associated with its written form. For example we might have a sample with eight or 10 different

people speaking the word dog and associated with each of these audio files with the spoken word
dog we would associate the text form of the word dog.
Each person will say the word dog slightly differently. There will be changes in intonation, pitch,
speed and other factors. What the learning algorithm will do is build a model of what the word
dog spoken is like and then attempt to compare a spoken word with this model of course it needs
to compare this not only with the word dog but pretty much all of the words spoken in the
language using the training data as a base, the machine learning algorithm will determine what
word is most similar to the one that was spoken. In this way even though the system doesnt have
a specific example of you speaking the word dog it can typically identify when you have spoken
the word dog because it compares it with all of the training examples that it provided. As the
algorithm gets more information it builds better models and becomes more accurate over time.
We refer to this ability to get more accurate over time as more data is evaluated as learning.
A key feature of machine learning is the fact that the algorithms must have the ability to learn
over time and as the algorithm is exposed to new information, it has the ability to incorporate the
new information to improve its own performance. [ii]
Types of Machine Learning
The discipline of machine learning has developed to include many different algorithms and
techniques. These techniques, however, can be grouped into three categories of machine
learning including Supervised Learning, Unsupervised Learning, and Reinforcement learning.

Supervised Learning
Although the term Supervised learning conjures up the notion of something that provides
oversight, this is not really the case in supervised learning. Supervised learning simply refers to
learning where the output of a collection of training examples is known and the input attributes
as well as the output value is used to training the learning algorithm.
Consider this example that is taken from the UCI Machine Learning Repository. In this example
we see a series of 9 attributes of cells that can be identified from biopsys of breast tissue. The
10th item in the list is a classification. If the value of this 10th item is 2, the cell is benign. If the
value is 4, then the cell represents a cancerous cell. [iii]
#

Attribute

1.

Clump Thickness

2.

Uniformity of Cell Size

3.

Uniformity of Cell Shape

4.

Marginal Adhesion

5.

Single Epithelial Cell Size

6.

Bare Nuclei

7.

Bland Chromatin

8.

Normal Nucleoli

9.

Mitoses

10.

Class: 2 for benign, 4 for malignant

This data set is an example of learning data in supervised learning as it contains both attributes
AND classifications. In supervised learning, the goal is to develop a model that can account,
most accurately, for all of the training data. Once trained, the classifier can be used to classify a
new set of attributes as either cancer or benign. The following is an example of the training
data for this example.
5,1,1,1,2,1,3,1,1,2
5,4,4,5,7,10,3,2,1,2
3,1,1,1,2,2,3,1,1,2
6,8,8,1,3,4,3,7,1,2
4,1,1,3,2,1,3,1,1,2
8,10,10,8,7,10,9,7,1,4
1,1,1,1,2,10,3,1,1,2

2,1,2,1,2,1,3,1,1,2
This example has a number of attributes or dimensions that must be a part of the model and as
such it will take a much larger amount of training examples to train a classifier for this problem.

Unsupervised Learning
The most important difference in unsupervised learning is that there is no output or target
variable to train against. In unsupervised learning, the goal is not to classify or predict anything,

but rather to segment the cases into groups based upon the attributes in the data set. The
primary technique that is employed in unsupervised learning is called clustering which is a
process that evaluates all of the attributes and assigns each instance to a group based upon the
similarity of the attributes. This is very different than the classification in supervised learning in
that the groupings in supervised learning have a known meaning. When we speak of Bias in
supervised learning it refers to the fact that we know in advance the groups that we want to
segment the data into. In unsupervised learning and clustering, we do not know the groups or
what we mean. The objective is to identify the groups and once they have been discovered we
can then examine them to determine what they mean or how they can be used.
It might be helpful to consider an example. Imagine that we have a customer loyalty program
and we wanted to get a better understanding of the customers in this program. Imagine that we
use all of the product purchased by these customers as input data for a clustering algorithm. The
objective would be to find groups of customers who are similar to each other in terms of the
kinds of products that they purchase. An unsupervised learning algorithm such as k-means
clustering could identify the groups of customers, once identified we could then examine the
group to discover what makes them similar and then target them with the products they are likely
to respond to.
An important concept in unsupervised learning that differentiates it from supervised learning is
point at which outcomes are determined. In supervised learning, we are using data to train a
model to make the right decision, to place an instance into the right class. In order to do this
we need to know what those classes are in advance and we need training data that includes both
the training attributes and the outcome that should result.
In unsupervised learning, we dont have this prior knowledge of the correct outcomes in
advance. Unsupervised learning helps us to identify the different outputs or groups to which the
machine learning must evaluate and determine what they mean and if and how they are
relevant.

Reinforcement Learning
Reinforcement learning differs from supervised learning in a number of important ways.
Supervised learning, as we have been discovering, utilizes a set of training data, which is used
to develop a model of classification. The data includes both training attributes as well as the
desired output for each attribute. In the case of supervised learning, the process of learning is
one of developing a model to fit the available data to the training outcomes. The learning
experience can improve over time as more data becomes available with which to train the
model. In supervised learning there are two elements that must be known a priori. The first is
the attributes which will fuel the model training and of course the desired output of the model.
In a classification problem, the desired output will be the appropriate category or class to place
an instance into for any set of attributes that are presented.
In the breast cancer example that we discussed, the objective of learning is to develop a model
that could accurately predict whether cells were cancerous or benign based upon the values of a

set of nine attributes. In this example, the desired output is known . we want to know if the
cells are cancer or not.
In contrast, supervised learning develops models to group instances together based upon the
similarity of the attributes they have with other instances. The unsupervised approach is
distinctly different from supervised learning in that the desired output is NOT known. In
supervised learning, we knew exactly what the output should be. In the breast cancer example
we wanted identify every instance as either cancer or benign.
Reinforcement learning has elements of both supervised and unsupervised learning. In
supervised learning the goal is to create a classifier that can make the right decision. For
example, in our breast cancer example, the goal is to correctly determine if an instance is cancer
or not by making a decision to place the instance into either the cancer category or the benign
category.
In unsupervised learning the goal is to identify instances that a particular instance is most
like. In unsupervised learning the data leads where it may and clusters appear as they will
based upon the data. True enough that as machine learning practitioners we will need to specify
the number of groups that we expect to exist, however unsupervised learning doesnt have the
same concept of rightness of decision that we see in supervised learning.
Reinforcement learning takes yet another direction. In reinforcement, learning the learning
problem becomes very important. In reinforcement, learning the goal is to make the most
correct decision that is possible right now and learn from the outcome. Imagine for a moment
that you are playing chess and you select a move that, based upon your knowledge of the game,
seems to be a good move. After making the move however, the outcome of the move was that
your opponent was able to gain advantage over you. In this case, the learning that occurs is that
we have learning that the move that we made wasnt so good. We will want to revise our
assumptions so that the next time we see the same situation we will know that the move we
made, was not a very good one.
This example provides a good description of the way that reinforcement learning operates.
Imagine that we have a machine-learning algorithm that is playing chess. Based upon the
attributes of the board, the algorithm will attempt to determine the best play. This play results in
the opponent gaining an advantage so we would need to adjust the machine-learning model such
that in the future the algorithm will not make the same mistake again.
What is important in this description in the idea of feedback. The machine-learning algorithm
obviously didnt have enough training examples to allow it to make the right decision for any
move that it would need to make in the chess game. Like unsupervised learning, the algorithm
may not know all of the correct outputs

Uses of Machine Learning

There are many examples of how machine-learning techniques can be used to solve real world
problem. We have already looked at one of them, which is the use of a machine-learning
algorithm to determine if a cell is cancerous or not based upon a set of attributes.
Most of these use cases, however, fall into three categories based upon the output of the
machine-learning algorithm. These categories include Prediction, Classification, Clustering, and
Estimation.
Prediction is the ability to foretell an output based upon the inputs. Some specific examples of
prediction might be a machine-learning algorithm that attempts to predict movement in the
stock market. Another example might be the ability to predict whether an individual will
develop diabetes, heart disease, or cancer based upon a set of predictive risk factors. A key
technique that is employed in machine learning algorithms for prediction is regression.
Classification is an important and, quite frankly, one of the most widely employed machine
learning techniques. In classification, the objective is to place an instance into some known and
predefined categories based upon a set of input attributes. Examples of classification would
include the ability of Google mail to determine if a particular email is a spam email, a social
media email or a regular email message. In the email example there are three categories and
based upon the input data each email message will be assigned to one of these categories. In
classification the machine learning algorithm learns the criteria used to determine which
category of email to place any particular message into.
Clustering is another important machine learning technique. In classification, the objective of
machine learning is to learn the model that will be used to classify a new instance of data into a
particular category. The process of learning requires that the machine learning algorithm have a
number of test cases where each of these test cases includes the attributes that are collected
(which are the same attributes that will be used to classify a new instance once the model has
been learned) and the proper category to place the test case into. Because both the attributes
AND the proper category have been provided, classification is called supervised learning
because some elements of the learning have been provided in advance and are used as
assumptions.
These assumptions (which are called bias) have meaning to the machine learning practitioner. In
clustering there are no such assumptions provided in advance. The goals of clustering is to
identify patterns in the test data based upon similarities that are discovered by the learning
algorithm between the different cases. It is the job of the machine learning practitioner (data
scientist) to determine what these patterns mean. This form of learning is called unsupervised
because there are no assumptions (with the exception that the data scientist will often need to
specify the number of groups they are looking for) provided in advance. The job of the machine
learning algorithm is to discover what, if any, patterns exist.

Output of learning
The output of the learning algorithm is typically a decision. In the case of regression, the
algorithm will decide the output value based upon the input value. In logistic regression the

output of often a two category value such as 0 or 1, true or false, yes or no. In clustering the
decision is the cluster that a particular instance should belong to and in classification the decision
is the category into which the instance belongs.
Each of these outputs is expressed using a variable. In the case of linear or curvilinear regression
the output is typically a continuous variable that can take on any value of output based upon the
input.
In the case of logistic regression, and classification problems which have only 2 categories the
output is typically a Boolean variable which can take on the values of 0 and 1 and can be defined
to mean true or false, yes or no, off or on, or more specific things such as spam mail or not spam
mail.
In the case of both clustering and classification machine learning a categorical variable is used
which can take on a finite number of discrete values. For example if a clustering algorithm is
defined to find 4 clusters within the data, then the output for any instance will indicate which of
the 4 clusters perhaps using the designators 1,2,3, and 4.
Of course one of the things that the analytics practitioner needs to do is to take these output
variables and translate them into information that decision must makers can act on. For example
if we used logistic regression in an algorithm to predict which customers are likely to buy within
the next 30 days, then the data analytics practitioner would prepare a list of those customers in
the buy group and provide this to marketing and sales to target these customers with offers and
advertising.
Prediction involves determining a future value based upon historical data. A forecast is an
example of prediction. Imagine that you are a buyer working for Walmart. Your job will be to
determine how many basketballs to keep in stock at the Walmart store in Anchorage Alaska. How
would you do this? One thing you would do is to consult how many basketballs were sold during
the same period of time in previous years. You may want to be a bit more sophisticated and
attempt to factor in expected weather using the assumption that if the weather is colder or rainier
than normal then people will not be playing basketball as much and the demand for new
basketballs might be lower. Prediction often involves the use of statistical procedures such as
regression analysis.
Classification is an important machine learning capability. The objective of classification is to
take an item or event and automatically determine what it is based upon its characteristics. A
good example that we can all be familiar with is credit approvals. Have you ever been at your
local department store when the clerk asked if you wanted to fill out an application and get
instant approval for the stores credit card? If so, you will recall that you had to fill in an
application with a few pieces of information which the clerk either keyed into a system or called
in over the phone and was able to receive either an approval to extend credit and the amount or a
denial to extend credit. What is actually happening began behind the scenes is that the credit
company is using a classification algorithm that makes a decision about whether you are a good
or poor credit risk. This algorithm has learned to identify good credit risks and poor credit risks.
It was able to learn this by reviewing a large number of cases of individuals who are good credit

risks or poor credit risks. These cases that are reviewed by the algorithm are called the training
data. What makes classification a machine learning is the fact that the training data is used by the
algorithm to learn how to identify good credit risks as opposed to court poor credit risks. As
additional training data becomes available, the algorithm learns to continuously refine its ability
to identify good as opposed to poor credit risks.
Clustering is an example of unsupervised learning. Unsupervised learning is simply learning that
has little or no bias. If you consider our description of classification, the goal was to develop an
algorithm that could learn a pattern such that when a new instance was received it could
determine which of a set of predefined categories to place the instance into. In the credit
example, the objective would be to place a particular persons credit application into a category.
One category might be to deny credit, another might be to give minimal credit, and still others
might be to extend very large credit lines.
In the classification example we know in advance what the categories are. We can define the
characteristics of the category and the role of the machine learning algorithm is to determine into
which category a new instance (a new credit application for example) should be placed. This a
priori knowledge of the categories is called bias because it makes the assumption that these
categories exist within the data.
In clustering we have no knowledge of the existence of any specific categories within the data or
what such categories might mean. The role of clustering is to find patterns in the data and report
those patterns. The job of making sense of what those groupings of instances or patterns in the
data mean or how they can be used is the job of the data scientist.
A simple example of clustering can be seen in the Netflix movie service. In Netflix as a
particular user begins to select and watch movies and rank those movies, Netflix begins to collect
characteristics about the individual they of course couple this data with any other data they have
on the user and perform a cluster analysis. Essentially they are trying to find people who have
similar interests based upon the movies they have watched and the ratings they have given those
films. From this clustering analysis groups within their subscriber community emerge. Netflix
then makes the assumption that people who fall into the same group are likely to have the same
tastes in movies and Netflix will provide as recommendations those movies that others within the
same group have watched and enjoyed but that user has not had the opportunity to view.
Netflix doesnt know in advance what groups will emerge or even how many different groups
might exist. They simply look for patterns in the data and use the groupings that emerge to build
recommendations.
As we proceed through the course, we will examine some of these outcomes of learning in detail
and examine the workings of the algorithms that produce these results.

[i] Deshpande, S. P., & Thakare, V. M. (2010). Data mining system and applications: a review.
International Journal of Distributed and Parallel systems (IJDPS), 1(1), 32-44.

[ii] Larose, D. T. (2005). Discovering Knowledge in Data: An Introduction to Data Mining.


Hoboken, New Jersey: John Wiley & Sons.
[iii] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and
Computer Science.

Das könnte Ihnen auch gefallen