You are on page 1of 14

11.

Machine Learning System Design

11.1 Prioritizing What to Work On


In the next few lessons I'd like to talk about machine learning system design. These lessons will touch on
the main issues that you may face when designing a complex machine learning system. And even though
the next set of videos may seem somewhat less mathematical, I think that this material may turn out to
be very useful, and potentially huge time savers when you're building big machine learning systems.

Concretely, I'd like to begin with the issue of prioritizing how to spend your time on what to work on, and
I'll begin with an example on spam classification. Let's say you want to build a spam classifier. Here are a
couple of examples of obvious spam and non-spam emails

Notice how spammers will deliberately misspell words, like Vincent with a 1, and mortgages with 0
instead an o. Let's say we have a labeled training set of some number of spam emails and some non-
spam emails denoted with labels = 1 0, how do we build a classifier using supervised learning to
distinguish between spam and non-spam?

In order to apply supervised learning, the first decision we must make is how do we want to represent X,
that is the features of the email. Given the features X and the labels y in our training set, we can then train
a classifier, for example using logistic regression. Here's one way to choose a set of features for our emails:
we could come up with a list of maybe a hundred words that we think are indicative of whether e-mail is
spam or non-spam, for example the words

So, if a piece of e-mail contains the word 'deal' or buy or discount maybe it's more likely to be spam,
whereas if a piece of email contains your name it would be more like to be a non-spam email.

Given the word features, we can then take a piece of email and encode it into a feature vector as follows
Given a piece of e-mail like this shown, I'm going to check and see whether or not each of the feature
words appears in the e-mail. And then I'm going to define a feature vector X like shown. In general, I'm
not going to recount how many times the word occurs.

So, that gives me a feature representation of a piece of email. By the way, even though I've described this
process as manually picking a hundred words, in practice what's most commonly done is to look through
a training set, and in the training set pick the most frequently occurring words where is usually
between ten thousand and fifty thousand, and use those as your features.

Now, if you're building a spam classifier one question that you may face is, what's the best use of your
time in order to make your spam classifier have higher accuracy, or lower error? One natural inclination
is going to collect lots of data, because in fact there's this tendency to think that more data gives a better
algorithm. And in fact, in the email spam domain, there are actually pretty serious projects called
HoneyPot projects, which create fake email addresses to try to collect tons of spam email and use it to
train learning algorithms.

The problem is that collecting more data is not always helpful. So, for most machine learning problems,
there are a lot of other things you could usually imagine doing to improve performance. For spam, one
thing you might think of is to develop more sophisticated features on the email, maybe based on the
email routing information. And this would be information contained in the email header. So, when
spammers send email, very often they will try to obscure the origins of the email, and maybe use fake
email headers, or send email through very unusual sets of computer service through very unusual routes,
in order to get the spam to you. And some of this information will be reflected in the email header. Hence,
looking at the email headers its possible to develop more sophisticated features to identify if something
is spam. Something else you might consider doing is to look at the email message body, that is the email
text, and try to develop more sophisticated features. For example, should the word 'discount' and the
word 'discounts' be treated as the same words or should we have treat the words 'deal' and 'dealer' as
the same word? Still in this case we can develop more complex features about punctuation because
maybe spam is using exclamation marks and misspellings words a lot more. Summarizing the spam
classification options
The point is that in machine learning problems the number of options available is not few. And what
usually happens is that a research group or product group will randomly fixate on one option. And
sometimes that turns out not to be the most fruitful way to spend your time. Hence, what Id like to tell
you in next lesson is about the concept of error analysis and talk about the way where you can try to have
a more systematic way to choose amongst the options of the many different things you might work. And
therefore be more likely to select what is actually a good way to spend your time in machine learning
system design.

11.2 Error Analysis


In the last lesson, I talked about that when faced with a machine learning problem, there are often lots of
different ideas on how to improve the algorithm. In this lesson let's talk about the concepts of error
analysis which will help you to be more systematically when making some of these decisions. If you're
starting work on a machine learning product or building a machine learning application, it is often
considered very good practice to start, not by building a very complicated system with lots of complex
features, but to instead start by building a very simple algorithm, that you can implement quickly.

And when I start on a learning problem, what I usually do is spend at most one day, literally at most 24
hours to try to get something really quick and dirty; not at all sophisticated system. And then implement
it and test it on my cross validation data. Once you've done that, you can then plot learning curves to try
to figure out if your learning algorithm may be suffering from high bias or high variance or something else
and use that to try to decide if having more data and more features and so on are likely to help.

And the reason that this is a good approach is often when you're just starting out on a learning problem,
there's really no way to tell in advance whether you need more complex features or whether you need
more data or something else. And it's often by implementing even a very, very quick and dirty
implementation and by plotting learning curves that that helps you make these decisions. So, you can
think of this as a way of avoiding what's sometimes called premature optimization in computer
programming.
In addition to plotting learning curves, one other thing that's often very useful to do is what's called error
analysis. And what I mean by that is that when building, say a spam classifier, I will often look at my cross
validation set and manually look at the emails that my algorithm is making errors on. So, look at the spam
emails and non-spam emails that the algorithm is misclassifying, and see if you can spot any systematic
patterns in what type of that examples. And often by doing that, this is the process that would inspire you
to design new features.

*Obs.: usually error analysis is doing on cross validations set rather than on test set.

Summarizing

Concretely, here's a specific example. Let's say you've built a spam classifier and you have 500 examples
in your cross-validation set

(i) The idea is to categorize the misclassified emails. For example in previous 100 misclassified emails we
could get this

And by counting up the number of emails in these different categories you might discover, for example,
that the algorithm is doing really particularly poorly on emails trying to steal passwords, and that may
suggest that it might be worth your effort to look more carefully at that type of email, and see if you can
come up with better features to categorize them correctly.

(ii) And also, what I might do is look at what features might have helped the algorithm classify the emails.
So let's say we have these features with some frequencies
And if this is what you get on your cross validation set then it really tells you that maybe deliberate spelling
is a sufficiently rare phenomenon that maybe is not really worth all your time trying to write algorithms
to detect that. But if you find a lot of spammers are using unusual punctuation then maybe that's a strong
sign that it might actually be worth your while to spend the time to develop more sophisticated features
based on this.

So, this sort of error analysis which is really the process of manually examining the mistakes that the
algorithm makes, can often help guide you to the most fruitful avenues to pursue. And this also explains
why I often recommend implementing a quick and dirty implementation of an algorithm. What we really
want to do is figure out what are the most difficult examples for an algorithm to classify. And very often
for different learning algorithms, they'll often find similar categories of examples difficult. And by having
a quick and dirty implementation, that's often a quick way to let you identify some errors and identify
what are the hard examples so that you can focus your efforts on those.

Lastly, when developing learning algorithms, one other useful tip is to make sure that you have way of
make the numerical evaluation of your learning algorithm. Now what I mean by that is that if you're
developing a learning algorithm, it is often incredibly helpful if you have a way of evaluating your learning
algorithm that just gives you back a single real number. Maybe accuracy, maybe error. But the single real
number that tells you how well your learning algorithm is doing. I'll talk more about this specific concepts
in later lessons, but here's a specific example.

Let's say we are trying to decide whether or not we should treat words like discount, discounts, discounter,
discounting, as the same word. So maybe one way to do that is to just look at the first few characters in a
word. In natural language processing, the way that this is done is actually using a type of software called
stemming software (for example Porter stemming). But using a stemming software that basically looks at
the first few alphabets can be bad, because some information are lost. So if you're trying to decide
whether or not to use stemming software for a stem classifier, maybe error analysis may not actually be
helpful. Instead, the best way to figure out if using stemming software is good to help your classifier is if
you have a way to numerically evaluate your algorithm. Concretely, maybe the most natural thing to do
is to look at the cross validation error of the algorithm's performance with and without stemming. So, if
you run your algorithm without stemming and you end up with

In this case it looks like using stemming is a good idea. For this particular problem, there's a very natural
single real number evaluation metric, namely, the cross validation error. We'll see later, examples where
coming up with this single number evaluation metric may need a little bit more work. But in some cases
this can be used. For example, we can use this same sort of evaluation to decide if we should treated
lower case words than upper case words differently

So when you're developing a learning algorithm, very often you'll be trying out lots of new ideas and lots
of new versions. And by having a single rule number evaluation metric, you can then just look and see if a
new idea has improved or worsened the performance of the learning algorithm and this will let you often
make much faster progress.

11.3 Error Metrics for Skewed Classes


In the previous lesson, I talked about error analysis and the importance of having error metrics, which is
the use of a single real number evaluation metric for your learning algorithm to tell how well it's doing. In
the context of evaluation and of error metrics, there is one important case, where it's particularly tricky
to come up with an evaluation metric for your learning algorithm. That case is the case of what's called
skewed classes.

Consider the problem of cancer classification, where we have features of medical patients and we want
to decide whether or not they have cancer. So this is like the malignant versus benign tumor classification
example that we had earlier. So, we have this setting

So, we're making 99% correct diagnosis. Seems like a really impressive result, because we're correct 99%
percent of the time. But now, let's say we find out that only 0.5% of patients in our training test sets
actually have cancer. So only half a percent of the patients that come through our screening process have
cancer. In this case, the 1% error no longer looks so impressive. And in particular, here's a piece of non-
learning code that takes this input of features X and it ignores it

It always predicts nobody has cancer and this algorithm would actually get 0.5% error. So this is even
better than the 1% error that we were getting just now and this is a non-learning algorithm that is just
predicting = 0 all the time.
This setting when the ratio of positive or negative examples is very close to one of two extremes, where,
in this case, the number of positive examples is much smaller than the number of negative examples ( =
1 is very rarely), is what we call the case of skewed classes. We just have a lot more of examples from one
class than from the other class. And by just predicting the most frequent class all the time an algorithm
can do pretty well.

So, in skewed classes a single error evaluation metric could not tell you if the algorithm are doing good or
are just greedy to the most frequent class. One such evaluation metric that handles this problem is called
precision/recall.

Let's say we are evaluating a binary classifier on the test set. And lets assume that we have skewed classes
where we have = 1 for the rarest class (in precision/recall evaluation this is an very used convention)

Given this we can draw this table

True positive: the algorithm classified a positive (rare) example correctly.

True negative: the algorithm classified a negative (common) example correctly.

False positive: the algorithm classified a negative example as a positive one. It saw a common example
as rare. For example it classified a patient without cancer as having cancer.

False negative: the algorithm classified a positive example as a negative one. It saw a rare example as
common. For example it classified a patient with cancer as healthy.

Given these definitions here's a different way of evaluating the performance of our algorithm. We're going
to compute two numbers. The first is called precision
The second number we're going to compute is called recall

So by computing precision and recall this will usually give us a better sense of how well our classifier is
doing. And in particular if we have a learning algorithm that predicts = 0 (most frequent class) all the
time, then this classifier will have precision and recall equal to zero, because there won't be any true
positives and so that's a quick way for us to recognize that isn't a very good classifier (considering = 1
as the rare class).

For the case when we predict = 0 all the time we have:

Any true positive (the algorithm dont make positive predictions);


Any false positive (the algorithm dont make positive predictions) ;
A few false negatives (the few ones that were positive and we predicted as negatives;
All true negatives.

And by using precision and recall, we find that it's not possible for an algorithm to "cheat" and predict
= 0 or = 1 all the time and get high precision and recall. And in particular, if a classifier is getting high
precision and high recall, then we are actually confident that the algorithm has to be doing well, even if
we have very skewed classes. So for the problem of skewed classes precision and recall gives us more
direct insight into how the learning algorithm is doing and this is often a much better way to evaluate our
learning algorithms, than looking just at classification error or classification accuracy.

11.4 Trading Off Precision and Recall


In the last lesson, we talked about precision and recall as an evaluation metric for classification problems
with skew classes. For many applications, we'll want to somehow control the trade off between position
and recall. Let me tell you how to do that and also show you even more effective ways to use precision
and recall as an evaluation metric for learning algorithms.

As a reminder, here are the definitions of precision and recall from the previous lesson

And let's continue our cancer classification example, where = 1 if the patient has cancer and = 0
otherwise. And let's say we've trained in logistic regression classifier, which outputs probabilities between
zero and one. So, as usual, we're going to predict one, if () 0.5 and predict zero if () < 0.5 and
this classifier may give us some value for precision and some value for recall

But now, suppose we want to predict that a patient has cancer only if we're very confident that they
really do. Because you know if you go to a patient and you tell them that they have cancer, it's going to
give them a huge shock because this is seriously bad news and they may end up going through a pretty
painful treatment process. And so maybe we want to tell someone that we think they have cancer only if
we're very confident. One way to do this would be to modify the algorithm, so that instead of setting the
threshold at 0.5, we might instead say that we'll predict that = 1, only if () 0.7

And if you do this then you're predicting some of this cancer only when you're more confident (on rare
class to avoid false positives), and so you end up with a classifier that has higher precision, because all
the patients that you're going to predict as having cancer, all of those patients are now pretty chance as
actually have cancer. And so, a higher fraction of the patients that you predict to have cancer, will actually
turn out to have cancer, because you restricted the number of prediction increasing the threshold for =
1 prediction. But in contrast, this classifier will have lower recall, because we are going to predict = 1
on a smaller number of patients and the number of false negatives increased.

Now we could even take this further. Instead of setting the threshold at 0.7, we can set this at 0.9 and
we'll predict = 1 only if we are more than 90% certain that the patient has cancer, and so, you know, a
large fraction that those patients who were classified as = 1 will turn out to have cancer and so, this is
a high precision classifier. However, it will have low recall because since we increase the threshold for =
1 prediction its more likely to avoid some actual patient who indeed have cancer (for example one who
had cancer and had 65% chance as having cancer by the algorithm is classified as not having wrongly).

Now consider a different example. Suppose we want to avoid missing too many actual cases of cancer.
So we want to avoid the false negatives. In particular, if a patient actually has cancer, but we fail to tell
them that they have cancer, then that can be really bad, because we deny the necessary treatment. In
this case, rather than setting higher probability threshold, we might instead take this value and set it to a
lower value, like

By doing so, we are being more conservative and tell them that they may have cancer, so they can seek
treatment if necessary. And in this case, what we would have is going to be a higher recall classifier,
because we're going to be correctly flagging a higher fraction of all of the patients that actually do have
cancer, but we're going to end up with lower precision, because the higher fraction of the patients that
we said have cancer will turn out not to have cancer after all.

And so, in general, for most classifiers, there is going to be a trade off between precision and recall. And
as you vary the value of this threshold, you can actually plot us some curve that trades off precision and
recall
And as you vary the threshold you can actually trace all the curve from your classifier to see the range of
different values you can get for precision recall. And by the way, the position recall curve can look like
many different shapes (as illustrated above) depending on the details of the classifier.

So is there a way to choose this threshold automatically? Or, more generally, if we have a few different
algorithms or a few different ideas for algorithms, how do we compare different precision recall numbers?
For example, suppose we have three different learning algorithms, (or actually the same algorithm with
different values for the threshold), how do we decide which of these algorithms is best?

One of the things we talked about earlier is the importance of a single real number evaluation metric. And
that is the idea of having a number that just tells you how well is your classifier doing. But by switching to
the precision recall metric, we've actually lost that. Where as in contrast, if we had a single real number
evaluation metric, that helps us much more quickly decide which algorithm to go with and helps us as well
to much more quickly evaluate different changes that we may be contemplating for an algorithm. So, how
can we get a single real number evaluation metric?

One natural thing that you might try is to look at the average between precision and recall and look at
what classifier has the highest average value

But this turns out not to be such a good solution because if we have a classifier that predicts = 1 all the
time (algorithm 3), then if you do that, you would get the highest value for the average, and thats wrong.

In contrast, there is a different way of combining precision recall. It is called the F-score and it uses that
formula
So, for this example we have

And in this case Algorithm 1 has the highest F-score; Algorithm 2 has the second highest and algorithm 3
has the lowest. The F-score, which is also called the F1-score it kind of combines precision and recall. But
for the F-score to be large, both position and recall have to be large, none of them can be small.

I should say that there are many different possible formulas for combining position and recall. This F-score
formula is really just one out of a much larger number of possibilities, but historically or traditionally this
is what people in machine learning use. And this usually gives you the effect that you want because if
either precision or recall is zero, this gives you a very low value. Whereas in contrast a perfect F-score
would be if precision equals one and recall equals one, that would give you an F-score equals one. And
this usually gives a reasonable rank ordering of different classifiers.

So, in this lesson we talked about the notion of trading off between position and recall and how we can
vary the threshold that we use to decide whether to predict = 1 or = 0. And of course, if your goal is
to automatically set that threshold one pretty reasonable way to do that would also be to try a range of
different values of thresholds and evaluate these different thresholds on your cross validation set, and
then to pick whatever value of threshold gives you the highest F-score on your cross validation setting.

11.5 Data for Machine Learning


In this lesson, I'd like to switch tracks a bit and touch on another important aspect of machine learning
system design, that is the issue of how much data to train on.

Now, in some earlier lessons, I had cautioned against blindly going out and just spending lots of time
collecting lots of data, because it's only sometimes that that would actually help. But it turns out that
under certain conditions, and I will say in this lesson what those conditions are, getting a lot of data and
training on a certain type of learning algorithm, can be a very effective way to get a learning algorithm to
do very good performance.

For example, consider the problem of classifying between confusable words, so for example, in the
sentence
should it be to, two or too? Well, for this example, it would be two. So, this is one example of a set of
confusable words. So researches try to work with this problem and they took a few different learning
algorithms which were sort of considered state of the art back in the day, like

The exact details of these algorithms aren't important. Think of this as just picking four different
classification algorithms. But what they did was they varied the training set size and tried out these
learning algorithms on a range of training set sizes and that's the result they got

And the trends were that these algorithms had remarkably similar performance. Note that the
performance of the algorithms all pretty much monotonically increase (horizontal axis is the training set
size in millions), but that raise is not too much if you observe more careful. See that, for example, if you
take some "superior algorithm" it could easily beat the accuracy of an inferior algorithm with tons more
data. But more data can help (as we can see above).

Let's try to lay out a set of assumptions under which having a massive training set we think will be able to
help. Let's assume that
For example, if we take the confusable words problem we could capture what are the surrounding words
around the blank space that we're trying to fill in. So usually that is pretty much information to tell me the
correct word that I want in the blank space. So, just by looking at surrounding words then that gives me
enough information to pretty unambiguously decide what is the label in some set of confusable words.
So that's an example where the future has sufficient information to specify .

For a counter example consider a problem of predicting the price of a house from only the size of the
house and from no other features. Well, there's so many other factors that would affect the price of a
house other than just the size of a house that it's actually very difficult to predict the price accurately. So
that would be a counter example to this assumption that the features have sufficient information to
predict the price to the desired level of accuracy.

So, one way to think if my features are enough is given the input features , given the information
available, if we were to go to human expert in this domain, can a human experts actually predict the value
of ? For the first example if we go to an expert human English speaker, this person could easily predict
a particular problem of confusable words (like the too, two, to problem). But in contrast, if we go to an
expert in houses market and if I just tell them the size of a house and I ask what the price is, probably
them could not tell me an accuracy prediction.

So, suppose the features have enough information to predict the value of based on . And let's suppose
we use a learning algorithm with a large number of parameters and features (like a neural network with
many hidden units). So these would be a powerful learning algorithm with a lot of parameters that can fit
very complex functions. So, I'm going to call these as low-bias algorithm because you can fit very complex
functions. Chances are, if we run an algorithm like this on data sets, that you fit the training set well, given
a small training error.

Now let's say, we use a massive training set where , so the hypothesis function will be unlikely to
overfit, and therefore the test error would be small as training error. And this would be a very good
performance. And in this casa a lot of data could help in high complex hypothesis to avoid overfitting.

Another way to think about this is that in order to have a high performance learning algorithm we want it
not to have high bias and not to have high variance. So the bias problem we're going to address by making
sure we have a learning algorithm with many parameters (the number of parameters would follow the
number of features that achieve that assumption that a human expert could solve the problem given
available features) and by using a very large training set, this ensures that we don't have a variance
problem.