Beruflich Dokumente
Kultur Dokumente
Tokenisation
Profanity filtering
Understand how the text functions sub, gsub, and grep can be used to manipulate strings and
identify patterns that might be of interest.
Regular expressions have a rich set of metacharacters (metacharacters defines its grammar) that
allow us to search through strings to identify specific patterns of interest that might be very hard to
identify with literals (literal text forms the words of the language).
Introduction
This will serve as an introduction to natural language processing. I adapted it from slides for
a recent talk at Boston Python.
The examples in this code are done in R, but are easily translatable to other languages. You
can get the source of the post from github
“`
1 I like solving interesting problems.
2 What is machine learning?
3 I’m not sure.
4 Machien lerning predicts eveyrthing.
“`
Let’s say that the survey also asks people to rate their interest on a scale of 0 to 2.
First steps
• Computers can’t directly understand text like humans can.
o Humans automatically break down sentences into units of meaning.
• In this case, we have to first explicitly show the computer how to do this, in a process
called tokenization.
• After tokenization, we can convert the tokens into a matrix (bag of words model).
• Once we have a matrix, we can a machine learning algorithm to train a model and
predict scores.
Tokenization
Let’s tokenize the first survey response:
“`
In this very simple case, we have just made each word a token (similar to string.split(‘ ’)).
Tokenization where n-grams are extracted is also useful. N-grams are sequences of words. So
a 2-gram would be two words together. This allows the bag of words model to have some
information about word ordering.
Old features:
Orthogonality
• As we saw in the slide before, we want to generate as much new information as
possible while preserving existing information.
• This will have us generate multiple feature sets. All of the feature sets will eventually
be collapsed into one matrix and fed into the algorithm.
o Recommend having one feature set with original input text.
• Can measure orthogonality by taking vector distance or vector similarity between
each document vector.
o Need to reformat document vectors to contain all terms.
Cosine similarities:
“`
Mean similarity:
“`
[1] 0.7292
“`
Meta-features
• We may also wish to extract higher-level features, such as number of spelling errors,
number of grammar errors, etc.
• Can add meta-features to the bag of words matrix.
• Meta-features preserve information.
o If we are lowercasing everything, a “number of spelling errors” feature will
capture some of the original information.
• Can also extract and condense information.
o Several columns with capitalized words will contain a lot of word-specific
information (including whether or not the word is capitalized), but making a
feature “number of capitalizations” will condense all of that information.
o If one of the criteria for whether or not an essay is good is whether or not the
student has a synonym for “sun”, a meta-feature could extract all possible
synonyms and condense them into a count.
Relevance of Information
• Just like with a human, too much information will swamp an algorithm will irrelevant
inputs.
• Similarly, information that is too broad will not help much.
o For example, say a latent trait that gives a student a 2/2 on an essay vs a 0/2 is
the presence of a synonym for the word “sun” in the response
o Broad information would be several columns in our matrix that contain
synonyms for the word “sun”
o Specific information would be a feature that counts all of the synonyms up
• Our goal is to give the computer as much relevant information as possible. If an item
is relevant, more specific is better, but the less specific it is, the more potentially
relevant it will be.
Linear regression
• A simple linear equation is $y=m*x+b$ , where y is the target value(score), m is a
coefficient, and b is a constant.
• In linear regression, we would do something like
$y=m_{1}x_{1}+m_{2}x_{2}+\dots+m_{n}*x_{n}+b$.
o Each column in the matrix (feature) has a coefficient.
o When we train the model, we calculate the coefficients.
o Once we have the coefficients, we can predict how future text would score.
Coefficients:
“`
Words that are not shown do not have a coefficient (ie they did not have any useful
information for scoring).
Predicting scores
• Now that we have our coefficients, and our intercept term, we can construct our
equation and predict scores for new text.
• Any new text has to go through the exact same process that we passed our training
text through.
o In this case, text will go through the bag of words model. We will skip
additional processing to keep it simple.
Let’s use this as our “test” text that we will predict a score for:
“`
Second fold:
Predictions:
Quantify error
• Quantify accuracy through one of several methods
o Kappa correlation
o Mean absolute error
o Root mean squared error
o All of them turn error into a single number
• Important to set random seeds when doing most machine learning methods in order to
make error meaningful from run to run.
• $RMSE=\sqrt{\frac{1}n\sum\limits_{i=1}^n(\hat{Y_{i}}-Y_{i})^2}$
• Our RMSE is 0.9354
• If we tried another method, and the RMSE improved, we would have a reasonable
expectation that the method was better.