Sie sind auf Seite 1von 17

KNN and Ensemble for stock price

prediction
2014-01-26 Joseph Ellsworth 4 Comments ensemble, ensemble stock, ensemble stock prediction, ensemble stock

price,knn, knn ensemble, knn neighbor, knn optimization, knn optimizer, knn prediction, knn price prediction, knn

process, knn stock price,knn stock selection, nearest neighbor, Percent Higher High, predict stock prices, similar

patterns, stock price prediction, trading algorithm, trading prediction

Applying KNN to stock price


prediction
© Jan-25-2014 Joseph Ellsworth All Rights Reserved http://bayesanalytic.com/

How to apply KNN (K-nearest_neighbors) algorithms to predict future stock price

movements. How to combine KNN, score normalization and and Ensemble techniques for

trading predictions.

History never repeats identically but

similar patterns do occur and they

tend to produce similar results.


I wrote this article in response to a question about how to apply KNN (K-nearest_neighbors)

algorithms to predict future stock price movements. Please send feedback on how to

make it more clear. I wrote this at the level of abstraction where I prefer to receive

knowledge. If there is strong demand for sample code then I may be willing to provide it in

a second article.
Basic KNN process for trading
prediction.
There are many KNN introductions available but at the core it is a statistical technique used

to predict or classify data based on similarity to the nearest neighbor.

It requires a few extra steps to apply KNN to predict stock prices. The first step of this

process requires figuring out which indicators you want to measure and the second is

specifying a goal such as the stock price rising 1% with 10 days. For this example I will

use K=6 which means using the scores from the 6 neighbors with the closest score.

Maury Osborne introduced a concept that 2 Consecutive Higher ticks tend to have a higher

probability of the next tick going higher. I extended this to an indicator that measures the

percentage of bars or ticks with higher highs over a specified period of time. The

Percentage of higher high for 20 bars is abbreviated to PHH(20). I use PHH in this article

because it is easy to calculate.

For this example I will use a simple goal of the closing price to rise by at least 1% in the next

40 bars. You can consider the 1% gain as magnitude and the 40 bars the duration

component of the goal. Some implementations require both magnitude and duration but

my favored version only requires duration.

Let’s say that your measurement for the most recent bar is that 16 out of the previous 20

bars had a high that was above than the high from the prior bar. This would yield a raw

PHH(20) indicator score of 16/20 = 0.80.

The raw indicator score is used to select the K scores closest to your computed

measurement. With K=6 then you would take the 3 with scores that are immediately above

and 3 scores immediately below our raw score 0.80. If 6 out of 8 of them had met the

goal then you have an effective KNN score of 5/8 = 0.625.


KNN purists may say you had a score of 1 (Majority consensus) where each measurement

was 1 if the goal was met and 0 if the goal was not met. I personally like the decimal

number as it can be used as a level of confidence.

Using the 3 closest scores above and below is an arbitrary choice for distance computation.

I could also have used the 3 mathematically closest.

It is possible for our raw indicator score of 0.80 to match more than K neighbors. When

this occurs we use the neighbors which are closest in time as the secondary selection filter

It is also reasonable to expand K when more exact matches are available. Since we are

using arrays with numeric indexes where bars and trades are known to be sorted by time

we can compute a distance in time from any bar as the absolute value of it’s index minus

the index of the matching candidate.

Some KNN implementations always use a time component as part of the section criteria for

the neighbors and in that case a more complex selection algorithm. You could use the

time measured in # bars, minutes, hours or days as the X axis and the raw PHH(20) score

for the Y axis. The net difference in distances is computed by finding time difference and

the feature difference between the candidate bar and all other bars. These are then

combined by applying the Pythagorean theorem hypotenuse formula. Once these are

computed then a separate selection process used to find the bars with the shortest

hypotenuse. The more difficult part when including time as a selection point is to find a

way to normalize the time contribution against the raw indicator output scale so you can

properly balance the input of both. I tend to prefer restricting the total compare set and

using only the raw indicator score for selection of the neighbors but there are valid

arguments either way. A larger system is likely to use both approaches.


Extending for Normalized
Magnitude score retention
I like to extend the native KNN output to retain the magnitude of gain within the next X bars

where X is the duration portion of the goal. With K=6 you may have percentage gains of

[1.3%, 0.2%, -2%, 1%, 0.5%, 0.4%]. These can be used to compute a score that is an average

% change for the selected neighbors.

I like this input because it allows the magnitude of the change to be represented both

positive and negative as a component of the output numbers without requiring explicit

statement in the goal. For this sample it would be a raw score of 0.667%.

Normalize this score against the entire set which allows easy combination of disparate

scores in the ensemble described below. (Think of it as poor man’s gauge normalization).

If the entire set had an average score of 0.183% and our average score from our neighbors

is 0.667% the normalized score is computed as 0.667% / 0.183% = 3.64.

This can be interpreted as this bars neighbors had an average gain 3.64 times higher than the

average gain for the entire set. Since this bar is similar then it is likely to produce a similar
result.

It is reasonable to compute the level of confidence and the normalized magnitude of

change for both sets. These can be used as separate inputs to the optimizer described

below as they both provide interesting inputs where one predicts the amount of movement

and the other predicts how likely that movement is to occur.

Performance Note: If you pre-compute the indicator to produce an array containing the output
metric and index into the trade or bar array and then sort by the output metric you can binary
search for closest matching scores very fast or you can use a tree structure with is fast to look-
up and fast to update. I prefer a tree structure which makes it easy to add new measurements
without inuring the sort overhead. A linear search also works but is much slower.
The scores for any bars too new to compute success or to provide the forward looking score

must be suppressed from the historical set when choosing neighbors.


Extending with basic Ensemble
Once you have processed the data set as described above for a single indicator, you have a

good single KNN measurement that is probably useful but is unlikely to be highly accurate

for predicting future price movements.

Ensembles combine the output from multiple models with the assertion that the combined

output is likely to yield more accurate results than the prediction from any single model.

Ensemble enthusiasts also claim that ensemble output are less likely to be negatively

impacted by special cases and are less affected by over learning.

The next step is to repeat the process for several additional measures and different time

frames. For example you may choose something like the following.

• Percentage of higher High (10 bar)

• Percentage of higher high (20 bar)

• Percentage of lower Low (10 bar)

• Percentage SMA(5) above or below SMA(10)

• Distribution rank of this max compared to the last (100) max scores.

• Sky’s the limit.


I recommend that you keep your individual indicators easy to code so you can spend more time
on the core KNN, Ensemble and optimizer algorithms then you can add more indicators as
needed.
Once you have the full set of measures and the KNN score for the most recent bar based for

each of these measure sets you can average those measures to produce the ensemble

score which hopefully will be more accurate than the score for any single indicator.

The score magnitude normalization step above is critical to allow easy combination via

averaging because all the scores are already relative to the entire set. One extension

that can yield substantial benefits is to extend the ensemble averaging method so it can

apply different weights for each measure set. An optimizer can be used to find the set of
weights which yield the highest success rate for the entire set. You can think of this as

choosing which expert you trust the most and giving their input the greatest weight.

Rejecting some neighbors to maximize


accuracy:
One weakness of the average based ensemble combination is that a single score from a

single neighbor for a single feature can drive too much change in the overall score. I have

seen instances where 5 out of 6 neighbors had an average gain of 0.6% over the next 40

bars while one of the 6 neighbors had a loss of 4%. Statistically the outlier changed the net

score more than should have been allowed which made the output less accurate than

desired. This is only an issue when using the normalized score mechanism mentioned

above. If you are using the pure consensus rules then the statistical outlier is

automatically eliminated.

Maximize statistical impact of the most


similar neighbors
One way to work adjust for this is to keep the K neighbors for each feature. Those

neighbors which are selected the most times by multiple features are those which have

exhibited the greatest similarity. Since they are the most similar you can you increase their

weighted contribution to the ensemble score relative to those which matched on fewer

facets. This allows you to accept some input from all the neighbors but give the greatest

impact to those neighbors which are most similar. This requires that you defer the

calculation of the feature average until after you have gathered the set of neighbors all of

the active features.


Selecting a larger set and suppressing
statistical outliers
Another way to avoid this is to use a higher K value such and then reject the a subset of the

neighbors that have future gains that are the furthest from the mean for the set. Think of

this as throwing out the statistical outliers from a Cauchy Distribution, Levy Distribution. I am

generally not a fan of throwing away data but in this instance our sole goal is to increase

prediction accuracy and in this special case it can be beneficial. Benoît B. Mandelbrot

introduced the idea that while stocks normally behave according to standard distribution

rules that they occasionally exhibit wild moves outside of statistical expectations. We normally

protect against these wild moves with hedges and stop losses but they have a secondary

effect of reducing prediction accuracy. We can extend Mandelbrot’s original assertion to

say that “Some conditions produce wildly different effects that are unlikely to repeat in a

useful period of time”. If you can collect these wild effects into a cluster and use them, they

can be aids in prediction as kind of a reverse neighbor look-up. In most conditions they

represent a small number of wild outlier readings in a otherwise rational set that can skew

statical accuracy and need to be suppressed. Our ultimate goal is to maximize the number

of times we correctly predict the price movement magnitude or direction and to meet this

goal suppressing the impact wild inputs may help.

If more than a small number of neighbors are showing wild forward looking effects outside

the statistical norm then it is prudent to pay attention as it can be a warning of severe

risk. In this case we may actually keep the wild inputs and either tune down or suppress

of the input normal results. A high number of wild results in any set of neighbors can also

be used to dramatically reduce the confidence in the prediction or possibly eliminate the

prediction all together. This can be used when modeling to find and cluster wild effects

which can be a component of an early warning system for major market moves. This kind

of warning signal can be beneficial to the portfolio risk manager to suppress purchases,

adjust hedge levels and cause early sale of long positions. Using a system for wild
behavior identification is a perfect example of when more years of data and a larger

number of neighbors can yield beneficial results by providing more wild events to learn

from.

Any easy way of filtering wild input is to reject any neighbors that produce predictor

outputs that are are more X standard deviations away from the mean of the K neighbors. A

faster selection mechanism is to throw away the upper and lower 10% of the readings

keeping only the 80% in the center. The nice part of this approach is that it allows the

normalized score to be computed earlier which simplifies some data management. These

approaches can also be combined. I tend to prefer maximizing the contribution from

neighbors that have been judged the most similar but it could be beneficial to combine this

with suppressing wild readings form less similar neighbors.

The ability to reject or adjust neighbors contribution relatively inexpensively is a benefit of

the KNN over some of other classification systems where it is more difficult to identify and

adjust for the outliers.

Calculating the future price


I am not a huge fan of attempting to predict a specific future price simply because I have

never seen it deliver pin point accuracy and it sends the wrong message to the user’s who

inevitably try to use it that way. On the other hand I have seen it deliver high accuracy

when predicting which stocks will rise in price versus which will drop. It is also beneficial

when comparing purchase opportunities where the greatest profit may be from the

purchase with the larger predicted price increase. If you use it as a directional indicator

rather than an absolute prediction I think you will see greater trading success.

If you applied the score magnitude normalization above then you can can take the output

score from the ensemble and multiply to obtain the future price prediction. The formula

for price change would be Last Bar Closing Price * Avg Perc Movement during duration

period * Normalized Score output.


Example:
Bar close price = 184.20

Average price move during goal period = 0.183% = 0.00183

Ensemble or KNN normalized magnitude score = 3.64

target price movement = (184.2 * 0.183% * 3.64) = 1.2269

target Price = 184.20 + 1.2269 = 185.423

I have not tested this but I suspect the probability of accurately predicting the exact price

movement is relatively low. It is more useful when used as a +- 20% indicator so the actual

price movement would be fairly likely to be between $0.97 and $1.46. One way to measure

the accuracy of the system is to compute average % deviation between predicted score and

actual score. I suspect that the most accurate predictions will be for the top 85% to 95%

scores and the 5% to 15% scores but this would need to be proven with read data tests.

Scoring and measuring


Scoring and measuring serve two purposes. The most important is to give you a good idea

of how much you should trust the output of the algorithm to influence live trading. The

second aspect is that scoring is an essential component of optimization. It is important to

keep these two aspects distinct in the mind of the engineer because for optimization the

scoring must be fast to allow many passes whereas scoring for trust is best done with a

separate scoring mechanism that can be somewhat more expensive.

When optimizing you typically score against the training data set but and when scoring for

trust you score against the portion of data set aside for testing and which the training and

optimization algorithms have never seen.

Simple Ranked scoring


You apply the algorithm to every bar in your historical data set to obtain it’s computed

output score. Then sort these bars to find the top 10% of the bars with the best score. If
you have done everything right and had some luck in choosing the features (indicators)

then you should find that your bars which have the top 10 percentile of the combined

scores will have a success rate substantially higher than the average score for the total set.

You must suppress any bars from the earliest part of your history where you do not have

enough data to accurately compute the indicators you have chosen.

You can measure success based on maximizing the number of successful bars in the top 10

percentile of your computed scores. A more accurate measure for the optimizer is

obtained by running a quick back-test after each change. Consider the change good if it

results in increased net profit from the back-test.

Make sure you include transaction fees for the back-test or you will get inaccurate results.

Prediction Accuracy scoring scaled


1. Compute the the target price move for every bar using the algorithm described above.

2. Choose a target zone such as prediction of price move +- 20%.

3. Classify each items as a success if the price fell inside the predicted price range or a

failure if the price fell outside this range.

4. Rank order all items based on absolute magnitude score.

5. iterate through the items keeping a count of total items reviewed so far versus successful

items.

6. Current success rate = # of success / # items reviewed

7. The items success rate = Min (last items success rate, current items sucess rate)

8. Stop when current sucess rate drops below a threshold such as 90%

9. Classification Score for the pass is computed by # of sucess before threshold was

encountered / total set size. This score can be easily used by the optimizer.

• A similar approach can be used but rather than choosing a range, Use the logic

of if the magnitude of the change was positive and the move within duration
target was positive then sucess or if the magnitude of predicted change was

negative and the move was negative then sucess. Otherwise failure.

• A similar approach can be used but rather than choosing an absolute range and

threshold compute the percentage difference between predicted move and

actual move and keep a running average for absolute value of this difference.

When the value of this average rises above a threshold then exit. The number

of records selected before reaching the threshold represents a classification

accuracy that can be used

• In trading we are generally most interested in the records at both ends of the

extreme movement scores because that is where trades represent the greatest

potential profit. You would run this score mechanism from both directions high

to low and low to high and it is quite probable that you may find a higher

selection efficiency at either end of the spectrum.

Prediction Accuracy scoring zoned


1. Compute the the target price move for every bar using the algorithm described

above. Create a target zone such as prediction of price move +- 20%.

2. Classify each items as a success if the price fell inside the predicted price range

or a failure if the price fell outside this range.

3. Take the absolute range of bucket scores and divide by a target number of

buckets to obtain a target bucket size. For example if the absolute move

magnitude scores range from -6.83 to 5.2 and you want 100 buckets then the

average bucket size will be 12.03 / 100 = 0.1203.

4. Group each historical bar into it’s appropriate bucket.

5. Score each bucket based on the # of successful price predictions / total # of

items in the bucket.

6. Select the buckets which have the highest probability of success.


7. Give priority to the buckets with the highest success score but only when they

have sufficient number of items in them. Buckets with too few of items should

be considered statistical outliers and rejected. If you have too many buckets

with too few of items then reduce the number of buckets or use an alternative

approach.

Back Test scoring


Back-test scoring is exactly what it sounds like. You essentially emulate buying and selling

stocks or options based on the predicted price. Back-test results which yield the greatest

provide are considered the most successful.

In this system you may choose to place long purchases any stock / bar combination where

the last purchase or last bar produced a score in the top 20% of the ranked scores and sell

when the last purchase or last bar produces a score lower than the top 70% of the top

scores. You may also include a limit which requires the predicted profit to be at least

300% of commission and fees costs. Force all transactions open at end of the test to exit

as if sold at market at that time. You should also include a maximum long value at risk so

you can accurately predict the effect on a portfolio. Use of a stop loss should be tested as a

optional component. Finally test for bankruptcy and abort the test when you reach the

minim allowed account balance.

I suspect that if using the Absolute normalized scoring above that the top 5% and bottom

5% of scores will be less accurate than those between the 5% to 25% and those 75% to

95%. I also expect those in the 40% to 60% to be less accurate when used for prediction

directional movement. If you use the ratio of accuracy for predicting a two part goal

shown then the most accurate should be those at the very top of the score range. I have

been surprised many times by so this is really a wild guess, the only real way to know if with

extensive testing.
Optimizing
Once you have the first set of scores for the ensemble you may find that it is very accurate

for some symbols and much less accurate for other symbols. I have seen this many times

and it tends to result from different stocks exhibiting different trading personalities. A

given indicator such as a SMA(5) as percent of SMA(50) may be very valuable for SPY and

produce negative results for AIG. I believe this is driven by the internal processes and

personalities of the institutional traders who are actively trading those symbols and these

influences do change over time.

Optimizing is the essentially the process we use to find combined set of factors which

produce the best result for a given subset of stocks during a given time-frame. The

essential process is:

1. Change one or more parameter values,

2. Recompute the scores,

3. Determine if the change resulted in a positive result,

4. Keep the change if positive, reverse the change if negative.

5. Repeat for a Specified number of passes or until the system has reached a semi-stable

state.
The Optimizer can be used to change many factor such as:

• The contribution weight for various indicators or measure sets of the ensemble

• Change the durations of indicators chosen.

• Add or remove some indicators or indicator durations.

• Change K (number of neighbors) either for the set or individual indicators.

• Change the number of days in the input data set.

• Change from limit or magnitude or directional sucess criteria to one of the others.

• The sky is the limit


The set of combinations grows rapidly so I generally suggest optimizing 1 axis at a time. It

also saves a lot of time if you load the key values that can be changed by the optimizer from

a data file so you can save the results after the run. I always save the intermediate results

after each major pass so we can use those combinations as input to more advanced genetic

optimizers.

Designing optimizers is a art as much as a science and there is a risk of the optimizer finding

micro cases which look good but actually produce negative trading results. The easiest

optimizer is generally to randomly choose a feature to change, randomly choose a new

value within a range +-50% of current value and test for that single change. More

advanced genetic optimizers have been used but be fore-warned that it is easy to sink

hundreds of hours into testing and tweaking optimizers.

Many optimizers have a tendency to go down the wrong path which can cause them to

never discover better optimization opportunities. One way to work around this is to

periodically clear the entire optimization set to default state and then run new passes. If

you keep the results from these clean slate passes as data files then it is relatively easy to

use them as the input as the input genome for future genetic optimizer passes.

Early in my ML (Machine Learning) work I found that genetic optimizers depending on pure

random generated genome values very often yielded negative results for the entire set and

they never recovered. I invested weeks testing various ways to add random perturbation to

the genome combination process and found several that worked but none of them seemed

to work as well or where as efficient as independently generating the starting genome

settings from known good result sets that where produced by a less sophisticated random

sequence optimizer.

Save some data for testing


Ensure you train against one data set and reserve some data from the same series to

measure how well the system will perform. The test data must be completely isolated from

the training data or you will not get obtain an accurate measurement of your effective lift.
The ultimate measure of success is how many of the bars from your reserved test data set

which have a score in the top (filter-range eg 90%) of the set succeed.

When testing new algorithms I like to reserve 20% of my total time for use as testing data.

You should expect your classification success rate in the test data set to be less than when

applying it to the training data set. One measure of brittleness is how much the difference

there is between the two scores.

I would be pretty cool if your top 10% scores had 3 times the average win rate and your

bottom 10% of scores had 3 times the average loss rate.

Thanks Joe Ellsworth

CTO Bayes Analytic

Feel free to contact me privately for any questions.

Contact us

I like http://www.youtube.com/watch?v=4ObVzTuFivY as a quick intro to the KNN algorithm.

Required Disclaimer
Forex, futures, stock, and options trading is not appropriate for everyone. There is a

substantial risk of loss associated with trading these markets. Losses can and will occur. No

system or methodology has ever been developed that can guarantee profits or ensure

freedom from losses. No representation or implication is being made that using this

methodology or system or the information in this letter will generate profits or ensure

freedom from losses. Forex and Option trading can result in losses that exceed the original

principal balance.

Hypothetical or simulated performance results have certain limitations. Unlike an actual

performance record, simulated results do not represent actual trading. Also, since the

trades have not been executed, the results may have under-or-over compensated for the

impact, if any, of certain market factors, such as lack of liquidity. Simulated trading
programs in general are also subject to the fact that they are designed with the benefit of

hindsight. No representation is being made that any account will or is likely to achieve profit

or losses similar to those shown.

Even results from Live cash trading can be subject to specific market conditions that may

not repeat in the future and as such, duplicate results from future trading is unlikely to

duplicate past results. Changing the dollar amount traded can cause different behavior in

live trading markets especially when trading large positons that can exceed the liquidity

available in the market and cause changes in pricing behavior.

Bayes Analytic LLC provides software that can produce trading signals. The customer is

responsible for choosing a configuration and parameters for the software that meets their

own goals. The customer is responsible for conducting their own tests and only the

customer can activate the software to start trading. The software runs in an account the

customer has logged into and then activated the software. Bayes Analytic has no control

of, influence over or visibility to the signals specific to given user because we have no

visibility into configuration parameters the user has chosen to operate with. The Bayes

Analytic software is provided without Warranty on a As Is, Where is basis. It is the

customers responsibility to test the software to ensure it meets their trading requirements.

Every time Bayes Analytic releases a new version of the software the customer should

conduct new tests to validate the new version continues to meet their requirements

because every software change could have unexpected side effects that may not be obvious

until the customer has tested it in their environment with their configurations. The Bayes

Analytic software may run as a script inside of other software packages or talking to API that

Bayes Analytic has no control of or Influence over so the customer should test entire

ecosystem to ensure it meets their trading requirements. Bayes Analytic may provide the

software in source form since that is required by some trading systems but it remains the

exclusive copyrighted property of Bayes Analytic and may not be reverse engineered or

redistributed. The customer is responsible for choosing their own broker and installing

the Bayes Analytic software so it can trade using the desired account. Bayes Analytic has no

control over or influence of the broker and many brokers have different ways of quoting
spreads, charging commissions, flow of orders and latency of information. As such a

strategy and software that performs well at one broker may and probably will require

changes to perform well at other brokers. It is the customers responsibility to test the

software with their selected broker to ensure it meets their trading requirements.

Das könnte Ihnen auch gefallen