Sie sind auf Seite 1von 6

First place in the Stay Alert!

competition
inference inference0@gmail.com March 2011

Introduction

The Stay Alert! competition from Ford [1] challenged competitors to predict whether a car driver was not alert based on various measured features. The training data was broken into 500 trials, each trial consisted of a sequence of approximately 1200 measurements spaced by 0.1 seconds. Each measurement consisted of 30 features; these features were presented in three sets: physiological (P1...P8), environmental (E1...E11) and vehicular (V1...V11). Each feature was presented as a real number. For each measurement we were also told whether the driver was alert or not at that time (a boolean label called IsAlert). No more information on the features was available. The test data consisted of 100 similar trials but with the IsAlert label hidden. 30% of this set was used for the leaderboard during the competition and 70% was reserved for the nal leaderboard. Competitors were invited to submit a real number prediction for each hidden IsAlert label. This real prediction should be convertible to a boolean decision by comparison with a threshold. The accuracy assessment criteria used was area under the curve (AUC) [2]. The curve is the receiver-operating characteristic (ROC) curve [2] where the true-positive rate is plotted against false-positive rate as this threshold is varied. An AUC value will typically vary between 0.5 (random guessing) and 1 (perfect prediction).

Observations on the data

As with any machine learning exercise, the rst process is exploratory data analysis. We will not present all our observations but instead concentrate on 1

Figure 1: A histogram of the mean alertness in each trial the two key observations that went towards our solution. Our rst observation is that trials are not homogeneous. In particular, in most trials the driver is either mainly alert or not alert (as shown in gure 1). This suggests that features should be aggregated over the duration of the trial; we considered the mean and standard deviation of each feature during the trial. We use the convention of prexing a feature with m for the trial mean and sd for the trial standard deviation. Our second observation is that if one uses a prediction technique based on all the features one typically achieves a high AUC on the training set (or a held-back portion of the training set) and achieves a poor AUC on the test set. We observed this with various models and also other competitors reported similar experiences on the competition forum. This observation suggests that we are working in the world of extrapolation: i.e. the training and test set dier in some manner. If were extrapolating then a simple model is usually required. It is also necessary to remember that extrapolation can not be done by all machine learning algorithms; for example random forests (either doing classication or regression) will saturate beyond their training range. We will work with linear models. 2

We also note that the use of AUC as an assessment criteria makes extrapolation easier. When assessing AUC the only thing that matters is the order given to test cases; in particular, we can predict values outside our training range. Also we do not need to choose a threshold a priori.

Models

Were going to consider linear models. First, we investigated the per trial mean and standard deviation features. We considered a linear model [3] predicting mean(IsAlert) as a linear function of the mean and standard deviation of the other features (this model has one case per trial). This model revealed a few strong features (mE8, mV5, sdE5 and sdV6) by investigating the z-values of the linear model diagnostics. However these features are not strong enough alone to produce a powerful prediction algorithm (for example a submission based on these features alone achieved an AUC of 0.706812). We therefore drilled down to consider each measurement and extended the features by including the above strong per-trial features. We considered logistic regression [4] of IsAlert as a linear function of this extended feature set. We conducted feature selection based on diagnostics of the logistic regression to come down to three strong features (sdE5, V11 and E9). We will admit to some luck at this point. We were training our models with 20% of the training trials and assessing each model based on the AUC obtained on the held-back portion of the training set. In gure 2 we show our two elements of luck. Firstly, there is little correlation between heldback training and test set AUCs (a 95% condence interval on Pearsons product moment correlation coecient gives r = [0.38, 0.24]); by chance, we obtained a model that gave a high AUC on the test set. Secondly, for a model with these features then the AUC on the test set is higher than on the training set. This lucky model (our 10th submission) achieved an AUC of 0.861151. The model predictions were given by 410.6073sdE5+0.1494V11+4.4185 E9. In gure 2 we also show the AUC from two other ways of computing our logistic regression model with these features based on the whole training set. We show the model achieved by a generalised-linear-model training approach and also the model obtained by optimising the training set AUC by numerical optimisation of the model parameters. It is perhaps interesting to note that this non-standard approach of numerical optimisation achieves a better AUC on both the training and test sets (but at the cost of a slower training phase and the loss of any probabilistic interpretation of our predictions). Our lucky 3

Figure 2: Comparing the AUC on the training set and the test set for linear models using sdE5, V11 and E9. The circles show a sample of models trained on a random 20% of training trials and validated on the remaining trials. The triangle shows our best submission. The crosses show two models trained and validated on the entire training set.

Figure 3: The ROC curve for our best submission and its online variant. A positive is declaring the driver is not alert. submission still slightly beats this optimised AUC approach. In gure 3 we show the test set ROC curve achieved by the best model. The steep vertical region is a good feature of the classier; the classier can detect many periods when the driver is not alert but with a low false alarm rate. A low false alarm rate feels like an important aspect of a classier as otherwise a car driver may end up disabling such a system due to the annoyance of false alarms. We are surprised that the competition organisers did not use an assessment criteria to focus on this region rather than the general assessment given by AUC. We also observe that these models all satisfy the initial desire of the competition organiser not to include any of the physiological features.

Online model

Mid-way through the competition the organiser expressed a desire for the model to be usable in an online context and so not use any future obser5

vations. Our per-trial features use future observations in the computation of the mean and standard deviation. We therefore adjusted our per-trial features to use a running mean and standard deviation. Our best online model was obtained with 392.4317 sdE5 + 0.2209 V11 + 3.6544 E9. The ROC curve of such a model is also shown in 3. There is a slight drop in the AUC achieved (AUC 0.849245 vs AUC 0.861151 for previous model) however the ROC curve shows that the online algorithm is not uniformly less powerful than the oine algorithm. We suggest that future competitions which require an online solution announce this requirement at the beginning and also structure their test dataset to prevent the use of future observations. We also note that AUC is an odd assessment criteria for an online algorithm as the threshold is not chosen a priori.

Conclusion

We have shown that a simple application of logistic regression leads to good accuracy in the prediction of when a driver is not alert. Furthermore only three features (two environmental and one vehicular) need to be measured to achieve this accuracy.

About the author

inference started in machine learning at an early age including writing a review paper on neural networks while at school and applying neural networks during a summer job. inference obtained a PhD on Bayesian inference and is now doing machine learning for a large organisation but occasionally plays with machine learning problems in own time. inference is based in the UK. For this work R was exclusively used (mainly on MacOS X but also on Amazon EC2 when experimenting with some more memory intensive models). All the models described here were trained with the functions lm, glm and optim.

References
[1] http://www.kaggle.com/stayalert [2] http://en.wikipedia.org/wiki/Receiver operating characteristic [3] http://en.wikipedia.org/wiki/Linear model#Linear regression models [4] http://en.wikipedia.org/wiki/Logistic regression

Das könnte Ihnen auch gefallen