Sie sind auf Seite 1von 16

Predicting road cycling ride duration: a statistical approach

by Konstantin Varbenov
2019

[1]
Abstract
The goal of this paper is to establish a simple, yet statistically sound framework for predicting the
duration of road cycling activities. The motivation behind it is the feasibility of calculations much
more detailed than the ones currently available online, not least because of the availability of huge
amounts of data being gathered by sports tracking.
Currently, several platforms offer such a ride duration estimation, and so the merits and limits of
these are discussed. The main body of this piece analyses a set of my own data, discovering
physical mechanisms not covered by available algorithms, for example the effects of a constantly
changing gradient on real roads and the uneven power distribution during the ride.
By plotting speed data against elevation from real, recorded activities, and polynomial data
fitting, a sketch of a model is devised. The model uses gpx files, such as planned routes, and
calculates the time required between every two adjacent data points. The sum of these is then the
predicted ride duration. It thus reaches an accuracy of approximately 95%.
Finally, I discuss the limitations of this model and suggest directions for further improvement.

[2]
Introduction
A brief overview of available tools

Calculating the exact duration of a road cycling ride is very tricky as way more variables are at
play than can reasonably be considered in a calculation. Nevertheless, several calculators are
available online, probably the most popular being the aptly named Bikecalculator[1]. It lets the
user input many variables, including basic ones such as distance, elevation, weigh, but also more
subtle factors such as tire type or rider position (to account for a rough aerodynamic
approximation). Then, one can predict what power (in Watts) is required for a given speed in the
defined circumstances. When used correctly, this type of calculation can provide very accurate
results. A good example for such a use would be calculating ride duration on a climb, which I
have found1 to be very usable. However, it has several crucial limitations when it comes to
predicting duration over a long course especially if it features undulating terrain. The nature of
this calculation is a “simple” equilibrium of forces: the rolling, climbing and aerodynamic
resistances are iterated to equal the pedaling force. This implies that the riding is in a steady
state, i.e. the conditions don’t change for a sufficiently long period of time. On climbs, this is a
reasonable approximation, but not on whole rides. I will return to this later.
Another (obvious) limitation is one already mentioned above: courses with complex profiles
cannot be practically modeled. Even if the whole route can be replaced by its average distance,
grade etc., this is not accurate, especially when predicting time for waypoints arbitrarily within the
route: imagine a route where you climb one hill and then coast back, effectively with a positive
average gradient half as steep as the climb’s (and what implications this has, modelling time
linearly).

Strava[2] is possibly the most used tool among cyclists, but it only bases its estimates on the
average speed a given athlete has sustained over the past 4 weeks, which is obviously grossly
inaccurate.

Another frequently used service, Komoot[3], has an algorithm that adjusts for local gradient. It
gives you the option to select one of several fitness levels, including for example “In good shape”
and even “Pro”. However, I have found it to be fairly conservative in that on many occasions even
the fastest option is slower than actual rides of mine. Still, it hints that customization is possible,
meaning that predictions can be tailored to fit every user’s fitness.

The only software that I have seen which does piecewise calculation based on gradient and user
fitness data is a German platform called Quäldich2 Tourenplanner[4]. It is an offline tool and as far
as one can tell from their website, support for the program is discontinued, probably due to the
rise of online platforms. The clever thing the program does is to let its user input three values:
flat speed, maximum descending speed, and maximum climbing speed (in meters per hour, also
known as VAM). A curve (shown below) is then generated and used to calculate ride duration.

1
When comparing to data from a power meter. On climbs in the range of 30 minutes, accuracy is usually
within 10 Watts
2 Literally “torture yourself”

[3]
figure 1: Quäldich Tourenplanner’s speed/gradient curve

This approach seems reasonable, however it has some inaccuracies, as discussed in the following
section. Also note how the input of 1200m/h climbing speed actually isn’t reflected in the graph:
a quick way to check is to multiply speed * gradient * 10 to obtain VAM.
The website (or the program’s info for that matter) doesn’t explicitly state how the calculation is
done, but I suspect it splits the course in sections, calculates duration based on the curve for each
and then sums them up. This is also the way I later address the problem.

[4]
Methods
Building a model from data
The natural question one might ask is if and how well these predictions actually fit real data. It is
somewhat puzzling that, given how ubiquitous ride data is and the relatively straightforward
structure of an approximative model, a personalized ride duration calculation is not an available
option on platforms such as Strava.
To try to at least get an idea whether a simple speed/gradient curve such as the one shown in
figure 1 actually represents reality and can also be constructed from rider data, I analyzed two of
my road cycling rides. It is important to note at this point why exactly these two rides were
selected from all I have recorded over the past few months. As any road cyclist will know, there
are many types of rides, all with quite specific characteristics such as riding alone vs. in a group,
riding a fast average speed ride vs. specific training (i.e., on a climb) etc. These have a huge
impact on duration calculation. To ensure the generated curve is as general as possible, I chose
solo rides of about 100km length with no specific training intervals and ridden at a moderate
average speed. Accordingly, trying to predict group rides with this model will be very conservative.

Let me present the raw data in all its glory, with two remarks:

i) blue dots represent a 15s rolling average speed, whereas orange ones stand for the
current speed at any second
ii) the gradient is a 10s rolling average in both cases. Using gradient for periods of time
shorter than that results in unacceptable noise. This also has the added benefit that
speed at any given moment reflects the gradient prior to it, an aspect which will be
discussed later.

figure 2: speed against gradient for a 96km ride at an average speed of 30kph and 165 Watts 3

3
If you have any idea why the blue dots cluster in this hyperbola-like fashion, please contact me

[5]
It is immediately obvious that the predicted S-curve is not realistic. One might be tempted to
use linear regression given that image, however this is problematic because there are many
more data points around the 0% line (riding on flats) than are for the extreme ends of the
gradient spectrum, so a linear fit is likely to be skewed.

I decided it makes more sense to split the data into bins, each one percent wide, i.e. every
data point between -0,5 and 0,5% fall into the “0%” category. This levels the playing field,
meaning each gradient is weighted equally, even though the data at the extreme ends is
relatively scarce. The data then looks like this:

figure 3: grouped data, speed values represents “current speed”, i.e. the orange dots from figure 2

Now, the speed/gradient points could be plotted, and a suitable model devised. I want once
again to present the graph first and explain what the chosen models looks like, before I
discuss its merits in the next section.

figure 4: linear and cubic fits on the data from figure 3

[6]
figure 5: linear regression based on the data points from figure 3

It is obvious that the linear fit, although with an R2 value of 0,971, doesn’t quite capture the
nature of the curve. Another method is thus required for a more realistic fit.
Across multiple datasets, I observed a characteristic S-shape in the region between -5 and 5%.
Note here that this is not the same S as in figure 1, it curves in the other direction.
Therefore, I decided to model that section as a third order polynomial.

figure 6: cubic regression [5] on the middle range of the data

This fit (see figure 4) is much more appealing and it also makes more sense, as I will point
out in the next section.

To model the “sides”, i.e. the regions from -20% to -5% and 5% to 20%4, I used two separate
quadratic polynomials with the following boundary conditions: values at intersection points (-
5% and 5%, respectively) should match, the speeds at -20% and 20% are set at 60kph and
6kph5, and the derivatives at these points to zero [6].

At that point, I should note that there are surely better ways to fit theses regions. Possible
options would be higher order polynomials6 or even exponential functions, as speeds starts to
saturate towards very steep gradients. However, I stuck to simple fitting techniques as this is
more of a feasibility study than a large-scale data fitting aiming to arrive at exact models.

4
The value of 20% is somewhat arbitrary. The model will still work beyond that range, albeit with reduced
accuracy
5
These values are also not derived from data
6
Then, a smooth transition between regions would be possible, if f’ and f’’ were required to match at the
intersections

[7]
This is the result, built up of three separate functions:

figure 7: piecewise modelling of the data

Of course, one could suspect some overfitting, but for the purpose of the experiment, I
consider the model general enough.

[8]
Discussion
Model verification, merits and limitations
In the previous sections, I stressed how current calculations don’t capture some important
phenomena, which render their results inaccurate. Let me go into some more detail and show
how some of these issues can be avoided.

Issue 1: steady state and the S-curve between -5% and 5%


One of the first observations I made when plotting the data from figure 2 was how relatively
flat the trend in the vicinity of 0% was. Simulations would suggest a much more dramatic
difference between, say, 0% and -2%. However, as I already stated above, such predictions are
based on the assumption that at any point, the rider is in a steady-state equilibrium. This
would imply that he/she has ridden for long enough on the current gradient to reach a sort of
terminal velocity. The data from figure 4 suggests this is statistically not the case.
Two reasons come to mind that could explain the phenomenon. Firstly, on real roads,
gradient changes all the time, so momentum plays a huge role when determining current
speed. Imagine hitting a 1% gradient after a fast, flat approach. If it is short enough, your
speed will remain almost unaffected. Conversely, if the road dips briefly, you won’t be able to
accelerate quickly enough to use the maximal potential of the downhill section. This ensures a
certain stability around the 0% mark.

Also, the data seems to suggest long stretches of low gradients (1% to 3%) are rare, so most
of the time a rider would accumulate riding on such inclines will be when transitioning from
flats to steeper climbs or vice versa, thus increasing the role of momentum.

The cubic fit in the central section of the graph shown in figure 4 accounts very well for these
phenomena, with a maximum error of about 0,5 kph.

Issue 2: steady state and steep descents


Something you might find puzzling is how subjectively low the measured speed for steep
descents (steeper than -5%) seems to be, especially if compared to predicted values by an
equilibrium of forces. This effect is clearly because at high speeds, many more impeding
factors affect the rider, primarily curves, but also road surface, traffic, aerodynamic position
and time required to get up to speed after every braking phase. Of course, many riders,
especially under racing conditions, will descend much faster, but the model is individual any
way, so this can be accounted for.

[9]
Issue 3: Assumption for equal power

Another possible issue when trying to use conventional calculators is to know what power to
use for any given ride section. Consider a curve that the already mentioned Bikecalculator
would predict, given the same ride’s average power as input.

figure 8: predicted speed at 165 watts

We notice the same sigmoid-like behavior which does not represent reality. This is partly due
to the reasons listed above, but another important role is played by the decidedly non-uniform
power distribution over gradients, as seen below:

figure 9: power plotted over gradient

Not only is most of the time on (steep) descents spent coasting, but also there’s a tendency
towards higher power output on steep climbs, as has been the subject of lots of research [7], [8].
For riders with dominating slow-twitch fibers, the distribution would be similar to the graph
above. Some athletes, however, will produce their maximum power on the flats.

[10]
The simplicity of a statistically derived curve is that it will account for the physiology of any
given rider.

Model verification
Of course, any model must be tested on new data. I used a ride with very similar average
power (within 2 watts) and average speed (0,5 kph) to ensure the model is appropriate for the
reasons discussed in the previous section. Also, I ran the model on the training data itself to
check for consistency, but also overfitting.

figure 10: model validation

The data was structured as shown above. Columns E and F refer to the simple linear model,
whereas columns G and H – to the more accurate, combined one. Each row of predicted time
calculates the difference in distance since the last row and divides it by the predicted speed,
thus giving a predicted time between the two points. The total sum of these segments is then
an estimate of the total duration.

Combined cubic
Linear model
Actual time model prediction
prediction
(accuracy)
Training data 3h 14’ 3h 33’ 3h 20’ (97%)
Testing data 2h 21’ 2h 13’ 2h 28’ (95%)

figure 11: model accuracy

As you can see, absolute accuracy is not significantly better than the linear model. However,
the combined one is much more realistic. For example, at very steep gradients7, the linear one
will predict negative speeds and mess up the local time calculation.

7
using a rounded version of the equation, 30-2x, anything over 15% will lead to negative speeds

[11]
Conclusion
Strengths and weaknesses
The proposed model has the following strengths, compared to available calculators

i) accounts for individual fitness and physiology


ii) can be trained on different scenarios, thus covering a broad range of riding situations
iii) being statistically (as opposed to analytically) derived, it implicitly accounts for power
unevenness and quick gradient changes
iv) any error will accumulate slowly during the ride, so estimates at any point during the
ride will be sufficiently accurate

Limitations remain:

i) no exact physical framework, thus potentially inaccurate


ii) dependence on ride scenario

The effects of both limitations can be reduced, if not entirely overcome, by some of these
suggested methods:

i) using a more sophisticated curve-fitting tool to ensure a more plausible physical


modelling
ii) analysis of much more data, including all relevant scenarios a rider typically
encounters, and input before each ride questioning the rider how he/she plans to ride,
i.e. in a group or not, at maximum speed or not, etc.

[12]
Appendix
Preparing gpx files
For anyone interested in carrying out similar experiments, here is a brief overview of how standard
Strava gpx data was converted to a formattable Excel table.

First, I exported the gpx file from the activity’s page on Strava. Of course, any other platform will
do, and many offer different formats which Excel can open. However, some might not have speed
and/or distance data included, which is crucial for any analysis.

figure 12: native Strava gpx files

To add the much-needed fields, I used a free software called GPS Babel [9], [10]. It allows you to
manipulate gpx files and add this information.

figure 13: adding speed data

[13]
After this manipulation, the data looks like this:

figure 14: speed (in m/s, last colum) added to the data

From this point on, standard excel operations are possible

Deriving an analytical curve with Bikecalculator


This is as simple as visiting the website and entering your information, then iterating through
every gradient and plotting the data, again in Excel

figure 15: fetching data from Bikecalculator

[14]
References
[1] http://bikecalculator.com/

[2] https://www.strava.com/

[3] https://www.komoot.de/

[4] http://tourenplaner.quaeldich.de/

[5] http://www.xuru.org/rt/PR.asp#CopyPaste

[6] https://www.wolframalpha.com/

[7] https://cyclingtips.com/2013/09/climbing-and-time-trialling-how-power-outputs-are-
affected/

[8] https://www.youtube.com/watch?v=I096TxTosL4

[9] https://www.gpsbabel.org/index.html

[10] https://gis.stackexchange.com/questions/202455/how-to-extract-the-speed-from-a-gpx-file

[15]
For any questions, ideas, or suggestions, don’t hesitate to contact me at
konstantin.varbenov@gmail.com

Let’s hope we figure something out together.

Konstantin Varbenov

Sofia,

1.7.2019

[16]

Das könnte Ihnen auch gefallen