Beruflich Dokumente
Kultur Dokumente
Split testing is a way of comparing multiple versions of a web page to find out which one best
achieves the site owner's goals.
When a visitor arrive on the web page, he is shown one of the versions being tested, and the
performance of the page is tracked. Whichever version achieves the best performance by
the end of the test can then be used across the site, or be the starting version for a further
test.
Statistical Significance
When analysing the result of a split test, it is important to take into account that any
difference might just be down to chance.
Visitors Orders
Version A 100 5
Version B 100 4
.... might lead one to conclude that Version A was the best performer, but clearly the
difference between the two is small, and this could be down to chance alone.
Whereas:
Visitors Orders
Version A 100 17
Version B 100 5
... shows a much higher difference in performance, and the possibility of this being due to
change alone is very small.
These two examples are extreme examples, and the role of chance is easy to estimate.
However, most results are far more complicated, and the role of chance far more difficult to
estimate without an adequate analysis tool.
This is provided by a formula which calculates the 'statistical significance' of the result. This
is a fairly complicated formula which takes into account overall sample size and respective
success rates to give us 'confidence level'.
This formula is used in scientific and medical testing in order to give a 'confidence level' that
the result obtained was not just down to chance.
Typically, for a test to be classed as 'statistically significant', they would expect a 'confidence
level' of 95% ... ie. that there was only a 5% probability that the result was down to chance
alone.
We needed to decide whether 95% confidence level was appropriate for split testing and
also incorporate code to calculate whether this was achieved.
A/B testing
This tests one page (usually the current page) against an alternative page.
The main advantage of simple A/B test is speed, as there are only two options. We found
that a statistical result was achieved far faster than the other types of test.
A/B/C/D/?
The same as A/B, but this tests more than two versions. As a result, we found these tests
took much longer to achieve a relevant result. There were no observed benefits over splitting
variables tested in the A/B/C/D/? tests over multiple A/B tests.
For example, a test might vary the heading, sub-heading, main image and colour scheme.
Different versions of the main heading might perform better or worse in combination with
different versions of the subheading/image/colour scheme.
In most cases, this is the slowest of all the tests, as to reach a relevant result takes far
longer due to the increase in possible combinations.
We decided to work with the basic 'A/B testing', as it will give results sooner, and other
variations can then be tested by subsequent tests which are refined by the lessons learned
from the previous tests.
Obstacles
Speed
The methodology used by most split test software takes for too long for our purposes.
It needs far too many visitors to achieve statistical relevance, and this incurs high costs both
in actually financial expenditure and time loss.
Although the maths are fairly complicated, and the exact figures vary depending on many
other parameters, most tests need between 30 to 90 ‘successes’ before ‘statistical
relevance’ is achieved. To illustrate the problem, we will use the average figure of 60.
Whilst varying depending on the product or service being promoted, the percentage of
‘successes’ is often less than 1% of visitors to the site.
We need to also consider the fact that we cannot expect to get 6000 visitors in one day.
There are a finite number of people searching for the phrase, or phrases, chosen for each
test. For most of our clients, we might expect between 10 and 60 visitors a day. For the
purposes of demonstration, we will use the average of 35 visitor per working day (testing
does not usually place at weekends as most clients are B2B, and weekend searches are
more relevant to the B2C market, which would skew results).
60 ‘successes’ at a conversion rate of 1% will need 6 000 v isitors to the site.
6000 visitors at a cost of £0.50 a working day would mean a total cost for the test of £ 3000
6000 visitors arriving at the rate of 35 per working day would take 1 71 working days
NB. The outlay is not all ‘lost’, as the client might expect to convert some of these visitors
into revenue. It will be sub-optimal though, as poor conversion is usually the reason for the
test. In addition, during the split test we often need to pay a far higher price for the visitors
than would be profitable in the long term.
This example is based on a simple A/B split test, any of the other split test methods would
take considerably longer.
For reasons explained elsewhere, we need the tests to be quicker, and involve less outlay.
Testing
As one of the reasons for the project is the problem of site visitors being ‘expensive’ and
‘slow to obtain’, we need to ensure that we are able to test some of the statistical algorithms
without being hindered by the same issues.
This needs a ‘test rig’ able to simulate the arrival of real visitors.
In order to test the algorithm we needed to write a program that would could create a fake
set of 1000 site visitors with a specified bias for one or other of the pages, and then feed the
data to the split test installation. The split test software itself needed to have a detailed test
log file routine written that would record each visitor in sequence and detail which page it
decided to show, the criteria and reasoning that it used to reach that decision, and the data
which it then wrote to the permanent log file that would be used when analysing subsequent
visitors. We could then vary the order of the visitors in the fake data set and feed it to the
split test installation again. By repeating this many times and then analysing the data in the
test log files, we could verify that the split test software was accurately predicting the best
performing page and reacting accordingly. Any errors observed were dealt with by fine
tuning the algorithm until we had consistently accurate results across a wide range of fake
visitor sets.
Selecting the Sample Group
In a scientific test, the sample group can be carefully selected in order to reduce the
influence of other variables. eg. for a medical test, the target group can be selected by
general health, presence or absence of certain medical conditions, age, ethnicity etc.
For website visitors, this is almost impossible, but we needed to take what steps we could.
General visitors arriving on the site can come from different sources (e.g search engine
results, links clicked from other sites, bookmarks from repeat visitors, links seen in adverts
(online and offline) and 'type-ins' of the web address from those who already know it.
Some of those would be inappropriate for our test, and would skew the data.
In order to gain some level of control over the sample group, we decided to only display the
test pages to visitors arriving from Google' paid adverts (Adwords). This system allows us to
only show the adverts to visitors who have searched for a precise phrase, thus achieving
more accurate targeting of the sample group that is likely to have most impact on profitability
for our clients.
It also ads to the cost of the test, and increases the need to achieve accurate results from
the smallest possible number of visitors.
As mentioned above, the sample group will be visitors from Google paid adverts, and this
increases the time/cost as each visitor will cost an incremental sum. There are a finite
number of people searching for our target phrases each day, which increases the need to
get significant results from few visitors.
Longer tests can be less accurate.
Although a test that involves more visitors is usually more accurate, the passing of time
introduces more variables. For example, 'Seasonality' affects site visitor behaviour. A visitor
arriving on the site after searching for 'Christmas Presents' in September might be at a
different stage of the purchasing process than a similar visitor in early December.
Once we have decided on the target 'confidence level', we will also need to devise a method
to speed up the process.
We realised that if we could find, and measure, a more frequent site visitor activity that was a
reliable precursor to a ‘success’, we could speed up the process dramatically.
It is widely believed, although not officially confirmed, that Google uses ‘Bounce Rate’ as
one of the many factors used to decide on how well a web site should rank for a specific
search phrase. ‘Bounce Rate’ is the percentage of visitors who leave the site after viewing
only one page. We analysed our client performance logs for our clients, and realised that,
whilst useful as an indicator of the likelihood of a ‘success’, using ‘Bounce Rate’ alone was
too crude a tool.
By analysing the server log files we could see that ‘successes’ were usually accompanied by
the site visitor spending longer reading the page at which they first arrived.
If we could combine ‘Bounce Rate’ with ‘time spent viewing the page’ we would have a far
more relevant and ‘fine grained’ data for analysis.
Unfortunately, extracting this data from the server log files is a slow process, that can only
be completed with any degree of accuracy if done by hand.
This script would then save the data to a plain text file for analysis by the main split-test
script.
The ‘AIDA’ Principle
We needed a methodology for analysing the data and presenting it in a manner which could
be easily understood, and which would allow a decision to be made regarding the success of
a split-test project.
We decided to use the ‘AIDA’ principle, as this is widely used in marketing, and so would be
familiar to our clients.
The acronym ‘AIDA’ stands for ‘Attention’, ‘Interest’, ‘Desire’ and ‘Action’.
In short, these are the stages of customer acquisition - first get their ‘attention’, next attract
their ‘interest’, then provoke ‘desire’, and finally push them to ‘action’.
Although a strictly accurate description of what is happening during the split-test, they are
useful, and easily understood labels for our test results.
For example:
As each client site offers different products/services and has a different visitor demographic,
the ‘timings’ can be customised to match previously observed visitor behaviour for the site in
question.
Should the page performing less well on ‘time on page’ start to outperform on actual actions
(eg. orders received), the display ratio is reset to 50% to 50%.
This feature speeds up the test process considerably in instances where there is a
immediate and noticeable difference between the two versions. It also reduces any losses
that might be experienced during the test as the poorer performing page is displayed less
often.
It also ‘self corrects’ rapidly should the perceived best performing page turn out to be a
chance based statistical error.
The ‘base level’, and speed of the ‘ramp up’ can be set in the config file on a test by test
basis.
In our case, the sample group size potentially infinite, (as we can leave the test running
indefinitely), and the number of visitors for each of the test pages will NOT be identical due
to the ‘ramp up’ feature.
This needed a rewriting of the normal relevance formula, and translating of the new formula
into Php programming code.
Software Interface
Key: