Sie sind auf Seite 1von 14

What is Split Testing

Split testing is a way of comparing multiple versions of a web page to find out which one best
achieves the site owner's goals.

When a visitor arrive on the web page, he is shown one of the versions being tested, and the
performance of the page is tracked. Whichever version achieves the best performance by
the end of the test can then be used across the site, or be the starting version for a further
test.

Statistical Significance
When analysing the result of a split test, it is important to take into account that any
difference might just be down to chance.

For example, a test such as the following:

Visitors Orders
Version A 100 5
Version B 100 4

.... might lead one to conclude that Version A was the best performer, but clearly the
difference between the two is small, and this could be down to chance alone.

Whereas:
Visitors Orders
Version A 100 17
Version B 100 5

... shows a much higher difference in performance, and the possibility of this being due to
change alone is very small.

These two examples are extreme examples, and the role of chance is easy to estimate.
However, most results are far more complicated, and the role of chance far more difficult to
estimate without an adequate analysis tool.

This is provided by a formula which calculates the 'statistical significance' of the result. This
is a fairly complicated formula which takes into account overall sample size and respective
success rates to give us 'confidence level'.

This formula is used in scientific and medical testing in order to give a 'confidence level' that
the result obtained was not just down to chance.
Typically, for a test to be classed as 'statistically significant', they would expect a 'confidence
level' of 95% ... ie. that there was only a 5% probability that the result was down to chance
alone.

We needed to decide whether 95% confidence level was appropriate for split testing and
also incorporate code to calculate whether this was achieved.

Which Type of Split Testing to Use?


There are a few different types of split testing.

We tested the following methods of split testing.

A/B testing
This tests one page (usually the current page) against an alternative page.
The main advantage of simple A/B test is speed, as there are only two options. We found
that a statistical result was achieved far faster than the other types of test.

A/B/C/D/?
The same as A/B, but this tests more than two versions. As a result, we found these tests
took much longer to achieve a relevant result. There were no observed benefits over splitting
variables tested in the A/B/C/D/? tests over multiple A/B tests.

Multivariate Testing (and the slightly more complicated 'Taguchi' testing.


This allows individual elements of a web page to be varied. This is a far more complicated
system and requires a sophisticated analysis algorithm to keep track of the interactions
between the different elements.

For example, a test might vary the heading, sub-heading, main image and colour scheme.
Different versions of the main heading might perform better or worse in combination with
different versions of the subheading/image/colour scheme.

In most cases, this is the slowest of all the tests, as to reach a relevant result takes far
longer due to the increase in possible combinations.

We decided to work with the basic 'A/B testing', as it will give results sooner, and other
variations can then be tested by subsequent tests which are refined by the lessons learned
from the previous tests.
Obstacles

Speed
The methodology used by most split test software takes for too long for our purposes.
It needs far too many visitors to achieve statistical relevance, and this incurs high costs both
in actually financial expenditure and time loss.

Why Do Split Tests Take So Long?


In any split test analysis, the speed with which one reaches a statistically relevant result is
primarily determined by the number of ‘successes’. For most web sites, a ‘success’ is an
‘order placed’ or an ‘enquiry received’.

Although the maths are fairly complicated, and the exact figures vary depending on many
other parameters, most tests need between 30 to 90 ‘successes’ before ‘statistical
relevance’ is achieved. To illustrate the problem, we will use the average figure of 60.

Whilst varying depending on the product or service being promoted, the percentage of
‘successes’ is often less than 1% of visitors to the site.

As we need to use Google Adwords ‘pay-per-click, (for reasons referenced elsewhere),


each visitor to the site has a cost, which starts at £0.05, but depending on the
competitiveness in the specific market, can exceed £15.00. For the purposes of illustration,
we will use a conservative cost of £0.50 per visitor.

We need to also consider the fact that we cannot expect to get 6000 visitors in one day.
There are a finite number of people searching for the phrase, or phrases, chosen for each
test. For most of our clients, we might expect between 10 and 60 visitors a day. For the
purposes of demonstration, we will use the average of 35 visitor per working day (testing
does not usually place at weekends as most clients are B2B, and weekend searches are
more relevant to the B2C market, which would skew results).

60​ ‘successes’ at a conversion rate of ​1%​ will need 6 ​ 000 v​ isitors to the site.
6000​ visitors at a cost of ​£0.50​ a working day would mean a total cost for the test of £ ​ 3000
6000​ visitors arriving at the rate of ​35​ per working day would take 1 ​ 71​ working days

​ ver 6 months​ and involve an outlay of £


To summarise, this example test would take o ​ 3000

NB. The outlay is not all ‘lost’, as the client might expect to convert some of these visitors
into revenue. It will be sub-optimal though, as poor conversion is usually the reason for the
test. In addition, during the split test we often need to pay a far higher price for the visitors
than would be profitable in the long term.

This example is based on a simple A/B split test, any of the other split test methods would
take considerably longer.
For reasons explained elsewhere, we need the tests to be quicker, and involve less outlay.
Testing
As one of the reasons for the project is the problem of site visitors being ‘expensive’ and
‘slow to obtain’, we need to ensure that we are able to test some of the statistical algorithms
without being hindered by the same issues.

This needs a ‘test rig’ able to simulate the arrival of real visitors.

In order to test the algorithm we needed to write a program that would could create a fake
set of 1000 site visitors with a specified bias for one or other of the pages, and then feed the
data to the split test installation. The split test software itself needed to have a detailed test
log file routine written that would record each visitor in sequence and detail which page it
decided to show, the criteria and reasoning that it used to reach that decision, and the data
which it then wrote to the permanent log file that would be used when analysing subsequent
visitors. We could then vary the order of the visitors in the fake data set and feed it to the
split test installation again. By repeating this many times and then analysing the data in the
test log files, we could verify that the split test software was accurately predicting the best
performing page and reacting accordingly. Any errors observed were dealt with by fine
tuning the algorithm until we had consistently accurate results across a wide range of fake
visitor sets.
Selecting the Sample Group
In a scientific test, the sample group can be carefully selected in order to reduce the
influence of other variables. eg. for a medical test, the target group can be selected by
general health, presence or absence of certain medical conditions, age, ethnicity etc.

For website visitors, this is almost impossible, but we needed to take what steps we could.

General visitors arriving on the site can come from different sources (e.g search engine
results, links clicked from other sites, bookmarks from repeat visitors, links seen in adverts
(online and offline) and 'type-ins' of the web address from those who already know it.

Some of those would be inappropriate for our test, and would skew the data.

In order to gain some level of control over the sample group, we decided to only display the
test pages to visitors arriving from Google' paid adverts (Adwords). This system allows us to
only show the adverts to visitors who have searched for a precise phrase, thus achieving
more accurate targeting of the sample group that is likely to have most impact on profitability
for our clients.

It also ads to the cost of the test, and increases the need to achieve accurate results from
the smallest possible number of visitors.

Getting Accurate Results from Fewer Visitors

Longer tests cost more money.


Typically a test will compare the current version of a page with a proposed alternative. If the
proposed alternative performs worse than the original page, then site revenue will be lower
during the test. If the new version performs better, then any delay in introducing it results in
lower than optimum revenue.

As mentioned above, the sample group will be visitors from Google paid adverts, and this
increases the time/cost as each visitor will cost an incremental sum. There are a finite
number of people searching for our target phrases each day, which increases the need to
get significant results from few visitors.
Longer tests can be less accurate.
Although a test that involves more visitors is usually more accurate, the passing of time
introduces more variables. For example, 'Seasonality' affects site visitor behaviour. A visitor
arriving on the site after searching for 'Christmas Presents' in September might be at a
different stage of the purchasing process than a similar visitor in early December.

Once we have decided on the target 'confidence level', we will also need to devise a method
to speed up the process.

Speeding up the Test


As mentioned previously, the speed of a test is dependant on the number of ‘successes’ (ie.
orders received, enquiry forms completed, etc), and as this is typically a very low percentage
of the site visitors, a test can take months before it yields satisfactory results.

We realised that if we could find, and measure, a more frequent site visitor activity that was a
reliable precursor to a ‘success’, we could speed up the process dramatically.

It is widely believed, although not officially confirmed, that Google uses ‘Bounce Rate’ as
one of the many factors used to decide on how well a web site should rank for a specific
search phrase. ‘Bounce Rate’ is the percentage of visitors who leave the site after viewing
only one page. We analysed our client performance logs for our clients, and realised that,
whilst useful as an indicator of the likelihood of a ‘success’, using ‘Bounce Rate’ alone was
too crude a tool.

By analysing the server log files we could see that ‘successes’ were usually accompanied by
the site visitor spending longer reading the page at which they first arrived.

If we could combine ‘Bounce Rate’ with ‘time spent viewing the page’ we would have a far
more relevant and ‘fine grained’ data for analysis.

Unfortunately, extracting this data from the server log files is a slow process, that can only
be completed with any degree of accuracy if done by hand.

As we needed this to be an automatic process we needed to write javascript code which


would start a ‘timer’ and detect when the visitor left the page they arrived on by any method
other than a ‘click’ to another page on the client site.

This script would then save the data to a plain text file for analysis by the main split-test
script.
The ‘AIDA’ Principle
We needed a methodology for analysing the data and presenting it in a manner which could
be easily understood, and which would allow a decision to be made regarding the success of
a split-test project.

We decided to use the ‘AIDA’ principle, as this is widely used in marketing, and so would be
familiar to our clients.

The acronym ‘AIDA’ stands for ‘​A​ttention’, ‘​I​nterest’, ‘​D​esire’ and ‘​A​ction’.

In short, these are the stages of customer acquisition - first get their ‘attention’, next attract
their ‘interest’, then provoke ‘desire’, and finally push them to ‘action’.

Although a strictly accurate description of what is happening during the split-test, they are
useful, and easily understood labels for our test results.

For example:

Attention​ = Visitors who stayed on the page longer than 5 seconds


Interest​ = Visitors who stayed on the page longer than 15 seconds
Desire​ = Visitors who stayed on the page longer than 30 seconds (or clicked to another
page on the client site.
Action​ = Visitors who placed an order or sent an enquiry (ie a ‘success’).

As each client site offers different products/services and has a different visitor demographic,
the ‘timings’ can be customised to match previously observed visitor behaviour for the site in
question.

‘Ramp Up’ Feature


The optional ‘ramp up’ feature is triggered when a pre-set base level of visitors has been
reached. It starts to show the best performing (page based on ‘time spent on page’) more
frequently than the poorer performing page. As the number of visitors increases, and
confidence in the accuracy increases, the ratio is increased exponentially up to a maximum
display ratio of 95% to 5% in favour of the better performing page.

Should the page performing less well on ‘time on page’ start to outperform on actual actions
(eg. orders received), the display ratio is reset to 50% to 50%.
This feature speeds up the test process considerably in instances where there is a
immediate and noticeable difference between the two versions. It also reduces any losses
that might be experienced during the test as the poorer performing page is displayed less
often.

It also ‘self corrects’ rapidly should the perceived best performing page turn out to be a
chance based statistical error.

The ‘base level’, and speed of the ‘ramp up’ can be set in the config file on a test by test
basis.

The Statistical Relevance Problem


The statistical relevance analysis is most commonly applied to a test with a sample group of
fixed size, split into two or more identically sized groups.

In our case, the sample group size potentially infinite, (as we can leave the test running
indefinitely), and the number of visitors for each of the test pages will NOT be identical due
to the ‘ramp up’ feature.

This needed a rewriting of the normal relevance formula, and translating of the new formula
into Php programming code.
Software Interface
Key:

a = timestamp f = prebiasB r = interestB


b = ip address g = confidencemod s = desireA
c = version shown h = biasA t = desireB
d = fakev i = biasB u = ordersA
e = prebiasA j = randcalc v = ordersB
f = prebiasB k = totvisA w = cookie found
g = confidencemod l = totvisB x = testmode
h = biasA m = clicksA y = actualbrowser

Das könnte Ihnen auch gefallen