Beruflich Dokumente
Kultur Dokumente
c.dubarry@criteo.com
Criteo
32, rue Blanche
Paris, France
Editor:
Abstract
AB-testing is a very popular technique in web companies since it makes it possible to
accurately predict the impact of a modification with the simplicity of a random split across
users. One of the critical aspects of an AB-test is its duration and it is important to reliably
compute confidence intervals associated with the metric of interest to know when to stop
the test. In this paper, we define a clean mathematical framework to model the AB-test
process. We then propose three algorithms based on bootstrapping and on the central
limit theorem to compute reliable confidence intervals which extend to other metrics than
the common probabilities of success. They apply to both absolute and relative increments
of the most used comparison metrics, including the number of occurrences of a particular
event and a click-through rate implying a ratio.
Keywords: AB-test, Confidence interval, Central limit theorem, Ratio of normal variables, Bootstrapping
1. Introduction
Evaluating complex web systems and their impact on user behavior is a challenge of growing
importance. Data-driven tools have become very popular in the last decades to help in deciding which algorithm, which website home page, which user interface, etc, provides the best
results in terms of some relevant criteria such as the generated revenue, the click-through
rate (CTR), the number of visits, or any other business metric. A detailed description of
the general data-driven paradigm is available in Darema (2004).
Dierent experimention methods are available, (Kaushik, 2006, for a primer), and ABtesting, aka split or bucket testing, is wide-spread. For examples and best practices, we
refer the reader to Crook et al. (2009); Kohavi et al. (2009, 2012) and references therein.
This method compares two versions, A and B, of a system by splitting the users randomly
into two independent populations to which systems A and B are respectively applied. We
use the word system in a broad sense here as it can range from being the design of a web
page (Swanson, 2011) to more complex algorithms such as a bidder on a real time bidding
ad server (Zhang et al., 2014). Relevant metrics are then computed on each population and
compared to decide which system performs better.
Such comparisons rely on statistical tests to evaluate their significance, see for example
Crocker and Algina (1986); Keppel (1991), among which Z-tests assess if the neutral hypothesis can be rejected or not at a fixed level of certainty. The simplest example is the
one measuring a click-through rate, or any other rate that can only lead to binary values.
c 2015 Cyrille Dubarry.
Cyrille Dubarry
The click-through rate can be written as the empirical average of Bernoulli random variables equal to 1 if the user has clicked and to 0 otherwise. Then, the central limit theorem
provides confidence intervals for both the click-through rate in each population and its absolute increment between the two populations (see Amazon, 2010, for an example). In this
case, the asymptotic variance is directly derived from the estimated click-through rate p as
p(1 p)/n where n is the number of users.
In practice, a user might click several times. Then the random variables that are averaged are no longer distributed under the Bernoulli law and the asymptotic variance can
not be computed in the same way. We show that using such an approximation can even
be dangerous through a numerical application to CTR. As stated in Kohavi et al. (2009),
we need to use the variance of the number of clicks per user. They also provide confidence
intervals for their relative increment using an approximation for the ratio adapted from
Willan and Briggs (2006) but estimators for the involved variances are not provided for
non Bernoulli random variables. Furthermore, these confidence intervals do not take into
account the randomness of the number of displays made to users.
The litterature lacks of a formal modeling of the AB-test process. Previous works such as
Crook et al. (2009); Kohavi et al. (2009, 2012) mainly focus on applications of this method
and do not provide a well-defined statistical framework for the results analysis. Most
available sources for the practitioner are online calculators only dedicated to the Bernoulli
case. A primer of the underlying theory applied to AB-test analysis is only given in online
references such as Amazon (2010) but they do not go deeply into the statistical modeling
and do not cover more general metrics than simple sums of independent Bernoulli random
variables. In this paper, we introduce a formal framework for the AB-test process modeling
only involving assumptions consistent with the data-driven paradigm. It allows us to prove
some statistical properties of the involved estimators, including those based on ratios, and
to get numerical methods to approximate the variances involved in the related central limit
theorems. We also go beyond that by justifying the use of the bootstrap algorithm (Efron
and Tibshirani, 1993) to compute confidence intervals for absolute and relative increments.
The mathematical formalization of the AB-test framework is given in Section 2. In
Section 3, we provide exact asymptotic confidence intervals for any kind of metric that is
obtained by summing quantities over the users, and for any metric computed as the ratio
of such sums. We also get exact asymptotic confidence intervals for both their absolute
and relative increments under few assumptions, most of them directly related to the ABtest process. Explicit estimators for the related asymptotic variances are provided. We
additionaly show how to use bootstrapping to get confidence intervals when the data cannot
be grouped by user, as is commonly the case in the big-data field. Section 4 numerically
validates our assumptions and the proposed algorithms, while Appendices A and B give
formal proofs of the technical results of Section 3.
More precisely, let (, F, P) be a probability space and E[] the expectation operator
under P. We define a sequence of random vectors on R4 {0, 1}2
B
(XiA , YiA , XiB , YiB , "A
i , "i )i
B
For each user i 1, "A
i and "i indicate the population that has been selected for this user:
B
"A
i = 1 (resp. "i = 1) if and only if the user i is in population A of size ratio A 2 [0, 1]
(resp. B of size ratio B 2 [0, 1]). Note that in general we will have A + B = 1 but this
is not required and our analysis also applies to tests involving more than two populations.
The other variables model metrics of interest for the AB-tester. XiA and XiB are the same
metric generated by the user i if he was applied to systems A and B respectively. The same
stands for YiA and YiB which model another metric.
Example 1 (Comparison of revenue) When the AB-tester wants to compare the revenue generated by algorithms A and B, he compares the total revenue of each population,
normalized by their ratio. They can be written:
1 X
XiA
A A
and
i|"i =1
1 X
XiB ,
B B
i|"i =1
if XiA and XiB are the revenues generated by user i under systems A and B respectively.
Note that, in practice, we can also normalize the total revenues by the real population sizes
instead of their ratios and the quantities to compare become:
P
P
XiA
XiB
i|"A
=1
i|"A
i
i =1
P
P
and
.
i|"A =1 1
i|"A =1 1
i
Example 2 (Comparison of CTR) When the AB-tester wants to compare the CTR generated by algorithms A and B, he compares the CTR of each population. They can be
written:
P
P
XiA
XiB
i|"A
i|"A
i =1
i =1
P
P
and
,
A
B
i|"A =1 Yi
i|"A =1 Yi
i
XiA
XiB
if
and
are the clicks generated by user i, and YiA and YiB the number of displays
shown to the same user under systems A and B respectively.
We introduce the following assumptions that will be easily followed in an AB-test setting.
B
A1 The random vectors (XiA , YiA , XiB , YiB , "A
i , "i )i
distributed.
1,
B
A2 The random vectors (X1A , Y1A , X1B , Y1B ) and ("A
1 , "1 ) are independent.
A3 The random variables (X1A , Y1A , X1B , Y1B ) are L2 -integrable and we define
def
def
def
def
mX A = E X1A , mY A = E Y1A , mX B = E X1B , mY B = E Y1B ,
3
(1)
Cyrille Dubarry
def
2
XA =
Var X1A ,
2 def
YA =
def
A =
Var Y1A ,
def
2
XB =
def
, B =
Var X1B ,
def
2
YB =
Var Y1B ,
(2)
(3)
XB Y B
A4 The random variables (X1A , Y1A , X1B , Y1B ) are almost surely non-negative and not almost surely zero, that is
P X1A < 0 = 0 ,
P Y1A < 0 = 0 ,
P X1B < 0 = 0 ,
P Y1B < 0 = 0 ,
B
A5 The random variables "A
1 , "1 satisfies:
B
1. "A
1 and "1 follow Bernoulli laws of respective parameters A and B .
B
2. "A
1 "1 = 0 .
A user can only be assigned to one population, which is ensured by Assumption A5-2.
Assumption A5-1 sets the ratio of populations A and B to be respectively A and B .
Assumption A2 reflects the fact that the population attribution process does not aect
the user reaction to the applied system while Assumption A3 is purely technical. This is
the only assumption that is not implied by the AB-test process but it will guarantee the
convergence of the estimators. Assumption A4 is consistent with the metrics that we are
studying. They will typically be zero with a high probability and positive otherwise (for
example, the number of clicks).
Finally, Assumption A1 models the un-identifiability of the users. They are all independent and, without prior knowledge, identically distributed. The whole AB-test process
relies on this assumption by randomly splitting the users into two populations.
It is worthwhile to note that the metrics of interest (XiA , YiA , XiB , YiB )i 1 are defined for
each user and for each system, independently of the population split. The AB-test process
will give access to only XiA or XiB for a given user i, but they can still both be defined
even when they are not observed. This is the main interest of this modeling that allows us
to write those variables independently of the population. Furthermore, we circumvent the
issue of having hidden variables by introducing a new set of variables that will always be
observed. To that purpose, we simply set XiA to 0 when it is not observed, i.e. when the
user i is not in population A. This is formalized in the following definition.
Definition 1 For each user i
1, we define
A A
A A
B B
B B
g
g
g
A def "i Xi , Yf
A def "i Yi , X
B def "i Xi , Y
B def "i Yi .
X
i =
i =
i =
i =
A
A
B
B
g
A f
A g
B g
B
Remark 2 We trivially obtain from Assumption A1 that the random vectors (X
i , Yi , Xi , Yi ) i
are independent and identically distributed.
4
g
A
X
i ,
g
A
where the random variables (X
i )i 1 are summed on all the users independently of their
population, which leads to the following sum definitions for any number of users n 2 N:
g
A def
SnX =
i=1
i=1
i=1
i=1
X
X
X
g
g
g
1 Xg
A def 1
B def 1
B def 1
g
g
A , SX
B , SY
B
XiA , SnY =
Yf
=
X
=
Y
n
n
i
i
i .
n
n
n
n
(4)
In the case of Example 1, we will have to compare either two sums over the same indices
g
g
A
B
SnX and SnX when the normalization is done by the population ratios ; or two ratios of
g
A
g
A
g
B
g
B
sums over the same indices SnX /SnY and SnX /SnY , where Y A 1 and Y B 1, when
the normalization is done by the real population sizes. In the case of Example 2, the ratios
g
g
g
g
A
A
B
B
to compare become similarly SnX /SnY and SnX /SnY .
Writting the estimators this way validates the use of the bootstrap technique (Efron
and Tibshirani, 1993) to get confidence intervals. For the relative increments of the metrics
of interest, this can be done through the study of ratio:
g
B
SnX
SnX A
g
B
g
B
SnX /SnY
and
SnX A /SnY A
(5)
Three algorithms will be derived in the following Section to get confidence intervals on such
quantities.
1
1
g
g
g
A
A 1 YA , X
B
B 1 YB .
X
XiA , Yf
XiB , Y
i
i
i
i
i
A
A
A
A i
5
Cyrille Dubarry
f (x, y, x0 , y 0 )
Estimator
g
g
B
A
SnX
SnX
g
g
B
A
SnX /SnX
g
g
g
g
B
B
A
A
SnX /SnY
SnX /SnY
g
g
B
B
SnX /SnY
g
g
SnX A /SnY A
x0
F (D)
mX B
x0 /x
x0 /y 0
mX A
mX B /mX A
x/y
mX B /mY B
x0 /y 0
mX A /mY A
mX B /mY B
mX A /mY A
x/y
CLT
Prop. 5
Prop. 6
Prop. 8
Prop. 9
g
A
g
A
g
B
def
' (x) =
1
x
, if x = 0 ,
, if x =
6 0.
We will apply ' to all the denominators in the following theorems, and, according to the
positiveness ensured by Assumption A4, the ratios are continuous functions of the non-zero
sums. It is only a technical point, as in practice we would not define the ratio for a null
denominator. In theoretical applications, Lemma 10 in Appendix A allows us to replace
the sums by their non-zero versions obtained by applying the operator ', but for the sake
of simplicity we will not use it when describing the bootstrap.
g
g
g
A , Yf
A, X
B, Y
B ), then all the quantities that we
If we denote by D the distribution of (X
1
1
def
Bootstrapping In this specific framework, bootstrapping can be used by randomly selecting n users (possibly picking the same user several times) and computing the estimator
with this random set of users. Repeating this M times provides an empirical distribution
of the estimator of F (D). The M estimator values can be computed with only one pass on
the dataset using an online version of bootstrapping described in Oza and Russell (2001);
Oza (2005).
For each user i, a Poisson random variable Zi is simulated and the current user is
included Zi times. The full procedure is detailed in Algorithm 1 and works well even if the
dataset is not grouped by user. In this case, each line l of the dataset is associated to a user
6
def
1
1
1
1
16:
Set Fc
=
f
,
,
,
m
1,m
2,m
3,m
4,m .
nm
nm
nm
nm
end for
M
c
18: Outputs: F
.
m
17:
m=1
A A A B B B B
i = Il and contains a vector ("A
Il xl , "Il yl , "Il xl , "Il yl ) such that for any i
XiA
L
X
xA
l
YiA
l=1|Il =i
XiB
L
X
L
X
ylA ,
l=1|Il =i
xB
l
YiB
l=1|Il =i
L
X
ylB .
l=1|Il =i
It relies on a pseudo-random generator that is able to generate M Poisson variables (Zim )1mM
for each user i.
M
Confidence interval algorithms The M estimators Fc
obtained in Algorithm 1
m
m=1
can then be used to derive empirical quantiles and obtain confidence intervals with Algorithm 2. However, quantile approximation for accurate confidence intervals requires M to
be big enough and Algorithm 2 is only feasible if the number of users n is small enough.
Another way of computing confidence intervals is to use one of the central limit theorems
stated in Section 3.2 on the condition that the implied variances can be easily estimated
from the data. The resulting algorithm is given in Algorithm 3 where we use the normal
cumulative density function N defined by
8x 2 R ,
def
N (x) =
x
1
e t /2
p
dt .
2
(6)
Cyrille Dubarry
Algorithm 2
1:
2:
3:
4:
1
+
q
def
1
2: Set s = N
.
2
2
3: Estimate the asymptotic variance c
relevant Proposition
1).
n using the
h gA g
g
(see Table
i
g
g
g
g
g
B
B
A
B
B
X
YA
X
Y
XA
Y
X
Y
4: Outputs: f Sn , Sn , Sn , Sn
scn , f Sn , Sn , Sn , Sn
+ scn .
In practice, the data is not aggregated by user and we have to do so as a first step
g
A f
A g
B g
B
in order to get the vectors (X
i , Yi , Xi , Yi )i 1 and estimate the related variances and
covariances. This can be quite costly as it requires more than one reading of the dataset if
the user can be found in several lines. In the case where each user appears only once, this
will be the quicker algorithm as it does not need any simulation.
We can take advantage of both Algorithms 2 and 3 by using bootstrapping to approximate the estimator variance and the asymptotic normality to derive confidence intervals as
described in Algorithm 4. The variance estimation only requires a few number of bootstraps
M and the dataset is read only once. This algorithm will be shown in Section 4 to perform
better than Algorithm 2 for a given computational cost. Though, this algorithm relies on
an asymptotic regime and is relevant only when the number of users n is large enough.
Otherwise, pure bootstrapping may be a better alternative as it works for any value of n.
Algorithm 4
1:
2:
3:
4:
def
1 1+q
Set s = N
.
2
0
12
M
M
2
X
X
1
1
def
F
cp A
@Fc
Set c
=
F
m
n
M 1
M
m=1
p=1
h gA g
g g g g
i
g
g
B
B
X
YA
X
Y
F , f SXA , SY A , SXB , SY B + s c
F .
Outputs: f Sn , Sn , Sn , Sn
sc
n
n
n
n
n
n
8
SnY
g
A
g
A
g
B
g
B
limit theorem) Under Assumptions A1-5, the vector (SnX , SnY , SnX , SnY ),
the following central limit theorem
1
00 1
1
mX A
0
C
BB C g f g gC
mY A C
A
A
B
B
C D! N BB 0 C , X
C ,
1 , Y1 , X1 , Y1
C
@@ 0 A
A
mX B A
0
mY B
g
g
A f
A g
B g
B is the covariance matrix of X
A f
A g
B g
B defined by the
where X
1 , Y 1 , X1 , Y 1
1 , Y1 , X1 , Y1
variances
1 A 2
def
g
2
2
A = 1
=
Var
X
mX A ,
(7)
A +
1
g
X
A
X
A
A
1 A 2
2 def
2
A = 1
=
Var
Yf
mY A ,
(8)
A +
1
g
Y
A
Y
A
A
1 B 2
def
g
2
2
B = 1
=
Var
X
mX B ,
(9)
B +
1
g
X
B
X
B
B
1 B 2
def
g
2
2
B = 1
=
Var
Y
mY B ,
(10)
B +
1
g
Y
B
Y
B
B
the covariances inside each population
g
A , Yf
A = 1
Cov X
A
1
1
A
g
B g
B = 1
Cov X
B
1 , Y1
B
XA Y A
XB Y B
def
mX A mY A = f
A
A
1 B
def
+
mX B mY B = f
B
B
+
g
g
A, X
B = m
Cov X
X A mX B ,
1
1
g
A g
B = m
Cov X
X A mY B ,
1 , Y1
A g
B = m
Cov Yf
Y A mX B ,
1 , X1
A g
B = m
Cov Yf
Y A mY B .
1 , Y1
g
g
A Y
A
X
g
g
B Y
B
X
(11)
(12)
p
The convergence is done at rate n where n is the total number of users, and not the
number of users in a population. However, the variance of each estimator decreases with
its relative population size thanks to factors A and B found in the denominators of the
four variances.
9
Cyrille Dubarry
Furthermore, these variances are composed of two terms. One that comes purely from
2 ) and another one added by the AB-test
the variance of the metrics of interest (ex: X
A
process which randomly attributes each user to a population (ex: (1 A )m2X A ). They
can be understood when looking at extreme cases. When population A includes all the
users,
i.e. A = 1, the randomness of the AB-test process disappears and we simply get
g
A = 2 . On the other hand, if the metric of interest X A is purely deterministic,
Var X
1
XA
lets say X A 1 in which case we are interested in the number of users per population,
then the variance becomes 1 AA which is the variance of "A
1 /A . However, in practice, we
often have X A >> mX A and the second term becomes almost negligible.
Another fact shown by Theorem 4 is that the metrics of the two populations are not
independent! This was actually intuitive as when a user is associated to one population and
thus included in the corresponding sum, the other population looses this user. If mX A and
mX B are positive, then the correlation is negative following the previous intuition.
Finally, Theorem 4 provides the asymptotic distribution of the joint law of the four
empirical averages we are interested in to compare the two populations. Simple linear
g
g
B
A
combinations such as SnX
SnX remain asymptoticaly normal and confidence intervals
can easily been derived as stated in Proposition 5. This allows for comparing, for example,
the absolute increment of the number of displays per user generated by the two algorithms
A and B.
Proposition 5 (CLT for f (x, y, x0 , y 0 ) = x0
increment
where
g
B
SnX
p h X
g
B
n Sn
g
A
X
and
g
A
SnX
g
A
D
SnX
(mX B mX A ) ! N 0, 2gA + 2g
+
2m
m
,
A
B
X
X
B
X
g
B
X
g
B
g
A
When coming to confidence intervals for relative increments such as SnX /SnX , or for
g
A
g
A
ratio metrics such as SnX /SnY , without further steps, one would need to compute quantiles
of the ratio of two correlated normal random variables. This problem is known to be difficult
and has been discussed for decades, see Marsaglia (2006) and references therein.
However, such ratios can themselves be shown to be asymptotically normal in our setup
as stated in Propositions 6 and 7.
g
g
B
A
Proposition 6 (CLT for f (x, y, x0 , y 0 ) = x0 /x) Under Assumptions A1-5, the ratio SnX /' SnX
satisfies the following central limit theorem
0
1
#!
2 "
2
2
g
B
X
p
S
m
m
g
g
B
B
A
B
D
X
X
n
X
X
A ! N 0,
n @ g
+
+2
,
mX A
mX A
mX A
mX B
' SXA
n
where
g
A
X
and
g
B
X
Following similar steps, we can now derive central limit theorems for ratio of the form
which allows us to get confidence intervals for metrics such as CTR as in Example
g
g
A
A
SnX /SnY
2.
10
g
B
A
mY A
p B ' SnY
nB
g
B
B
SX
mX B
B
@ ng
B
mY B
' SY
n
with
def
VA =
def
VB =
where X
g
A,
and (12).
g
A,
Y
g
B,
X
mX A
mY A
mX B
mY B
2 "
2 "
g
B,
Y
g
A
X
mX A
g
B
X
mX B
g
g
g
g
A
A
B
B
and SnX /' SnY
satthe ratio SnX /' SnY
1
C
C
C D
C ! N 0, VA 0
,
C
0 VB
C
A
+
+
g
A
Y
mY A
g
B
Y
mY B
2f
A
2f
B
g
A
X
g
A
Y
g
B
X
g
B
Y
mX A mY A
mX B mY B
(13)
(14)
f
f
A and
B are respectively defined in (7), (8), (9), (10), (11)
g
A
g
B
g
g
g
g
g
g
A
B
A
A
B
B
Y
Y
X
Y
X
Y
Sn and Sn , the ratio Sn /' Sn
and Sn /' Sn
are not. This can be explained
by recalling that the correlation of the non-ratio metrics is due to the fact that adding a
user to one sum, excludes him from the other one, resulting in a negative correlation. On
the contrary, ratios inside each population are independent of the scale of the individual
sums, and their correlation vanishes asymptotically.
We can now derive central limit theorems for both the absolute and relative dierences
of ratios. This is done in Propositions 8 and 9 respectively.
Proposition
8 (CLT
for f (x, y, x0 , y0 ) = x0 /y 0 x/y) Under Assumptions A1-5, the ra
g
g
g
g
A
A
B
B
tios SnX /' SnY
and SnX /' SnY
satisfy the following central limit theorem
20
1
3
g
g
B
A
X
X
p
S
S
mX B
mX A 5 D
n A
n 4@ n g
! N (0, VA + VB ) ,
g
B
A
mY B
mY A
' SnY
' SnY
where VA and VB are defined respectively in (13) and (14).
0
/y
Proposition 9 (CLT for f (x, y, x0 , y 0 ) = xx/y
) Under Assumptions A1-5, we have the following central limit theorem
g
0
1
g
B
B
X
S
/'
SnY
n
p
m
/m
B
B
X
Y A
n @ g g
A
A
m
/m
A
X
Y
X
YA
' Sn
/' Sn
" p
2 p
2 #!
mX B /mY B 2
VA
VB
D
! N 0,
+
,
mX A /mY A
mX A /mY A
mX B /mY B
Cyrille Dubarry
g
B
X
g
B
Y
def
def
X
m
[
,
X A [n] = Sn
g
A
Y
m
[
Y A [n] = Sn
def
g
B
X
m
[
,
X B [n] = Sn
g
B
def
Y
m
[
Y B [n] = Sn
g
A f
A g
B g
B
The variance estimators can be computed directly from the random variables (X
i , Yi , Xi , Yi ) i
without estimating in a first step X A , Y A , X B , Y B :
1
def
2
d
=
g
A [n]
X
n
1
def
2
d
=
g
B [n]
X
n
X
g
A
X
i
i=1
n
X
i=1
g
B
X
i
m
[
X A [n]
m
[
X B [n]
def
2
d
=
g
A [n]
Y
n
1
def
2
d
=
g
B [n]
Y
n
X
i=1
n
X
i=1
def
c
f
B [n] =
1
1 X g
XiA
d
[n]
d
[n]
n
1
g
g
XA
YA
n
1
1
d
n 1
g
g
B [n] d
B [n]
X
Y
i=1
n
X
i=1
g
B
X
i
m
[
X A [n]
A
Yf
i
m
[
X B [n]
g
B
Y
i
A
Yf
i
g
B
Y
i
m
[
Y A [n]
m
[
Y B [n]
m
[
Y A [n]
m
[
Y B [n]
2
2
,
.
,
.
NbDisplays
1
3
1
6
1
1
1
3
1
NbClicks
0
1
0
0
0
0
0
2
0
13
Cyrille Dubarry
In the next Sections, blank AB-tests will be simulated from this dataset to compare the
CTR (Click-Through Rate) of each population, defined as the average number of clicks per
display:
def NbClicks
CTR =
.
NbDisplays
Number of users
Number of displays
Number of clicks
1000000
4332627
191892
CTR
Displays per user
Clicks per user
4.4%
4.3
0.19
g
A
SnX
SnY A
CTRB =
g
B
SnX
SnY B
14
15
Cyrille Dubarry
under-estimating the AB-test noise, and increments appear significant much more often
than they should be. For example, a 95%-confidence interval includes the true value in only
59% of AB-tests which contradicts the definition of a confidence interval. On the contrary,
the assumption of user independence leads to the expected conclusion of having almost 95%
of 95%-confidence intervals including 0 and it remains true for all other tested levels.
This under-estimation is explicitly illustrated in Figure 3 where the empirical distribution of the number of clicks (obtained by bootstrapping) is compared to the binomial
distribution implied by the display independence assumption. It shows that the empirical
standard deviation is much higher than the binomial one (twice as big iN this example).
4.3 Comparison of Bootstrap Algorithms
The assumption of independence by user having been validated, we can now focus on the
comparison of the proposed algorithms. The method using only the central limit theorem
will be given as a reference but is not of practical interest here as the dataset is not grouped
by user (see Section 3.1). We are thus more interested in comparing Algorithms 2 and 4 as
they can be implemented in a such a way that the dataset is read only once. Each algorithm
uses bootstrapping, having a computational cost linear in the number of bootstraps M .
Similarly to Section 4.2, 500 blank AB-tests were simulated from the dataset described in
16
1=
g
B
g
B
SnX /SnY
g
SnX A /SnY A
1,
where (XiA , YiA , XiB , YiB )i 1 are defined in Section 4.2. According to Proposition 9, this
estimator is asymptotically normal and its average should be 0 for a blank AB-test. The
frequency of confidence intervals including the true value 0 is displayed in Figure 4 for
dierent levels of confidence and for both the pure bootstrap technique with M = 10
(Algorithm 2) and the technique using the bootstrap variance in the CLT (Algorithm 4)
again with M = 10. As expected, for a small number of bootstraps M = 10, the pure
bootstrap algorithm performs poorly and is able to get an acceptable confidence intervals
for only a few confidence levels, while the algorithm using both CLT and bootstrapping
shows good results for all confidence levels for the same computational cost. In Figure 5,
we show the influence of the number of bootstraps M in the ability of each algorithm to
compute reliable 95% confidence intervals. The pure bootstrap algorithm converges more
slowly to the target 95% value and requires twice the computational cost as the mixed
algorithm.
17
Cyrille Dubarry
18
5. Conclusion
We have translated the AB-test process into a statistical framework, providing three algorithms for the computation of confidence intervals. Each of them are useful for dierent
practical cases:
1. if the number of users n is small, pure bootstrapping is the best choice (see Algorithm
2), and a large number of bootstraps M is tractable;
2. if the number of users n is large and the dataset is grouped by user, then one should
use one of the relevant central limit theorems (see Algorithm 3);
3. if the number of users n is large and the dataset is not grouped by user, the algorithm
using the bootstrap variance in the central limit theorem will result in the smallest
computational cost (see Algorithm 4).
Numerical experiments allowed us to check that our assumptions were valid. We focused
on the CTR computation, but, as stated in the theoretical parts, the proposed algorithms
apply to any metric that can be written as a sum or a ratio of sums, e.g., to the sales
amount spend per user as well as the revenue generated per user. Similar numerical results
allowed us to validate the algorithms.
It is worthwhile to note that the provided algorithms lead to results valid only during the
AB-test but do not extend to the future. This is known as the long term eect as discussed
in Kohavi et al. (2009). Addressing this issue would require additional assumptions on the
metrics of interest, such as time series modeling, and is out of the scope of this paper.
Acknowledgments
The author would like to thank Olivier Chapelle, Alexandre Gilotte, Andrew Kwok and
Nicolas Le Roux for their valuable ideas, comments and feedbacks.
19
Cyrille Dubarry
y| |' (Yn )
Yn | + |Yn
y| = 1Yn =0 + |Yn
y| ,
y| > "} ,
y| > |y|/2} ,
P
n(Xn1
x1 , , Xnd
x d , Yn
y) ! V .
then the assertions 1 and 3 are satisfied with ' (Yn ) where ' is defined in Definition 3.
P
According to Assumption 3, the second term of the right hand side of (15) converges to 0.
The first term is handled using the Lipschitz property of f : there exists a constant K such
that for all (a, b) 2 (Rd+1 )2 , |f (a) f (b)|L1 K||a b||L1 so that
h p
i
p
E f ( n(Xn1 x1 , , Xnd xd , ' (Yn ) y) f ( n(Xn1 x1 , , Xnd xd , Yn y))
p
K nE (Xn1 x1 , , Xnd xd , ' (Yn ) y) (Xn1 x1 , , Xnd xd , Yn y)
,
L1
p
p
p
= K nE [|' (Yn ) Yn |] = K nE [1Yn =0 ] = K nP {Yn = 0} ,
p
K ncn , according to Assumption 2,
which shows that the first term of the right hand side of (15) converges to 0 and that
p
D
n(Xn1 x1 , , Xnd xd , ' (Yn ) y) ! V .
Proposition 12 Let (Xn , Yn , Xn0 , Yn0 )n 1 be a sequence a random variables in R4 , (x, y, x0 , y 0 ) 2
R4 and a 4 4 covariance matrix such that
1. y 6= 0 and y 0 6= 0,
P
2. Yn ! y and Yn0 ! y 0 ,
Then the ratio sequence (Xn /' (Yn ) , Xn0 /' (Yn0 ))n
orem
0
Xn
p B ' (Yn )
nB
@ Xn0
' (Yn0 )
1
x
0
D
y C
T
C
, P P ,
x0 A ! N
0
y0
21
where
B
B
B
def B
P = B
B
B
B
@
1
y
x
y2
0
0
0
0
1
y0
x0
(y 0 )2
C
C
C
C
C .
C
C
C
A
Cyrille Dubarry
Xn
'(Yn )
Xn
' (Yn )
x
y
as
x
Xn
x
x
x
=
+
,
y
' (Yn ) ' (Yn ) ' (Yn ) y
y 1
y
x
=
(Xn x)
(' (Yn )
' (Yn ) y
' (Yn ) y 2
y) ,
and simarly
Xn0
' (Yn0 )
so that
x0
y0 1
=
(X 0
0
y
' (Yn0 ) y 0 n
0
B
B
B
B
def B
Pn = B
B
B
B
@
y0
x0
(' Yn0
0
' (Yn ) (y 0 )2
y0) ,
1
0
1
x
Xn x
C
p B
y C
C = PnT n B ' (Yn ) y C ,
0
0
0
A
@
x
Xn x A
y0
' (Yn0 ) y 0
Xn
p B ' (Yn )
nB
@ Xn0
' (Yn0 )
where
x0 )
y 1
' (Yn ) y
y
x
' (Yn ) y 2
0
0
y0 1
' (Yn0 ) y 0
y0
x0
' (Yn0 ) (y 0 )2
0
0
P
C
C
C
C
C
C .
C
C
C
A
By applying Lemma 10, ' (Yn ) ! y and ' (Yn0 ) ! y 0 so that Pn ! P . Furthermore, using Lemma 11 twice, we successively get that (Xn , ' (Yn ) , Xn0 , Yn0 )n 1 and then
(Xn , ' (Yn ) , Xn0 , ' (Yn0 ))n 1 satisfy the CLT stated in Assumption 4. We then only need to
apply the Slutsky lemma to conclude.
2. Yn ! y,
3. There exists c 2 [0, 1) such that P {Yn = 0} cn ,
4. (Xn , Yn )n
Xn
Yn
x
y
22
!N
0
0
Xn
' (Yn )
x
D
def B
C
! N 0, P T P , where P = @ yx A .
y
y2
1
Proof This is a direct consequence of Proposition 12 by keeping only the first marginal of
the ratio couple.
g
A
g
B
g
B
Proof [Proof of Theorem 4] The vector (SnX , SnY , SnX , SnY ) is made of empirical means of
g
A f
A g
B g
B
random variables (X
i , Yi , Xi , Yi )i 1 . According to Definition 1 and to Assumption A1,
these variables are i.i.d. and by the same definition they are centered on (mX A , mY A , mX B , mY B ).
Furthermore, one directly sees that
1
1
1
g
g
g
A
A
B
B 1 YB ,
X
X1A , Yf
Y1A , X
X1B , Y
1
1
1
1
A
A
A
A 1
g
A g
B f
A g
B
which, combined with Assumption A3, shows that (X
1 , X1 , Y1 , Y1 ) is L2 -integrable. We
then can apply a multi-dimensional version of the central limit theorem to get the announced
convergence in distribution result.
It now only remains to calculate the related variances and covariances. By Definition1,
we have
A A
" X
g
A
Var X1 = Var 1 1
,
A
A A 2 o
1 n
2
A 2
= 2 E ("A
)
(X
)
E
"1 X 1
,
1
1
A
2 A 2 o
1 n A 2
E "A
E X1
, by Assumption A2,
= 2 E "A
1 E (X1 )
1
A
1 A 2
=
E (X1 )
m2X A ,
A
1 2
2
=
m2X A , according to assumation A3,
A + mX A
X
A
1 2
1 A 2
=
mX A .
A +
X
A
A
g
g
A , Var X
B , and Var Y
B , and very similar steps allows
The same stands for Var Yf
1
1
1
g
g
g
A , Yf
A and Cov X
B, Y
B .
to get the values of Cov X
1
h
i h
i
" X " X
g
g
g
g
A
B
A E X
B ,
Cov X1 , X1 = E 1 1 1 1
E X
1
1
A
B
=
mX A mX B ,
23
B
as "A
1 "1 = 0 ,
Cyrille Dubarry
g
g
g
g
A, Y
B , Cov Yf
A, X
B and Cov Yf
A, Y
B .
and the same formula can be derived for Cov X
1
1
1
1
1
1
Proof [Proof of Proposition 5] Define a continuous function g from R4 to R by
8(xA , yA , xB , yB ) 2 R4 ,
g (xA , yA , xB , yB )T
def
= xB
xA = ( 1, 0, 1, 0) (xA , yA , xB , yB )T ,
so that
p h X
g
B
n Sn
g
A
SnX
(mX B
g
A
SnX
B
B
A
Bp B SnYg
B
mX A ) = g B n B
g
B X
B
@
@ Sn
g
B
SnY
i
mX A
11
CC
C
mY A C
CC .
CC
mX B AA
mY B
p h g
g
B
A
Then, by the continuous mapping theorem and Theorem 4, n SnX
SnX
(mX B
converges in distribution to a normal random variable of mean 0 and variance
mX A )
g
g
g
A , Yf
A, X
B, Y
B ( 1, 0, 1, 0)T .
( 1, 0, 1, 0) X
1
1
1
1
Before moving to proofs of ratio CLT, we need two intermediary Lemmas.
Lemma 14 Under Assumptions A3-4, we have
mX A > 0 ,
mY A > 0 ,
mX B > 0 ,
mY B > 0 .
A
Proof
0 almost surely, which implies that mX A =
A According to assumpation A4, X1
E X1
0. Furthermore, by the Markov inequality, for any n 1 we have:
P X1A
1/n nmX A .
Lemma 15 Under Assumptions A1-5 there exists a constant c 2 [0, 1) such that
n gA
o
P SnX = 0 cn ,
n g
o
A
P SnY = 0 cn ,
n g
o
B
P SnX = 0 cn ,
24
n g
o
B
P SnY = 0 cn .
Proof We have
n gA
o
n
on
g
A =0
P SnX = 0 = P X
,
1
where 1
g
B
and SnY
A
= P "A
, by Definition 1,
1 X1 = 0
n
A
= 1 P "A
,
1 X1 > 0
n
A
A
= 1 P "1 > 0, X1 > 0
,
n
A
= 1 P "A
, by Assumption A2,
1 > 0 P X1 > 0
n
A
= 1 A P X1 > 0
, by Assumption A5,
g
B
g
A
A P X1A > 0 2 [0, 1) by Assumption A4. The same steps applied to SnY , SnX
def
c = 1
A with Xn = SnX
g
A
1. mX A 6= 0 by Lemma 14,
g
A
! mX A ,
4. According to Theorem 4
p
g
B
SnX
mX B
SnX
mX A
g
A
!N
where P =
1
,
mX A
mX B
m2X A
!T
0
0
2
g
B
X
2
g
A
X
mX A mXb
2
g
B
X
0, P T
mX A mXb
mX A mXb
2
g
A
X
mX A mXb
!!
! !
P
g
A
g
B
g
B
pendix A with Xn = SnX , Yn = SnY , Xn0 = SnX , and Yn0 = SnY . Its assumptions are all
satisfied:
1. mY A 6= 0 and mY B 6= 0 by Lemma 14,
25
Cyrille Dubarry
g
A
g
B
! mY A and SnY
! mY B ,
1
mX A
C
mY A C
C D
g
g
g
A , Yf
A, X
B, Y
B P
C ! N 0, P T X
,
1
1
1
1
C
mX B C
A
mY B
g
g
g
A , Yf
A, X
B, Y
B is defined in Theorem 4 and
where X
1
1
1
1
0
B
B
B
B
def B
P = B
B
B
B
@
1
mY A
mX A
m2Y A
0
0
0
0
1
mY B
mX B
m2Y B
C
C
C
C
C
C .
C
C
C
A
Proof [Proof of Proposition 8] The proof follows the same steps as the one of Proposition
5.
Proof [Proof of Proposition
of Corrolary 13 in Appendix
g
B
g
B
g
A
g
A
g
A
g
A
g
A
P
Y
by Lemma 10, ' Sn
! mY A and we can apply the continuous mapping theorem
g
g
A
A
P
to get SnX /' SnY
! mX A /mY A ,
n g g
o
n g
o
A
A
A
3. According to Lemma 15, we have P SnX /' SnY
= 0 = P SnX = 0 cn ,
4. The central limit theorem is stated in Proposition 7.
26
References
Amazon.
The Math Behind A/B Testing.
https://developer.amazon.com/sdk/
ab-testing/reference/ab-math.html, 2010. [Online].
Linda Crocker and James Algina. Introduction to classical and modern test theory. ERIC,
1986.
Thomas Crook, Brian Frasca, Ron Kohavi, and Roger Longbotham. Seven pitfalls to
avoid when running controlled experiments on the web. In Proceedings of the 15th ACM
SIGKDD international conference on Knowledge discovery and data mining, pages 1105
1114. ACM, 2009.
Frederica Darema. Dynamic data driven applications systems: A new paradigm for application simulations and measurements. In Marian Bubak, GeertDick van Albada, PeterM.A. Sloot, and Jack Dongarra, editors, Computational Science - ICCS 2004, volume 3038 of Lecture Notes in Computer Science, pages 662669. Springer Berlin Heidelberg, 2004. ISBN 978-3-540-22116-6. doi: 10.1007/978-3-540-24688-6 86. URL
http://dx.doi.org/10.1007/978-3-540-24688-6_86.
Bradley Efron and Robert J Tibshirani. Confidence intervals based on bootstrap percentiles.
In An introduction to the bootstrap, pages 168177. Springer, 1993.
Avinash Kaushik. Experimentation and Testing: A Primer. http://www.kaushik.net/
avinash/experimentation-and-testing-a-primer/, 2006. [Online].
KDD. 2012 KDD Cup. https://www.kddcup2012.org/c/kddcup2012-track2, 2012. [Online].
Georey Keppel. Design and analysis: A researchers handbook . Prentice-Hall, Inc, 1991.
Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. Controlled experiments on the web: survey and practical guide. Data Mining and Knowledge Discovery,
18(1):140181, 2009.
Ron Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and Ya Xu. Trustworthy online controlled experiments: Five puzzling outcomes explained. In Proceedings
of the 18th ACM SIGKDD international conference on Knowledge discovery and data
mining, pages 786794. ACM, 2012.
George Marsaglia. Ratios of normal variables. Journal of Statistical Software, 16(4):110,
2006.
Nikunj C Oza. Online bagging and boosting. In Systems, man and cybernetics, 2005 IEEE
international conference on, volume 3, pages 23402345. IEEE, 2005.
Nikunj C Oza and Stuart Russell. Experimental comparisons of online and batch versions
of bagging and boosting. In Proceedings of the seventh ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 359364. ACM, 2001.
27
Cyrille Dubarry
Laura Swanson.
A Primer on A/B Testing.
a-primer-on-a-b-testing/, 2011. [Online].
http://alistapart.com/article/
28