Sie sind auf Seite 1von 17

CS109/Stat121/AC209/E-109

Data Science
Bias and Sampling
Hanspeter Pfister & Joe Blitzstein
pfister@seas.harvard.edu / blitzstein@stat.harvard.edu
This Week
HW1 due tonight at 11:59 pm

HW2 will be posted by tonight start soon!

Friday lab 10-11:30 am in MD G115


Pandas with Rahul, Brandon, and Steffen
Some Forms of Bias
selection bias
publication bias (file drawer problem)
censoring bias
length bias
sampling bias
Longevity Study

Profession Average Longevity

chocolate maker 73.6

professors 66.6

clocksmiths 55.3

locksmiths 47.2

students 20.2

Sources: Lombard (1835), Wainer (1999), Stigler (2002)


01jan2012
08jan2012
Figure 2: Brand Keyword Click Substitution
15jan2012
22jan2012
29jan2012
05feb2012
12feb2012
19feb2012
26feb2012
04mar2012
11mar2012
18mar2012
25mar2012
01apr2012
08apr2012
15apr2012
22apr2012
29apr2012
06may2012
13may2012
20may2012
27may2012
03jun2012
10jun2012
17jun2012
24jun2012

01jun2012

08jun2012

15jun2012

22jun2012

29jun2012

06jul2012

13jul2012

20jul2012

27jul2012

03aug2012

10aug2012

17aug2012

24aug2012

31aug2012

07sep2012

14sep2012
MSN Paid MSN Natural Goog Natural Goog Paid Goog Natural

(a) MSN Test (b) Google Test


MSN and Google click traffic is shown for two events where paid search was suspended (Left)
and suspended and resumed (Right).

result from
To quantify this substitution, Blake-Nosko-Tadelis
Table 1 shows estimates from a(2013)
simple pre-post comparison
as well ashttp://conference.nber.org/confer/2013/EoDs13/Tadelis.pdf
a simple dierence-in-dierences across search platforms. In the pre-post analysis
we regress the log of total daily clicks from MSN to eBay on an indicator for whether days
(Challenger Disaster) Wainer (2000), Visual Revelations
Why sample from a population?
often the only feasible option
but its useful to think about the question:
What would you do if you had all the data?
also often important for computational reasons
There are many sampling schemes...
simple random sampling
stratified sampling
cluster sampling
snowball sampling
Absolute vs. relative
In simple random sampling, which matters more: the relative
sample size, or the absolute sample size?

For example, how much bigger a sample should you collect


in China vs. in the US, to get the same standard error?
Snowball Sampling (Link-Tracing)

2 2

2 1 2

2 1 0
1

2
1 0 1
2

1 2 2

(a) Stage 1 (b) Stage 2

Figure 2: Two successive stages of k = 3 snowball sampling. Nodes have been labelled with th
tage number of when they first appeared in the sample. Node 0 was the original node, acquire
via Bernoulli sampling
Bias of an Estimator
The bias of an estimator is = E()

how far off it is on average:
bias()

So why not just subtract off the bias?


Bias-Variance Tradeoff
one form: 2
MSE() = Var() + bias ()
often a little bit of bias can make it
possible to have much lower MSE

http://scott.fortmann-roe.com/docs/BiasVariance.html
Unbiased Estimation: Poisson Example
X Pois( )

2
Goal: estimate e

( 1)X is the best (and only) unbiased estimator of e 2

sensible?
Basus Elephant

Estimate the total weight of 50 elephants.


Horvitz-Thompson Estimator
Estimate the total of some variable for a finite population:
yi
Ty =
i
i S

where S is the sample and i > 0 is the probability of i being in the


sample

Unbiased! But what about the variance?


Fisher Weighting
How should we combine independent, unbiased
estimators for a parameter into one estimator?

k
X
= wi i
i=1

The weights should sum to 1, but how should they be chosen?

1
wi /
Var(i )

(Inversely proportional to variance; why not SD?)

Das könnte Ihnen auch gefallen