Business Statistics 41000-03/81 Fall 2009 Cheat Sheet For Final Exam Hedibert F. Lopes Exploratory Data Analysis

Business Statistics 41000-03/81 Cheat Sheet for nal exam Exploratory data analysis
Sample: x1 , . . . , xn and y1 , . . . , yn . Mean: x =

n i=1
Fall 2009 Hedibert F. Lopes
xi /n.
n i=1 (xi
Variance and standard deviation: s2 = x Skewness =

n (n1)(n2) n i=1 ((xi
x)2 /(n 1) and sx =
s2 . x
x)/sx )3 .
3(n1)2 . (n2)(n3)
Negative skewness means longer left tail, while positive skewness means longer right tail. Excess kurtosis =
n(n+1) (n1)(n2)(n3) n i=1 ((xi
x)/sx )4
High kurtosis results in exceptional values that are called fat tails or heavy tails. Fat/heavy tails indicate a higher percentage of very low and very high returns than would be expected under a bell shape (normal) distribution. Covariance and correlation: sxy =
n i=1 (xi
x)(yi y )/(n 1) and rxy = sxy / s2 s2 . x y
The closer rxy is to 1 the stronger the linear relationship is with a positive slope. When one goes up, the other tends to go up. The closer rxy is to -1 the stronger the linear relationship is with a negative slope. When one goes up, the other tends to go down. If p = ax+by+cz, then s2 = a2 s2 +b2 s2 +c2 s2 +2 (absxy + acsxz + bcsyz ) and p = a+b+c. x y z p x y z An example of the above linear combination is portfolio allocation where x, y and z are the rms (assets) and a,b and c are their weights with a + b + c = 1. Boxplot: Quartiles split the whole data into four quarters. Q1=Median of the rst half of the data. Q2=Median of the whole data. Q3=Median of the second half of the data.
Q11.5*(Q3Q1)
Q3+1.5*(Q3Q1)
Q1
Q2
Q3
Basic probability
Let p(x, y) denote the joint distribution of discrete random variables X and Y . Then, the marginal distribution of X can be obtained from the joint as p(x) =
y
p(x, y)
while the conditional distribution of X given that Y = y is p(x|y) = joint p(x, y) = . p(y) marginal
The table below is one example: the probability that Y=happy is equal to 0.46, the probability that X=17.5 (high salary) is equal to 0.24, while the probability that Y=happy AND X=17.5 is equal to 0.14. Mathematically, P(Y=happy)=0.46, P(X=17.5)=0.24 and Pr(Y=happy,X=17.5)=0.14. Happiness (Y) unhappy ok happy P(X) 0.03 0.12 0.07 0.22 0.02 0.13 0.11 0.26 0.01 0.13 0.14 0.28 0.01 0.09 0.14 0.24 0.07 0.47 0.46 1.00
Salary (X) 2.5 7.5 12.5 17.5 P(Y)
Conditional on X=17.5, i.e. conditional on the subset whose salary is 17.5, we can completely ignore the top three rows (corresponding to salaries of 2.5, 7.5 and 12.5) from the joint distribution. Y unhappy P (Y |X = 17.5) 0.01 ok happy 0.09 0.14
However, the probabilities on the new table do not add up to 1.0 and should be normalized by 0.24, its total. Therefore, the new table with the conditional probability distribution of Y conditional on X = 17.5 is given by Y unhappy ok happy P (Y |X = 17.5) 0.0417 0.3750 0.5833
Bernoulli, binomial and normal distributions

X Bernoulli(p) X {0, 1} and P r(X = 1) = p. X Binomial(n, p) X {0, 1, . . . , n} and P r(X = x) = n! px (1 p)nx x!(n x)! n! = n (n 1) 2 1
If X1 , . . . , Xn are independent and identically distributed (i.i.d.) Bernoulli(p), then X = n i=1 Xi Binomial(n, p). Standard normal distribution: Z N (0, 1). The Z-table is the standard normal table. Also, P r(Z (1, 1)) 0.68 and P r(Z (2, 2)) 0.95. Other normal distributions: X N (, 2 ). Here, P r(X ( , + )) 0.68 and P r(Z ( 2, + 2)) 0.95. Cumulative distribution: F (x) = P r(X x); so P r(a < X b) = F (b) F (a).
Mean, variance, covariance and correlation of discrete random variables

Mean and variance: =
i
xi p(xi ) and 2 =
i (xi
i (xi
)2 p(xi )
Covariance and correlation: xy =
x )(yi y )p(xi , yi ) and xy = xy /(x y ).
If X and Y are independent then xy = xy = 0. If xy = xy = 0, then X and Y are not necessarily independent.
Central limit theorem and approximate condence intervals

If X1 , X2 , . . . , Xn are i.i.d. observations from ANY population with mean and variance 2 , then for large values of n the sample mean x behaves like a N (, 2 /n). In particular, p N (p, p(1 p)/n). Approximate 95% C.I. for the population mean : x 2se(), where se() = x x the sample variance. s2 /n and s2 is
Approximate 95% C.I. for the population proportion p: p 2se(), where p is the sample p proportion of successes and se() = p(1 p)/n. p
Hypothesis testing for population mean and proportion p

H0 : = 0 versus Ha : = 0 at the 5% level: Reject H0 if |test statistic| = H0 : p = p0 versus Ha : p = p0 at the 5% level: Reject H0 if |test statistic| = p p0 >2 se() p x 0 >2 se() x
P-value: For point null hypothesis (as above), the P-value is twice the probability of obtaining a test statistic (or x or p) at least as extreme as the one observed.
Simple linear regression

Business Statistics 41000-03/81 Fall 2009 Cheat Sheet For Final Exam Hedibert F. Lopes Exploratory Data Analysis

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Business Statistics 41000-03/81 Fall 2009 Cheat Sheet For Final Exam Hedibert F. Lopes Exploratory Data Analysis

Hochgeladen von

Copyright:

Verfügbare Formate

Business Statistics 41000-03/81 Cheat Sheet for nal exam Exploratory data analysis

Sample: x1 , . . . , xn and y1 , . . . , yn . Mean: x =

Fall 2009 Hedibert F. Lopes

Variance and standard deviation: s2 = x Skewness =

x)2 /(n 1) and sx =

x)(yi y )/(n 1) and rxy = sxy / s2 s2 . x y

Salary (X) 2.5 7.5 12.5 17.5 P(Y)

Bernoulli, binomial and normal distributions

Mean, variance, covariance and correlation of discrete random variables

Covariance and correlation: xy =

x )(yi y )p(xi , yi ) and xy = xy /(x y ).

Central limit theorem and approximate condence intervals

Hypothesis testing for population mean and proportion p

Simple linear regression

Das könnte Ihnen auch gefallen