Sie sind auf Seite 1von 22

26/05/12

Analisis Regresi Linier dan


Logistik
Oleh :
Nurita Andayani

Introduction
Difference between chi-square and regression : chisquare test of independence to determine whether a
statistical relationship existed between two variables. The
chi-square test tell if there is such a relationship, but it does
not tell about what that relationship. But regression and
correlation analyses will show how to determine both the
nature and the strength of a relationship between two
variables
Regression analysis is a body of statistical methods
dealing with the formulation of mathematical models that
depict relationships among variables, and the use of these
modeled relationships for the purpose of prediction and other
statistical inferences.
The word regression was first in its present technical
context by Sir Francis Galton, who analyzed the heights of
sons and the average heights of their parents.

26/05/12

Models

The independent or controlled variable is also called the predictor variable


and is denoted by x. The effect or response variable is denoted by y.
If the relation between y and x is exactly a straight line, then the variables
are connected by the formula :
y = + x
where indicates the intercept of the line with the y axis and represents
the slope of the line, or the change in y per unit change in x.
y

yi

+ xi

xi

Statistical Model
Yi = + xi + ei, i = 1, , n
Where :
a) x1, x2, ,xn are the set values of the controlled variable x
that the experimenter has selected for the study.
b) e1, e2, ,en are the unknown error components that are
superimposed on the true linear relation. These are
unobservable random variables, which we assume are
independently and normally distributed with a mean of
zero and unknown variance of 2.
c) The parameters and , which together locate the
straight line, are unknown.

26/05/12

Basic Notations

( x x ) x nx
( y y ) y ny
( x x )( y y ) x y nx y

x
S x2

1
n

S y2

S xy

xi ,

1
n

2
i

2
i

i i

Example
Zippy Cola is studying the effect of its
latest advertising campaign. People
chosen at random were called and asked
how many cans of Zippy Cola they had
bought in the past week and how many
Zippy Cola advertisements they had either
read or seen in the past week.
X (number of ads) 3 7 4
Y( cans purchased) 11 18 9

2
4

0
7

4
6

1
3

2
8

26/05/12

Least Squares Regression Line


y x

Least squares regression line :


Least square estimate of :

a y x

Least square estimate of :

S xy

b 2
Sx

The residual sum of squares or the sum of squares due to


n
error is :

SSE S y2 2 S x2

( y x )
i

i 1

Properties of the Least Squares


Estimators
a)

b)
c)

d)

The least squares estimators are unbiased; that is


E ( )
and E ( )

E (s 2 ) 2

and

E (s)

The distribution of and are normal with means of and


, respectively; the standard deviations are the square roots
of the variances given in b).
s2=SSE/(n-2) is an unbiased estimator of 2. Also, (n-1)s2/2
is distributed as 2 with d,f,=n-2, and it is independent of
and

26/05/12

e) Replacing 2 in b) with its sample estimate s2 and


considering the square roots of the variances, we
obtain the estimated standard errors of and ;
1 x2

estimated standard error of s


n S x2
s
estimated standard error of
Sx

f) S x ( ) has at t distribution with d.f.=n-2


s
( )

1 x2
s

n S x2

has at t distribution with d.f.=n-2

Inference Concerning the Slope


H 0 : 0 vs H1 : 0 is based on
S x ( 0 )
t
, d.f. n 2
s
p% confidence interval for :

t(1CI ) / 2

s
Sx

26/05/12

Inference about
H 0 : 0 vs H1 : 0
( 0 )
t
, d.f. n 2
2
1 x
s
2
n Sx

is based on

p% confidence interval for :


1 x2
t(1CI ) / 2 .s
2
n Sx

Checks on The Straight Line Model


yi
( xi ) ( yi xi )
observed Explainedby
residualor
y value
linear relation deviationfrom linear
relation

S y2
Total
SS of y

2 S x2

SSE

SS explained
residualSS
by linear relation (unexplained)

26/05/12

Anova for checking regression model


Source

Sum of Squares

d.f.

Mean Squares

Regression

SSR

MSR=SSR/1

MSR/MSE

Error

SSE

n2

MSE=SSE/(n-2)

Total

SST

n1

Inference for regression model


H 0 : 0 H1 : 0
Rejection region: (withsignificant level )
R : F F (1,( n 2))

26/05/12

The coefficient of determination


The sample coefficient of determination is
developed from relationship between two kinds of
variation: variation of Y values in a data set around :
The fitted regression line
Their own mean

R2

SSR
SSE
1
SST
SST

0 R 2 1 or 0 R 2 100%
Perfect fitted
regression line

unfitted
regression
model

The coefficient correlation


Coefficient correlation ( r ) indicates the direction of
the relationship between the two variables X and Y
If an inverse relationship exist-that is, if Y decreases
as X increases-then r will fall between 0 and -1
If there is a direct relationship (if Y increases as X
increases), then r will be a value within the range 0
and 1

S xy
S x2 .S y2

26/05/12

Exercise
PUSKESMAS PANCORAN MAS ingin mengetahui
hubungan antara usia dengan besarnya tekanan
darah dari pasien. Diambil 10 pasien dan didapatkan
hasilnya sebagai berikut
Usia

38

36

72

42

68

63

Tekanan darah

115

118

160

140

152 149

49

56

60

55

145

147

155

150

a) Buat model regresinya !


b) Jika usia pasien adalah 40 pediksikan besar tekanan
darahnya !
c) Ujilah model regresi yang telah anda buat !
d) Ujilah apakah parameter =0 dan =0 ?
e) Buat selang kepercayaan 90% untuk dan !
f) Hitung koefisien determinasi dan korelasinya, jelaskan artinya
!

What is Logistic Regression?


Form of regression that allows the prediction
of discrete variables by a mix of continuous
and discrete predictors.
Addresses the same questions that
discriminant function analysis and multiple
regression do but with no distributional
assumptions on the predictors (the
predictors do not have to be normally
distributed, linearly related or have equal
variance in each group)

26/05/12

What is Logistic Regression?

Logistic regression is often used because


the relationship between the a discrete
variable and a predictor is non-linear

Example from the text: the probability of heart disease


changes very little with a ten-point difference among
people with low-blood pressure, but a ten point change
can mean a drastic change in the probability of heart
disease in people with high blood-pressure.

Assumptions

Absence of multicollinearity
No outliers
Independence of errors assumes a
between subjects design. There are
other forms if the design is within
subjects.

10

26/05/12

Background

Odds like probability. Odds are usually


written as 5 to 1 odds which is equivalent to
1 out of five or .20 probability or 20% chance,
etc.

The problem with probabilities is that they are


non-linear
Going from .10 to .20 doubles the probability, but
going from .80 to .90 barely increases the
probability.

Background

Odds ratio the ratio of the odds over 1


the odds. The probability of winning
over the probability of losing. 5 to 1 odds
equates to an odds ratio of .20/.80 = .25.

11

26/05/12

Background

Logit this is the natural log of an odds


ratio; often called a log odds even though
it really is a log odds ratio. The logit
scale is linear and functions much like a
z-score scale.

Background
LOGITS ARE CONTINOUS, LIKE Z
SCORES
p = 0.50, then logit = 0
p = 0.70, then logit = 0.84
p = 0.30, then logit = -0.84

12

26/05/12

Plain old regression

Y = A BINARY RESPONSE (DV)

1 POSITIVE RESPONSE (Success) P


0 NEGATIVE RESPONSE (failure) Q = (1-P)

MEAN(Y) = P, observed proportion of


successes
VAR(Y) = PQ, maximized when P = .50,
variance depends on mean (P)
XJ = ANY TYPE OF PREDICTOR
Continuous, Dichotomous, Polytomous

Plain old regression

Y | X B0 B1 X1

and it is assumed that errors are


normally distributed, with mean=0 and
constant variance (i.e., homogeneity of
variance)

13

26/05/12

Plain old regression

E(Y | X ) B0 B1 X 1
an expected value is a mean, so

(Y ) PY 1 | X

The predicted value equals the proportion of


observations for which Y|X = 1; P is the
probability of Y = 1(A SUCCESS) given X, and
Q = 1- P (A FAILURE) given X.

An alternative the ogive function


An ogive function is a curved s-shaped
function and the most common is the
logistic function which looks like:

14

26/05/12

The logistic function

The logistic function

Yi

eu
1 eu

Where Y-hat is the estimated probability


that the ith case is in a category and u is
the regular linear regression equation:

u A B1 X1 B2 X 2 BK X K

15

26/05/12

The logistic function

b0 b1 X1

e
i b0 b1X1
1 e
The logistic function

Change in probability is not constant


(linear) with constant changes in X
This means that the probability of a
success (Y = 1) given the predictor
variable (X) is a non-linear function,
specifically a logistic function

16

26/05/12

The logistic function

It is not obvious how the regression


coefficients for X are related to changes
in the dependent variable (Y) when the
model is written this way
Change in Y(in probability units)|X
depends on value of X. Look at Sshaped function

The logistic function

The values in the regression equation b0


and b1 take on slightly different
meanings.

b0 The regression constant (moves curve


left and right)
b1 <- The regression slope (steepness of
curve)
b The threshold, where probability of
b
success
= .50
0

17

26/05/12

Logistic Function
Constant regression
constant different
slopes
v2: b0 = -4.00
b1 = 0.05 (middle)
v3: b0 = -4.00
b1 = 0.15 (top)
v4: b0 = -4.00
b1 = 0.025 (bottom)

1.0

.8

.6

.4
V4
V1
V3

.2

V1
V2
V1

0.0
30

40

50

60

70

80

90

100

Logistic Function
Constant slopes
with different
regression
constants
v2: b0 = -3.00
b1 = 0.05 (top)
v3: b0 = -4.00
b1 = 0.05 (middle)
v4: b0 = -5.00
b1 = 0.05 (bottom)

1.0
.9
.8
.7
.6
.5
.4
V4
.3

V1

.2

V3
V1

.1

V2
V1

0.0
30

40

50

60

70

80

90

100

18

26/05/12

The Logit

By algebraic manipulation, the logistic


regression equation can be written in
terms of an odds ratio for success:

P(Y 1| X i )

exp(b0 b1 X1i )
(1 P(Y 1| X i )) (1 )

The Logit

Odds ratios range from 0 to positive


infinity
Odds ratio: P/Q is an odds ratio; less
than 1 = less than .50 probability, greater
than 1 means greater than .50 probability

19

26/05/12

The Logit

Finally, taking the natural log of both


sides, we can write the equation in
terms of logits (log-odds):

P(Y 1| X )
ln
ln
b0 b1 X1

(1 P(Y 1| X )) (1 )
For a single predictor

The Logit


ln
b0 b1 X1 b2 X 2 bk X k

(1 )
For multiple predictors

20

26/05/12

The Logit

Log-odds are a linear function of the


predictors
The regression coefficients go back to
their old interpretation (kind of)

The expected value of the logit (logodds) when X = 0


Called a logit difference; The amount
the logit (log-odds) changes, with a one
unit change in X; the amount the logit
changes in going from X to X + 1

Conversion

EXP(logit) or = odds ratio


Probability = odd ratio / (1 + odd ratio)

21

26/05/12

THANK YOU
GOOD LUCK

22