Sie sind auf Seite 1von 211

CE 459 Statistics

Assistant Prof. Muhammet Vefa AKPINAR

VAR1 16 14 12 10

No of obs

8 6 4 2 0 50 55 60 65 70 75 80 85 90 95 100 Upper Boundaries (x <= boundary) Expected Normal

Lecture Notes
What is Statistics Frequency Distribution Descriptive Statistics Normal Probability Distribution Sampling Distribution of the Mean Simple Linear Regression & Correlation Multiple Regression & Correlation

08.10.2011

INTRODUCTION
Criticism
There is a general perception that statistical knowledge is alltoo-frequently intentionally misused, by finding ways to interpret the data that are favorable to the presenter. (A famous quote, variously attributed, but thought to be from Benjamin Disraeli is: "There are three types of lies - lies, damn lies, and statistics.") Indeed, the well-known book How to Lie with Statistics by Darrell Huff discusses many cases of deceptive uses of statistics, focusing on misleading graphs. By choosing (or rejecting, or modifying) a certain sample, results can be manipulated; throwing out outliers is one means of doing so. This may be the result of outright fraud or of subtle and unintentional bias on the part of the researcher.

08.10.2011

WHAT IS STATISTICS?
Definition

Statistics is a group of methods used to

collect, analyze, present, and interpret data and to make decisions.

What is Statistics ?

American Heritage Dictionary defines statistics as: "The mathematics of the collection, organization, and interpretation of numerical data, especially the analysis of population characteristics by inference from sampling."
The Merriam-Websters Collegiate Dictionary definition is: "A branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data." The word statistics is also the plural of statistic (singular), which refers to the result of applying a statistical algorithm to a set of data, as in employment statistics, accident statistics, etc.

08.10.2011

In applying statistics to a scientific, industrial, or societal problem, one begins with a process or population to be studied. This might be a population of people in a country, of crystal grains in a rock, or of goods manufactured by a particular factory during a given period. For practical reasons, rather than compiling data about an entire population, one usually instead studies a chosen subset of the population, called a sample. Data are collected about the sample in an observational or experimental setting. The data are then subjected to statistical analysis, which serves two related purposes: description and inference.

08.10.2011

Descriptive statistics and Inferential statistics.

Statistical data analysis can be subdivided into Descriptive statistics and Inferential statistics. Descriptive statistics is concerned with exploring, visualising, and summarizing data but without fitting the data to any models. This kind of analysis is used to explore the data in the initial stages of data analysis. Since no models are involved, it can not be used to test hypotheses or to make testable predictions. Nevertheless, it is a very important part of analysis that can reveal many interesting features in the data. Descriptive statistics can be used to summarize the data, either numerically or graphically, to describe the sample. Basic examples of numerical descriptors include the mean and standard deviation. Graphical summarizations include various kinds of charts and graphs.
7

08.10.2011

Inferential statistics is the next stage in data analysis and involves the identification of a suitable model. The data is then fit to the model to obtain an optimal estimation of the model's parameters. The model then undergoes validation by testing either predictions or hypotheses of the model. Models based on a unique sample of data can be used to infer generalities about features of the whole population. Inferential statistics is used to model patterns in the data, accounting for randomness and drawing inferences about the larger population. These inferences may take the form of answers to yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), forecasting of future observations, descriptions of association (correlation), or modeling of relationships (regression). Other modeling techniques include ANOVA, time series, and data mining.
8

08.10.2011

Population and sample.

A portion of the population selected for study is referred to as a sample.

Population Sample

A population consists of all elements individuals, items, or objects whose characteristics are being studied. The population that is being studied is also called the target population.

Measures of Central Tendency


The central tendency of a dataset, i.e. the centre of a frequency distribution, is most commonly measured by the 3 Ms:

Mean = arithmetic mean =average Sum of all measurements divided by the number of measurements. Median: A number such that at most half of the measurements are below it and at most half of the measurements are above it.

Mode:
The most frequent measurement in the data.

Mean
The Sample Meany ) is the arithmetic average of a data set. ( It is used to estimate the population mean, ( . Calculated by taking the sum of the observed values (yi) divided by the number of observations (n). Historical Transmogrifier Average Unit Production Costs Residual $K System yi - y 1 22.2 y = 9.06 n 2 17.3 yi 3 11.8 y1 y2 yn y i1 4 9.6 n n 5 8.8 y i 6 7.6 7 6.8 22.2 17.3 1.6 y $9.06K 8 3.2 10 9 1.7 10 1.6

The Mode

The mode, symbolized by Mo, is the most frequently occurring score value. If the scores for a given sample distribution are:
32 32 35 36 37 38 38 39 39 39 40 40 42 45 then the mode would be 39 because a score of 39 occurs 3 times, more than any other score.

08.10.2011

12

A distribution may have more than one mode if the two most frequently occurring scores occur the same number of times. For example, if the earlier score distribution were modified as follows:
32 32 32 36 37 38 38 39 39 39 40 40 42 45

then there would be two modes, 32 and 39. Such distributions are called bimodal. The frequency polygon of a bimodal distribution is presented below.

08.10.2011

13

Example of Mode
Measurements x 3 5 1 1 4 7 3 8 3

Mode: 3

Notice that it is possible for a data not to have any mode.

Mode

The Mode is the value of the data set that occurs most frequently Example: 1, 2, 4, 5, 5, 6, 8 Here the Mode is 5, since 5 occurred twice and no other value occurred more than once Data sets can have more than one mode, while the mean and median have one unique value Data sets can also have NO mode, for example: 1, 3, 5, 6, 7, 8, 9 Here, no value occurs more frequently than any other, therefore no mode exists You could also argue that this data set contains 7 modes since each value occurs as frequently as every other

Example of Mode
Measurements x 3 5 5 1 7 2 6 7 0 4

In this case the data have tow modes: 5 and 7 Both measurements are repeated twice

Median

Computation of Median When there is an odd number of numbers, the median is simply the middle number. For example, the median of 2, 4, and 7 is 4.
When there is an even number of numbers, the median is the mean of the two middle numbers. Thus, the median of the numbers 2, 4, 7, 12 is (4+7)/2 = 5.5.

08.10.2011

17

Example of Median
Measurements Measurements Ranked x x 3 0 5 1 5 2 1 3 7 4 2 5 6 5 7 6 0 7 4 7 40 40

Median: (4+5)/2 = 4.5


Notice that only the two central values are used in the computation. The median is not sensible to extreme values

median
rim diameter (cm) unit 1 unit 2 9.7 9.0 11.5 11.2 11.6 11.3 12.1 11.7 12.4 12.2 12.6 12.5 12.9 <-13.2 13.2 13.1 13.8 13.5 14.0 13.6 15.5 14.8 15.6 16.3 16.2 26.9 16.4

Median

The Median is the middle observation of an ordered (from low to high) data set Examples: 1, 2, 4, 5, 5, 6, 8 Here, the middle observation is 5, so the median is 5 1, 3, 4, 4, 5, 7, 8, 8 Here, there is no middle observation so we take the average of the two observations at the center

Median

4 5 2

4.5

Mode = Median = Mean

Mode

Median Mean

Dispersion Statistics

The Mean, Median and Mode by themselves are not sufficient descriptors of a data set Example: Data Set 1: 48, 49, 50, 51, 52 Data Set 2: 5, 15, 50, 80, 100 Note that the Mean and Median for both data sets are identical, but the data sets are glaringly different! The difference is in the dispersion of the data points Dispersion Statistics we will discuss are: Range Variance Standard Deviation

Range

The Range is simply the difference between the smallest and largest observation in a data set Example Data Set 1: 48, 49, 50, 51, 52 Data Set 2: 5, 15, 50, 80, 100 The Range of data set 1 is 52 - 48 = 4 The Range of data set 2 is 100 - 5 = 95 So, while both data sets have the same mean and median, the dispersion of the data, as depicted by the range, is much smaller in data set 1

deviation score

A deviation score is a measure of by how much each point in a frequency distribution lies above or below the mean for the entire dataset:

where:

X= the mean Note that if you add all the deviation scores for a dataset together, you automatically get the mean for that dataset.

X = raw score

08.10.2011

24

Variance

The Variance, s2, represents the amount of variability of the data relative to their mean As shown below, the variance is the average of the squared deviations of the observations about their mean

s2
2

( yi n 1

y) 2

( yi N

)2

The Variance, s2, is the sample variance, and is used to estimate the actual population variance, 2

Standard Deviation

The Variance is not a common sense statistic because it describes the data in terms of squared units The Standard Deviation, s, is simply the square root of the variance

( yi n 1

y) 2

The Standard Deviation, s, is the sample standard deviation, and is used to estimate the actual population standard deviation,

( yi N

)2

The sample standard deviation, s, is measured in the same units as the data from which the standard deviation is being calculated

Standard Deviation
System 1 2 3 4 5 6 7 8 9 10 Average

FY97$K 22.2 17.3 11.8 9.6 8.8 7.6 6.8 3.2 1.7 1.6 9.06

yi y 13.1 8.2 2.7 0.5 -0.3 -1.5 -2.3 -5.9 -7.4 -7.5

(yi y) 2
172.7 67.9 7.5 0.3 0.1 2.1 5.1 34.3 54.2 55.7

( yi

y)2

n 1 172.7 67.9 55.7 10 1 399 .8 44.4 ($K 2 ) 9

s2

44.4($K 2 )

6.67 ($K )

This number, $6.67K, represents the average estimating error for predicting

subsequent observations In other words: On average, when estimating the cost of transmogrifiers that belongs to the same population as the ten systems above, we would expect to be off by $6.67K

Variance and the closely-related standard deviation

The variance and the closely-related standard deviation are measures of how spread out a distribution is. In other words, they are measures of variability.
In order to define the amount of deviation of a dataset from the mean, calculate the mean of all the deviation scores, i.e. the variance. The variance is computed as the average squared deviation of each number from its mean. For example, for the numbers 1, 2, and 3, the mean is 2 and the variance is: .

08.10.2011

28

variance in a population is:

variance in a sample is:

where; is the mean and N is the number of scores.

08.10.2011

29

The standard deviation is the square root of the variance.

08.10.2011

30

Variance and Standar Deviation

08.10.2011

31

Example of Mean
Measurements x
3 5 5 1 7 2 6 7 0 4 40

Deviation x - mean
-1 1 1 -3 3 -2 2 3 -4 0 0

MEAN = 40/10 = 4
Notice that the sum of the deviations is 0. Notice that every single observation intervenes in the computation of the mean.

Example of Variance
Measurements Deviations x 3 5 5 1 7 2 6 7 0 4 40 x - mean -1 1 1 -3 3 -2 2 3 -4 0 0 Square of deviations 1 1 1 9 9 4 4 9 16 0 54

Variance = 54/9 = 6
It is a measure of spread. Notice that the larger the deviations (positive or negative) the larger the variance

The standard deviation

It is defines as the square root of the variance In the previous example Variance = 6 Standard deviation = Square root of the variance = Square root of 6 = 2.45

Observed Vehicle velocity


velocity km/saat

67

73

81

72

76

75

85

77

68

84

76

93

73

79

88

73

60

93

71

59

74

62

95

78

63

72

66

78

82

75

96

70

89

61

75

95

66

79

83

71

76

65

71

75

65

80

73

57

88

78

08.10.2011

35

Mean, Median, Standard Deviation

Valid Numbers

Range

Mean

Median

Minimum

Maximum

Variance

Standard.Dev.

50

39

75,62

75

57

96

96,362

9,816458

08.10.2011

36

Frequency Table
Number of Class (intervals) 1 2 3 4 5 6 class intervals 50,000 < x <= 55,000 55,000 < x <= 60,000 60,000 < x <= 65,000 65,000 < x <= 70,000 70,000 < x <= 75,000 75,000 < x <= 80,000 class midpoints 52,5 57,5 62,5 67,5 72,5 77,5 0 3 5 5 14 10 frequency relative freq. % 0 6 10 10 28 20 0 3 8 13 27 37 Cumulative freq. Relative cumulative freq. % 0 6 16 26 54 74

7
8 9 10

80,000 < x <= 85,000


85,000 < x <= 90,000 90,000 < x <= 95,000 95,000 < x <= 100,00

82,5
87,5 92,5 97,5

5
3 4 1

10
6 8 2

42
45 49 50

84
90 98 100

08.10.2011

37

Frequency Table

A cumulative frequency distribution is a plot of the number of observations falling in or below an interval. The graph shown here is a cumulative frequency distribution of the scores on a statistics test. A frequency table is constructed by dividing the scores into intervals and counting the number of scores in each interval. The actual number of scores as well as the percentage of scores in each interval are displayed. Cumulative frequencies are also usually displayed. The X-axis shows various intervals of vehicle speed.
38

08.10.2011

Selecting the Interval Size

In order to find a starting interval size the first step is to find the range of the data by subtracting the smallest score from the largest. In the case of the example data, the range was 96-57 = 39. The range is then divided by the number of desired intervals, with a suggested starting number of intervals being ten (10). In the example, the result would be 50/10 = 5. The nearest odd integer value is used as the starting point for the selection of the interval size.

08.10.2011

39

Histogram

A histogram is constructed from a frequency table. The intervals are shown on the X-axis and the number of scores in each interval is represented by the height of a rectangle located above the interval. A histogram of the vehicle speed from the dataset is shown below. The shapes of histograms will vary depending on the choice of the size of the intervals.
VAR1 16 14 12 10

No of obs

8 6 4 2 0 Expected Normal

08.10.2011

50

55

60

65

70

75

80

85

90

95

100

Upper Boundaries (x <= boundary)

40

There are many different-shaped frequency distributions:

08.10.2011

41

A frequency polygon is a graphical display of a frequency table. The intervals are shown on the X-axis and the number of scores in each interval is represented by the height of a point located above the middle of the interval. The points are connected so that together with the X-axis they form a polygon.

08.10.2011

42

Spread, Dispersion, Variability A variable's spread is the degree to which scores on the variable differ from each other. If every score on the variable were about equal, the variable would have very little spread. There are many measures of spread. The distributions shown below have the same mean but differ in spread: The distribution on the bottom is more spread out. Variability and dispersion are synonyms for spread.

08.10.2011

43

Skew

08.10.2011

44

Further Notes
When the Mean is greater than the Median the data distribution is skewed to the Right.

When the Median is greater than the Mean the data distribution is skewed to the Left. When Mean and Median are very close to each other the data distribution is approximately symmetric.

The Effect of Skew on the Mean and Median

The distribution shown below has a positive skew. The mean is larger than the median.

test was very difficult and almost everyone in the class did very poorly on it, 08.10.2011 the resulting distribution would most likely be positively skewed. 46

The distribution shown below has a negative skew. The mean is smaller than the median.

08.10.2011

47

Probability

Likelihood or chance of occurrence. The probability of an event is the theoretical relative frequency of the event in a model of the population.

08.10.2011

48

Normal Distribution or Normal Curve

Normal distribution is probably one of the most important and widely used continuous distribution. It is known as a normal random variable, and its probability distribution is called a normal distribution. The normal distribution is a theoretical function commonly used in inferential statistics as an approximation to sampling distributions. In general, the normal distribution provides a good model for a random variable.

08.10.2011

49

In a normal distribution:

68% of samples fall between 1 SD 95% of samples fall between 2 SD (actually + 1.96 SD) 99.7% of samples fall between 3 SD

08.10.2011

50

The normal distribution function

The normal distribution function is determined by the following formula:

Where; : mean : standard deviation e: Euler's constant (2.71...) : constant Pi (3.14...)

08.10.2011

51

Characteristics of the Normal Distribution:

It is bell shaped and is symmetrical about its mean. It is asymptotic to the axis, i.e., it extends indefinitely in either direction from the mean. They are symmetric with scores more concentrated in the middle than in the tails. It is a family of curves, i.e., every unique pair of mean and standard deviation defines a different normal distribution. Thus, the normal distribution is completely described by two parameters: mean and standard deviation. There is a strong tendency for the variable to take a central value. It is unimodal, i.e., values mound up only in the center of the curve. The frequency of deviations falls off rapidly as the deviations become larger.

08.10.2011

52

Total area under the curve sums to 1, the area of the distribution on each side of the mean is 0.5. The Area Under the Curve Between any Two Scores is a PROBABILITY The probability that a random variable will have a value between any two points is equal to the area under the curve between those points. Positive and negative deviations from this central value are equally likely

08.10.2011

53

Examples of normal distributions

Notice that they differ in how spread out they are. The area under each curve is the same. The height of a normal distribution can be specified mathematically in terms of two parameters: the mean () and the standard deviation (). The two parameters, and , each change the shape of the distribution in a different manner.

08.10.2011

54

Changes in

without changes in

Changes in , without changes in , result in moving the distribution to the right or left, depending upon whether the new value of was larger or smaller than the previous value, but does not change the shape of the distribution.

08.10.2011

55

Changes in the value of

Changes in the value of , change the shape of the distribution without affecting the midpoint, because d affects the spread or the dispersion of scores. The larger the value of , the more dispersed the scores; the smaller the value, the less dispersed. The distribution below demonstrates the effect of increasing the value of :

08.10.2011

56

THE STANDARD NORMAL CURVE

The standard normal curve is a member of the family of normal curves with = 0.0 and = 1.0.
Note that the integral calculus is used to find the area under the normal distribution curve. However, this can be avoided by transforming all normal distribution to fit the standard normal distribution. This conversion is done by rescalling the normal distribution axis from its true units (time, weight, dollars, and...) to a standard measure called Z score or Z value.

08.10.2011

57

Standard Scores (z Scores)

A Z score is the number of standard deviations that a value, X, is away from the mean. Standard scores are therefore useful for comparing datapoints in different distributions. If the value of X is greater than the mean, the Z score is positive; if the value of X is less than the mean, the Z score is negative. The Z score or equation is as follows:

where z is the z-score for the value of X

08.10.2011

58

Table of the Standard Normal (z) Distribution


z
0.0 0.1 0.2 0.3 0.4

0.00
0.0000 0.0398 0.0793 0.1179 0.1554

0.01
0.0040 0.0438 0.0832 0.1217 0.1591

0.02
0.0080 0.0478 0.0871 0.1255 0.1628

0.03
0.0120 0.0517 0.0910 0.1293 0.1664

0.04
0.0160 0.0557 0.0948 0.1331 0.1700

0.05
0.0190 0.0596 0.0987 0.1368 0.1736

0.06
0.0239 0.0636 0.1026 0.1406 0.1772

0.07
0.0279 0.0675 0.1064 0.1443 0.1808

0.08
0.0319 0.0714 0.1103 0.1480 0.1844

0.09
0.0359 0.0753 0.1141 0.1517 0.1879

0.5
0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1

0.1915
0.2257 0.2580 0.2881 0.3159 0.3413 0.3643 0.3849 0.4032 0.4192 0.4332 0.4452 0.4554 0.4641 0.4713 0.4772 0.4821

0.1950
0.2291 0.2611 0.2910 0.3186 0.3438 0.3665 0.3869 0.4049 0.4207 0.4345 0.4463 0.4564 0.4649 0.4719 0.4778 0.4826

0.1985
0.2324 0.2642 0.2939 0.3212 0.3461 0.3686 0.3888 0.4066 0.4222 0.4357 0.4474 0.4573 0.4656 0.4726 0.4783 0.4830

0.2019
0.2357 0.2673 0.2969 0.3238 0.3485 0.3708 0.3907 0.4082 0.4236 0.4370 0.4484 0.4582 0.4664 0.4732 0.4788 0.4834

0.2054
0.2389 0.2704 0.2995 0.3264 0.3508 0.3729 0.3925 0.4099 0.4251 0.4382 0.4495 0.4591 0.4671 0.4738 0.4793 0.4838

0.2088
0.2422 0.2734 0.3023 0.3289 0.3513 0.3749 0.3944 0.4115 0.4265 0.4394 0.4505 0.4599 0.4678 0.4744 0.4798 0.4842

0.2123
0.2454 0.2764 0.3051 0.3315 0.3554 0.3770 0.3962 0.4131 0.4279 0.4406 0.4515 0.4608 0.4686 0.4750 0.4803 0.4846

0.2157
0.2486 0.2794 0.3078 0.3340 0.3577 0.3790 0.3980 0.4147 0.4292 0.4418 0.4525 0.4616 0.4693 0.4756 0.4808 0.4850

0.2190
0.2517 0.2823 0.3106 0.3365 0.3529 0.3810 0.3997 0.4162 0.4306 0.4429 0.4535 0.4625 0.4699 0.4761 0.4812 0.4854

0.2224
0.2549 0.2852 0.3133 0.3389 0.3621 0.3830 0.4015 0.4177 0.4319 0.4441 0.4545 0.4633 0.4706 0.4767 0.4817 0.4857

08.10.2011

59

Three areas on a standard normal curve

08.10.2011

60

Total - infinity to Z

Total Z to + infinity

Area from -Z to +Z

-infinity to -Z plus +Z to + infinity

-infinity to -Z plus +Z to + infinity

Total - infinity to Z1.5

-Z

+Z

-Z

+Z

-Z

+Z

Z-1.5

Z
0,000 0,100 0,200 0,300

Area Under Curve from negative infinity to Z 0,50000000000000 0,53982783727702 0,57925970943909 0,61791142218894

Area Under Curve from Z to positive infinity 0,50000000000000 0,46017216272298 0,42074029056091 0,38208857781106

Area Under Curve from -Z to +Z 0,00000000000000 0,07965567455403 0,15851941887818 0,23582284437788

Area Under Curve (negative infinity to Z) PLUS (+Z to positive infinity) 1,00000000000000 0,92034432544597 0,84148058112182 0,76417715562212

Convert (negative infinity to -Z ) PLUS (+Z to positive infinity) into PPM 1.000.000,00000000 920.344,32544597 841.480,58112182 764.177,15562212

Area Under Curve negative infinity to Z1.5

0,06680720126886

0,08075665923377

0,09680048458561

0,11506967022170

0,400
0,500 0,600 0,700 0,800 0,900 1,000 1,100 1,200 1,300 1,400

0,65542174161031
0,69146246127400 0,72574688224992 0,75803634777692 0,78814460141659 0,81593987465323 0,84134474606854 0,86433393905361 0,88493032977829 0,90319951541439

0,34457825838969
0,30853753872600 0,27425311775008 0,24196365222308 0,21185539858341 0,18406012534677 0,15865525393146 0,13566606094639 0,11506967022171 0,09680048458562

0,31084348322063
0,38292492254801 0,45149376449983 0,51607269555384 0,57628920283319 0,63187974930647 0,68268949213707 0,72866787810722 0,76986065955657 0,80639903082877 0,83848668153245

0,68915651677937
0,61707507745200 0,54850623550017 0,48392730444617 0,42371079716681 0,36812025069354 0,31731050786293 0,27133212189278 0,23013934044343 0,19360096917123 0,16151331846755

689.156,51677937
617.075,07745200 548.506,23550017 483.927,30444617 423.710,79716681 368.120,25069354 317.310,50786293 271.332,12189278 230.139,34044343 193.600,96917123 161.513,31846755

0,13566606094638

0,15865525393145

0,18406012534675

0,21185539858339

0,24196365222306

0,27425311775006

0,30853753872598

0,34457825838967

0,38208857781104

0,42074029056089

08.10.2011 0,08075665923378 0,91924334076622

61 0,46017216272296

The area between Z-scores of -1.00 and +1.00. It is .68 or 68%. The area between Z-scores of -2.00 and +2.00 and is .95 or 95%.

08.10.2011

62

Exercise 1

An industrial sewing machine uses ball bearings that are targeted to have a diameter of 0.75 inch. The specification limits under which the ball bearing can operate are 0.74 inch (lower) and 0.76 inch (upper). Past experience has indicated that the actual diameter of the ball bearings is approximately normally distributed with a mean of 0.753 inch and a standard deviation of 0.004 inch.
For this problem, note that "Target" = .75, and "Actual mean" = .753.

08.10.2011

63

What is the probability that a ball bearing will be between the target and the actual mean?

P(-0.75 < Z < 0) = .2734

08.10.2011

64

What is the probability that a ball bearing will be between the lower specification limit and the target?

P(-3.25 < Z < -0.75) = .49942 - .2734 = .22602

08.10.2011

65

What is the probability that a ball bearing will be above the upper specification limit?

P(Z > 1.75) = .5 - .4599 = .0401

08.10.2011

66

What is the probability that a ball bearing will be below the lower specification limit?

P (Z < -3.25) = .5 - .49942 = .00058

08.10.2011

67

Above which value in diameter will 93% of the ball bearings be? The value asked for here will be the 7th percentile, since 93% of the ball bearings will have diameters above that. So we will look up .4300 in the Z-table in a "backwards manner. The closest area to this is .4306, which corresponds to a Z-value of 1.48.

-0.00592 = X - 0.753

X = 0.74708

So 0.74708 in. is the value that 93% of the diameters are above.

08.10.2011

68

Exercise 2

Graduate Management Aptitude Test (GMAT) scores are widely used by graduate schools of business as an entrance requirement. Suppose that in one particular year, the mean score for the GMAT was 476, with a standard deviation of 107. Assuming that the GMAT scores are normally distributed, answer the following questions:

08.10.2011

69

Question 1

What is the probability that a randomly selected score from this GMAT falls between 476 and 650 (476 <= x <= 650) the following figure shows a graphic representation of this problem.

Answer: Z = (650 - 476)/107 = 1.62. The Z value of 1.62 indicates that the GMAT score of 650 is 1.62 standard deviation above the mean. The standard normal table gives the probability of value falling between 650 and the mean. The whole number and tenths place portion of the Z score appear in the first column of the table. Across the top of the table are the values of the hundredths place portion of the Z score. Thus the answer is that 0.4474 or 44.74% of the scores on the GMAT fall between a score of 650 and 476. 08.10.2011

70

Question 2.

What is the probability of receiving a score greater than 750 on a GMAT test that has a mean of 476 and a standard deviation of 107 i.e., P(X >= 750) = ?. Answer This problem is asking for determining the area of the upper tail of the distribution. The Z score is: Z = ( 750 - 476)/107 = 2.56- Table- P(Z=2.56) = 0.4948. This is the probability of a GMAT with a score between 476 and 750. 0.5 - 0.4948 = 0.0052 or 0.52%. Note that P(X >= 750) is the same as P(X >750), because, in continuous distribution, the area under an exact number such as X=750 is zero.

08.10.2011

71

What is the probability of receiving a score of 540 or less on a GMAT test that has a mean of 476 and a standard deviation of 107 i.e., P(X <= 540)= ? we are asked to determine the area under the curve for all values less than or equal to 540. z score (540-476)/107=0.6 -Table- P (z= 0.2257) which is the probability of getting a score between the mean 476 and 540. The answer to this problem is: 0.5 + 0.2257 = 0.73 or 73%.

Graphic representation of this problem.

08.10.2011

72

Question 4

What is the probability of receiving a score between 440 and 330 on a GMAT test that has a mean of 476 and a standard deviation of 107. i.e., P(330 < 440)="?."

The two values fall on the same side of the mean. The Z scores are: Z1 = (330 - 476)/107 = -1.36, and Z2 = (440 - 476)/107 = -0.34. The probability associated with Z = -1.36 is 0.4131, The probability associated with Z = -0.34 is 0.1331. Thee answer to this problem is: 0.4131 - 0.1331 = 0.28 or 28%.

08.10.2011

73

Standard Error (SE)

Any statistic can have a standard error. Each sampling distribution has a standard error.
Standard errors are important because they reflect how much sampling fluctuation a statistic will show, i.e. how good an estimate of the population the sample statistic is How good an estimate is the mean of a population? One way to determine this is to repeat the experiment many times and to determine the mean of the means. However, this is tedious and frequently impossible. SE refers to the variability of the sample statistic, a measure of spread for random variables The inferential statistics involved in the construction of confidence intervals (CI) and significance testing are based on standard errors.

08.10.2011

74

Standard Error of the Mean, SEM, M

The standard deviation of the sampling distribution of the mean is called the standard error of the mean.

The size of the standard error of the mean is inversely proportional to the square root of the sample size.

Not:

08.10.2011

75

The standard error of any statistic depends on the sample size - in general, the larger the sample size the smaller the standard error. Note that the spread of the sampling distribution of the mean decreases as the sample size increases.

Notice that the mean of the distribution is not affected by sample size.

08.10.2011

76

Comparing the Averages of Two Independent Samples

Is there "grade inflation" in KTU? How does the average GPA of KTU students today compare with, say 10, years ago? Suppose a random sample of 100 student records from 10 years ago yields a sample average GPA of 2.90 with a standard deviation of .40. A random sample of 100 current students today yields a sample average of 2.98 with a standard deviation of .45. The difference between the two sample means is 2.98-2.90 = .08. Is this proof that GPA's are higher today than 10 years ago?

08.10.2011

77

First we need to account for the fact that 2.98 and 2.90 are not the true averages, but are computed from random samples. Therefore, .08 is not the true difference, but simply an estimate of the true difference. Can this estimate miss by much? Fortunately, statistics has a way of measuring the expected size of the ``miss'' (or error of estimation) . For our example, it is .06 (we show how to calculate this later). Therefore, we can state the bottom line of the study as follows: "The average GPA of KTU students today is .08 higher than 10 years ago, give or take .06 or so."

08.10.2011

78

Overview of Confidence Intervals

Once the population is specified, the next step is to take a random sample from it. In this example, let's say that a sample of 10 students were drawn and each student's memory tested. The way to estimate the mean of all high school students would be to compute the mean of the 10 students in the sample. Indeed, the sample mean is an unbiased estimate of , the population mean. Clearly, if you already knew the population mean, there would be no need for a confidence interval.

08.10.2011

79

We are interested in the mean weight of 10-year old kids living in Turkey. Since it would have been impractical to weigh all the 10-year old kids in Turkey, you took a sample of 16 and found that the mean weight was 90 pounds. This sample mean of 90 is a point estimate of the population mean.
A point estimate by itself is of limited usefulness because it does not reveal the uncertainty associated with the estimate; you do not have a good sense of how far this sample mean may be from the population mean. For example, can you be confident that the population mean is within 5 pounds of 90? You simply do not know.

08.10.2011

80

Confidence intervals provide more information than point estimates. An example of a 95% confidence interval is shown below: 72.85 < < 107.15 There is good reason to believe that the population mean lies between these two bounds of 72.85 and 107.15 since 95% of the time confidence intervals contain the true mean.
If repeated samples were taken and the 95% confidence interval computed for each sample, 95% of the intervals would contain the population mean. Naturally, 5% of the intervals would not contain the population mean.

08.10.2011

81

It is natural to interpret a 95% confidence interval as an interval with a 0.95 probability of containing the population mean
The wider the interval, the more confident you are that it contains the parameter. The 99% confidence interval is therefore wider than the 95% confidence interval and extends from 4.19 to 7.61.

08.10.2011

82

Example

Assume that the weights of 10-year old children are normally distributed with a mean of 90 and a standard deviation of 36. What is the sampling distribution of the mean for a sample size of 9? standard deviation of 36/3 = 12. Note that the standard deviation of a sampling distribution is its standard error.
90 - (1.96)(12) = 66.48 90 + (1.96)(12) = 113.52

The value of 1.96 is based on the fact that 95% of the area of a normal distribution is within 1.96 standard deviations of the mean; 12 is the standard error of the mean.

08.10.2011

83

Figure shows that 95% of the means are no more than 23.52 units (1.96x12) from the mean of 90. Now consider the probability that a sample mean computed in a random sample is within 23.52 units of the population mean of 90. Since 95% of the distribution is within 23.52 of 90, the probability that the mean from any given sample will be within 23.52 of 90 is 0.95.
This means that if we repeatedly compute the mean (M) from a

sample, and create an interval ranging from M - 23.52 to M + 23.52, this interval will contain the population mean 95% of the time.

08.10.2011

84

notice that you need to know the standard deviation () in order to estimate the mean. This may sound unrealistic, and it is. However, computing a confidence interval when is known is easier than when has to be estimated, and serves a pedagogical purpose. Suppose the following five were sampled from a normal distribution with a standard deviation of 2.5: 2, 3, 5, 6, and 9. To compute the 95% confidence interval, start by computing the mean and standard error:
M = (2 + 3 + 5 + 6 + 9)/5 = 5. m = = 1.118.

08.10.2011

85

Z.95 --the value is 1.96.

08.10.2011

86

If you had wanted to compute the 99% confidence interval, you would have set the shaded area to 0.99 and the result would have been 2.58.
The confidence interval can then be computed as follows: Lower limit = 5 - (1.96)(1.118)= 2.81 Upper limit = 5 + (1.96)(1.118)= 7.19

08.10.2011

87

Estimating the Population Mean Using Intervals

Estimate the average GPA of the population of approximately 23000 KTU undergraduates.n=25 randomly selected students, sample average= 3.05. Consider estimating the population average Now chances are the true average is not equal to 3.05. True KTU average GPA is between 1.00 and 4.00, and with high confidence between (2.50, 3.50); but what level of confidence do we have that it is between say, (2.75, 3.25) or (2.95, 3.15)? Even better, can we find an interval (a, b) which will contain with 95% certainty?

08.10.2011

88

Example:

Given the following GPA for 6 students: 2.80, 3.20, 3.75, 3.10, 2.95, 3.40
Calculate a 95% confidence interval for the population mean GPA.

08.10.2011

89

Determining Sample Size for Estimating the Mean

want to estimate the average GPA of KTU undergraduates this school year. Historically, the SD of student GPA is known to be . If a random sample of size n=25 yields a sample mean of , then the population mean is estimated as lying within the interval

with 95% confidence. The plus-or-minus quantity .12 is called the margin of error of the sample mean associated with a 95% confidence level. It is also correct to say ``we are 95% confident that is within .12 of the sample mean 3.05''.

08.10.2011

90

Confidence Interval for , Standard Deviation Estimated

It is very rare for a researcher wishing to estimate the mean of a population to already know its standard deviation. Therefore, the construction of a confidence interval almost always involves the estimation of both and .
When is known -> M - zM M + zM is used for a confidence interval.

When is not known, Whenever the standard deviation is estimated (NOT KNOWN), the t rather than the normal (z) distribution should be used. for when is estimated is: M - t sM M + t sM where M is the sample mean, sM is an estimate of M (standard error), and t depends on the degrees of freedom and the level of confidence.

confidence interval on the mean:


More generally, the formula for the 95% confidence interval on the mean is: Lower limit = M - (t)(sm) Upper limit = M + (t)(sm) where; M is the sample mean, t is the t for the confidence level desired (0.95 in the above example), and sm is the estimated standard error of the mean.

A comparison of the t and normal distribution


A comparison of the t distribution with 4 df (in blue) and the standard normal distribution (in red).

Finding t-values
Find the t-value such that the area under the t distribution to the right of the t-value is 0.2 assuming 10 degrees of freedom. That is, find t0.20 with 10 degrees of freedom.

Upper tail probability p (area under the right side)


Example: P[t(2) > 2.92] = 0.05

P[-2.92 < t(2) < 2.92] = 0.9

50% 0.25 df 1 2 3 4 5 6 7 8 9 10 11 1.000 0.817 0.765 0.741 0.727 0.718 0.711 0.706 0.703 0.700 0.697

60% 0.2

70% 0.15

80% 0.1

90% 0.05

95% 0.025

96% 0.02

98% 0.01

99% 0.005

99.5% 0.0025

99.8% 0.001

99.9% 0.0005

1.376 1.061 0.979 0.941 0.920 0.906 0.896 0.889 0.883 0.879 0.876

1.963 1.386 1.250 1.190 1.156 1.134 1.119 1.108 1.100 1.093 1.088

3.078 1.886 1.638 1.533 1.476 1.440 1.415 1.397 1.383 1.372 1.363

6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796

12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201

15.895 4.849 3.482 2.999 2.757 2.612 2.517 2.449 2.398 2.359 2.328

31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718

63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106

127.32 14.089 7.453 5.598 4.773 4.317 4.029 3.833 3.690 3.581 3.497

318.30 22.327 10.215 7.173 5.893 5.208 4.785 4.501 4.297 4.144 4.025

636.61 31.599 12.924 8.610 6.869 5.959 5.408 5.041 4.781 4.587 4.437

12
13 14

0.696
0.694 0.692

0.873
0.870 0.868

1.083
1.079 1.076

1.356
1.350 1.345

1.782
1.771 1.761

2.179
2.160 2.145

2.303
2.282 2.264

2.681
2.650 2.624

3.055
3.012 2.977

3.428
3.372 3.326

3.930
3.852 3.787

4.318
4.221 4.140

Abbreviated t table

df 2

0.95 4.303

0.99 9.925

3
4 5

3.182
2.776 2.571

5.841
4.604 4.032

8
10 20 50 100

2.306
2.228 2.086 2.009 1.984

3.355
3.169 2.845 2.678 2.626

Example

Assume that the following five numbers are sampled from a normal distribution: 2, 3, 5, 6, and 9 and that the standard deviation is not known. The first steps are to compute the sample mean and variance: M=5 sm = 7.5 Standard error (sm)= 1.225 df = N - 1 = 4 t t tablethe value for the 95% interval for is 2.776. Lower limit = 5 - (2.776)(1.225)= 1.60 Upper limit = 5 + (2.776)(1.225)= 8.40

Example
Suppose a researcher were interested in estimating the mean reading speed (number of words per minute) of high-school graduates and computing the 95% confidence interval. A sample of 6 graduates was taken and the reading speeds were: 200, 240, 300, 410, 450, and 600. For these data, M = 366.6667 sM= 60.9736 df = 6-1 = 5 t = 2.571 lower limit is: M - (t) (sM) = 209.904 upper limit is: M + (t) (sM) = 523.430, 95% confidence interval is: 209.904 523.430 Thus, the researcher can conclude based on the rounded off 95% confidence interval that the mean reading speed of high-school graduates is between 210 and 523.

Homework 1

The mean time difference for all 47 subjects is 16.362 seconds and the standard deviation is 7.470 seconds. The standard error of the mean is 1.090.
A t table shows the critical value of t for 47 - 1 = 46 degrees of freedom is 2.013 (for a 95% confidence interval). The confidence interval is computed as follows: Lower limit = 16.362 - (2.013)(1.090)= 14.17 Upper limit = 16.362 + (2.013)(1.090)= 18.56 Therefore, the interference effect (difference) for the whole population is likely to be between 14.17 and 18.56 seconds.

Homework 2

The pasteurization process reduces the amount of bacteria found in dairy products, such as milk. The following data represent the counts of bacteria in pasteurized milk (in CFU/mL) for a random sample of 12 pasteurized glasses of milk. Construct a 95% confidence interval for the bacteria count.

NOTE: Each observation is in tens of thousand. So, 9.06 represents 9.06 x 104.

Prediction with Regression Analysis


The relationship(s) between values of the response variable and corresponding values of the predictor variable(s) is (are) not deterministic. Thus the value of y is estimated given the value of x. The estimated value of the dependent variable is denoted y, and the population slope and intercept are usually denoted 1 and 0.

Linear Regression

The idea is to fit a straight line through data points Linear Regression - Indicates that the relationship(s) between the dependent variable and the independent variable(s). Can extend to multiple dimensions

correlation analysis is applied to independent factors: if X increases, what will Y do (increase, decrease, or
perhaps not change at all)? In regression analysis a unilateral response is assumed: changes in X result in changes in Y, but changes in Y do not result in changes in X.

Regression Plot
m1 = 0.0095937 + 0.880436 vwmkt S = 0.0590370 R-Sq = 31.3 % R-Sq(adj) = 30.8 %

0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 -0.3 -0.2 -0.1 0.0 0.1

m1

vwmkt

Linear regression means a regression that is linear in the parameters A linear regression can be non-linear in the variables Example: Y = 0 + 1X2
Some non-linear regression models can be transformed into a linear regression model (e.g., Y=aXbZc can be transformed into lnY = ln a + b*ln X + c*ln Z)

Example

Given one variable Goal: Predict Y Example:

X (years) 3 8

Y (salary, $1,000) 30 57

Given Years of Experience Predict Salary When X=10, what is Y? When X=25, what is Y? This is known as regression

9
13 3

64
72 36

Questions:

6
11 21 1

43
59 90 20

For the example data


23.2, y 3.5 23.2 3.5 x

x=10 years prediction of y (salary) is: 23.2+35=58.2 K dollars/year.

Linear Regression Example


Linear Regression: Y=3.5*X+23.2 120 100 80

Salary

60 40 20 0 0 5 10 Years 15 20 25

Y
( xi
i i

X
x )( yi ( xi y x x)
2

y)

Regression Error

We can also write a regression equation slightly differently:

Also called the residual, this is the difference between our estimate of the value of the dependent variable y and the actual value of the dependent variable y.

Unless we have perfect prediction, many of the y values will fall off of the line. The added e in the equation refers to this fact. It would be incorrect to write the equation without the e, because it would suggest that the y scores are completely accounted for by just knowing the slope, x values, and the intercept. Almost always, that is not true. There is some error in prediction, so we need to add an e for error variation into the equation. The actual values of y can be accounted for by the regression line equation (y=a+bx) plus some degree of error in our prediction (the e's).

r correlation coefficient

The correlation between X and Y is expressed by the correlation coefficient r :

xi = data X, x = mean of data X yi = data Y, y = mean of data Y


1 >r > -1 r = 1 perfect positive linear correlation between two variables r = 0 no linear correlation (maybe other correlation) r = -1 perfect negative linear correlation Notice that for the perfect correlation, there is a perfect line of points. They do not deviate from that line.

least squares

The principle is to establish a statistical linear relationship between two sets of corresponding data by fitting the data to a straight line by means of the "least squares" technique. The resulting line takes the general form:

y = bx + a

a = intercept of the line with the y-axis b = slope (tangent)


a = 0, b= 1 perfect positive correlation without bias a= 0 systematic discrepancy (bias, error) between X and Y; b = 1 proportional response or difference between X and Y.

Example

Each point represents one student with a certain score for time on the exam, x, and grade, y. The scatter plot reveals that, in general, longer times on the exam tend to be associated with higher grades.
0.64

ID 1 2 3 4 5 6 7 8 9 10 11 12 13

Grade on Time on Exam (x) Exam (y) 88 60 96 53 72 22 78 44 65 34 80 47 77 38 83 50 79 51 68 35 84 46 76 36 92 48

X-X avr 8.6 16.6 -7.4 -1.4 -14.4 0.6 -2.4 3.6 -0.4 -11.4 4.6 -3.4 12.6

Y-Yavr 18.55 11.55 -19.45 2.55 -7.45 5.55 -3.45 8.55 9.55 -6.45 4.55 -5.45 6.55

(X-Xavr)*(Y-Yavr) 159.53 191.73 143.93 -3.57 107.28 3.33 8.28 30.78 -3.82 73.53 20.93 18.53 82.53

(X-Xavr)2 73.96 275.56 54.76 1.96 207.36 0.36 5.76 12.96 0.16 129.96 21.16 11.56 158.76

r correlation

The Pearson r can be positive or negative, ranging from -1.0 to 1.0. If the correlation is 1.0, the longer the amount of time spent on the exam, the higher the grade will be--without any exceptions. An r value of -1.0 indicates a perfect negative correlation-without an exception, the longer one spends on the exam, the poorer the grade. If r=0, there is absolutely no relationship between the two variables. When r=0, on average, longer time spent on the exam does not result in any higher or lower grade. Most often r is somewhere in between -1.0 and +1.0.

ID 1 2

Grade on Exam (x) 88 96

x2 7744 9216

Time on Exam (y) 60 53

y2 3600 2809

xy 5280 5088

3
4 5 6 7 8 9 10 11 12 13

72
78 65 80 77 83 79 68 84 76 92

5184
6084 4225 6400 5929 6889 6241 4624 7056 5776 8464

22
44 34 47 38 50 51 35 46 36 48

484
1936 1156 2209 1444 2500 2601 1225 2116 1296 2304

1584
3432 2210 3760 2926 4150 4029 2380 3864 2736 4416

14
15 16 17 18 19 20

80
67 78 74 73 88 90

6400
4489 6084 5476 5329 7744 8100 127454

43
40 32 27 41 39 43 829

1849
1600 1024 729 1681 1521 1849 35933

3440
2680 2496 1998 2993 3432 3870 66764

1588

ID
1 2 3 4 5 6 7 8 9

Grade on Exam (x)


88 96 72 78 65 80 77 83 79

Time on Exam (y)


60 53 22 44 34 47 38 50 51

X-X ort
8,6 16,6 -7,4 -1,4 -14,4 0,6 -2,4 3,6 -0,4

Y-Yort
18,55 11,55 -19,45 2,55 -7,45 5,55 -3,45 8,55 9,55

(X-Xort)*(Y-Yort)
159,53 191,73 143,93 -3,57 107,28 3,33 8,28 30,78 -3,82

(X-Xort)
73,96 275,56 54,76 1,96 207,36 0,36 5,76 12,96 0,16

(Y-Yort)

344,1025 133,4025 378,3025 6,5025 55,5025 30,8025 11,9025 73,1025 91,2025

10
11 12 13 14 15 16 17 18 19 20

68
84 76 92 80 67 78 74 73 88 90

35
46 36 48 43 40 32 27 41 39 43

-11,4
4,6 -3,4 12,6 0,6 -12,4 -1,4 -5,4 -6,4 8,6 10,6

-6,45
4,55 -5,45 6,55 1,55 -1,45 -9,45 -14,45 -0,45 -2,45 1,55

73,53
20,93 18,53 82,53 0,93 17,98 13,23 78,03 2,88 -21,07 16,43

129,96
21,16 11,56 158,76 0,36 153,76 1,96 29,16 40,96 73,96 112,36

41,6025
20,7025 29,7025 42,9025 2,4025 2,1025 89,3025 208,8025 0,2025 6,0025 2,4025

Total

1588

829

941,4

1366,8

1570,95

Average

79,4

41,45

r = 0.6424

r2 square of the correlation coefficient

r is the proportion of the sum of squares explained in one-variable regression, r is the proportion of the sum of squares explained in multiple regression.

Is an R-Square < 1.00 Good or bad?

This is both a statistical and a philosophical question; It is quite rare, especially in the social sciences, to get an Rsquare that is really high (e.g., 98%). The goal is NOT to get the highest R-square per se. Instead, the goal is to develop a model that is both statistically and theoretically sound, creating the best fit with existing data. Do you want just the best fit, or a model that theoretically/conceptually makes sense? Yes, you might get a good fit with nonsensical explanatory variables. But, this opens you to spurious/intervening relationships. THEREFORE: hard to use model for explanation.

Why might an R-Square be less than 1.00?


underdetermined model (need more variables) nonlinear relationships measurement error sampling error not fully predictable/explainable even with all data available; there is a certain amount of unexplainable chaos/static/randomness in the universe (which may be reassuring) the unit of analysis is too aggregated (e.g., you are predicting mean housing values for a city -- you might get better results with predicting individual housing prices, or neighborhood housing prices).

Adjusted R2 (R-square)
What is an "Adjusted" R-Square? The Adjusted R-Square takes into account not only how much of the variation is explained, but also the impact of the degrees of freedom. It "adjusts" for the number of variables use. That is, look at the adjusted R- Square to see how adding another variable to the model both increases the explained variance but also lowers the degrees of freedom. Adjusted R2 = 1- (1 - R2 )((n - 1)/(n - k - 1)). As the number of variables in the model increases, the gap between the Rsquare and the adjusted R-square will increase. This serves as a disincentive to simply throwing in a huge number of variables into the model to increase the R-square.

This adjusted value for R-square will be equal or smaller than the regular R-square. The adjusted R-square adjusts for a bias in R-square. R-square tends to over estimate the variance accounted for compared to an estimate that would be obtained from the population. There are two reasons for the overestimate, a large number of predictors and a small sample size. So, with a small sample and with few predictors, adjusted Rsquare should be very similar to the R-square value. Researchers and statisticians differ on whether to use the adjusted R-square. It is probably a good idea to look at it to see how much your R-square might be inflated, especially with a small sample and many predictors.

Example
Suppose we have collected the following sample of 6 observations on age and income:
Find the estimated regression line for the sample of six observations we have collected on age and income: Which is the independent variable and which is the dependent variable for this problem?

Cautions About Simple Linear Regression


Correlation and regression describe only linear relations Correlation and least-squares regression line are not resistant to outliers Predictions outside the range of observed data are often inaccurate Correlation and regression are powerful tools for describing relationship between two variables, but be aware of their limitations

Multiple Prediction

Regression analysis allows us to use more than one independent

variable to predict values of y. Take the fat intake and blood cholesterol level study as an example. If we want to predict cholesterol as accurately as possible, we need to know more about diet than just how much fat intake there is. On the island of Crete, they consume a lot of olive oil, so there fat intake is high. This, however, seems to have no dramatic affect on cholesterol (at least the bad cholesterol, the LDLs). They also consume very little cholesterol in their diet, which consists more of fish than high cholesterol foods like cheese and beef (hopefully this won't be considered libelous in Texas). So, to improve our prediction of blood cholesterol levels, it would be helpful to add another predictor, dietary cholesterol.

From Bivariate to Multiple regression: what changes?

potentially more explanatory power with more variables. the ability to control for other variables: and one sees the interaction of the various explanatory variables. partial correlations and multicollinearity. harder to visualize drawing a line through three+ ndimensional space. the R is no longer simply the square of the correlation statistic r.

From Two to Three Dimensions With simple regression (one predictor) we had only the x-axis and the y-axis. Now we need an axis for x1, x2, and y. where Y' is the predicted score, X1 is the score on the first predictor variable, X2 is the score on the second, etc. The Y intercept is A. The regression coefficients (b1, b2, etc.) are analogous to the slope in simple regression. If we want to predict these points, we now need a regression plane rather than just a regression line. That looks something like this:

More than one prediction attribute


X1, X2 For example,


X1=years of experience X2=age Y=salary

1 1

2 2

y
yi
0=10 i

Response Surface E(yi)

(xi1, xi2)

x1

x2

The parameters 0, 1, 2, , k are called partial regression coefficients. 1 represents the change in y corresponding to a unit increase in x1, holding all the other predictors constant. A similar interpretation can be made for 2, 3, , k

Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations ANOVA df Regression Residual Total 4 25 29 SS 0,164 0,002 0,165 MS 0,041 0,000 Significa F nce F 628,372 0,000 0,995 0,990 0,989 0,008 30

Intercept Percent of Gross Hhd Income Spent on rent percent 2-parent families Police Anti-Drug Program? Active Tenants Group? (1 = yes; 0 = no)

Coefficie Standard nts Error 0,500 0,008 -0,399 0,016 -0,288 0,015 -0,004 0,004 -0,102 0,004

t Stat P-value 60,294 0,000 -24,610 0,000 -19,422 0,000 -1,238 0,227 -28,827 0,000

Controlling also for this new variable, the police anti-drug program is no longer statistically significant, an instead the presence of the active tenants group makes the dramatic difference. (and look at that great R square!). However, we are no quite done

SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations ANOVA df Regression Residual Total 2 27 29 Coeffici ents 0.36582 -0.2565 -0.1246 SS 0.149 0.024 0.173 Standard Error 0.017 0.051 0.011 MS 0.074 0.001 F 83.484 Significance F 0.000

0.928 0.861 0.850 0.030 30

Intercept percent 2-parent families Active Tenants Group? (1 = yes; 0 = no)

t Stat 20.908 -5.017 -11.347

P-value 0.000 0.000 0.000

BETA

-0.362 -0.821

Since the police variable now has a statistically insignificant t-score, we remove it from the model. (We also remove the income variable, since it also becomes insignificant after we remove the police variable.) We are left with two independent variables: percent of 2-parent families and active tenants group.

Stepwise Regression Algorithms

Backward Elimination Forward Selection Stepwise Selection

Backward Elimination 1. Fit the model containing all (remaining) predictors. 2. Test each predictor variable, one at a time, for a significant relationship with y. 3. Identify the variable with the largest pvalue. If p > , remove this variable from the model, and return to (1.). 4. Otherwise, stop and use the existing model.

Forward Selection 1. Fit all models with one (more) predictor. 2. Test each of these predictor variables, for a significant relationship with y. 3. Identify the variable with the smallest pvalue. If p < , add this variable to the model, and return to (1.). 4. Otherwise, stop and use the existing model.

Stepwise Selection The Stepwise Selection method is basically Forward Selection with Backward Elimination added in at every step.

Stepwise Selection 1. Fit all models with one (more) predictor. 2. Test each of these predictor variables, for a significant relationship with y. 3. Identify the variable with the smallest p-value. If p < , add this variable to the model, and return to (1.). 4. Now, for the model being considered, test each predictor variable, one at a time, for a significant relationship with y. 5. Identify the variable with the largest p-value. If p > , remove this variable from the model, and return to (1.). 6. Otherwise, stop and use the existing model.

Linear regression

Review

Multiple Regression Models

Chapter Topics

The Multiple Regression Model Contribution of Individual Independent Variables Coefficient of Determination Categorical Explanatory Variables Transformation of Variables Violations of Assumptions Qualitative Dependent Variables

Multiple Regression Models


Multiple Regression Models Linear

NonLinear

Linear

Dummy Variable

Interaction

PolyNomial

Square Root

Log

Reciprocal

Exponential

Linear Multiple Regression Model

Additional Assumption for Multiple Regression

No exact linear relation exists between any subset of explanatory variables (perfect "multicollinearity")

The Multiple Regression Model


Relationship between 1 dependent & 2 or more independent variables is a linear function Random

Population Y-intercept

Population slopes
2 X2i

Error
p X pi

Yi
Yi

1X1i

b0

b1X1i

b2 X 2i

bp X pi

ei

Dependent (Response) variable for sample

Independent (Explanatory) variables for sample model

Population Multiple Regression Model


Bivariate model

Yi =

0 1 1i (Observed Y)

X +

X2i +

Response Plane

X2

X1

(X1i,X2i)
YX

1X1i

2X2i

Sample Multiple Regression Model


Bivariate model

Y
Response Plane

Yi = b0 + b1X1i + b2X2i + ei
(Observed Y)

b0

ei

X2

X1

(X1i,X2i) ^ Yi = b0 + b1X1i + b2X2i

Parameter Estimation
Linear Multiple Regression Model

Multiple Regression Model: Example


O il (G a l) T e m p (0F) In su la tio n

Develop a model for estimating heating oil used for a single family home in the month of January based on average temperature and amount of insulation in inches.

2 7 5 .3 0 3 6 3 .8 0 1 6 4 .3 0 4 0 .8 0 9 4 .3 0 2 3 0 .9 0 3 6 6 .7 0 3 0 0 .6 0 2 3 7 .8 0 1 2 1 .4 0 3 1 .4 0 2 0 3 .5 0 4 4 1 .1 0 3 2 3 .0 0 5 2 .5 0

40 27 40 73 64 34 9 8 23 63 65 41 21 38 58

3 3 10 6 6 6 6 10 10 3 10 6 3 3 10

Interpretation of Estimated Coefficients

Slope (bP)

Estimated Y changes by bP for each 1 unit increase in XP holding all other variables

constant (ceterus paribus)

Example: If b1 = -2, then fuel oil usage (Y) is expected to decrease by 2 gallons for each 1 degree increase in temperature (X1) given the inches of insulation (X2)

Y-Intercept (b0)

Average value of Y when all XP = 0

Sample Regression Model: Example


C o e ffi c i e n ts
I n te r c e p t X V a ria b le 1 X V a ria b le 2 5 6 2 .1 5 1 0 0 9 2 -5 . 4 3 6 5 8 0 5 8 8 -2 0 . 0 1 2 3 2 0 6 7

Yi

562 .151

5 .437 X 1i

20 .012 X 2 i

For each degree increase in temperature, the average amount of heating oil used is decreased by 5.437 gallons, holding insulation constant.

For each increase in one inch of insulation, the use of heating oil is decreased by 20.012 gallons, holding temperature constant.

Evaluating the Model

Evaluating Multiple Regression Model Steps


Examine variation measures Test parameter significance

Overall model Portions of model Individual coefficients

Variation Measures

Coefficient of Multiple Determination

r2Y.12..P = Explained variation = SSR


Total variation

SST

r2=0 all the variables taken together do not explain variation in Y

Adjusted Coefficient of

Multiple Determination
NOT proportion of variation in Y explained by all X variables taken together Reflects

Sample size Number of independent variables

Smaller than r2Y.12..P Sometimes used to compare models

Simple and Multiple Regression Compared:Example

Two simple regressions: ABSENCES= + 1AUTONOMY ABSENCES= + 2SKILLVARIETY


Multiple Regression: ABSENCES= + 1AUTONOMY+ 2SKILLVARIETY

Overlap in Explanation
SIMPLE REGRESSION: AUTONOMY Multiple R 0,169171 R Square 0,028619 Adjusted R Square 0,027709 Standard Error 12,443 Observations 1069 ANOVA df Regression Residual Total SS MS F Significance F 1 4867,198 4867,198 31,43612 2,62392E-08 1067 165201,7 154,8282 1068 170068,9 MULTIPLE REGRESSION Multiple R 0,231298 R Square 0,053499 Adjusted R Square 0,051723 Standard Error 2,28837 1 Observations 1069 ANOVA df Regression Residual Total SS MS F 2 9098,483 4549,242 30,1266 1066 160970,4 151,0041 1068 170068,9

SIMPLE REGRESSION: SKILL VARIETY Multiple R 0,193838 R Square 0,037573 Adjusted R Square 0,036671 Standard Error 12,38552 Observations 1069 ANOVA df Regression Residual Total SS MS F Significance F 1 6390,011 6390,011 41,6556 1,64882E-10 1067 163678,9 153,401 1068 170068,9

0,06619206 SUM OF SIMPLE R2 0,05349881 MULTIPLE R2 0,01269325 OVERLAP ATTRIBUTED TO BOTH

11257,2098 SUM OF REGRESSION SUM OF SQUARES 9098,4831 REGRESSION SUM OF SQUARES 2158,72671 OVERLAP

Testing Parameters

Test for Overall Significance Example Solution


Test Statistic:

H0 : 1 = 2 = = H1: At least one I = .05

=0 0

168.47

df = 2 and 12
Critical Value(s):

Decision: Reject at = 0.05 Conclusion:


= 0.05

There is evidence that at least one independent variable affects Y

3.89

Test for Significance: Individual Variables


Shows if there is a linear relationship between the variable Xi and Y Use t test Statistic
Hypotheses:

H0:
H1:

i i

= 0 (No linear relationship)


0 (Linear relationship between Xi and Y)

t Test Statistic Excel Output: Example


t Test Statistic for X1 (Temperature)
C o e ffi c i e n ts S ta n d a r d E r r o r
I n te r c e p t 5 6 2 .1 5 1 0 0 9 2 1 .0 9 3 1 0 4 3 3 0 .3 3 6 2 1 6 1 6 7 2 .3 4 2 5 0 5 2 2 7 X V a r i a b l e 1 -5 . 4 3 6 5 8 0 6 X V a r i a b l e 2 -2 0 . 0 1 2 3 2 1

t S ta t
2 6 .6 5 0 9 4 -1 6 . 1 6 9 9 -8 . 5 4 3 1 3

t Test Statistic for X2 (Insulation)

Se

t Test : Example Solution


Does temperature have a significant effect on monthly consumption of heating oil? Test at = 0.05.
H0: H1:

1 1

=0 0

Test Statistic: t Test Statistic = -16.1699 Decision: Reject H0 at = 0.05 Conclusion: There is evidence of a significant effect of temperature on oil consumption.

df = 12 Critical Value(s):
Reject H0 Reject H0

.025
-2.1788

.025 0 2.1788

Example: Analysis of job earnings


What is the impact of employer tenure (ERTEN), unemployment (UNEM) and education (EDU) on job earnings (JEARN)?

Example: Analysis of job earnings

Correlations

Results: Anova

Results

Testing Model Portions

Examines the contribution of a set of X variables to the relationship with Y Null hypothesis:

Variables in set do not improve significantly the model when all other variables are included
At least one variable is significant

Alternative hypothesis:

Testing Model Portions


Only one-tail test Requires comparison of two regressions

One regression includes everything One regression includes everything except the portion to be tested.

Testing Model Portions Test Statistic


Test H0:
1= 2

= 0 in a 3 variable model

( SSR( X 1 , X 2 , X 3 ) - SSR( X 3 ))/k MSE ( X 1 , X 2 , X 3 )


From ANOVA section of regression for

From ANOVA section of regression for

Yi b0 b1 X1i b2 X 2i b3 X 3i

Yi

b0

b3 X 3i

Testing Portions of Model: SSR


Contribution of X1 and X2 given X3 has been included: SSR(X1and X2 X3) = SSR(X1,X2 and X3) SSR(X3)
From ANOVA section of regression for From ANOVA section of regression for

Yi b0 b1 X1i b2 X 2i b3 X 3i

Yi

b0

b2 X 3i

Partial F Test For Contribution of Set of X variables

Hypotheses:

H0 : Variables Xi... do not significantly improve


the model given all others included

H1 : Variables Xi... significantly improve the


model given all others included

Test Statistic: With df = k and (n - p -1)

F=

SSR( X i .... all others) / k MSE

k=# of variables tested

Testing Portions of Model: Example


Test at the = .05 level to determine if the variable of average temperature significantly improves the model given that insulation is included.

Testing Portions of Model: Example


H0: X1 does not improve model (X2 included)
H1: X1 does improve model
ANOV A

= .05, df = 1 and 12
Critical Value = 4.75
A N O V A

(For X1 and X2)


SS MS
114007.313 676.716918 228014.6263 8120.603016 236135.2293

(For X2)
S S 5 1 0 7 6 .4 7 1 8 5 0 5 8 .8 2 3 6 1 3 5 .2

R e g r e ssi o n R e si d u a l T o ta l

R e g re ssio n R e sid u a l T o ta l

SSR ( X 1 X 2 ) MSE

228,015 51,076 676,717

= 261.47

Conclusion: Reject H0. X1 does improve model

Do I need to do this for one variable?


The F test for the inclusion of a single variable after all other variables are included in the model is IDENTICAL to the t test of the slope for that variable The only reason to do an F test is to test several variables together.

Example: Collinear Variables


20,000 Execs in 439 Corps: Dependent Variable=base pay+bonus Individual Simple Regression R2 Company Dummies .33 Multiple Regression Contribution to R2 .08 .022

Occupational Dummies .52

Position in hierarchy
Human Capital Vars Shared

.69
.28

.104
.032 .632 TOTAL .87

Yedek

Multiple Regression

The value of outcome variable depends on several explanatory variables. F-test. To judge whether the explanatory variables in the model adequately describe the outcome variable. t-test. Applies to each individual explanatory variable. Significant t indicates whether the explanatory variable has an effect on outcome variable while controlling for other Xs. T-ratio. To judge the relative importance of the explanatory variable.

Problem of Multicollinearity
When explanatory variables are correlated there is difficulty in interpreting the effect of explanatory variables on the outcome. Check by: Correlation coefficient matrix (see next slide). F-test significant with insignificant t. Large changes occur in the regression coefficients when variables are added or deleted. (Variance Inflation). Vi > 4 or 5 means there is multicollinearity.

Example of a Matrix Plot

This matrix plot comprises several scatter plots to provide visual information as to whether variables are correlated The arrow points at a scatter plot where two explanatory variables are

Selecting the most Economic Model


The purpose is to find the smallest number of explanatory variables which make the maximum contribution to the outcome. After excluding variables that may be causing multicollinearity, examine the table of t-ratios in the full model. Those variables with a significant t are included in the sub-set. In the Analysis of Variance table examine the column headed SEQ SS. Check that the candidate variables are indeed making a sizable contribution to the Regression Sum of Squares

Stepwise Regression Analysis

Stepwise finds the explanatory variable with the highest R2 to start with. It then checks each of the remaining variables until two variables with highest R2 are found. It then repeats the process until three variables with highest R2 are found, and so on. The overall R2 gets larger as more variables are added. Stepwise may be useful in the early exploratory stage of data analysis, but not to be relied upon for the confirmatory stage.

Is the Model Adequate?


Judged by the following: R2 value. Increase in R2 on adding another variable gives a useful hint Adjusted R2 is a more sensitive measure. Smallest value of s (standard deviation). C-p statistic. A model with the smallest C-p is used such that Cp value is closest

Confidence Interval Estimate For The Slope


Provide the 95% confidence interval for the population slope 1 (the effect of temperature on oil consumption).

b1

tn

p 1S b1
Lower 95% Upper 95% 516,1930837 608,108935 -6,169132673 -4,7040285 -25,11620102 -14,90844

Coefficients Intercept 562,151009 X Variable 1 -5,4365806 X Variable 2 -20,012321

-6.169 -4.704 1 The average consumption of oil is reduced by between 4.7 gallons to 6.17 gallons per each increase of 10 F in houses with the same insulation.

Das könnte Ihnen auch gefallen