Statistics

CE 459 Statistics
Assistant Prof. Muhammet Vefa AKPINAR
VAR1 16 14 12 10
No of obs
8 6 4 2 0 50 55 60 65 70 75 80 85 90 95 100 Upper Boundaries (x <= boundary) Expected Normal
Lecture Notes
What is Statistics Frequency Distribution Descriptive Statistics Normal Probability Distribution Sampling Distribution of the Mean Simple Linear Regression & Correlation Multiple Regression & Correlation
08.10.2011
INTRODUCTION
Criticism
There is a general perception that statistical knowledge is alltoo-frequently intentionally misused, by finding ways to interpret the data that are favorable to the presenter. (A famous quote, variously attributed, but thought to be from Benjamin Disraeli is: "There are three types of lies - lies, damn lies, and statistics.") Indeed, the well-known book How to Lie with Statistics by Darrell Huff discusses many cases of deceptive uses of statistics, focusing on misleading graphs. By choosing (or rejecting, or modifying) a certain sample, results can be manipulated; throwing out outliers is one means of doing so. This may be the result of outright fraud or of subtle and unintentional bias on the part of the researcher.
08.10.2011
WHAT IS STATISTICS?
Definition
Statistics is a group of methods used to
collect, analyze, present, and interpret data and to make decisions.
What is Statistics ?
American Heritage Dictionary defines statistics as: "The mathematics of the collection, organization, and interpretation of numerical data, especially the analysis of population characteristics by inference from sampling."
The Merriam-Websters Collegiate Dictionary definition is: "A branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data." The word statistics is also the plural of statistic (singular), which refers to the result of applying a statistical algorithm to a set of data, as in employment statistics, accident statistics, etc.
08.10.2011
In applying statistics to a scientific, industrial, or societal problem, one begins with a process or population to be studied. This might be a population of people in a country, of crystal grains in a rock, or of goods manufactured by a particular factory during a given period. For practical reasons, rather than compiling data about an entire population, one usually instead studies a chosen subset of the population, called a sample. Data are collected about the sample in an observational or experimental setting. The data are then subjected to statistical analysis, which serves two related purposes: description and inference.
08.10.2011
Descriptive statistics and Inferential statistics.
Statistical data analysis can be subdivided into Descriptive statistics and Inferential statistics. Descriptive statistics is concerned with exploring, visualising, and summarizing data but without fitting the data to any models. This kind of analysis is used to explore the data in the initial stages of data analysis. Since no models are involved, it can not be used to test hypotheses or to make testable predictions. Nevertheless, it is a very important part of analysis that can reveal many interesting features in the data. Descriptive statistics can be used to summarize the data, either numerically or graphically, to describe the sample. Basic examples of numerical descriptors include the mean and standard deviation. Graphical summarizations include various kinds of charts and graphs.
7
08.10.2011
Inferential statistics is the next stage in data analysis and involves the identification of a suitable model. The data is then fit to the model to obtain an optimal estimation of the model's parameters. The model then undergoes validation by testing either predictions or hypotheses of the model. Models based on a unique sample of data can be used to infer generalities about features of the whole population. Inferential statistics is used to model patterns in the data, accounting for randomness and drawing inferences about the larger population. These inferences may take the form of answers to yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), forecasting of future observations, descriptions of association (correlation), or modeling of relationships (regression). Other modeling techniques include ANOVA, time series, and data mining.
8
08.10.2011
Population and sample.
A portion of the population selected for study is referred to as a sample.
Population Sample
A population consists of all elements individuals, items, or objects whose characteristics are being studied. The population that is being studied is also called the target population.
Measures of Central Tendency

The central tendency of a dataset, i.e. the centre of a frequency distribution, is most commonly measured by the 3 Ms:
Mean = arithmetic mean =average Sum of all measurements divided by the number of measurements. Median: A number such that at most half of the measurements are below it and at most half of the measurements are above it.
Mode:
The most frequent measurement in the data.
Mean
The Sample Meany ) is the arithmetic average of a data set. ( It is used to estimate the population mean, ( . Calculated by taking the sum of the observed values (yi) divided by the number of observations (n). Historical Transmogrifier Average Unit Production Costs Residual $K System yi - y 1 22.2 y = 9.06 n 2 17.3 yi 3 11.8 y1 y2 yn y i1 4 9.6 n n 5 8.8 y i 6 7.6 7 6.8 22.2 17.3 1.6 y $9.06K 8 3.2 10 9 1.7 10 1.6
The Mode
The mode, symbolized by Mo, is the most frequently occurring score value. If the scores for a given sample distribution are:
32 32 35 36 37 38 38 39 39 39 40 40 42 45 then the mode would be 39 because a score of 39 occurs 3 times, more than any other score.
08.10.2011
12
A distribution may have more than one mode if the two most frequently occurring scores occur the same number of times. For example, if the earlier score distribution were modified as follows:
32 32 32 36 37 38 38 39 39 39 40 40 42 45
then there would be two modes, 32 and 39. Such distributions are called bimodal. The frequency polygon of a bimodal distribution is presented below.
08.10.2011
13
Example of Mode
Measurements x 3 5 1 1 4 7 3 8 3
Mode: 3
Notice that it is possible for a data not to have any mode.
Mode

The Mode is the value of the data set that occurs most frequently Example: 1, 2, 4, 5, 5, 6, 8 Here the Mode is 5, since 5 occurred twice and no other value occurred more than once Data sets can have more than one mode, while the mean and median have one unique value Data sets can also have NO mode, for example: 1, 3, 5, 6, 7, 8, 9 Here, no value occurs more frequently than any other, therefore no mode exists You could also argue that this data set contains 7 modes since each value occurs as frequently as every other
Example of Mode
Measurements x 3 5 5 1 7 2 6 7 0 4
In this case the data have tow modes: 5 and 7 Both measurements are repeated twice
Median
Computation of Median When there is an odd number of numbers, the median is simply the middle number. For example, the median of 2, 4, and 7 is 4.
When there is an even number of numbers, the median is the mean of the two middle numbers. Thus, the median of the numbers 2, 4, 7, 12 is (4+7)/2 = 5.5.
08.10.2011
17
Example of Median
Measurements Measurements Ranked x x 3 0 5 1 5 2 1 3 7 4 2 5 6 5 7 6 0 7 4 7 40 40
Median: (4+5)/2 = 4.5

Notice that only the two central values are used in the computation. The median is not sensible to extreme values
median
rim diameter (cm) unit 1 unit 2 9.7 9.0 11.5 11.2 11.6 11.3 12.1 11.7 12.4 12.2 12.6 12.5 12.9 <-13.2 13.2 13.1 13.8 13.5 14.0 13.6 15.5 14.8 15.6 16.3 16.2 26.9 16.4
Median
The Median is the middle observation of an ordered (from low to high) data set Examples: 1, 2, 4, 5, 5, 6, 8 Here, the middle observation is 5, so the median is 5 1, 3, 4, 4, 5, 7, 8, 8 Here, there is no middle observation so we take the average of the two observations at the center
Median
4 5 2
4.5
Mode = Median = Mean
Mode
Median Mean
Dispersion Statistics

The Mean, Median and Mode by themselves are not sufficient descriptors of a data set Example: Data Set 1: 48, 49, 50, 51, 52 Data Set 2: 5, 15, 50, 80, 100 Note that the Mean and Median for both data sets are identical, but the data sets are glaringly different! The difference is in the dispersion of the data points Dispersion Statistics we will discuss are: Range Variance Standard Deviation
Range

The Range is simply the difference between the smallest and largest observation in a data set Example Data Set 1: 48, 49, 50, 51, 52 Data Set 2: 5, 15, 50, 80, 100 The Range of data set 1 is 52 - 48 = 4 The Range of data set 2 is 100 - 5 = 95 So, while both data sets have the same mean and median, the dispersion of the data, as depicted by the range, is much smaller in data set 1
deviation score
A deviation score is a measure of by how much each point in a frequency distribution lies above or below the mean for the entire dataset:
where:
X= the mean Note that if you add all the deviation scores for a dataset together, you automatically get the mean for that dataset.
X = raw score
08.10.2011
24
Variance

The Variance, s2, represents the amount of variability of the data relative to their mean As shown below, the variance is the average of the squared deviations of the observations about their mean
s2
2
( yi n 1
y) 2
( yi N
)2
The Variance, s2, is the sample variance, and is used to estimate the actual population variance, 2
Standard Deviation

The Variance is not a common sense statistic because it describes the data in terms of squared units The Standard Deviation, s, is simply the square root of the variance
( yi n 1
y) 2
The Standard Deviation, s, is the sample standard deviation, and is used to estimate the actual population standard deviation,
( yi N
)2
The sample standard deviation, s, is measured in the same units as the data from which the standard deviation is being calculated
Standard Deviation
System 1 2 3 4 5 6 7 8 9 10 Average
FY97$K 22.2 17.3 11.8 9.6 8.8 7.6 6.8 3.2 1.7 1.6 9.06
yi y 13.1 8.2 2.7 0.5 -0.3 -1.5 -2.3 -5.9 -7.4 -7.5
(yi y) 2
172.7 67.9 7.5 0.3 0.1 2.1 5.1 34.3 54.2 55.7
( yi
y)2
n 1 172.7 67.9 55.7 10 1 399 .8 44.4 ($K 2 ) 9
s2
44.4($K 2 )
6.67 ($K )
This number, $6.67K, represents the average estimating error for predicting
subsequent observations In other words: On average, when estimating the cost of transmogrifiers that belongs to the same population as the ten systems above, we would expect to be off by $6.67K
Variance and the closely-related standard deviation
The variance and the closely-related standard deviation are measures of how spread out a distribution is. In other words, they are measures of variability.
In order to define the amount of deviation of a dataset from the mean, calculate the mean of all the deviation scores, i.e. the variance. The variance is computed as the average squared deviation of each number from its mean. For example, for the numbers 1, 2, and 3, the mean is 2 and the variance is: .
08.10.2011
28
variance in a population is:
variance in a sample is:
where; is the mean and N is the number of scores.
08.10.2011
29
The standard deviation is the square root of the variance.
08.10.2011
30
Variance and Standar Deviation
08.10.2011
31
Example of Mean
Measurements x
3 5 5 1 7 2 6 7 0 4 40
Deviation x - mean
-1 1 1 -3 3 -2 2 3 -4 0 0
MEAN = 40/10 = 4
Notice that the sum of the deviations is 0. Notice that every single observation intervenes in the computation of the mean.
Example of Variance
Measurements Deviations x 3 5 5 1 7 2 6 7 0 4 40 x - mean -1 1 1 -3 3 -2 2 3 -4 0 0 Square of deviations 1 1 1 9 9 4 4 9 16 0 54
Variance = 54/9 = 6
It is a measure of spread. Notice that the larger the deviations (positive or negative) the larger the variance
The standard deviation
It is defines as the square root of the variance In the previous example Variance = 6 Standard deviation = Square root of the variance = Square root of 6 = 2.45
Observed Vehicle velocity

velocity km/saat
67
73
81
72
76
75
85
77
68
84
76
93
73
79
88
73
60
93
71
59
74
62
95
78
63
72
66
78
82
75
96
70
89
61
75
95
66
79
83
71
76
65
71
75
65
80
73
57
88
78
08.10.2011
35
Mean, Median, Standard Deviation
Valid Numbers
Range
Mean
Median
Minimum
Maximum
Variance
Standard.Dev.
50
39
75,62
75
57
96
96,362
9,816458
08.10.2011
36
Frequency Table
Number of Class (intervals) 1 2 3 4 5 6 class intervals 50,000 < x <= 55,000 55,000 < x <= 60,000 60,000 < x <= 65,000 65,000 < x <= 70,000 70,000 < x <= 75,000 75,000 < x <= 80,000 class midpoints 52,5 57,5 62,5 67,5 72,5 77,5 0 3 5 5 14 10 frequency relative freq. % 0 6 10 10 28 20 0 3 8 13 27 37 Cumulative freq. Relative cumulative freq. % 0 6 16 26 54 74
7
8 9 10
80,000 < x <= 85,000

85,000 < x <= 90,000 90,000 < x <= 95,000 95,000 < x <= 100,00
82,5
87,5 92,5 97,5
5
3 4 1
10
6 8 2
42
45 49 50
84
90 98 100
08.10.2011
37
Frequency Table
A cumulative frequency distribution is a plot of the number of observations falling in or below an interval. The graph shown here is a cumulative frequency distribution of the scores on a statistics test. A frequency table is constructed by dividing the scores into intervals and counting the number of scores in each interval. The actual number of scores as well as the percentage of scores in each interval are displayed. Cumulative frequencies are also usually displayed. The X-axis shows various intervals of vehicle speed.
38
08.10.2011
Selecting the Interval Size
In order to find a starting interval size the first step is to find the range of the data by subtracting the smallest score from the largest. In the case of the example data, the range was 96-57 = 39. The range is then divided by the number of desired intervals, with a suggested starting number of intervals being ten (10). In the example, the result would be 50/10 = 5. The nearest odd integer value is used as the starting point for the selection of the interval size.
08.10.2011
39
Histogram
A histogram is constructed from a frequency table. The intervals are shown on the X-axis and the number of scores in each interval is represented by the height of a rectangle located above the interval. A histogram of the vehicle speed from the dataset is shown below. The shapes of histograms will vary depending on the choice of the size of the intervals.
VAR1 16 14 12 10
No of obs
8 6 4 2 0 Expected Normal
08.10.2011
50
55
60
65
70
75
80
85
90
95
100
Upper Boundaries (x <= boundary)
40
There are many different-shaped frequency distributions:
08.10.2011
41
A frequency polygon is a graphical display of a frequency table. The intervals are shown on the X-axis and the number of scores in each interval is represented by the height of a point located above the middle of the interval. The points are connected so that together with the X-axis they form a polygon.
08.10.2011
42
Spread, Dispersion, Variability A variable's spread is the degree to which scores on the variable differ from each other. If every score on the variable were about equal, the variable would have very little spread. There are many measures of spread. The distributions shown below have the same mean but differ in spread: The distribution on the bottom is more spread out. Variability and dispersion are synonyms for spread.
08.10.2011
43
Skew
08.10.2011
44
Further Notes
When the Mean is greater than the Median the data distribution is skewed to the Right.
When the Median is greater than the Mean the data distribution is skewed to the Left. When Mean and Median are very close to each other the data distribution is approximately symmetric.
The Effect of Skew on the Mean and Median
The distribution shown below has a positive skew. The mean is larger than the median.
test was very difficult and almost everyone in the class did very poorly on it, 08.10.2011 the resulting distribution would most likely be positively skewed. 46
The distribution shown below has a negative skew. The mean is smaller than the median.
08.10.2011
47
Probability
Likelihood or chance of occurrence. The probability of an event is the theoretical relative frequency of the event in a model of the population.
08.10.2011
48
Normal Distribution or Normal Curve
Normal distribution is probably one of the most important and widely used continuous distribution. It is known as a normal random variable, and its probability distribution is called a normal distribution. The normal distribution is a theoretical function commonly used in inferential statistics as an approximation to sampling distributions. In general, the normal distribution provides a good model for a random variable.
08.10.2011
49
In a normal distribution:
68% of samples fall between 1 SD 95% of samples fall between 2 SD (actually + 1.96 SD) 99.7% of samples fall between 3 SD
08.10.2011
50
The normal distribution function
The normal distribution function is determined by the following formula:
Where; : mean : standard deviation e: Euler's constant (2.71...) : constant Pi (3.14...)
08.10.2011
51
Characteristics of the Normal Distribution:
It is bell shaped and is symmetrical about its mean. It is asymptotic to the axis, i.e., it extends indefinitely in either direction from the mean. They are symmetric with scores more concentrated in the middle than in the tails. It is a family of curves, i.e., every unique pair of mean and standard deviation defines a different normal distribution. Thus, the normal distribution is completely described by two parameters: mean and standard deviation. There is a strong tendency for the variable to take a central value. It is unimodal, i.e., values mound up only in the center of the curve. The frequency of deviations falls off rapidly as the deviations become larger.
08.10.2011
52
Total area under the curve sums to 1, the area of the distribution on each side of the mean is 0.5. The Area Under the Curve Between any Two Scores is a PROBABILITY The probability that a random variable will have a value between any two points is equal to the area under the curve between those points. Positive and negative deviations from this central value are equally likely
08.10.2011
53
Examples of normal distributions
Notice that they differ in how spread out they are. The area under each curve is the same. The height of a normal distribution can be specified mathematically in terms of two parameters: the mean () and the standard deviation (). The two parameters, and , each change the shape of the distribution in a different manner.
08.10.2011
54
Changes in
without changes in
Changes in , without changes in , result in moving the distribution to the right or left, depending upon whether the new value of was larger or smaller than the previous value, but does not change the shape of the distribution.
08.10.2011
55
Changes in the value of
Changes in the value of , change the shape of the distribution without affecting the midpoint, because d affects the spread or the dispersion of scores. The larger the value of , the more dispersed the scores; the smaller the value, the less dispersed. The distribution below demonstrates the effect of increasing the value of :
08.10.2011
56
THE STANDARD NORMAL CURVE
The standard normal curve is a member of the family of normal curves with = 0.0 and = 1.0.
Note that the integral calculus is used to find the area under the normal distribution curve. However, this can be avoided by transforming all normal distribution to fit the standard normal distribution. This conversion is done by rescalling the normal distribution axis from its true units (time, weight, dollars, and...) to a standard measure called Z score or Z value.
08.10.2011
57
Standard Scores (z Scores)
A Z score is the number of standard deviations that a value, X, is away from the mean. Standard scores are therefore useful for comparing datapoints in different distributions. If the value of X is greater than the mean, the Z score is positive; if the value of X is less than the mean, the Z score is negative. The Z score or equation is as follows:
where z is the z-score for the value of X
08.10.2011
58
Table of the Standard Normal (z) Distribution

z
0.0 0.1 0.2 0.3 0.4
0.00
0.0000 0.0398 0.0793 0.1179 0.1554
0.01
0.0040 0.0438 0.0832 0.1217 0.1591
0.02
0.0080 0.0478 0.0871 0.1255 0.1628
0.03
0.0120 0.0517 0.0910 0.1293 0.1664
0.04
0.0160 0.0557 0.0948 0.1331 0.1700
0.05
0.0190 0.0596 0.0987 0.1368 0.1736
0.06
0.0239 0.0636 0.1026 0.1406 0.1772
0.07
0.0279 0.0675 0.1064 0.1443 0.1808
0.08
0.0319 0.0714 0.1103 0.1480 0.1844
0.09
0.0359 0.0753 0.1141 0.1517 0.1879
0.5
0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1
0.1915
0.2257 0.2580 0.2881 0.3159 0.3413 0.3643 0.3849 0.4032 0.4192 0.4332 0.4452 0.4554 0.4641 0.4713 0.4772 0.4821
0.1950
0.2291 0.2611 0.2910 0.3186 0.3438 0.3665 0.3869 0.4049 0.4207 0.4345 0.4463 0.4564 0.4649 0.4719 0.4778 0.4826
0.1985
0.2324 0.2642 0.2939 0.3212 0.3461 0.3686 0.3888 0.4066 0.4222 0.4357 0.4474 0.4573 0.4656 0.4726 0.4783 0.4830
0.2019
0.2357 0.2673 0.2969 0.3238 0.3485 0.3708 0.3907 0.4082 0.4236 0.4370 0.4484 0.4582 0.4664 0.4732 0.4788 0.4834
0.2054
0.2389 0.2704 0.2995 0.3264 0.3508 0.3729 0.3925 0.4099 0.4251 0.4382 0.4495 0.4591 0.4671 0.4738 0.4793 0.4838
0.2088
0.2422 0.2734 0.3023 0.3289 0.3513 0.3749 0.3944 0.4115 0.4265 0.4394 0.4505 0.4599 0.4678 0.4744 0.4798 0.4842
0.2123
0.2454 0.2764 0.3051 0.3315 0.3554 0.3770 0.3962 0.4131 0.4279 0.4406 0.4515 0.4608 0.4686 0.4750 0.4803 0.4846
0.2157
0.2486 0.2794 0.3078 0.3340 0.3577 0.3790 0.3980 0.4147 0.4292 0.4418 0.4525 0.4616 0.4693 0.4756 0.4808 0.4850
0.2190
0.2517 0.2823 0.3106 0.3365 0.3529 0.3810 0.3997 0.4162 0.4306 0.4429 0.4535 0.4625 0.4699 0.4761 0.4812 0.4854
0.2224
0.2549 0.2852 0.3133 0.3389 0.3621 0.3830 0.4015 0.4177 0.4319 0.4441 0.4545 0.4633 0.4706 0.4767 0.4817 0.4857
08.10.2011
59
Three areas on a standard normal curve
08.10.2011
60
Total - infinity to Z
Total Z to + infinity
Area from -Z to +Z
-infinity to -Z plus +Z to + infinity
-infinity to -Z plus +Z to + infinity
Total - infinity to Z1.5
-Z
+Z
-Z
+Z
-Z
+Z
Z-1.5
Z
0,000 0,100 0,200 0,300
Area Under Curve from negative infinity to Z 0,50000000000000 0,53982783727702 0,57925970943909 0,61791142218894
Area Under Curve from Z to positive infinity 0,50000000000000 0,46017216272298 0,42074029056091 0,38208857781106
Area Under Curve from -Z to +Z 0,00000000000000 0,07965567455403 0,15851941887818 0,23582284437788
Area Under Curve (negative infinity to Z) PLUS (+Z to positive infinity) 1,00000000000000 0,92034432544597 0,84148058112182 0,76417715562212
Convert (negative infinity to -Z ) PLUS (+Z to positive infinity) into PPM 1.000.000,00000000 920.344,32544597 841.480,58112182 764.177,15562212
Area Under Curve negative infinity to Z1.5
0,06680720126886
0,08075665923377
0,09680048458561
0,11506967022170
0,400
0,500 0,600 0,700 0,800 0,900 1,000 1,100 1,200 1,300 1,400
0,65542174161031
0,69146246127400 0,72574688224992 0,75803634777692 0,78814460141659 0,81593987465323 0,84134474606854 0,86433393905361 0,88493032977829 0,90319951541439
0,34457825838969
0,30853753872600 0,27425311775008 0,24196365222308 0,21185539858341 0,18406012534677 0,15865525393146 0,13566606094639 0,11506967022171 0,09680048458562
0,31084348322063
0,38292492254801 0,45149376449983 0,51607269555384 0,57628920283319 0,63187974930647 0,68268949213707 0,72866787810722 0,76986065955657 0,80639903082877 0,83848668153245
0,68915651677937
0,61707507745200 0,54850623550017 0,48392730444617 0,42371079716681 0,36812025069354 0,31731050786293 0,27133212189278 0,23013934044343 0,19360096917123 0,16151331846755
689.156,51677937
617.075,07745200 548.506,23550017 483.927,30444617 423.710,79716681 368.120,25069354 317.310,50786293 271.332,12189278 230.139,34044343 193.600,96917123 161.513,31846755
0,13566606094638
0,15865525393145
0,18406012534675
0,21185539858339
0,24196365222306
0,27425311775006
0,30853753872598
0,34457825838967
0,38208857781104
0,42074029056089
08.10.2011 0,08075665923378 0,91924334076622
61 0,46017216272296
The area between Z-scores of -1.00 and +1.00. It is .68 or 68%. The area between Z-scores of -2.00 and +2.00 and is .95 or 95%.
08.10.2011
62
Exercise 1
An industrial sewing machine uses ball bearings that are targeted to have a diameter of 0.75 inch. The specification limits under which the ball bearing can operate are 0.74 inch (lower) and 0.76 inch (upper). Past experience has indicated that the actual diameter of the ball bearings is approximately normally distributed with a mean of 0.753 inch and a standard deviation of 0.004 inch.
For this problem, note that "Target" = .75, and "Actual mean" = .753.
08.10.2011
63
What is the probability that a ball bearing will be between the target and the actual mean?
P(-0.75 < Z < 0) = .2734
08.10.2011
64
What is the probability that a ball bearing will be between the lower specification limit and the target?
P(-3.25 < Z < -0.75) = .49942 - .2734 = .22602
08.10.2011
65
What is the probability that a ball bearing will be above the upper specification limit?
P(Z > 1.75) = .5 - .4599 = .0401
08.10.2011
66
What is the probability that a ball bearing will be below the lower specification limit?
P (Z < -3.25) = .5 - .49942 = .00058
08.10.2011
67
Above which value in diameter will 93% of the ball bearings be? The value asked for here will be the 7th percentile, since 93% of the ball bearings will have diameters above that. So we will look up .4300 in the Z-table in a "backwards manner. The closest area to this is .4306, which corresponds to a Z-value of 1.48.
-0.00592 = X - 0.753
X = 0.74708
So 0.74708 in. is the value that 93% of the diameters are above.
08.10.2011
68
Exercise 2
Graduate Management Aptitude Test (GMAT) scores are widely used by graduate schools of business as an entrance requirement. Suppose that in one particular year, the mean score for the GMAT was 476, with a standard deviation of 107. Assuming that the GMAT scores are normally distributed, answer the following questions:
08.10.2011
69
Question 1
What is the probability that a randomly selected score from this GMAT falls between 476 and 650 (476 <= x <= 650) the following figure shows a graphic representation of this problem.
Answer: Z = (650 - 476)/107 = 1.62. The Z value of 1.62 indicates that the GMAT score of 650 is 1.62 standard deviation above the mean. The standard normal table gives the probability of value falling between 650 and the mean. The whole number and tenths place portion of the Z score appear in the first column of the table. Across the top of the table are the values of the hundredths place portion of the Z score. Thus the answer is that 0.4474 or 44.74% of the scores on the GMAT fall between a score of 650 and 476. 08.10.2011
70
Question 2.
What is the probability of receiving a score greater than 750 on a GMAT test that has a mean of 476 and a standard deviation of 107 i.e., P(X >= 750) = ?. Answer This problem is asking for determining the area of the upper tail of the distribution. The Z score is: Z = ( 750 - 476)/107 = 2.56- Table- P(Z=2.56) = 0.4948. This is the probability of a GMAT with a score between 476 and 750. 0.5 - 0.4948 = 0.0052 or 0.52%. Note that P(X >= 750) is the same as P(X >750), because, in continuous distribution, the area under an exact number such as X=750 is zero.
08.10.2011
71
What is the probability of receiving a score of 540 or less on a GMAT test that has a mean of 476 and a standard deviation of 107 i.e., P(X <= 540)= ? we are asked to determine the area under the curve for all values less than or equal to 540. z score (540-476)/107=0.6 -Table- P (z= 0.2257) which is the probability of getting a score between the mean 476 and 540. The answer to this problem is: 0.5 + 0.2257 = 0.73 or 73%.
Graphic representation of this problem.
08.10.2011
72
Question 4
What is the probability of receiving a score between 440 and 330 on a GMAT test that has a mean of 476 and a standard deviation of 107. i.e., P(330 < 440)="?."
The two values fall on the same side of the mean. The Z scores are: Z1 = (330 - 476)/107 = -1.36, and Z2 = (440 - 476)/107 = -0.34. The probability associated with Z = -1.36 is 0.4131, The probability associated with Z = -0.34 is 0.1331. Thee answer to this problem is: 0.4131 - 0.1331 = 0.28 or 28%.
08.10.2011
73
Standard Error (SE)
Any statistic can have a standard error. Each sampling distribution has a standard error.
Standard errors are important because they reflect how much sampling fluctuation a statistic will show, i.e. how good an estimate of the population the sample statistic is How good an estimate is the mean of a population? One way to determine this is to repeat the experiment many times and to determine the mean of the means. However, this is tedious and frequently impossible. SE refers to the variability of the sample statistic, a measure of spread for random variables The inferential statistics involved in the construction of confidence intervals (CI) and significance testing are based on standard errors.
08.10.2011
74
Standard Error of the Mean, SEM, M
The standard deviation of the sampling distribution of the mean is called the standard error of the mean.
The size of the standard error of the mean is inversely proportional to the square root of the sample size.
Not:
08.10.2011
75
The standard error of any statistic depends on the sample size - in general, the larger the sample size the smaller the standard error. Note that the spread of the sampling distribution of the mean decreases as the sample size increases.
Notice that the mean of the distribution is not affected by sample size.
08.10.2011
76
Comparing the Averages of Two Independent Samples
Is there "grade inflation" in KTU? How does the average GPA of KTU students today compare with, say 10, years ago? Suppose a random sample of 100 student records from 10 years ago yields a sample average GPA of 2.90 with a standard deviation of .40. A random sample of 100 current students today yields a sample average of 2.98 with a standard deviation of .45. The difference between the two sample means is 2.98-2.90 = .08. Is this proof that GPA's are higher today than 10 years ago?
08.10.2011
77
First we need to account for the fact that 2.98 and 2.90 are not the true averages, but are computed from random samples. Therefore, .08 is not the true difference, but simply an estimate of the true difference. Can this estimate miss by much? Fortunately, statistics has a way of measuring the expected size of the ``miss'' (or error of estimation) . For our example, it is .06 (we show how to calculate this later). Therefore, we can state the bottom line of the study as follows: "The average GPA of KTU students today is .08 higher than 10 years ago, give or take .06 or so."
08.10.2011
78
Overview of Confidence Intervals
Once the population is specified, the next step is to take a random sample from it. In this example, let's say that a sample of 10 students were drawn and each student's memory tested. The way to estimate the mean of all high school students would be to compute the mean of the 10 students in the sample. Indeed, the sample mean is an unbiased estimate of , the population mean. Clearly, if you already knew the population mean, there would be no need for a confidence interval.
08.10.2011
79
We are interested in the mean weight of 10-year old kids living in Turkey. Since it would have been impractical to weigh all the 10-year old kids in Turkey, you took a sample of 16 and found that the mean weight was 90 pounds. This sample mean of 90 is a point estimate of the population mean.
A point estimate by itself is of limited usefulness because it does not reveal the uncertainty associated with the estimate; you do not have a good sense of how far this sample mean may be from the population mean. For example, can you be confident that the population mean is within 5 pounds of 90? You simply do not know.
08.10.2011
80
Confidence intervals provide more information than point estimates. An example of a 95% confidence interval is shown below: 72.85 < < 107.15 There is good reason to believe that the population mean lies between these two bounds of 72.85 and 107.15 since 95% of the time confidence intervals contain the true mean.
If repeated samples were taken and the 95% confidence interval computed for each sample, 95% of the intervals would contain the population mean. Naturally, 5% of the intervals would not contain the population mean.
08.10.2011
81
It is natural to interpret a 95% confidence interval as an interval with a 0.95 probability of containing the population mean
The wider the interval, the more confident you are that it contains the parameter. The 99% confidence interval is therefore wider than the 95% confidence interval and extends from 4.19 to 7.61.
08.10.2011
82
Example
Assume that the weights of 10-year old children are normally distributed with a mean of 90 and a standard deviation of 36. What is the sampling distribution of the mean for a sample size of 9? standard deviation of 36/3 = 12. Note that the standard deviation of a sampling distribution is its standard error.
90 - (1.96)(12) = 66.48 90 + (1.96)(12) = 113.52
The value of 1.96 is based on the fact that 95% of the area of a normal distribution is within 1.96 standard deviations of the mean; 12 is the standard error of the mean.
08.10.2011
83
Figure shows that 95% of the means are no more than 23.52 units (1.96x12) from the mean of 90. Now consider the probability that a sample mean computed in a random sample is within 23.52 units of the population mean of 90. Since 95% of the distribution is within 23.52 of 90, the probability that the mean from any given sample will be within 23.52 of 90 is 0.95.
This means that if we repeatedly compute the mean (M) from a
sample, and create an interval ranging from M - 23.52 to M + 23.52, this interval will contain the population mean 95% of the time.
08.10.2011
84
notice that you need to know the standard deviation () in order to estimate the mean. This may sound unrealistic, and it is. However, computing a confidence interval when is known is easier than when has to be estimated, and serves a pedagogical purpose. Suppose the following five were sampled from a normal distribution with a standard deviation of 2.5: 2, 3, 5, 6, and 9. To compute the 95% confidence interval, start by computing the mean and standard error:
M = (2 + 3 + 5 + 6 + 9)/5 = 5. m = = 1.118.
08.10.2011
85
Z.95 --the value is 1.96.
08.10.2011
86
If you had wanted to compute the 99% confidence interval, you would have set the shaded area to 0.99 and the result would have been 2.58.
The confidence interval can then be computed as follows: Lower limit = 5 - (1.96)(1.118)= 2.81 Upper limit = 5 + (1.96)(1.118)= 7.19
08.10.2011
87
Estimating the Population Mean Using Intervals
Estimate the average GPA of the population of approximately 23000 KTU undergraduates.n=25 randomly selected students, sample average= 3.05. Consider estimating the population average Now chances are the true average is not equal to 3.05. True KTU average GPA is between 1.00 and 4.00, and with high confidence between (2.50, 3.50); but what level of confidence do we have that it is between say, (2.75, 3.25) or (2.95, 3.15)? Even better, can we find an interval (a, b) which will contain with 95% certainty?
08.10.2011
88
Example:
Given the following GPA for 6 students: 2.80, 3.20, 3.75, 3.10, 2.95, 3.40
Calculate a 95% confidence interval for the population mean GPA.
08.10.2011
89
Determining Sample Size for Estimating the Mean
want to estimate the average GPA of KTU undergraduates this school year. Historically, the SD of student GPA is known to be . If a random sample of size n=25 yields a sample mean of , then the population mean is estimated as lying within the interval
with 95% confidence. The plus-or-minus quantity .12 is called the margin of error of the sample mean associated with a 95% confidence level. It is also correct to say ``we are 95% confident that is within .12 of the sample mean 3.05''.
08.10.2011
90
Confidence Interval for , Standard Deviation Estimated
It is very rare for a researcher wishing to estimate the mean of a population to already know its standard deviation. Therefore, the construction of a confidence interval almost always involves the estimation of both and .
When is known -> M - zM M + zM is used for a confidence interval.
When is not known, Whenever the standard deviation is estimated (NOT KNOWN), the t rather than the normal (z) distribution should be used. for when is estimated is: M - t sM M + t sM where M is the sample mean, sM is an estimate of M (standard error), and t depends on the degrees of freedom and the level of confidence.
confidence interval on the mean:

More generally, the formula for the 95% confidence interval on the mean is: Lower limit = M - (t)(sm) Upper limit = M + (t)(sm) where; M is the sample mean, t is the t for the confidence level desired (0.95 in the above example), and sm is the estimated standard error of the mean.
A comparison of the t and normal distribution

A comparison of the t distribution with 4 df (in blue) and the standard normal distribution (in red).
Finding t-values
Find the t-value such that the area under the t distribution to the right of the t-value is 0.2 assuming 10 degrees of freedom. That is, find t0.20 with 10 degrees of freedom.
Upper tail probability p (area under the right side)

Example: P[t(2) > 2.92] = 0.05
P[-2.92 < t(2) < 2.92] = 0.9
50% 0.25 df 1 2 3 4 5 6 7 8 9 10 11 1.000 0.817 0.765 0.741 0.727 0.718 0.711 0.706 0.703 0.700 0.697
60% 0.2
70% 0.15
80% 0.1
90% 0.05
95% 0.025
96% 0.02
98% 0.01
99% 0.005
99.5% 0.0025
99.8% 0.001
99.9% 0.0005
1.376 1.061 0.979 0.941 0.920 0.906 0.896 0.889 0.883 0.879 0.876
1.963 1.386 1.250 1.190 1.156 1.134 1.119 1.108 1.100 1.093 1.088
3.078 1.886 1.638 1.533 1.476 1.440 1.415 1.397 1.383 1.372 1.363
6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796
12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201
15.895 4.849 3.482 2.999 2.757 2.612 2.517 2.449 2.398 2.359 2.328
31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718
63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106
127.32 14.089 7.453 5.598 4.773 4.317 4.029 3.833 3.690 3.581 3.497
318.30 22.327 10.215 7.173 5.893 5.208 4.785 4.501 4.297 4.144 4.025
636.61 31.599 12.924 8.610 6.869 5.959 5.408 5.041 4.781 4.587 4.437
12
13 14
0.696
0.694 0.692
0.873
0.870 0.868
1.083
1.079 1.076
1.356
1.350 1.345
1.782
1.771 1.761
2.179
2.160 2.145
2.303
2.282 2.264
2.681
2.650 2.624
3.055
3.012 2.977
3.428
3.372 3.326
3.930
3.852 3.787
4.318
4.221 4.140
Abbreviated t table
df 2
0.95 4.303
0.99 9.925
3
4 5
3.182
2.776 2.571
5.841
4.604 4.032
8
10 20 50 100
2.306
2.228 2.086 2.009 1.984
3.355
3.169 2.845 2.678 2.626
Example
Assume that the following five numbers are sampled from a normal distribution: 2, 3, 5, 6, and 9 and that the standard deviation is not known. The first steps are to compute the sample mean and variance: M=5 sm = 7.5 Standard error (sm)= 1.225 df = N - 1 = 4 t t tablethe value for the 95% interval for is 2.776. Lower limit = 5 - (2.776)(1.225)= 1.60 Upper limit = 5 + (2.776)(1.225)= 8.40
Example
Suppose a researcher were interested in estimating the mean reading speed (number of words per minute) of high-school graduates and computing the 95% confidence interval. A sample of 6 graduates was taken and the reading speeds were: 200, 240, 300, 410, 450, and 600. For these data, M = 366.6667 sM= 60.9736 df = 6-1 = 5 t = 2.571 lower limit is: M - (t) (sM) = 209.904 upper limit is: M + (t) (sM) = 523.430, 95% confidence interval is: 209.904 523.430 Thus, the researcher can conclude based on the rounded off 95% confidence interval that the mean reading speed of high-school graduates is between 210 and 523.
Homework 1
The mean time difference for all 47 subjects is 16.362 seconds and the standard deviation is 7.470 seconds. The standard error of the mean is 1.090.
A t table shows the critical value of t for 47 - 1 = 46 degrees of freedom is 2.013 (for a 95% confidence interval). The confidence interval is computed as follows: Lower limit = 16.362 - (2.013)(1.090)= 14.17 Upper limit = 16.362 + (2.013)(1.090)= 18.56 Therefore, the interference effect (difference) for the whole population is likely to be between 14.17 and 18.56 seconds.
Homework 2
The pasteurization process reduces the amount of bacteria found in dairy products, such as milk. The following data represent the counts of bacteria in pasteurized milk (in CFU/mL) for a random sample of 12 pasteurized glasses of milk. Construct a 95% confidence interval for the bacteria count.
NOTE: Each observation is in tens of thousand. So, 9.06 represents 9.06 x 104.
Prediction with Regression Analysis

The relationship(s) between values of the response variable and corresponding values of the predictor variable(s) is (are) not deterministic. Thus the value of y is estimated given the value of x. The estimated value of the dependent variable is denoted y, and the population slope and intercept are usually denoted 1 and 0.
Linear Regression

The idea is to fit a straight line through data points Linear Regression - Indicates that the relationship(s) between the dependent variable and the independent variable(s). Can extend to multiple dimensions
correlation analysis is applied to independent factors: if X increases, what will Y do (increase, decrease, or
perhaps not change at all)? In regression analysis a unilateral response is assumed: changes in X result in changes in Y, but changes in Y do not result in changes in X.
Regression Plot
m1 = 0.0095937 + 0.880436 vwmkt S = 0.0590370 R-Sq = 31.3 % R-Sq(adj) = 30.8 %
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 -0.3 -0.2 -0.1 0.0 0.1
m1
vwmkt
Linear regression means a regression that is linear in the parameters A linear regression can be non-linear in the variables Example: Y = 0 + 1X2
Some non-linear regression models can be transformed into a linear regression model (e.g., Y=aXbZc can be transformed into lnY = ln a + b*ln X + c*ln Z)
Example
Given one variable Goal: Predict Y Example:
X (years) 3 8
Y (salary, $1,000) 30 57
Given Years of Experience Predict Salary When X=10, what is Y? When X=25, what is Y? This is known as regression
9
13 3
64
72 36
Questions:
6
11 21 1
43
59 90 20
For the example data

23.2, y 3.5 23.2 3.5 x
x=10 years prediction of y (salary) is: 23.2+35=58.2 K dollars/year.
Linear Regression Example

Linear Regression: Y=3.5*X+23.2 120 100 80
Salary
60 40 20 0 0 5 10 Years 15 20 25
Y
( xi
i i
X
x )( yi ( xi y x x)
2
y)
Regression Error
We can also write a regression equation slightly differently:
Also called the residual, this is the difference between our estimate of the value of the dependent variable y and the actual value of the dependent variable y.
Unless we have perfect prediction, many of the y values will fall off of the line. The added e in the equation refers to this fact. It would be incorrect to write the equation without the e, because it would suggest that the y scores are completely accounted for by just knowing the slope, x values, and the intercept. Almost always, that is not true. There is some error in prediction, so we need to add an e for error variation into the equation. The actual values of y can be accounted for by the regression line equation (y=a+bx) plus some degree of error in our prediction (the e's).
r correlation coefficient
The correlation between X and Y is expressed by the correlation coefficient r :
xi = data X, x = mean of data X yi = data Y, y = mean of data Y

1 >r > -1 r = 1 perfect positive linear correlation between two variables r = 0 no linear correlation (maybe other correlation) r = -1 perfect negative linear correlation Notice that for the perfect correlation, there is a perfect line of points. They do not deviate from that line.
least squares

The principle is to establish a statistical linear relationship between two sets of corresponding data by fitting the data to a straight line by means of the "least squares" technique. The resulting line takes the general form:
y = bx + a
a = intercept of the line with the y-axis b = slope (tangent)

a = 0, b= 1 perfect positive correlation without bias a= 0 systematic discrepancy (bias, error) between X and Y; b = 1 proportional response or difference between X and Y.
Example
Each point represents one student with a certain score for time on the exam, x, and grade, y. The scatter plot reveals that, in general, longer times on the exam tend to be associated with higher grades.
0.64
ID 1 2 3 4 5 6 7 8 9 10 11 12 13
Grade on Time on Exam (x) Exam (y) 88 60 96 53 72 22 78 44 65 34 80 47 77 38 83 50 79 51 68 35 84 46 76 36 92 48
X-X avr 8.6 16.6 -7.4 -1.4 -14.4 0.6 -2.4 3.6 -0.4 -11.4 4.6 -3.4 12.6
Y-Yavr 18.55 11.55 -19.45 2.55 -7.45 5.55 -3.45 8.55 9.55 -6.45 4.55 -5.45 6.55
(X-Xavr)*(Y-Yavr) 159.53 191.73 143.93 -3.57 107.28 3.33 8.28 30.78 -3.82 73.53 20.93 18.53 82.53
(X-Xavr)2 73.96 275.56 54.76 1.96 207.36 0.36 5.76 12.96 0.16 129.96 21.16 11.56 158.76
r correlation

The Pearson r can be positive or negative, ranging from -1.0 to 1.0. If the correlation is 1.0, the longer the amount of time spent on the exam, the higher the grade will be--without any exceptions. An r value of -1.0 indicates a perfect negative correlation-without an exception, the longer one spends on the exam, the poorer the grade. If r=0, there is absolutely no relationship between the two variables. When r=0, on average, longer time spent on the exam does not result in any higher or lower grade. Most often r is somewhere in between -1.0 and +1.0.
ID 1 2
Grade on Exam (x) 88 96
x2 7744 9216
Time on Exam (y) 60 53
y2 3600 2809
xy 5280 5088
3
4 5 6 7 8 9 10 11 12 13
72
78 65 80 77 83 79 68 84 76 92
5184
6084 4225 6400 5929 6889 6241 4624 7056 5776 8464
22
44 34 47 38 50 51 35 46 36 48
484
1936 1156 2209 1444 2500 2601 1225 2116 1296 2304
1584
3432 2210 3760 2926 4150 4029 2380 3864 2736 4416
14
15 16 17 18 19 20
80
67 78 74 73 88 90
6400
4489 6084 5476 5329 7744 8100 127454
43
40 32 27 41 39 43 829
1849
1600 1024 729 1681 1521 1849 35933
3440
2680 2496 1998 2993 3432 3870 66764
1588
ID
1 2 3 4 5 6 7 8 9
Grade on Exam (x)

88 96 72 78 65 80 77 83 79
Time on Exam (y)

60 53 22 44 34 47 38 50 51
X-X ort
8,6 16,6 -7,4 -1,4 -14,4 0,6 -2,4 3,6 -0,4
Y-Yort
18,55 11,55 -19,45 2,55 -7,45 5,55 -3,45 8,55 9,55
(X-Xort)*(Y-Yort)
159,53 191,73 143,93 -3,57 107,28 3,33 8,28 30,78 -3,82
(X-Xort)
73,96 275,56 54,76 1,96 207,36 0,36 5,76 12,96 0,16
(Y-Yort)
344,1025 133,4025 378,3025 6,5025 55,5025 30,8025 11,9025 73,1025 91,2025
10
11 12 13 14 15 16 17 18 19 20
68
84 76 92 80 67 78 74 73 88 90
35
46 36 48 43 40 32 27 41 39 43
-11,4
4,6 -3,4 12,6 0,6 -12,4 -1,4 -5,4 -6,4 8,6 10,6
-6,45
4,55 -5,45 6,55 1,55 -1,45 -9,45 -14,45 -0,45 -2,45 1,55
73,53
20,93 18,53 82,53 0,93 17,98 13,23 78,03 2,88 -21,07 16,43
129,96
21,16 11,56 158,76 0,36 153,76 1,96 29,16 40,96 73,96 112,36
41,6025
20,7025 29,7025 42,9025 2,4025 2,1025 89,3025 208,8025 0,2025 6,0025 2,4025
Total
1588
829
941,4
1366,8
1570,95
Average
79,4
41,45
r = 0.6424
r2 square of the correlation coefficient
r is the proportion of the sum of squares explained in one-variable regression, r is the proportion of the sum of squares explained in multiple regression.
Is an R-Square < 1.00 Good or bad?
This is both a statistical and a philosophical question; It is quite rare, especially in the social sciences, to get an Rsquare that is really high (e.g., 98%). The goal is NOT to get the highest R-square per se. Instead, the goal is to develop a model that is both statistically and theoretically sound, creating the best fit with existing data. Do you want just the best fit, or a model that theoretically/conceptually makes sense? Yes, you might get a good fit with nonsensical explanatory variables. But, this opens you to spurious/intervening relationships. THEREFORE: hard to use model for explanation.
Why might an R-Square be less than 1.00?

underdetermined model (need more variables) nonlinear relationships measurement error sampling error not fully predictable/explainable even with all data available; there is a certain amount of unexplainable chaos/static/randomness in the universe (which may be reassuring) the unit of analysis is too aggregated (e.g., you are predicting mean housing values for a city -- you might get better results with predicting individual housing prices, or neighborhood housing prices).
Adjusted R2 (R-square)
What is an "Adjusted" R-Square? The Adjusted R-Square takes into account not only how much of the variation is explained, but also the impact of the degrees of freedom. It "adjusts" for the number of variables use. That is, look at the adjusted R- Square to see how adding another variable to the model both increases the explained variance but also lowers the degrees of freedom. Adjusted R2 = 1- (1 - R2 )((n - 1)/(n - k - 1)). As the number of variables in the model increases, the gap between the Rsquare and the adjusted R-square will increase. This serves as a disincentive to simply throwing in a huge number of variables into the model to increase the R-square.
This adjusted value for R-square will be equal or smaller than the regular R-square. The adjusted R-square adjusts for a bias in R-square. R-square tends to over estimate the variance accounted for compared to an estimate that would be obtained from the population. There are two reasons for the overestimate, a large number of predictors and a small sample size. So, with a small sample and with few predictors, adjusted Rsquare should be very similar to the R-square value. Researchers and statisticians differ on whether to use the adjusted R-square. It is probably a good idea to look at it to see how much your R-square might be inflated, especially with a small sample and many predictors.
Example
Suppose we have collected the following sample of 6 observations on age and income:
Find the estimated regression line for the sample of six observations we have collected on age and income: Which is the independent variable and which is the dependent variable for this problem?
Cautions About Simple Linear Regression

Correlation and regression describe only linear relations Correlation and least-squares regression line are not resistant to outliers Predictions outside the range of observed data are often inaccurate Correlation and regression are powerful tools for describing relationship between two variables, but be aware of their limitations
Multiple Prediction
Regression analysis allows us to use more than one independent
variable to predict values of y. Take the fat intake and blood cholesterol level study as an example. If we want to predict cholesterol as accurately as possible, we need to know more about diet than just how much fat intake there is. On the island of Crete, they consume a lot of olive oil, so there fat intake is high. This, however, seems to have no dramatic affect on cholesterol (at least the bad cholesterol, the LDLs). They also consume very little cholesterol in their diet, which consists more of fish than high cholesterol foods like cheese and beef (hopefully this won't be considered libelous in Texas). So, to improve our prediction of blood cholesterol levels, it would be helpful to add another predictor, dietary cholesterol.
From Bivariate to Multiple regression: what changes?
potentially more explanatory power with more variables. the ability to control for other variables: and one sees the interaction of the various explanatory variables. partial correlations and multicollinearity. harder to visualize drawing a line through three+ ndimensional space. the R is no longer simply the square of the correlation statistic r.
From Two to Three Dimensions With simple regression (one predictor) we had only the x-axis and the y-axis. Now we need an axis for x1, x2, and y. where Y' is the predicted score, X1 is the score on the first predictor variable, X2 is the score on the second, etc. The Y intercept is A. The regression coefficients (b1, b2, etc.) are analogous to the slope in simple regression. If we want to predict these points, we now need a regression plane rather than just a regression line. That looks something like this:
More than one prediction attribute

X1, X2 For example,

X1=years of experience X2=age Y=salary
1 1
2 2
y
yi
0=10 i
Response Surface E(yi)
(xi1, xi2)
x1
x2
The parameters 0, 1, 2, , k are called partial regression coefficients. 1 represents the change in y corresponding to a unit increase in x1, holding all the other predictors constant. A similar interpretation can be made for 2, 3, , k
Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations ANOVA df Regression Residual Total 4 25 29 SS 0,164 0,002 0,165 MS 0,041 0,000 Significa F nce F 628,372 0,000 0,995 0,990 0,989 0,008 30
Intercept Percent of Gross Hhd Income Spent on rent percent 2-parent families Police Anti-Drug Program? Active Tenants Group? (1 = yes; 0 = no)
Coefficie Standard nts Error 0,500 0,008 -0,399 0,016 -0,288 0,015 -0,004 0,004 -0,102 0,004
t Stat P-value 60,294 0,000 -24,610 0,000 -19,422 0,000 -1,238 0,227 -28,827 0,000
Controlling also for this new variable, the police anti-drug program is no longer statistically significant, an instead the presence of the active tenants group makes the dramatic difference. (and look at that great R square!). However, we are no quite done
SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations ANOVA df Regression Residual Total 2 27 29 Coeffici ents 0.36582 -0.2565 -0.1246 SS 0.149 0.024 0.173 Standard Error 0.017 0.051 0.011 MS 0.074 0.001 F 83.484 Significance F 0.000
0.928 0.861 0.850 0.030 30
Intercept percent 2-parent families Active Tenants Group? (1 = yes; 0 = no)
t Stat 20.908 -5.017 -11.347
P-value 0.000 0.000 0.000
BETA
-0.362 -0.821
Since the police variable now has a statistically insignificant t-score, we remove it from the model. (We also remove the income variable, since it also becomes insignificant after we remove the police variable.) We are left with two independent variables: percent of 2-parent families and active tenants group.
Stepwise Regression Algorithms
Backward Elimination Forward Selection Stepwise Selection
Backward Elimination 1. Fit the model containing all (remaining) predictors. 2. Test each predictor variable, one at a time, for a significant relationship with y. 3. Identify the variable with the largest pvalue. If p > , remove this variable from the model, and return to (1.). 4. Otherwise, stop and use the existing model.
Forward Selection 1. Fit all models with one (more) predictor. 2. Test each of these predictor variables, for a significant relationship with y. 3. Identify the variable with the smallest pvalue. If p < , add this variable to the model, and return to (1.). 4. Otherwise, stop and use the existing model.
Stepwise Selection The Stepwise Selection method is basically Forward Selection with Backward Elimination added in at every step.
Stepwise Selection 1. Fit all models with one (more) predictor. 2. Test each of these predictor variables, for a significant relationship with y. 3. Identify the variable with the smallest p-value. If p < , add this variable to the model, and return to (1.). 4. Now, for the model being considered, test each predictor variable, one at a time, for a significant relationship with y. 5. Identify the variable with the largest p-value. If p > , remove this variable from the model, and return to (1.). 6. Otherwise, stop and use the existing model.
Linear regression
Review
Multiple Regression Models
Chapter Topics

The Multiple Regression Model Contribution of Individual Independent Variables Coefficient of Determination Categorical Explanatory Variables Transformation of Variables Violations of Assumptions Qualitative Dependent Variables
Multiple Regression Models

Multiple Regression Models Linear
NonLinear
Linear
Dummy Variable
Interaction
PolyNomial
Square Root
Log
Reciprocal
Exponential
Linear Multiple Regression Model
Additional Assumption for Multiple Regression
No exact linear relation exists between any subset of explanatory variables (perfect "multicollinearity")
The Multiple Regression Model

Relationship between 1 dependent & 2 or more independent variables is a linear function Random
Population Y-intercept
Population slopes
2 X2i
Error
p X pi
Yi
Yi
1X1i
b0
b1X1i
b2 X 2i
bp X pi
ei
Dependent (Response) variable for sample
Independent (Explanatory) variables for sample model
Population Multiple Regression Model

Bivariate model
Yi =
0 1 1i (Observed Y)
X +
X2i +
Response Plane
X2
X1
(X1i,X2i)
YX
1X1i
2X2i
Sample Multiple Regression Model

Bivariate model
Y
Response Plane
Yi = b0 + b1X1i + b2X2i + ei
(Observed Y)
b0
ei
X2
X1
(X1i,X2i) ^ Yi = b0 + b1X1i + b2X2i
Parameter Estimation
Linear Multiple Regression Model
Multiple Regression Model: Example

O il (G a l) T e m p (0F) In su la tio n
Develop a model for estimating heating oil used for a single family home in the month of January based on average temperature and amount of insulation in inches.
2 7 5 .3 0 3 6 3 .8 0 1 6 4 .3 0 4 0 .8 0 9 4 .3 0 2 3 0 .9 0 3 6 6 .7 0 3 0 0 .6 0 2 3 7 .8 0 1 2 1 .4 0 3 1 .4 0 2 0 3 .5 0 4 4 1 .1 0 3 2 3 .0 0 5 2 .5 0
40 27 40 73 64 34 9 8 23 63 65 41 21 38 58
3 3 10 6 6 6 6 10 10 3 10 6 3 3 10
Interpretation of Estimated Coefficients
Slope (bP)
Estimated Y changes by bP for each 1 unit increase in XP holding all other variables
constant (ceterus paribus)
Example: If b1 = -2, then fuel oil usage (Y) is expected to decrease by 2 gallons for each 1 degree increase in temperature (X1) given the inches of insulation (X2)
Y-Intercept (b0)
Average value of Y when all XP = 0
Sample Regression Model: Example

C o e ffi c i e n ts
I n te r c e p t X V a ria b le 1 X V a ria b le 2 5 6 2 .1 5 1 0 0 9 2 -5 . 4 3 6 5 8 0 5 8 8 -2 0 . 0 1 2 3 2 0 6 7
Yi
562 .151
5 .437 X 1i
20 .012 X 2 i
For each degree increase in temperature, the average amount of heating oil used is decreased by 5.437 gallons, holding insulation constant.
For each increase in one inch of insulation, the use of heating oil is decreased by 20.012 gallons, holding temperature constant.
Evaluating the Model
Evaluating Multiple Regression Model Steps

Examine variation measures Test parameter significance
Overall model Portions of model Individual coefficients
Variation Measures
Coefficient of Multiple Determination
r2Y.12..P = Explained variation = SSR

Total variation
SST
r2=0 all the variables taken together do not explain variation in Y
Adjusted Coefficient of
Multiple Determination
NOT proportion of variation in Y explained by all X variables taken together Reflects
Sample size Number of independent variables
Smaller than r2Y.12..P Sometimes used to compare models
Simple and Multiple Regression Compared:Example
Two simple regressions: ABSENCES= + 1AUTONOMY ABSENCES= + 2SKILLVARIETY

Multiple Regression: ABSENCES= + 1AUTONOMY+ 2SKILLVARIETY
Overlap in Explanation
SIMPLE REGRESSION: AUTONOMY Multiple R 0,169171 R Square 0,028619 Adjusted R Square 0,027709 Standard Error 12,443 Observations 1069 ANOVA df Regression Residual Total SS MS F Significance F 1 4867,198 4867,198 31,43612 2,62392E-08 1067 165201,7 154,8282 1068 170068,9 MULTIPLE REGRESSION Multiple R 0,231298 R Square 0,053499 Adjusted R Square 0,051723 Standard Error 2,28837 1 Observations 1069 ANOVA df Regression Residual Total SS MS F 2 9098,483 4549,242 30,1266 1066 160970,4 151,0041 1068 170068,9
SIMPLE REGRESSION: SKILL VARIETY Multiple R 0,193838 R Square 0,037573 Adjusted R Square 0,036671 Standard Error 12,38552 Observations 1069 ANOVA df Regression Residual Total SS MS F Significance F 1 6390,011 6390,011 41,6556 1,64882E-10 1067 163678,9 153,401 1068 170068,9
0,06619206 SUM OF SIMPLE R2 0,05349881 MULTIPLE R2 0,01269325 OVERLAP ATTRIBUTED TO BOTH
11257,2098 SUM OF REGRESSION SUM OF SQUARES 9098,4831 REGRESSION SUM OF SQUARES 2158,72671 OVERLAP
Testing Parameters
Test for Overall Significance Example Solution

Test Statistic:

H0 : 1 = 2 = = H1: At least one I = .05
=0 0
168.47
df = 2 and 12
Critical Value(s):
Decision: Reject at = 0.05 Conclusion:

= 0.05
There is evidence that at least one independent variable affects Y
3.89
Test for Significance: Individual Variables

Shows if there is a linear relationship between the variable Xi and Y Use t test Statistic
Hypotheses:
H0:
H1:
i i
= 0 (No linear relationship)

0 (Linear relationship between Xi and Y)
t Test Statistic Excel Output: Example

t Test Statistic for X1 (Temperature)
C o e ffi c i e n ts S ta n d a r d E r r o r
I n te r c e p t 5 6 2 .1 5 1 0 0 9 2 1 .0 9 3 1 0 4 3 3 0 .3 3 6 2 1 6 1 6 7 2 .3 4 2 5 0 5 2 2 7 X V a r i a b l e 1 -5 . 4 3 6 5 8 0 6 X V a r i a b l e 2 -2 0 . 0 1 2 3 2 1
t S ta t
2 6 .6 5 0 9 4 -1 6 . 1 6 9 9 -8 . 5 4 3 1 3
t Test Statistic for X2 (Insulation)
Se
t Test : Example Solution

Does temperature have a significant effect on monthly consumption of heating oil? Test at = 0.05.
H0: H1:

1 1
=0 0
Test Statistic: t Test Statistic = -16.1699 Decision: Reject H0 at = 0.05 Conclusion: There is evidence of a significant effect of temperature on oil consumption.
df = 12 Critical Value(s):
Reject H0 Reject H0
.025
-2.1788
.025 0 2.1788
Example: Analysis of job earnings

What is the impact of employer tenure (ERTEN), unemployment (UNEM) and education (EDU) on job earnings (JEARN)?
Example: Analysis of job earnings
Correlations
Results: Anova
Results
Testing Model Portions
Examines the contribution of a set of X variables to the relationship with Y Null hypothesis:
Variables in set do not improve significantly the model when all other variables are included
At least one variable is significant
Alternative hypothesis:
Testing Model Portions

Only one-tail test Requires comparison of two regressions
One regression includes everything One regression includes everything except the portion to be tested.
Testing Model Portions Test Statistic

Test H0:
1= 2
= 0 in a 3 variable model
( SSR( X 1 , X 2 , X 3 ) - SSR( X 3 ))/k MSE ( X 1 , X 2 , X 3 )

From ANOVA section of regression for
From ANOVA section of regression for
Yi b0 b1 X1i b2 X 2i b3 X 3i
Yi
b0
b3 X 3i
Testing Portions of Model: SSR

Contribution of X1 and X2 given X3 has been included: SSR(X1and X2 X3) = SSR(X1,X2 and X3) SSR(X3)
From ANOVA section of regression for From ANOVA section of regression for
Yi b0 b1 X1i b2 X 2i b3 X 3i
Yi
b0
b2 X 3i
Partial F Test For Contribution of Set of X variables
Hypotheses:
H0 : Variables Xi... do not significantly improve

the model given all others included
H1 : Variables Xi... significantly improve the

model given all others included
Test Statistic: With df = k and (n - p -1)
F=
SSR( X i .... all others) / k MSE
k=# of variables tested
Testing Portions of Model: Example

Test at the = .05 level to determine if the variable of average temperature significantly improves the model given that insulation is included.
Testing Portions of Model: Example

H0: X1 does not improve model (X2 included)
H1: X1 does improve model
ANOV A
= .05, df = 1 and 12
Critical Value = 4.75
A N O V A
(For X1 and X2)

SS MS
114007.313 676.716918 228014.6263 8120.603016 236135.2293
(For X2)
S S 5 1 0 7 6 .4 7 1 8 5 0 5 8 .8 2 3 6 1 3 5 .2
R e g r e ssi o n R e si d u a l T o ta l
R e g re ssio n R e sid u a l T o ta l
SSR ( X 1 X 2 ) MSE
228,015 51,076 676,717
= 261.47
Conclusion: Reject H0. X1 does improve model
Do I need to do this for one variable?

The F test for the inclusion of a single variable after all other variables are included in the model is IDENTICAL to the t test of the slope for that variable The only reason to do an F test is to test several variables together.
Example: Collinear Variables

20,000 Execs in 439 Corps: Dependent Variable=base pay+bonus Individual Simple Regression R2 Company Dummies .33 Multiple Regression Contribution to R2 .08 .022
Occupational Dummies .52
Position in hierarchy
Human Capital Vars Shared
.69
.28
.104
.032 .632 TOTAL .87
Yedek
Multiple Regression
The value of outcome variable depends on several explanatory variables. F-test. To judge whether the explanatory variables in the model adequately describe the outcome variable. t-test. Applies to each individual explanatory variable. Significant t indicates whether the explanatory variable has an effect on outcome variable while controlling for other Xs. T-ratio. To judge the relative importance of the explanatory variable.
Problem of Multicollinearity
When explanatory variables are correlated there is difficulty in interpreting the effect of explanatory variables on the outcome. Check by: Correlation coefficient matrix (see next slide). F-test significant with insignificant t. Large changes occur in the regression coefficients when variables are added or deleted. (Variance Inflation). Vi > 4 or 5 means there is multicollinearity.
Example of a Matrix Plot
This matrix plot comprises several scatter plots to provide visual information as to whether variables are correlated The arrow points at a scatter plot where two explanatory variables are
Selecting the most Economic Model

The purpose is to find the smallest number of explanatory variables which make the maximum contribution to the outcome. After excluding variables that may be causing multicollinearity, examine the table of t-ratios in the full model. Those variables with a significant t are included in the sub-set. In the Analysis of Variance table examine the column headed SEQ SS. Check that the candidate variables are indeed making a sizable contribution to the Regression Sum of Squares
Stepwise Regression Analysis
Stepwise finds the explanatory variable with the highest R2 to start with. It then checks each of the remaining variables until two variables with highest R2 are found. It then repeats the process until three variables with highest R2 are found, and so on. The overall R2 gets larger as more variables are added. Stepwise may be useful in the early exploratory stage of data analysis, but not to be relied upon for the confirmatory stage.
Is the Model Adequate?

Judged by the following: R2 value. Increase in R2 on adding another variable gives a useful hint Adjusted R2 is a more sensitive measure. Smallest value of s (standard deviation). C-p statistic. A model with the smallest C-p is used such that Cp value is closest
Confidence Interval Estimate For The Slope

Provide the 95% confidence interval for the population slope 1 (the effect of temperature on oil consumption).
b1
tn
p 1S b1
Lower 95% Upper 95% 516,1930837 608,108935 -6,169132673 -4,7040285 -25,11620102 -14,90844
Coefficients Intercept 562,151009 X Variable 1 -5,4365806 X Variable 2 -20,012321
-6.169 -4.704 1 The average consumption of oil is reduced by between 4.7 gallons to 6.17 gallons per each increase of 10 F in houses with the same insulation.

Statistics

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Statistics

Hochgeladen von

Copyright:

Verfügbare Formate

CE 459 Statistics

Assistant Prof. Muhammet Vefa AKPINAR

8 6 4 2 0 50 55 60 65 70 75 80 85 90 95 100 Upper Boundaries (x <= boundary) Expected Normal

Statistics is a group of methods used to

collect, analyze, present, and interpret data and to make decisions.

Descriptive statistics and Inferential statistics.

Population and sample.

A portion of the population selected for study is referred to as a sample.

Measures of Central Tendency

Notice that it is possible for a data not to have any mode.

Median: (4+5)/2 = 4.5

Mode = Median = Mean

n 1 172.7 67.9 55.7 10 1 399 .8 44.4 ($K 2 ) 9

Variance and the closely-related standard deviation

variance in a population is:

variance in a sample is:

where; is the mean and N is the number of scores.

The standard deviation is the square root of the variance.

Variance and Standar Deviation

The standard deviation

Observed Vehicle velocity

Mean, Median, Standard Deviation

80,000 < x <= 85,000

Selecting the Interval Size

Upper Boundaries (x <= boundary)

There are many different-shaped frequency distributions:

The Effect of Skew on the Mean and Median

Normal Distribution or Normal Curve

The normal distribution function

The normal distribution function is determined by the following formula:

Where; : mean : standard deviation e: Euler's constant (2.71...) : constant Pi (3.14...)

Characteristics of the Normal Distribution:

Examples of normal distributions

Changes in the value of

THE STANDARD NORMAL CURVE

Standard Scores (z Scores)

where z is the z-score for the value of X

Table of the Standard Normal (z) Distribution

Three areas on a standard normal curve

-infinity to -Z plus +Z to + infinity

-infinity to -Z plus +Z to + infinity

Total - infinity to Z1.5

Area Under Curve from -Z to +Z 0,00000000000000 0,07965567455403 0,15851941887818 0,23582284437788

Area Under Curve negative infinity to Z1.5

08.10.2011 0,08075665923378 0,91924334076622

P(-0.75 < Z < 0) = .2734

P(-3.25 < Z < -0.75) = .49942 - .2734 = .22602

P(Z > 1.75) = .5 - .4599 = .0401

P (Z < -3.25) = .5 - .49942 = .00058

Graphic representation of this problem.

Standard Error (SE)

Standard Error of the Mean, SEM, M

Comparing the Averages of Two Independent Samples

Overview of Confidence Intervals

Z.95 --the value is 1.96.

Estimating the Population Mean Using Intervals

Determining Sample Size for Estimating the Mean

Confidence Interval for , Standard Deviation Estimated

confidence interval on the mean:

A comparison of the t and normal distribution

Upper tail probability p (area under the right side)

P[-2.92 < t(2) < 2.92] = 0.9

Prediction with Regression Analysis