Sie sind auf Seite 1von 23

____________________Dpartement TECHNIQUES DE COMMERCIALISATION

________ MATHEMATICS ________ 2nd Semester

Bivariate descriptive statistics and prediction

Online lessons : on ENT, section outils pdagogiques , platform Claroline, category TC, MATHS2 . Are to be uploaded at the same place : corrections of exercises ; results of tests.

SUMMARY
Lessons and tutorials
I Introduction, vocabulary
I-1 Aims I-2 Formatting

3
3
3 3

II

Parameters of bivariant series


II-1 Central tendency II-2 Dispersion

4
4 4

III
III-1 III-2 III-3 III-4 III-5

Scatter plot and fitting


Scatter plot Purpose of linear fitting Mayer's method Least square method Linear correlation coefficient

6
6 6 7 9 10

IV V

Non-linear fitting : variable change Statistical prediction


V-1 Single prediction V-2 Prediction by confidence interval

12 13
13 14

VI

Contingency tables

16

Exercises Annals Form

18 20 23

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 2 on 23

Introduction, vocabulary
I-1 Aims
Two quantitative characters are to be picked together from each individual of a n-sized population. They lead to two data lists, that are two quantitative variables X and Y . Aims : * study the relationship between these two characters : their correlation ; * model this correlation by a mathematical function : regression ; * use this model to perform a prediction, with associated confidence level.

I-2 Formatting
Two cases may occur, from a study : 1. One observation (# i) will be written as a couple of values (x i ; y i). e.g. : relation between quantity of spread fertilizer and harvested production series is represented by a "list table" fertilizer quantity harvest -1 -1 plot # X (kg.ha ) Y (q.ha ) 1 2 3 4 5 150 80 120 220 100 46 37 46 51 43 n=5

Type of case studied in parts II to V. 2. Different case : each couple (x i ; y j) is reached by several individuals ; their number is the frequency : "n ij" e.g. : relation between age and size (measures taken from 100 chidren) series is represented by a "contingency table" size Y (cm) [95 ; 105[ [105 ; 125[ [125 ; 135[ age X (years) [3 ; 5[ [5 ; 7[ [7 ; 9[ 15 8 2 10 32 13 0 5 15 n = 100

Type of case exclusively studied in part VI.

Comment In most cases, there is a cause and effect relationship between both characters. The causal variable will then be named explaining variable (mostly X ), the other is named explained variable (mostly Y ).

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 3 on 23

II

Parameters of bivariant series


II-1 Central tendency
The means of X and Y are obviously : and

def

The especial point

is called

of the series.

II-2 Dispersion
The variance of X and the one of Y are : and

and their standard deviations are :

and

The covariant of couple (X ,Y ) is the number :

Cov(X , Y ) =

Koenig's theorem : V(X ) = ; V(Y ) =

and Cov(X ,Y ) =

T1 : parameters of a bivariant series


1Manual computations Taking example 1 (fertilizer/harvest), compute the parameters defined earlier. (you may save time and energy using the lists of your calculator, first entering data into lists 1 and 2)

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 4 on 23

2-

Use the statistics mode of your calculator Taking the same example as before, ask your calculator statistical results from lists 1 and 2, and then write them down (stats in "2var", of course). Finally, compute the parameters that are still not given. You may also take note of the actions you had to perform through the stat mode of your calculator.

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 5 on 23

III

Scatter plot and fitting


III-1 Scatter plot
Into an orthogonal frame, values of X are put on abscissas axis and those of Y on the ordinates axis. Each couple (x i ; y i) gives a point Mi. Example 3 : Here are the advertising expenses (k) of a food company : X : year Y : expense
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

41

60

55

66

87

61

90

95

82

120 125 118

expense (k)

120

+ +

100

+
80

+ +

+
60

+ +
1

+
3 4 5 6 7 8 9 10 11 12
year (1 : 2001)

III-2 Purpose of linear fitting


A scatter plot might show a link between both variables if its points are apparently not randomly distributed. In some cases, this cloud of points might show an elongated shape, more or less thin, whose "axis" seems to be straight, showing some tendency Could we determine an axis, a straight line, that would follow "to the best" the points cloud ? y Let's say this line is already drawn : (D) : y = ax + b. M1 With a given value x i are associated the value (ordinate of the point Mi) and the value = on the line. M3 Mi

(D)

M2 x

definition : we name residue the number

Vocabulary The "best" line is called trendline or regression line . The technique that consist in modeling a points cloud by a straight line is called linear regression or linear fitting .
IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 6 on 23

III-3 Mayer's method


Some residues are positive, some others are negative. Mayer's first idea was to define the "best line" as the one that lead to a zero sum of the residues. (negative residues exactly balance positive ones). definition : Mayer's principle consists in finding a regression line leading to :

e
i =1

=0

mathematical study :

e = ( y ax b ) = y a x nb
i i i i i

This sum equals zero

1 1 1 yi a xi nb = 0 n n n

y ax b = 0

property : Mayer's line, whose expression is y = ax + b, owns the mean point of the cloud, G(x , y ). comment : this property isn't sufficient in itself to define a unique straight line, since it only gives one point (G). There is so an infinite number of lines leading the sum of residues to zero ! One of them is the one that contains the points G1 and G2, mean points of two "half clouds". Mayer's method : 1. Dissociate the cloud in two equal parts : * Both subclouds contain the same (n/2) number of points, if n is even, or one contains (n+1)/2 points and the other (n-1)/2 points, if n is odd. * Points abscissas of the first subcloud are less than those of the second. 2. Compute the coordinates of G1 and G2, mean points of the two subclouds ; 3. Draw the Mayer's line (G1G2), and get its expression.

T2 : Mayer's line of a series


Using example 1 : plot # 1 2 3 4 5 fertilizer -1 X (kg.ha ) 150 80 120 220 100 harvest -1 Y (q.ha ) 46 37 46 51 43

On the graph below, draw the scatter plot and the Mayer's line (G1G2).

coordinates of both mean points :

Expression of Mayer's trendline :

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 7 on 23

T3 : Smoothing out data from a time series, moving means


Below, we're given the variations of the turnover of a company : 2009 2010 2011 2012 tri1 tri2 tri3 tri4 tri1 tri2 tri3 tri4 tri1 tri2 tri3 tri4 tri1 tri2 tri3 tri4 (M) 28 45 49 36 30 44 48 40 28 46 52 37 31 42 54 39 Besides short-term fluctuations - a seasonal activity due to its industry -, could we get a brighter view of how the turnover evoluates, by discovering some trend ?

Turnover (M)

2009

2010

2011

2012

1-

Mayer's line A first answer, rough one, consists in cutting the period in half, computing both means, then plotting their proper mean points and finally connecting them by Mayer's line.

2-

Moving means Aim : determine new points to plot a more regular line than the original one, showing a trend. Each new point is a mean point of a subset of the initial series. Subsets must have the same size, contain following values, and from one subset to the next, you may shift forward one value. e.g. : let's get the moving means with 5 points subsets : mean of values 1 to 5 etc. mean of values 2 to 6 etc. mean of values 3 to 7 etc. etc.

etc. 12 to 16

etc.

etc.

etc.

explanations :

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 8 on 23

III-4 Least square method


The purpose here is to square the residues. The "best" regression line is the one that gives the least sum of these squares, among the infinite number of possible lines on the plane. definition : The least square method consists in finding a regression line leading to
P ( a, b ) = ( yi axi b )

e
i =1

is minimum

mathematical study :

This sum is :

polynomial of two variables a and b . There are (at least) two different ways to expand this polynomial :
P ( a, b ) = ( ( yi axi ) b ) = nb 2 2b ( yi axi ) + ( yi axi )
2 2

(1)

2nd degree polynomial, with variable b ;


P ( a, b ) = ( ( yi b) axi ) = a 2 xi 2 2a ( xi yi b xi ) + ( yi b )
2 2

(2)

2nd degree polynomial, with variable a ; Then, we can follow this scheme : * Given a constant a and a variable b , P (a ,b ) (1) is minimum when its derivative (on b ) is zero (its 1st coefficient, n , is positive), which implies b = y ax ; * Given this value to b and a variable a , then P (a ,b ) (2) is minimum when its derivative (on a ) is zero (its 1st coefficient is positive), which implies 1 xi yi x y Cov ( X , Y ) a= n = 1 V (X) 2 2 x x i n Calculus amateurs will try to find back these results ! comments : * such a value of b implies that the regression line owns the mean point of the cloud, G. * this regression line is unique. least square results : Compute coefficients

a=

Cov ( X , Y ) V (X )

, then

b = y ax

Write down the expression of the Y on X regression line, D Y/X : y = ax + b

T4 : Regression line by least square method


Take back again example 3 and carry on the linear fitting following least square method ; Draw this line on the existing graph and notice it owns G. Computations and use of calculator :

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 9 on 23

III-5 Linear correlation coefficient


A scatter plot shows a more or less strong relationship between two variables X et Y . In case the cloud is thin and seems to follow a straight axis, we talk about linear correlation. The value of the linear correlation coefficient is to reflect the strength of this trend. def linear correlation coefficient (Pearson's correlation coefficient) Cov(X, Y) (X)(Y) -1 1

It's proved that whatever the statistical data set :


(the letter R or r is also often used for this coefficient)

calculator : This coefficient is generally noted r . Warning : some models give its square (determination coefficient ), writing it r ! So, we'll always choose to compute this coefficient ourselves. Interpretation of its value : The strongest the linear correlation is (cloud looking like a straight line), the closest to 1 is ||. "positive correlation" is positive when Y overall increases with X "negative correlation" is negative when Y overall decreases with an increasing X 0 || 0.5 : 0.5 || 0.75 : 0.75 || 0.95 : 0.95 || 1 : Comments : * are X and Y really linked ? If is close to 1 (or -1), points of the cloud are close to colinearity. Nevertheless, that doesn't always mean that X and Y are concretely linked. E.g. : in France, from 1974 to 1981, the wedding rate decreased and in the meantime the GDP (french : PIB) increased, so that the scatter plot using both sets of data was quasi-linear (fourth graph below). Linear correlation is mathematically very strong, but facts and sudies show there's no cause to effect relationship between both variables ! (after 1981, the following points are not at all colinear with the previous ones any more). * linear correlation only shows a linear link. Correlation between X and Y may be very strong, but not linear (so : curved) In case, is far from 1 and -1, and the study has to go further (see IV). Some examples :
income () success rate vs % of disadvantaged SPC

weak linear correlation, inappropriate linear model. mean linear correlation, non appropriate linear model. tolerable linear correlation, the linear model may not be the best one. strong linear correlation, the linear model is one of the most appropriate.

duration

R = 0.8449

(weeks)

R = -0.7457
IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 10 on 23

unit margin (/u)

wedding rate vs GDP (PIB)

wedding rate

quantity (ku)

R = 0.6438 R = -0.9875

T5 : Linear correlation
Compute the linear correlation coefficients from examples 1 and 3.

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 11 on 23

IV

Non linear fitting : variable change


A variable change may be applied in case points cloud seem to follow the curve of a function. The right function to be used will always be mentioned in a statement. It can be for instance : * a logarithm or an exponential * a 2nd degree (or more) polynomial * a trigonometric function

T6 : Variable change
example 4 : the following series is provided (see table). On testing a motorcycle, speed X (km.h-1) and fuel efficiency Y (L / 100 km) have been marked : X 10 20 30 40 50 60 70 80 90 Y 15.2 11.6 9.3 7.8 7 6.6 6.9 8 9.6 Plotted points seem to follow a parabol whose vertex abscissa is 60. Let's set the variable T = (X - 60) and analyse the linear correlation between variables T and Y . Your work : Variable change : replacement of X by T . T Y 15.2 11.6 9.3 7.8 7 6.6 6.9 8 9.6
16 14 12 10 8 6 4 2 0

10

20

30

40

50

60

70

80

90 100

Determine the linear correlation coefficient of the couple (T , Y ), comment.

Determine the expression of Y on T regression line, by least square method.

Deduce the relationship (expression) between Y and X . ("curve regression")

On the graph, draw the curve of the function Y (X ).

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 12 on 23

Statistical prediction
V-1 Single prediction
The obtained regression line (with or without variable change) allows, thanks to its expression, to estimate a value of a variable, given a new, unexplored, value of the other (generally : a higher value than those existing in the initial set of data). As it happens, if X is a time variable, the method makes us able to do a prediction for the future.

T7 : Single prediction
1Taking example 3, estimate the advertising expense to anticipate for 2013.

2-

Taking example 1, estimate the quantity of fertilizer necessary for a harvest of 60 q/ha.

3-

Taking example 4, estimate fuel efficiency when speed is 100 km/h.

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 13 on 23

V-2 Prediction by confidence interval


We have to be careful, considering the single prediction value : depending on the strengh of the linear correlation coefficient (i.e. width of dispersion in points cloud), we may more or less trust it. Here, the aim is to give a range (interval) of most possible values (instead of a single one), and in the meantime to give the probability that the real and unknown value could belong to this interval. Rates method (used only with linear correlation, in order to estimate y , knowing x ) : 1. For each value x i of the initial data set : * compute the values y' i according to the expression of the regression line * compute the rates z i = y i / y' i, thus building a new variable : Z * get the mean and standard deviation of variable Z By hypothesis, often really close to reality, values of Z match with a normal law (a mathematical framework in which distribution of values around their mean is well-known). It leads to (for instance) : 95 % of values of Z theoretically belong to the interval 99 % of values of Z theoretically belong to the interval

[ z 1.96 z ; z + 1.96 z ] [ z 2.58 z ; z + 2.58 z ]

2. Compute the single prediction y' 0 associated with the new desired value x 0 (with regression line). The real value y 0, unknown, is then estimated as follows : There are 95% of chances that y 0 belongs to There are 99% of chances that y 0 belongs to

( z 1.96 z ) ; y0 ( z + 1.96 z ) y0

( z 2.58 z ) ; y0 ( z + 2.58 z ) y0

comments :

* this method is only valid in case > 0 (positive correlation) * the rate (95%, 99%, etc.) is called confidence level of the estimation. * the size of such an interval (and so the level of uncertainty of our answer) increases when :

. . .

desired confidence level increases, || decreases, x 0 is far from values x i of the initial data set.

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 14 on 23

T8 : Estimation by confidence interval


1Taking example 3, estimate the advertizing expense to anticipate, by a 95% confidence interval.

2-

Taking example 1, estimate the harvest obtained with 300 kg/ha of fertilizer, by a 99% confidence interval.

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 15 on 23

VI

Particularity of contingency tables


Here, it's possible that a couple (x , y ) have been observed on more than one individual. In this case, data set is represented in a contingency table (i.e. : with frequencies) : * its columns (e.g.) contain the different values of X , the x i ; * its rows are then for the different values of Y , the y j, not necessarily as numerous as X ones ; * its content is reserved for frequencies (frequency of couple (x i ; y j) will be written n ij. case : relationship between visual acuity and age X : age * Columns here show centers of former age classes, 20 40 50 60 Y : acuity values x 1, x 2, x 3, x 4. 3/10 1 5 10 20 * rows here show centers of former acuity classes, values : y 1, y 2, y 3. (so : 0.3, 0.6, 0.9) 6/10 8 12 25 18 * the content gives, for each couple (x ; y ), 9/10 55 30 14 6 the corresponding number of persons. e.g. : 25 persons

All the items introduced earlier in this chapter (fitting, estimation, ) are relevant in the case of contingency tables. The only difference is the existence of frequencies different from 1 (see form).

T9 : Parameters, fitting and confidence interval


Let's take back the example above. X : ge Y : acuit 3/10 6/10 9/10 n i. n i.x i n i.x i 20 1 8 55 64 1280 25600 1092 40 5 12 26 43 1720 68800 1284 50 10 25 14 49 2450 122500 1530 60 20 18 6 44 2640 158400 1332 n .j 36 63 101 n .jy j 10,8 37,8 90,9 n .jy j 3,24 22,68 81,81 576 1782 2880

200 8090 375300 5238

139,5

107,73

5238

On your calculator, stat mode, enter the contingency table. Then, write down the results given back to you by the 2var option. Chek the ones that match with results in grey cells (above table)

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 16 on 23

Compute the parameters of this statistic series (covariance, ) and give the expression of the regression line (least squares).

Give the 99% confidence interval for the predictable acuity of a 80 year old person.

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 17 on 23

EXERCISES
Exercise 1 linear fitting The below table indicates the sales prices () of a machine and the number of units sold (one column = one year). ranking of the year 1 2 3 4 sales price X 300 210 270 375 number of sales Y 59400 198 240 222 160 50400 59940 60000 1) Graph the scatter plot with points M i (x i , y i ) in an orthogonal frame. Frame's origin : point (210, 160) ; scales : 1 cm for 15 on x -axis and 1 cm for 10 units on y-axis. Have a look to tell if a linear fitting seems appropriate. 2) Dtermine the coordinates of cloud's mean point G. Plot it on the graph. 3) a) Give an expression of D, regression line obtained with least squares method (round coefficients to 10-3). b) Build this regression line on your graph. 4) Which year shows the highest turnover ? How much is it ? going further : 5) It's now stated that each year, the number of sold units y and the sales price x are related as follows : y = 0.498x + 349. Let be S (x ) the turnover made by selling y machines (x euros each). a) Express S (x ) in term of x . b) Analyse variations of function S defined for x [210 ; 375]. c) Deduce the sales price to fix for a machine, for the fifth year, if we want optimize turnover S(x). How many units will be sold (round your result at one unit) ? For what turnover ? Exercise 2 2nd degree fitting A company marked its benefits Y beside its production X : X (tons) 2 3 5 7 11 Y (k) 38 55 72 69 24 T 1) Using your calculator, compute the linear correlation coefficient between X et Y . Comment. 2) A new variable is introduced : T = -(X - 6). a. Complete the table. b. Compute the linear correlation coefficient of couple (T , Y) . Is a linear fitting relevant here ? c. Give the expression of Y on T regression line, by least squares method. d. Deduce an expression of a Y on X regression. 3) Graph the points cloud (x i, y i) and the curve whose expression was found earlier (2d.).

Exercise 3 logarithm fitting Here, every results will be rounded 10-3 close. On a tudi la Lifelenght of some identical office equipments has been studied. In the following table, t i represents the duration of use expressed in thousands of hours and R (t i ) the rate of equipments still in use at time t i . (e.g. : we can read that after 1,000 hours, there are still 90 % of equipments in use left, R (t i ) = 0.90). ti R (t i ) 1 0.9 2 0.66 3 0.53 4 0.4 5 0.32 6 0.25 7 0.19 8 0.14 9 0.1

1) We set y i = ln[R (t i )] where ln is the natural logarithm. Fill the following table, then graph the points cloud M i (t i , y i ) in a plane set with an orthogonal frame. ti yi 2) May a linear fitting be relevant in the previous point ? Compute the linear correlation coefficient between T and Y .
IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 18 on 23

3) By least square method, determine an expression of Y on T regression line. Deduce from this expression that there are two positive real numbers k and such that : R (t ) = k e
- t

4) In this question, we'll take k = 1.174 and = 0.266. a. Determine the rate of equipment still in use after 10,000 hours. b. Determine the time t 0 when 50 % of equipments are still in use. 5) In this question, we're looking for a confidence interval of the rate of still in use equipments after 10,000 hours of use. a. Compute values y 'i corresponding to Y on T linear fitting. b. Compute values z i = y i/y 'i, then get the mean and standard deviation of Z . c. Find a 95% confidence interval of y for t = 10. d. Then, what is the 95% confidence interval of R ? Exercise 4 contingency table Let's take back an example introduced at the beginning of the document : 100 children distributed according to their age and size. size Y (cm) xi yj [95 ; 105[ 15 8 2 [105 ; 125[ 10 32 13 [125 ; 135[ 0 5 15

age X (years) [3 ; 5[ [5 ; 7[ [7 ; 9[ 1) 2) 3) 4) 5)

Enter this table in your calculator. Give means and standard deviations of X and Y , compute their covariance. Compute their linear correlation coefficient ; comment. Nevertheless, does the table allow us to see some trend ? Admitting that relationship between age and size is linear until the age of 12, give the 95% confidence interval of the size of a 12 year-old child.

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 19 on 23

ANNALS

Promotion 2009-2011 : Exercise 1 (15 points) The company Colis-Bris transports all kinds of packets. Its turnovers have been divided, considering the unit volume of carried packets : volume (liters)* [0 ; 2[ [2 ; 10[ [10 ; 50[ [50 ; 200[ [200 ; 500[ [500 ; 1000[ X turnover (k) 19 25 27 32 36 38 Y * 1 m = 1,000 liters

Part 1 (6 points) a. Compute the linear correlation coefficient of variables X and Y . 1) (wich implies using intermediate results given by your calculator, taking for X values the class centers) b. Comment its value. By a linear regression of Y on X following least squares method, estimate the turnover the company 2) could reach carrying packets of 1.5 m mean volume. 3) a. What is a residue ? b. Describe the principle of least squares method. We set T = ln(X ) where x i are still the centers of the classes. Part 2 (9 points) 1) Complete appendix 1 table. a. Compute the linear correlation coefficient of variables T and Y . 2) b. Comment its value. a. Give the expression of Y on T regression line (least squares). 3) b. Deduce an expression of Y into X . c. Thanks to this latest expression, re-estimate the possible turnover with 1.5 m3 packets. d. Assuming this latest estimation is right, what was the error percentage of the first estimation done in part one ? On appendix 2, graph the (x i, y i) points cloud and plot the curve whose expression was found in question 3b. scales : 1 cm for 100 liters, 3 cm for 10 k. a. Working with T and Y : give the 95% confidence interval of the turnover y for t = ln(1500). b. Plot this interval on the appendix 2 graph.

4) 5)

Exercice 2 (5 points) In a country, the index of purchasing power of inhabitants has been compared to the turnover of its automotive industry : 3.26 3.85 3.44 3.08 3.6 purchasing power (index) X 9.3 9.56 9.36 9.24 9.47 turnover auto (G) Y 1) 2) 3) Give an expression of the Y on X regression line, using Mayer's method. By the mean of a single prediction, tell what should be the purchasing power index so that automotive industry would reach a 10 G turnover (10 billions of euros). Is a good linear correlation between two variables the sign of a strong cause and effect relationship ?

Appendix 1 (exercise 1 part 2) : T and Y values table T = ln(X ) turnover (k) T Y

19

25

27

32

36

38

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 20 on 23

Promotion 2010-2012 : Exercise 1 (8 points) A market study has been performed on a new kind of product. The below table gives, according to several considered sales prices, the number of clients ready to buy it this cost. unit price () 2 3 4 5 6 7 X # of answers Y 66 47 34 25 18 14 1) 2) Compute the covariance of variables X and Y , then comment its sign. We set T = X (X - 20). a. Compute the linear correlation coefficient between variables T and Y . b. Comment its value. c. Give the expression of Y on T regression line (least square method). d. Deduce an expanded expression of Y into X . Here, we examine the expected turnover (unit sales price number of sold units), assuming that # of answers are numbers of sold units. a. Compute every turnovers obtained from the above table. b. Compute, taking the same values of X than in the table, the expected turnovers according to the formula found in question 2d. c. What unit sales price would you set, to get the highest possible turnover ?
1,5

1 0,5 0,5 1

3)

1 1

1,5

Exercise 2 (4 points) A commercial agent analyzes his activity and efficiency. For each visited prospect, he noted the time spent (X , minutes) to do a presentation of his product, and the quantity (Y ) sold. Table content is numbers of visits corresponding to each couple (X , Y ). 1) 2) 3) What's the signification of frequency "8" found in the table ? Compute manually the mean time spent for each visit. Compute the covariance of couple (X , Y ).

Y X [0 ; 10[ [10 ; 20[ [20 ; 30[ 0 3 0 1 1 2 4 5 2 2 8 12 3 0 7 3

1 2 1

Exercise 3 (8 points) Monthly sales revenue of a commercial website have been listed below, from january to december 2010 : in k : 3 5 4 8 10 9 13 12 17 18 18 21 1) In a few words, describe least square method. 2) According to the general trend shown by the evolution of monthly sales revenue, and using least square method, give the 95% confidence interval of the expected sales revenue for decembre 2011. (number the months from 1 for january 2010) 3) What's the probability that in december 2011 sales revenue might be less than 29.23 k ? 4) On appendix, plot the points cloud (scale : 2 cm per month), the regression line, and finally the confidence interval.

Promotion 2011-2013 : Exercise 1 city A B C D E F G H X 850 623 587 360 312 275 262 244 (8 points) Y 58 37 38 20 16 15 12 12 This table gathers eight big cities of a country. Variable X gives, in thousands, city's population ; variable Y gives, in thousands, the number of students in this city. 1) 2) 3) On appendix, plot the series points cloud. Give the coordinates of G, mean point of the cloud. a. Applying Mayer's method, determine (manually) the expression of regression line. b. Draw this line. Is G part of it ? c. Give Mayer's principle.
1 0,5

2 1 0,5

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 21 on 23

4)

We'll use here an other regression line, whose expression is : y ' = 0,07x - 6. a. With this line, give the 95% confidence interval of the number of students in a two million inhabitants town. b. Tell about the probability that the number of students would be more than 155,000.

2,5

0,5

Exercise 2 (7 points) A perfume shop, analyzing its turnover, compares the number (Y ) of sales from different brands and models of perfume with price (X ) of bottles. X , price () Y , # of sales 15 25 30 202 117 107 40 82 45 78 60 60 75 55 90 48

If a question asks "compute", you'll use intermediate results given by your calculator. 1) a. Compute covariance of variables X and Y ; comment its sign. b. Compute linear correlation coefficient between X and Y ; comment its value.
T= 850 X
1,5 1 0,5 0,5 1 1,5 1

2) Looking for a sharper study of the X and Y relationship, a variable change has been chosen : a. After getting the whole values of T in a third list of your calculator, examine linear correlation between T and Y . b. Give the expression of Y on T regression line, with least square method. c. Give the lieast square criterion. d. Deduce from question 2b. A modelized expression of Y into X . e. According to this model, how many bottles of 150 is the shop supposed to sell ? Exercice 3 (5 points) 500 persons having passed their driving licence test are listed in the below table. They are classified according to the number X of times they took the test before passing it and to the number Y of hours of driving lessons taken before their first try.

X Y 1 2 3 4

[0 ; 15[ 23 77 42 12

[15 ; 25[ 92 84 35 6

[25 ; 40[ 80 33 13 3

1) What's a margin frequency ? Give an example from the table. 2) Describe shortly the method used to enter all these data into your calculator. 3) Compute the covariance of couple (X, Y) and comment concretely its value.

0,5

1,5

4) Among the persons that took between 15 and 25 h of driving lessons, what's the rate of those who passed the test at third time ? 5) Among the persons who passed their test at third time, what's the rate of those who took between 15 and 25 h of driving lessons ?

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 22 on 23

FORM
IUT - TC Mathematics - Form for the first test of semester 2

2-variable statistics : Without contingency * mean, variant and standard deviation

x=

x
i =1

y=

y
i =1

V (X ) =

x
i =1

x2

; V (Y ) =

y
i =1

y2

( X ) = x = V ( X ) ; (Y ) = y = V ( Y )
* covariance and linear correlation coefficient

Cov ( X , Y ) =

x .y
i =1 i

x.y

Cov( X , Y ) x . y

* parameters in expression of Y on X regression line (least square method) y = ax + b with


a= Cov ( X , Y ) V (X )

and

b = y ax

With contingency * mean and variant


x=

ni. xi
i =1

n. j y j
; y=
j =1

N
q

V (X ) =

ni. xi 2
i =1

x2

; V (Y ) =

n
j =1

.j

y j2 y2

* covariance
Cov ( X , Y ) =

n x y
i =1 j =1 ij i

x.y

* other formulas are identical as those given above

Confidence interval of y for a given value x 0 : where * y' 0 = ax 0 + b * y' i = ax i + b

( z u z ) ; y0 ( z + u z ) y0
and
zi = yi yi

* u = 1.96 (confidence level : 95%) or 2.58 (confidence level : 99%)

IUT Saint-Etienne Department TC - J.F.Ferraris - Mathematics - S2 - Stat2var - LessonEx - Rev 2012 - page 23 on 23

Das könnte Ihnen auch gefallen