Beruflich Dokumente
Kultur Dokumente
1
What is Statistics ?
Measures of Spread
Shape Measures
2
Introduction to Statistics
Descriptive Statistics
Measuring the Central Tendency
Mean, Weighted Mean, Geometric Mean, Median, Mode
Measure of Spread
Percentiles, Quartiles and Deciles
Range, Interquartile, variation
Standard Deviation, Coefficient of Variation
Shape Measures
Skewness and Kurtosis
3
Statistics – What is it ?
It’s a science deals with Collection, Classification, Analysis , and Interpretation of numerical facts or
data AND possibly use it to predict the future events
Nominal Ordinal
Frequencies and Frequencies and
proportions proportions
Interval Ratio
Mean Mean
Median Median
Standard Deviation Standard Deviation
Data
Categorical Quantitative
Gender
Gender
Female ,Male Customer Female Customer Continuous
feedback (1),Male(2) Discrete
Color, defective feedback 1,2,3 Customer
Color, defective No of
Job title Income
Customers
Claim Total Case Name Age Income Job Gender Case Exp Salary
Num claim
1 Raj 45 1,25,000 Sr. Mgr Male
Amount 1 5 4000
AQI01 25000 2 Ashok 50 2,56,500 A.Dir Male 0
2 8 7500
AQI08 18000 3 Mala 26 12,000 Sw eng Female 0
Diet Coke Diet Coke Coke Classic Coke Classic Dr. Pepper
Soft Drink Frequency Percentage We created a frequency table which was just
counting the number of occurrences of
Coke Classic 19 38%
each value appeared in the dataset
Diet Coke 8 16%
Sprite 5 10%
15 Frequency
10
0
Coke Classic Diet Coke Dr. Pepper Pepsi-Cola Sprite
5 2 2 2 0 1 3 .2 .33
2 4 .27 .50
4 2 1 3 1
3 3 .2
4 2 .13
5 1 .07
15 1.00 100
(sample)
• It’s the running total of frequencies through the classes of frequency distribution
22 18 28 65 22
24
24 18 22 54
28
28 29 18 29
29
32 30 17 44 30
65 21 44 32 32
44
65
1 2000 2000
1500
2 2100 1000
500
3 2200 0
1 2 3 4 5 6
4 2100
5 2300 2450
2400
2350
6 2400 2300
2250
2200
2150
2100
2050
2000
1 2 3 4 5 6
Walmart
Tesco
Costco
Walgreens
Bar charts allow you to compare relative sizes, but at a higher degree of precision
Baby Care
Beauty Care
Gifts
Apparel
Computer
Electronics
0 2 4 6 8 10 12
8
6 6
7
5
5
6
4
5 4
3
4
3
3 2
2
2 1
1 1
0
0
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 0 -1
1-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
Age
Measures of Dispersion
Min and max
Range
Standard Deviation
Others
Skewness
Kurtosis
Distribution Plot
58350 $48670
Average Salary = 48000
63120 $57320
X bar = $ 123,098
44640 $38150
56380 $41290
72250 $53160
$500,000
1 250
2 300 No. of Observations = 10
3 280 SUM = 2750
4 270
MEAN = (2750/10) = 275
5 320
6 290 Ø Works well when data is not heavily skewed
7 260
8 280 Ø Easy to compute
9 240
10 260
– Many samples from the same population will have similar means
10 20 22 24 32 35 51 31 11
10 20 22 24 32 32 51 31 11 33
(32 + 32 ) / 2 = 32
28
Median
63120 $57320
What is the median value here ?
44640 $38150
56380 $41290
72250 $53160
72250 $53160
$500000
Mean is 123098
Median is 50915
31
Median
58350 $48670
63120 $57320
44640 $38150
56380 $41290
72250 $53160
32
Median with Outlier
33
Mode
Customer Amount Spent Amount Spent (bucket) Mins Bucket No. of Subscribers
Id (in $)
1 240
< 300 8
2 280
300-500 1
3 270
> 500 1
4 300
5 277
6 267
MODE = “<300”
7 292
34
Mode
35
Quick 1 - Mode for a negatively skewed distribution
90
What is the mode here ?
80
70
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
36
Right Skewed data
70
60
Which one holds true ?
50
mean < median < mode
40
median < mode < mean
30
mode < median < mean
mode < mean < median
20
10
0
62 56 53 46 40 36 25 22 22 18 13 10 5
37
Positive and Negative skewed data
38
Quiz
8
Which one is applicable here ?
7
0
114 115 116 117 118 119 120 121 121 123 124 125 126 127 128 129
39
Quiz 2 – Mode on a Uniform distribution
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
40
Quiz 3 – Mode on a bimodal distribution
10
0
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129
41
Quiz 4 –Mode of distribution
90 9
80 8
100
70 7
90
80 60
6
Number of employees
70
50
60 5
50 40
4
40
30
30 3
20 20
2
10 10
0 1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0
Income
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129
42
Quiz 5 : Mode on Categorical data
40000
35000
30000
25000
What is the mode ?
20000 – 34100
Male
15000
10000
–
5000
– 16500
0
Male female
– Female
43
Quiz 6: Mode
• Using mode we can describe if the data is either categorical or numerical (TRUE/FALSE)
• The expenditures in the below data set affect the mode (TRUE/FALSE)
– 32, 45 , 32, 25, 28, 32
• The mode remains same for more than one samples drawn from a same population
44
Geometric Mean
In calculating the GM, the numbers must all be positive. So, add 1 to each value before
calculating the GM and subtract 1 from the answer.
45
Measures of Central Tendency
46
Descriptive Statistics – Measures of Dispersion
47
What is Measures of Dispersion?
Range
Standard Deviation, Interquartile Range
Coefficient of Variation
48
What is measures of Dispersion / Spread ?
What is the difference between Measures of central Tendency and Measures of dispersion ?
The difference between measures of central tendency and Measures of dispersion is , central tendency describes
the centre of the data ,but it does not tell us anything about the spread of the data We need some measures
which show whether the distribution is small or large
Closer spread
Wider spread
49
Measures of Dispersion
50
Range
Both the datasets have the similar range , but the values are distributed differently.
Range shows the width of the data but not how it’s dispersed between the bounds.
Its not the best way of measuring how the data is distributed within the range.
Range is computed by taking the difference between maximum value and minimum value
Drawback – It takes into account of outliers
5 5
10 7
25
12 7 25
20
15 16 20
15
16 16 15
10
10 16 10
5
20 20 5
0
5 5 1 2 3 4 5 6 7
0
1 2 3 4 5 6 7
20 20
40,89,91,93, 95,100
Sasken Training, Adyar , Chennai – 600 020
51
Application of Range- Case Study
Another reason for examining the dispersion in a set of data is to compare the spread in two or more distributions.
For instance, that Sony assembles TVs in North and South Chennai. The arithmetic mean hourly output is both these
North and South plant is 40.
With this we can conclude that the distribution of the hourly outputs are identical but this is not
correct conclusion
52
Standard Deviation
Standard deviation is a measure of how close or far are the observations to the mean
Cust id Amount
Spent
1 250 This point is 0 units away from the mean (250-250)
2 345 This point is 95 units away from the mean (345-250)
3 280
4 290
5 175
6 200
7 255
8 150
9 375
10 180
µ
Mean 250
Mean 250
Lets say in a company 1000 employees are organized into a histogram based on the number of
years of experience with the company. The mean is 4.9 years, but the spread of the data is from
6 months to 16.8 years.
The mean of 4.9 years is not very representative of all the employees
55
Application of Standard Deviation - Case Study 2
Another reason for examining the dispersion in a set of data is to compare the spread in two or more distributions.
For instance, that Sony assembles TVs in North and South Chennai. The arithmetic mean hourly output is both these
North and South plant is 40.
With this we can conclude that the distribution of the hourly outputs are identical but this is not
correct conclusion
56
Application of Standard Deviation - Case Study 3
Wipro
Infosys
Wipro Infosys Note that the relationship between the state of the
economy and profits is much tighter (i.e. less dispersed)
for Wipro than for Infosys. Therefore, Wipro is less risky
than Infosys.
57
Percentiles
Percentiles are descriptive statistics that give us reference points in our data.
A percentile is the value of a variable below which a certain percentage of observations fall.
The most commonly reported percentiles are quartiles, which break the data up into quarters.
Most commonly reported percentiles are quartiles,which break the data up into quarters
We have an even number of data , this means that when we calculate the quartiles , we take the
sum of the two values around each quartile and average them
50th Percentile = ?
50th percentile can also be referred as Median
83.5, it means that 50% of the data values fall at or below 83.5
What is Boxplot ??
It’s a visual representation which helps us to understand how spread the data and to detect the outliers. In
order to construct the same, we need min, Q1, median, Q3 and the max value. To determine central
tendency, spread, skewness, and the existence of outliers.
Median
Min Lower Quartile Upper Quartile Max
(or) or
25th percentile 75 percentile
th
IQR
19 19 20 21 22 22 22 23 23 24 25
Q1 Q3
After getting the update from a sales person about his sales
accomplishment over phone , the manager noted down the sales made
as 25
70 75 80 85 90 95 100 105
Age 19 20 21 22 23 24 25
No. of customers 2 1 1 3 2 1 1
19 19 20 21 22 22 22 23 23 24 25
19 19 20 21 22 22 22 23 23 24 25
110 110 130 130 140 150 160 160 170 170
110 110 130 130 140 150 160 160 170 170
Quartile 1 = 130
Quartile 2 = 145
Quartile 3 = 160
Sample 2
33219
36254 Q1
38801
46335
Q2
46840 IQR = Q3 – Q1
47596
55130 Q3 25% 25% 25% 25%
56863
78070 Q1 Q3
88,830
It also shows that the IQR is very resistant to outliers (and to some degree skew) while
the SD is not
69
Shape Measures
70
Shape measures
Symmetry
How does data fall on either side of the set mean ? Is it symmetric or skewed ?
Is it possible to have symmetric distributions with more than one peak ?
71
Positive/Right Skew
90
80
Number of employees
60
50 the median
40
30
20
10
0
less than 31-40 41-50 51-60 61-70 Income
71-80 81-90 91-100 101-110 111-120 121-130 131-140 141-150
31
When the skewness statistic is +ve, the data is What can we say about this distribution ?
right-skewed. most employees have an income greater
100 than 131k ?
90
Most common income is less than 60,000
Number of employees
80
70 No employee have an income greater than
60
50
70,000 ?
40 It’s a symmetrical distribution
30
20
10
0
Only few employees are
less than 31-40 41-50 51-60 61-70 Income
71-80 81-90 91-100 101-110 111-120 121-130 131-140 141-150 making more than 71,000
31
dollar , the yearly income
is +vely skewed
73
Positive Skew
100
90
70
30
20
10
0
less than 31-40 41-50 51-60 61-70 Income
71-80 81-90 91-100 101-110 111-120 121-130 131-140 141-150
31
74
Negative skew or Left-skew - example
90
When the skewness statistic is -ve, the data is
80
left-skewed.
70
60
A left skewed distribution tells us that the
50
mean is less than the median.
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
75
Quiz
100
• What is true about this distribution ?
90 – mode < mean < median
80 – mean < median < mode
Number of employees
40
30
20
10
0
less than 31-40 41-50 51-60 61-70 Income
71-80 81-90 91-100 101-110 111-120 121-130 131-140 141-150
31
Least Highest
76
Normally Distributed data - example
9
Incomes have a smooth distribution
8
0
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129
77
Kurtosis
Measuring the peakedness of data, reflecting the shape of the peak relative to the rest
of the distribution
79
Bi modal distributions
80
Summary
If the mean and median coincide and the skewness and kurtosis statistics
are close to 0, it is likely that your data comes from a normal distribution
81
Descriptive Statistics – Case Study
Now , we can see how we can use retail data set to apply
descriptive statistics to gain insights
82
Descriptive Statistics
83