Beruflich Dokumente
Kultur Dokumente
Topics
Topics
Topics
Measures of Location
Mean
Median
Mode
Geometric Mean
Measures of Variability
Range
Variance
Standard Deviation
Coefficient of Variation
Z-score: Number of Standard Deviations
Topics
No:
Yes:
Addresses
313 173rd Blvd, Kent, WA 981215
316 66th Blvd, Kent, WA 981244
4358 23rd St, Kent, WA 981225
965 151st St, Kent, WA 981162
7900 173rd Lane, Kent, WA 981266
4047 15th Ave, Kent, WA 981228
4907 13th Ave, Kent, WA 981232
3789 4th Blvd, Seattle, WA 981152
2977 66th Lane, Seattle, WA 981171
3392 23rd St, Seattle, WA 981131
Address
313 173rd Blvd
316 66th Blvd
4358 23rd St
965 151st St
7900 173rd Lane
4047 15th Ave
4907 13th Ave
3789 4th Blvd
2977 66th Lane
3392 23rd St
City
Kent
Kent
Kent
Kent
Kent
Kent
Kent
Seattle
Seattle
Seattle
State
WA
WA
WA
WA
WA
WA
WA
WA
WA
WA
Zip
981215
981244
981225
981162
981266
981228
981232
981152
981171
981131
Why?
Because it is easier to analyze data when it is stored in its smallest parts
Data:
Sales
12/1/2014
12/1/2014
12/2/2014
12/2/2014
12/2/2014
12/3/2014
12/3/2014
12/3/2014
12/3/2014
SalesRep
$19,161 Jo
$15,027 Gigi
$12,953 Chin
$12,670 Jo
$8,893 Gigi
$4,667 Chin
$20,272 Jo
$20,204 Gigi
$17,223 Chin
Empty Cells Not really a Data Type, but it is a "thing" in Excel that can sometimes
cause problems.
**Refer to Empty Cells as "Empty Cells", not blanks.
Why Default Alignment? Because Left means Excel thinks it is Text and Right means Excel
thinks it is a Number. This is important when dealing with data because some systems will
mistakenly import numbers as text. Numbers as text do not always behave like you expect
(like not being added by the SUM function. The Default Alignment is a visual cue that
informs us about how Excel sees the data.
Transaction
Number
Date
12568
12569
12570
12571
12572
12573
12574
12575
12576
Sales
12/1/2014
12/1/2014
12/2/2014
12/2/2014
12/2/2014
12/3/2014
12/3/2014
12/3/2014
12/3/2014
SalesRep
$19,161 Jo
$15,027 Gigi
$12,953 Chin
$12,670 Jo
$8,893 Gigi
$4,667 Chin
$20,272 Jo
$20,204 Gigi
$17,223 Chin
Variables
Element = Entities on
which data are collected.
We are collecting data for
each Transaction Number.
Transaction Number is the
Element.
Each row is
a Record /
Observation
10
Variable
A characteristic or quantity of interest that can take on different values
A Variable is also known as a Field or Column Header in Database terminology
Example: Street address, City, State, Zip for a customer
Element
Entities on which data are collected
Like collecting data for an Employee or Invoice Number
Primary Key
When the first column in a Proper Data Set contains a Unique List of Elements, it is
called a Primary Key.
Primary Key, Unique List of Elements, List of Unique Identifiers, Distinct List are all
synonyms
The Primary Key assure that data collected for a give element is stored in one and
only one place.
Observation or Record
11
12
13
Variables
14
Variation
15
Population
All elements of interest
Sample
Subset of the population
Random sampling
Quantitative Data
Number Data on which numeric and arithmetic operations, such as
addition, subtraction, multiplication, and division, can be performed.
Discrete Quantitative Data: There are gaps between numbers, like
counting: 1, 2, 3
Continuous Quantitative Data: There are no gaps between numbers,
like weight, time, money. The number depends on the measurement
instrument.
Categorical Data
Not Number Data, like Product Names or Yes No Data on which arithmetic
operations cannot be performed.
17
Data Terminology
Cross-sectional Data
Cross-sectional Data
Market Cap:
Employees:
Qtrly Rev Growth (yoy):
Revenue (ttm):
Gross Margin (ttm):
EBITDA (ttm):
Operating Margin (ttm):
Net Income (ttm):
EPS (ttm):
P/E (ttm):
PEG (5 yr expected):
P/S (ttm):
GOOG
426.88B
69.61B
22.62B
14.39B
YHOO
FB
Industry
28.62B
261.91B 277.63M
57148
12500
10955
0.11
0.15
0.39
4.87B
14.64B
132.20M
0.62
0.67
0.83
541.75M 6.38B
3.47M
0.02
0.32
0.26
6.94B
2.72B
N/A
21.22
7.2
0.98
29.34
4.22
94.47
1.22
-2.38
1.59
6.26
6.02
18.39
355
0.15
0.58
0.01
0
33.33
1.07
3.74
18
Sources of Data
Experimental study
Customer Lists
Sales or Expense Lists
Census Data
Weather Data
Government sources (data.gov)
Purchase data from companies such as: Bloomberg, Dow Jones
19
Filter
PivotTables
21
From Field List, drag field name (Criteria for calculations) to Row Header or Column
Header
From Field List drag field you want to make a calculation upon to values area
Formatting:
Design, Report Layout, Show in Tabular or Outline Form
Right-Click: Number Formatting (so format follows the field if you Pivot)
22
If you want more than one calculation, drop the field into the Values area more
than one time and then change the calculation.
To Group, after dragging field to row area, Right-click, Group.
When Grouping in a PivotTable, Numbers with Decimals trigger ambiguous labels.
When Grouping in a PivotTable, Numbers with NO Decimals create unambiguous
labels
23
25
Column/Bar Chart:
Used to show Frequency Distribution or Relative/Percent Frequency
Distribution for Categorical Data
Counts across categories. Height of columns convey count. Order of
categories conveys no info
There are "gaps" between columns to indicate that the data is
categorical or a discrete quantitative variable (not a continuous
quantitative variable). Columns do not touch
26
COUNTIFS function:
Web Site
Frequency % Frequency
amazon.com
11436
43.12%
coloradoboomerangs.com
6380
24.05%
ebay.com
5810
21.90%
gel-boomerang.com
2898
10.93%
Grand Total
26524
100.00%
Web Site
amazon.com
coloradoboomerangs.com
ebay.com
gel-boomerang.com
Total
Frequency
% Frequency
11436
43.12%
6380
24.05%
5810
21.90%
2898
10.93%
26524
100.00%
2898
5810
6380
11436
27
28
1.
2.
3.
4.
5.
The goal is to reveal the natural distribution or shape or variation of the data. This is the "art side of
statistics". It takes practice to get the hang of it.
Determine the number of nonoverlapping classes. Goal is to have enough to show natural shape of
data. One general guideline is: 2^k > n, where n = count and k = number of classes.
Determine the width of each class with something like: Approx. width = (Max-Min)/(Number of
classes).
Determine the class limits: the key is to not create classes where you would double count.
1. If you have a discrete variable (or a continuous variable that is shown as a whole number) it is just a
matter of getting the lower and upper limit, like: 0-9, 10-19...
2. If you have a continuous variable and you choose to use the upper limit from the previous class as the
lower limit for the current class, be sure to include the equal sign on either the lower or upper, but
not both. Create classes like: 0 <= Sales < 20, 20 <= Sales <40... or 0 up to 20, 20 up to 40...
3. When we create a set of classes, we create a type of category for our continuous quantitative variable
4. Making the classes all the same width helps to create tables & charts that are more easily interpreted
5. Sometimes if there are a few large values or small values, it may be efficient to create an open ended
class
Class midpoint is calculated as the halfway mark between the lower and upper limit
29
Histograms
Used to show frequency distribution of continuous quantitative data
over a set of class intervals (lower and upper limit for each category)
Column or Bar Charts where columns are touching to indicate that the
variable is continuous
Columns touch to indicate that no numbers can fit between classes.
"No numbers can fit between columns - no gaps"
Height of columns convey count
Order of classes is important to help reveal shape of data, or
distribution of data.
31
Cumulative Distributions
Example of
Frequency
Distribution
& Cumulative
Percent
Frequency
Distribution
32
Because you have control over the comparative operators, you can create any type of Upper and Lower Limit.
This is different than with the PivotTable Grouping feature and the FREQUENCY Array Function.
You must add this feature in: File tab, Options, Add-ins, Manage: Excel Ass-ins, Click Go, Check box for
Analysis Toolpak, Click OK
This feature will create the Frequency Table (just like the FREQUENCY Array Function), a Histogram and a
Cumulative Distribution. If Gap Width in Chart is not zero, you must change it!!
FREQUENCY Array Function and Data Analysis Tools, Histogram yield the same answer.
33
34
Sales Data
35
Frequency
256
249
246
333
934
975
318
337
2094
2025
4174
1962
341
213
211
226
966
984
1813
1773
1579
4062
234
219
234
1579
1813
1773
984
226
211
213
341
966
4062
4174
1962
337
318
975
2025
2094
934
333
249
246
256
26524
219
Histogram:
12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11
AM AM AM AM AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM PM
36
% Cumulative
Frequency Frequency
16,431
61.95%
71.64%
2,570
77.30%
1,501
2,021
84.92%
1,432
90.31%
935
93.84%
707
96.51%
405
98.03%
253
98.99%
89
99.32%
56
99.53%
54
99.74%
49
99.92%
18
99.99%
3
100.00%
26,524
% Cumulative Frequency
120.00%
100.00%
80.00%
20.00%
3
18
49
54
56
89
253
405
707
40.00%
935
1,432
2,021
1,501
60.00%
2,570
Revenue (Upper
NOT Included)
0-200
200-400
400-600
600-800
800-1000
1000-1200
1200-1400
1400-1600
1600-1800
1800-2000
2000-2200
2200-2400
2400-2600
2600-2800
2800-3000
Grand Total
16,431
Frequency
Distribution
and Histogram
for Revenue
with
PivotTable:
0.00%
37
Frequency
Distribution and
Histogram for
Revenue with
FREQUENCY
Array Function:
38
39
Frequency
Distribution and
Histogram for
Revenue with Data
Analysis Histogram
Feature:
40
41
Skew of Histograms
42
Measures of Location
Measures of Location:
Average = Typical Value = Measure of central location
"Typical Values" calculated so that we have one value that can
represent all the data points.
Examples:
Mean
Median
Mode
Geometric Mean
43
Mean
Arithmetic Mean: Add them up and divide by the count
Good for quantitative data when there are not extreme values - extreme values can make the mean
look too big or too small (Median more representative of a typical value in that case)
Use AVERAGE function
Median
Sort, then take the one in the middle. If count odd, take one in middle, if even, average middle two.
Marks the point in the sorted list (an actual number) where 50% of the numbers are above and 50%
of the numbers are below
Good for quantitative data when there are extreme values (like house prices and salaries)
Use MEDIAN function
Mode
One that occurs most frequently (can be bimodal, multimodal)
Good for Categorical Data (Nominal and Ordinal)
Use MODE.SNGL for quantitative data and COUNTIF or PivotTable for Categorical or quantitative data.
MODE.SNGL will only show 1 mode if the data set is bi-modal or multi-modal. MODE.MULT can be
used for multiple modes.
44
Mean
45
Geometric Mean
Use Geometric Mean when you have "Growth Rates" or "Rates of Change and you want:
True "Average" Compounding Rate per Period
You have a Begin Value and you want to calculate the End Value after a number of periods, like
in Finance
Arithmetic Mean overestimates
Arithmetic Mean is for additive processes; Geometric Mean is for multiplicative processes
Arithmetic mean is used in some situations like for Standard Deviation, Correlation, and other
calculations that do not require True "Average" Compounding Rate per Period.
"Growth Rates" or "Rates of Change = % change from one period to the next
Growth Factor = Growth Rate + 1
Growth Factor is value that you use when calculating End Value from Begin Value.
Like: BeginValue*(1+GeometricMean)^NumberOfPeriods = EndValue
In Finance: PV*(1+PeriodRate)^NumberOfPeriods = FV
46
Geometric Mean
47
48
Variability
49
Variability
Measures of Variability
Range
Variance
Standard Deviation
Coefficient of Variation
Z-score
50
Range
Max - Min
Simple to calculate. Sensitive to extreme values
Interquartile Range
Quartile 3 - Quartile 1
The range for the middle 50% of the data. It overcomes
the sensitivity to extreme values
51
52
53
Proof that
two
formulas
for
Sample
Standard
Deviation
are equal
54
Variance
A Numerical Measure that says how much variability there is in the data points
Variance uses all the data points, not just some like Range and Interquartile
Range
Variance has squared units, which makes interpreting it difficult.
Although Variance has squared units, it has many uses in statistics, especially
with Regression Analysis (chapter 4) and Hypothesis Testing
Standard Deviation undoes the squared units and is thus easier to interpret.
Use VAR.P function for population data
Use VAR.S function for sample data.
55
Standard Deviation = SD
Standard Deviation uses all the data points, not just some like Range and Interquartile Range
Standard Deviation does not have squared units (like Variance) and is thus easier to interpret
Standard deviation has the same units as the data!!
The sample standard deviation is a point estimator of the population standard deviation
=
^2
( )
1
56
Standard Deviation: How Fairly Does Mean Represent Its Data Points?
57
Coefficient of Variation
Formula = SD/Mean
59
Uses of z-score:
60
61
Empirical Rule
62
63
64
Excel Functions:
PERCENTILE.EXC
PERCENTILE.INC
65
Excel Functions:
QUARTILE.EXC
QUARTILE.INC
.INC = Inclusive: 0 = Min, 1 = Quartile 1,
2 = Quartile 2, 3 = Quartile 3, 4 = Max
66
67
68
Box Plots
69
70
71
Dont Forget:
72