Sie sind auf Seite 1von 39

## Unit I & II Student Notes

by
Vipul Mehta
Data & Variable
What is data?
Facts, statistics or items of information
Which can be used further to obtain some knowledge

What is a variable?
Data is expressed in terms of a variable where
Variable is any characteristic that varies from one member of a
population to another.
For example height of students in this classroom, which varies with
one individual to another
Types of Variables
There are two types of variables
Numerical variables quantitative
Categorical variables qualitative

Discrete
Numerical
Types of
Continuous
variables
Categorical
Dataset and Data Table
Dataset: Data of a group of variables for a collection of
people
Data table: A dataset organized into a table, with one
column for each variable and one row for each individual
Data Table

## How many students

Test 1 Test 2
performance is better Test 2 Name Out of 15 Out of 15
than in Test 1? Abhimanyu Kr Singh 12 10
What is the %age reduction in Abhishek Gupta 14 11
Total marks of students from Adarsh Mishra 10 12
Test 1 to Test 2? Aditi Shrivastava 9 11
% improvement in marks? Akanksha 7 9
Amit Malik 9 3
If a student had to get Amlesh Kumar Singh 3 8
minimum 50% to qualify, what Anish Kumar 9 4
% of students have qualified in Ankit Gupta 11 10
Test 1? What % in Test 2? Total 90 75
Data Table
Item of Expenditure
What is the average amount of
interest per year which the Fuel
Intere
company had to pay during this Year Salar and Bonu
st on Taxes Total
period? y Trans s
Loans
port
The total amount of bonus paid by
the company during the given 1998 288 98 3 23.4 83 495
period is approximately what 1999 342 112 2.52 32.5 108 597
percent of the total amount of 2000 324 101 3.84 41.6 74 544
salary paid during this period? 2001 336 133 3.68 36.4 88 597
Total expenditure on all these items 2002 420 142 3.96 49.4 98 713
in 1998 was approximately what Total 1710 586 17 183 451 2947
percent of the total expenditure in
Expenditures of a Company (in Millions Rupees) per Annum Over the given
2002? Years.
Types of Data
Primary data
Collected by you for the purpose of your analysis
Secondary data
Taken from some existing source such as trade journals, data
directories, government research bodies etc
Population v. Sample
Population
The entire group of individuals is called the population.

Sample
A sample is selected to represent the population in a research
study.
Statistics
Types of statistics
Descriptive statistics
Inferential statistics
Measures of Central Tendency
What is
Mean
Average of numbers
Median
Central number when all are put in increasing order.
What happens in case of even set of numbers?
Mode
Number with maximum appearance in the set
Other Measures of Central Tendency
Percentile
Quartile
Q1
Q2
Q3
Percentile
To calculate the pth percentile for a set of data, the following process is used:
1. Arrange the data in increasing order (smallest to largest)
2. Compute an index i given as
p
i n
100
1. where p is the percentile of interest
2. n is the number of observations
3. A. If i is not an integer, round up to the next number
B. If i is an integer, pth percentile is the average of the values in positions i an i + 1
Lets say we find the height of 10 students of this class (in cm) and get the following
results:
180 165 158 160 176 163 179 161
152 159
Find the 75th, 85th& 50th percentile.

## Ans: 75th: 176, 85th: 179, 50th: 162

Quartile
Quartile
Quartile is the one fourth of a percentile
We define quartiles as:
Q1 = first quartile, or 25th percentile
Q2 = second quartile, or 50th percentile
Q3 = third quartile, or 75th percentile
Find out Q1, Q2 and Q3 for the above question. The set of
numbers are
180 165 158 160 176 163 179 161
152 159
Ans: Q1 = 159, Q2 = 162, Q3 = 176
Measures of Variation
Thus, key measures of variation are:
Range = Largest Value Smallest Value
Interquartile Range, IQR = Q3 Q1

1 n
Variance (x i - ) 2
2

n i 1

1 n
Standard Deviation i
n i 1
(x - ) 2

Coefficient of Variation Cv = 100 %

Exploratory Data Analysis - Five Number Summary
In Five Number Summary, the following five numbers are used
to summarize the data:
1. Smallest value
2. First quartile (Q1)
3. Median (Q2)
4. Third quartile (Q3)
5. Largest value
For our set of numbers discussed in the previous class,
142 158 159 160 161 163 165 176
200 240
Perform the Five Number Summary
Hence draw the Box Plot of the data
Exploratory Data Analysis - Five Number Summary

Smallest value =
Q1 =
Median =
Q3 =
Largest Value =

Smallest
Q1
Median
Q3
Largest
Exercise
The following are the marks of 10 students in this class (out
of 20)
4 4 5 5 6 7 8 8 9
14
a) Perform the five number summary and draw the box plot.
b) Do you observe any outliers in the data set?
Bessels Correction
Bessels Correction
Multiply population variance obtained by [n/n-1]
1 n 2 n
s (x i - x)
2

n i 1 n 1
This is done to increase the sample variance to make it closer to
population variance
Thus the unbiased sample variance becomes
1 n
s 2
i
n - 1 i 1
(x - x ) 2
Sample v. Population

## Parameter Population Sample

Mean N
x x
i
n
x i
i 1 N i 1 n

Variance
1 N 1 n
(x i - ) 2
2
s
2
i (x - x ) 2
N i 1 n - 1 i 1
Standard Deviation
1 N 1 n
i
N i 1
(x - ) 2
s i
n - 1 i 1
(x - x ) 2

## Note that population size is denoted by capital N.

Also note the population mean denoted by
Basic Statistics Using Excel
Mean: AVERAGE(number1, number 2)
Median: MEDIAN(number1, number 2)
Variance of Population: VARP(number1, number 2)
Variance of Sample: VAR(number1, number 2)
Standard Deviation of Population: STDEVP(number1, number 2)
Standard Deviation of Sample: STDEV(number1, number 2)

## Note the difference between population and sample estimates

Measures of Association Between Two Variables

## You are store manager at Big Bazaar Faridabad.You want to

understand the effect of TV commercials that you have given
on the local television channel every weekend for the past 10
weeks.You wants to analyze if the TV commercials you have
put are affecting your sales and whether you should continue
or discontinue.
You have the following data:
Measures of Association Between Two Variables
Week Number of Commercials SalesVolume (\$100s)
x y
1 2 50
2 5 57
3 1 41
4 3 54
5 4 54
6 1 38
7 5 63
8 3 48
9 4 59
10 2 46

## Source:Statistics for Business and Economics by Anderson, Sweeny,Williams

Measures of Association Between Two Variables

Covariance

x y
n

i x i y
s xy i 1

n 1
Scatter Diagram

65
II I
Sales (\$100s) 60
55
50
45
40
III IV
35
0 1 2 3 4 5 6
Number of Commercials
What do we see?
We observe that the amount of money we generate is increasing
with the increasing number of commercials
Covariance Shortcoming
The sign of sxy gives insight about the linear relationship
The value of sxy does not give us insight about the linear
relationship
Pearsons Correlation Coefficient
To overcome the shortcoming with Covariance, we use a
Correlation Coefficient
Pearsons Correlation Coefficient is defined as:
s xy
rxy
sx s y

Where,
rxy is sample correlation coefficient
sxy is sample covariance
sx is sample standard deviation of x
sy is sample standard deviation of y
Spearmans Rank Coefficient
Spearmans Rank Coefficient is defined as:
n
6 d i
2

r s 1 i 1

n n 1
2

where,
rs is Spearmans Rank Coefficient
d is different between ranks of the two variables
n is sample size
Exercise
You are the marketing manager in a furniture making unit.
You have been assigned the task of figuring out the optimum
price of the newly developed table-chair set your company
has designed. You do a survey on a sample of people to see
their preference for the following set of prices. For example,
50 people said they prefer a price of Rs 400 for the set.
Similarly you get other observations.

Price (00s) 4 6 11 3 16 14
No of 50 45 40 60 30 35
responses in
favor
Exercise
Price (00s) 4 6 11 3 16 14
No of 50 45 40 60 30 35
responses in
favor
On the above data, analyze the following:
a) If you have to draw a scatter diagram, which one would be the
independent variable?
b) Draw a scatter diagram of the data. What does it indicate
about the relationship between the two?
c) Compute and interpret the sample covariance.
d) Compute and interpret the Pearsons correlation coefficient
e) Compute and interpret the Spearmans Rank coefficient
Weighted Mean
What is the weighted average price of the following set of
data?
Price (00s) 4 6 11 3 16 14
No of 50 45 40 60 30 35
responses in
favor

x
wxi i

w i

## Where wi are the weights of individual numbers

Weighted Average Cost of Capital
Assume newly formed Corporation ABC needs to raise \$1 million
in capital so it can buy office buildings and the equipment needed
to conduct its business. The company issues and sells 6,000
shares of stock at \$100 each to raise the first \$600,000.
Shareholders expect a return of 15% on their investment, thus the
cost of equity is 15%.
Corporation ABC then sells 400 bonds for \$1,000 each to raise
the other \$400,000 in capital. The people who bought those bonds
expect an after tax return or 8%, so ABC's cost of debt is 8%.
Corporation ABC's total market value is now (\$600,000 equity +
\$400,000 debt) = \$1 million. Calculate corporation ABC's
weighted average cost of capital (WACC).
Some Basic Statistical Graphs
Bar Graph
Pie Chart
Frequency Distribution
Scatter Diagram
Bar Graph & Pie Chart
Lets say you did a survey of 50 people asking their
preferences for soft drink and you got the following
responses:
Soft Drink Frequency
Coke 19
Diet Coke 8
Dr. Pepper 5
Pepsi 13
Sprite 5
50

## Draw a bar graph and pie chart of the above

Frequency Distribution
Lets say you are a Manager at an Audit firm.You have 20 clients
and your boss has asked you to give him a summary of the days
remaining for audit for each client. After checking the records,
you have the following data:
AUDIT TIME REMAINING (IN DAYS)
12 14 19 18
15 15 18 17
20 27 22 23
22 21 33 28
14 18 16 13

## How will you show him the data?

Ans: Frequency Distribution
Frequency Distribution
The method to draw a frequency distribution graph is:
1. Determine the number of non-overlapping classes
2. Determine the width of each class
3. Determine the class limits
4. Find the frequency of each class
5. Draw the frequency distribution
Exercise
Try this at home:
Question:You are a plant manager at a handloom making firm.
Your production employee has produced the following units
of particular handloom in the last 20 days:
160 170 181 156 176
148 198 179 162 150
162 156 179 178 151
157 154 179 148 156
You want to find out how he has performed in last 20 days.
Draw a frequency distribution graph (Histogram) of this data to
find out his performance.