You are on page 1of 41

Overview of Concepts - Statistics

Statistics is broadly divided into 2 categories - >

Descriptive Statistics
Inferential Statistics

To understand the simple difference between descriptive and inferential statistics, all you need
to remember is that descriptive statistics summarize your current dataset and inferential
statistics aim to draw conclusions about an additional population outside of your dataset.

Descriptive Statistics

It is the form of statistics in which data helps show, summarize and describe data in a
meaningful way such that patterns might emerge from that data.

Descriptive statistics do not allow us to make conclusions beyond the data on which we have
done analysis or reach any conclusion regarding any hypothesis.

This is simply a way to describe our data.

This is describing big hunk of data in summary charts and templates.

Descriptive statistics enables us to present the data in a more meaningful way which allows
simpler interpretation of data. It is useful to summarize group of data using a combination of
tabulated description(tables), graphical description (graphs and charts) and statistical
commentary (discussion of results)

This is used to provide statistics about immediate group of data. -> A group of data for which
includes all the data in which we are interested in, is called a population. A population can be as
large or as small as long as it included all the data we are interested in.
Descriptive population are applied to population and the properties of population like the mean
or standard deviation are called parameters as they represent the whole population in which we
are interested in.

Descriptive statistics usually involve measure of central tendency (mean, median, mode) and
measure of dispersion (variance, standard deviation)

Inferential Statistics

In this we are testing a hypothesis and drawing samples about a population.


Like in descriptive stats we are only concerned about the limited population, now in case we do
not want to have access to the whole population data but rather a limited amount of data.
So we measure a sample of data which is used to measure the entire population.

Properties of a sample, such as standard deviation, mean etc are not called parameter but
statistics.

Inferential statistics allows us to use these samples to make generalizations about the
populations from which samples were drawn.

It is therefore important that samples accurately represents the population.

This process is called sampling.

Methods of inferential statistics are ->


- Estimation of parameters
- We estimate the population values based on sampling data
- We use techniques such as confidence interval we can provide a
range of values for which parameters hold true
- Testing of statistical hypothesis.
- Drawing conclusions about a population parameter
- Tests used in hypothesis testing - T-Tests, Chi-Square etc
- Tests are used to derive whether hypothesis about a parameter
are true or not helping us reach a conclusion about the population based on
sample data
Terms and symbols used in statistics ->

Sample Population Description

Statistic Parameter

n N number of members of sample or population

xx x-bar mu or x mean

M or Med (none) median

s sigma or x standard deviation

(TIs say Sx) For variance, apply a squared symbol (s or ).

r rho coefficient of linear correlation

pp p-hat p proportion

z t (n/a) calculated test statistic


Intro to research methods

There are various research methods for measuring various things.

Measuring Constructs

Constructs -> something that is difficult to measure without parameter


E.g - > Memory, Effort, intelligence

Based on data we can have/ or we can draw some conclusions on the measuring constructs.

After doing some study, we can have an operational definition.

Once we have an operational definition, we can measure constructs in real life.

Data is the most important part of statistics, without data we cannot draw any operational
definitions or any conclusions

Randomness

Randomness is chance
When taking an example an example has an equal chance of being selected

When taking a sample at random, it has more chances of being true representative of the
population

When we visualize a sample of data ->

On the
X- Axis - independent variable - predictor variable

Y- Axis - Dependent variable - outcome

Lurking variables -> Variables other than the variables being considered in the visualization

Lurking variables can play an important part in influencing the outcome.


For example in this case of representation of hours slept and temporal memory

We can see a relationship between hours slept and temporal memory, but we cannot
necessarily be confident that sleep causes better memory

If we want to show relationships , i.e -> see patterns and plot them on a scatter plot, => we can
do observational studies and surveys

But

If we want to show causation, eg -> that more sleep leads to better memory => we need to do a
controlled experiment.

Causation one particular variable causes another.

Causation always requires a controlled experiment

Benefits of Surveys

- Easy way to get info on a population


- Relatively inexpensive
- Conducted remotely
- Anyone can access and analyse survey results

Downsides of surveys

- Untruthful responses
- Biased responses
- Respondents not understanding the questions
- Respondents refusing to answer

Controlled experiment another method for doing research


When doing an experiment, we can control multiple factors so that we can arrive on
interpretable results

Randomization, makes 2 groups -> in an experiment for eg, placebo and non placebo group
Similar so that both groups have similar base parameters(in case of people, same number of
m/f, sage, sleep patterns etc)

Randomization works best with larger groups.

Visualizing Data

Frequency table
How many times a variable occurs in a data set, this makes a data set easier to analyse

Relative frequency
Relative frequency is occurrence of a particular variable wrt to other/total
Proportion of each variable

Range of proportions

All proportion are greater than 0 and less than 1 or may be equal to either of these values

Sum relative frequency

For any frequency table, the relative frequencies should always add to 1

Showing relative frequency with percentages

Multiple proportions with 100

Percentages range from 0 to 100 %


Similarly in case of sample of student ages, whenever we have large amount of data..
We cant just group data, with 1 row for each value.

Here now we start grouping data in bins

0-19, 20-39, 40-59, 60-79


^^ These are the bins of data
Visualizing this data in a histogram

Frequency is always on the y-axis and the variable is on the x-axis

To draw histograms ->

http://www.shodor.org/interactivate/activities/Histogram/

You need to have an ideal bin size to visualize correctly, it cannot be too small or too large
So that we can clearly see the distributions

Plotting student - continent data in a bar graph


The basic difference bw histogram and bar graph is that in a bar graph each of the points on x-
axis is a completely different category.

And we cant mix them

You can change bin size in histogram


However, in a bar graph, bin size is categorically different

Order in bar graph is not necessary


While in histogram it is always in histogram

Shape of histogram is very important.


Whereas the shape of bar graph is arbitrary
Depending upon how we chose to order the categories

Interpreting histograms ->


^^ This data is what we call normally distributed
That is if we fold in the middle it will be somewhat symmetrical.

Normally distributed data has 1 peak in the middle called mode.


^^shape of histogram in normal distribution

Skewed distributions -.>

^^ This is a positively skewed distribution


^^ This is a negatively skewed distrubtion

Summary

Central Tendency

We want to quickly summarize all of our data with 1 number.

Measure of central tendency are ->

Mode, Median, Mean.


Mode The number that occurs the most in a dataset or in a distribution

In a Uniform distribution ->

^^There is no mode

In the following distribution ->

There are 2 modes


Mode can be of categorical data

Here mode will be plain

There is no equation for mode

If we take lot of samples from the same population, the mode can be different as samples would
be at random and similarly mode calculated would be different.

Not all scores in a data set affect the mode.

Mean ->

Sum (dataset)/count of data set


All values in the distribution affect the mean

Mean can be described as a formula

Many samples from same population will have similar mean

Mean of sample can be used to make inference about the population it came from.

Outlier ->

Outlier is a data point that is extremely different (either higher or lower) than the other points in
the data sets.

Mode is not influenced by an outlier, but the mean is influenced a lot by the outlier.

Median ->

Middle of the ordered data.

Median splits data in the half

Median with a data that has a outlier, can still be used to accurately describe the data
^Median with outlier.

Median is more robust than Mean

Median is best measure of tendency when checking highly skewed distribution

In a positively skewed distribution ->

mode<median<mean
In a neutral skewed distribution ->

Mean = median = mode

In Negatively skewed distribution

Mean < Median <mode

Variability

In some data sets, the mean, median, mode are the same, but there is still difference in the data
sets
Eg- >
^^ Here what we can observe is that range is different

Range gives a picture of how spread out the data is..

However range is also affected by the outlier..


For example if some on in the data set has an extreme salary then that will change the outcome
of the range calculation

^^ To take care of this,,,

We can cut off the tails

Statisticians cut off the upper 25% and bottom 25%

So all we care about is the values in the middle..

We call it quartile
Q1,Q2,Q3

First quartile (Q1) is the point where 25% of the distribution is below that point
And 75% is above that point.

Q3 - Q1 is the new range after we cut off the tails

^^IQR (Interquartile range)

Defining outliers

A value is considered to be a outlier if ->

Outlier < Q1 - 1.5 (IQR)

OR

Outlier > Q3 + 1.5 (IQR)


We visualize quartiles and outliers using box plots.

Min q1 q2 q3 max outlier

Mean can be outside the IQR,


But it generally it is within IQR

IQR doesnt tell us, as much as we would like to know about the data set.

eg

To measure true variability we can

Find the average dist(deviation) bw each data value and the mean
- Calculate mean
- Find deviation from mean
- Find the average deviation

Average Deviation

^^ This cant be a measure of variability as here -ves will cancel the +ves and we might end up
with 0

Avg Sum of sqaured deviation (Variance) -> mean of squred deviation


Or
Sum of squared deviation divided by n

Variance would be squre units


We need to take a square root
Sqaure root of variance is called standard deviation

Standard Deviation (SD) is most common measure of spread

Donated by small sigma

Each data point can be represented as x times the sd

In general samples underestimate the amount of variability in a population

Coz samples tend to be values in middl eof population

To correct this
We have besselss correction

N-1 is only used when we are using standard deviation to approximate a larger population
Standardizing

If we know a point on data set


We cannot gather much from it..
Until we know the shape of the distribution

We need to know how many data points are less/greater than that data point to draw
significance of it..

This is why we should use relative frequencies and convert all absolute frequencies to
proportions.

Coming back to histograms

We need to have small bin sizes to get most accurate data


But too many bins may result in irrelevant data..
Hende we plot a continuous graph over the histogram
To find out the values at any point.

Area under histogram graph is 1 or 100%


We can get pretty good approximations from this distributions

The locations on the x-axis is usually described in terms of standard deviation

We can calculate z

Z = number of standard deviation away from the mean.


By calculating z we can see the how much approximate %age the z is greater than the other
data distributions

The number of standard deviations away from the mean is the way to look fo rpatterns

Standardizing the distributions

When we have multiple distributions and we need to draw conclusions on both of them

The only common point is standard deviation, we can combine these 2, using 0 as a reference
point
^^standardizing

When we standardize any score on x-axis, we call it z-score

Z - score is basically the number of standard deviations any value is from the mean

The mean of standardised distribution will be 0


And z- score after standardizing will be 1

Here mean is 0 and standard deviation is 1, -1


Generally the standard deviation range from -2 sd to +2sd

Normal Distributions
PDF -> Probability density function

The area under this curve is always 1


As when we are modelling the data,
The absolute frequency converts into relative frequency

Equation for pdf

http://mathworld.wolfram.com/ProbabilityDensityFunction.html
The curves never touch the x-axis the go like this to - infinity and + infinity.
The area under the curve at certain point from -infinity is the probability of that point occurring

Area under the relative distribution is always 1


And for normal distributions, approx 68% is within 1 sd (-1 to +1)
And 95% is within 2sd (-2 to +2)

Now if we want to calculate for arbitrary values such as 240 ~~

We are gonna use z-table.

https://drive.google.com/file/d/0B58c_aSA_l2ueTgxUXRZaFIxLUE/view?usp=sharing

Sample Distributions

If we want to compare 2 sample distributions we can compare their individual means and arriver
at results

In a population mean can be said as expeted value

Mean of sample means = M

^^ this is called central limit theorem

Standard error = std deviation/sqrt(sample size)


Estimation

As we saw that 98% of data point line within 2 sigma +-

Similarly when we are distributing the means

^ this is called the margin of error

Interval estimate

Where xt bar is mean after some intervention

The greater our sample size the smaller the confidence interval
And the better we can estimate the population parameter.
These 1.96 is the critical value of Z for 95% confidence

This is found by the z-table based on the %ages

Hypothesis Testing

In this we check if something is likely or unlikely.

Alpha levels -> to denote if something is unlikely.

Critical region is something if its probability is alpha level

The z value of demarkation of critical region is called z* or z critical value


But here we are considering that unlikely values can only exist in one side of the distribution
But if we analyse we will see that unlikely values can exist on either side of distribution

^ this is called two tailed split, as if we split these values we get same + and -ve values
Alpha = 0.05 -> 2 outcomes, that is whether the outcome will lie inside or outside the critical
region, this is
null hypothesis ( sample mean lies outside the critical region, in whitespace^)

Null hypothesis -> assumes there is no significance difference between current population
parameters and new population parameters after interference

Alternative hypothesis -> ( sample mean can lie in critical region)


There will be a difference
> or < or !=

We cant prove that null hypothesis is true, we can only obtain evidence to reject the null
hypothesis
Our decision is to either to reject the null hypothesis or fail to reject the null hypothesis

We generally choose alpha level as .05

We will find the z critical values


And then well find the z-score of the sample mean and see if it fits in the critical region

Based on this we will decide whether to reject or fail to reject the null hypothesis

Rejecting the null hypothesis ->


- Sample mean falls within the critical region
- Z-score of our sample mean is greater than the z-critical value
- Probability of obtaining the sample mean is less than the alpha level
T - Distributions

Z- test work when we know the value of sd and mean (sigma and mu)

In a sample, the sd is bayesian value, of having n-1 in denominator

In a sampling distribution the sd is sigma/root of n

T depends on sample,
Now we have a new distribution called t distribution,
It is more prone to error

T- distributions are defined by their degree of freedom


Degrees of freedom -> after how many choices the rest of choices are forced
For n number n-1 is the degrees of freedom

For n X n
The degrees of freedom are n-1 X n-1

Degrees of freedom are the number of pieces of information that can be freely varied without
violating any given restrictions.

Independent pinces of information to estimate the another piece of information.


Only n-1 values are independent

As the degree of freedom increases the t- distribution better approximated the normal
distribution.

In case of normal distribution, we use z- table


In case of t distribution, we use t-table

T- table :

https://s3.amazonaws.com/udacity-hosted-downloads/t-table.jpg

Reading t-table (almost same as z-table)


Here we find the- statistic and find if it is greater than or less than t-critical value

Similar to z score
If t statistic is far from the mean we reject the null

(s is the std deviation)

After getting t value


We can the probability of getting this t-value
^^ this is called the p-value
^the probability of getting this t-statistic

If the probability is very small,


Then we can reject the null hypothesis,
As there might be something else going on

Cohens d :
Standard mean difference that measures the distance between means in standardized units.