You are on page 1of 41

# Overview of Concepts - Statistics

## Statistics is broadly divided into 2 categories - >

Descriptive Statistics
Inferential Statistics

To understand the simple difference between descriptive and inferential statistics, all you need
to remember is that descriptive statistics summarize your current dataset and inferential

Descriptive Statistics

It is the form of statistics in which data helps show, summarize and describe data in a
meaningful way such that patterns might emerge from that data.

Descriptive statistics do not allow us to make conclusions beyond the data on which we have
done analysis or reach any conclusion regarding any hypothesis.

## This is describing big hunk of data in summary charts and templates.

Descriptive statistics enables us to present the data in a more meaningful way which allows
simpler interpretation of data. It is useful to summarize group of data using a combination of
tabulated description(tables), graphical description (graphs and charts) and statistical
commentary (discussion of results)

This is used to provide statistics about immediate group of data. -> A group of data for which
includes all the data in which we are interested in, is called a population. A population can be as
large or as small as long as it included all the data we are interested in.
Descriptive population are applied to population and the properties of population like the mean
or standard deviation are called parameters as they represent the whole population in which we
are interested in.

Descriptive statistics usually involve measure of central tendency (mean, median, mode) and
measure of dispersion (variance, standard deviation)

Inferential Statistics

## In this we are testing a hypothesis and drawing samples about a population.

Like in descriptive stats we are only concerned about the limited population, now in case we do
not want to have access to the whole population data but rather a limited amount of data.
So we measure a sample of data which is used to measure the entire population.

Properties of a sample, such as standard deviation, mean etc are not called parameter but
statistics.

Inferential statistics allows us to use these samples to make generalizations about the
populations from which samples were drawn.

## Methods of inferential statistics are ->

- Estimation of parameters
- We estimate the population values based on sampling data
- We use techniques such as confidence interval we can provide a
range of values for which parameters hold true
- Testing of statistical hypothesis.
- Drawing conclusions about a population parameter
- Tests used in hypothesis testing - T-Tests, Chi-Square etc
- Tests are used to derive whether hypothesis about a parameter
are true or not helping us reach a conclusion about the population based on
sample data
Terms and symbols used in statistics ->

## Sample Population Description

Statistic Parameter

## n N number of members of sample or population

xx x-bar mu or x mean

## r rho coefficient of linear correlation

pp p-hat p proportion

## z t (n/a) calculated test statistic

Intro to research methods

## There are various research methods for measuring various things.

Measuring Constructs

## Constructs -> something that is difficult to measure without parameter

E.g - > Memory, Effort, intelligence

Based on data we can have/ or we can draw some conclusions on the measuring constructs.

## Once we have an operational definition, we can measure constructs in real life.

Data is the most important part of statistics, without data we cannot draw any operational
definitions or any conclusions

Randomness

Randomness is chance
When taking an example an example has an equal chance of being selected

When taking a sample at random, it has more chances of being true representative of the
population

## When we visualize a sample of data ->

On the
X- Axis - independent variable - predictor variable

## Y- Axis - Dependent variable - outcome

Lurking variables -> Variables other than the variables being considered in the visualization

## Lurking variables can play an important part in influencing the outcome.

For example in this case of representation of hours slept and temporal memory

We can see a relationship between hours slept and temporal memory, but we cannot
necessarily be confident that sleep causes better memory

If we want to show relationships , i.e -> see patterns and plot them on a scatter plot, => we can
do observational studies and surveys

But

If we want to show causation, eg -> that more sleep leads to better memory => we need to do a
controlled experiment.

## Causation always requires a controlled experiment

Benefits of Surveys

## - Easy way to get info on a population

- Relatively inexpensive
- Conducted remotely
- Anyone can access and analyse survey results

Downsides of surveys

- Untruthful responses
- Biased responses
- Respondents not understanding the questions

## Controlled experiment another method for doing research

When doing an experiment, we can control multiple factors so that we can arrive on
interpretable results

Randomization, makes 2 groups -> in an experiment for eg, placebo and non placebo group
Similar so that both groups have similar base parameters(in case of people, same number of
m/f, sage, sleep patterns etc)

## Randomization works best with larger groups.

Visualizing Data

Frequency table
How many times a variable occurs in a data set, this makes a data set easier to analyse

Relative frequency
Relative frequency is occurrence of a particular variable wrt to other/total
Proportion of each variable

Range of proportions

All proportion are greater than 0 and less than 1 or may be equal to either of these values

## Sum relative frequency

For any frequency table, the relative frequencies should always add to 1

## Percentages range from 0 to 100 %

Similarly in case of sample of student ages, whenever we have large amount of data..
We cant just group data, with 1 row for each value.

## 0-19, 20-39, 40-59, 60-79

^^ These are the bins of data
Visualizing this data in a histogram

## To draw histograms ->

http://www.shodor.org/interactivate/activities/Histogram/

You need to have an ideal bin size to visualize correctly, it cannot be too small or too large
So that we can clearly see the distributions

## Plotting student - continent data in a bar graph

The basic difference bw histogram and bar graph is that in a bar graph each of the points on x-
axis is a completely different category.

## You can change bin size in histogram

However, in a bar graph, bin size is categorically different

## Order in bar graph is not necessary

While in histogram it is always in histogram

## Shape of histogram is very important.

Whereas the shape of bar graph is arbitrary
Depending upon how we chose to order the categories

## Interpreting histograms ->

^^ This data is what we call normally distributed
That is if we fold in the middle it will be somewhat symmetrical.

## Normally distributed data has 1 peak in the middle called mode.

^^shape of histogram in normal distribution

## ^^ This is a positively skewed distribution

^^ This is a negatively skewed distrubtion

Summary

Central Tendency

## Mode, Median, Mean.

Mode The number that occurs the most in a dataset or in a distribution

## In a Uniform distribution ->

^^There is no mode

## There are 2 modes

Mode can be of categorical data

## There is no equation for mode

If we take lot of samples from the same population, the mode can be different as samples would
be at random and similarly mode calculated would be different.

Mean ->

## Sum (dataset)/count of data set

All values in the distribution affect the mean

## Many samples from same population will have similar mean

Mean of sample can be used to make inference about the population it came from.

Outlier ->

Outlier is a data point that is extremely different (either higher or lower) than the other points in
the data sets.

Mode is not influenced by an outlier, but the mean is influenced a lot by the outlier.

Median ->

## Median splits data in the half

Median with a data that has a outlier, can still be used to accurately describe the data
^Median with outlier.

## In a positively skewed distribution ->

mode<median<mean
In a neutral skewed distribution ->

## Mean < Median <mode

Variability

In some data sets, the mean, median, mode are the same, but there is still difference in the data
sets
Eg- >
^^ Here what we can observe is that range is different

## However range is also affected by the outlier..

For example if some on in the data set has an extreme salary then that will change the outcome
of the range calculation

## So all we care about is the values in the middle..

We call it quartile
Q1,Q2,Q3

First quartile (Q1) is the point where 25% of the distribution is below that point
And 75% is above that point.

## ^^IQR (Interquartile range)

Defining outliers

OR

## Outlier > Q3 + 1.5 (IQR)

We visualize quartiles and outliers using box plots.

## Mean can be outside the IQR,

But it generally it is within IQR

IQR doesnt tell us, as much as we would like to know about the data set.

eg

## To measure true variability we can

Find the average dist(deviation) bw each data value and the mean
- Calculate mean
- Find deviation from mean
- Find the average deviation

Average Deviation

^^ This cant be a measure of variability as here -ves will cancel the +ves and we might end up
with 0

## Avg Sum of sqaured deviation (Variance) -> mean of squred deviation

Or
Sum of squared deviation divided by n

## Variance would be squre units

We need to take a square root
Sqaure root of variance is called standard deviation

## Coz samples tend to be values in middl eof population

To correct this
We have besselss correction

N-1 is only used when we are using standard deviation to approximate a larger population
Standardizing

## If we know a point on data set

We cannot gather much from it..
Until we know the shape of the distribution

We need to know how many data points are less/greater than that data point to draw
significance of it..

This is why we should use relative frequencies and convert all absolute frequencies to
proportions.

## We need to have small bin sizes to get most accurate data

But too many bins may result in irrelevant data..
Hende we plot a continuous graph over the histogram
To find out the values at any point.

## Area under histogram graph is 1 or 100%

We can get pretty good approximations from this distributions

## The locations on the x-axis is usually described in terms of standard deviation

We can calculate z

## Z = number of standard deviation away from the mean.

By calculating z we can see the how much approximate %age the z is greater than the other
data distributions

The number of standard deviations away from the mean is the way to look fo rpatterns

## Standardizing the distributions

When we have multiple distributions and we need to draw conclusions on both of them

The only common point is standard deviation, we can combine these 2, using 0 as a reference
point
^^standardizing

## When we standardize any score on x-axis, we call it z-score

Z - score is basically the number of standard deviations any value is from the mean

## The mean of standardised distribution will be 0

And z- score after standardizing will be 1

## Here mean is 0 and standard deviation is 1, -1

Generally the standard deviation range from -2 sd to +2sd

Normal Distributions
PDF -> Probability density function

## The area under this curve is always 1

As when we are modelling the data,
The absolute frequency converts into relative frequency

## Equation for pdf

http://mathworld.wolfram.com/ProbabilityDensityFunction.html
The curves never touch the x-axis the go like this to - infinity and + infinity.
The area under the curve at certain point from -infinity is the probability of that point occurring

## Area under the relative distribution is always 1

And for normal distributions, approx 68% is within 1 sd (-1 to +1)
And 95% is within 2sd (-2 to +2)

## We are gonna use z-table.

Sample Distributions

If we want to compare 2 sample distributions we can compare their individual means and arriver
at results

Estimation

## ^ this is called the margin of error

Interval estimate

## Where xt bar is mean after some intervention

The greater our sample size the smaller the confidence interval
And the better we can estimate the population parameter.
These 1.96 is the critical value of Z for 95% confidence

## This is found by the z-table based on the %ages

Hypothesis Testing

## The z value of demarkation of critical region is called z* or z critical value

But here we are considering that unlikely values can only exist in one side of the distribution
But if we analyse we will see that unlikely values can exist on either side of distribution

^ this is called two tailed split, as if we split these values we get same + and -ve values
Alpha = 0.05 -> 2 outcomes, that is whether the outcome will lie inside or outside the critical
region, this is
null hypothesis ( sample mean lies outside the critical region, in whitespace^)

Null hypothesis -> assumes there is no significance difference between current population
parameters and new population parameters after interference

## Alternative hypothesis -> ( sample mean can lie in critical region)

There will be a difference
> or < or !=

We cant prove that null hypothesis is true, we can only obtain evidence to reject the null
hypothesis
Our decision is to either to reject the null hypothesis or fail to reject the null hypothesis

## We will find the z critical values

And then well find the z-score of the sample mean and see if it fits in the critical region

Based on this we will decide whether to reject or fail to reject the null hypothesis

## Rejecting the null hypothesis ->

- Sample mean falls within the critical region
- Z-score of our sample mean is greater than the z-critical value
- Probability of obtaining the sample mean is less than the alpha level
T - Distributions

Z- test work when we know the value of sd and mean (sigma and mu)

## In a sampling distribution the sd is sigma/root of n

T depends on sample,
Now we have a new distribution called t distribution,
It is more prone to error

## T- distributions are defined by their degree of freedom

Degrees of freedom -> after how many choices the rest of choices are forced
For n number n-1 is the degrees of freedom

For n X n
The degrees of freedom are n-1 X n-1

Degrees of freedom are the number of pieces of information that can be freely varied without
violating any given restrictions.

## Independent pinces of information to estimate the another piece of information.

Only n-1 values are independent

As the degree of freedom increases the t- distribution better approximated the normal
distribution.

## In case of normal distribution, we use z- table

In case of t distribution, we use t-table

T- table :

## Reading t-table (almost same as z-table)

Here we find the- statistic and find if it is greater than or less than t-critical value

Similar to z score
If t statistic is far from the mean we reject the null

## After getting t value

We can the probability of getting this t-value
^^ this is called the p-value
^the probability of getting this t-statistic

## If the probability is very small,

Then we can reject the null hypothesis,
As there might be something else going on

Cohens d :
Standard mean difference that measures the distance between means in standardized units.