2 views

Uploaded by Sushant Gandhi

- SIJMD4OCT2014
- Stat NU Question 190-208
- Mean Median Mode
- Casiofx82AUPLUSstats (1)
- 4 Chapter III
- A Note on Standard Deviation
- Chapter1_Data_Description.pdf
- Solutions to Even Problems
- Ch.2 ppt - Descriptive stat.ppt
- ME 3701 - Saputro Albertus - Hardness Test
- Sample - Lss Green Belt Course
- Chapter 7 Statistical Analysis Not Mine
- statistics for psychology
- maths prac wk 4
- sas work
- Report 1
- Final S03
- 033
- descriptive
- std_dev

You are on page 1of 41

Descriptive Statistics

Inferential Statistics

To understand the simple difference between descriptive and inferential statistics, all you need

to remember is that descriptive statistics summarize your current dataset and inferential

statistics aim to draw conclusions about an additional population outside of your dataset.

Descriptive Statistics

It is the form of statistics in which data helps show, summarize and describe data in a

meaningful way such that patterns might emerge from that data.

Descriptive statistics do not allow us to make conclusions beyond the data on which we have

done analysis or reach any conclusion regarding any hypothesis.

Descriptive statistics enables us to present the data in a more meaningful way which allows

simpler interpretation of data. It is useful to summarize group of data using a combination of

tabulated description(tables), graphical description (graphs and charts) and statistical

commentary (discussion of results)

This is used to provide statistics about immediate group of data. -> A group of data for which

includes all the data in which we are interested in, is called a population. A population can be as

large or as small as long as it included all the data we are interested in.

Descriptive population are applied to population and the properties of population like the mean

or standard deviation are called parameters as they represent the whole population in which we

are interested in.

Descriptive statistics usually involve measure of central tendency (mean, median, mode) and

measure of dispersion (variance, standard deviation)

Inferential Statistics

Like in descriptive stats we are only concerned about the limited population, now in case we do

not want to have access to the whole population data but rather a limited amount of data.

So we measure a sample of data which is used to measure the entire population.

Properties of a sample, such as standard deviation, mean etc are not called parameter but

statistics.

Inferential statistics allows us to use these samples to make generalizations about the

populations from which samples were drawn.

- Estimation of parameters

- We estimate the population values based on sampling data

- We use techniques such as confidence interval we can provide a

range of values for which parameters hold true

- Testing of statistical hypothesis.

- Drawing conclusions about a population parameter

- Tests used in hypothesis testing - T-Tests, Chi-Square etc

- Tests are used to derive whether hypothesis about a parameter

are true or not helping us reach a conclusion about the population based on

sample data

Terms and symbols used in statistics ->

Statistic Parameter

xx x-bar mu or x mean

pp p-hat p proportion

Intro to research methods

Measuring Constructs

E.g - > Memory, Effort, intelligence

Based on data we can have/ or we can draw some conclusions on the measuring constructs.

Data is the most important part of statistics, without data we cannot draw any operational

definitions or any conclusions

Randomness

Randomness is chance

When taking an example an example has an equal chance of being selected

When taking a sample at random, it has more chances of being true representative of the

population

On the

X- Axis - independent variable - predictor variable

Lurking variables -> Variables other than the variables being considered in the visualization

For example in this case of representation of hours slept and temporal memory

We can see a relationship between hours slept and temporal memory, but we cannot

necessarily be confident that sleep causes better memory

If we want to show relationships , i.e -> see patterns and plot them on a scatter plot, => we can

do observational studies and surveys

But

If we want to show causation, eg -> that more sleep leads to better memory => we need to do a

controlled experiment.

Benefits of Surveys

- Relatively inexpensive

- Conducted remotely

- Anyone can access and analyse survey results

Downsides of surveys

- Untruthful responses

- Biased responses

- Respondents not understanding the questions

- Respondents refusing to answer

When doing an experiment, we can control multiple factors so that we can arrive on

interpretable results

Randomization, makes 2 groups -> in an experiment for eg, placebo and non placebo group

Similar so that both groups have similar base parameters(in case of people, same number of

m/f, sage, sleep patterns etc)

Visualizing Data

Frequency table

How many times a variable occurs in a data set, this makes a data set easier to analyse

Relative frequency

Relative frequency is occurrence of a particular variable wrt to other/total

Proportion of each variable

Range of proportions

All proportion are greater than 0 and less than 1 or may be equal to either of these values

For any frequency table, the relative frequencies should always add to 1

Similarly in case of sample of student ages, whenever we have large amount of data..

We cant just group data, with 1 row for each value.

^^ These are the bins of data

Visualizing this data in a histogram

http://www.shodor.org/interactivate/activities/Histogram/

You need to have an ideal bin size to visualize correctly, it cannot be too small or too large

So that we can clearly see the distributions

The basic difference bw histogram and bar graph is that in a bar graph each of the points on x-

axis is a completely different category.

However, in a bar graph, bin size is categorically different

While in histogram it is always in histogram

Whereas the shape of bar graph is arbitrary

Depending upon how we chose to order the categories

^^ This data is what we call normally distributed

That is if we fold in the middle it will be somewhat symmetrical.

^^shape of histogram in normal distribution

^^ This is a negatively skewed distrubtion

Summary

Central Tendency

Mode The number that occurs the most in a dataset or in a distribution

^^There is no mode

Mode can be of categorical data

If we take lot of samples from the same population, the mode can be different as samples would

be at random and similarly mode calculated would be different.

Mean ->

All values in the distribution affect the mean

Mean of sample can be used to make inference about the population it came from.

Outlier ->

Outlier is a data point that is extremely different (either higher or lower) than the other points in

the data sets.

Mode is not influenced by an outlier, but the mean is influenced a lot by the outlier.

Median ->

Median with a data that has a outlier, can still be used to accurately describe the data

^Median with outlier.

mode<median<mean

In a neutral skewed distribution ->

Variability

In some data sets, the mean, median, mode are the same, but there is still difference in the data

sets

Eg- >

^^ Here what we can observe is that range is different

For example if some on in the data set has an extreme salary then that will change the outcome

of the range calculation

We call it quartile

Q1,Q2,Q3

First quartile (Q1) is the point where 25% of the distribution is below that point

And 75% is above that point.

Defining outliers

OR

We visualize quartiles and outliers using box plots.

But it generally it is within IQR

IQR doesnt tell us, as much as we would like to know about the data set.

eg

Find the average dist(deviation) bw each data value and the mean

- Calculate mean

- Find deviation from mean

- Find the average deviation

Average Deviation

^^ This cant be a measure of variability as here -ves will cancel the +ves and we might end up

with 0

Or

Sum of squared deviation divided by n

We need to take a square root

Sqaure root of variance is called standard deviation

To correct this

We have besselss correction

N-1 is only used when we are using standard deviation to approximate a larger population

Standardizing

We cannot gather much from it..

Until we know the shape of the distribution

We need to know how many data points are less/greater than that data point to draw

significance of it..

This is why we should use relative frequencies and convert all absolute frequencies to

proportions.

But too many bins may result in irrelevant data..

Hende we plot a continuous graph over the histogram

To find out the values at any point.

We can get pretty good approximations from this distributions

We can calculate z

By calculating z we can see the how much approximate %age the z is greater than the other

data distributions

The number of standard deviations away from the mean is the way to look fo rpatterns

When we have multiple distributions and we need to draw conclusions on both of them

The only common point is standard deviation, we can combine these 2, using 0 as a reference

point

^^standardizing

Z - score is basically the number of standard deviations any value is from the mean

And z- score after standardizing will be 1

Generally the standard deviation range from -2 sd to +2sd

Normal Distributions

PDF -> Probability density function

As when we are modelling the data,

The absolute frequency converts into relative frequency

http://mathworld.wolfram.com/ProbabilityDensityFunction.html

The curves never touch the x-axis the go like this to - infinity and + infinity.

The area under the curve at certain point from -infinity is the probability of that point occurring

And for normal distributions, approx 68% is within 1 sd (-1 to +1)

And 95% is within 2sd (-2 to +2)

https://drive.google.com/file/d/0B58c_aSA_l2ueTgxUXRZaFIxLUE/view?usp=sharing

Sample Distributions

If we want to compare 2 sample distributions we can compare their individual means and arriver

at results

Estimation

Interval estimate

The greater our sample size the smaller the confidence interval

And the better we can estimate the population parameter.

These 1.96 is the critical value of Z for 95% confidence

Hypothesis Testing

But here we are considering that unlikely values can only exist in one side of the distribution

But if we analyse we will see that unlikely values can exist on either side of distribution

^ this is called two tailed split, as if we split these values we get same + and -ve values

Alpha = 0.05 -> 2 outcomes, that is whether the outcome will lie inside or outside the critical

region, this is

null hypothesis ( sample mean lies outside the critical region, in whitespace^)

Null hypothesis -> assumes there is no significance difference between current population

parameters and new population parameters after interference

There will be a difference

> or < or !=

We cant prove that null hypothesis is true, we can only obtain evidence to reject the null

hypothesis

Our decision is to either to reject the null hypothesis or fail to reject the null hypothesis

And then well find the z-score of the sample mean and see if it fits in the critical region

Based on this we will decide whether to reject or fail to reject the null hypothesis

- Sample mean falls within the critical region

- Z-score of our sample mean is greater than the z-critical value

- Probability of obtaining the sample mean is less than the alpha level

T - Distributions

Z- test work when we know the value of sd and mean (sigma and mu)

T depends on sample,

Now we have a new distribution called t distribution,

It is more prone to error

Degrees of freedom -> after how many choices the rest of choices are forced

For n number n-1 is the degrees of freedom

For n X n

The degrees of freedom are n-1 X n-1

Degrees of freedom are the number of pieces of information that can be freely varied without

violating any given restrictions.

Only n-1 values are independent

As the degree of freedom increases the t- distribution better approximated the normal

distribution.

In case of t distribution, we use t-table

T- table :

https://s3.amazonaws.com/udacity-hosted-downloads/t-table.jpg

Here we find the- statistic and find if it is greater than or less than t-critical value

Similar to z score

If t statistic is far from the mean we reject the null

We can the probability of getting this t-value

^^ this is called the p-value

^the probability of getting this t-statistic

Then we can reject the null hypothesis,

As there might be something else going on

Cohens d :

Standard mean difference that measures the distance between means in standardized units.

- SIJMD4OCT2014Uploaded byScholedge Publishing
- Stat NU Question 190-208Uploaded bySayeed Ahmad
- Mean Median ModeUploaded byRedmondWA
- Casiofx82AUPLUSstats (1)Uploaded byCarlos de los Santos
- 4 Chapter IIIUploaded bymarvin08_31
- A Note on Standard DeviationUploaded byvg_mrt
- Chapter1_Data_Description.pdfUploaded byJoshua Andrew
- Solutions to Even ProblemsUploaded byAndrey Hcivokram
- Ch.2 ppt - Descriptive stat.pptUploaded bySanaJaved
- ME 3701 - Saputro Albertus - Hardness TestUploaded byAlbertus Saputro
- Sample - Lss Green Belt CourseUploaded byAhmed Ragab
- Chapter 7 Statistical Analysis Not MineUploaded byMC Badlon
- statistics for psychologyUploaded byDuma Dumai
- maths prac wk 4Uploaded byapi-300378200
- sas workUploaded byAmruta Madankar
- Report 1Uploaded byDaniyal Aslam
- Final S03Uploaded byBình Cao
- 033Uploaded bykumarasaamy
- descriptiveUploaded byBethyboop26
- std_devUploaded byDennis Vigil Caballero
- fieldbus tutorial part12-advanced functionality-101111064147-phpapp01.pdfUploaded bythrone001
- 9709_w12_qp_72Uploaded byAj Agen
- 6 - Problems on Sampling DistributionsUploaded byGaurav Agarwal
- Mini ProjectUploaded byAfrizal Wildan
- Descriptive StatisticsUploaded bySalam Daeng Bengo
- MULTIPLE INTELLIGENCES PRACTICE OF HIGHER SECONDARY SCHOOL TEACHERS OF THIRUVANANTHAPURAM DISTRICT: A STATUS RESEARCH.Uploaded byIJAR Journal
- 03_HW_Answer_Key_CUploaded byNiyeli Herrera
- DeviaUploaded byafidah khoiru ummah
- A comparison of noise simulation modelsUploaded byGiuseppe Marsico
- A Statistical Examination of Weight and KG Margin Values for U.S Navy Surface ShipsUploaded byblackwallabox

- R_dataAna_yakirsplusUploaded byseaman111
- PathUploaded byArjinder Singh
- SyllabusAST1Uploaded bysaswat1953419
- Introduction to Statistics and Data Analysis - 9.pdfUploaded byDanielaPulidoLuna
- NMSA334 Stochastic Processes 1Uploaded byDionisio Ussaca
- 3.2 Paper Estabilidad de Caserones Mawdesley et al 2000 Tran.PDFUploaded byVictor Miguel Vergara Lovera
- Statistical Method to Determine Petroleum ResourcesUploaded bydedete
- 9.ZUMBOUploaded byAbishanka Saha
- The Odd Generalized Exponential Log Logistic DistributionUploaded byinventionjournals
- CS771 IITK EndSem SolutionsUploaded byAnujNagpal
- SPSS Analysis Links for LearningUploaded byHimanshumangal
- SPSS ANOVAUploaded bybjssurya08
- tutorsquickguidetostatistics.pdfUploaded byAdp Segi Penang
- Regression analysis.docxUploaded byThePhantomStranger
- EconomicsUploaded byObaydur Rahman
- EvansUploaded byJessica Clark
- alphaa stableUploaded byGokhan Yassibas
- StatsDirect Instructions 2009Uploaded byahmad_nazib_1
- Unbiased Minimum-Variance Linear State EstimationUploaded byAmit Shukla
- Chapter2.pdfUploaded byPatrick Mfungwe
- StatChapter7._studentpptUploaded bykhairitkr
- Elements of Forecasting (Diebold) Solutions ManualUploaded byTher
- Exercises - SPSSUploaded byRakkuyil Sarath
- Econometrics pset3Uploaded bytarun singh
- Chapter 19 - Quiz Version B - Blank Copy and Answer Key Combined (1)Uploaded byGauravSahota
- Data Analysis Note, UT DallasUploaded bymeisam hejazinia
- MODULE 2 - Descriptive StatisticsUploaded byluis_m_dr
- CHAPTER 3 ProbabilityUploaded byyahska1305
- Minitab DOE Tutorial2015Uploaded bypuddin245
- Bayes TheoremUploaded byLutfi Ikbal Majid