Beruflich Dokumente
Kultur Dokumente
Paola Grosso
SNE research group
Today with
Jeroen van der Ham
as special guest
Instructions for use
I do talk fast:
Ask me to repeat if something is not clear;
I made an effort to keep it interesting,
but you are the guinea pigsfeedback is welcome!
We want to
avoid to hear
this from you.
Collecting data
Presenting data
Descriptive statistics
A real-life example (Jeroen)
Terminology
Sampling
Data types
Basic terminology
Estimate the
height?
Non-probability sampling:
some elements of the population have no chance of selection, or
where the probability of selection can't be accurately determined.
Accidental (or convenience) Sampling;
Quota Sampling;
Purposive Sampling.
Probability sampling:
every unit in the population has a chance (greater than zero) of being
selected in the sample, and this probability can be accurately
determined.
Simple random sample
Systematic random sample
Stratified random sample
Cluster sample
Discrete data
values are distinct and separate, i.e. they can be counted
Categorical data
values can be sorted according to category.
Nominal data
values can be assigned a code in the form of a number, where the
numbers are simply labels
Ordinal data
values can be ranked or have a rating scale attached
Continuous data
Values may take on any value within a finite or infinite interval
Discrete or continuous?
Tables
Charts
Graphs
How many friends do you have on Facebook?
Frequency tables .
23,44,156,246,37,79,156,123,267,12,
145,88,95,156,32,287,167,55,256,47,
Add values
Add title
(or caption in document)
Caution:
You should never use a pie chart to
show historical data over time;
Also do not use for the data in the
frequency distribution.
and more
Shmoo plot
Bode plot
Stemplot
Arrhenius plot
Ternary plot
Recurrence plot
Nichols plot
Nyquist plot
LineweaverBurk plot
Violin plot
Q-Q plot
The median is the middle number in the ordered data set; below
and above the median there is an equal number of observations.
The mode is the most frequently occurring value in the data set.
Grad 9 1900
Grad 10 1750
What did you
Grad 11 2100
learn?
Grad 12 2050 Sep.06 2010 - Slide 30
Outliers
Causes:
measurement error
the population has a
heavy-tailed distribution
1
s
x = V (x) =
N i
(x i x) 2
= x 2
x 2
s
x
Parent Distribution
Data Sample (from which data sample was drawn)
sdata parent
1 N 1
parent =
N 1 i
2
(x x ) = sdata
N 1
sdata =
N i
(x x ) 2
Quartiles and percentiles
Quartiles:
Q1, Q2 and Q3 divide the sample of observations into four groups:
25% of data points Q1;
50% of data points Q2; (Q2 is the median);
75% of data points Q3.
The semi-inter-quartile range (SIQR) , or quartile deviation, is:
Q3 Q1
SIQR =
2
The 5-number summary: (min_value, Q1, Q2 , Q3 and max_value)
Percentiles:
The values that divide the data sample in 100 equal parts.
It uses the
5-number
summary.
Quick test
84,32
93,33
81,33
61,24
95,39
Create a CSV file with frequency data. 86,32
90,34
Read the file into the R memory in variable obesity. 78,28
85,33
Run the following commands: 72,27
65,26
75,29
attach(obesity)
plot(Weight,Food_consumption)
cor(Weight,Food_consumption)
cor(obesity)
cor.test(Weight,Food_consumption)
y i = a + bx i + i
such that the vertical distances i
(the error on yi) are minimized.
i = y i y i
The resulting equation and
coefficients are:
y = a + bx
b=
(x x )(y y ) cov(x, y)
i i
= =r
s y
2
(x x )
i
2
s s
x x
> pairs(obesity)
> fit <- lm(Food_Consumption~Weight)
> fit
> summary(fit)
> plot(Weight,Food_consumption,pch=16)
> abline(lm(Food_consumption~Weight),col='red')