Sie sind auf Seite 1von 41

# APPLIED STATISTICS SOFTWARE

www.davidgl.eu
david@davidgl.eu
April 2016
1

## APPLIED STATISTICS SOFTWARE

STATISTICS BASIC CONCEPTS
1. QUALITATIVE AND QUANTITATIVE RESEARCH
2. DESCRIPTIVE STATISTICS AND INFERENTIAL
STATISTICS
3. POPULATION AND SAMPLE
4. SAMPLING METHODS
5. SAMPLE SIZE
6. TYPES OF VARIABLES

## When you decide how to study a certain

issue, problem or phenomenon, you can
choose either a qualitative or a quantitative
methodology.

Qualitative
Research

Vs

Quantitative
Research

Different
Different methods,
methods, tools
tools and
and
procedures
procedures to
to analyse
analyse information.
information.

## QUALITATIVE RESEARCH (I)

Objective: Understanding the deeply hidden nature of

phenomena.
Obtaining knowledge about emotions, sensitivity thresholds,
barriers, attitudes, evaluations, desires and needs of a target
group.
Qualitative research is inductive (used to start the research
process).
What matters is what was said, not how many times:
processes and meanings are rigorously examined, but not
measured in terms of quantity, amount or frequency.

QUANTITATIVE RESEARCH
Objective: determining the relationship between an
independent variable and a dependent one.

## It allows measuring the extent of phenomena.

Quantitative research is deductive (hypothesis are identified
before research begins).
Quantitative research often requires recruiting hundreds of
participants (for reducing the likelihood of biases).

Complementary approaches:

Qualitative
Qualitative
Research
Research

Research
Research subject
subject definition.
definition.
Hypotheses
Hypotheses definition.
definition.

Quantitative
Quantitative
Research
Research

Research
Research hypotheses
hypotheses tests.
tests.
Generalizable
Generalizable conclusions.
conclusions.

Quantitative
enumerates,
and qualitative
explains.

## Measure what can be measured, and make

measurable what cannot be measured
Galileo Galilei.

## Choose a qualitative method when most

of these conditions apply:
You have no existing research data on
this topic.
You are exploring the reasons why
people do or believe something.
The most appropriate unit of
measurement is not certain
(Individuals? Households?
Organizations?)
The concept is assessed with no clear
demarcation points.

## Choose a quantitative method when

most of these conditions apply:
The research is confirmatory rather
than exploratory (i.e. this is a
frequently researched topic, and
numerical data from earlier research is
available).
You are trying to measure a trend.
There is no ambiguity about the
concepts being measured, and only
one way to measure each concept.

## Statistics is more than just a collection of mathematical techniques,

it is not only putting numbers into formulas or computers.
Statistics is concerned with the
collection, organization and
description of a dataset, and
the use of probability theory to
make predictions that are
useful for taking decisions in an
uncertainty context.
.

learning from data.

## Descriptive Statistics describes show or summarize the basic

features of the data in a study.
Tables (Frequency Distribution)
Graphs
Statitstics (Calculations)

## Inferential Statistics deduces (infers) the properties of a

population from the analysis of the properties of a data sample
drawn from it.
Inference: using facts you have to learn about facts you dont have.
(Gary King)

## 3. POPULATION AND SAMPLE

Population (universe): the entire set of all individuals, items, or
subjects whose characteristics are being studied. The size of the
population is referred as N.
Parameter: measurable characteristic of a population. For
example, the mean of a population is denoted by the symbol .
Sample: subset of items drawn from a population, and used to test
hypotheses about such population. The size of the sample is
referred as n.
Statistic: measurable characteristic of a sample. Statistics vary
from sample to sample. For example, the mean of a sample is
denoted by the symbol x

## Sometimes, an entire population is analyzed (elections, study of

all members of a (small) association,) and then there is no
inaccuracy or error.
But researchers often rely on samples!
Why shall we choose a sample (instead of the entire population)?
Budget
Budget or
or time
time restrictions
restrictions (e.g.
(e.g. unemployed
unemployed people).
people).
Impossibility
Impossibility of
of identify
identify and
and access
access all
all population
population members
members
(e.g.
(e.g. people
people who
who may
may suffer
suffer insomnia).
insomnia).
Sometimes
Sometimes analyzing
analyzing an
an item
item means
means destroying
destroying itit (e.g.
(e.g.
bulbs
bulbs produced
produced by
by aa certain
certain factory)
factory)

12

WHICH

Sampling methods

## INDIVIDUALS CONSTITUTE THE SAMPLE?

HOW
MANY

Sample size

4. SAMPLING METHODS

## How are the individuals in the sample selected?

Objective: obtain a sample that is representative of the population,
so that our findings could be generalized to the whole group.
SAMPLING
METHODS

Probability

Non-Probability

SAMPLING

Probability

Non-Probability

## Every member of the

population has a known
non-zero probability of
being selected.

## Some elements of the population have

no chance of selection, or the
probability of selection can't be
accurately calculated.

## The sampling error can be

calculated, and inference
can be undertaken.

## The selection is but based on

assumptions regarding the population.
Hence, this sampling does not allows
the estimation of sampling errors and
inference cannot be undertaken

SAMPLING

Probability

Non-Probability

## Simple Random Sampling

Convenience Sampling

Systematic Sampling

Judgement Sampling

Stratified Sampling

Quota Sampling

Cluster Sampling

Probability Sampling

## Simple Random Sampling (S.R.S.): Each member of the

population has an equal and known probability of being
selected.
Each one of them is assigned a number, and the sample is
determined by generating random numbers.
Applicable when population is small, homogeneous and

## Estimates are easy to calculate.

It requires a complete and accurate record of the population.
It can only be done with small populations where all individuals are
identified.

Probability Sampling

## Systematic Sampling. The population is arranged according to

some ordering scheme, then a random start is chosen, and then
and then elements are selected at regular intervals (every kth
element from then onwards) through that ordered list.

## Easier to conduct than a simple random sample (gains in time,

effort and cost).
It requires a complete and ordered record of the population.
It can produce biased findings if the population data presents
any hidden order, periodicity or pattern.

Example:
A simple example would be to select every 10th name from the
telephone directory (an 'every 10th' sample, also referred to as 'sampling
with a skip of 10').

Probability Sampling

## Stratiefied Sampling. It involves a dividing the population into

heterogeneous non-overlapping groups (strata), which contain fairly
homogenous individuals. E.g. age-groups, genders.
Then each stratum is sampled as an independent sub-population,
out of which individual elements can be randomly selected, and
have the same chance of being selected.

## It allows comparison between strata, and estimates of the

population parameters for each stratum.

## Difficulty identifying the appropriate strata, particularly with little

knowledge of the population characteristics.

90

18

63

ni

(90/180)x100 =
50

20

(18/180)x100 =
10
(9/180)x100 =
5

4
2
14

(63/180)x100 =
35
is needed,
using

## and a sample of n = 40 individuals

stratified proportional
sampling according to those categories.
The first step is to calculate the weight of each group in the total staff:
50% of the sample individuals should be male full time (20 people), 10%
should be male part time (4 people), 5% should be female full time (2
people), and 35% should be female part time (14 people).
Then a SRS within each stratum would be conducted .

Probability Sampling

## Cluster Sampling involves a two-step procedure:

1st population is subdivided into groups (clusters) that are
expected be homogenous amongst each other but heterogeneous
internally, so that each of them is as representative of the
population as possible.
In a 2nd step, a random sample of these clusters is selected, and
either all observations in the selected clusters are included in the
sample (one-stage clustering), or a random sample of elements is
selected within each of these groups (two-stage clustering).

Probability Sampling

## Can reduce time, effort or administrative costs.

Simple when population shows a natural arrangement (e.g.
geographical).
Actual clusters are not completely homogeneous, so the sample
may not be representative

Example:
A chain of hardware stores wants to know the buying profile of its
costumers.
Since it may not be possible to list all of the customers of a chain of
hardware stores, it would be possible to randomly select a subset of
stores (stage 1 of cluster sampling) and then interview a random sample
of customers who visit those stores (stage 2 of cluster sampling).

Non-Probability Sampling

## Convenience Sampling: Individuals are chosen for

convenience or ease: they are ready available or at hand to
the researcher.

## Very popular in practice, due to its simplicity.

Elements are selected arbitrarily from the population, so the
sample is not representative of the population.
There is no randomness and the likelihood of bias is high, so it is
only adequate for subjective assessments or pilot studies.

## Example: The interviewer has to conduct a survey at a shopping center.

She goes early in the morning on a given day, so the people that s/he
could interview would be limited to those given there at that given time,
which would not represent the views of other members of society in such
an area, if the survey was to be conducted at different times of day and
several times per week.

## Judgment or Purposive Sampling

Sample selection is based on the researchers belief that they
would be appropiate for the study.
Often used in political polling: some districts chosen because
their pattern has in the past provided good idea of outcomes for
whole electorate.

Used very often since it involves a low cost and time effort.
Elements are selected arbitrarily from the population, so the
sample is not representative of the population.
There is no randomness and the likelihood of bias is high, so it is
only adequate for subjective assessments or pilot studies.

Non-Probability Sampling

## Quota Sampling involves a two-step procedure:

In 1st place population is segmented into mutually exclusive subgroups (just as in stratified sampling), following one or more
criterion such as age, income, frequency of purchase, or usage
patterns.
Then, in a 2nd step convenience or judgment of the researcher is
used to select individuals within each group (sample size from
each category is proportional to its weight in he whole population).
It is this second step which makes the technique one of nonprobability sampling

## The structure or characteristics of the population have to be known

ex-ante in order to obtain a similar structure for the sample.
As a non probability technique, inference cannot be undertaken.

## Which is the appropiate sampling technique?

It depends on:

Research objectives
Need for statistical analysis and degree of accuracy required.
Available resources (time and funds)
Knowledge regarding the target population

5. SAMPLE SIZE

HOW
MANY
INDIVIDUALS COMPRISE THE SAMPLE?

Is sample
size so
important?
Tested
Tested
on
on 26
26
women
women &
&
men
men

Tested
Tested on
on 23
23
women
women

Tested
Tested on
on 18
18
women
women

Tested
Tested on
on 20
20
men
men

Sample information is
not as accurate and
truthful as population
information.
So, the bigger the
sample is, the more
precise information
offers.

## But, on the other hand,

the bigger the sample
is, the more
expensive it is the
sampling process.

## So, which is the optimal sample size?

Insufficient size

No scientific
scientific results
results

Excessive size

Waste of resources
resources

## Trade-off between quality and cost of our research

Degree of variability of
the measured variable

Population size,
N

Confidence level

Population
homogeneity
Sampling Error
(precision required)

## The percentage of all possible

samples that can be expected to
include the true population
parameter. (It tells you how sure
you can be)

Sampling Method

Statistical
technique

## Maximum expected difference

between the population
parameter and its sample
estimate.

GP Power 3.1

## PS Power and Sample size

http://biostat.mc.vanderbilt.edu/wiki/Main/PowerSampleSize

ST Plan
eSoftware.aspx?Software_Id=41

## Specific Webs for online calculating sample size:

http://stat.ubc.ca/~rollin/stats/ssize/
http://www.stat.uiowa.edu/~rlenth/Power/index.html
http://www.raosoft.com/samplesize.html
http://epitools.ausvet.com.au/content.php?page=SampleSize
http://statpages.org/index.html#Power

6. TYPES OF VARIABLES
Variable: any characteristic or attribute that differs for different
subjects.
Variables are classified
according to their
nature or
measurement scale.

6. TYPES OF VARIABLES
According to
quantitative.
QUALITATIVE or
CATEGORICAL

their

nature,

variables

are

qualitative

## Represent characteristics (or categories) that

cannot be measured or quantified. Such
characteristics are not a number, and, if it is a
number, it cannot be used for calculations

## Dichotomous: only two categories are defined. (binary,

dummy)
Gender
Gender (male/female),
(male/female), consumer
consumer (yes/no)
(yes/no)
Polytomous: more than two categories are defined.
Marital status, religious group, ZIP-Code
ZIP-Code

6. TYPES OF VARIABLES
QUANTITATIVE or
NUMERICAL

Represent characteristic
measured or quantified.

that

can

be

## Discrete: variable whose values are countable.

Number
Number of
of children
children in
in aa household,
household, times
times aa place
place has
has
been
been visited
visited
Continous: variable that can assume any numerical value over
one or several intervals.
Weight, temperature, salary

6. TYPES OF VARIABLES
Codification: assigning a certain number to each category of the
qualitative variable. I.e, using numbers to describe the outcomes.
Gender

a) Male

b) Female

Those numbers do not have any meaning, soy they cannot be used for
calculations
Discretisation: Converting a quantitative variable into a qualitative
variable, according to whether or not the quantitative variable exceeds
a critical threshold.
For the variable Monthly income, we can consider the following categories:
If monthly income >= 5.000 : high income
2.000 =< monthly income < 4.999 : medium-high income
1.000 =< monthly income < 1.999 : medium-low income
Loss of
Monthly income < 999 : low income
information

6. TYPES OF VARIABLES
According to their measurement scale, variables are nominal or
ordinal (if qualitative ), or interval or ratio (if quantitative).

NOMINAL

Numbers
Numbers serve
serve only
only as
as labels
labels for
for
individuals,
individuals, but
but they
they are
are randomly
randomly
Categories
Categories cannot
cannot be
be rank
rank ordered.
ordered.

identifying
identifying
assigned.
assigned.

Gender,
Gender, Marital
Marital status
status

ORDINAL

Categories
Categories can
can be
be ordered
ordered in
in aa hierarchical
hierarchical
fashion,
fashion, but
but values
values cannot
cannot provide
provide relative
relative
distance.
distance.

Ranking
Ranking of
of sportsman,
sportsman, socioeconomic
socioeconomic status,
status, opinion
opinion

6. TYPES OF VARIABLES
INTERVAL

## It provides distance properties, i.e., it allows

comparison
comparison between
between different
different individuals.
individuals. Origin
Origin
(zero
(zero point)
point) is
is arbitrary.
arbitrary.

Time,
Time, temperature
temperature (
( C)
C)

RATIO

It provides
provides assignment,
assignment, order,
order, distance
distance and
and
origin
origin properties.
properties. Origin
Origin (zero)
(zero) has
has aa meaning
meaning of
of
absence.
absence.

## Sales, age, number of children in a household

household
Measurement scale determines which statistical techniques
can be applied.

6. TYPES OF VARIABLES

## In Statistics (probability/inference), we will usually work

with RANDOM VARIABLES.

## Differences between a variable and a random variable?

Realisations of a random variable hinge on probability
A sample/dataset is a collection of realised random
variables

## APPLIED STATISTICS SOFTWARE

STATISTICS - BASIC CONCEPTS