Sie sind auf Seite 1von 60

Descriptive Methods

Descriptive Statistics
 Statisticalprocedures used to summarise,
organise, and simplify data. This process
should be carried out in such a way that
reflects overall findings
Raw data is made more manageable
Raw data is presented in a logical form
Patterns can be seen from organised data
Tables
Graphical techniques
Measures of Central Tendency
Measures of Dispersion(variability).
Coefficient of Correlation.
11 September 2015 2
Describing data with tables

1) Frequency table
2) Relative and cumulative frequency
3) Grouped frequency
4) Open- ended groups
5) Cross-tabulation
6) Tables that are not contingency tables

11 September 2015 3
1) Frequency table

A picture of the frequency distributions

variables frequency

Mortality (%) Tally No. of ICU


11.2-15.1 1, 1, 1, 1, 1, 1, 1, 1, 1 9

15.2-20.1 1, 1, 1, 1, 1, 1, 1, 1 8

20.2-25.1 1, 1, 1, 1, 1 5
25.2-30.1 1, 1, 1 3
30.2-35.1 1, 1

11 September 2015 4
2) Relative frequency, cumulative frequency

 Relative
frequency: percentage of the total
Cumulative frequency:
parity No. of women Percentage Cumulative
(relative frequency) percentage
0 5 12.5 12.5
1 6 15 27.5
2 14 35 62.5
3 10 25 87.5
4 3 7.5 95
7 1 2.5 97.5
8 1 25 100
11 September 2015 5
3) Grouped frequency
 Grouped frequency: work for continuous
metric data
Birth weight No. of infants
2700-2999 2
A group width
of 300g
3000-3299 3
3300-3599 9
The class
lower limit 3600-3899 9
The class 3900-4199 4
upper limit
4200-4499 3
11 September 2015 6
Note

 Frequency table— ordinal and discrete metric


data
 Grouped– continuous metric data

11 September 2015 7
5) Cross-tabulation
 Two variables within a single group of individuals
Breast lump 2 or fewer children Totals
diagnosis
Yes No

Benign 21 (84%) (66) 11 (73%)(34) 32(100)

Malignant 4 (16%) (50) 4 (27%)(50) 8(100)

Totals 25(100%) 15(100%) 40

1. 2or fewer children has more benign breast lump.


2. The malignant lumps are not influenced by parity

3. The lumps with 2 or fewer children tend to more benign than those with more than 2 children

11 September 2015 8
6) Not contingency tables

Group 1 (n=106) Group II (n=226)


Outcome
(breast cancer) (no breast cancer)
Lifetime use
Yes 38% 40 61% 138
of oral
contraceptive
No 62% 66 39% 88
226
Totals 106

 This is no a contingency table because two quite


separate groups of individuals are involved.

11 September 2015 9
Describing data with Chart

1) Histogram & Frequency polygon


2) Pie chart
3) Bar chart
4) Dot plot
5) Scatter plot

11 September 2015 10
Histogram & Frequency Polygon

11 September 2015 11
The pie chart
 4-7 categories
 One variable
 Start at 0° in the same order as the table

11 September 2015 12
11 September 2015 13
The simple bar chart

 Same widths, equal spaces b/w bars

11 September 2015 14
The clustered bar chart

11 September 2015 15
The stacked bar chart

11 September 2015 16
The dot plot

 This is particularly useful with ordinal


variables if the number of categories is too
large for a bar chart

11 September 2015 17
Scatter-plot

 Displays the relationship between two


continuous variables

 Useful in the early stage of analysis when


exploring data and determining is a linear
regression analysis is appropriate

 May show outliers in your data


11 September 2015 18
Example 1: Age versus Systolic Blood
Pressure in a Clinical Trial

11 September 2015 19
Example of a Scatter-plot matrix
(multiple pair-wise plots)

11 September 2015 20
Describing data with numeric summary value

1. numbers, percentages and proportions.


2. summary measures of location.
3. summary measures of dispersion.
4. Inter quartile range.
5. Coefficient of variation.
6. Coefficient of correlation.

11 September 2015 21
1- numbers, percentages and proportions

 Numbers-the numerical summaries of data


A percentage is a proportion multiplied by 100.
(categorical data)
 Prevalence: number of existing cases in some
population at a given time.
 Incidence (inception): the number of new cases
occurring per 100, or per 1000, of the population,
during some period of time.

11 September 2015 22
Measures of central tendency
 Also called measures of location
 Gives one number which is representative
of all the data
 They are the:
 Mean
 Median
 Mode

11 September 2015 23
Sample Mean

 Also called sample average or arithmetic mean

 Whyis it called the sample mean?


To distinguish it from population mean

11 September 2015 24
Measures of Location - Mean
Given a data set of size n : x1 , x2 , x3 ,..., xn 
n

x i
the mean of the x' s will be denoted by x  i 1
n
How many hours of television do you watch in a week?
Example : {5,7,3,38,7} in hours , n  5
5

x
i 1
i  60  sum of the data points

60
leading to : x   12 hours
5
Summation Sign ””
In the formula to find the mean, we use the “summation sign” — 
This is just mathematical shorthand for “add up all of the
observations” n
X
i1
i  X 1  X 2  X 3  .......  X n
11 September 2015 25
Geometric Mean Example
x ln(x)
8 2.08 The mean using the raw data is :
5 1.61 79
x   11.3
4 1.39 7
While on the log scale :
12 2.48
15 2.71  ln x  15.5  2.22
n 7
7 1.95 leading to a geometric mean of : 9.22
28 3.33
79 15.55

11 September 2015 26
Mean from a Positively Skewed Distribution

 When the data is positively skewed analyses are


commonly done on the log scale.
 This is done to minimize the effect of extreme
observations.
 Method of obtaining the mean:
Take the log of each data value
Calculate the mean on the log scale
Take the antilog of the mean to return to the original
scale of measurement.
This is called the “GEOMETRIC” Mean.

11 September 2015 27
Mean

Advantages Disadvantages
 Simple and easy  Affected by extreme
 Most widely used values
 Can be used for further  Sometimes looks
statistical tests ridiculous e.g. average
 All values are included
number of children =
2.7
 Does not need
arrangement of data

11 September 2015 28
Median

 Value which divides the data into two


equal parts after arrangement of data
into ascending or descending order

11 September 2015 29
Measure of Location- Median

If the number of observations in the dataset is


 odd the median will be the ½(n+1) th observation
even the median is defined as the average of the (½n) th
and the ½(n+1) th observation.
i.e. {8,5,4,12,15,7,28} the median is 8.
First put observations in order: 4,5,7,8,12,15,28
Find the ½ (n+1)th (which is the 4th) observation.

11 September 2015 30
Another Example of the Median

 First arrange the data in order from


smallest to largest.
 If the number of data points is ODD:
3 5 7 7 38
The median is the value in the middle: 7
 If the number of data points is EVEN:
3 5 7 7
The median is the average of the two values around
the middle: (5+7)/2 = 6

11 September 2015 31
Median
Advantages Disadvantages
 Not affected by  Needs arrangement
extreme values of data
 Used for growth  Difficult to calculate
curves and income from large amounts
 Can be determined of data
graphically  Not all values are
represented

11 September 2015 32
Final Measure of Location-Mode
 It is the most common value found in the dataset
(fashionable value)
Hb level of 5 pregnant women
12, 12.5, 11, 13, 12.5 Mode = 12.5
Hb level of 6 pregnant women
12, 12.5, 11, 13, 12.5, 8 Mode = 12.5
 More than one mode may occur (bimodal, trimodal)
 Sometimes there is no mode .

 The mode is not used widely in analytical statistics


because of the ambiguity in its definition.

11 September 2015 33
Mode
Advantages Disadvantages
 Not affected by  Not all values are
extreme values represented

11 September 2015 34
Distribution Characteristics

 Mode: Peak(s)
 Median: Equal areas point
 Mean: Balancing point

Mode Mean
Median
11 September 2015 35
Shapes of Distributions

 Right skewed (positively skewed)


Long right tail
Mean > Median

Mode Mean
Median
11 September 2015 36
Shapes of Distributions

 Left skewed (negatively skewed)


Long left tail
Mean < Median

Mean Median Mode

11 September 2015 37
Shapes of Distributions
 Symmetric (Right and left sides are mirror
images)
Left tail looks like right tail
Mean = Median = Mode

Mean Median Mode

11 September 2015 38
Choosing the most appropriate measure

mode median mean

Nominal yes no no

Ordinal yes yes no

Metric Yes, when markedly


yes yes
discrete skewed
Metric Yes, when markedly
yes yes
continuous skewed

11 September 2015 39
Measures of Dispersion
 After we know the mean of a set of
measurements it is often of interest to measure
the degree of variation or dispersion around the
mean.
 The measurement of dispersion (or variation)
plays an important role in the methods of
statistical inference.
 We will discuss:
 Range.
 Variance.
 Standard Deviation.

11 September 2015 40
Range
 Difference between highest and lowest value.
Range = largest value-smallest value

 E.g:Hb level of 5  E.g: Hb level of 6


pregnant women pregnant women
12, 12.5, 11, 13, 12.5 12, 12.5, 11, 13, 12.5, 8
Range = 13-11 = 2 Range = 13-8 = 5

11 September 2015 41
Range

Advantages Disadvantages
It is affected by extreme Value of range is only
values. determined by two values.
Easy to calculate The interpretation of the
range is difficult.
It does not provide
information about other
values and how dispersed
they are.

11 September 2015 42
Variance and Standard Deviation

 Uses deviations from the mean to measure the variation in


the dataset.
 The variance is obtained by squaring these deviations and
dividing their sum by one less than n.

 n 
n  (  xi ) 2

n  xi   i 1
2
 n



 xi  x 2 i 1
 
s2  i 1
  
n 1 n 1

11 September 2015 43
Variance Example
 Considerthe dataset x xi-x (xi-x)2
{8,5,4,12,15,5,7} 8 0 0
◦ Use VARIANCE function
5 -3 9
in SPSS to calculate OR
◦ Use the data from the 4 -4 16
table 12 4 16
x 8 15 7 49
n
5 -3 9
 i
( x
i 1
 x ) 2
 100
7 -1 1
s 2  100 / 6  16.67
11 September 2015 44
Standard Deviation
 Average deviation of values around the mean
(Square root of variance)

SD 
 ( x  x)
i
2

n 1
s has the advantage of being in the same units as
the original variable x
 From previous example sd=4.08

11 September 2015 45
Standard deviation

HB Mean Deviation from mean (x - mean) 2


(x - mean)
12 11.5 - 0.5 0.25
12.5 11.5 -1 1
11 11.5 0.5 0.25
13 11.5 - 1.5 2.25
12.5 11.5 - 1 1
8 11.5 3.5 12.25

11 September 2015 46
Standard deviation

 variance =17/(6-1) = 3.4


 SD = √ variance
= √ 3.4 = 1.84
 Hb level of 6 pregnant women
12, 12.5, 11, 13, 12.5, 8
Mean = 11.5
SD = 1.84

11 September 2015
47
Standard deviation

 If mean HB of 10 women is 11.5 and SD


is 3, what does this tell you about the
dispersion of these values around the
mean as compared to the previous
example?

11 September 2015 48
E.g. Systolic blood pressure
 Smoking males  Non-smokingmales
120 130 120 150 110 130 120 140
130 170 180 160 130 150 160 130
170 150 130 150

 Mean SBP = 148  Mean SBP = 135


 Range = 180-120=60  Range = 160-110=50
 SD = 22  SD = 15.1

11 September 2015 49
Inter-quartile range
 The Median divides a distribution into two halves.

 The first and third quartiles (denoted Q1 and Q3) are


defined as follows:
25% of the data lie below Q1 (and 75% is above Q1),
25% of the data lie above Q3 (and 75% is below Q3)

 The inter-quartile range (IQR) is the difference


between the first and third quartiles, i.e.
IQR = Q3- Q1

11 September 2015 50
Example
 The ordered blood pressure data is:
113 124 124 132 146 151 170

Q1 Q3

 Inter Quartile Range (IQR) is 151-124 = 27

11 September 2015 51
Coefficient of Variation

 The coefficient of variation (CV) is the sample


standard deviation expressed as a percentage of the
mean, i.e.

s
CV    100%
x
 Measureof spread that is independent of the units
of measurement variables.
 Consequently, a useful way of comparing the
dispersion of variables measured on different scales

11 September 2015 52
Coefficient of Correlation
 Measure of linear association/ relationship
between two continuous variables.
 Setting:
two measurements are made for each
observation.
Sample consists of pairs of values and you
want to determine the association between
the variables.

11 September 2015 53
Association Examples
 Example 1: Association between a mother’s
weight and the birth weight of her child
2 measurements: mother’s weight and baby’s weight
 Both continuous measures

 Example 2: Association between a risk factor


and a disease
2 measurements: disease status and risk factor status
 Both dichotomous measurements

11 September 2015 54
Birth Weight Data
x (oz) y(%)
112 63
111 66
x – birth weight in ounces
107 72
119 52 y – increase in weight between
92 75 70th and 100th days of life,
80 118 expressed as a percentage of
81 120 birth weight
84 114
118 42
106 72
103 90
94 91
11 September 2015 55
Pearson Correlation Coefficient
Birth Weight Data

120

110
Increase in Birth Weight (%)

100

90

80

70

60

50

40
70 80 90 100 110 120 130 140
Birth Weight (in ounces)

11 September 2015 56
Pearson Correlation Results

x (oz) y(%)
x (oz) 1
y(%) -0.94629 1

Pearson Correlation Coefficient = -0.946


Interpretation:
 values near 1 indicate strong positive linear relationship
 values near –1 indicate strong negative linear relationship
 values near 0 indicate a weak linear association
11 September 2015 57
CAUTION!!!!
 Interpretingthe correlation coefficient should be
done cautiously!
A result of 0 does not mean there is NO
relationship …. It means there is no linear
association.
 There may be a perfect non-linear association.

11 September 2015 58
Uses of Statistics
 Data presentation
 Simplifies large numbers of figures and
reduces volume of data
 Enables comparisons across different
groups
 Helps us to form and test hypotheses
 Helps in prediction, planning and
administration
 Helps form suitable policies
 Helps measure standard of health
11 September 2015 59
Thank you

Das könnte Ihnen auch gefallen