Descriptive Methods

Descriptive Methods
Descriptive Statistics
 Statisticalprocedures used to summarise,
organise, and simplify data. This process
should be carried out in such a way that
reflects overall findings
Raw data is made more manageable
Raw data is presented in a logical form
Patterns can be seen from organised data
Tables
Graphical techniques
Measures of Central Tendency
Measures of Dispersion(variability).
Coefficient of Correlation.
11 September 2015 2
Describing data with tables
1) Frequency table
2) Relative and cumulative frequency
3) Grouped frequency
4) Open- ended groups
5) Cross-tabulation
6) Tables that are not contingency tables
11 September 2015 3
1) Frequency table
A picture of the frequency distributions
variables frequency
Mortality (%) Tally No. of ICU

11.2-15.1 1, 1, 1, 1, 1, 1, 1, 1, 1 9
15.2-20.1 1, 1, 1, 1, 1, 1, 1, 1 8
20.2-25.1 1, 1, 1, 1, 1 5
25.2-30.1 1, 1, 1 3
30.2-35.1 1, 1
11 September 2015 4
2) Relative frequency, cumulative frequency
 Relative
frequency: percentage of the total
Cumulative frequency:
parity No. of women Percentage Cumulative
(relative frequency) percentage
0 5 12.5 12.5
1 6 15 27.5
2 14 35 62.5
3 10 25 87.5
4 3 7.5 95
7 1 2.5 97.5
8 1 25 100
11 September 2015 5
3) Grouped frequency
 Grouped frequency: work for continuous
metric data
Birth weight No. of infants
2700-2999 2
A group width
of 300g
3000-3299 3
3300-3599 9
The class
lower limit 3600-3899 9
The class 3900-4199 4
upper limit
4200-4499 3
11 September 2015 6
Note
 Frequency table— ordinal and discrete metric

data
 Grouped– continuous metric data
11 September 2015 7
5) Cross-tabulation
 Two variables within a single group of individuals
Breast lump 2 or fewer children Totals
diagnosis
Yes No
Benign 21 (84%) (66) 11 (73%)(34) 32(100)
Malignant 4 (16%) (50) 4 (27%)(50) 8(100)
Totals 25(100%) 15(100%) 40
1. 2or fewer children has more benign breast lump.

2. The malignant lumps are not influenced by parity
3. The lumps with 2 or fewer children tend to more benign than those with more than 2 children
11 September 2015 8
6) Not contingency tables
Group 1 (n=106) Group II (n=226)

Outcome
(breast cancer) (no breast cancer)
Lifetime use
Yes 38% 40 61% 138
of oral
contraceptive
No 62% 66 39% 88
226
Totals 106
 This is no a contingency table because two quite

separate groups of individuals are involved.
11 September 2015 9
Describing data with Chart
1) Histogram & Frequency polygon

2) Pie chart
3) Bar chart
4) Dot plot
5) Scatter plot
11 September 2015 10
Histogram & Frequency Polygon
The pie chart
 4-7 categories
 One variable
 Start at 0° in the same order as the table
The simple bar chart
 Same widths, equal spaces b/w bars
The clustered bar chart
The stacked bar chart
The dot plot
 This is particularly useful with ordinal

variables if the number of categories is too
large for a bar chart
Scatter-plot
 Displays the relationship between two

continuous variables
 Useful in the early stage of analysis when

exploring data and determining is a linear
regression analysis is appropriate
 May show outliers in your data

Example 1: Age versus Systolic Blood
Pressure in a Clinical Trial
Example of a Scatter-plot matrix
(multiple pair-wise plots)
Describing data with numeric summary value
1. numbers, percentages and proportions.

2. summary measures of location.
3. summary measures of dispersion.
4. Inter quartile range.
5. Coefficient of variation.
6. Coefficient of correlation.
1- numbers, percentages and proportions
 Numbers-the numerical summaries of data

A percentage is a proportion multiplied by 100.
(categorical data)
 Prevalence: number of existing cases in some
population at a given time.
 Incidence (inception): the number of new cases
occurring per 100, or per 1000, of the population,
during some period of time.
Measures of central tendency
 Also called measures of location
 Gives one number which is representative
of all the data
 They are the:
 Mean
 Median
 Mode
Sample Mean
 Also called sample average or arithmetic mean
 Whyis it called the sample mean?

To distinguish it from population mean
Measures of Location - Mean
Given a data set of size n : x1 , x2 , x3 ,..., xn 
n
x i
the mean of the x' s will be denoted by x  i 1
n
How many hours of television do you watch in a week?
Example : {5,7,3,38,7} in hours , n  5
5
x
i 1
i  60  sum of the data points
60
leading to : x   12 hours
5
Summation Sign ””
In the formula to find the mean, we use the “summation sign” — 
This is just mathematical shorthand for “add up all of the
observations” n
X
i1
i  X 1  X 2  X 3  .......  X n
Geometric Mean Example
x ln(x)
8 2.08 The mean using the raw data is :
5 1.61 79
x   11.3
4 1.39 7
While on the log scale :
12 2.48
15 2.71  ln x  15.5  2.22
n 7
7 1.95 leading to a geometric mean of : 9.22
28 3.33
79 15.55
Mean from a Positively Skewed Distribution
 When the data is positively skewed analyses are

commonly done on the log scale.
 This is done to minimize the effect of extreme
observations.
 Method of obtaining the mean:
Take the log of each data value
Calculate the mean on the log scale
Take the antilog of the mean to return to the original
scale of measurement.
This is called the “GEOMETRIC” Mean.
Mean
Advantages Disadvantages
 Simple and easy  Affected by extreme
 Most widely used values
 Can be used for further  Sometimes looks
statistical tests ridiculous e.g. average
 All values are included
number of children =
2.7
 Does not need
arrangement of data
Median
 Value which divides the data into two

equal parts after arrangement of data
into ascending or descending order
Measure of Location- Median
If the number of observations in the dataset is

 odd the median will be the ½(n+1) th observation
even the median is defined as the average of the (½n) th
and the ½(n+1) th observation.
i.e. {8,5,4,12,15,7,28} the median is 8.
First put observations in order: 4,5,7,8,12,15,28
Find the ½ (n+1)th (which is the 4th) observation.
Another Example of the Median
 First arrange the data in order from

smallest to largest.
 If the number of data points is ODD:
3 5 7 7 38
The median is the value in the middle: 7
 If the number of data points is EVEN:
3 5 7 7
The median is the average of the two values around
the middle: (5+7)/2 = 6
Median
 Not affected by  Needs arrangement
extreme values of data
 Used for growth  Difficult to calculate
curves and income from large amounts
 Can be determined of data
graphically  Not all values are
represented
Final Measure of Location-Mode
 It is the most common value found in the dataset
(fashionable value)
Hb level of 5 pregnant women
12, 12.5, 11, 13, 12.5 Mode = 12.5
Hb level of 6 pregnant women
12, 12.5, 11, 13, 12.5, 8 Mode = 12.5
 More than one mode may occur (bimodal, trimodal)
 Sometimes there is no mode .
 The mode is not used widely in analytical statistics

because of the ambiguity in its definition.
Mode
 Not affected by  Not all values are
extreme values represented
Distribution Characteristics
 Mode: Peak(s)
 Median: Equal areas point
 Mean: Balancing point
Mode Mean
Median
Shapes of Distributions
 Right skewed (positively skewed)

Long right tail
Mean > Median
Mode Mean
Median
 Left skewed (negatively skewed)

Long left tail
Mean < Median
Mean Median Mode
 Symmetric (Right and left sides are mirror
images)
Left tail looks like right tail
Mean = Median = Mode
Mean Median Mode
Choosing the most appropriate measure
mode median mean
Nominal yes no no
Ordinal yes yes no
Metric Yes, when markedly

yes yes
discrete skewed
Metric Yes, when markedly
yes yes
continuous skewed
Measures of Dispersion
 After we know the mean of a set of
measurements it is often of interest to measure
the degree of variation or dispersion around the
mean.
 The measurement of dispersion (or variation)
plays an important role in the methods of
statistical inference.
 We will discuss:
 Range.
 Variance.
 Standard Deviation.
Range
 Difference between highest and lowest value.
Range = largest value-smallest value
 E.g:Hb level of 5  E.g: Hb level of 6

pregnant women pregnant women
12, 12.5, 11, 13, 12.5 12, 12.5, 11, 13, 12.5, 8
Range = 13-11 = 2 Range = 13-8 = 5
Range
It is affected by extreme Value of range is only
values. determined by two values.
Easy to calculate The interpretation of the
range is difficult.
It does not provide
information about other
values and how dispersed
they are.
Variance and Standard Deviation
 Uses deviations from the mean to measure the variation in

the dataset.
 The variance is obtained by squaring these deviations and
dividing their sum by one less than n.
 n 
n  (  xi ) 2

n  xi   i 1
2
 n



 xi  x 2 i 1
 
s2  i 1
  
n 1 n 1
Variance Example
 Considerthe dataset x xi-x (xi-x)2
{8,5,4,12,15,5,7} 8 0 0
◦ Use VARIANCE function
5 -3 9
in SPSS to calculate OR
◦ Use the data from the 4 -4 16
table 12 4 16
x 8 15 7 49
n
5 -3 9
 i
( x
i 1
 x ) 2
 100
7 -1 1
s 2  100 / 6  16.67
Standard Deviation
 Average deviation of values around the mean
(Square root of variance)
SD 
 ( x  x)
i
2
n 1
s has the advantage of being in the same units as
the original variable x
 From previous example sd=4.08
Standard deviation
HB Mean Deviation from mean (x - mean) 2

(x - mean)
12 11.5 - 0.5 0.25
12.5 11.5 -1 1
11 11.5 0.5 0.25
13 11.5 - 1.5 2.25
12.5 11.5 - 1 1
8 11.5 3.5 12.25
Standard deviation
 variance =17/(6-1) = 3.4

 SD = √ variance
= √ 3.4 = 1.84
 Hb level of 6 pregnant women
12, 12.5, 11, 13, 12.5, 8
Mean = 11.5
SD = 1.84
11 September 2015
47
Standard deviation
 If mean HB of 10 women is 11.5 and SD

is 3, what does this tell you about the
dispersion of these values around the
mean as compared to the previous
example?
E.g. Systolic blood pressure
 Smoking males  Non-smokingmales
120 130 120 150 110 130 120 140
130 170 180 160 130 150 160 130
170 150 130 150
 Mean SBP = 148  Mean SBP = 135

 Range = 180-120=60  Range = 160-110=50
 SD = 22  SD = 15.1
Inter-quartile range
 The Median divides a distribution into two halves.
 The first and third quartiles (denoted Q1 and Q3) are

defined as follows:
25% of the data lie below Q1 (and 75% is above Q1),
25% of the data lie above Q3 (and 75% is below Q3)
 The inter-quartile range (IQR) is the difference

between the first and third quartiles, i.e.
IQR = Q3- Q1
Example
 The ordered blood pressure data is:
113 124 124 132 146 151 170
Q1 Q3
 Inter Quartile Range (IQR) is 151-124 = 27
Coefficient of Variation
 The coefficient of variation (CV) is the sample

standard deviation expressed as a percentage of the
mean, i.e.
s
CV    100%
x
 Measureof spread that is independent of the units
of measurement variables.
 Consequently, a useful way of comparing the
dispersion of variables measured on different scales
Coefficient of Correlation
 Measure of linear association/ relationship
between two continuous variables.
 Setting:
two measurements are made for each
observation.
Sample consists of pairs of values and you
want to determine the association between
the variables.
Association Examples
 Example 1: Association between a mother’s
weight and the birth weight of her child
2 measurements: mother’s weight and baby’s weight
 Both continuous measures
 Example 2: Association between a risk factor

and a disease
2 measurements: disease status and risk factor status
 Both dichotomous measurements
Birth Weight Data
x (oz) y(%)
112 63
111 66
x – birth weight in ounces
107 72
119 52 y – increase in weight between
92 75 70th and 100th days of life,
80 118 expressed as a percentage of
81 120 birth weight
84 114
118 42
106 72
103 90
94 91
Pearson Correlation Coefficient
Birth Weight Data
120
110
Increase in Birth Weight (%)
100
90
80
70
60
50
40
70 80 90 100 110 120 130 140
Birth Weight (in ounces)
Pearson Correlation Results
x (oz) y(%)
x (oz) 1
y(%) -0.94629 1
Pearson Correlation Coefficient = -0.946

Interpretation:
 values near 1 indicate strong positive linear relationship
 values near –1 indicate strong negative linear relationship
 values near 0 indicate a weak linear association
CAUTION!!!!
 Interpretingthe correlation coefficient should be
done cautiously!
A result of 0 does not mean there is NO
relationship …. It means there is no linear
association.
 There may be a perfect non-linear association.
Uses of Statistics
 Data presentation
 Simplifies large numbers of figures and
reduces volume of data
 Enables comparisons across different
groups
 Helps us to form and test hypotheses
 Helps in prediction, planning and
administration
 Helps form suitable policies
 Helps measure standard of health
Thank you

Descriptive Methods

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Descriptive Methods

Hochgeladen von

Copyright:

Verfügbare Formate

Descriptive Methods

A picture of the frequency distributions

Mortality (%) Tally No. of ICU

 Frequency table— ordinal and discrete metric

Benign 21 (84%) (66) 11 (73%)(34) 32(100)

Malignant 4 (16%) (50) 4 (27%)(50) 8(100)

Totals 25(100%) 15(100%) 40

1. 2or fewer children has more benign breast lump.

Group 1 (n=106) Group II (n=226)

 This is no a contingency table because two quite

1) Histogram & Frequency polygon

 Same widths, equal spaces b/w bars

 This is particularly useful with ordinal

 Displays the relationship between two

 Useful in the early stage of analysis when

 May show outliers in your data

1. numbers, percentages and proportions.

 Numbers-the numerical summaries of data

 Also called sample average or arithmetic mean

 Whyis it called the sample mean?

 When the data is positively skewed analyses are

 Value which divides the data into two

If the number of observations in the dataset is

 First arrange the data in order from

 The mode is not used widely in analytical statistics

 Right skewed (positively skewed)

 Left skewed (negatively skewed)

Mean Median Mode

Mean Median Mode

mode median mean

Ordinal yes yes no

Metric Yes, when markedly

 E.g:Hb level of 5  E.g: Hb level of 6

 Uses deviations from the mean to measure the variation in

HB Mean Deviation from mean (x - mean) 2

 variance =17/(6-1) = 3.4

 If mean HB of 10 women is 11.5 and SD

 Mean SBP = 148  Mean SBP = 135

 The first and third quartiles (denoted Q1 and Q3) are

 The inter-quartile range (IQR) is the difference

 Inter Quartile Range (IQR) is 151-124 = 27

 The coefficient of variation (CV) is the sample

 Example 2: Association between a risk factor

Pearson Correlation Coefficient = -0.946

Das könnte Ihnen auch gefallen