Sie sind auf Seite 1von 79

Define Statistics

Discuss the significance of Statistics for Physicians Suggest the study strategies for learning Statistics

Present role of Statistics in the scientific process


Review basic concepts of Statistics

Introduce methods of Exploratory Data Analysis

Amazingly, it is widely considered acceptable for medical

researchers to be ignorant of statistics. Many are not ashamed (and some seem proud) to admit that they don't know anything about Statistics.1
It may not be expected from doctors to be expert in

statistics but they should be made capable of understanding the basic statistical methodology. 2
Medical students may not like statistics, but as good

doctors they will have to understand it.3


1. 2. 3. Altman DG. The scandal of poor medical research. BMJ. 1994 ; 308: 283-4. Singh G. Medical Science without Statistics. The Internet Journal of Healthcare Administration. 2006; 4:2 Chen J. Lecture: Advice to GCRC & Surgery Fellows and Residents, SBU 2004.

Statistics: theory & methodology

for the collection, organization, analysis, interpretation & presentation of data.


DESCRIPTIVE INFERENTIAL

Descriptive Statistics: discipline


Collecting
Organizing Summarizing Presenting Data Inferences Hypothesis Testing Relationships Predictions

of quantitatively describing the features of data.


Inferential Statistics: deals with

drawing conclusions from data.

Ability to understand:
the value of published

Medical Research.
the role of Statistics in

Medical Business.

In the past the USE of Statistics was its most significant aspect.

P H A R M

E
D I A

Today, the MISUSE of Statistics in Research became a concern.

Statistics is an essential aspect of modern science. Before Statistics, the science was perceived as the process of

developing absolute knowledge through observations


In a contrast, Statistics is based on the notion that scientific

knowledge is not absolute


Hence, uncertainty & error are part of science
The only real things in science are distributions of numbers

Probability theory is used to interpret those distributions


Statistics reflects acts of interpretation - not irrefutable facts

the Wilcoxon rank sum test Poisson regression models the Bayesian estimates Wald 2 statistics Cox proportional hazards compared using t tests repeated-measures ANOVA adjusted hazard ratios 2-stage statistical model 95% confidence interval the degrees of freedom odds ratio

Kaplan-Meier method Pearson 2 Fisher exact test to have 90% power Mann-Whitney test a 2-tailed a level of less than.05 the log-rank test A 2-factor analysis of variance 2 tests the Z test Logistic regression models stratified Mantel-Haenszel analysis

Source: JAMA Vol. 292 (19): Six Original Contribution Papers

In developed countries, much of what

laymen know about medicine is gleaned from the media.


Unfortunately, the more frightening an

event is - the more newsworthy it is.


The Statistical Analysis of Research

Studies is complex. Regrettably, it tends to be oversimplified & sensationalized.

Data dredging (data fishing, data snooping, equation fitting) is the inappropriate use of statistics to uncover misleading relationships.

The SIMPLEST FORM OF STATISTICS

will suffice in most well-designed studies.

Therefore, a revision of the study

design should occur - before the use of more sophisticated analysis. complex statistical methods should be approached with a caution.

Similarly, any study that uses overly

Bhattacharya K. University of Oxford Introduction to Statistics for Medical Students. 2004

As opposed to the past we live now in the

QUANTITATIVE ERA.
In clinical environment everything is measured.
All aspects of physicians work is being

statistically analyzed & compared to benchmarks such as evidence-based guidelines.


Any physician who does not understand this

WILL BE CRUSHED.

Computerization of medical business facilitated:


Automated surreptitious data gathering

Data Mining
Physician Performance analyses Outcome & Cost-Effectiveness analyses Practitioner vs. Peer-Group comparison analyses More accurate Actuarial analyses

Statistic is challenging for everybody. Physicians may find it especially challenging - as Statistics is:
Math-based. It has many rigorous quantitative

aspects rooted in mathematics. Most physicians are not used to study math-based subjects.
Time consuming. It is a tedious subject requiring a

tremendous time commitment.


Spuriously not-essential. It appeasers to be not

everyday use topic. (I can get away w/o studying it)

Statistics is not a spectator sport.

Get Motivated by understanding why you need Statistics. Learn Actively: it cannot be passively crammed:

o Use pen & paper: for solving problems & reflecting on ideas o Make your own scenarios
Study Deliberately as:

o few words & symbols can mean a lot in statistics o it may be necessary to read a topic many times

Study Incrementally:
o Statistics is based on small number of principles o Those must be memorized & understood first o It is futile to look up the advanced test (e.g. used in a research paper) w/o knowing those essentials

Assemble Resources:
o There is no single best statistical manual
o It pays to prepare the set of personalized references

Source: University of Oxford & LISA: Laboratory for Interdisciplinary Statistical Analysis at Virginia Tech

Population: all elements to be studied


o Parameter: characteristic of the Population (e.g. Mean, Standard Deviation)

Sample: a subset of Population.


o Statistic: characteristic of the Sample (not to be confused with Statistics)

VARIABLE: any measurable attribute that differs. Quantitative = Numerical


o Continuous: any value between a set of numbers
E.g.: Time

o Discrete: only a finite number of values


E.g.: Number of children in a family

Qualitative = Categorical o Ordinal: can be ordered (ranked)


E.g.: Clothing Size: S, M, L, XL

o Nominal: cannot be ordered


E.g.: Colors

DATA: values that variables can assume


DATA

Univariate: analysis of one variable

Bivariate: analysis of two variables


Multivariate: analysis of many variables

SSDC IDM

SAMPLE SELECTION & DATA COLLECTION INITIAL DATA MANIPULATION o Data Formatting o Data Quality Control EXPLORATORY DATA ANALYSIS o Tabular, Numerical, Graphical data summaries o Choosing ways of Definitive Analysis DEFINITIVE DATA ANALYSIS o Final Inferential Data Analysis PRESENTATION of CONCLUSIONS o Concise graphical & tabular summaries o Statement of conclusions

EDA
DDA PoC

SSDC IDM

Understanding the phases of SA is important

not only for performing research.


It is essential for the critical appraisal of the

EDA
DDA

published studies.
This truth is frequently overlooked.
PoC

GOALS:

DI

Descriptive INFERENCE: describe a population,

using information from a sample


Analytical INFERENCE: describe relationships

AI

between variables, using a sample - assuming that it can be generalized to a population.

SAMPLING:
Simple Random

Stratified
Cluster Multistage

SIMPLE RANDOM Sample


subset of individuals chosen

RANDOMLY from a population


each individual has the same

probability of being chosen

STRATIFIED Sample
STRATA: homogeneous

nonoverlapping subgroups
STRATIFICATION: dividing

population into strata


STRATIFIED Sample is

obtained by simple random sampling from each stratum

CLUSTER Sample
CLUSTERS: natural heterogenous
Cluster
Cluster

subgroups representative of population


CLUSTERING: identifying clusters

in population
CLUSTER Sample is obtained by

simple random sampling within each cluster

MULTISTAGE Sample a form of cluster sampling when using all the sample elements in all the clusters is undoable instead the researcher randomly selects elements from clusters

Putting a Data Set to order, making it usable: Data Formatting Checking Quality of:
o Data (outliers?) o Implementation of Design

Basic Characteristics of data

OUTLIERS: data points that deviate remarkably from the majority of the sample.

DISTRIBUTION: The pattern of occurrence


of the various values of a variable
POPULATION D: distribution of values for

all units in the population


EMPIRICAL D: distribution of values for the

units in a sample.
It is assumed that the Empirical Distribution is a good representation of the Population Distribution

is a listing or function showing all the possible values of the data and how often they occur.
Distribution of categorical data shows the number

& percentage of individuals in each group.


Distribution of numerical data is typically

presented using graphs & charts to examine:


o the shape, o center, o amount of variability in the data.

NORMAL Distribution

A PROBABILITY DISTRIBUTION: assigns a probability to each measurable subset of the possible outcomes of a procedure. Normal (Gaussian) distribution is a very common continuous probability distribution
Continuous probability distribution is a

probability distribution that has a pdf.


pdf: Probability density function or density of a

continuous random variable, is a function that describes the relative likelihood for this variable to take on a given value.
NORMAL (GAUSSIAN) DISTRIBUTION

There are myriad probability distributions Most are related to each other, and ultimately to the Normal

GOAL: to reduce the information contained in a data

set to a few key indicators.


APPROACH: summarization of the data with visual

methods to reveal trends & patterns.


METHODS: depends on the type of data

TABULAR:

NUMERICAL: GRAPHICAL:

Q1=64; Q2=71; Q3=74; IQR=14 = . ; = 45; 2 = 16 ; =4; CV=0.9

Quantiles & Quartiles Median Mean Mode Spread or Dispersion Interquartile Range Standard Deviation Coefficient of Variation

The EDA methods to be presented in this section are

important not just for the researchers


Any reader of scientific literature or business statistical

analyses will encounter discussed here methods.


Familiarity with them is essential for ones ability to

critically appraise any statistics based document.

FREQUENCY DISTRIBUTION: is an organization of the raw data in the tabular form using classes & frequencies
Frequency : the number of times a value occurs in a data set Relative Frequency: frequency counts expressed as percentages

of the total observations


Cumulative Frequency: the sum for the frequencies for all

values at or below the given value


Cumulative Relative Frequency: the sum for the relative

frequencies for all values at or below the given value

Useful for

categorical data.
It presents the

distribution of values by showing their frequencies.

Contingency table (cross tab) is used to analyze the relationship between many categorical variables.
Example: 100 individuals are randomly sampled from a population as part of a study of sex differences in handedness.

Quantiles & Quartiles Location o median o mean o mode Spread or Dispersion o Range o Interquartile range o Variance o Standard deviation o Coefficient of variation

Skewness o Coefficient of Skewness Kurtosis o Coefficient of Kurtosis Covariance Correlation o Correlation Coefficients
Pearsons CC Spearman's rank CC

Simple Definition: QUANTILES: Points taken at regular intervals, that divide the data set into equal subsets.

Example of Formal Definition: The -th sample quantile, denoted (), is the smallest value such that (100)% of the observations for the variable take values which are less than or equal to ().

Quantiles are the data values (cut-off POINTS) marking the

boundaries between subsets. Examples of specific quantiles:


o o o o 2-quantile: median 4-quantiles: quartiles 5-quantiles : quintiles 100-quantiles: percentiles

Common misconception: the use of the name of quantiles

to denote the subsets they mark. These subsets should be called thirds, quarters, fifths, etc.

three POINTS that divide the data set into four equal groups, each comprising a quarter of data. A quartile is a type of quantile. Q1: First: lower = 25th percentile
o splits lowest 25% of data Q2: Second: median = 50th percentile

Q2=5

o cuts data set in half


Q3: Third = upper = 75th percentile o splits highest 25% of data

Q2=5.5

Interquartile Range (IQR): the difference between upper and lower quartiles.

Q2

IQR= Q3-Q1

Finding the position of the value in the data set that best characterizes it.
Median ( ): value separating the higher half of a data set from lower

o The median of {2,3,5,8,9} is 5


Mean (): the sum of the n numbers divided by n

o The mean of {6,4,7,10,4} is 6.2=

6+4+7+10+4 5

Mode (Mo): the most frequent value in the data set

o The mode of {1, 3, 6, 4, 3, 5, 3} is 3

Mean is affected by outliers, median is not Median exhibits robustness against outliers Robustness: the ability to resist.

Robust statistics: statistics with good performance

for data drawn from a wide range of probability distributions & not affected by outliers

Measures the degree to which the observed values are

concentrated around a location measure.


Smaller spread: values are tightly clustered around the center.

Measures of Spread:
Range

Interquartile range
Variance Standard deviation Coefficient of variation

RANGE: difference between the

sample Maximum & Minimum.


o The simplest measure of dispersion o Very sensitive to outliers

INTERQUARTILE RANGE (IQR):

the difference between upper and lower quartiles.


o Less sensitive to outliers

Measure of how far a set of numbers is spread out: how far the numbers are located from the mean.
s2 is always positive

s2=0: no variation
n = Number of variables Xi = Each of the values of the data = Mean

s2 Small: data close to

s2 High: data far from

Since Variance is expressed in squared units it is difficult to interpret intuitively.

Standard Deviation (SD): square root of the Variance. It shows the extend of variation from the mean.
s Small: data close to

s High: data far from


Both s2 & s depend on the units in

which a variable is measured. It can be misleading when comparing variables using different units.

from Latin: co (together) + efficere (to effect)


COEFFICIENT

4
COEFFICIENT

In Mathematics: Number or other known factor (symbol) by which another number or factor is multiplied.
Eg.: in the equation ax2 + bx + c = 0

o a is the coefficient of x2 o b is the coefficient of x

In Statistics: Measure of a specified characteristic of a phenomenon

Coefficient of Variation (CV): ratio of the SD to the Mean


Relative SD (RSD): CV expressed as a percentage

CV<1: Low Variance


s= Standard Deviation = Mean

CV=1: No Variance CV>1: High Variance

CV has no units It can be used for comparing dispersions of variables measured in different units.

Skewness: deviations from symmetry with respect to a location measure. It is unit-free.


b1=0: variables distributed symmetrically around

o Tails are symmetric


s= Standard Deviation = n = Number of variables Xi = The data values

b1>0: positively, right-skewed

o Longer tail for values >


b1<0: negatively, left-skewed

o Longer tail for values <

The degree of peakedness of the distribution - as compared to a Normal (Gaussian) Distribution. It is unit-free.
b2>3: Leptokurtic
o Peaked > Normal
s= Standard Deviation = n = Number of variables Xi = Each of the data values

b2=3: Mesokurtic
o Peaked as Normal

b2<3: Platykurtic
o Peaked < Normal

Covariance is a measure of how much two

random variables change together. Dependence is any statistical relationship between two random variables. Correlation refers to statistical relationships involving dependence.

Correlation does not imply causation!

Measures association between two numerical variables


cov(X,Y)=0: X&Y are INDEPENDENT
o X&Y do not correspond
X , Y : variables Xi ,Yi : observations for unit i , : means of the variables

cov(X,Y)>0: X&Y POSITIVELY associated


o greater values of X correspond w/ greater Y

cov(X,Y)<0: X&Y NEGATIVELY associated


o greater values of X correspond w/ smaller Y

n: number of variables

Sign (+/-) of cov shows the type of linear relationship between X&Y.

The magnitude of the cov is hard to interpret, hence normalized cov is used.

NORMALIZATION: creation of scaled versions of statistics to

allow the comparisons with elimination of influences.


Correlation Coefficients (CC): normalized versions of covariance

CC measure the degree of correlation


CC commonly used:

o Pearson Correlation Coefficient o Spearmans Rank Correlation Coefficient

Measure of the linear correlation between variables X&Y.


Linear X,Y relationship is modeled best by a straight line

r=-1: total NEGATIVE correlation


X,Y: variables

r = 0: NO correlation
r=+1: total POSITIVE correlation

cov (X,Y): covariance


sx ,sy : Standard Deviations

r removes the dependence on the units by scaling the cov by the product of the SD of X,Y r is not robust to: outliers, unequal variances, non-normality, & non-linearity

RANK: relative position in a graded group RANKING: transformation of data, in which

values are replaced by their rank


8.9 7.3

5.1

3.4

2.6

Ranking of numerical dataset: { 3.4, 5.1, 7.3, 2.6, 8.9 }


Value Rank 8.9 5 7.3 4 5.1 3 3.4 2 2.6 1

Measure of monotonic dependence between variables X&Y


In Monotonic X,Y relationship: Y moves in one direction ( or) as X moves, but the relationship is not necessarily linear

Reflects Monotone Trend (M.T.) between X&Y:


=+1: perfect increasing M.T.
o +1>>0: increasing M.T. (Y when X)
is calculated by applying the Pearson CC formula to the ranks of the data, not to values For a sample of size n, the n raw scores Xi ,Yi are converted to ranks xi , yi .

=0: no M.T.
o -1<<0: decreasing M.T. (Y when X)

=-1: perfect decreasing M.T.

is robust to outliers, unequal variances, non-normality, & non-linearity is non-parametric as exact sampling distribution can be obtained w/o knowing

the parameters of the joint probability distribution of X&Y.

SSDC
IDM

EDA
DDA

PoC

Statistics is an essential component of both:

Science & Business of Medicine. is prevalent among physicians.

Despite this fact statistical illiteracy & innumeracy This situation should be remedied. Study of Statistics poses many challenges, but

those are well worth of overcoming.

Statistics is based on definite number of principles.

It is best studied in an incremental fashion.

SSDC IDM EDA DDA PoC

Statistics reflects acts of interpretation, not irrefutable facts. Statistics can be misused & abused. Statistical Analyses are result of the multiphasic process, that:

o starts at Sample Selection, o ends with Presentation of Conclusions.


Appraisal of Statistical Analyses requires familiarity with all phases. Understanding of Tabular, Numerical and Graphical Methods of

EDA is critical for assessing the quality of the Statistical Analysis.

To be continued

Author wishes to thank: Stephen DeCherney, MD, MPH for his valuable comments.

Nothing to disclose: there are no known conflicts of interest associated with this presentation. Specifically, neither the author nor his family have any potential conflicts of interest, financial or otherwise regarding any of the discussed here products and/or services.

Das könnte Ihnen auch gefallen