Statistics For Physicians, Part I

Define Statistics
Discuss the significance of Statistics for Physicians Suggest the study strategies for learning Statistics
Present role of Statistics in the scientific process

Review basic concepts of Statistics
Introduce methods of Exploratory Data Analysis
Amazingly, it is widely considered acceptable for medical
researchers to be ignorant of statistics. Many are not ashamed (and some seem proud) to admit that they don't know anything about Statistics.1
It may not be expected from doctors to be expert in
statistics but they should be made capable of understanding the basic statistical methodology. 2
Medical students may not like statistics, but as good
doctors they will have to understand it.3

1. 2. 3. Altman DG. The scandal of poor medical research. BMJ. 1994 ; 308: 283-4. Singh G. Medical Science without Statistics. The Internet Journal of Healthcare Administration. 2006; 4:2 Chen J. Lecture: Advice to GCRC & Surgery Fellows and Residents, SBU 2004.
Statistics: theory & methodology
for the collection, organization, analysis, interpretation & presentation of data.

DESCRIPTIVE INFERENTIAL
Descriptive Statistics: discipline

Collecting
Organizing Summarizing Presenting Data Inferences Hypothesis Testing Relationships Predictions
of quantitatively describing the features of data.

Inferential Statistics: deals with
drawing conclusions from data.
Ability to understand:
the value of published
Medical Research.
the role of Statistics in
Medical Business.
In the past the USE of Statistics was its most significant aspect.
P H A R M
E
D I A
Today, the MISUSE of Statistics in Research became a concern.
Statistics is an essential aspect of modern science. Before Statistics, the science was perceived as the process of
developing absolute knowledge through observations

In a contrast, Statistics is based on the notion that scientific
knowledge is not absolute

Hence, uncertainty & error are part of science
The only real things in science are distributions of numbers
Probability theory is used to interpret those distributions

Statistics reflects acts of interpretation - not irrefutable facts
the Wilcoxon rank sum test Poisson regression models the Bayesian estimates Wald 2 statistics Cox proportional hazards compared using t tests repeated-measures ANOVA adjusted hazard ratios 2-stage statistical model 95% confidence interval the degrees of freedom odds ratio
Kaplan-Meier method Pearson 2 Fisher exact test to have 90% power Mann-Whitney test a 2-tailed a level of less than.05 the log-rank test A 2-factor analysis of variance 2 tests the Z test Logistic regression models stratified Mantel-Haenszel analysis
Source: JAMA Vol. 292 (19): Six Original Contribution Papers
In developed countries, much of what
laymen know about medicine is gleaned from the media.

Unfortunately, the more frightening an
event is - the more newsworthy it is.

The Statistical Analysis of Research
Studies is complex. Regrettably, it tends to be oversimplified & sensationalized.
Data dredging (data fishing, data snooping, equation fitting) is the inappropriate use of statistics to uncover misleading relationships.
The SIMPLEST FORM OF STATISTICS
will suffice in most well-designed studies.
Therefore, a revision of the study
design should occur - before the use of more sophisticated analysis. complex statistical methods should be approached with a caution.
Similarly, any study that uses overly
Bhattacharya K. University of Oxford Introduction to Statistics for Medical Students. 2004
As opposed to the past we live now in the
QUANTITATIVE ERA.
In clinical environment everything is measured.
All aspects of physicians work is being
statistically analyzed & compared to benchmarks such as evidence-based guidelines.

Any physician who does not understand this
WILL BE CRUSHED.
Computerization of medical business facilitated:

Automated surreptitious data gathering
Data Mining
Physician Performance analyses Outcome & Cost-Effectiveness analyses Practitioner vs. Peer-Group comparison analyses More accurate Actuarial analyses
Statistic is challenging for everybody. Physicians may find it especially challenging - as Statistics is:
Math-based. It has many rigorous quantitative
aspects rooted in mathematics. Most physicians are not used to study math-based subjects.
Time consuming. It is a tedious subject requiring a
tremendous time commitment.

Spuriously not-essential. It appeasers to be not
everyday use topic. (I can get away w/o studying it)
Statistics is not a spectator sport.
Get Motivated by understanding why you need Statistics. Learn Actively: it cannot be passively crammed:
o Use pen & paper: for solving problems & reflecting on ideas o Make your own scenarios
Study Deliberately as:
o few words & symbols can mean a lot in statistics o it may be necessary to read a topic many times
Study Incrementally:
o Statistics is based on small number of principles o Those must be memorized & understood first o It is futile to look up the advanced test (e.g. used in a research paper) w/o knowing those essentials
Assemble Resources:
o There is no single best statistical manual
o It pays to prepare the set of personalized references
Source: University of Oxford & LISA: Laboratory for Interdisciplinary Statistical Analysis at Virginia Tech
Population: all elements to be studied

o Parameter: characteristic of the Population (e.g. Mean, Standard Deviation)
Sample: a subset of Population.

o Statistic: characteristic of the Sample (not to be confused with Statistics)
VARIABLE: any measurable attribute that differs. Quantitative = Numerical

o Continuous: any value between a set of numbers
E.g.: Time
o Discrete: only a finite number of values

E.g.: Number of children in a family
Qualitative = Categorical o Ordinal: can be ordered (ranked)

E.g.: Clothing Size: S, M, L, XL
o Nominal: cannot be ordered

E.g.: Colors
DATA: values that variables can assume

DATA
Univariate: analysis of one variable
Bivariate: analysis of two variables

Multivariate: analysis of many variables
SSDC IDM
SAMPLE SELECTION & DATA COLLECTION INITIAL DATA MANIPULATION o Data Formatting o Data Quality Control EXPLORATORY DATA ANALYSIS o Tabular, Numerical, Graphical data summaries o Choosing ways of Definitive Analysis DEFINITIVE DATA ANALYSIS o Final Inferential Data Analysis PRESENTATION of CONCLUSIONS o Concise graphical & tabular summaries o Statement of conclusions
EDA
DDA PoC
SSDC IDM
Understanding the phases of SA is important
not only for performing research.

It is essential for the critical appraisal of the
EDA
DDA
published studies.
This truth is frequently overlooked.
PoC
GOALS:
DI
Descriptive INFERENCE: describe a population,
using information from a sample

Analytical INFERENCE: describe relationships
AI
between variables, using a sample - assuming that it can be generalized to a population.
SAMPLING:
Simple Random
Stratified
Cluster Multistage
SIMPLE RANDOM Sample

subset of individuals chosen
RANDOMLY from a population

each individual has the same
probability of being chosen
STRATIFIED Sample
STRATA: homogeneous
nonoverlapping subgroups
STRATIFICATION: dividing
population into strata

STRATIFIED Sample is
obtained by simple random sampling from each stratum
CLUSTER Sample
CLUSTERS: natural heterogenous
Cluster
Cluster
subgroups representative of population

CLUSTERING: identifying clusters
in population
CLUSTER Sample is obtained by
simple random sampling within each cluster
MULTISTAGE Sample a form of cluster sampling when using all the sample elements in all the clusters is undoable instead the researcher randomly selects elements from clusters
Putting a Data Set to order, making it usable: Data Formatting Checking Quality of:
o Data (outliers?) o Implementation of Design
Basic Characteristics of data
OUTLIERS: data points that deviate remarkably from the majority of the sample.
DISTRIBUTION: The pattern of occurrence

of the various values of a variable
POPULATION D: distribution of values for
all units in the population

EMPIRICAL D: distribution of values for the
units in a sample.
It is assumed that the Empirical Distribution is a good representation of the Population Distribution
is a listing or function showing all the possible values of the data and how often they occur.
Distribution of categorical data shows the number
& percentage of individuals in each group.

Distribution of numerical data is typically
presented using graphs & charts to examine:

o the shape, o center, o amount of variability in the data.
NORMAL Distribution
A PROBABILITY DISTRIBUTION: assigns a probability to each measurable subset of the possible outcomes of a procedure. Normal (Gaussian) distribution is a very common continuous probability distribution
Continuous probability distribution is a
probability distribution that has a pdf.

pdf: Probability density function or density of a
continuous random variable, is a function that describes the relative likelihood for this variable to take on a given value.
NORMAL (GAUSSIAN) DISTRIBUTION
There are myriad probability distributions Most are related to each other, and ultimately to the Normal
GOAL: to reduce the information contained in a data
set to a few key indicators.

APPROACH: summarization of the data with visual
methods to reveal trends & patterns.

METHODS: depends on the type of data
TABULAR:
NUMERICAL: GRAPHICAL:
Q1=64; Q2=71; Q3=74; IQR=14 = . ; = 45; 2 = 16 ; =4; CV=0.9
Quantiles & Quartiles Median Mean Mode Spread or Dispersion Interquartile Range Standard Deviation Coefficient of Variation
The EDA methods to be presented in this section are
important not just for the researchers

Any reader of scientific literature or business statistical
analyses will encounter discussed here methods.

Familiarity with them is essential for ones ability to
critically appraise any statistics based document.
FREQUENCY DISTRIBUTION: is an organization of the raw data in the tabular form using classes & frequencies
Frequency : the number of times a value occurs in a data set Relative Frequency: frequency counts expressed as percentages
of the total observations

Cumulative Frequency: the sum for the frequencies for all
values at or below the given value

Cumulative Relative Frequency: the sum for the relative
frequencies for all values at or below the given value
Useful for
categorical data.
It presents the
distribution of values by showing their frequencies.
Contingency table (cross tab) is used to analyze the relationship between many categorical variables.
Example: 100 individuals are randomly sampled from a population as part of a study of sex differences in handedness.
Quantiles & Quartiles Location o median o mean o mode Spread or Dispersion o Range o Interquartile range o Variance o Standard deviation o Coefficient of variation
Skewness o Coefficient of Skewness Kurtosis o Coefficient of Kurtosis Covariance Correlation o Correlation Coefficients
Pearsons CC Spearman's rank CC
Simple Definition: QUANTILES: Points taken at regular intervals, that divide the data set into equal subsets.
Example of Formal Definition: The -th sample quantile, denoted (), is the smallest value such that (100)% of the observations for the variable take values which are less than or equal to ().
Quantiles are the data values (cut-off POINTS) marking the
boundaries between subsets. Examples of specific quantiles:

o o o o 2-quantile: median 4-quantiles: quartiles 5-quantiles : quintiles 100-quantiles: percentiles
Common misconception: the use of the name of quantiles
to denote the subsets they mark. These subsets should be called thirds, quarters, fifths, etc.
three POINTS that divide the data set into four equal groups, each comprising a quarter of data. A quartile is a type of quantile. Q1: First: lower = 25th percentile
o splits lowest 25% of data Q2: Second: median = 50th percentile
Q2=5
o cuts data set in half

Q3: Third = upper = 75th percentile o splits highest 25% of data
Q2=5.5
Interquartile Range (IQR): the difference between upper and lower quartiles.
Q2
IQR= Q3-Q1
Finding the position of the value in the data set that best characterizes it.
Median ( ): value separating the higher half of a data set from lower
o The median of {2,3,5,8,9} is 5

Mean (): the sum of the n numbers divided by n
o The mean of {6,4,7,10,4} is 6.2=
6+4+7+10+4 5
Mode (Mo): the most frequent value in the data set
o The mode of {1, 3, 6, 4, 3, 5, 3} is 3
Mean is affected by outliers, median is not Median exhibits robustness against outliers Robustness: the ability to resist.
Robust statistics: statistics with good performance
for data drawn from a wide range of probability distributions & not affected by outliers
Measures the degree to which the observed values are
concentrated around a location measure.

Smaller spread: values are tightly clustered around the center.
Measures of Spread:
Range
Interquartile range
Variance Standard deviation Coefficient of variation
RANGE: difference between the
sample Maximum & Minimum.

o The simplest measure of dispersion o Very sensitive to outliers
INTERQUARTILE RANGE (IQR):
the difference between upper and lower quartiles.

o Less sensitive to outliers
Measure of how far a set of numbers is spread out: how far the numbers are located from the mean.
s2 is always positive
s2=0: no variation
n = Number of variables Xi = Each of the values of the data = Mean
s2 Small: data close to
s2 High: data far from
Since Variance is expressed in squared units it is difficult to interpret intuitively.
Standard Deviation (SD): square root of the Variance. It shows the extend of variation from the mean.
s Small: data close to
s High: data far from

Both s2 & s depend on the units in
which a variable is measured. It can be misleading when comparing variables using different units.
from Latin: co (together) + efficere (to effect)

COEFFICIENT
4
COEFFICIENT
In Mathematics: Number or other known factor (symbol) by which another number or factor is multiplied.
Eg.: in the equation ax2 + bx + c = 0
o a is the coefficient of x2 o b is the coefficient of x
In Statistics: Measure of a specified characteristic of a phenomenon
Coefficient of Variation (CV): ratio of the SD to the Mean

Relative SD (RSD): CV expressed as a percentage
CV<1: Low Variance

s= Standard Deviation = Mean

CV=1: No Variance CV>1: High Variance
CV has no units It can be used for comparing dispersions of variables measured in different units.
Skewness: deviations from symmetry with respect to a location measure. It is unit-free.

b1=0: variables distributed symmetrically around
o Tails are symmetric

s= Standard Deviation = n = Number of variables Xi = The data values
b1>0: positively, right-skewed
o Longer tail for values >

b1<0: negatively, left-skewed
o Longer tail for values <
The degree of peakedness of the distribution - as compared to a Normal (Gaussian) Distribution. It is unit-free.
b2>3: Leptokurtic
o Peaked > Normal
s= Standard Deviation = n = Number of variables Xi = Each of the data values
b2=3: Mesokurtic
o Peaked as Normal
b2<3: Platykurtic
o Peaked < Normal
Covariance is a measure of how much two
random variables change together. Dependence is any statistical relationship between two random variables. Correlation refers to statistical relationships involving dependence.
Correlation does not imply causation!
Measures association between two numerical variables

cov(X,Y)=0: X&Y are INDEPENDENT
o X&Y do not correspond
X , Y : variables Xi ,Yi : observations for unit i , : means of the variables
cov(X,Y)>0: X&Y POSITIVELY associated

o greater values of X correspond w/ greater Y
cov(X,Y)<0: X&Y NEGATIVELY associated

o greater values of X correspond w/ smaller Y
n: number of variables
Sign (+/-) of cov shows the type of linear relationship between X&Y.
The magnitude of the cov is hard to interpret, hence normalized cov is used.
NORMALIZATION: creation of scaled versions of statistics to
allow the comparisons with elimination of influences.

Correlation Coefficients (CC): normalized versions of covariance
CC measure the degree of correlation

CC commonly used:
o Pearson Correlation Coefficient o Spearmans Rank Correlation Coefficient
Measure of the linear correlation between variables X&Y.

Linear X,Y relationship is modeled best by a straight line
r=-1: total NEGATIVE correlation

X,Y: variables
r = 0: NO correlation
r=+1: total POSITIVE correlation
cov (X,Y): covariance

sx ,sy : Standard Deviations
r removes the dependence on the units by scaling the cov by the product of the SD of X,Y r is not robust to: outliers, unequal variances, non-normality, & non-linearity
RANK: relative position in a graded group RANKING: transformation of data, in which
values are replaced by their rank

8.9 7.3
5.1
3.4
2.6
Ranking of numerical dataset: { 3.4, 5.1, 7.3, 2.6, 8.9 }

Value Rank 8.9 5 7.3 4 5.1 3 3.4 2 2.6 1
Measure of monotonic dependence between variables X&Y

In Monotonic X,Y relationship: Y moves in one direction ( or) as X moves, but the relationship is not necessarily linear
Reflects Monotone Trend (M.T.) between X&Y:

=+1: perfect increasing M.T.
o +1>>0: increasing M.T. (Y when X)
is calculated by applying the Pearson CC formula to the ranks of the data, not to values For a sample of size n, the n raw scores Xi ,Yi are converted to ranks xi , yi .
=0: no M.T.
o -1<<0: decreasing M.T. (Y when X)
=-1: perfect decreasing M.T.
is robust to outliers, unequal variances, non-normality, & non-linearity is non-parametric as exact sampling distribution can be obtained w/o knowing
the parameters of the joint probability distribution of X&Y.
SSDC
IDM
EDA
DDA
PoC
Statistics is an essential component of both:
Science & Business of Medicine. is prevalent among physicians.
Despite this fact statistical illiteracy & innumeracy This situation should be remedied. Study of Statistics poses many challenges, but
those are well worth of overcoming.
Statistics is based on definite number of principles.
It is best studied in an incremental fashion.
SSDC IDM EDA DDA PoC
Statistics reflects acts of interpretation, not irrefutable facts. Statistics can be misused & abused. Statistical Analyses are result of the multiphasic process, that:
o starts at Sample Selection, o ends with Presentation of Conclusions.

Appraisal of Statistical Analyses requires familiarity with all phases. Understanding of Tabular, Numerical and Graphical Methods of
EDA is critical for assessing the quality of the Statistical Analysis.
To be continued
Author wishes to thank: Stephen DeCherney, MD, MPH for his valuable comments.
Nothing to disclose: there are no known conflicts of interest associated with this presentation. Specifically, neither the author nor his family have any potential conflicts of interest, financial or otherwise regarding any of the discussed here products and/or services.

Statistics For Physicians, Part I

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Statistics For Physicians, Part I

Hochgeladen von

Copyright:

Verfügbare Formate

Define Statistics

Present role of Statistics in the scientific process

Introduce methods of Exploratory Data Analysis

Amazingly, it is widely considered acceptable for medical

doctors they will have to understand it.3

Statistics: theory & methodology

for the collection, organization, analysis, interpretation & presentation of data.

Descriptive Statistics: discipline

of quantitatively describing the features of data.

drawing conclusions from data.

Today, the MISUSE of Statistics in Research became a concern.

developing absolute knowledge through observations

knowledge is not absolute

Probability theory is used to interpret those distributions

Source: JAMA Vol. 292 (19): Six Original Contribution Papers

In developed countries, much of what

laymen know about medicine is gleaned from the media.

event is - the more newsworthy it is.

Studies is complex. Regrettably, it tends to be oversimplified & sensationalized.

The SIMPLEST FORM OF STATISTICS

will suffice in most well-designed studies.

Therefore, a revision of the study

Similarly, any study that uses overly

Bhattacharya K. University of Oxford Introduction to Statistics for Medical Students. 2004

As opposed to the past we live now in the

statistically analyzed & compared to benchmarks such as evidence-based guidelines.

Computerization of medical business facilitated:

tremendous time commitment.

everyday use topic. (I can get away w/o studying it)

Statistics is not a spectator sport.

Population: all elements to be studied

Sample: a subset of Population.

VARIABLE: any measurable attribute that differs. Quantitative = Numerical

o Discrete: only a finite number of values

Qualitative = Categorical o Ordinal: can be ordered (ranked)

o Nominal: cannot be ordered

DATA: values that variables can assume

Univariate: analysis of one variable

Bivariate: analysis of two variables

Understanding the phases of SA is important

not only for performing research.

Descriptive INFERENCE: describe a population,

using information from a sample

between variables, using a sample - assuming that it can be generalized to a population.

SIMPLE RANDOM Sample

RANDOMLY from a population

probability of being chosen

population into strata

obtained by simple random sampling from each stratum

subgroups representative of population

simple random sampling within each cluster

Basic Characteristics of data

DISTRIBUTION: The pattern of occurrence

all units in the population

& percentage of individuals in each group.

presented using graphs & charts to examine:

probability distribution that has a pdf.

GOAL: to reduce the information contained in a data

set to a few key indicators.

methods to reveal trends & patterns.

Q1=64; Q2=71; Q3=74; IQR=14 = . ; = 45; 2 = 16 ; =4; CV=0.9

The EDA methods to be presented in this section are

important not just for the researchers