Beruflich Dokumente
Kultur Dokumente
Discuss the significance of Statistics for Physicians Suggest the study strategies for learning Statistics
researchers to be ignorant of statistics. Many are not ashamed (and some seem proud) to admit that they don't know anything about Statistics.1
It may not be expected from doctors to be expert in
statistics but they should be made capable of understanding the basic statistical methodology. 2
Medical students may not like statistics, but as good
Ability to understand:
the value of published
Medical Research.
the role of Statistics in
Medical Business.
In the past the USE of Statistics was its most significant aspect.
P H A R M
E
D I A
Statistics is an essential aspect of modern science. Before Statistics, the science was perceived as the process of
the Wilcoxon rank sum test Poisson regression models the Bayesian estimates Wald 2 statistics Cox proportional hazards compared using t tests repeated-measures ANOVA adjusted hazard ratios 2-stage statistical model 95% confidence interval the degrees of freedom odds ratio
Kaplan-Meier method Pearson 2 Fisher exact test to have 90% power Mann-Whitney test a 2-tailed a level of less than.05 the log-rank test A 2-factor analysis of variance 2 tests the Z test Logistic regression models stratified Mantel-Haenszel analysis
Data dredging (data fishing, data snooping, equation fitting) is the inappropriate use of statistics to uncover misleading relationships.
design should occur - before the use of more sophisticated analysis. complex statistical methods should be approached with a caution.
QUANTITATIVE ERA.
In clinical environment everything is measured.
All aspects of physicians work is being
WILL BE CRUSHED.
Data Mining
Physician Performance analyses Outcome & Cost-Effectiveness analyses Practitioner vs. Peer-Group comparison analyses More accurate Actuarial analyses
Statistic is challenging for everybody. Physicians may find it especially challenging - as Statistics is:
Math-based. It has many rigorous quantitative
aspects rooted in mathematics. Most physicians are not used to study math-based subjects.
Time consuming. It is a tedious subject requiring a
Get Motivated by understanding why you need Statistics. Learn Actively: it cannot be passively crammed:
o Use pen & paper: for solving problems & reflecting on ideas o Make your own scenarios
Study Deliberately as:
o few words & symbols can mean a lot in statistics o it may be necessary to read a topic many times
Study Incrementally:
o Statistics is based on small number of principles o Those must be memorized & understood first o It is futile to look up the advanced test (e.g. used in a research paper) w/o knowing those essentials
Assemble Resources:
o There is no single best statistical manual
o It pays to prepare the set of personalized references
Source: University of Oxford & LISA: Laboratory for Interdisciplinary Statistical Analysis at Virginia Tech
SSDC IDM
SAMPLE SELECTION & DATA COLLECTION INITIAL DATA MANIPULATION o Data Formatting o Data Quality Control EXPLORATORY DATA ANALYSIS o Tabular, Numerical, Graphical data summaries o Choosing ways of Definitive Analysis DEFINITIVE DATA ANALYSIS o Final Inferential Data Analysis PRESENTATION of CONCLUSIONS o Concise graphical & tabular summaries o Statement of conclusions
EDA
DDA PoC
SSDC IDM
EDA
DDA
published studies.
This truth is frequently overlooked.
PoC
GOALS:
DI
AI
SAMPLING:
Simple Random
Stratified
Cluster Multistage
STRATIFIED Sample
STRATA: homogeneous
nonoverlapping subgroups
STRATIFICATION: dividing
CLUSTER Sample
CLUSTERS: natural heterogenous
Cluster
Cluster
in population
CLUSTER Sample is obtained by
MULTISTAGE Sample a form of cluster sampling when using all the sample elements in all the clusters is undoable instead the researcher randomly selects elements from clusters
Putting a Data Set to order, making it usable: Data Formatting Checking Quality of:
o Data (outliers?) o Implementation of Design
OUTLIERS: data points that deviate remarkably from the majority of the sample.
units in a sample.
It is assumed that the Empirical Distribution is a good representation of the Population Distribution
is a listing or function showing all the possible values of the data and how often they occur.
Distribution of categorical data shows the number
NORMAL Distribution
A PROBABILITY DISTRIBUTION: assigns a probability to each measurable subset of the possible outcomes of a procedure. Normal (Gaussian) distribution is a very common continuous probability distribution
Continuous probability distribution is a
continuous random variable, is a function that describes the relative likelihood for this variable to take on a given value.
NORMAL (GAUSSIAN) DISTRIBUTION
There are myriad probability distributions Most are related to each other, and ultimately to the Normal
TABULAR:
NUMERICAL: GRAPHICAL:
Quantiles & Quartiles Median Mean Mode Spread or Dispersion Interquartile Range Standard Deviation Coefficient of Variation
FREQUENCY DISTRIBUTION: is an organization of the raw data in the tabular form using classes & frequencies
Frequency : the number of times a value occurs in a data set Relative Frequency: frequency counts expressed as percentages
Useful for
categorical data.
It presents the
Contingency table (cross tab) is used to analyze the relationship between many categorical variables.
Example: 100 individuals are randomly sampled from a population as part of a study of sex differences in handedness.
Quantiles & Quartiles Location o median o mean o mode Spread or Dispersion o Range o Interquartile range o Variance o Standard deviation o Coefficient of variation
Skewness o Coefficient of Skewness Kurtosis o Coefficient of Kurtosis Covariance Correlation o Correlation Coefficients
Pearsons CC Spearman's rank CC
Simple Definition: QUANTILES: Points taken at regular intervals, that divide the data set into equal subsets.
Example of Formal Definition: The -th sample quantile, denoted (), is the smallest value such that (100)% of the observations for the variable take values which are less than or equal to ().
to denote the subsets they mark. These subsets should be called thirds, quarters, fifths, etc.
three POINTS that divide the data set into four equal groups, each comprising a quarter of data. A quartile is a type of quantile. Q1: First: lower = 25th percentile
o splits lowest 25% of data Q2: Second: median = 50th percentile
Q2=5
Q2=5.5
Interquartile Range (IQR): the difference between upper and lower quartiles.
Q2
IQR= Q3-Q1
Finding the position of the value in the data set that best characterizes it.
Median ( ): value separating the higher half of a data set from lower
6+4+7+10+4 5
Mean is affected by outliers, median is not Median exhibits robustness against outliers Robustness: the ability to resist.
for data drawn from a wide range of probability distributions & not affected by outliers
Measures of Spread:
Range
Interquartile range
Variance Standard deviation Coefficient of variation
Measure of how far a set of numbers is spread out: how far the numbers are located from the mean.
s2 is always positive
s2=0: no variation
n = Number of variables Xi = Each of the values of the data = Mean
Standard Deviation (SD): square root of the Variance. It shows the extend of variation from the mean.
s Small: data close to
which a variable is measured. It can be misleading when comparing variables using different units.
4
COEFFICIENT
In Mathematics: Number or other known factor (symbol) by which another number or factor is multiplied.
Eg.: in the equation ax2 + bx + c = 0
CV has no units It can be used for comparing dispersions of variables measured in different units.
The degree of peakedness of the distribution - as compared to a Normal (Gaussian) Distribution. It is unit-free.
b2>3: Leptokurtic
o Peaked > Normal
s= Standard Deviation = n = Number of variables Xi = Each of the data values
b2=3: Mesokurtic
o Peaked as Normal
b2<3: Platykurtic
o Peaked < Normal
random variables change together. Dependence is any statistical relationship between two random variables. Correlation refers to statistical relationships involving dependence.
n: number of variables
Sign (+/-) of cov shows the type of linear relationship between X&Y.
The magnitude of the cov is hard to interpret, hence normalized cov is used.
r = 0: NO correlation
r=+1: total POSITIVE correlation
r removes the dependence on the units by scaling the cov by the product of the SD of X,Y r is not robust to: outliers, unequal variances, non-normality, & non-linearity
5.1
3.4
2.6
=0: no M.T.
o -1<<0: decreasing M.T. (Y when X)
is robust to outliers, unequal variances, non-normality, & non-linearity is non-parametric as exact sampling distribution can be obtained w/o knowing
SSDC
IDM
EDA
DDA
PoC
Despite this fact statistical illiteracy & innumeracy This situation should be remedied. Study of Statistics poses many challenges, but
Statistics reflects acts of interpretation, not irrefutable facts. Statistics can be misused & abused. Statistical Analyses are result of the multiphasic process, that:
To be continued
Author wishes to thank: Stephen DeCherney, MD, MPH for his valuable comments.
Nothing to disclose: there are no known conflicts of interest associated with this presentation. Specifically, neither the author nor his family have any potential conflicts of interest, financial or otherwise regarding any of the discussed here products and/or services.