Sie sind auf Seite 1von 4

Sample exploratory data analysis Info Sys 271

Why are we doing this?


Part of doing an exploratory data analysis is calculating lots of statistics and doing lots
of graphs that might not make it into your final write-up. Graphs and statistics that
aren’t directly used in the report can be included in appendices to your report. Part of
the reason we do exploratory data analysis is so we can justify the methods we use in
further analysis. Some statistical methods you will encounter require that data has
certain characteristics such as normal distribution (e.g. not too skewed or bi-modal).
Exploratory data analysis is also an opportunity to look for interesting features of the
data that enable us to form hypotheses (e.g. relationships which might allow us to
predict one variable from another). Exploratory data analysis is not a tool for making
conclusions (supporting or rejecting hypotheses) because it doesn’t tell us whether the
trends we think we can see are actually statistically significant. An exploratory data
analysis write-up should be short!

What are we doing?


Numerical methods
Calculate means and standard deviations for data. Sometimes you might want to
calculate medians and modes if you suspect the data is not normally distributed. The
mode should be used for nominal level data. The median should be used for ordinal
level data. Calculate values for skewness and kurtosis. Where we have nominal
level variables we might be particularly interested in calculating means and standard
deviations for each value of the nominal data. This will help us work out whether
there is a difference between each of the categories. For example we should calculate
values for each nominal value of who picked the stocks (the values are pros, darts,
djia).

Descriptive Statistics

Std.
N Minimum Maximum Mean Skewness Kurtosis
Deviation

Std. Std. Std.


Statistic Statistic Statistic Statistic Statistic Statistic Statistic
Error Error Error

PROS 100 -37.80 75.00 10.9470 2.2247 22.2466 .479 .241 .317 .478

DARTS 100 -43.00 72.90 4.5210 1.9388 19.3883 .770 .241 2.237 .478

DJIA 100 -13.10 22.50 6.7930 .8031 8.0315 -.280 .241 -.215 .478

Valid N
100
(listwise)
Commenting on the numerical data
Usually we won’t want to comment on all the numerical data, just the interesting
differences. Differences in means tell us that we might be able to find a statistically
significant difference later. Differences in standard deviations between categories can
violate the assumptions behind some parametric statistical tests and invalidate any
results we get.
The Pros had the highest mean increase in stock values (10.9) followed by the Dow
Jones Index (6.8) and the Darts (4.5). The Pros and Darts groups had similar standard
deviations (2.2 and 1.9 respectively) while the DJIA had a lower standard deviation
(0.8).

Graphical data analysis


Graphical data analysis should provide a quick visual summary. From looking at a
graph for 10 seconds you should be able to infer general trends in the data . Graphical
data analysis is essential to check for normal distributions. A histogram or scatterplot
of interval level data will reveal any quirks in the data such as being bi-modal, skewed
or having a truncated tail. Nominal or ordinal data should be represented as a bar
graph. Box and whisker plots can be used to find differences between different
nominal categories on an interval variable (e.g. between stock values for Darts, Pros
and the DJIA). To facilitate quick visual inspection there are a couple of tips you
should follow: always use the same axes ranges and labels where you want different
graphs to be comparable; don’t put more than 7 axis labels on an axis, don’t put more
than about five trend lines on a graph (less if the trend lines are hard to distinguish).
Some graphs are designed so that it is easy to extract numeric information from them
(often histograms or carefully labelled pie charts enable numeric analysis). Pie charts
should only be used to show proportions of a whole (or to demonstrate the capabilities
of a colour printer )

Distribution of the percentage price change on investments for the Pros


The distribution of percentage price change on investments for the professionals is
approximately normally distributed with some positive skew.

Correlational analysis
Correlational analysis can be performed by numerical analysis only but if you don’t
do a scatterplot it is easy to miss important trends such as non-linear relationships.

If you look carefully at these scatterplots you will see that there are only three unique
scatterplots but each is duplicated (for example, by correlating Pros with Darts as well
as Darts with Pros). Confusing! But it does allow you to inspect a lot of data to find
where correlations exist with a quick glance. This is what we want for exploratory
data analysis especially if we have a lot of variables to explore. Plots like this
shouldn’t be used where anyone might want to extract numeric data from the plot.
By looking at the numerical summaries of the correlations we can assess the
magnitude of a correlation. The magnitude of a correlation tells us how much of the
variance in one variable is explained by the variance in another variable.
CORRELATION IS NOT CAUSATION so all we can say is that the scores vary
together. Sometimes we have a third factor that is causing both of our correlated
variables to change. For example, an increase in the value of stocks of Pros does not
cause the value of the DJIA to increase. A rise in the stockmarket causes both to
change. Positive or negative correlations refer to the direction of the correlation.

Correlations

PROS DARTS DJIA

PROS Pearson Correlation 1.000 .324(**) .538(**)

DARTS Pearson Correlation .324(**) 1.000 .428(**)

DJIA Pearson Correlation .538(**) .428(**) 1.000

** Correlation is significant at the 0.01 level (2-tailed).

We can say:
The performance of the Pros’ stocks is moderately strongly positively correlated with
the performance of the DJIA stocks (a Pearson correlation of 0.538). The
performance of the Darts stocks is weakly positively correlated with the performance
of the Pros stocks (a Pearson Correlation of 0.324).

See this scatterplot from http://research.ed.asu.edu/siip/briefs/images/rainrdyscities.gif


to see a single scatterplot of rainfall vs the number or rainy days.

Das könnte Ihnen auch gefallen