Sie sind auf Seite 1von 155

Simple Plot in R

This section will begin a very gentle introduction to plotting in R. The goal is to show how one
can draw the plot of a function using R and annotate the resulting plot. The tutorial is easy to
follow and it gives you a nice overview of the plotting capabilities of R.

R as a Calculator

Start R on your system. This will open R's interactive shell. Let's start by performing a basic
calculation in the shell. Enter the expression 2+2, then press the Enter key. The result follows.

> 2+2
[1] 4

Sweet! R can add two and two! Let's try something a bit harder, such as sin π /2.

> sin(pi/2)
[1] 1

Good start! Let's produce a list of numbers, storing the result in the vector x.

> x=0:5

We can see what's stored in the variable x by typing its name and hitting the Enter key.

> x
[1] 0 1 2 3 4 5

Let's add 3 to each entry in x.

> x+3
[1] 3 4 5 6 7 8

Let's square each entry in x.

> x^2
[1] 0 1 4 9 16 25

These calculations are pretty typical of the way in which R deals with lists of numbers. Whatever
you do to the list is applied to each element in the list.

Let's do something a bit more substantial. First, let's create a list of numbers using R's seq
command.

> x=seq(0,2*pi,by=pi/2)
> x
[1] 0.000000 1.570796 3.141593 4.712389 6.283185
Note the syntax, seq(a,b,by=increment) produces a list of numbers that starts at a, increments
by increment, and finishes at b. Hence the command x=seq(0,2*pi,by=pi/2) produces a list of
numbers that starts at zero, increments by π /2, and stops at 2 π. Now, let's take the sine of each
number in the list stored in x.

> sin(x)
[1] 0.000000e+00 1.000000e+00 1.224647e-16 -1.000000e+00 -2.449294e-16

A little strange, but note the scientific notation. This result is approximately 0, 1, 0, -1, 0, which
is correct. The small errors are due to round-off error.

Let's apply what we've learned to sketch the graph of a sinusoid.

Plotting a Sinusoid

In this first activity, let's try to draw the graph of a sinusoid with the following equation.

y = 2 sin (2π x - π /2)

In a trigonometry class, the first step would be to identify the amplitude, period, and phase shift.
To that effort, we factor out a 2π.

y = 2 sin 2π (x - 1/4)

Comparing this equation with the more general form, y = A sin B (x - φ), we see that A = 2, B =
2π, and φ = 1/4. Thus, we have the following facts:

 The amplitude of the sinusoid is |A| = |2| = 2.


 The phase shift of the sinusoid is φ = 1/4.
 The period of the sinusoid is T = 2π /B = 2π /2π = 1.

Thus, when we draw the graph of the sinusoid, we should see it complete one full cycle every 1
second, it should bounce up and down between 2 and -2, and it should be shifted 1/4 units to the
right. With these thoughts in mind, let's sketch 2 periods of the sinusoid, which means that we
should sketch the graph of the sinusoid over the domain [0, 2].

Coding

We begin. We first create a list of domain values and store them in x. The following command
starts at 0, increments by 0.01, and stops at 2, giving us a large number of points in the domain
[0,2].

> x=seq(0,2,by=0.01)

We now need to evaluate the sinusoid y = 2 sin 2π (x - 1/4) at each point of the vector x. This is a
simple matter in R.
>> y=2*sin(2*pi*(x-1/4))

The above command evaluates the sinusoid at each point in the vector x and stores the result in
the vector y. All that remains is to plot the points in the vector y versus the points in the vector x.

> plot(x,y)

The result of this last command on our system is shown in Figure 1. Results may differ on
different systems depending on the windowing system used by your version of R.

Figure 1. A plot of y = 2 sin 2π (x - 1/4) on the interval [0, 2].

Note that R's default behavior is to plot the individual points. We can change the "type" of the
plot as follows.

> plot(x,y,type='l')

The result is shown in Figure 2. Note that the points are no longer marked and we have a
"smooth curve" instead. Actually, what's going on behind the scenes is each consecutive pair of
points is being joined with a small line segment. Because we've plotted a "lot of points," the
result has the appearance of a "smooth curve."
Figure 2. Changing the type, giving the appearance of a "smooth curve."

You can add axis labels and a title with the commands that follow below. After typing the first
line, namely plot(x,y,type='l', (that's a lower case "L") press the Enter key. The plus sign (+) on
the next line is R's "line continuation" character and is automatically supplied by R's interactive
shell. Without the "line continuation" character, lines would be too long to include in this
document.

> plot(x,y,type='l',
+ xlab='x-axis',
+ ylab='y-axis',
+ main='A plot of a sinusoid.')

It's somewhat difficult to type a number of consecutive lines correctly, so if you make a mistake
you will need to start over. However, you can use the "up-arrow" on your keyboard to replay
lines you've already typed. Pressing the "up-arrow" a number of times take you back through the
"history" of lines you've entered in R's interactive shell. When you get the line you want, press
Enter, or edit the line and then press Enter.

The result of this code is shown in Figure 3. Note the addition of axis labels and a "main" title.
Figure 3. Adding axis labels and a title to the figure..

Enjoy

We hope you've enjoyed this small introduction to plotting in R. To find out more about R's plot
command, type the following command in R's interactive shell.

> ?plot

This will open a help file on the plot where you can read more about this powerful command.
Sampling from a Discrete Distribution
A colleague of mine asked how he could sample from a discrete distribution. Imagine, he said, a
six-sided die that is "loaded." That is, when thrown, the sides do not show with equal
probabilities. Examine the discrete distribution delineated in the following table. The first
column contains the number showing on the face of the die, while the second column states the
probability of observing that face when the thrown die come to rest.

A Six-Sided Die
x p
1 0.1
2 0.1
3 0.1
4 0.1
5 0.2
6 0.4

Thus, for example, the probability of throwing a 1 is 0.1, a 10% chance. The probability of
throwing a 6 is 0.4, a 40% chance.

Drawing a Random Sample from the Discrete Distribution

We will now draw from the set (1, 2, 3, 4, 5, 6) with replacement. This means that we draw a
number from the set (1, 2, 3, 4, 5, 6), record the result, then place it back in the set. This means
that it will be available on our next draw from the set.

However, we must also assign numbers (0.1, 0.1, 0.1, 0.1, 0.2, 0.4) representing the probabilities
of selecting a number in the corresponding position of the list (1, 2, 3, 4, 5, 6).

This task is easily accomplished in R. We will use the sample command, which has the
following general syntax.

> sample(x, size, replace = FALSE, prob = NULL)

Here is a description of the arguments to the sample function:

 x - a vector containing the elements from which to sample


 size - the number of elements we wish to choose from the set x
 replace - if true, we return the number to the set x before making another draw
 prob - a vector of probabilities that correspond to each entry in the vector x

Readers can obtain additional help on the sample command by entering the following command
at the R prompt:

> ?sample
Let's begin by drawing 100 numbers from this "discrete" distribution. After you enter the first of
the following code lines, press the Enter key, which results in a new line that starts with a "plus
sign" (+). This is R's line continuation character and is provided by the system when you press
Enter to start a new line. Continue typing the remaining lines of code. In the last line, the
concluding parentheses (the second one) delimits a complete command and R will then execute
and store the result in the variable v.

> v=sample(c(1, 2, 3, 4, 5, 6),100,replace=TRUE,prob=c(0.1, 0.1, 0.1, 0.1,


0.2, 0.4))

Well, that was easy! And fast! The variable v now contains 100 random numbers drawn from the
discrete distribution with the assigned probabilities. You can view these numbers by typing v at
the R prompt. The result is shown below.

> v
[1] 1 6 4 6 1 1 3 2 6 3 6 3 6 5 2 5 4 6 6 4 6 5 6 1 5 6 6 6 6 2 6 5 6
[34] 5 6 1 2 6 6 1 2 5 6 5 4 5 5 3 6 1 5 6 5 3 4 6 6 6 5 4 4 5 6 3 6 3
[67] 6 6 2 6 5 4 6 2 6 2 5 5 5 6 4 6 6 3 6 5 6 6 5 4 4 5 5 3 3 5 3 3 2
[100] 5

You can test the length of this vector with the length command.

> length(v)
[1] 100

Using R's Table Command

Now that we have 100 numbers drawn from our discrete distribution, how do we go about
visualizing the result? One idea would be to sort the numbers, then count the number of times
each number occurs. This is most easily accomplished with R's table command.

> table(v)
v
1 2 3 4 5 6
7 9 12 11 24 37

Note that the number 1 was selected 7 times, the number 2 nine times, etc.

Crafting a Barplot of the Count

A barplot is one of the best methods for visualizing the results of our sample. The following
command is used to create the barplot shown in Figure 1.

Pie Charts in R
There are a number of important types of plots that are used in descriptive statistics. A common
plot that is frequently used in newspapers and magazines is the pie plot, where the size of a
"wedge of pie" help the reader visualize the percentage of data falling in a particular category.
Take, for example, the responses of children who were asked their favorite color of M and M
candies.

Favorite Color
Red 40
Blue 30
Green 20
Brown 10

Note that the data in the table are percentages, summing to 100%.

A Simple Pie Chart

We'll first enter the data from the table above, storing the list in the variable mypie

> mypie=c(40,30,20,10)

You can obtain help on R's combine command by typing ?c at the prompt. This command is used
to concatenate elements into a single list, particularly useful when entering data manually. You
can see the contents of the variable by typing mypie and hitting the Enter key.

> mypie
[1] 40 30 20 10

It is a simple task to create a pie chart using the data stored in mypie.

> pie(mypie)

The command pie(mypie) produces the pie chart shown in Figure 1.


Figure 1. A simple pie chart representing the percentages of children favoring red, blue, green
and brown candies.

In the upcoming exercises, we'll explore further the capabilities of R's pie command. You can
read more about this command by entering the command ?pie and reading the resulting help file.

Adding Labels to the Pie Chart

We can add "names" to the percentages of colored candies favored by the children.

> names(mypie)=c("Red","Blue","Green","Brown")

The names command attaches a "name" to each piece of data. We can see the result of this
command by typing the variable name (mypie) and hitting the Enter key.

> mypie
Red Blue Green Brown
40 30 20 10

Note how each piece of data now has a "name" or "header." We can access individual elements
of the list with R's indexing capability. For example, to access the first element of the list (R
starts indexing at one), enter:

> mypie[1]
Red
40
However, since the data have "names" attached, we can access each data item through its
"name."

> mypie['Red']
Red
40

The pie command knows how to apply these "names" to the pie chart.

> pie(mypie)

When "names" are added to each piece of data (as we did to the variable mypie), R's pie
command automatically adds these names as labels to the corresponding "slice of pie," as shown
in Figure 2.

Figure 2. Adding labels to each "slice of pie."

Using Custom Colors

It won't do that the "Red" slice is colored "White." Let's see how we can use customized colors
for each "slice of pie." First, we store the corresponding colors for each slice in the variable
mycolors.

> mycolors=c("red","blue","green","brown")
Next, we pass the colors in mycolors to the col argument of the function mypie.

> pie(mypie,col=mycolors)

This last command produces the pie chart shown in Figure 3.

Figure 3. Adding custom colors to each "slice of pie."

Enjoy

We hope you enjoyed this very short introducion to pie charts using R. To find out more about
what can be accomplished with R's pie command, remember to type ?pie, hit Enter, and read the
resulting help file.
Bar Charts in R
There are a number of important types of plots that are used in descriptive statistics. A common
plot that is frequently used in newspapers and magazines is the bar chart, where the height of
each bar helps the reader visualize the amount of data falling in a particular category. Take, for
example, the responses of children who were asked their favorite color of M and M candies.

Favorite Color
Red 40
Blue 30
Green 20
Brown 10

Note that the data in the table are percentages, summing to 100%.

A Simple Bar Chart

We'll first enter the data from the table above, storing the list in the variable mypie

> x=c(40,30,20,10)

You can obtain help on R's combine command by typing ?c at the prompt. This command is used
to concatenate elements into a single list, particularly useful when entering data manually. You
can see the contents of the variable by typing x and hitting the Enter key.

> x
[1] 40 30 20 10

It is a simple task to create a bar chart using the data stored in x.

> barplot(x)

The command barplot(x) produces the bar chart shown in Figure 1.


Figure 1. A simple bar chart representing the percentages of children favoring red, blue, green
and brown candies.

In the upcoming exercises, we'll explore further the capabilities of R's barplot command. You
can read more about this command by entering the command ?barplot and reading the resulting
help file.

Adding Labels to the Pie Chart

We can add "names" to the percentages of colored candies favored by the children.

> names(x)=c("Red","Blue","Green","Brown")

The names command attaches a "name" to each piece of data. We can see the result of this
command by typing the variable name (mypie) and hitting the Enter key.

> x
Red Blue Green Brown
40 30 20 10

Note how each piece of data now has a "name" or "header." We can access individual elements
of the list with R's indexing capability. For example, to access the first element of the list (R
starts indexing at one), enter:
> x[1]
Red
40

However, since the data have "names" attached, we can access each data item through its
"name."

> x['Red']
Red
40

The barplot command knows how to apply these "names" to the bar chart.

> barplot(x)

When "names" are added to each piece of data (as we did to the variable x), R's barplot
command automatically adds these names as labels to the corresponding "bars," as shown in
Figure 2.

Figure 2. Adding labels to each "slice of pie."


Using Custom Colors

It won't do that the "Red" slice is colored "White." Let's see how we can use customized colors
for each "bar." First, we store the corresponding colors for each slice in the variable mycolors.

> mycolors=c("red","blue","green","brown")

Next, we pass the colors in mycolors to the col argument of the function barplot.

> barplot(x,col=mycolors)

This last command produces the pie chart shown in Figure 3.

Figure 3. Adding custom colors to each "bar."

Enjoy

We hope you enjoyed this very short introducion to bar charts using R. To find out more about
what can be accomplished with R's barplot command, remember to type ?barplot, hit Enter,
and read the resulting help file.
Histograms in R
There are a number of important types of plots that are used in descriptive statistics. A common
plot that is frequently used in newspapers and magazines is the histogram, a sequence of vertical
bars, where the height of each bar represents a count of the data values falling in a "bin." The
"count", "bins", and "bars" will be explained in the images that follow.

Creating a Histogram of the Standard Normal Distribution

The standard normal distribution is the famous "bell-shaped" curve of statistics, known to have a
mean value of zero and a standard deviation of one. If you are not familiar with the standard
normal distribution, read on. This distribution will be made clear in the images that follow.

We first ask for help on rnorm.

> ?rnorm

The help file response describes the use of several commands relating to the normal distribution.
The one of interest for this activity is R's rnom command, with syntax rnorm(n,mean,sd), with
the following argument use:

 n - the number of random numbers requested


 mean - the mean of the normal distribution of requested numbers
 sd - the standard deviation of the normal distributin of requested numbers

let's begin by drawing 1000 numbers from this "standard normal" distribution.

> x=rnorm(1000,mean=0,sd=1)

Well, that was easy! And fast! The variable x now contains 1000 random numbers drawn from
the standard normal distribution. You can view these numbers by typing x at the R prompt. The
first few numbers are shown below.

> x
[1] -1.123987512 0.865229526 -1.325374408 0.679182289 -1.184965803
[6] 1.755767521 0.064290993 -1.733885165 0.470695523 0.303721954
[11] 0.496295681 -0.431201657 1.378353239 1.729874427 0.445363031
...

You can test the length of this vector with the length command.

> length(x)
[1] 1000

We have our 1000 random numbers, so just how do we go about creating a histogram of these
numbers? If we were doing the problem by hand, we would first define some categories called
"bins", and then sort the random numbers into these "bins". The final count of occurrences in
each bin might look like the following.

Counting Occurences In Each Bin


(-3.5,-3] 1
(-3,-2.5] 6
(-2.5,-2] 21
(-2,-1.5] 46
(-1.5,-1] 82
(-1,-0.5] 147
(-0.5,0] 163
(0,0.5] 207
(0.5,1] 171
(1,1.5] 103
(1.5,2] 33
(2,2.5] 13
(2.5,3] 5
(3,3.5] 2

The table shows that there was only one number falling between -3.5 and -3, there were 6
numbers falling between -3 and -2.5, etc. Proceeding by hand we would next set up axes on
graph paper, partition the horizontal axis in "bins", then scale the vertical axes to accomodate the
counts in our table. Over each bin we would create a rectangle having a vertical height the
corresponds to the number of occurences in that particular bin.

Performing this task by hand is a painstaking procedure. R can automate the process for us,
which will allow us to spend more time interpreting the results and less time having to deal
with the tedium of sorting a 1000 number into bins and counting them by hand.

What follows is the simplest way to get a quick histogram of the data in the variable x.

> hist(x)

The command hist(x) responds by spewing some output to the terminal window (more on this
later) and creating the nice looking histogram shown in Figure 1.
Figure 1. A histogram of the standard normal distribution data in the variable x.

R offers fine-grain control over the appearance and form of its histograms. We can learn more
about this command through the interactive shell of the R environment.

> ?hist

The help system responds with a wealth of information on R's hist command, a snippet of which
follows.

Usage
hist(x, ...)

## Default S3 method:
hist(x, breaks = "Sturges",
freq = NULL, probability = !freq,
include.lowest = TRUE, right = TRUE,
density = NULL, angle = 45, col = NULL, border = NULL,
main = paste("Histogram of" , xname),
xlim = range(breaks), ylim = NULL,
xlab = xname, ylab,
axes = TRUE, plot = TRUE, labels = FALSE,
nclass = NULL, ...)

Let's first focus on the "breaks" argument to the hist command. This argument can be used in a
number of ways.
1. You can "suggest" the number of breaks ("bins") you want. For example, if we wanted
fewer bins, we'd pass this request to the hist command as follows.
2. > hist(x,breaks=5)

Note that there are not exactly 5 bins Figure 2, but there are fewer.

Figure 2. The command hist(x,breaks=5) only "suggests" 5 bins.

3. There is a second way to proceed and that entails directing the hist command to use a
specific set of bins. Let's say that we want the histogram to use the bins described in our
tabular work. Use the arange command to produce this set of bins.
4. > bins=seq(-3.5,3.5,by=0.5)

You can see the result of this command by typing bins at the R prompt.

> bins
[1] -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
[14] 3.0 3.5

These are precisely the bins in our table. We can now request that R use these bins with
the following command.

> hist(x,breaks=bins)

The result is shown in Figure 3.


Figure 3. The command hist(x,breaks=bins) uses the bins stored in the variable bins.

Annotations and Color

Finally, we'll add a bit of color and our own personal annotations to the axes and title. In the
code that follows, the plus (+) sign is the line continuation character. After entering hist(x,, hit
the Enter key. The plus sign is added by R's shell automatically. Continue entering the code as
shown, hitting Enter at the end of each line. When you close the parentheses on the last line and
hit Enter, the entire command is executed.

> hist(x,
+ breaks=bins,
+ col="lightblue",
+ xlab="x-values",
+ ylab="count",
+ main="Random Numbers from the Standard Normal Distribution")
The result of this command is shown in Figure 4.

Figure 4. Adding custom axes annotations, a custom title, and color.

Things to Note in Figure 4:

1. Note how the histogram is approximately "balanced" about its mean, which appears to be
apprimately located at zero, which is precisely what we would anticipate, given the fact
that the numbers in the variable x were randomly drawn from the standard normal
distribution, which has mean zero.
2. Because the numbers in the variable x were drawn from the standard normal distribution,
which has a standard deviation of one, note that almost all of the data occurs within 3
standard deviations of the mean, either way. That is, not that almost all of the data in the
histogram the data in the variable x occurs between -3 and 3, which represents 3 standard
deviations of 1 on either side of the mean 0.

Enjoy!

We hope you enjoyed this introduction to the R system. This interactive system provides a strong
interactive interface for exploration in statistics.
We encourage you to explore further. Use the command ?hist to learn more about what you can
do with the hist command.

Boxplots in R
In this activity we show our readers how to create a boxplot in R. In preparation for this activity,
we must first explore what statisticians call "measures of central tendency," specifically the
mean and median of a data set.

Measures of Central Tendency

We first create a set of data that we will use throughout this activity. Although our data set is
somewhat artificial, all of what we explain in this activity (as it relates to our data set) can also
be applied to any set of data chosen by our readers. With this thought in mind, we enter our data
set at the R prompt.

> x=c(0,4,15, 1, 6, 3, 20, 5, 8, 1, 3)

We can examine the contents of the variable x with the following command:

> x
[1] 0 4 15 1 6 3 20 5 8 1 3

The Mean

One of the most important measures of central tendency in statistics is the mean, which is found
by summing the elements of the data set, then dividing by the number of elements in the set. We
can sum the elements in the data set contained in the variable x with the R-command sum.

> sum(x)
[1] 66

Readers should convince themselves (get out the pencil and paper) that the elements in the
variable x do indeed sum to 66. Add up the individual elements in the list stored in x and show
that the sum is 66.

To find the number of elements in the list stored in the variable x, we can get out our abacus and
count them, or we can use R's length command.

> length(x)
[1] 11

Readers should check that the list stored in x does indeed have 11 elements. Count them!

To find the average of the list stored in x, we divide the sum by the number of elements in the
list.
> sum(x)/length(x)
[1] 6

Recall that the sum of the elements in x was 66, the length was 11, so the average (or mean) is
66/11=6, as verified by our R-command sum(x)/length(x).

However, because finding the mean is such a common requirement in most statistical analysis, it
should come as no surprise that R has a command for finding the mean of a data set.

> mean(x)
[1] 6

The Median

The mean of a data set can be strongly influenced by "outliers" in the data. Consider anew the
data stored in the variable x.

> x
[1] 0 4 15 1 6 3 20 5 8 1 3

Let's sketch a quick histogram of the data stored in x. The following command produces the
histogram shown in Figure 1.

> hist(x)
Figure 1. A histogram of the data stored in the variable x. Note that the data is badly skewed to
the right.

Note the long tail to the right in Figure 1. Statisticians say that the data is "skewed to the right."

Imagine that the bars of the histograms represent masses of equal density. If we were to place a
fulcrum or "knife-edge" located at the mean (at x = 6), the masses would balance. The outliers
can greatly affect the placement of the mean. It's like an old-fashioned "teeter-totter." A child
seated at greater distance from the fulcrum is able to balance a much heavier child seated closer
to the fulcrum.

To pursue this line of reasoning a bit further, imagine that the numbers contained in the variable
x represent speakers' fees in thousands of dollars. Let's sort the data in ascending order.

Let's sort the data in ascending order.

> sort(x)
[1] 0 1 1 3 3 4 5 6 8 15 20

It is probably unfair to say that the "average speaking fee is $6,000." Although statistically
correct, the average (mean) speaking fee ($6,000) does not reflect a common speaking charge for
this collection of speakers. Indeed, the two outliers (the speakers charging $15,000 and $20,000)
unduely influence the mean.

A second measure of central tendencey, a statistic called the median, will be seen to more
closely resemble what a group might be charged should they hire one of the speakers represented
in the data set stored in x. The median is defined to be the data item that is precisely in the
middle of the sorted data set; that is, half (50%) of the data occurs to the left of the median, and
half (50%) occurs to the right of the median.

In the case that the data set has an odd number of elements, it is a simple matter to spot the data
item that lies precisely in the middle. The data set stored in the variable x has 11 elements.
Hence, the sixth element lies exactly in the middle of this data set. Thus, the median is 4. Note
that this number represents a speaking fee of $4,000, which is probably more representative of a
"middling fee" that a group might expect should they use one of the speakers represented by the
data stored in x.

Of course, R finds the median with ease.

> median(x)
[1] 4

If a data set has an even number of elements, the median is found by averaging the two "middle
elements." For example, the following data set has six elements.

> y=1:6
> y
[1] 1 2 3 4 5 6
The median is found by averaging the third and fourth elements; that is, the median is (3 + 4)/2 =
3.5. R is completely aware of the even case.

> median(y)
[1] 3.5

Quantiles

The median of a data set is located so that 50% of the data occurs to the left of the median (and
50% of the data occurs to the right of the median). There is no reason to restrict our attention to
the 50% level. For example, we can find a point where 25% of the data occurs on its left (and
75% to its right). This point is known as the first "quartile" and is found with the following R
command:

> sort(x)
[1] 0 1 1 3 3 4 5 6 8 15 20
> quantile(x,0.25)
25%
2

To help explain, we've listed the data set in ascending order. R provides nine different algorithms
for computing the 25% quantile which can be viewed by typing the command ?quantile. The
default technique is to use linear interpolation to find the entry in the position given by the
formula 1 + p(n -1), where p is the required percentage and n is the length of the data set. In this
particular case, p = 0.25 and n = 11, so 1 + p(n -1) = 3.5. Thus, R will interpolate (linearly) a
number that is exactly halfway between the third and fourth entries, arriving at 1 + 0.5(3 - 1) =
2.

In similar fashion, R will approximate the 75% quantile with the following command:

> sort(x)
[1] 0 1 1 3 3 4 5 6 8 15 20
> quantile(x,0.75)
75%
7

Note that 1 + p(n - 1) = 1 + 0.75(11 - 1) = 8.5, so R reports the number that is exactly halfway
between the eighth and ninth entries, namely 6 + 0.5(8 - 6) = 7.

In general, the p% quantile will be a number that finds p% of the data to its left. For the
remainder of this activity, the most important statistics are the minimum, first quartile, median,
second quartile, and the maximum. We can use the quantile command to compute all of these at
once.

> quantile(x,c(0,0.25,0.5,0.75,1))
0% 25% 50% 75% 100%
0 2 4 7 20
However, R's summary command will report each of these quantiles with descriptive headers,
and throw in the mean for good measure.

> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 2 4 6 7 20

The Spread of the Data Set

Two pairs of numbers in the summary for our data set give the user a sense of the "spread" of the
data involved. The first is the range of the data set.

> range(x)
[1] 0 20

Note the R's range command reports the minimum and maximum entries in the data set. In
addition, R's IQR command gives the inner quartile range.

> IQR(x)
[1] 5

The inner quartile range reports the difference between the 75% quantile and the 25% quantile.
In this case, IQR = 7 - 2 = 5.

The Standard Boxplot

It is easier to explain the boxplot if we first have a picture to which we can refer in the
discussion. So, without any further ado, here is how R produces a boxplot for the data in the
variable x.

> boxplot(x,range=0)
The above command was used to produce the boxplot shown in Figure 2.

Figure 2. The minimum, quartiles, median, and maximum are used to construct a "box and
whisker plot."

So, how is this boxplot constructed? First, recall the summary data for the data in the variable x.

> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 2 4 6 7 20

Here are the steps for creating the standard box and whiskers plot.

1. Draw a thick, dark, horizontal segment at the median, that is, at 4. See Figure 2.
2. Second, draw horizontal lines at the first and third quartiles, that is, at 2 and at 7. Use
these to draw the "box." See Figure 2.
3. From the bottom edge of the box, draw a "whisker" that extends to the the minimum data
value, namely 0. See Figure 2.
4. From the top edge of the box, draw a "whisker" that extends to the maximum data value,
namely 20. See Figure 2.

There are several important points that need making in regard to our box and whisker plot.

1. 50% of the data occurs between the lower and upper edges of the box, namely, between
the first and third quartiles located at 2 and 7, respectively.
2. The lower 50% of the data occurs below the median, the dark horizontal line in the box in
Figure 2. Likewise, the upper 50% of the data occurs above the median line in the box.
3. The lower 25% of the data occurs between the bottom edge of the box and the bottom
edge of the lower whisker. Likewise, the upper 25% of the data occurs above the top edge
of the box and the top edge of the upper whisker.

The Modified Box Plot

The Standard Box Plot does not pay special attention to outliers that might be present. The
Modified Box Plot is constructed so as to highlight outliers. As in the Standard Boxplot
described above, let's begin with a picture. Note that the Modified Boxplot is the default in R,
and requires no special parameters.

> boxplot(x)

The above command was used to produce the modified "box and whiskers" plot shown in Figure
3.

Figure 3. A modified boxplot marks outliers for "special attention."

Before explaining the construction, let's repeat the sorted data and the summary information.

> sort(x)
[1] 0 1 1 3 3 4 5 6 8 15 20
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 2 4 6 7 20

Here are the steps required to construct a modified box and whiskers plot:

1. The median and the quartiles are used to construct the box in exactly the same manner
used to construct the standard boxplot.
2. Multiply the IQR by 1.5. So, in this case, 1.5 x IQR = 1.5 (5) = 7.5. Let's call this result
the STEP. That is, STEP = 7.5.
3. Add the STEP to the third quartile, obtaining 3rd Quartile + STEP = 7 + 7.5 = 14.5. Use
this to perform two tasks:
a. Any data beyond 14.5 is plotted using an empty circle. This explains the two
circles at 15 and 20 in Figure 3.
b. Locate the largest data point below 14.5. This is the number 8. This is where the
end of the upper whisker is drawn.
4. Subtract the STEP from the first quartile, obtaining 1st Quartile - STEP = 2 - 7.5 = -5.5.
Use this to perform two tasks:
a. Any data below -5.5 is plotted using an empty circle. There are no such data
points in Figure 3.
b. Locate the smallest data point that occurs above -5.5. This is the number 0. This is
where the end of the lower whisker is drawn.

Enjoy!

We hope you enjoyed this introduction to the R system. This interactive system provides a strong
interactive interface for exploration in statistics.

We encourage you to explore further. Use the command ?boxplot to learn more about what you
can do with the boxplot command.
Discrete Distributions in R
The discrete distributions of statistics are not continuous. Usually, they are constructed of a finite
number of possible values for the random variable and each possibility is assigned a probability
of occurrence.

The Bernoulli Distribution

One of the simplest discrete distributions is called the Bernoulli Distribution. This is a
distribution with only two possible values. For example, consider the Bernoulli distribution in
the table that follows:

A Bernoulli Distribution

x p

0 0.3

1 0.7

In this case, there are only two possible values of the random variable, x = 0 or x = 1. When
drawing numbers from this distribution, the probability of selecting x = 0 is 0.3, while the
probability of selecting x = 1 is 0.7.

There are many applications that can be represented by Bernoulli Distributions. For example, x
= 0 and x = 1 might represent the states of a lamp, with zero representing the state that the lamp
is switched "off" and one representing the state that the lamp is switched "on." Or x = 0 and x =
1 might represent the states of a biased coin, with zero representing "heads" and one representing
"tails."

Because the values of this distribution are discrete, the probability density density function will
not be a continuous curve as in the Standard Normal Distribution or the Normal Distribution.
The usual way to visualize a discrete distribution is with a sequence of "spikes." Use the
following code to produce the image in Figure 3.

> x=c(0,1)
> y=c(0.3,0.7)
> plot(x,y,type="h",xlim=c(-1,2),ylim=c(0,1),lwd=2,col="blue",ylab="p")
> points(x,y,pch=16,cex=2,col="dark red")

There are a few new constructs introduced in the above code:


1. We use xlim and ylim to place "limits" on the horizontal and vertical axes, respectively.
2. In the points command, the pch argument determines the "plotting character" that is used.
Experiment with the numbers 1 through 16 to see a variety of different choices for the plotting
character. Try ?points to learn more.
3. In the points command, the cex argument determines the "character expansion." We've used it
here to "scale" the size of the plotting character. Try values of 1 or 3 and note the difference.

Figure 3. The discrete Bernoulli probability density function.

Note the "discrete" nature of the visualization.

1. On the x-axis, there are only two possibilities, x = 0 or x = 1.


2. On the vertical axis, the heights of the "spikes" represent probabilities. Note that the height of
the spike at x = 0 is 0.3, the probability of selecting x = 0. The height of the spike at x = 1 is 0.7,
the probability of selecting x = 1.

Key Idea: It is important to note that the sum of the probabilites equals 1, much as it did in the
Standard Normal Distribution and the Normal Distribution. Only this time, it is not the area
under a curve that represents the total probability, but the sum of the probabilities represented by
the heights of the "spikes."

The Binomial Distribution


The Binomial Distribution is best introduced with a simple example. Suppose that we have a
"fair" coin, one in which there is an equal probability of flipping "heads" or "tails." Now,
suppose that we toss the coin three times. Further, suppose that we define a random variable X
that equals the number of heads obtained in three tosses of the fair coin. The outcomes and the
value of the random variable are listed in the table that follows.

Counting the Number of Heads

Outcome X = # Heads

HHH 3

HHT 2

HTH 2

HTT 1

THH 2

THT 1

TTH 1

TTT 0

Because the coin is "fair," each of the outcomes is equally likely, with each having a 1 in 8
chance of occurring. Thus, the probability of tossing three heads (HHH) is 1/8. Next, there are
three ways of obtaining two heads (HHT, HTH, and THH), each with a 1 in 8 chance of
occuring, so the probability of obtaining two heads in three tosses is 3/8. The remaining values of
hte random variable and their probabilities are listed in the table that follows.

A Binomial Distribution

x p

3 1/8 = 0.125

2 3/8 = 0.375
1 3/8 = 0.375

0 1/8 = 0.125

Using dbinom

It is readily apparent that this simple approach will rapidly become quite tedious if we increase
the number of times that we toss the coin. Fortunately, this "binomial distribution" is easily
calculated in R. To calculate the probability of obtaining three heads in three tosses of a "fair"
coin, enter the following code:

> dbinom(3,size=3,prob=1/2)
[1] 0.125

Note that this code gives a result that is identical to the first row in the table above.

The general syntax dbinom(x, size= , prob = ) is fairly straightforward.

1. The argument x is the value of the random variable, the number of heads in the current
example.
2. The size is the "number of trials," or the number of tosses in the current example.
3. The prob is the probability of success, or the probability of obtaining "heads" in the current
example.

We can produce the entire table with the following code:

> p=dbinom(0:3,size=3,prob=1/2)
> p
[1] 0.125 0.375 0.375 0.125

In this case, the input for x is the vector 0:3, which produces upon expansion 0, 1, 2, and 3. We
can create the "spikes" shown in Figure 4 with the following code:

> n=3
> p=1/2
> x=0:3
> p=dbinom(x,size=n,prob=p)
> plot(x,p,type="h",xlim=c(-1,4),ylim=c(0,1),lwd=2,col="blue",ylab="p")
> points(x,p,pch=16,cex=2,col="dark red")
Figure 4. The discrete binomial distribution describing the number of heads in three tosses of a "fair" coin.

These commands become increasingly effective as we increase the number of trials, in this case
the number of "tosses" of the fair coin.

> n=10
> p=1/2
> x=0:10
> p=dbinom(x,size=n,prob=p)
> plot(x,p,type="h",xlim=c(-1,11),ylim=c(0,0.5),lwd=2,col="blue",ylab="p")
> points(x,p,pch=16,cex=2,col="dark red")
Figure 4. The discrete binomial distribution describing the number of heads in ten tosses of a "fair" coin.

Using pbinom

Now, suppose that we toss a "fair" coin ten times (the distribution shown in Figure 4) and we ask
the following question: "What is the probbility of obtaining 4 or fewer heads?" One approach is
to realize that this probability is the sum of the individual probabilities. That is, to obtain the
probability of obtaining 4 or fewer heads, we would add the probabilities of obtaining 0, 1, 2, 3,
or 4 heads.

> sum(dbinom(0:4,size=10,prob=1/2))
[1] 0.3769531

Alternatively, this sounds suspiciously similar to the questions we asked in the activities The
Standard Normal Distribution and The Normal Distribution. In those activities we used the
pnorm command. In similar fashion, we can use the pbinom command in the current situation.

> pbinom(4,size=10,prob=1/2)
[1] 0.3769531

Note again that it's always the probability to the left of and including the number of successes.

Let's ask another question. Suppose that we want to know the probability of obtaining between 5
and 8 heads, inclusive. We could sum the probabilities of obtaining 5, 6, 7, or 8 heads.
> sum(dbinom(5:8,size=10,prob=1/2))
[1] 0.6123047

Warning: The following computation is an incorrect application of the pbinom command for this
example.

> pbinom(8,size=10,prob=1/2)-pbinom(5,size=10,prob=1/2)
[1] 0.3662109

The error in this calculation is due to our failure to note that the distribution is discrete. In
subtracting the probability of getting 5 or fewer heads from the probability of getting 8 or fewer
heads, we're left with the probability of getting 6, 7, or 8 heads, which is not the the goal. Here is
the correct calculation:

> pbinom(8,size=10,prob=1/2)-pbinom(4,size=10,prob=1/2)
[1] 0.6123047

Using the binomial distribution in Figure 4, let's look at one final question and compute the
probability that X is greater than 7. One obvious way to go is to compute the sum the
probabilities that X is 8, 9, or 10.

> sum(dbinom(8:10,size=10,prob=1/2))
[1] 0.0546875

A second approach uses the pbinom command, which must always sum the probabilities to the
left of or equal to a given number. In this case, the probability that X is greater that or equal to 8
(8, 9, or 10) is equal to 1 minus the probability that X is less than or equal to 7 (0, 1, 2, 3, 4, 5, 6,
or 7).

> 1-pbinom(7,size=10,prob=1/2)
[1] 0.0546875

Enjoy!

We hope you enjoyed this introduction to the discrete distributions in general, binomial
distributions in particular, using the R system. We encourage you to explore further.
Continuous Distributions in R
A continuous function in mathematics is one whose graph can be drawn in one continuous
motion without ever lifting pen from paper. We've already seen examples of continuous
probability density functions. For example, the probability density function

from The Standard Normal Distribution was an example of a continuous function, having the
continuous graph shown in Figure 1.

Figure 1. The standard normal distribution is continuous.

In the activities The Standard Normal Distribution and The Normal Distribution, we saw that
dnorm, pnorm, and qnorm provided values of the density function, cumulative probabilities,
and quantiles, respectively. In this activity, we will explore several continuous probability
density functions and we will see that each has variants of the d, p, and q commands.
The Uniform Distribution

The Uniform Distribution is defined on an interval [a, b]. The idea is that any number selected
from the interval [a, b] has an equal chance of being selected. Therefore, the probability density
function must be a constant function. Because the total are under the probability density curve
must equal 1 over the interval [a, b], it must be the case that the probability density function is
defined as follows:

The Uniform Probability Density Function on [a, b]

Figure 2.The probability density function for the uniform distribution on [a, b].

For example, the uniform probability density function on the interval [1,5] would be defined by
f(x) = 1/(5-1), or equivalently, f(x) = 1/4. The following code sketches this uniform probability
density function over the interval [1,5]. The resulting plot is shown in Figure 3.

> x=seq(1,5,length=200)
> y=rep(1/4,200)
> plot(x,y,type="l",xlim=c(0,6),ylim=c(0,0.5),lwd=2,col="red",ylab="p")
> polygon(c(1,x,5),c(0,y,0),col="lightgray",border=NA)
> lines(x,y,type="l",lwd=2,col="red")

Some comments on this code are in order:

1. The rep command takes the value 1/4 and duplicates it 200 times. This stores a vector in
the variable y having length 200, each entry of which is 1/4. This is necessary because the
plot(x,y) command requires that vectors x and y have equal length.
2. We've seen the polygon command before, but setting the argument border=NA "not
applicable" prevents the borders of the shaded region from being drawn.
3. The last lines command redraws the horizontal uniform density function so that appears
atop the shaded polygon.

Key Idea: Note that the area of the shaded region is 1, as it should be for all probability density
functions.
Figure 3. The shaded area under the uniform density function is 1.

Alternatively, we could have used dunif, which will produce density values for the uniform
distribution. The following code will also produce the image shown in Figure 3.

> x=seq(1,5,length=200)
> y=dunif(x,min=1,max=5)
> plot(x,y,type="l",xlim=c(0,6),ylim=c(0,0.5),lwd=2,col="red",ylab="p")
> polygon(c(1,x,5),c(0,y,0),col="lightgray",border=NA)
> lines(x,y,type="l",lwd=2,col="red")

Note that the arguments min=1 and max=5 provide the endpoints of the interval [1,5] on which
the uniform probability density function is defined.

Using punif

Suppose that we would like to find the probability that the random variable X is less than or
equal to 2. To calculate this probability, we would shade the region under the density function to
the left of and including 2, then calculate its area.

> x=seq(1,5,length=200)
> y=dunif(x,min=1,max=5)
> plot(x,y,type="l",xlim=c(0,6),ylim=c(0,0.5),lwd=2,col="red",ylab="p")
> x=seq(1,2,length=100)
> y=dunif(x,min=1,max=5)
> polygon(c(1,x,2),c(0,y,0),col="lightgray",border=NA)
> x=seq(1,5,length=200)
> y=dunif(x,min=1,max=5)
> lines(x,y,type="l",lwd=2,col="red")

The above code produces the image shown in Figure 4.

Figure 4. The shaded area under the uniform probability density function represents the probability that x ≤ 2.

In Figure 4, note that the width of the shaded area is 1, the height is 1/4, so the area of the shaded
region is 1/4. Thus, the probability that x ≤ 2 is 1/4.

We can use the punif command to compute the probability that x ≤ 2.

> punif(2,min=1,max=5)
[1] 0.25

Note that the punif command works in precisely the same manner as did the pnorm command in
the activity The Normal Distribution. It calculates the area to the left of a given number under the
uniform probability density function.

Let's look at another example. Suppose that we wanted to find the probability that x lies between
2 and 4. We could draw the uniform distribution, then shade the area under the curve between 2
and 4.
x=seq(1,5,length=200)
y=dunif(x,min=1,max=5)
plot(x,y,type="l",xlim=c(0,6),ylim=c(0,0.5),lwd=2,col="red",ylab="p")
x=seq(2,4,length=100)
y=dunif(x,min=1,max=5)
polygon(c(2,x,4),c(0,y,0),col="lightgray",border=NA)
x=seq(1,5,length=200)
y=dunif(x,min=1,max=5)
lines(x,y,type="l",lwd=2,col="red")

The above code produces the image shown in Figure 5.

Figure 5. The shaded area under the uniform probability density function represents the probability that 2 < x < 4.

Note that the width of the shaded area is 2, the height is 1/4, so the area is 1/2. That is, the
probability that 2 < x < 4 is 1/2.

We can use punif to arrive at the same conclusion. To find the area between 2 and 4, we must
subtract the area to the left of 2 from the area to the left of 4.

> punif(4,min=1,max=5)-punif(2,min=1,max=5)
[1] 0.5

Finally, if we need to find the area to the right of a given number, simply subtract the are to the
left of the given number from the total area; i.e., subtract the area from the left of a given number
from 1. So, the following calculation will find the probability that x > 4.
> 1-punif(4,min=1,max=5)
[1] 0.25

Using qunif

The command qunif will find quantiles for the uniform distribution in the same way as we saw
the qnorm find quantiles in the activity The Normal Distribution. Thus, to find the 25th
percentile for the uniform distribution on the interval [1,5], we execute the following code.

> qunif(0.25,min=1,max=5)
[1] 2

Note that this is in total agreement with the result shown in Figure 4, where we saw that the area
of the shaded rectangle in Figure 4 was 0.25. Hence, the number 2 delimits the point at which
25% of the total area under the density function is attained.

The Exponential Distribution

The exponential probability density function is defined on the interval [0, ∞]. It has the following
definition.

The Exponential Probability Density Function

Figure 6. The exponential probability density function is continuous on [0, ∞).

The exponential distribution is known to have mean μ = 1/λ and standard deviation σ = 1/λ.

Suppose that we set λ = 1. Then the mean of the distribution should be μ = 1 and the standard
deviation should be σ = 1 as well. We could use the formula in Figure 6 to produce values of the
probability density function, but as we've seen before, it will be more efficient to use the dexp
command.

x=seq(0,4,length=200)
y=dexp(x,rate=1)
plot(x,y,type="l",lwd=2,col="red",ylab="p")

Note that the argument rate expects us to respond with the value of λ. The code above produces
the exponential distribution shown in Figure 7.
Figure 7. The exponential probability density function is shown on the interval [0,4].

There are a number of applications where the exponential function is a resonable model. For
example, waiting in queues.

The exponential probability density function is shown on the interval [0,4] in Figure 7. However,
remember that the full domain is on [0,∞), so we've shown only part of the full picture. In
determining on what domain to draw the function, we extended the interval three standard
deviations to the right of the mean.

Unlike the normal and uniform distributions, the exponential distribution is not symmetric about
its mean. However, in Figure 7 there is reasonable evidence that the distribution will "balance"
about the mean at μ = 1.

Using pexp

Suppose that we want to find the probability that x &le 1. We would shade the area under the
exponential probability density function to the left of 1, as shown in Figure 8.

> x=seq(0,4,length=200)
> y=dexp(x,rate=1)
> plot(x,y,type="l",lwd=2,col="red",ylab="p")
> x=seq(0,1,length=200)
> y=dexp(x,rate=1)
> polygon(c(0,x,1),c(0,y,0),col="lightgray")
The code above produces shaded region under the exponential distribution shown in Figure 8.

Figure 8. Shading the region under the density curve to the left of x = 1

Now, just as we did with the uniform distribution above, we will use the pexp command to
compute the area of the shaded region in Figure 8.

> pexp(1,rate=1)
[1] 0.6321206

You may find it surprising that the answer was not 50%! However, the mean at x = 1 is not the
median! The graph of the exponential is "skewed to the right" and the extreme outliers at the
right strongly influence the mean, pushing it to the right of the median.

As a second example, suppose that we want to find the probability that x lies between 1 and 2.

> x=seq(0,4,length=200)
> y=dexp(x,rate=1)
> plot(x,y,type="l",lwd=2,col="red",ylab="p")
> x=seq(1,2,length=200)
> y=dexp(x,rate=1)
> polygon(c(1,x,2),c(0,y,0),col="lightgray")

The above code produces the shaded region under the exponential density curve in Figure 9.
Figure 9. Calculating the probability that x lies between 1 and 2.

Again, to find the shaded area in Figure 9, we must subtract the area to the left of x = 1 from the
area to the left of x = 2.

> pexp(2,rate=1)-pexp(1,rate=1)
[1] 0.2325442

Thus, the probability of selecting a number between 1 and 2 from this distribution is
approximately 0.2325442.

Finally, if we need to find the area to the right of a given number, simply subtract the area to the
left of the given number from the total area. For example, to find the probability that x > 3,
subtract the probability that x ≤ 3 from 1.

> 1-pexp(3,rate=1)
[1] 0.04978707

Using qexp

The command qexp will find quantiles for the exponential distribution in the same way as we
saw the qunif find quantiles for the uniform distribution. Thus, to find the 50th percentile for the
exponential distribution on the interval, we execute the following code.

> qexp(0.50,rate=1)
[1] 0.6931472
This result is in keeping with the fact that the distribution is skewed badly to the right. The
outliers at the right end greatly influence the mean, pushing it to the right. With the mean at x =
1, the current result for the median makes good sense. Fifty percent of the data lies to the left of
x = 0.6931472 and fifty percent of the data lies to the right of x = 0.6931472.

Regarding d, p, and q

As we've seen in this and previous activities, the letters d, p, and q have special meanings:

 "d" is for "density." It is used to find values of the probability density function.
 "p" is for "probability." It is used to find the probability that the random variable lies to
the left of a given number.
 "q" is for "quantile." It is used to find the quantiles of a given distribution.

In connection with the normal distribution, dnorm calculates values of the normal probability
density function. Similarly, dbinom, dunif, and dexp calculate values of the binomial, uniform,
and exponential probability density functions, respectively.

In connection with the normal distribution, pnorm calculates area under the normal probability
density function to the left of a given number. Similarly, pbinom, punif, and pexp calculate area
under the binomial, uniform, and exponential probability density functions to the left of a given
number, respectively.

In connection with the normal distribution, qnorm calculates quantiles for the normal probability
density function. Similarly, pbinom, punif, and pexp calculate quantiles for the binomial,
uniform, and exponential probability density functions, respectively.

This common use of d, p, and q is by design. There are a host of statistical distributions that
we've yet to introduce. For example, there is a distribution called the Beta Distribution. The
commands dbeta, pbeta, and qbeta would be used to calculate values of the beta probability
density function, to calculate the area to the left of a given number underneath the beta
probability density function, and to calculate quantiles for the beta distribution.

Enjoy!

We hope you enjoyed this introduction to the uniform and exponential distributions in R. We
encourage you to explore further. Try typing ?dunif or ?dexp and reading the resulting help
files.
The Standard Normal Distribution in R
One of the most fundamental distributions in all of statistics is the Normal Distribution or the
Gaussian Distribution. According to Wikipedia, "Carl Friedrich Gauss became associated with
this set of distributions when he analyzed astronomical data using them, and defined the equation
of its probability density function. It is often called the bell curve because the graph of its
probability density resembles a bell."

The Probability Density Function

The probability density function for the normal distribution having mean μ and standard
deviation σ is given by the function in Figure 1.

The Normal Probability Density Function

Figure 1. The probability density function for the normal distribution.

If we let the mean μ = 0 and the standard deviation σ = 1 in the probability density function in
Figure 1, we get the probability density function for the standard normal distribution in Figure 2.

The Standard Normal Probability Density Function

Figure 2. The probability density function for the standard normal distribution has mean μ = 0 and standard
deviation σ = 1.

It is a simple matter to produce a plot of the probability density function for the standard normal
distribution.

> x=seq(-4,4,length=200)
> y=1/sqrt(2*pi)*exp(-x^2/2)
> plot(x,y,type="l",lwd=2,col="red")

If you'd like a more detailed introduction to plotting in R, we refer you to the activity Simple
Plotting in R. However, these commands are simply explained.

1. The command x=seq(-4,4,length=200) produces 200 equally spaced values between -4 and 4
and stores the result in a vector assigned to the variable x.
2. The command y=1/sqrt(2*pi)*exp(-x^2/2) evaluates the probability density function of Figure 2
at each entry of the vector x and stores the result in a vector assigned to the variable y.
3. The command plot(x,y,type="l",lwd=2,col="red") plots y versus x, using:
o a solid line type (type="l") --- that's an "el", not an I (eye) or a 1 (one),
o a line width of 2 points (lwd=2), and
o uses the color red (col="red").

The result is the "bell-shaped" curve shown in Figure 3.

Figure 3. The bell-shaped curve of the standard normal distribution.

An Alternate Approach

The command dnorm can be used to produce the same result as the probability density function
of Figure 2. Indeed, the "d" in dnorm stands for "density." Thus, the command dnorm is
designed to provide values of the probability density function for the normal distribution.

> x=seq(-4,4,length=200)
> y=dnorm(x,mean=0,sd=1)
> plot(x,y,type="l",lwd=2,col="red")

These commands produce the plot shown in Figure 4. Note that the result is identical to the plot
in Figure 3.
Figure 4. The bell-shaped curve of the standard normal distribution.

The Area Under the Probability Density Function

Like all probability density functions, the standard normal curves in Figures 3 and 4 possess two
very important properties:

1. The graph of the probability density function lies entirely above the x-axis. That is, f(x) ≥ 0 for all
x.
2. The area under the curve (and above the x-axis) on its full domain is equal to 1.
3. The probability of selecting a number between x = a and x = b is equal to the area under the
curve from x = a to x = b.

As an example, consider the area under the standard normal curve shown in Figure 5.
Figure 5. The area under the standard normal curve to the left of x = 0.

If the total area under the curve equals 1, then by symmetry one would expect that the area under
the curve to the left of x = 0 would equal 0.5. R has a command called pnorm (the "p" is for
"probability") which is designed to capture this probability (area under the curve).

> pnorm(0, mean=0, sd=1)


[1] 0.5

Note that the syntax is strikingly similar to the syntax for the density function. The command
pnorm(x, mean = , sd = ) will find the area under the normal curve to the left of the number x.
Note that we use mean=0 and sd=1, the mean and density of the standard normal distribution.

As a second example, suppose that we wish to find the area under the standard normal curve
( mean = 0, standard deviation = 1) that is to the left of the value of x that is one standard
deviation to the right of the mean, pictured for the reader in Figure 6.
Figure 6. The area under the curve to the left of x = 1.

In this case, we wish to find the area under the curve to the left of x = 1.

> pnorm(1, mean=0, sd=1)


[1] 0.8413447

The interpretation of area as a probability is all-important. This result indicates that if we draw a
number at random from the standard normal distribution, the probability that we draw a number
that is less than or equal to 1 is 0.8413447.

If readers are interested, to produce the image in Figure 6, we used the following code.

> x=seq(-4,4,length=200)
> y=dnorm(x)
> plot(x,y,type="l", lwd=2, col="blue")
> x=seq(-4,1,length=200)
> y=dnorm(x)
> polygon(c(-4,x,1),c(0,y,0),col="gray")

For help on the polygon command enter ?polygon and read the resulting help file. However, the
basic idea is pretty simple. In the syntax polygon(x,y), the argument x contains the x-coordinates
of the vertices of the polygon you wish to draw. Similarly, the argument y contains the y-
coordinates of the vertices of the desired polygon.
68%-95%-99.7% Rule

The 68% - 95% - 99.7% is a rule of thumb that allows practitioners of statistics to estimate the
probability that a randomly selected number from the standard normal distribution occurs within
1, 2, and 3 standard deviations of the mean at zero.

Let's first examine the probability that a randomly selected number from the standard normal
distribution occurs within one standard deviation of the mean. This probability is represented by
the area under the standard normal curve between x = -1 and x = 1, pictured in Figure 7.

> x=seq(-4,4,length=200)
> y=dnorm(x)
> plot(x,y,type="l", lwd=2, col="blue")
> x=seq(-1,1,length=100)
> y=dnorm(x)
> polygon(c(-1,x,1),c(0,y,0),col="gray")

The code above was used to produce the shaded region shown in Figure 7.

Figure 7. The shaded area represents the probability of drawing a number from the standard normal distribution
that falls within one standard deviation of the mean.

Remember that R's pnorm command finds the area to the left of a given value of x. Thus, to find
the area between x = -1 and x = 1, we must subtract the area to the left of x = -1 from the area to
the left of x = 1.
> pnorm(1,mean=0,sd=1)-pnorm(-1,mean=0,sd=1)
[1] 0.6826895

There's the promised 68%!

In similar fashion, we can get the area within two and three standard deviations.

> x=seq(-4,4,length=200)
> y=dnorm(x)
> plot(x,y,type="l", lwd=2, col="blue")
> x=seq(-2,2,length=200)
> y=dnorm(x)
> polygon(c(-2,x,2),c(0,y,0),col="gray")

Figure 8. The shaded area represents the probability of drawing a number from the standard normal distribution
that falls within two standard deviations of the mean.

To find the area between x = -2 and x = 2, we must subtract the area to the left of x = -2 from the
area to the left of x = 2.

> pnorm(2,mean=0,sd=1)-pnorm(-2,mean=0,sd=1)
[1] 0.9544997
There is a 95% chance that the number drawn falls within two standard deviations of the mean.

> x=seq(-4,4,length=200)
> y=dnorm(x)
> plot(x,y,type="l",lwd=2,col="blue")
> x=seq(-3,3,length=200)
> y=dnorm(x)
> polygon(c(-3,x,3),c(0,y,0),col="gray")

Figure 9. The shaded area represents the probability of drawing a number from the standard normal distribution
that falls within three standard deviations of the mean.

To find the area between x = -3 and x = 3, we must subtract the area to the left of x = -3 from the
area to the left of x = 3.

> pnorm(3,mean=0,sd=1)-pnorm(-3,mean=0,sd=1)
[1] 0.9973002

Therefore, the chance that a number drawn randomly from the standard normal distribution falls
within three standard deviations of the mean is 99.7%!

Important Result: We conclude that virtually all numbers from the standard normal distribution
occur within three standard deviations of the mean.
Quantiles

Sometimes the opposite question is asked. That is, suppose that the area under the curve to the
left of some unknown number is known. What is the unknown number?

For example, suppose that the area under the curve to the left of some unknown x-value is 0.85,
as shown in Figure 10.

Figure 10. The area under the curve to the left of some unknown x-value is 0.95.

To find the unknown value of x we use R's qnorm command (the "q" is for "quantile").

> qnorm(0.95,mean=0,sd=1)
[1] 1.644854

Hence, there is a 95% probability that a random number less than or equal to 1.644854 is chosen
from the standard normal distribution.

In a sense, R's pnorm and qnorm commands play the roles of inverse functions. On one hand,
the command pnorm is fed a number and asked to find the probability that a random selection
from the standard normal distribution falls to the left of this number. On the other hand, the
command qnorm is given the probability and asked to find a limiting number so that the area
under the curve to the left of that number equals the given probability.
We must emphasize that the area under the curve to the left is used when applying the commands
pnorm and qnorm. If you are given an area to the right, then you must make a simple
adjustment before applying the qnorm command. Suppose, as shown in Figure 11, that the area
to the right of an unknown number is 0.80.

Figure 11. The area under the curve to the right of some unknown x-value is 0.80.

In this case, we must subtract the area to the right from the number 1 to obtain 1 - 0.80 = 0.20,
which is the area to the left of the unknown value of x shown in Figure 12.
Figure 12. The area under the curve to the left of some unknown x-value is 1 - 0.80 = 0.20.

We can now use the qnorm command to find the unknown value of x.

> qnorm(0.20,mean=0,sd=1)
[1] -0.8416212

We now know that the probability of selecting a number from the standard normal distribution
that is greater than or equal to -0.8416212 is 0.80.

Enjoy!

We hope you enjoyed this introduction to the standard normal distribution R system. We
encourage you to explore further.
The Normal Distribution in R
One of the most fundamental distributions in all of statistics is the Normal Distribution or the
Gaussian Distribution. According to Wikipedia, "Carl Friedrich Gauss became associated with
this set of distributions when he analyzed astronomical data using them, and defined the equation
of its probability density function. It is often called the bell curve because the graph of its
probability density resembles a bell."

The Probability Density Function

The probability density function for the normal distribution having mean μ and standard
deviation σ is given by the function in Figure 1.

The Normal Probability Density Function

Figure 1. The probability density function for the normal distribution.

If we let the mean μ = 0 and the standard deviation σ = 1 in the probability density function in
Figure 1, we get the probability density function for the standard normal distribution in Figure 2.

The Standard Normal Probability Density Function

Figure 2. The probability density function for the standard normal distribution has mean μ = 0 and standard
deviation σ = 1.

In the activity The Standard Normal Distribution, we examined the normal distribution having
mean and standard deviation 0 and 1, respectively. You might want to work your way through
that activity before continuing.

It is a simple matter to produce a plot of the probability density function for the standard normal
distribution.
> x=seq(-4,4,length=200)
> y=1/sqrt(2*pi)*exp(-x^2/2)
> plot(x,y,type="l",lwd=2,col="red")

If you'd like a more detailed introduction to plotting in R, we refer you to the activity Simple
Plotting in R. However, these commands are simply explained.

1. The command x=seq(-4,4,length=200) produces 200 equally spaced values between -4 and 4
and stores the result in a vector assigned to the variable x.
2. The command y=1/sqrt(2*pi)*exp(-x^2/2) evaluates the probability density function of Figure 2
at each entry of the vector x and stores the result in a vector assigned to the variable y.
3. The command plot(x,y,type="l",lwd=2,col="red") plots y versus x, using:
o a solid line type (type="l") --- that's an "el", not an I (eye) or a 1 (one),
o a line width of 2 points (lwd=2), and
o uses the color red (col="red").

The result is the "bell-shaped" curve shown in Figure 3.

Figure 3. The bell-shaped curve of the standard normal distribution.


An Alternate Approach

The command dnorm can be used to produce the same result as the probability density function
of Figure 2. Indeed, the "d" in dnorm stands for "density." Thus, the command dnorm is
designed to provide values of the probability density function for the normal distribution.

> x=seq(-4,4,length=200)
> y=dnorm(x,mean=0,sd=1)
> plot(x,y,type="l",lwd=2,col="red")

These commands produce the plot shown in Figure 4. Note that the result is identical to the plot
in Figure 3.

Figure 4. The bell-shaped curve of the standard normal distribution.


The Standard Deviation

The standard deviation represents the "spread" in the distribution. With "spread" as the
interpretation, we would expect a normal distribution with a standard deviation of 2 to be "more
spread out" than a normal distribution with a standard deviation of 1. Let's simulate this idea in
R.

> x=seq(-8,8,length=500)
> y1=dnorm(x,mean=0,sd=1)
> plot(x,y1,type="l",lwd=2,col="red")
> y2=dnorm(x,mean=0,sd=2)
> lines(x,y2,type="l",lwd=2,col="blue")

The above sequence of commands produces the image shown in Figure 5.

Figure 5. A normal distribution with standard deviation 2 is "wider" than the standard normal distribution having
standard deviation 1.

Key Idea: The key idea of importance is to note that the normal curve having standard deviation
equal to 2 is "twice as spread out" as the "standard" normal curve having standard deviation 1.

In similar fashion, a normal curve with standard deviation 1/2 would be "half as wide" as the
standard normal curve.
> x=seq(-8,8,length=500)
> y3=dnorm(x,mean=0,sd=1/2)
> plot(x,y3,type="l",lwd=2,col="green")
> y2=dnorm(x,mean=0,sd=2)
> lines(x,y2,type="l",lwd=2,col="blue")
> y1=dnorm(x,mean=0,sd=1)
> lines(x,y1,type="l",lwd=2,col="red")
> legend("topright",c("sigma=1/2","sigma=2","sigma=1"),
+ lty=c(1,1,1),col=c("green","blue","red"))

The above sequence of commands produces the image shown in Figure 6. Note: Remember that
the "plus" symbol is R's line continuation character. R will provide this character when you hit
the Enter key to end the previous line.

Figure 6. The standard deviation detemrines the spread of the distribution.

Key Idea: The key idea of importance is to note that the normal curve having standard deviation
equal to 1/2 is "half as spread out" as the "standard" normal curve having standard deviation 1.
The Mean

In Figures 3, 4, and 5, note that each of the normal curves is "centered" about a mean equal to
zero. If the mean were different, we would expect the normal curve to be "centered" about new
mean. In the following example, we use coding to create a normal curve with mean equal to 10
and standard deviation equal to 2.

> x=seq(4,16,length=200)
> y=dnorm(x,mean=10,sd=2)
> plot(x,y,type="l",lwd=2,col="red")

The above sequence of commands produces the image shown in Figure 7.

Figure 7. A normal distribution with mean equal to 10 and standard deviation equal to 2.

The choice of domain bears some explanation. In the activity The Standard Normal Distribution,
we learned that essentially all of the data occurs within three standard deviations of the mean.
Because the standard deviation is 2, three standard deviations to the left of the mean takes us to
4, while three standard deviations to the right of the mean takes us to 16.

Key Idea: The key idea of importance is to note that this particular normal curve is "centered" or
"balanced" about its mean 10.

As a second example, the following code draws the normal curve with mean equal to 100 with
standard deviation equal to 10.
> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")

The above sequence of commands produces the image shown in Figure 8.

Figure 8. A normal distribution with mean equal to 100 and standard deviation equal to 10.

Again, the mean is 100 and the standard deviation is 10, so three standard deviations each side of
the mean sees us sketching this normal curve on the interval (70,130).

Key Idea: As in the previous example, the most important thing to note is that the distribution
shown in Figure 8 is "centered" or "balanced" about its mean 100.
The Area Under the Probability Density Function

As in the activity The Standard Normal Distribution, the command pnorm will compute the area
to the left of a given value of x.

> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")
> x=seq(70,90,length=100)
> y=dnorm(x,mean=100,sd=10)
> polygon(c(70,x,90),c(0,y,0),col="gray")

The above commands produce the image shown in Figure 9.

Figure 9. The area to the left of 90 represents the probability of selecting a number less than 90 from a normal
distribution with mean 100 and standard deviation 10.

The following command will calculate the shaded area in Figure 9.

> pnorm(90,mean=100,sd=10)
[1] 0.1586553
Hence, the probability of selecting a random number less than 90 from a normal distribution
having mean 100 and standard deviation 10 is 0.1586553.

As a second example, the following code shades the area between 90 and 100 shown in Figure
10.

> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")
> x=seq(90,100,length=200)
> y=dnorm(x,mean=100,sd=10)
> polygon(c(90,x,100),c(0,y,0),col="gray")

Figure 10. The area to the right of 90 and to the left of 100 represents the probability of selecting a number
between 90 and 100 from a normal distribution with mean 100 and standard deviation 10.

The following command will calculate the shaded area in Figure 10.

> pnorm(100,mean=100,sd=10)-pnorm(90,mean=100,sd=10)
[1] 0.3413447
68%-95%-99.7% Rule

In the activity The Standard Normal Distribution, we introduced the 68% - 95% - 99.7% rule in
conjunction with the standard normal distribution. The 68% - 95% - 99.7% works just as well as
a rule of thumb even when the mean and standard deviation change. For example, the following
code shades the region within one standard deviation of the mean in Figure 11.

> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")
> x=seq(90,110,length=200)
> y=dnorm(x,mean=100,sd=10)
> polygon(c(90,x,110),c(0,y,0),col="gray")

Figure 11. The shaded area represents the probability of drawing a number from the normal distribution (mean =
100, standard deviation = 10) that falls within one standard deviation of the mean.

Remember that R's pnorm command finds the area to the left of a given value of x. Thus, to find
the area between x = 90 and x = 110, we must subtract the area to the left of x = 90 from the area
to the left of x = 110.

> pnorm(110, mean=100, sd=10)-pnorm(90, mean=100, sd=10)


[1] 0.6826895
There's that promised 68% again!

In similar fashion, we can get the area within two and three standard deviations.

> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")
> x=seq(80,120,length=200)
> y=dnorm(x,mean=100,sd=10)
> polygon(c(80,x,120),c(0,y,0),col="gray")

Figure 12. The shaded area represents the probability of drawing a number from the normal distribution (mean =
100, standard deviation = 10) that falls within two standard deviations of the mean.

To find the area between x = 80 and x = 120, we must subtract the area to the left of x = 80 from
the area to the left of x = 120.

> pnorm(120, mean=100, sd=10)-pnorm(80, mean=100, sd=10)


[1] 0.9544997

Note again that there is a 95% chance that the number drawn falls within two standard deviations
of the mean.

> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")
> polygon(c(70,x,130),c(0,y,0),col="gray")
Figure 13. The shaded area represents the probability of drawing a number from the normal distribution (mean =
100, standard deviation = 10) that falls within three standard deviations of the mean.

To find the area between x = 70 and x = 130, we must subtract the area to the left of x = 70 from
the area to the left of x = 130.

> pnorm(130, mean=100, sd=10)-pnorm(70, mean=100, sd=10)


[1] 0.9973002

Therefore, the chance that a number drawn randomly from the normal distribution falls within
three standard deviations of the mean is 99.7%!

Important Result: We conclude that virtually all numbers from any normal distribution occur
within three standard deviations of the mean.
Quantiles

In the activity The Standard Normal Distribution, the command qnorm was introduced to
calculate quantiles. This command is fed the area under the curve to the left of some unknown
number, then calculates the unknown number.

For example, suppose that the area under the curve to the left of some unknown x-value is 0.95,
as shown in Figure 14.

Figure 14. The area under the curve to the left of some unknown x-value is 0.95.

To find the unknown value of x we use R's qnorm command (the "q" is for "quantile").

> qnorm(0.95,mean=100,sd=10)
[1] 116.4485

Hence, there is a 95% probability that a random number less than or equal to 116.4485 is chosen
from the standard normal distribution.

In a sense, R's pnorm and qnorm commands play the roles of inverse functions. On one hand,
the command pnorm is fed a number and asked to find the probability that a random selection
from the standard normal distribution falls to the left of this number. On the other hand, the
command qnorm is given the probability and asked to find a limiting number so that the area
under the curve to the left of that number equals the given probability.
We must emphasize that the area under the curve to the left is used when applying the commands
pnorm and qnorm. If you are given an area to the right, then you must make a simple
adjustment before applying the qnorm command. Suppose, as shown in Figure 15, that the area
to the right of an unknown number is 0.80.

Figure 15. The area under the curve to the right of some unknown x-value is 0.80.

In this case, we must subtract the area to the right from the number 1 to obtain 1 - 0.80 = 0.20,
which is the area to the left of the unknown value of x shown in Figure 16.
Figure 16. The area under the curve to the left of some unknown x-value is 1 - 0.80 = 0.20.

We can now use the qnorm command to find the unknown value of x.

> qnorm(0.20,mean=100,sd=10)
[1] 91.58379

We now know that the probability of selecting a number from the standard normal distribution
that is greater than or equal to 91.58379 is 0.80.

Enjoy!

We hope you enjoyed this introduction to the normal distribution R system. We encourage you to
explore further.
Linear Regression in R
In this activity we will explore the relationship between a pair of variables. We will first learn to
create a scatter plot for the given data, then we will learn how to craft a "Line of Best Fit" for our
plot.

Scatterplots

The data set in the table that follows is taken from The Data and Story Library. Researchers
measured the heights of 161 children in Kalama, a village in Egypt. The heights were averaged
and recorded each month, with the study lasting several years. The data is presented in the table
that follows.

Mean Height versus Age

Age in
Average Height in Centimeters
Months

18 76.1

19 77

20 78.1

21 78.2

22 78.8

23 79.7

24 79.9

25 81.1

26 81.2

27 81.8

28 82.8
29 83.5

To build our scatter plot, we must first enter the data in R. This is a fairly straightforward task.
Because the ages are incremented in months, we can use the command age=18:29, which uses R
start:finish syntax to begin a vector at the number 18, then increment by 1 until the number 29
is reached.

> age=18:29

We can view the result by typing age at R's prompt.

> age
[1] 18 19 20 21 22 23 24 25 26 27 28 29

Entering the average heights is a bit more tedious, but straightforward.

> height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)

Again, we can view the result by entering height at R's prompt.

> height
[1] 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8 83.5

We can check that age and height have the same number of elements with R's length command.

> length(age)
[1] 12
> length(height)
[1] 12

It is now a simple matter to produce a scatterplot of height versus age.

> plot(age,height)

The result of this command is shown in Figure 1.


Figure 1. A scatterplot of height versus age data..

The Line of Best Fit

Note that the data in Figure 1 is approximately linear. As the age increases, the average height
increases at an approximately constant rate. One could not fit a single line through each and
every data point, but one could imagine a line that is fairly close to each data point, with some of
the data points appearing above the line, others below for balance. In this next activity, we will
calculate and plot the Line of Best Fit, or the Least Squares Regression Line.

We will use R's lm command to compute a "linear model" that fits the data in Figure 1. The
command lm is a very sophisticated command with a host of options (type ?lm to view a full
description), but in its simplest form, it is quite easy to usea. The syntax height~age is called a
model equation and is a very sophisticated R construct. We are using its most simple form here.
The symbol separating "height" and "age" in the syntax height~age is a "tilde." It is located on
the key to the immediate left of the the #1 key on your keyboard. You must use the Shift Key to
access the "tilde."

> res=lm(height~age)
Let's examine what is returned in the variable res.

> res

Call:
lm(formula = height ~ age)

Coefficients:
(Intercept) age
64.928 0.635

Note the "Coefficients" part of the contents of res. These coefficients are the intercept and slope
of the line of best fit. Essentially, we are being told that the equation of the line of best fit is:

height = 0.635 age + 64.928.

Note that this result has the form y = m x + b, where m is the slope and b is the intercept of the
line.

It is a simple matter to superimpose the "Line of Best Fit" provided by the contents of the
variable res. The command abline will use the data in res to draw the "Line of Best Fit."

> abline(res)

The result of this command is the "Line of Best Fit" shown in Figure 2.
Figure 2. The abline command superimposes the "Line of Best Fit" on our previous scatterplot.

The command abline is a versatile tool. You can learn more about this command by entering ?
abline and reading the resulting help file.

Prediction

Now that we have the equation of the line of best fit, we can use the equation to make
predictions. Suppose, for example, that we wished to estimate the average height of a child at age
27.5 months. One technique would be to enter this value into the equation of the line of best fit.

height = 0.635 age + 64.928 = 0.635(27.5) + 64.928 = 82.3905.

We can use R as a simple calculator to perform this calculation.

> 0.635*27.5+64.928
[1] 82.3905

Thus, the average height at age 27.5 months is 82.3905 centimeters.

Enjoy!

We hope you enjoyed this introduction to the principles of Linear Regression in the R system.
We encourage you to explore further. Use the commands ?plot, ?lm, and ?abline to learn more
about producing scatterplots and performing linear regression.

Data Frames in R
In this activity we will introduce R's dataframes. Technically, a dataframe is R is a type of object.
Less formally, a dataframe is a type of table where the typical use employs the rows as
observations and the columns as variables. Take, for example, Mean Height versus Age data
from the activity Linear Regression in R.

Mean Height versus Age

Age in
Average Height in Centimeters
Months

18 76.1

19 77

20 78.1
21 78.2

22 78.8

23 79.7

24 79.9

25 81.1

26 81.2

27 81.8

28 82.8

29 83.5

In the activity Linear Regression in R, we entered the age in a vector called age.

> age=18:29
> age
[1] 18 19 20 21 22 23 24 25 26 27 28 29

In similar fashion, we entered the average heights in a vector called height.

> height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
> height
[1] 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8 83.5

We will now use R's data.frame command to create our first dataframe and store the results in
the variable village.

> village=data.frame(age=age,height=height)

Let's examine the result of this command.

> village
age height
1 18 76.1
2 19 77.0
3 20 78.1
4 21 78.2
5 22 78.8
6 23 79.7
7 24 79.9
8 25 81.1
9 26 81.2
10 27 81.8
11 28 82.8
12 29 83.5

The result is fairly clear. The contents of the vector age are stored in the first column of the
dataframe under the heading "age" and the contents of the vector height are stored in the second
column of the dataframe under the heading "height."

Examining R's Workspace

Let's examine the contents of our workspace with R's ls() command (which stands for "list the
contents of the workspace." The contents of your workspace might vary, depending on whether
or not you have created other variables than those introduced in this activity

> ls()
[1] "age" "height" "village"

Note that their are three objects in our workspace. We can examine the "class" of each object in
our workspace, which tells us what kind of object is contained in each variable.

> ls()
> class(age)
[1] "integer"
> class(height)
[1] "numeric"
> class(village)
[1] "data.frame"

Thus we see that the vector age has class "integer." This should be clear as it was created with
the command age=18:29. Each of these numbers are integers. On the other hand, the average
heights were numbers such as 76.1, etc. These are not integers, but floating point numbers, so it
is not surprising that the class of the object height is "numeric."

It is interesting, though not surprising --- after all, village was created with the data.frame
command --- that the class of the object village is "data.frame." This is a new type of object that
we will find quite useful.

Before we begin to explore dataframes in more depth, let's clean up our workspace a bit. Let's
"remove" the age and height objects from our workspace.

> remove(age,height)
If you now try to print the contents of age and/or height, again you will see that they are gone.

> age
Error: object "age" not found
> height
Error: object "height" not found

Indeed, if you "list" the contents of your workspace, you will see that the age and height objects
are gone. However, the dataframe village remains.

> ls()
[1] "village"

Accessing the Variables in the Dataframe

We will now explain how to access the variables contained in our data frame. In this case, we
know that the each column contains a variable.

> village
age height
1 18 76.1
2 19 77.0
3 20 78.1
4 21 78.2
5 22 78.8
6 23 79.7
7 24 79.9
8 25 81.1
9 26 81.2
10 27 81.8
11 28 82.8
12 29 83.5
The first column contains age and the second height. But we can quickly ascertain the names of the
columns with R's names command. This would be useful if we were using someone else's dataframe and
we were not sure of the column headings.

> names(village)
[1] "age" "height"
OK, as expected. But how do we access the data in each column? One way is to state the variable
containing the dataframe, followed by a dollar sign, then the name of the column we wish to
access. For example, if we wanted to access the data in the "age" column, we would do the
following:

> village$age
[1] 18 19 20 21 22 23 24 25 26 27 28 29

In a similar fashion, we can access the values in the "height" column.

> village$height
[1] 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8
[12] 83.5

However, the additional typing required by the "dollar sign" notation can quickly become
tiresome, so R provides the ability to "attach" the variables in the dataframe to our workspace.

> attach(village)

Let's re-examine our workspace.

> ls()
[1] "village"

No evidence of the variables in the workspace. However, R has made copies of the variables in
the columns of the dataframe, and most importantly, we can access them without the "dollar
notation."

> age
[1] 18 19 20 21 22 23 24 25 26 27 28 29
> height
[1] 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8
[12] 83.5

It is important to understand that these are "copies" of the columns of the data frame. Suppose
that we make a change to one of the entries of age.

> age[1]=12
> age
[1] 12 19 20 21 22 23 24 25 26 27 28 29

Note, however, that the data in the dataframe village remains unchanged.

> village
age height
1 18 76.1
2 19 77.0
3 20 78.1
4 21 78.2
5 22 78.8
6 23 79.7
7 24 79.9
8 25 81.1
9 26 81.2
10 27 81.8
11 28 82.8
12 29 83.5

Revisiting the Line of Best Fit

Because we've "attached" the dataframe village, we can execute the plot command as we did in
the Linear Regression activity.

> plot(age,height)

This command produces the plot shown in Figure 1.


Figure 1. A scatterplot of height versus age data..

And we can once again use the linear model command to produce the line of best fit.

> res=lm(height~age)
> res

Call:
lm(formula = height ~ age)

Coefficients:
(Intercept) age
64.928 0.635

Again, the coefficients provide the intercept and slope of the line of best fit.

height = 0.635 age + 64.928.

The command abline uses the data in res to draw the "Line of Best Fit."

> abline(res)

The result of this command is the "Line of Best Fit" shown in Figure 2.
Figure 2. The abline command superimposes the "Line of Best Fit" on our previous scatterplot.

Prediction

We can use equation of the line of best fit to make predictions. Suppose, for example, that we
wished to estimate the average height of a child at age 27.5 months. One technique would be to
enter this value into the equation of the line of best fit.

height = 0.635 age + 64.928 = 0.635(27.5) + 64.928 = 82.3905.

We can use R as a simple calculator to perform this calculation.

> 0.635*27.5+64.928
[1] 82.3905

However, it is more efficient to use dataframes and R's predict command to predict the average
height when the age is 27.5 months.

> predict(res,data.frame(age=27.5))
[1] 82.38986

Thus, the average height at age 27.5 months is 82.3905 centimeters.

Detach
When you no longer need the data in the dataframe village, it is good practice to "detach" the
dataframe.

> detach(village)

Note that "age" and "height" are no longer available for use.

> age
Error: object "age" not found
> height
Error: object "height" not found

Advanced Use of Dataframes

Many R commands are able to handle dataframes without attaching them. For example, if you
give the plot command a dataframe with two columns, by default it will plot the second column
versus the first. The following command will produce the image shown in Figure 1.

> plot(village)

We can get the linear model in this case without attaching the dataframe village, simply by
passing the dataframe village as an argument to the lm command.

> res=lm(height~age,data=village)
> res

Call:
lm(formula = height ~ age, data = village)

Coefficients:
(Intercept) age
64.928 0.635

The abline will produce the line of best fit shown in Figure 2.

> abline(res)

The single command abline(lm(height~age),data=village) will also produce the line of best fit
shown in Figure 2. Thus, if you are in a hurry, two commands are enough to produce the
scatterplot and line of best fit.

> plot(village)
> abline(lm(height~age,data=village))

Enjoy!

We hope you enjoyed this introduction to the use of dataframes in the R system. We encourage
you to explore their use further. We will certainly do so in upcoming activities.
Importing Data in R
In this activity we will learn how to import external data files into R. This activity builds upon
the lessons learned in Dataframes in R. If you are not familiar with dataframes in R, we
encourage you to first work through the activity Dataframes in R.

In the table that follows, we list average heights of children versus their ages. This data set was
collected over a period of time in an Egyptian village and first explored in Linear Regression in
R, then again in Dataframes in R.

Mean Height versus Age

Age in
Average Height in Centimeters
Months

18 76.1

19 77

20 78.1

21 78.2

22 78.8

23 79.7

24 79.9

25 81.1

26 81.2

27 81.8

28 82.8

29 83.5
In the activity Dataframes in R, we created vectors age and height.

> age=18:29
> height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)

We then used R's data.frame command to create a dataframe and stored the result in the variable
village.

> village=data.frame(age=age,height=height)
> village
age height
1 18 76.1
2 19 77.0
3 20 78.1
4 21 78.2
5 22 78.8
6 23 79.7
7 24 79.9
8 25 81.1
9 26 81.2
10 27 81.8
11 28 82.8
12 29 83.5

Managing the Workspace

We can examine the contents of our workspace with the command ls().

> ls()
[1] "age" "height" "village"

And we can delete the age and height vectors with the remove command.

> remove(age,height)
> ls()
[1] "village"

We can "attach" the dataframe stored in village, which allows easy access to its columns.

> attach(village)
> ls()
[1] "village"
> age
[1] 18 19 20 21 22 23 24 25 26 27 28 29
> height
[1] 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8 83.5

When we are through working with the dataframe, it is a good practice to "detach" the dataframe.

> detach(village)
> age
Error: object "age" not found
> height
Error: object "height" not found
> ls()
[1] "village"

Finally, before learning how to import data in electronic format, delete the dataframe stored in
village.

> remove(village)
> ls()
character(0)

Creating an External File

Open your favorite text editor on your system. For example, you might use Notepad on Windows
or Textedit on the Mac. Enter the data exactly as follows.

age height
18 76.1
19 77.0
20 78.1
21 78.2
22 78.8
23 79.7
24 79.9
25 81.1
26 81.2
27 81.8
28 82.8
29 83.5

Save the file as "village.txt". Be sure to make note of the folder or directory where you save your
file. In our case, we saved the file in:

/Users/darnold/Documents/MathDept/trunk/MathDept/html/R/village.txt

Caution: If you use Microsoft Word, you must save the file in text format. Do not use Word's
proprietary doc format.

The Current Working Directory

An important questions arises. How will R find the file "village.txt"? Before we provide specific
instructions, we must first address the concept of R's working directory. The current working
directory is found with the getwd() command. The response will differ on your system.

> getwd()
[1] "/Users/darnold"
The response tells us that the current working directory is /Users/darnold.

We can change the current working directory with the command setwd(). Simply enter the
desired destination as a string (delimited with quotes).

> setwd("/Users/darnold/Documents")

You can check the result of this command with the getwd() command.

> getwd()
[1] "/Users/darnold/Documents"

Alternative -- Using the GUI

Alternately, we can use the menus provided by R's GUI interfaces on Windows and the Mac
operating systems. For example, on the Mac there is a Misc menu that contains submenus for
setting and getting the current working directory:

 The Get Working Directory submenu, will when selected, inform the user of the current working
directory. It performs the same task as the getwd() command.
 The Change Working Directory submenu, will when selected, open the operating's system
dialog for browsing the directory structure. Simply select the directory or folder in the manner
you are accustomed to on your operating system and click the Open button. You can check the
result with the getwd() command or the Get Working Directory submenu item.

Importing Data from an External File

It's time to import the data in the file "village.txt". You will recall that we stored the file in:

/Users/darnold/Documents/MathDept/trunk/MathDept/html/R/village.txt

And the file had this format.

age height
18 76.1
19 77.0
20 78.1
21 78.2
22 78.8
23 79.7
24 79.9
25 81.1
26 81.2
27 81.8
28 82.8
29 83.5

Note that the first line of the file contains "headers", that is, "names" for each of the columns.

The R command read.table is used to read data from external sources and place the result in a
data frame. By default, read.table reads data from a file where the data is separated by white
space (one or more spaces, tabs, newlines, or carriage returns).

> village=read.table(file="/Users/darnold/Documents/MathDept/trunk/MathDept/
html/R/village.txt",header=TRUE)

Note that this command was successful.

> village
age height
1 18 76.1
2 19 77.0
3 20 78.1
4 21 78.2
5 22 78.8
6 23 79.7
7 24 79.9
8 25 81.1
9 26 81.2
10 27 81.8
11 28 82.8
12 29 83.5

Trying to remember the path to your data file ("village.txt") can be frustrating, so there is an
alternative approach, one that allows the user to browse their directory structure in the usual
manner.

> village=read.table(file=file.choose())

The option file.choose() will pop open a dialog that allows the user to browse through their
directory structure in an accustomed manner. Locate the file in your directory structure, then
click the Open button. The result is identical to the first use of read.table above.

> village
age height
1 18 76.1
2 19 77.0
3 20 78.1
4 21 78.2
5 22 78.8
6 23 79.7
7 24 79.9
8 25 81.1
9 26 81.2
10 27 81.8
11 28 82.8
12 29 83.5

Another approach involves changing the current working directory. Using the setwd() command
or the Change Working Directory submenu of the Misc menu, change the working directory to
where you stored the file "village.txt".

> setwd("/Users/darnold/Documents/MathDept/trunk/MathDept/html/R")
> getwd()
[1] "/Users/darnold/Documents/MathDept/trunk/MathDept/html/R"

The dir() command will reveal the files stored in this directory.

> dir()
[1] "ImportingData.php" "SampleDiscrete.php" "barplot.php"
[4] "boxplot.php" "code" "dataframe.php"
[7] "graphics" "hist.php" "index.php"
[10] "pie.php" "regression.php" "simple.php"
[13] "village.txt"

Note that "village.txt" is among the files stored in this directory. Because we've changed the
current directory to the folder containing the file "village.txt", we can greatly simplify the use of
read.table to read the file and store the result in a dataframe.

> village=read.table(file="village.txt",header=TRUE)

Because we've changed the current directory to the folder containing the file "village.txt", the file
argument of the read.table no longer requires the long pathname we used above. Note that the
result is identical to the previous results above.

> village
age height
1 18 76.1
2 19 77.0
3 20 78.1
4 21 78.2
5 22 78.8
6 23 79.7
7 24 79.9
8 25 81.1
9 26 81.2
10 27 81.8
11 28 82.8
12 29 83.5

Importing Data from the Internet

Data can be stored in files in many different formats. This short tutorial does not cover all of the
possibilities, but if the file is stored on the internet in the same form as we used to store data in
the file "village.txt", then it is a simple matter to import that data into R.

For practice, we've stored the file "village.txt" at the following URL:

http://online.redwoods.edu/instruct/darnold/Math15/RData/village.txt

We read this data file into a dataframe as follows.

> village=read.table(file="http://online.redwoods.edu/instruct/darnold/
Math15/RData/village.txt",header=TRUE)

The result is identical to previous readings of the file "village.txt".

> village
age height
1 18 76.1
2 19 77.0
3 20 78.1
4 21 78.2
5 22 78.8
6 23 79.7
7 24 79.9
8 25 81.1
9 26 81.2
10 27 81.8
11 28 82.8
12 29 83.5

It is comforting that the syntax for importing the data from the internet is identical to the syntax
for importing a file from the local computer system.

One can now use the dataframe to perform a variety of analyses. For example, one might want a
scatterplot of height versus age and a superimposed line of best fit.

> plot(village)
> abline(lm(height~age,data=village))

These commands were introduced in the activity Linear Regression in R and Dataframes in R.
The resulting scatterplot and line of best fit are shown in Figure 1.
Figure 2. Superimposing the "Line of Best Fit" on a scatterplot.

Enjoy!

We hope you enjoyed this introduction to importing data into the R system. We encourage you to
explore further. Use the command ?read.table to learn more about using this important
command.

The Central Limit Theorem (Part 1)


One of the most important theorems in all of statistics is called the Central Limit Theorem or the
Law of Large Numbers. The introduction of the Central Limit Theorem requires examining a
number of new concepts as well as introducing a number of new commands in the R
programming language. Consequently, we will break our introduction of the Central Limit
Theorem into several parts.

In this first part of the introduction to the Central Limit Theorem, we will show how to draw and
visualize a sample of random numbers from a distribution. From there we will examine the mean
and standard deviation of the sample, then examine the distribution of the sample means.

We begin by learning how to draw random numbers from a distribution.

The Letter r — Drawing Random Numbers

In previous activities (e.g., The Normal Distribution and Continuous Distributions), we


introduced the use of the letters d, p, and q in relation to the various distributions (e.g., normal,
uniform, and exponential). A reminder of their use follows:

 "d" is for "density." It is used to find values of the probability density function.
 "p" is for "probability." It is used to find the probability that the random variable lies to the left
of a given number.
 "q" is for "quantile." It is used to find the quantiles of a given distribution.

There is a fourth letter, namely "r", that is used to draw random numbers from a distribution. So,
for example, runif and rexp would be used to draw random numbers from the uniform and
exponential distributions, respectively.

Let's use the rnorm command to draw 500 numbers at random from a normal distribution having
mean 100 and standard deviation 10.

> x=rnorm(500,mean=100,sd=10)

We can view the result, some of which are shown below.

> x
[1] 110.67263 102.07696 114.41904 98.52447 95.31791 103.32522
[7] 96.85134 105.37060 98.60348 103.72672 101.82439 107.65795
[13] 96.91467 89.42021 106.06962 111.24015 90.51377 108.22921

...
...

[493] 110.61727 87.24973 113.95993 89.80688 106.44881 109.89394


[499] 87.57305 90.88494

When you examine the numbers stored in the variable x, there is a sense that you are pulling
random numbers that are clumped about a mean of 100. However, a histogram of this selection
provides a better understanding of the data stored in x.

> hist(x,prob=TRUE)
The above command produces the histogram shown in Figure 1.

Figure 1. A histogram of 500 random numbers drawn from a normal distribution with mean 100 and standard
deviation 10.

Several comments are in order regarding the histogram in Figure 1:

1. The histogram is approximately normal in shape.


2. The "balance point" of the histogram appears to be located near 100, suggesting that the
random numbers were drawn from a distribution having mean 100.
3. It appears that almost all of the values appear within 3 increments of 10 from the mean,
suggesting that the random numbers were drawn from a distribution having standard deviation
10.

Let's try the experiment again, drawing a new set of 500 random numbers from the normal
distribution having mean 100 and standard deviation 10.

> x=rnorm(500,mean=100,sd=10)
> hist(x,prob=TRUE,ylim=c(0,0.04))

These commands produce the plot shown in Figure 2.


Figure 2. A second drawing of 500 random numbers from a normal distribution having mean 100 and standard
deviation 10.

The histogram in Figure 2 is different from the histogram shown in Figure 1, owing to the
"random" selection of numbers. However, it does share some common traits with the histogram
shown in Figure 1: (1) it appears "normal" in shape, (2) it appears to be "balanced" or "centered"
about 100, and (3) all data appears to occur within 3 increments of 10 of the mean. This is strong
evidence that the random numbers have been drawn from a normal distribution having mean 100
and standard deviation 10. We can provide further evidence of this claim by superimposing a
normal probability density function with mean 100 and standard deviation 10.

> curve(dnorm(x,mean=100,sd=10),70,130,add=TRUE,lwd=2,col="red")

The above command superimposes the normal curve shown in Figure 3.


Figure 3. Superimpose a normal curve having mean 100 and standard deviation 10.

The curve command is new. Some comments on its use are in order:

1. In its simplest form, the syntax curve(f(x),from=,to=) draws the "function" defined by f(x) on the
interval (from,to). Our function is dnorm(x,mean=100,sd=10). The curve command sketches this
function of x on the interval (from,to).
2. The notation "from=" and "to=" may be omitted if the arguments are submitted in the proper
order to the curve command, function first, value of "from" second, then value of "to" third.
That is what we've done, substituting 70 for "from" and 130 for "to".
3. If the argument "add" is set to TRUE, as we have done, then the curve is "added" to the existing
figure. If this argument is omitted, or if it is set to FALSE, then a new plot is drawn, erasing any
previous graphics drawn.

The Distribution of Sample Means


In our previous examples, we drew 500 random numbers from a normal distribution with mean
100 and standard deviation 10. This is called "drawing a sample of size 500" from the normal
distribution with mean 100 and standard deviation 10. This leads to a sample of 500 random
numbers. One immediate question we can ask is "what is the mean of our sample?"

> mean(x)
[1] 99.75439

Thus, the mean of this sample is 99.75439.

Of course, if we take another sample of 500 random numbers from the normal distribution with
mean 100 and standard deviation 10, we get a new sample that has a different mean.

> x=rnorm(500,mean=100,sd=10)
> mean(x)
[1] 99.91978

In this case, we have a new sample of 500 randomly selected numbers, and the mean of this
sample provides a different result, namely 99.91978. The next question to ask is "what happens
if we do this repeatedly?"

Producing a Vector of Sample Means

In the next activity, we will repeatedly sample from the normal distribution. Each sample will
select five random numbers from the normal distribution having mean 100 and standard
deviation 10. We will then find the mean of the five numbers in our sample. We will repeat this
experiment 500 times, collecting the sample means in a vector xbar as we go.

We begin by declaring the mean and standard deviation of the distribution from which we will
draw random numbers. Then we declare the sample size (the number of random numbers drawn).

> mu=100; sigma=10


> n=5

Each time we draw a sample of size n = 5 from the normal distribution having mean μ = 100 and
standard deviation σ = 10, we need someplace to store the mean of the sample. Because we
intend to collect the means of 500 samples, we initialize a vector xbar to initially contain 500
zeros.

> xbar=rep(0,500)

The rep command "repeats" the entry zero 500 times. As a result, the vector xbar now contains
500 entries, each of which is zero.

It is easy to draw a sample of size n = 5 from the normal distribution having mean μ = 100 and
standard deviation σ = 10. We simply issue the command rnorm(n,mean=mu,sd=sigma). To
find the mean of this result, we simply add the adjustment
mean(rnorm(n,mean=mu,sd=sigma)). The final step is to store this result in the vector xbar.
Then we must repeat this same process an additional 499 times for a total of 500 sample means.
This requires the use of a for loop.

> for (i in 1:500) { xbar[i]=mean(rnorm(n,mean=mu,sd=sigma)) }

The for construct used by R is similar to the "for loops" used in many programming languages.

 The i in for (i in 1:500) is called the index of the "for loop."


 The index i is first set equal to 1, then the "body" of the "for loop" (the part between the curly
braces) is executed. On the next iteration, i is set equal to 2 and the body of the loop is executed
again. The loop continues in this manner, incrementing the index i by 1, finally setting the index i
to 500, upon which the body of the loop executes one last time. Then the "for loop" is
terminated.
 In the body of the "for loop" we have xbar[i]=mean(rnorm(n,mean=mu,sd=sigma)). This draws
a sample of size n = 5 from the normal distribution, calculates the mean of the sample, and
stores the result in xbar[i], the ith entry of xbar.
 When the "for loop" completes 500 iterations, the vector xbar contains the means of 500
samples of size n = 5 drawn from the normal distribution having mean μ = 100 and standard
deviation σ = 10.

It is a simple task to sketch the histogram of the sample means contained in the vector xbar.

> hist(xbar,prob=TRUE,breaks=12,xlim=c(70,130),ylim=c(0,0.1))

The above command produces the histogram shown in Figure 4.


Figure 4. The histogram of the sample means.

There are a number of important observations to be made about the histogram of sample means
in Figure 4, particularly when it is compared with the histograms of Figures 2 and 3.

1. It is essential to note the labels on the horizontal axis. In Figures 2 and 3, the label is x. This is
because the histograms in Figures 2 and 3 are simply describing the shape of 500 random
numbers selected from the normal distribution with mean μ = 100 and standard deviation σ =
10. On the other hand, the histogram of Figure 4 is describing the distribution of 500 sample
means, each of which was found by selecting n = 5 numbers from the normal distribution with
mean μ = 100 and standard deviation σ = 10, then computing their mean (average). The
horizontal axis in Figure 4 emphasizes this fact with the label xbar.
2. It is important to note that the distribution of xbar in Figure 4 appears "normal" in shape. This is
so even though the sample size is relatively small (n = 5).
3. It appears that the "balance point" or "center" of the distribution in Figure 4 occurs near 100.
This can be checked with the following command:
4. > mean(xbar)
5. [1] 100.3104

This calculation provides the "mean of the sample means," so to speak. It is important to note
that the mean of the sample means appears to be identical to the mean of the distribution from
which the samples were drawn.

6. The distribution of sample means in Figure 4 appears to be "less spread out" or "narrower" than
the distributions in Figures 2 and 3.
Increasing the Sample Size

Let's repeat the last experiment, but this time let's draw samples of size n = 10 from the same
"parent population," the normal distribution having mean μ = 100 and standard deviation σ = 10.

> mu=100; sigma=10


> n=10
> xbar=rep(0,500)
> for (i in 1:500) {xbar[i]=mean(rnorm(n,mean=mu,sd=sigma))}
> hist(xbar,prob=TRUE,breaks=12,xlim=c(70,130),ylim=c(0,0.1))

The above commands produce the histogram in Figure 5.

Figure 5. Increasing the sample size decreases the spread.

The image in Figure 5 sheds light on three key ideas:

Key Idea: When we select samples from a normal distribution, then the distribution of sample
means is also "normal" in shape.

Key Idea: The mean of the distribution of sample means appears to be the same as the mean of
the "parent population" from which we selected our samples (see the "balance point" or "center"
in Figure 5). This is easily checked.

> mean(xbar)
[1] 100.0866
Key Idea: By increasing the size of our samples (n = 10), the histogram of the sample means
becomes "less spread out" or "narrower," as is clearly seen when contrasting the spread of the
histograms in Figures 4 and 5. This behavior (increasing the sample size decreases the spread)
seems quite reasonable. We would expect a more accurate estimate of the mean of the parent
population if we take the mean of larger sample size. For example, you have a better chance of
estimating the average height of the student population at a school if you ask ten students their
height and average than if you asked only five students their height and averaged.

The Central Limit Theorem

We finish with a statement of the Central Limit Theorem.

1. If you draw samples from a normal distribution, then the distribution of sample means is also
normal.
2. The mean of the distribution of sample means is identical to the mean of the "parent
population," the population from which the samples are drawn.
3. The higher the sample size that is drawn, the "narrower" will be the spread of the distribution of
sample means.

This statement of the Central Limit Theorem is not complete. We will add refinements to this
statement in later activities.

Enjoy!

We hope you enjoyed this introduction to the Central Limit Theorem system. We encourage you
to explore further. You might try repeating the experiments provided in this activity with sample
sizes n = 15, 20, and 25.

The Central Limit Theorem (Part 2)


In the activity The Central Limit Theorem (Part 1), we concluded with the following
observations on the Central Limit Theorem.
1. If you draw samples from a normal distribution, then the distribution of sample means is also
normal.
2. The mean of the distribution of sample means is identical to the mean of the "parent
population," the population from which the samples are drawn.
3. The higher the sample size that is drawn, the "narrower" will be the spread of the distribution of
sample means.

We stated that we would refine this statement of the Central Limit Theorem in further activities.
Let's proceed to do that now.

What if the Parent Distribution is not Normal?

In the activity The Central Limit Theorem (Part 1), we drew our random samples from a "parent"
population whose distribution was "normal." In this activity, we'll choose parent populations that
are not normal, then see if the conclusions of the Central Limit Theorem still hold.

In the activity Continuous Distributions, we introduced the Exponential Distribution defined by


the following probability density function.

The Exponential Probability Density Function

Figure 1. The exponential probability density function has both mean and standard deviation equal to 1/λ.

We can easily plot the distribution for λ = 1.

> curve(dexp(x,rate=1),0,4,lwd=2,col="red",ylab="p")

Some comments are in order for the command curve.

 The syntax curve(expr,from=,to=) sketches the graph of expr on the interval (from,to).
 In this example, from = 0 and to = 4.

The curve command above produces the probability density curve for the exponential
distribution shown in Figure 2.
Figure 2. Sketching the probability density function for the exponential distribution (λ = 1).

Note that the exponential distribution shown in Figure 2 is not normal.

The Letter r — Drawing Random Numbers

In previous activities (e.g., The Normal Distribution and Continuous Distributions), we


introduced the use of the letters d, p, and q in relation to the various distributions (e.g., normal,
uniform, and exponential). A reminder of their use follows:

 "d" is for "density." It is used to find values of the probability density function.
 "p" is for "probability." It is used to find the probability that the random variable lies to the left
of a given number.
 "q" is for "quantile." It is used to find the quantiles of a given distribution.

There is a fourth letter, namely "r", that is used to draw random numbers from a distribution.
Let's use the rexp command to draw 500 numbers at random from the exponential distribution
having mean 1 and standard deviation 1.

> x=rexp(500,rate=1)

We can view the result, some of which are shown below.


> x
[1] 0.279198724 2.974863732 1.784980038 1.089242851 1.654580539
[6] 6.817406399 0.521311578 0.620673515 0.012819642 2.024541829
[11] 1.489907151 0.316853154 0.756492146 1.915256215 0.087262080

...
...

[496] 0.609009484 0.504856527 3.498884801 0.871710847 1.009950972

When you examine the numbers stored in the variable x, it is difficult to get a sense of the
distribution of numbers. However, a histogram of this selection provides a better understanding
of the data stored in x.

> hist(x,prob=TRUE)

The above command produces the histogram shown in Figure 3.

Figure 3. A histogram of 500 random numbers drawn from the exponential distribution with λ = 1

Several comments are in order regarding the histogram in Figure 3:


1.The histogram is not normal. Indeed, the distribution is decidedly skewed to the right.

1. Visually, it is not unreasonable to estimate that the "balance point" or "center" of the
distribution is near 1. However, a quick calculation provides convincing evidence that the mean
is 1.
2. > mean(x)
3. [1] 1.006549

The Distribution of Sample Means

In our previous examples, we drew 500 random numbers from an exponential distribution with
mean and standard deviation equal to 1. This is called "drawing a sample of size 500" from the
exponential distribution with mean and standard deviation eqyak ti 1. This leads to a sample of
500 random numbers. One immediate question we can ask is "what is the mean of our sample?"

> mean(x)
[1] 1.006549

Thus, the mean of this sample is 1.006549.

Of course, if we take another sample of 500 random numbers from the exponential distribution
with mean and standard deviation equal to 1, we get a new sample that has a different mean.

> x=rexp(500,rate=1)
> mean(x)
[1] 0.9780556

In this case, we have a new sample of 500 randomly selected numbers, and the mean of this
sample provides a different result, namely 0.9780556. The next question to ask is "what happens
if we do this repeatedly?"

Producing a Vector of Sample Means

In the next activity, we will repeatedly sample from the exponential distribution. Each sample
will select five random numbers from the exponential distribution having mean and standard
deviation equal to 1. We will then find the mean of the five numbers in our sample. We will
repeat this experiment 500 times, collecting the sample means in a vector xbar as we go.

We begin by declaring the rate of the exponential distribution from which we will draw random
numbers. Then we declare the sample size (the number of random numbers drawn).

> lambda=1
> n=5

Each time we draw a sample of size n = 5 from the exponential distribution having mean μ = 1
and standard deviation σ = 1, we need someplace to store the mean of the sample. Because we
intend to collect the means of 500 samples, we initialize a vector xbar to initially contain 500
zeros.

> xbar=rep(0,500)

The rep command "repeats" the entry zero 500 times. As a result, the vector xbar now contains
500 entries, each of which is zero.

It is easy to draw a sample of size n = 5 from the exponential distribution having mean μ = 1 and
standard deviation σ = 1. We simply issue the command exp(n,rate=lambda). To find the mean
of this result, we simply add the adjustment mean(exp(n,rate=lambda)). The final step is to
store this result in the vector xbar. Then we must repeat this same process an additional 499
times for a total of 500 sample means. This requires the use of a for loop.

> for (i in 1:500) { xbar[i]=mean(exp(n,rate=lambda)) }

The for construct used by R is similar to the "for loops" used in many programming languages.

 The i in for (i in 1:500) is called the index of the "for loop."


 The index i is first set equal to 1, then the "body" of the "for loop" (the part between the curly
braces) is executed. On the next iteration, i is set equal to 2 and the body of the loop is executed
again. The loop continues in this manner, incrementing the index i by 1, finally setting the index i
to 500, upon which the body of the loop executes one last time. Then the "for loop" is
terminated.
 In the body of the "for loop" we have xbar[i]=mean(exp(n,rate=lambda)). This draws a sample
of size n = 5 from the exponential distribution, calculates the mean of the sample, and stores the
result in xbar[i], the ith entry of xbar.
 When the "for loop" completes 500 iterations, the vector xbar contains the means of 500
samples of size n = 5 drawn from the exponential distribution having mean μ = 1 and standard
deviation σ = 1.

It is a simple task to sketch the histogram of the sample means contained in the vector xbar.

> > hist(xbar,prob=TRUE,breaks=12)

The above command produces the histogram shown in Figure 4.


Figure 4. The histogram of the sample means.

There are a number of important observations to be made about the histogram of sample means
in Figure 4, particularly when it is compared with the histograms of Figures 2 and 3.

1. It is essential to note the labels on the horizontal axis. In Figures 2 and 3, the label is x. This is
because the histograms in Figures 2 and 3 are simply describing the shape of 500 random
numbers selected from the exponential distribution with mean μ = 1 and standard deviation σ =
1. On the other hand, the histogram of Figure 4 is describing the distribution of 500 sample
means, each of which was found by selecting n = 5 numbers from the exponential distribution
with mean μ = 1 and standard deviation σ = 1, the computing their mean (average). The
horizonatal axis in Figure 4 emphasizes this fact with the label xbar.
2. It is important to note that the distribution of xbar in Figure 4 is not normalin shape. Indeed, the
distribution is decidedly skewed to the right.

Increasing the Sample Size


Let's repeat the last experiment, but this time let's draw samples of size n = 10 from the same
"parent population," the exponential distribution having mean μ = 1 and standard deviation σ =
1.

> lambda=1
> n=10
> xbar=rep(0,500)
> for (i in 1:500) { xbar[i]=mean(rexp(n,rate=1))}
> hist(xbar,prob=TRUE,breaks=12)

The above commands produce the histogram in Figure 5.

Figure 5. Increasing the sample size to n = 10.

The histogram in Figure 5 is still not normal in shape. Again, it is definitely skewed to the right,
though perhaps not as much as the histogram in Figure 4 that was produced with a smaller
sample size.

Let's increase the sample size to n = 20 and repeat the experiment.


> lambda=1
> n=20
> xbar=rep(0,500)
> for (i in 1:500) { xbar[i]=mean(rexp(n,rate=1))}
> hist(xbar,prob=TRUE,breaks=12)

The code above will produce the image in Figure 6.

Figure 6. Increasing the sample size to n = 20.

Aha! The histogram in Figure 6 has the appearance of a normal distribution. The "right-
skewness" is beginning to disappear, when compared with the histograms in Figures 4 and 5.

> lambda=1
> n=30
> xbar=rep(0,500)
> for (i in 1:500) { xbar[i]=mean(rexp(n,rate=1))}
> hist(xbar,prob=TRUE,breaks=12)

The code above will produce the image in Figure 7.


Figure 7. Increasing the sample size to n = 30.

The histogram in Figure 7 has the symmetric bell-shape of the normal distribution.

Key Observation: It would appear that the distribution of the sample means will be normal in
shape, regardless of the shape of the "parent" population, provided the sample size is large
enough. In Figure 7, a sample size of n = 30 seemed to be enough to guarantee that the
distribution of sample means is normal in shape, even though the samples were drawn from the
exponential distribution, a distribution that is highly skewed to the right.

There are two more important observations to be made about the distribution shown in Fiugre 7:

1. Key Observation: The histogram in Figure 7 appears to be "balanced" or "centered" about xbar =
1. Indeed:
2. > mean(xbar)
3. [1] 0.9978525

This mean is the same as the mean of the "parent" population (the exponential distribution with
mean μ = 1) from which our samples were drawn.

4. Key Observation: The standard deviation of the distribution in Figure 7 appears to be much
smaller than the standard deviation of the parent population (the exponential distribution had σ
= 1). Indeed, if we recall that all data in the normal distribution falls within about three standard
deviations of the mean, the histogram in Figure 7 would indicate a standard deviation of about
0.2. We will have more to say about the standard deviation of the sample means in a later
activity.

Sampling from a Discrete Population

Let's repeat the experiment again (increasing the sample size), only this time let's use a "parent"
population that is discrete and skewed to the left. Specifically:

A Discrete Distribution

x p

1 0.1

2 0.1

3 0.1

4 0.1

5 0.2

6 0.4

We can "load" the values of the random variable and their probabilities in R as follows:

> x=c(1,2,3,4,5,6)
> p=c(0.1,0.1,0.1,0.1,0.2,0.4)

We can provide a "stick" plot of this discrete distribution with the following code:

x=c(1,2,3,4,5,6)
p=c(0.1,0.1,0.1,0.1,0.2,0.4)
plot(x,p,type="h",lwd=2,col="red",ylim=c(0,0.5))
points(x,p,pch=16,cex=2,col="black")

The code above will produce the discrete distribution shown in Figure 8.
Figure 7. A discrete distribution that is badly skewed to the left.

Determine the Mean of the Discrete Distribution

Here is a simple formula for computing the mean of a discrete distribution.

A Formula for the Mean of a Discrete Distribution

Figure 1.The mean is found by summing the product of the values of the random variable and their associated
probabilities.

Thus, we can find the mean of our discrete distribution with the following calculation:

Making this calculation is R greatly simplifies the task. First find the product of the vectors x and
p. Note: When you take the product of two vectors, R will produce a third vector, each entry of
which is the product of the corresponding entries in the vectors being multiplied.
> x*p
[1] 0.1 0.2 0.3 0.4 1.0 2.4

Sum these numbers to find the mean.

> sum(x*p)
[1] 4.4

Thus, the mean of the discrete distribution is μ = 4.4.

Key Observation: Look again at the "spike" plot in Figure 7. Imagine a "knife-edge" at 4.4 and
set the distribution of Figure 7 atop the "knife-edge." Will the distribution balance? Keep in mind
the "principle of the lever" or the "teeter-totter effect." The outliers, such as x = 1, positioned a
farther distance from the mean but with lower probability, can balance values with more
"massive" probabilities that are clumped closer to the mean. With these thoughts in mind, the
mean value μ = 4.4 seems reasonable.

Sampling from the Discrete Distribution

In the activity Sampling a Discrete Population, the sample command was used to draw a sample.
The syntax sample(x, size, replace=, prob=) indicates that we must provide the following
arguments:

 x: a vector containing the values of the random variable


 size: the size of the sample you wish to draw
 replace: a value of TRUE causes the value of the discrete variable to be replaced before another
draw is made.
 prob: a vector containing the associated probabilities for each value of the random variable

Let's draw a sample of size 1000 and create a histogram of the resulting sample.

n=1000
xs=sample(x,size=n,replace=TRUE,prob=p)

As in the activity Sampling a Discrete Distribution, the table command provides a nice summary
of the sample.

> table(xs)
xs
1 2 3 4 5 6
97 116 91 102 189 405

Even better is a visualization provided by a barplot.

> barplot(table(xs)/length(xs),xlab="x",ylab="frequency")
This last command produces the barplot shown in Figure 8.

Figure 8.Plotting a sample of size 1000 selected from the discrete distribution.

The barplot in Figure 8 is a random sample from the discrete distribution shown in Figure 7,
whose theoretical mean is 4.4. Let's see what the mean of our sample is.

> mean(xs)
[1] 4.385

Of course, if we draw another random sample, we will get a different sample mean.

> xs=sample(x,size=n,replace=TRUE,prob=p)
> mean(xs)
[1] 4.479

The Distribution of Sample Means


As in the last example, let's start with a sample size n = 5. Will draw 500 samples, then plot the
distribution of samples using a histogram.

n=5
xbar=rep(0,500)
for (i in 1:500) {
xbar[i]=mean(sample(x,size=n,replace=TRUE,prob=p))
}
hist(xbar,prob=TRUE,breaks=12)

This last command produces the barplot shown in Figure 9.

Figure 9. Plotting samples of size n = 5 selected from the discrete distribution.

Note that the distribution of sample means in Figure 9 is not normal. Indeed, the distribution is
skewed to the left. This is to be expected as the "parent" population is highly skewed to the left
and the sample size we are using is quite small. Let's increase the sample size and see what
happens.

n=10
xbar=rep(0,500)
for (i in 1:500) {
xbar[i]=mean(sample(x,size=n,replace=TRUE,prob=p))
}
hist(xbar,prob=TRUE,breaks=12)

This last command produces the barplot shown in Figure 10.


Figure 10. Plotting samples of size n = 10 selected from the discrete distribution.

There's a bit of an improvement (starting to look somewhat normal) but the distribution of
sample means in Figure 10 is still skewed to the left. Let's increase the sample size and see what
happens.

n=20
xbar=rep(0,500)
for (i in 1:500) {
xbar[i]=mean(sample(x,size=n,replace=TRUE,prob=p))
}
hist(xbar,prob=TRUE,breaks=12)

This last command produces the barplot shown in Figure 11.


Figure 11. Plotting samples of size n = 20 selected from the discrete distribution.

The distribution is now taking on the symmery and "bell-shape" of the normal distribution. Let's
increase the sample size one more time and see what happens.

n=30
xbar=rep(0,500)
for (i in 1:500) {
xbar[i]=mean(sample(x,size=n,replace=TRUE,prob=p))
}
hist(xbar,prob=TRUE,breaks=12)

This last command produces the barplot shown in Figure 12.


Figure 12. Plotting samples of size n = 30 selected from the discrete distribution.

The histogram in Figure 7 has the symmetric bell-shape of the normal distribution.

Key Observation: It would appear that the distribution of the sample means will be normal in
shape, regardless of the shape of the "parent" population, provided the sample size is large
enough. In Figure 12, a sample size of n = 30 seemed to be enough to guarantee that the
distribution of sample means is normal in shape, even though the samples were drawn from a
discrete distribution, a distribution that is highly skewed to the left.

Key Observation: The histogram in Figure 12 appears to be "balanced" or "centered" about xbar
= 4.4. Indeed:

> mean(xbar)
[1] 4.396933

This mean is the same as the mean of the "parent" population (the discrete distribution of Figure
7 with mean μ = 4.4) from which our samples were drawn.

The Central Limit Theorem


We are now in a position to refine our statement of the Central Limit Theorem.

1. If you draw samples from a distribution, then the distribution of sample means is also normal,
provided a large enough sample size is used. It appears that that the Magic Number for a
sufficient sample size is n = 30. Note: This is why the Central Limit Theorem is oftentimes referred
to a the "Law of Large Numbers."
2. The mean of the distribution of sample means is identical to the mean of the "parent
population," the population from which the samples are drawn.
3. The higher the sample size that is drawn, the "narrower" will be the spread of the distribution of
sample means.

This statement of the Central Limit Theorem is still not complete. We still need to discuss just
how the standard deviation of the sample means varies with the sample size. We will attack this
question in a later activity.

Enjoy!

We hope you enjoyed this second activity on the Central Limit Theorem system. We encourage
you to explore further. You might try repeating the experiments in this activity with different
"parent" distributions.

Transforming Data in R
In the activity Linear Regression in R, we showed how to calculate and plot the "line of best fit"
for a set of data. As a quick reminder, consider the normal average January minimum
temperatures in 56 American cities, presented at the following URL:

http://lib.stat.cmu.edu/DASL/Datafiles/USTemperatures.html

This file is one of many data sets stored at the Data and Story Library. We've adapted the file to
make it easier to import into R. You can download the adapted file at
http://msenux.redwoods.edu/mathdept/R/USTemperatures.txt. Take note of the folder name
where you save this file as USTemperatures.txt.

In the activity Importing Data in R, we showed one of the simplest methods to import a datafile
into a dataframe in R.

> USTemps=read.table(file=file.choose(),header=TRUE)

The option file.choose() will pop open a dialog that allows the user to browse through their
directory structure in an accustomed manner. Locate the file USTemperatures.txt in your
directory structure, then click the Open button.

> USTemps=read.table(file=file.choose(),header=TRUE)
> USTemps
Temp Lat Long
1 44 31.2 88.5
2 38 32.9 86.8
3 35 33.6 112.5
4 31 35.4 92.8
5 47 34.3 118.7
...
...
...
56 14 41.2 104.9

The first variable Temp contains the average January low temperature for a particular US city
having latitude and longtitude stored in the second and third variables, Lat and Long,
respectively.

We now calculate the equation of the line of best fit.

> res=lm(Temp~Lat,data=USTemps)
> res

Call:
lm(formula = Temp ~ Lat, data = USTemps)

Coefficients:
(Intercept) Lat
108.728 -2.110

Hence, the equation of the line of best fit is given by:


Temp = -2.110 Lat + 108.728.

It's a simple matter to produce a scatterplot and the line of best fit.

> plot(Temp~Lat,data=USTemps)
> abline(res)

The above commands produce the scatterplot and the line of best fit in Figure 1.

Figure 1. Plot of temperature versus latitude and the line of best fit.

In Figure 1, we see a general downward linear trend (which explains the negative slope). As the
latitude increases, we move northward from the equator (which has latitude zero) and the
temperature gets colder.

The Power Function

A power function has the following form.

The Power Function


Figure 2. Note the location of the variable x in the power function.

It's important to note the location of the variable x in the power function in Figure 2. Later in this
activity, we will contrast the power function with the exponential function, where the variable x
is an exponent, rather than the base as in the equation in Figure 2.

Let's look at an example, choosing the power function y = 3x2. First, let's produce a plot.

> x=seq(1,5,0.5)
> y=3*x^2
> plot(x,y)

The above sequence produces the plot shown in Figure 3.

Figure 3. An example of a power function.

It could be argued that the data is somewhat linear, showing a general upward trend. However, it
is more likely (based on Figure 3) that the function is nonlinear, due to the general bend in curve
shown in Figure 3.

Tranforming the Data


Let's take the logarithm of both sides of the power function y = 3x2. The base of the logarithm is
irrelevant; however, log x is understood to be the natural logarithm (base e) in R. That is, log x =
loge x in R.

log y = log 3x2

The log of a product is the sum of the logs, so we can write the following.

log y = log 3 + log x2

Another property of logs allows us to move the exponent down.

log y = log 3 + 2 log x

This last form shows that if we plot the log of y versus the log of x, the graph will be linear with
slope 2 and intercept log 3.

> plot(log(x),log(y))

The above command produces the plot shown in Figure 4.

Figure 4. Plotting the log of y versus the log of x produces a line.

Note that the plot of log y versus log x is linear!


It's interesting to find the line of best fit for the transformed data.

> res=lm(log(y)~log(x))
> res

Call:
lm(formula = log(y) ~ log(x))

Coefficients:
(Intercept) log(x)
1.099 2.000

The slope is 2, which is the slope indicated in log y = log 3 + 2 log x. Supposedly, the intercept
should be log 3.

> log(3)
[1] 1.098612

This result agrees with the intercept found using R's lm command as reported in the variable res.

An important lesson has been learned.

Important Result: With any power function, the graph of the logarithm of the dependent variable versus
the logarithm of the independent variable will be a line. Thus, if the graph of the logarithm of the
response variable versus the logarithm of the independent variable is a line, then we should suspect that
the relationship between the original variables is that of a power function.
Resting Metabolic Rate

Let's apply what we've learned about power functions to a concrete example. In the table that
follows, the resting metabolic rates of several classs of primates are presented along with the
mass of the primate class. The resting metabolic rate (RMR) is defined to be the amount of
energy required by the body when the body is doing nothing. This data is taken from Leonard
and Robinson (1997, AJPA) and repeated in and article by Marcus Hamilton, Model-Fitting with
Linear Regression: Power Functions.

Resting Metabolic Rate in Primates

Species Weight RMR

A. palliata 8.5 363

A. palliata 6.4 293

A. trivirgatus 0.85 46

A. geoffroyi 8.41 346

C. molloch 0.7 54

C. apella 2.6 143

C. albifrons 2.4 135

S. imperator 0.4 35

S. fusicollis 0.3 28

S. sciureus 0.8 66

C. albigena 7.9 327

C. guereza 7 265

M. fascicularis 5.5 331


P. anubis 29.3 956

P. anubis 13 520

H. lar 6 292

P. troglodytes 39.5 1036

P. troglodytes 29.8 839

P. pygmaeus 83.6 1948

P. pygmaeus 37.8 1074

S. syndactylus 10.5 408

!Kung 46 1383

!Kung 41 1099

Ache 59.6 1591

Ache 51.8 1394

We've massaged the data and stored it in the file RMR.txt. Download the file and save it as
RMR.txt. Take note of the folder in which you save the file. Read the data into R as follows:

> primates=read.table(file=file.choose(),header=TRUE)
> primates
Weight RMR
1 8.50 363
2 6.40 293
3 0.85 46
4 8.41 346
5 0.70 54
6 2.60 143
7 2.40 135
8 0.40 35
9 0.30 28
10 0.80 66
11 7.90 327
12 7.00 265
13 5.50 331
14 29.30 956
15 13.00 520
16 6.00 292
17 39.50 1036
18 29.80 839
19 83.60 1948
20 37.80 1074
21 10.50 408
22 46.00 1383
23 41.00 1099
24 59.60 1591
25 51.80 1394

Plot the data.

> plot(RMR~Weight,data=primates)

The above command produces the plot shown in Figure 5.


Figure 5. Plotting RMR versus Weight.

One could argue that the data has a general linear appearance. However, there is evidence in
Figure 5 of a slightly concave down bend to the data, suggesting that we might use a power
function (perhaps the square root function) to fit the data. This encourages us to try plot the
logarithm of the RMR versus the logarithm of the Weight as follows.

> plot(log(RMR)~log(Weight),data=primates)

The above command produces the plot shown in Figure 6.


Figure 6. Plotting the logarithm of the RMR versus the logarithm of the Weight.

Aha! The plot in Figure 6 shows a definite linear trend. Therefore, our suspicion that a power
function might fit the original data set is probably valid (there are actual statistical tests for this
assumption which we will explore in later activities). Let's find the line of best fit for the
transformed data in Figure 6.

> res=lm(log(RMR)~log(Weight),data=primates)
> res

Call:
lm(formula = log(RMR) ~ log(Weight), data = primates)

Coefficients:
(Intercept) log(Weight)
4.2409 0.7599

This result tells us that we can fit a linear model to the transformed data as follows:

log RMR = 4.2409 + 0.7599 log Weight

Exponentiate both sides of the above equation.

elog RMR = e4.2409 + 0.7599 log Weight


On the left, the exponential and the logarithm are inverses. On the right, we use a property of
exponents.

RMR = e4.2409 e0.7599 log Weight

We use a property of logarithms to move the 0.7599 into the exponent.

RMR = e4.2409 elog Weight0.7599

We evaluate e4.2409.

> exp(4.2409)
[1] 69.47035

This simplifies further our result.

RMR = 69.47035 elog Weight0.7599

Finally, we again use the fact that the logarithm and exponential are inverses.

RMR = 69.47035 Weight0.7599

Note that this final function has the form y = a xb, a power function that should "fit" the original
data set.

Let's provide some visual verification of our model. First, find the range (min and max) of the
Weight data.

> range(primates$Weight)
[1] 0.3 83.6

Plot the original data, then superimpose a plot of the power function that we think will "fit" the
data.

> plot(RMR~Weight,data=primates)
> x=seq(0.3,83.6,0.1)
> y=69.47035*x^0.7599
> lines(x,y,type="l",lwd=2,col="red")
These commands produce the scatterplot and the power function that "fits" the data shown in
Figure 7.

Figure 7. The power function "fits" the original data.

The Exponential Function

An exponential function has the following form.

The Exponential Function

Figure 8. In the exponential function, the independent variable is the exponent.

In the power function, the independent variable was the base, as in y = a xb. In the exponential
function, the independent variable is now an exponent, as in y = a bx. This is a subtle but
important difference.
Let's look at an example of an exponential function, choosing the example y = 3(2x). First, let's
produce a plot.

> x=seq(1,5,0.5)
> y=3*2^x
> plot(x,y)

The above sequence produces the plot shown in Figure 9.

Figure 9. An example of a exponential function.

The function in Figure 9 is clearly nonlinear.


Tranforming the Data

Let's take the logarithm of both sides of the exponential function y = 3(2x). The base of the
logarithm is irrelevant; however, log x is understood to be the natural logarithm (base e) in R.
That is, log x = loge x in R.

log y = log 3(2x)

The log of a product is the sum of the logs, so we can write the following.

log y = log 3 + log 2x

Another property of logs allows us to move the exponent down.

log y = log 3 + x log 2

This last form shows that if we plot the log of y versus x, the graph will be linear with slope log
2 and intercept log 3.

> plot(log(x),log(y))

The above command produces the plot shown in Figure 4.


Figure 4. Plotting the log of y versus x produces a line.

Note that the plot of log y versus x is linear!

It's interesting to find the line of best fit for the transformed data.

> res=lm(log(y)~x)
> res

Call:
lm(formula = log(y) ~ x)

Coefficients:
(Intercept) x
1.0986 0.6931

The slope is 0.6931, which agrees with the slope indicated in log y = log 3 + x log 2; that is, the
slope is log 2.

> log(2)
[1] 0.6931472

The intercept is 1.098, which agrees with the intercept indicated in log y = log 3 + x log 2; that
is, the intercept is log 3.

> log(3)
[1] 1.098612

This result agrees with the intercept found using R's lm command as reported in the variable res.

An important lesson has been learned.

Important Result: With any exponential function, the graph of the logarithm of the dependent variable
versus the independent variable will be a line. Thus, if the graph of the logarithm of the response
variable versus the independent variable is a line, then we should suspect that the relationship between
the original variables is that of a exponential function.
Cell Phone Subscribers

Let's apply what we've learned to a concrete example. In the table that follows, the number of
cell phone subscriptions (in millions) is presented as a function of the number of years that have
passed since 1987.

Cell Phone Subscribers

t (years since
subscribers (in millions)
1987)

1 1.6

2 2.7

3 4.4

4 6.4

5 8.9

6 13.1

7 19.3

8 28.2

9 38.2

10 48.7

Enter and plot the data.

> t=seq(1,10)
> subscribers=c(1.6,2.7,4.4,6.4,8.9,13.1,19.3,28.2,38.2,48.7)
> plot(t,subscribers)
These commands produce the scatterplot shown in Figure 11.

Figure 11. Cell phone subscribers versus years since 1987.

We suspect exponential growth. This leads us to plot the logarithm of the subscriber data versus
the year since 1987.

> plot(t,log(subscribers))
> res=lm(log(subscribers)~t)
> abline(res)
These commands produce the scatterplot and the line of best fit shown in Figure 12.

Figure 12. Plotting log of subscribers versus the years since 1987.

The data of Figure 12 certainly appear to be linear. Let's examine the model of the line of best fit.

> res

Call:
lm(formula = log(subscribers) ~ t)

Coefficients:
(Intercept) t
0.2630 0.3774

This leads us to the following model.

log subscribers = 0.2630 + 0.3774t

Exponentiate both sides of the last equation.

elog subscribers = e0.2630 + 0.3774t


On the left, the exponential and logarithm are inverses. On the right, the exponential of a sum is
the product of the exponentials.

subscribers = e0.2630 e0.3774t

Finally, we can evaluate e0.2630.

> exp(0.2630)
[1] 1.300827

This leads to the final form of the exponential model.

subscribers = 1.300827 e0.3774t

Let's provide some visual verification of our model.

> plot(t,subscribers)
> t=seq(1,10,0.1)
> y=1.300827*exp(0.3774*t)
> lines(t,y,type="l",lwd=2,col="red")

The above commands will plot the original data, then superimpose a plot of the exponential
function that we found to "fit" the data.
Figure 13. The exponential function 'fits' the original data.

Not bad!

Enjoy!

We hope you enjoyed this introduction to the principles of transforming data in the R system. In
upcoming activities, we will discuss further what is meant by a "good fit."

The Central Limit Theorem (Part 3)


In the activity The Central Limit Theorem (Part 1), we concluded with the following
observations on the Central Limit Theorem.

1. If you draw samples from a normal distribution, then the distribution of sample means is also
normal.
2. The mean of the distribution of sample means is identical to the mean of the "parent
population," the population from which the samples are drawn.
3. The higher the sample size that is drawn, the "narrower" will be the spread of the distribution of
sample means.

To provide visual evidence of the narrowing spread, the following code will produce the image
shown in Figure 1. The four examples pictured use sample sizes of 5, 10, 20, and 30. (Note: It is
very difficult to type the following commands without error. In a moment, we'll provide a more
efficient way to enter the commands. In the meantime, give the commands a try, if only to
appreciate how difficult it is to get them correctly entered in sequence.)

> par(mfrow=c(2,2))
>
> n=5
> xbar_5=rep(0,500)
> for (i in 1:500){ xbar_5[i]=mean(rnorm(n,mean=100,sd=10))}
> hist(xbar_5,prob=TRUE,breaks=12,xlim=c(70,130))
>
> n=10
> xbar_10=rep(0,500)
> for (i in 1:500) { xbar_10[i]=mean(rnorm(n,mean=100,sd=10))}
> hist(xbar_10,prob=TRUE,breaks=12,xlim=c(70,130))
>
> n=20
> xbar_20=rep(0,500)
> for (i in 1:500) { xbar_20[i]=mean(rnorm(n,mean=100,sd=10))}
> hist(xbar_20,prob=TRUE,breaks=12,xlim=c(70,130))
>
> n=30
> xbar_30=rep(0,500)
> for (i in 1:500) { xbar_30[i]=mean(rnorm(n,mean=100,sd=10))}
> hist(xbar_30,prob=TRUE,breaks=12,xlim=c(70,130))

Some comments are in order:

1. There is a new command, par(mfrow=c(2,2)), which divides the graphics window into two rows
and two columns (see Figure 1). The par stands for "parameter," and there are a whole host of
that can alter the figure window (type ?par for more information).
2. Think of mfrow as meaning "multiple figures entered by rows." Because there are two rows and
two columns, the first histogram command creates a plot in row 1, column 1, the second in row
1, column 2, the third in row 2, column 1, and the fourth in row 2 column 2. There is also a
command mfcol that enters the figures by columns.
3. The for loops are explained in the activity The Central Limit Theorem (Part 1).
4. Note that xlim=c(70,130) creates a common axis so that a proper comparison of the spreads can
be made.

In Figure 1, note that the spread narrows as the sample size increases from n = 5 to n = 10, n =
20, and n = 30.

Figure 1. A visual comparison of the distribution of sample means as the sample size increases.
Note that the mean of each distribution of sample means is 100 (each of the histograms in Figure
1 balances about 100), the same as the mean of the parent population from which the samples are
drawn.

Sourcing Files

If you were successful in typing the lines that produced Figure 1, then you are far better typist
than me. When commands must be correctly entered in sequence, such as the lines of code that
produced Figure 1, it can be exasperating to make a mistake, knowing that you must retype the
entire sequence anew.

A far more efficient way to proceed is to use R's editor to enter the lines of code, save the file,
then execute the code in the file. This way, if you make a simple typing mistake, you make a
change to the file, save, then execute the file anew. R calls these files of code "source files," and
executing the file is called "sourcing the file."

On the Mac operating system (Windows and Linux should be similar), you open the editor by
selecting New Document from the File menu (there is also an icon for new documents on the
console toolbar). When the editor opens, type the following lines.

par(mfrow=c(2,2))

n=5
xbar_5=rep(0,500)
for (i in 1:500){ xbar_5[i]=mean(rnorm(n,mean=100,sd=10))}
hist(xbar_5,prob=TRUE,breaks=12,xlim=c(70,130))

n=10
xbar_10=rep(0,500)
for (i in 1:500) { xbar_10[i]=mean(rnorm(n,mean=100,sd=10))}
hist(xbar_10,prob=TRUE,breaks=12,xlim=c(70,130))

n=20
xbar_20=rep(0,500)
for (i in 1:500) { xbar_20[i]=mean(rnorm(n,mean=100,sd=10))}
hist(xbar_20,prob=TRUE,breaks=12,xlim=c(70,130))

n=30
xbar_30=rep(0,500)
for (i in 1:500) { xbar_30[i]=mean(rnorm(n,mean=100,sd=10))}
hist(xbar_30,prob=TRUE,breaks=12,xlim=c(70,130))

Save the file as central3_1.R. The .R extension reminds us that the file contains R source code.

To execute the code in the file central3_1.R, select Source File from the File menu (there is
also an icon for "sourcing" files on the console toolbar). This opens a dialog where you can
browse your system in the usual manner and select the saved source file (central3_1.R in our
case). Clicking the Open button will execute the source code in the file and produce the image in
Figure 1. If you have errors, or wish to make adaptions, correct your source code, save, then
"source" the file again.
On our system, "sourcing" the file central3_1.R sends the following command to the console:

>
source("/Users/darnold/Documents/MathDept/trunk/MathDept/html/R/code/central3_
1.R")

This line shows both the path and filename of the file containing our source code. We've found
that should errors occur, or should we wish to add further code or make changes, we can save our
changes, then hit the "up-arrow" on our keyboard to replay this line in our console. Hitting enter
executes the line and "sources" our saved changes, a most efficient way to work.

Examining the Workspace

Once you've "sourced" your code, the workspace will reflect variables used in your code. (Note:
Your workspace may differ.)

> ls()
[1] "i" "n" "xbar_10" "xbar_20" "xbar_30" "xbar_5"

The variable i should contain the last value of iteration.

> i
[1] 500

The variable n should contain the last sample size used in your code.

> n
[1] 30

The variables xbar_5, xbar_10, xbar_20, and xbar_30 should contain 500 sample means with
the indicated sample size. For example, xbar_5 contains the means of 500 samples of size n = 5
drawn from a normal population having mean μ = 100 and standard deviation σ = 10.

> xbar_5
[1] 94.47642 98.71575 106.02464 104.54210 102.46225 95.55821
[7] 106.01677 103.29282 101.07612 93.36717 97.08930 99.53715
[13] 98.72121 91.59586 100.12131 100.20003 99.66180 97.42841
[19] 102.15585 93.75008 98.89875 103.13431 101.45754 97.65430
[25] 103.02266 97.22582 95.37828 96.91194 98.07255 96.93296

...
...

[493] 102.79047 92.16152 103.85065 95.74779 92.83653 97.43172


[499] 99.70777 106.23438

The histograms in Figure 1 indicate that the spread of the distributions of sample means narrows
with increasing sample size. Let's examine the standard deviations of xbar_5, xbar_10,
xbar_20, and xbar_30. The sd command is used to find the standard deviation of any collection
of data.
> sd(xbar_5)
[1] 4.52229
> sd(xbar_10)
[1] 3.179132
> sd(xbar_20)
[1] 2.165316
> sd(xbar_30)
[1] 1.820735

Sure enough, as the sample size increases, the standard deviation, which is a measure of the
spread of the distribution, clearly decreases.

Adding Data Points

Let's add a couple more data points. Save the following lines in the file central3_2.R.

n=5
xbar_5=rep(0,500)
for (i in 1:500){ xbar_5[i]=mean(rnorm(n,mean=100,sd=10))}

n=10
xbar_10=rep(0,500)
for (i in 1:500) { xbar_10[i]=mean(rnorm(n,mean=100,sd=10))}

n=20
xbar_20=rep(0,500)
for (i in 1:500){ xbar_20[i]=mean(rnorm(n,mean=100,sd=10))}

n=30
xbar_30=rep(0,500)
for (i in 1:500) { xbar_30[i]=mean(rnorm(n,mean=100,sd=10))}

n=40
xbar_40=rep(0,500)
for (i in 1:500) { xbar_40[i]=mean(rnorm(n,mean=100,sd=10))}

n=50
xbar_50=rep(0,500)
for (i in 1:500) { xbar_50[i]=mean(rnorm(n,mean=100,sd=10))}

Now, collect the sample sizes in a vector ssizes and standard deviations of the distributions of
sample means in a vector sdevs, then plot the standard deviations versus the sample sizes. Add
these lines to the code in the file central3_2.R.

ssizes=c(5,10,20,30,40,50)
sdevs=c(sd(xbar_5),sd(xbar_10),sd(xbar_20),sd(xbar_30),sd(xbar_40),sd(xbar_50)
)

plot(ssizes,sdevs)
Save the file central3_2.R then "source" the file to produce the scatterplot shown in Figure 2.

Figure 2. Plotting the standard deviations of the distributions of sample means versus the sample size.

Modeling Standard Deviation versus Sample Size

Let's examine the current workspace in R. Note that this next command is executed at the
command line.

> ls()
[1] "i" "n" "sdevs" "ssizes" "xbar_10" "xbar_20"
[7] "xbar_30" "xbar_40" "xbar_5" "xbar_50"

Good! All the data we need exists in the current workspace.

We will now try to find the specific relationship between the standard deviations of the
distributions of sample means and the corresponding sample size. In the activity Transforming
Data in R, we examined two models: power functions and exponential functions. Let's plot the
logarithm of sdevs versus ssizes.

> plot(log(ssizes),log(sdevs))
The above command produces the scatterplot shown in Figure 3.

Figure 3. Plotting the logarithm of sdevs versus the logarithm of ssizes.

The fact that the plot of the logarithm of sdevs versus ssizes in Figure 3 is linear indicates that a
power function would be a good model for the relationship between sdevs and ssizes.

As we learned in the activity Transforming Data in R, we first fit a line to the logarithm of sdevs
versus the logarithm of ssizes in Figure 1.

> res=lm(log(sdevs)~log(ssizes))
> res

Call:
lm(formula = log(sdevs) ~ log(ssizes))

Coefficients:
(Intercept) log(ssizes)
2.2962 -0.4949

Let's add the line of best fit to the data in Figure 3.

> abline(res)
This produces the line of best fit in Figure 4.

Figure 4. Adding the line of best fit..

The fact that the line fits the data in Figure 4 so well is a strong indication that a power function
is the relationship we seek between the standard deviations of the distribution of sample means
and the sample sizes.

The output in the variable res tells us that the line of best fit in Figure 4 has the following
equation.

log sdevs = 2.2962 - 0.4949 log ssizes

We exponentiate both sides of the last equation.

elog sdevs = e2.2962 - 0.4949 log ssizes

On the left, the exponential and the logarithm are inverses. On the right, the exponential of a sum
is the product of the exponentials.

sdevs = e2.2962 e - 0.4949 log ssizes


We compute e2.2962.

> exp(2.2962)
[1] 9.936352

Thus, our last equation becomes:

sdevs = 9.936352 e - 0.4949 log ssizes

We use a property of logarithms to move - 0.4949 into the exponent.

sdevs = 9.936352 elog ssizes - 0.4949

Again, the logarithm and exponential are inverses, which leads to the following result.

sdevs = 9.936352 ssizes - 0.4949

Conclusions

Suppose that we round the numbers in our last result to obtain:

sdevs = 10 ssizes - 0.5

Note the number 10. It is not a coincidence that this number is the standard deviation of the
parent population. Remember, we drew our samples from a normal distribution with mean μ =
100 and standard deviation σ = 10.

Indeed, here is the final conclusion of the Central Limit Theorem.

Central Limit Theorem. Let X be a random variable having normal distribution with mean μ and
standard deviation σ. Let X be the random variable representing the mean of a sample of size n
drawn from the distribution of X. Then:

1. The distribution of X will be normal.


2. The mean of the distribution of X is identical to the mean of the distribution of X. In symbols, we
write:

3. The standard deviation of the distribution of X is related to the standard deviation of the
distribution of X by the following equation:
Note that the result in item #3 is identical to our experimental result, repeated here for
convenience.

sdevs = 10 ssizes - 0.5

The comparison is better seen if we use a little algebra to change the form of the result in item
#3.

Thus, to find the standard deviation of the distribution of sample means, take the standard
deviation of the parent population and divide it by the square root of the sample size.

Sampling from Distributions that are not Normal

Here is the key fact: The hypotheses and conclusions of the Central Limit Theorem as stated
above hold regardless of the distribution of the parent population, provided the sample size is
large enough. We now will examine this statement through examples.

Drawing from a Uniform Distribution

Consider the uniform distribution on the interval [0,10].

> x=seq(0,10,0.1)
> plot(x,y,type="l")
> plot(x,y,type="l",lwd=2,col="red")
These commands produce the uniform density function on [0,10] in Figure 5.

Figure 5. The uniform density function on [0,10].

The mean of the uniform distribution on the interval [a, b] is given by the formula:

Thus, the mean of the uniform distribution on [0,10] is:

You need to use a little calculus to determine the standard deviation of the uniform distribution
on the interval [a, b], but the result is:
Thus, the standard deviation of the uniform distribution on the interval [0,10] is:

Now, let's draw 500 samples of size n=100 from the uniform distribution on [0,10].

> n=100
> xbar=rep(0,500)
> for (i in 1:500) { xbar[i]=mean(runif(n,min=0,max=10))}
> hist(xbar,prob=TRUE)

These commands produce the histogram in Figure 6.

Figure 6. Histogram of means of samples of size n = 10 drawn from the uniform distribution [0,10].
Now, let's check the conclusions of the Central Limit Theorem.

1. The first conclusion of the Central Limit Theorem is the fact that the distribution of sample
means is normal. This is evident in Figure 6.
2. The second conclusion of the Central Limit Theorem states that the mean of the sample means
is the same as the mean of the distribution from which the samples are drawn. In Figure 6, the
mean of the distribution of sample means appears to be about 5 (estimate the balance point in
Figure 6), which is the same as the mean of the uniform distribution on [0,10]. However, let's do
one more check:
3. > mean(xbar)
4. [1] 5.002197

Looks good!

5. The third conclusion of the Central Limit Theorem states that the standard deviation of the
distribution of sample means is calculated with:

Let's check this prediction.

> sd(xbar)
[1] 0.2936821

That's pretty good agreement!

Drawing from an Exponential Distribution

In the activity Continuous Distributions in R, we examined the exponential function having the
following probability density function:

The Exponential Probability Density Function</STRONG< p>

The mean and standard deviation of this exponential distribution are μ = 1/λ and σ = 1/λ,
respectively.

Let's consider the exponential distribution with λ = 1.

> lambda=1
> y=dexp(x,rate=lambda)
> plot(x,y,type="l",lwd=2,col="red")
These commands produce the histogram in Figure 7.

Figure 7. The exponential density function for λ = 1.

The mean and standard deviation of this distribution are:

μX = 1/λ = 1/1 = 1 and σX = 1/λ = 1/1 = 1

Now, let's draw 500 samples of size n=100 from the exponential distribution with λ = 1.

> n=100
> xbar=rep(0,500)
> for (i in 1:500) { xbar[i]=mean(rexp(n,rate=lambda))}
> hist(xbar,prob=TRUE)
These commands produce the histogram of the sample means shown in Figure 8.

Figure 8. Distribution of sample means of size n = 100 from the exponential distribution with λ = 1.

Now, let's check the conclusions of the Central Limit Theorem.

1. The first conclusion of the Central Limit Theorem is the fact that the distribution of sample
means is normal. This is evident in Figure 8.
2. The second conclusion of the Central Limit Theorem states that the mean of the sample means
is the same as the mean of the distribution from which the samples are drawn. In Figure 8, the
mean of the distribution of sample means appears to be about 1 (estimate the balance point
in Figure 8), which is the same as the mean of the exponential distribution with λ = 1.
However, let's do one more check:
3. > mean(xbar)
4. [1] 0.9997962

Looks good!

5. The third conclusion of the Central Limit Theorem states that the standard deviation of the
distribution of sample means is calculated with:
Let's check this prediction.

> sd(xbar)
[1] 0.09948667

That's pretty good agreement!

Enjoy!

We hope you enjoyed this third activity on the Central Limit Theorem system. We
encourage you to explore further. You might try repeating the experiments in this activity
with different "parent" distributions.

Das könnte Ihnen auch gefallen