Beruflich Dokumente
Kultur Dokumente
This section will begin a very gentle introduction to plotting in R. The goal is to show how one
can draw the plot of a function using R and annotate the resulting plot. The tutorial is easy to
follow and it gives you a nice overview of the plotting capabilities of R.
R as a Calculator
Start R on your system. This will open R's interactive shell. Let's start by performing a basic
calculation in the shell. Enter the expression 2+2, then press the Enter key. The result follows.
> 2+2
[1] 4
Sweet! R can add two and two! Let's try something a bit harder, such as sin π /2.
> sin(pi/2)
[1] 1
Good start! Let's produce a list of numbers, storing the result in the vector x.
> x=0:5
We can see what's stored in the variable x by typing its name and hitting the Enter key.
> x
[1] 0 1 2 3 4 5
> x+3
[1] 3 4 5 6 7 8
> x^2
[1] 0 1 4 9 16 25
These calculations are pretty typical of the way in which R deals with lists of numbers. Whatever
you do to the list is applied to each element in the list.
Let's do something a bit more substantial. First, let's create a list of numbers using R's seq
command.
> x=seq(0,2*pi,by=pi/2)
> x
[1] 0.000000 1.570796 3.141593 4.712389 6.283185
Note the syntax, seq(a,b,by=increment) produces a list of numbers that starts at a, increments
by increment, and finishes at b. Hence the command x=seq(0,2*pi,by=pi/2) produces a list of
numbers that starts at zero, increments by π /2, and stops at 2 π. Now, let's take the sine of each
number in the list stored in x.
> sin(x)
[1] 0.000000e+00 1.000000e+00 1.224647e-16 -1.000000e+00 -2.449294e-16
A little strange, but note the scientific notation. This result is approximately 0, 1, 0, -1, 0, which
is correct. The small errors are due to round-off error.
Plotting a Sinusoid
In this first activity, let's try to draw the graph of a sinusoid with the following equation.
In a trigonometry class, the first step would be to identify the amplitude, period, and phase shift.
To that effort, we factor out a 2π.
y = 2 sin 2π (x - 1/4)
Comparing this equation with the more general form, y = A sin B (x - φ), we see that A = 2, B =
2π, and φ = 1/4. Thus, we have the following facts:
Thus, when we draw the graph of the sinusoid, we should see it complete one full cycle every 1
second, it should bounce up and down between 2 and -2, and it should be shifted 1/4 units to the
right. With these thoughts in mind, let's sketch 2 periods of the sinusoid, which means that we
should sketch the graph of the sinusoid over the domain [0, 2].
Coding
We begin. We first create a list of domain values and store them in x. The following command
starts at 0, increments by 0.01, and stops at 2, giving us a large number of points in the domain
[0,2].
> x=seq(0,2,by=0.01)
We now need to evaluate the sinusoid y = 2 sin 2π (x - 1/4) at each point of the vector x. This is a
simple matter in R.
>> y=2*sin(2*pi*(x-1/4))
The above command evaluates the sinusoid at each point in the vector x and stores the result in
the vector y. All that remains is to plot the points in the vector y versus the points in the vector x.
> plot(x,y)
The result of this last command on our system is shown in Figure 1. Results may differ on
different systems depending on the windowing system used by your version of R.
Note that R's default behavior is to plot the individual points. We can change the "type" of the
plot as follows.
> plot(x,y,type='l')
The result is shown in Figure 2. Note that the points are no longer marked and we have a
"smooth curve" instead. Actually, what's going on behind the scenes is each consecutive pair of
points is being joined with a small line segment. Because we've plotted a "lot of points," the
result has the appearance of a "smooth curve."
Figure 2. Changing the type, giving the appearance of a "smooth curve."
You can add axis labels and a title with the commands that follow below. After typing the first
line, namely plot(x,y,type='l', (that's a lower case "L") press the Enter key. The plus sign (+) on
the next line is R's "line continuation" character and is automatically supplied by R's interactive
shell. Without the "line continuation" character, lines would be too long to include in this
document.
> plot(x,y,type='l',
+ xlab='x-axis',
+ ylab='y-axis',
+ main='A plot of a sinusoid.')
It's somewhat difficult to type a number of consecutive lines correctly, so if you make a mistake
you will need to start over. However, you can use the "up-arrow" on your keyboard to replay
lines you've already typed. Pressing the "up-arrow" a number of times take you back through the
"history" of lines you've entered in R's interactive shell. When you get the line you want, press
Enter, or edit the line and then press Enter.
The result of this code is shown in Figure 3. Note the addition of axis labels and a "main" title.
Figure 3. Adding axis labels and a title to the figure..
Enjoy
We hope you've enjoyed this small introduction to plotting in R. To find out more about R's plot
command, type the following command in R's interactive shell.
> ?plot
This will open a help file on the plot where you can read more about this powerful command.
Sampling from a Discrete Distribution
A colleague of mine asked how he could sample from a discrete distribution. Imagine, he said, a
six-sided die that is "loaded." That is, when thrown, the sides do not show with equal
probabilities. Examine the discrete distribution delineated in the following table. The first
column contains the number showing on the face of the die, while the second column states the
probability of observing that face when the thrown die come to rest.
A Six-Sided Die
x p
1 0.1
2 0.1
3 0.1
4 0.1
5 0.2
6 0.4
Thus, for example, the probability of throwing a 1 is 0.1, a 10% chance. The probability of
throwing a 6 is 0.4, a 40% chance.
We will now draw from the set (1, 2, 3, 4, 5, 6) with replacement. This means that we draw a
number from the set (1, 2, 3, 4, 5, 6), record the result, then place it back in the set. This means
that it will be available on our next draw from the set.
However, we must also assign numbers (0.1, 0.1, 0.1, 0.1, 0.2, 0.4) representing the probabilities
of selecting a number in the corresponding position of the list (1, 2, 3, 4, 5, 6).
This task is easily accomplished in R. We will use the sample command, which has the
following general syntax.
Readers can obtain additional help on the sample command by entering the following command
at the R prompt:
> ?sample
Let's begin by drawing 100 numbers from this "discrete" distribution. After you enter the first of
the following code lines, press the Enter key, which results in a new line that starts with a "plus
sign" (+). This is R's line continuation character and is provided by the system when you press
Enter to start a new line. Continue typing the remaining lines of code. In the last line, the
concluding parentheses (the second one) delimits a complete command and R will then execute
and store the result in the variable v.
Well, that was easy! And fast! The variable v now contains 100 random numbers drawn from the
discrete distribution with the assigned probabilities. You can view these numbers by typing v at
the R prompt. The result is shown below.
> v
[1] 1 6 4 6 1 1 3 2 6 3 6 3 6 5 2 5 4 6 6 4 6 5 6 1 5 6 6 6 6 2 6 5 6
[34] 5 6 1 2 6 6 1 2 5 6 5 4 5 5 3 6 1 5 6 5 3 4 6 6 6 5 4 4 5 6 3 6 3
[67] 6 6 2 6 5 4 6 2 6 2 5 5 5 6 4 6 6 3 6 5 6 6 5 4 4 5 5 3 3 5 3 3 2
[100] 5
You can test the length of this vector with the length command.
> length(v)
[1] 100
Now that we have 100 numbers drawn from our discrete distribution, how do we go about
visualizing the result? One idea would be to sort the numbers, then count the number of times
each number occurs. This is most easily accomplished with R's table command.
> table(v)
v
1 2 3 4 5 6
7 9 12 11 24 37
Note that the number 1 was selected 7 times, the number 2 nine times, etc.
A barplot is one of the best methods for visualizing the results of our sample. The following
command is used to create the barplot shown in Figure 1.
Pie Charts in R
There are a number of important types of plots that are used in descriptive statistics. A common
plot that is frequently used in newspapers and magazines is the pie plot, where the size of a
"wedge of pie" help the reader visualize the percentage of data falling in a particular category.
Take, for example, the responses of children who were asked their favorite color of M and M
candies.
Favorite Color
Red 40
Blue 30
Green 20
Brown 10
Note that the data in the table are percentages, summing to 100%.
We'll first enter the data from the table above, storing the list in the variable mypie
> mypie=c(40,30,20,10)
You can obtain help on R's combine command by typing ?c at the prompt. This command is used
to concatenate elements into a single list, particularly useful when entering data manually. You
can see the contents of the variable by typing mypie and hitting the Enter key.
> mypie
[1] 40 30 20 10
It is a simple task to create a pie chart using the data stored in mypie.
> pie(mypie)
In the upcoming exercises, we'll explore further the capabilities of R's pie command. You can
read more about this command by entering the command ?pie and reading the resulting help file.
We can add "names" to the percentages of colored candies favored by the children.
> names(mypie)=c("Red","Blue","Green","Brown")
The names command attaches a "name" to each piece of data. We can see the result of this
command by typing the variable name (mypie) and hitting the Enter key.
> mypie
Red Blue Green Brown
40 30 20 10
Note how each piece of data now has a "name" or "header." We can access individual elements
of the list with R's indexing capability. For example, to access the first element of the list (R
starts indexing at one), enter:
> mypie[1]
Red
40
However, since the data have "names" attached, we can access each data item through its
"name."
> mypie['Red']
Red
40
The pie command knows how to apply these "names" to the pie chart.
> pie(mypie)
When "names" are added to each piece of data (as we did to the variable mypie), R's pie
command automatically adds these names as labels to the corresponding "slice of pie," as shown
in Figure 2.
It won't do that the "Red" slice is colored "White." Let's see how we can use customized colors
for each "slice of pie." First, we store the corresponding colors for each slice in the variable
mycolors.
> mycolors=c("red","blue","green","brown")
Next, we pass the colors in mycolors to the col argument of the function mypie.
> pie(mypie,col=mycolors)
Enjoy
We hope you enjoyed this very short introducion to pie charts using R. To find out more about
what can be accomplished with R's pie command, remember to type ?pie, hit Enter, and read the
resulting help file.
Bar Charts in R
There are a number of important types of plots that are used in descriptive statistics. A common
plot that is frequently used in newspapers and magazines is the bar chart, where the height of
each bar helps the reader visualize the amount of data falling in a particular category. Take, for
example, the responses of children who were asked their favorite color of M and M candies.
Favorite Color
Red 40
Blue 30
Green 20
Brown 10
Note that the data in the table are percentages, summing to 100%.
We'll first enter the data from the table above, storing the list in the variable mypie
> x=c(40,30,20,10)
You can obtain help on R's combine command by typing ?c at the prompt. This command is used
to concatenate elements into a single list, particularly useful when entering data manually. You
can see the contents of the variable by typing x and hitting the Enter key.
> x
[1] 40 30 20 10
> barplot(x)
In the upcoming exercises, we'll explore further the capabilities of R's barplot command. You
can read more about this command by entering the command ?barplot and reading the resulting
help file.
We can add "names" to the percentages of colored candies favored by the children.
> names(x)=c("Red","Blue","Green","Brown")
The names command attaches a "name" to each piece of data. We can see the result of this
command by typing the variable name (mypie) and hitting the Enter key.
> x
Red Blue Green Brown
40 30 20 10
Note how each piece of data now has a "name" or "header." We can access individual elements
of the list with R's indexing capability. For example, to access the first element of the list (R
starts indexing at one), enter:
> x[1]
Red
40
However, since the data have "names" attached, we can access each data item through its
"name."
> x['Red']
Red
40
The barplot command knows how to apply these "names" to the bar chart.
> barplot(x)
When "names" are added to each piece of data (as we did to the variable x), R's barplot
command automatically adds these names as labels to the corresponding "bars," as shown in
Figure 2.
It won't do that the "Red" slice is colored "White." Let's see how we can use customized colors
for each "bar." First, we store the corresponding colors for each slice in the variable mycolors.
> mycolors=c("red","blue","green","brown")
Next, we pass the colors in mycolors to the col argument of the function barplot.
> barplot(x,col=mycolors)
Enjoy
We hope you enjoyed this very short introducion to bar charts using R. To find out more about
what can be accomplished with R's barplot command, remember to type ?barplot, hit Enter,
and read the resulting help file.
Histograms in R
There are a number of important types of plots that are used in descriptive statistics. A common
plot that is frequently used in newspapers and magazines is the histogram, a sequence of vertical
bars, where the height of each bar represents a count of the data values falling in a "bin." The
"count", "bins", and "bars" will be explained in the images that follow.
The standard normal distribution is the famous "bell-shaped" curve of statistics, known to have a
mean value of zero and a standard deviation of one. If you are not familiar with the standard
normal distribution, read on. This distribution will be made clear in the images that follow.
> ?rnorm
The help file response describes the use of several commands relating to the normal distribution.
The one of interest for this activity is R's rnom command, with syntax rnorm(n,mean,sd), with
the following argument use:
let's begin by drawing 1000 numbers from this "standard normal" distribution.
> x=rnorm(1000,mean=0,sd=1)
Well, that was easy! And fast! The variable x now contains 1000 random numbers drawn from
the standard normal distribution. You can view these numbers by typing x at the R prompt. The
first few numbers are shown below.
> x
[1] -1.123987512 0.865229526 -1.325374408 0.679182289 -1.184965803
[6] 1.755767521 0.064290993 -1.733885165 0.470695523 0.303721954
[11] 0.496295681 -0.431201657 1.378353239 1.729874427 0.445363031
...
You can test the length of this vector with the length command.
> length(x)
[1] 1000
We have our 1000 random numbers, so just how do we go about creating a histogram of these
numbers? If we were doing the problem by hand, we would first define some categories called
"bins", and then sort the random numbers into these "bins". The final count of occurrences in
each bin might look like the following.
The table shows that there was only one number falling between -3.5 and -3, there were 6
numbers falling between -3 and -2.5, etc. Proceeding by hand we would next set up axes on
graph paper, partition the horizontal axis in "bins", then scale the vertical axes to accomodate the
counts in our table. Over each bin we would create a rectangle having a vertical height the
corresponds to the number of occurences in that particular bin.
Performing this task by hand is a painstaking procedure. R can automate the process for us,
which will allow us to spend more time interpreting the results and less time having to deal
with the tedium of sorting a 1000 number into bins and counting them by hand.
What follows is the simplest way to get a quick histogram of the data in the variable x.
> hist(x)
The command hist(x) responds by spewing some output to the terminal window (more on this
later) and creating the nice looking histogram shown in Figure 1.
Figure 1. A histogram of the standard normal distribution data in the variable x.
R offers fine-grain control over the appearance and form of its histograms. We can learn more
about this command through the interactive shell of the R environment.
> ?hist
The help system responds with a wealth of information on R's hist command, a snippet of which
follows.
Usage
hist(x, ...)
## Default S3 method:
hist(x, breaks = "Sturges",
freq = NULL, probability = !freq,
include.lowest = TRUE, right = TRUE,
density = NULL, angle = 45, col = NULL, border = NULL,
main = paste("Histogram of" , xname),
xlim = range(breaks), ylim = NULL,
xlab = xname, ylab,
axes = TRUE, plot = TRUE, labels = FALSE,
nclass = NULL, ...)
Let's first focus on the "breaks" argument to the hist command. This argument can be used in a
number of ways.
1. You can "suggest" the number of breaks ("bins") you want. For example, if we wanted
fewer bins, we'd pass this request to the hist command as follows.
2. > hist(x,breaks=5)
Note that there are not exactly 5 bins Figure 2, but there are fewer.
3. There is a second way to proceed and that entails directing the hist command to use a
specific set of bins. Let's say that we want the histogram to use the bins described in our
tabular work. Use the arange command to produce this set of bins.
4. > bins=seq(-3.5,3.5,by=0.5)
You can see the result of this command by typing bins at the R prompt.
> bins
[1] -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
[14] 3.0 3.5
These are precisely the bins in our table. We can now request that R use these bins with
the following command.
> hist(x,breaks=bins)
Finally, we'll add a bit of color and our own personal annotations to the axes and title. In the
code that follows, the plus (+) sign is the line continuation character. After entering hist(x,, hit
the Enter key. The plus sign is added by R's shell automatically. Continue entering the code as
shown, hitting Enter at the end of each line. When you close the parentheses on the last line and
hit Enter, the entire command is executed.
> hist(x,
+ breaks=bins,
+ col="lightblue",
+ xlab="x-values",
+ ylab="count",
+ main="Random Numbers from the Standard Normal Distribution")
The result of this command is shown in Figure 4.
1. Note how the histogram is approximately "balanced" about its mean, which appears to be
apprimately located at zero, which is precisely what we would anticipate, given the fact
that the numbers in the variable x were randomly drawn from the standard normal
distribution, which has mean zero.
2. Because the numbers in the variable x were drawn from the standard normal distribution,
which has a standard deviation of one, note that almost all of the data occurs within 3
standard deviations of the mean, either way. That is, not that almost all of the data in the
histogram the data in the variable x occurs between -3 and 3, which represents 3 standard
deviations of 1 on either side of the mean 0.
Enjoy!
We hope you enjoyed this introduction to the R system. This interactive system provides a strong
interactive interface for exploration in statistics.
We encourage you to explore further. Use the command ?hist to learn more about what you can
do with the hist command.
Boxplots in R
In this activity we show our readers how to create a boxplot in R. In preparation for this activity,
we must first explore what statisticians call "measures of central tendency," specifically the
mean and median of a data set.
We first create a set of data that we will use throughout this activity. Although our data set is
somewhat artificial, all of what we explain in this activity (as it relates to our data set) can also
be applied to any set of data chosen by our readers. With this thought in mind, we enter our data
set at the R prompt.
We can examine the contents of the variable x with the following command:
> x
[1] 0 4 15 1 6 3 20 5 8 1 3
The Mean
One of the most important measures of central tendency in statistics is the mean, which is found
by summing the elements of the data set, then dividing by the number of elements in the set. We
can sum the elements in the data set contained in the variable x with the R-command sum.
> sum(x)
[1] 66
Readers should convince themselves (get out the pencil and paper) that the elements in the
variable x do indeed sum to 66. Add up the individual elements in the list stored in x and show
that the sum is 66.
To find the number of elements in the list stored in the variable x, we can get out our abacus and
count them, or we can use R's length command.
> length(x)
[1] 11
Readers should check that the list stored in x does indeed have 11 elements. Count them!
To find the average of the list stored in x, we divide the sum by the number of elements in the
list.
> sum(x)/length(x)
[1] 6
Recall that the sum of the elements in x was 66, the length was 11, so the average (or mean) is
66/11=6, as verified by our R-command sum(x)/length(x).
However, because finding the mean is such a common requirement in most statistical analysis, it
should come as no surprise that R has a command for finding the mean of a data set.
> mean(x)
[1] 6
The Median
The mean of a data set can be strongly influenced by "outliers" in the data. Consider anew the
data stored in the variable x.
> x
[1] 0 4 15 1 6 3 20 5 8 1 3
Let's sketch a quick histogram of the data stored in x. The following command produces the
histogram shown in Figure 1.
> hist(x)
Figure 1. A histogram of the data stored in the variable x. Note that the data is badly skewed to
the right.
Note the long tail to the right in Figure 1. Statisticians say that the data is "skewed to the right."
Imagine that the bars of the histograms represent masses of equal density. If we were to place a
fulcrum or "knife-edge" located at the mean (at x = 6), the masses would balance. The outliers
can greatly affect the placement of the mean. It's like an old-fashioned "teeter-totter." A child
seated at greater distance from the fulcrum is able to balance a much heavier child seated closer
to the fulcrum.
To pursue this line of reasoning a bit further, imagine that the numbers contained in the variable
x represent speakers' fees in thousands of dollars. Let's sort the data in ascending order.
> sort(x)
[1] 0 1 1 3 3 4 5 6 8 15 20
It is probably unfair to say that the "average speaking fee is $6,000." Although statistically
correct, the average (mean) speaking fee ($6,000) does not reflect a common speaking charge for
this collection of speakers. Indeed, the two outliers (the speakers charging $15,000 and $20,000)
unduely influence the mean.
A second measure of central tendencey, a statistic called the median, will be seen to more
closely resemble what a group might be charged should they hire one of the speakers represented
in the data set stored in x. The median is defined to be the data item that is precisely in the
middle of the sorted data set; that is, half (50%) of the data occurs to the left of the median, and
half (50%) occurs to the right of the median.
In the case that the data set has an odd number of elements, it is a simple matter to spot the data
item that lies precisely in the middle. The data set stored in the variable x has 11 elements.
Hence, the sixth element lies exactly in the middle of this data set. Thus, the median is 4. Note
that this number represents a speaking fee of $4,000, which is probably more representative of a
"middling fee" that a group might expect should they use one of the speakers represented by the
data stored in x.
> median(x)
[1] 4
If a data set has an even number of elements, the median is found by averaging the two "middle
elements." For example, the following data set has six elements.
> y=1:6
> y
[1] 1 2 3 4 5 6
The median is found by averaging the third and fourth elements; that is, the median is (3 + 4)/2 =
3.5. R is completely aware of the even case.
> median(y)
[1] 3.5
Quantiles
The median of a data set is located so that 50% of the data occurs to the left of the median (and
50% of the data occurs to the right of the median). There is no reason to restrict our attention to
the 50% level. For example, we can find a point where 25% of the data occurs on its left (and
75% to its right). This point is known as the first "quartile" and is found with the following R
command:
> sort(x)
[1] 0 1 1 3 3 4 5 6 8 15 20
> quantile(x,0.25)
25%
2
To help explain, we've listed the data set in ascending order. R provides nine different algorithms
for computing the 25% quantile which can be viewed by typing the command ?quantile. The
default technique is to use linear interpolation to find the entry in the position given by the
formula 1 + p(n -1), where p is the required percentage and n is the length of the data set. In this
particular case, p = 0.25 and n = 11, so 1 + p(n -1) = 3.5. Thus, R will interpolate (linearly) a
number that is exactly halfway between the third and fourth entries, arriving at 1 + 0.5(3 - 1) =
2.
In similar fashion, R will approximate the 75% quantile with the following command:
> sort(x)
[1] 0 1 1 3 3 4 5 6 8 15 20
> quantile(x,0.75)
75%
7
Note that 1 + p(n - 1) = 1 + 0.75(11 - 1) = 8.5, so R reports the number that is exactly halfway
between the eighth and ninth entries, namely 6 + 0.5(8 - 6) = 7.
In general, the p% quantile will be a number that finds p% of the data to its left. For the
remainder of this activity, the most important statistics are the minimum, first quartile, median,
second quartile, and the maximum. We can use the quantile command to compute all of these at
once.
> quantile(x,c(0,0.25,0.5,0.75,1))
0% 25% 50% 75% 100%
0 2 4 7 20
However, R's summary command will report each of these quantiles with descriptive headers,
and throw in the mean for good measure.
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 2 4 6 7 20
Two pairs of numbers in the summary for our data set give the user a sense of the "spread" of the
data involved. The first is the range of the data set.
> range(x)
[1] 0 20
Note the R's range command reports the minimum and maximum entries in the data set. In
addition, R's IQR command gives the inner quartile range.
> IQR(x)
[1] 5
The inner quartile range reports the difference between the 75% quantile and the 25% quantile.
In this case, IQR = 7 - 2 = 5.
It is easier to explain the boxplot if we first have a picture to which we can refer in the
discussion. So, without any further ado, here is how R produces a boxplot for the data in the
variable x.
> boxplot(x,range=0)
The above command was used to produce the boxplot shown in Figure 2.
Figure 2. The minimum, quartiles, median, and maximum are used to construct a "box and
whisker plot."
So, how is this boxplot constructed? First, recall the summary data for the data in the variable x.
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 2 4 6 7 20
Here are the steps for creating the standard box and whiskers plot.
1. Draw a thick, dark, horizontal segment at the median, that is, at 4. See Figure 2.
2. Second, draw horizontal lines at the first and third quartiles, that is, at 2 and at 7. Use
these to draw the "box." See Figure 2.
3. From the bottom edge of the box, draw a "whisker" that extends to the the minimum data
value, namely 0. See Figure 2.
4. From the top edge of the box, draw a "whisker" that extends to the maximum data value,
namely 20. See Figure 2.
There are several important points that need making in regard to our box and whisker plot.
1. 50% of the data occurs between the lower and upper edges of the box, namely, between
the first and third quartiles located at 2 and 7, respectively.
2. The lower 50% of the data occurs below the median, the dark horizontal line in the box in
Figure 2. Likewise, the upper 50% of the data occurs above the median line in the box.
3. The lower 25% of the data occurs between the bottom edge of the box and the bottom
edge of the lower whisker. Likewise, the upper 25% of the data occurs above the top edge
of the box and the top edge of the upper whisker.
The Standard Box Plot does not pay special attention to outliers that might be present. The
Modified Box Plot is constructed so as to highlight outliers. As in the Standard Boxplot
described above, let's begin with a picture. Note that the Modified Boxplot is the default in R,
and requires no special parameters.
> boxplot(x)
The above command was used to produce the modified "box and whiskers" plot shown in Figure
3.
Before explaining the construction, let's repeat the sorted data and the summary information.
> sort(x)
[1] 0 1 1 3 3 4 5 6 8 15 20
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 2 4 6 7 20
Here are the steps required to construct a modified box and whiskers plot:
1. The median and the quartiles are used to construct the box in exactly the same manner
used to construct the standard boxplot.
2. Multiply the IQR by 1.5. So, in this case, 1.5 x IQR = 1.5 (5) = 7.5. Let's call this result
the STEP. That is, STEP = 7.5.
3. Add the STEP to the third quartile, obtaining 3rd Quartile + STEP = 7 + 7.5 = 14.5. Use
this to perform two tasks:
a. Any data beyond 14.5 is plotted using an empty circle. This explains the two
circles at 15 and 20 in Figure 3.
b. Locate the largest data point below 14.5. This is the number 8. This is where the
end of the upper whisker is drawn.
4. Subtract the STEP from the first quartile, obtaining 1st Quartile - STEP = 2 - 7.5 = -5.5.
Use this to perform two tasks:
a. Any data below -5.5 is plotted using an empty circle. There are no such data
points in Figure 3.
b. Locate the smallest data point that occurs above -5.5. This is the number 0. This is
where the end of the lower whisker is drawn.
Enjoy!
We hope you enjoyed this introduction to the R system. This interactive system provides a strong
interactive interface for exploration in statistics.
We encourage you to explore further. Use the command ?boxplot to learn more about what you
can do with the boxplot command.
Discrete Distributions in R
The discrete distributions of statistics are not continuous. Usually, they are constructed of a finite
number of possible values for the random variable and each possibility is assigned a probability
of occurrence.
One of the simplest discrete distributions is called the Bernoulli Distribution. This is a
distribution with only two possible values. For example, consider the Bernoulli distribution in
the table that follows:
A Bernoulli Distribution
x p
0 0.3
1 0.7
In this case, there are only two possible values of the random variable, x = 0 or x = 1. When
drawing numbers from this distribution, the probability of selecting x = 0 is 0.3, while the
probability of selecting x = 1 is 0.7.
There are many applications that can be represented by Bernoulli Distributions. For example, x
= 0 and x = 1 might represent the states of a lamp, with zero representing the state that the lamp
is switched "off" and one representing the state that the lamp is switched "on." Or x = 0 and x =
1 might represent the states of a biased coin, with zero representing "heads" and one representing
"tails."
Because the values of this distribution are discrete, the probability density density function will
not be a continuous curve as in the Standard Normal Distribution or the Normal Distribution.
The usual way to visualize a discrete distribution is with a sequence of "spikes." Use the
following code to produce the image in Figure 3.
> x=c(0,1)
> y=c(0.3,0.7)
> plot(x,y,type="h",xlim=c(-1,2),ylim=c(0,1),lwd=2,col="blue",ylab="p")
> points(x,y,pch=16,cex=2,col="dark red")
Key Idea: It is important to note that the sum of the probabilites equals 1, much as it did in the
Standard Normal Distribution and the Normal Distribution. Only this time, it is not the area
under a curve that represents the total probability, but the sum of the probabilities represented by
the heights of the "spikes."
Outcome X = # Heads
HHH 3
HHT 2
HTH 2
HTT 1
THH 2
THT 1
TTH 1
TTT 0
Because the coin is "fair," each of the outcomes is equally likely, with each having a 1 in 8
chance of occurring. Thus, the probability of tossing three heads (HHH) is 1/8. Next, there are
three ways of obtaining two heads (HHT, HTH, and THH), each with a 1 in 8 chance of
occuring, so the probability of obtaining two heads in three tosses is 3/8. The remaining values of
hte random variable and their probabilities are listed in the table that follows.
A Binomial Distribution
x p
3 1/8 = 0.125
2 3/8 = 0.375
1 3/8 = 0.375
0 1/8 = 0.125
Using dbinom
It is readily apparent that this simple approach will rapidly become quite tedious if we increase
the number of times that we toss the coin. Fortunately, this "binomial distribution" is easily
calculated in R. To calculate the probability of obtaining three heads in three tosses of a "fair"
coin, enter the following code:
> dbinom(3,size=3,prob=1/2)
[1] 0.125
Note that this code gives a result that is identical to the first row in the table above.
1. The argument x is the value of the random variable, the number of heads in the current
example.
2. The size is the "number of trials," or the number of tosses in the current example.
3. The prob is the probability of success, or the probability of obtaining "heads" in the current
example.
> p=dbinom(0:3,size=3,prob=1/2)
> p
[1] 0.125 0.375 0.375 0.125
In this case, the input for x is the vector 0:3, which produces upon expansion 0, 1, 2, and 3. We
can create the "spikes" shown in Figure 4 with the following code:
> n=3
> p=1/2
> x=0:3
> p=dbinom(x,size=n,prob=p)
> plot(x,p,type="h",xlim=c(-1,4),ylim=c(0,1),lwd=2,col="blue",ylab="p")
> points(x,p,pch=16,cex=2,col="dark red")
Figure 4. The discrete binomial distribution describing the number of heads in three tosses of a "fair" coin.
These commands become increasingly effective as we increase the number of trials, in this case
the number of "tosses" of the fair coin.
> n=10
> p=1/2
> x=0:10
> p=dbinom(x,size=n,prob=p)
> plot(x,p,type="h",xlim=c(-1,11),ylim=c(0,0.5),lwd=2,col="blue",ylab="p")
> points(x,p,pch=16,cex=2,col="dark red")
Figure 4. The discrete binomial distribution describing the number of heads in ten tosses of a "fair" coin.
Using pbinom
Now, suppose that we toss a "fair" coin ten times (the distribution shown in Figure 4) and we ask
the following question: "What is the probbility of obtaining 4 or fewer heads?" One approach is
to realize that this probability is the sum of the individual probabilities. That is, to obtain the
probability of obtaining 4 or fewer heads, we would add the probabilities of obtaining 0, 1, 2, 3,
or 4 heads.
> sum(dbinom(0:4,size=10,prob=1/2))
[1] 0.3769531
Alternatively, this sounds suspiciously similar to the questions we asked in the activities The
Standard Normal Distribution and The Normal Distribution. In those activities we used the
pnorm command. In similar fashion, we can use the pbinom command in the current situation.
> pbinom(4,size=10,prob=1/2)
[1] 0.3769531
Note again that it's always the probability to the left of and including the number of successes.
Let's ask another question. Suppose that we want to know the probability of obtaining between 5
and 8 heads, inclusive. We could sum the probabilities of obtaining 5, 6, 7, or 8 heads.
> sum(dbinom(5:8,size=10,prob=1/2))
[1] 0.6123047
Warning: The following computation is an incorrect application of the pbinom command for this
example.
> pbinom(8,size=10,prob=1/2)-pbinom(5,size=10,prob=1/2)
[1] 0.3662109
The error in this calculation is due to our failure to note that the distribution is discrete. In
subtracting the probability of getting 5 or fewer heads from the probability of getting 8 or fewer
heads, we're left with the probability of getting 6, 7, or 8 heads, which is not the the goal. Here is
the correct calculation:
> pbinom(8,size=10,prob=1/2)-pbinom(4,size=10,prob=1/2)
[1] 0.6123047
Using the binomial distribution in Figure 4, let's look at one final question and compute the
probability that X is greater than 7. One obvious way to go is to compute the sum the
probabilities that X is 8, 9, or 10.
> sum(dbinom(8:10,size=10,prob=1/2))
[1] 0.0546875
A second approach uses the pbinom command, which must always sum the probabilities to the
left of or equal to a given number. In this case, the probability that X is greater that or equal to 8
(8, 9, or 10) is equal to 1 minus the probability that X is less than or equal to 7 (0, 1, 2, 3, 4, 5, 6,
or 7).
> 1-pbinom(7,size=10,prob=1/2)
[1] 0.0546875
Enjoy!
We hope you enjoyed this introduction to the discrete distributions in general, binomial
distributions in particular, using the R system. We encourage you to explore further.
Continuous Distributions in R
A continuous function in mathematics is one whose graph can be drawn in one continuous
motion without ever lifting pen from paper. We've already seen examples of continuous
probability density functions. For example, the probability density function
from The Standard Normal Distribution was an example of a continuous function, having the
continuous graph shown in Figure 1.
In the activities The Standard Normal Distribution and The Normal Distribution, we saw that
dnorm, pnorm, and qnorm provided values of the density function, cumulative probabilities,
and quantiles, respectively. In this activity, we will explore several continuous probability
density functions and we will see that each has variants of the d, p, and q commands.
The Uniform Distribution
The Uniform Distribution is defined on an interval [a, b]. The idea is that any number selected
from the interval [a, b] has an equal chance of being selected. Therefore, the probability density
function must be a constant function. Because the total are under the probability density curve
must equal 1 over the interval [a, b], it must be the case that the probability density function is
defined as follows:
Figure 2.The probability density function for the uniform distribution on [a, b].
For example, the uniform probability density function on the interval [1,5] would be defined by
f(x) = 1/(5-1), or equivalently, f(x) = 1/4. The following code sketches this uniform probability
density function over the interval [1,5]. The resulting plot is shown in Figure 3.
> x=seq(1,5,length=200)
> y=rep(1/4,200)
> plot(x,y,type="l",xlim=c(0,6),ylim=c(0,0.5),lwd=2,col="red",ylab="p")
> polygon(c(1,x,5),c(0,y,0),col="lightgray",border=NA)
> lines(x,y,type="l",lwd=2,col="red")
1. The rep command takes the value 1/4 and duplicates it 200 times. This stores a vector in
the variable y having length 200, each entry of which is 1/4. This is necessary because the
plot(x,y) command requires that vectors x and y have equal length.
2. We've seen the polygon command before, but setting the argument border=NA "not
applicable" prevents the borders of the shaded region from being drawn.
3. The last lines command redraws the horizontal uniform density function so that appears
atop the shaded polygon.
Key Idea: Note that the area of the shaded region is 1, as it should be for all probability density
functions.
Figure 3. The shaded area under the uniform density function is 1.
Alternatively, we could have used dunif, which will produce density values for the uniform
distribution. The following code will also produce the image shown in Figure 3.
> x=seq(1,5,length=200)
> y=dunif(x,min=1,max=5)
> plot(x,y,type="l",xlim=c(0,6),ylim=c(0,0.5),lwd=2,col="red",ylab="p")
> polygon(c(1,x,5),c(0,y,0),col="lightgray",border=NA)
> lines(x,y,type="l",lwd=2,col="red")
Note that the arguments min=1 and max=5 provide the endpoints of the interval [1,5] on which
the uniform probability density function is defined.
Using punif
Suppose that we would like to find the probability that the random variable X is less than or
equal to 2. To calculate this probability, we would shade the region under the density function to
the left of and including 2, then calculate its area.
> x=seq(1,5,length=200)
> y=dunif(x,min=1,max=5)
> plot(x,y,type="l",xlim=c(0,6),ylim=c(0,0.5),lwd=2,col="red",ylab="p")
> x=seq(1,2,length=100)
> y=dunif(x,min=1,max=5)
> polygon(c(1,x,2),c(0,y,0),col="lightgray",border=NA)
> x=seq(1,5,length=200)
> y=dunif(x,min=1,max=5)
> lines(x,y,type="l",lwd=2,col="red")
Figure 4. The shaded area under the uniform probability density function represents the probability that x ≤ 2.
In Figure 4, note that the width of the shaded area is 1, the height is 1/4, so the area of the shaded
region is 1/4. Thus, the probability that x ≤ 2 is 1/4.
> punif(2,min=1,max=5)
[1] 0.25
Note that the punif command works in precisely the same manner as did the pnorm command in
the activity The Normal Distribution. It calculates the area to the left of a given number under the
uniform probability density function.
Let's look at another example. Suppose that we wanted to find the probability that x lies between
2 and 4. We could draw the uniform distribution, then shade the area under the curve between 2
and 4.
x=seq(1,5,length=200)
y=dunif(x,min=1,max=5)
plot(x,y,type="l",xlim=c(0,6),ylim=c(0,0.5),lwd=2,col="red",ylab="p")
x=seq(2,4,length=100)
y=dunif(x,min=1,max=5)
polygon(c(2,x,4),c(0,y,0),col="lightgray",border=NA)
x=seq(1,5,length=200)
y=dunif(x,min=1,max=5)
lines(x,y,type="l",lwd=2,col="red")
Figure 5. The shaded area under the uniform probability density function represents the probability that 2 < x < 4.
Note that the width of the shaded area is 2, the height is 1/4, so the area is 1/2. That is, the
probability that 2 < x < 4 is 1/2.
We can use punif to arrive at the same conclusion. To find the area between 2 and 4, we must
subtract the area to the left of 2 from the area to the left of 4.
> punif(4,min=1,max=5)-punif(2,min=1,max=5)
[1] 0.5
Finally, if we need to find the area to the right of a given number, simply subtract the are to the
left of the given number from the total area; i.e., subtract the area from the left of a given number
from 1. So, the following calculation will find the probability that x > 4.
> 1-punif(4,min=1,max=5)
[1] 0.25
Using qunif
The command qunif will find quantiles for the uniform distribution in the same way as we saw
the qnorm find quantiles in the activity The Normal Distribution. Thus, to find the 25th
percentile for the uniform distribution on the interval [1,5], we execute the following code.
> qunif(0.25,min=1,max=5)
[1] 2
Note that this is in total agreement with the result shown in Figure 4, where we saw that the area
of the shaded rectangle in Figure 4 was 0.25. Hence, the number 2 delimits the point at which
25% of the total area under the density function is attained.
The exponential probability density function is defined on the interval [0, ∞]. It has the following
definition.
The exponential distribution is known to have mean μ = 1/λ and standard deviation σ = 1/λ.
Suppose that we set λ = 1. Then the mean of the distribution should be μ = 1 and the standard
deviation should be σ = 1 as well. We could use the formula in Figure 6 to produce values of the
probability density function, but as we've seen before, it will be more efficient to use the dexp
command.
x=seq(0,4,length=200)
y=dexp(x,rate=1)
plot(x,y,type="l",lwd=2,col="red",ylab="p")
Note that the argument rate expects us to respond with the value of λ. The code above produces
the exponential distribution shown in Figure 7.
Figure 7. The exponential probability density function is shown on the interval [0,4].
There are a number of applications where the exponential function is a resonable model. For
example, waiting in queues.
The exponential probability density function is shown on the interval [0,4] in Figure 7. However,
remember that the full domain is on [0,∞), so we've shown only part of the full picture. In
determining on what domain to draw the function, we extended the interval three standard
deviations to the right of the mean.
Unlike the normal and uniform distributions, the exponential distribution is not symmetric about
its mean. However, in Figure 7 there is reasonable evidence that the distribution will "balance"
about the mean at μ = 1.
Using pexp
Suppose that we want to find the probability that x &le 1. We would shade the area under the
exponential probability density function to the left of 1, as shown in Figure 8.
> x=seq(0,4,length=200)
> y=dexp(x,rate=1)
> plot(x,y,type="l",lwd=2,col="red",ylab="p")
> x=seq(0,1,length=200)
> y=dexp(x,rate=1)
> polygon(c(0,x,1),c(0,y,0),col="lightgray")
The code above produces shaded region under the exponential distribution shown in Figure 8.
Figure 8. Shading the region under the density curve to the left of x = 1
Now, just as we did with the uniform distribution above, we will use the pexp command to
compute the area of the shaded region in Figure 8.
> pexp(1,rate=1)
[1] 0.6321206
You may find it surprising that the answer was not 50%! However, the mean at x = 1 is not the
median! The graph of the exponential is "skewed to the right" and the extreme outliers at the
right strongly influence the mean, pushing it to the right of the median.
As a second example, suppose that we want to find the probability that x lies between 1 and 2.
> x=seq(0,4,length=200)
> y=dexp(x,rate=1)
> plot(x,y,type="l",lwd=2,col="red",ylab="p")
> x=seq(1,2,length=200)
> y=dexp(x,rate=1)
> polygon(c(1,x,2),c(0,y,0),col="lightgray")
The above code produces the shaded region under the exponential density curve in Figure 9.
Figure 9. Calculating the probability that x lies between 1 and 2.
Again, to find the shaded area in Figure 9, we must subtract the area to the left of x = 1 from the
area to the left of x = 2.
> pexp(2,rate=1)-pexp(1,rate=1)
[1] 0.2325442
Thus, the probability of selecting a number between 1 and 2 from this distribution is
approximately 0.2325442.
Finally, if we need to find the area to the right of a given number, simply subtract the area to the
left of the given number from the total area. For example, to find the probability that x > 3,
subtract the probability that x ≤ 3 from 1.
> 1-pexp(3,rate=1)
[1] 0.04978707
Using qexp
The command qexp will find quantiles for the exponential distribution in the same way as we
saw the qunif find quantiles for the uniform distribution. Thus, to find the 50th percentile for the
exponential distribution on the interval, we execute the following code.
> qexp(0.50,rate=1)
[1] 0.6931472
This result is in keeping with the fact that the distribution is skewed badly to the right. The
outliers at the right end greatly influence the mean, pushing it to the right. With the mean at x =
1, the current result for the median makes good sense. Fifty percent of the data lies to the left of
x = 0.6931472 and fifty percent of the data lies to the right of x = 0.6931472.
Regarding d, p, and q
As we've seen in this and previous activities, the letters d, p, and q have special meanings:
"d" is for "density." It is used to find values of the probability density function.
"p" is for "probability." It is used to find the probability that the random variable lies to
the left of a given number.
"q" is for "quantile." It is used to find the quantiles of a given distribution.
In connection with the normal distribution, dnorm calculates values of the normal probability
density function. Similarly, dbinom, dunif, and dexp calculate values of the binomial, uniform,
and exponential probability density functions, respectively.
In connection with the normal distribution, pnorm calculates area under the normal probability
density function to the left of a given number. Similarly, pbinom, punif, and pexp calculate area
under the binomial, uniform, and exponential probability density functions to the left of a given
number, respectively.
In connection with the normal distribution, qnorm calculates quantiles for the normal probability
density function. Similarly, pbinom, punif, and pexp calculate quantiles for the binomial,
uniform, and exponential probability density functions, respectively.
This common use of d, p, and q is by design. There are a host of statistical distributions that
we've yet to introduce. For example, there is a distribution called the Beta Distribution. The
commands dbeta, pbeta, and qbeta would be used to calculate values of the beta probability
density function, to calculate the area to the left of a given number underneath the beta
probability density function, and to calculate quantiles for the beta distribution.
Enjoy!
We hope you enjoyed this introduction to the uniform and exponential distributions in R. We
encourage you to explore further. Try typing ?dunif or ?dexp and reading the resulting help
files.
The Standard Normal Distribution in R
One of the most fundamental distributions in all of statistics is the Normal Distribution or the
Gaussian Distribution. According to Wikipedia, "Carl Friedrich Gauss became associated with
this set of distributions when he analyzed astronomical data using them, and defined the equation
of its probability density function. It is often called the bell curve because the graph of its
probability density resembles a bell."
The probability density function for the normal distribution having mean μ and standard
deviation σ is given by the function in Figure 1.
If we let the mean μ = 0 and the standard deviation σ = 1 in the probability density function in
Figure 1, we get the probability density function for the standard normal distribution in Figure 2.
Figure 2. The probability density function for the standard normal distribution has mean μ = 0 and standard
deviation σ = 1.
It is a simple matter to produce a plot of the probability density function for the standard normal
distribution.
> x=seq(-4,4,length=200)
> y=1/sqrt(2*pi)*exp(-x^2/2)
> plot(x,y,type="l",lwd=2,col="red")
If you'd like a more detailed introduction to plotting in R, we refer you to the activity Simple
Plotting in R. However, these commands are simply explained.
1. The command x=seq(-4,4,length=200) produces 200 equally spaced values between -4 and 4
and stores the result in a vector assigned to the variable x.
2. The command y=1/sqrt(2*pi)*exp(-x^2/2) evaluates the probability density function of Figure 2
at each entry of the vector x and stores the result in a vector assigned to the variable y.
3. The command plot(x,y,type="l",lwd=2,col="red") plots y versus x, using:
o a solid line type (type="l") --- that's an "el", not an I (eye) or a 1 (one),
o a line width of 2 points (lwd=2), and
o uses the color red (col="red").
An Alternate Approach
The command dnorm can be used to produce the same result as the probability density function
of Figure 2. Indeed, the "d" in dnorm stands for "density." Thus, the command dnorm is
designed to provide values of the probability density function for the normal distribution.
> x=seq(-4,4,length=200)
> y=dnorm(x,mean=0,sd=1)
> plot(x,y,type="l",lwd=2,col="red")
These commands produce the plot shown in Figure 4. Note that the result is identical to the plot
in Figure 3.
Figure 4. The bell-shaped curve of the standard normal distribution.
Like all probability density functions, the standard normal curves in Figures 3 and 4 possess two
very important properties:
1. The graph of the probability density function lies entirely above the x-axis. That is, f(x) ≥ 0 for all
x.
2. The area under the curve (and above the x-axis) on its full domain is equal to 1.
3. The probability of selecting a number between x = a and x = b is equal to the area under the
curve from x = a to x = b.
As an example, consider the area under the standard normal curve shown in Figure 5.
Figure 5. The area under the standard normal curve to the left of x = 0.
If the total area under the curve equals 1, then by symmetry one would expect that the area under
the curve to the left of x = 0 would equal 0.5. R has a command called pnorm (the "p" is for
"probability") which is designed to capture this probability (area under the curve).
Note that the syntax is strikingly similar to the syntax for the density function. The command
pnorm(x, mean = , sd = ) will find the area under the normal curve to the left of the number x.
Note that we use mean=0 and sd=1, the mean and density of the standard normal distribution.
As a second example, suppose that we wish to find the area under the standard normal curve
( mean = 0, standard deviation = 1) that is to the left of the value of x that is one standard
deviation to the right of the mean, pictured for the reader in Figure 6.
Figure 6. The area under the curve to the left of x = 1.
In this case, we wish to find the area under the curve to the left of x = 1.
The interpretation of area as a probability is all-important. This result indicates that if we draw a
number at random from the standard normal distribution, the probability that we draw a number
that is less than or equal to 1 is 0.8413447.
If readers are interested, to produce the image in Figure 6, we used the following code.
> x=seq(-4,4,length=200)
> y=dnorm(x)
> plot(x,y,type="l", lwd=2, col="blue")
> x=seq(-4,1,length=200)
> y=dnorm(x)
> polygon(c(-4,x,1),c(0,y,0),col="gray")
For help on the polygon command enter ?polygon and read the resulting help file. However, the
basic idea is pretty simple. In the syntax polygon(x,y), the argument x contains the x-coordinates
of the vertices of the polygon you wish to draw. Similarly, the argument y contains the y-
coordinates of the vertices of the desired polygon.
68%-95%-99.7% Rule
The 68% - 95% - 99.7% is a rule of thumb that allows practitioners of statistics to estimate the
probability that a randomly selected number from the standard normal distribution occurs within
1, 2, and 3 standard deviations of the mean at zero.
Let's first examine the probability that a randomly selected number from the standard normal
distribution occurs within one standard deviation of the mean. This probability is represented by
the area under the standard normal curve between x = -1 and x = 1, pictured in Figure 7.
> x=seq(-4,4,length=200)
> y=dnorm(x)
> plot(x,y,type="l", lwd=2, col="blue")
> x=seq(-1,1,length=100)
> y=dnorm(x)
> polygon(c(-1,x,1),c(0,y,0),col="gray")
The code above was used to produce the shaded region shown in Figure 7.
Figure 7. The shaded area represents the probability of drawing a number from the standard normal distribution
that falls within one standard deviation of the mean.
Remember that R's pnorm command finds the area to the left of a given value of x. Thus, to find
the area between x = -1 and x = 1, we must subtract the area to the left of x = -1 from the area to
the left of x = 1.
> pnorm(1,mean=0,sd=1)-pnorm(-1,mean=0,sd=1)
[1] 0.6826895
In similar fashion, we can get the area within two and three standard deviations.
> x=seq(-4,4,length=200)
> y=dnorm(x)
> plot(x,y,type="l", lwd=2, col="blue")
> x=seq(-2,2,length=200)
> y=dnorm(x)
> polygon(c(-2,x,2),c(0,y,0),col="gray")
Figure 8. The shaded area represents the probability of drawing a number from the standard normal distribution
that falls within two standard deviations of the mean.
To find the area between x = -2 and x = 2, we must subtract the area to the left of x = -2 from the
area to the left of x = 2.
> pnorm(2,mean=0,sd=1)-pnorm(-2,mean=0,sd=1)
[1] 0.9544997
There is a 95% chance that the number drawn falls within two standard deviations of the mean.
> x=seq(-4,4,length=200)
> y=dnorm(x)
> plot(x,y,type="l",lwd=2,col="blue")
> x=seq(-3,3,length=200)
> y=dnorm(x)
> polygon(c(-3,x,3),c(0,y,0),col="gray")
Figure 9. The shaded area represents the probability of drawing a number from the standard normal distribution
that falls within three standard deviations of the mean.
To find the area between x = -3 and x = 3, we must subtract the area to the left of x = -3 from the
area to the left of x = 3.
> pnorm(3,mean=0,sd=1)-pnorm(-3,mean=0,sd=1)
[1] 0.9973002
Therefore, the chance that a number drawn randomly from the standard normal distribution falls
within three standard deviations of the mean is 99.7%!
Important Result: We conclude that virtually all numbers from the standard normal distribution
occur within three standard deviations of the mean.
Quantiles
Sometimes the opposite question is asked. That is, suppose that the area under the curve to the
left of some unknown number is known. What is the unknown number?
For example, suppose that the area under the curve to the left of some unknown x-value is 0.85,
as shown in Figure 10.
Figure 10. The area under the curve to the left of some unknown x-value is 0.95.
To find the unknown value of x we use R's qnorm command (the "q" is for "quantile").
> qnorm(0.95,mean=0,sd=1)
[1] 1.644854
Hence, there is a 95% probability that a random number less than or equal to 1.644854 is chosen
from the standard normal distribution.
In a sense, R's pnorm and qnorm commands play the roles of inverse functions. On one hand,
the command pnorm is fed a number and asked to find the probability that a random selection
from the standard normal distribution falls to the left of this number. On the other hand, the
command qnorm is given the probability and asked to find a limiting number so that the area
under the curve to the left of that number equals the given probability.
We must emphasize that the area under the curve to the left is used when applying the commands
pnorm and qnorm. If you are given an area to the right, then you must make a simple
adjustment before applying the qnorm command. Suppose, as shown in Figure 11, that the area
to the right of an unknown number is 0.80.
Figure 11. The area under the curve to the right of some unknown x-value is 0.80.
In this case, we must subtract the area to the right from the number 1 to obtain 1 - 0.80 = 0.20,
which is the area to the left of the unknown value of x shown in Figure 12.
Figure 12. The area under the curve to the left of some unknown x-value is 1 - 0.80 = 0.20.
We can now use the qnorm command to find the unknown value of x.
> qnorm(0.20,mean=0,sd=1)
[1] -0.8416212
We now know that the probability of selecting a number from the standard normal distribution
that is greater than or equal to -0.8416212 is 0.80.
Enjoy!
We hope you enjoyed this introduction to the standard normal distribution R system. We
encourage you to explore further.
The Normal Distribution in R
One of the most fundamental distributions in all of statistics is the Normal Distribution or the
Gaussian Distribution. According to Wikipedia, "Carl Friedrich Gauss became associated with
this set of distributions when he analyzed astronomical data using them, and defined the equation
of its probability density function. It is often called the bell curve because the graph of its
probability density resembles a bell."
The probability density function for the normal distribution having mean μ and standard
deviation σ is given by the function in Figure 1.
If we let the mean μ = 0 and the standard deviation σ = 1 in the probability density function in
Figure 1, we get the probability density function for the standard normal distribution in Figure 2.
Figure 2. The probability density function for the standard normal distribution has mean μ = 0 and standard
deviation σ = 1.
In the activity The Standard Normal Distribution, we examined the normal distribution having
mean and standard deviation 0 and 1, respectively. You might want to work your way through
that activity before continuing.
It is a simple matter to produce a plot of the probability density function for the standard normal
distribution.
> x=seq(-4,4,length=200)
> y=1/sqrt(2*pi)*exp(-x^2/2)
> plot(x,y,type="l",lwd=2,col="red")
If you'd like a more detailed introduction to plotting in R, we refer you to the activity Simple
Plotting in R. However, these commands are simply explained.
1. The command x=seq(-4,4,length=200) produces 200 equally spaced values between -4 and 4
and stores the result in a vector assigned to the variable x.
2. The command y=1/sqrt(2*pi)*exp(-x^2/2) evaluates the probability density function of Figure 2
at each entry of the vector x and stores the result in a vector assigned to the variable y.
3. The command plot(x,y,type="l",lwd=2,col="red") plots y versus x, using:
o a solid line type (type="l") --- that's an "el", not an I (eye) or a 1 (one),
o a line width of 2 points (lwd=2), and
o uses the color red (col="red").
The command dnorm can be used to produce the same result as the probability density function
of Figure 2. Indeed, the "d" in dnorm stands for "density." Thus, the command dnorm is
designed to provide values of the probability density function for the normal distribution.
> x=seq(-4,4,length=200)
> y=dnorm(x,mean=0,sd=1)
> plot(x,y,type="l",lwd=2,col="red")
These commands produce the plot shown in Figure 4. Note that the result is identical to the plot
in Figure 3.
The standard deviation represents the "spread" in the distribution. With "spread" as the
interpretation, we would expect a normal distribution with a standard deviation of 2 to be "more
spread out" than a normal distribution with a standard deviation of 1. Let's simulate this idea in
R.
> x=seq(-8,8,length=500)
> y1=dnorm(x,mean=0,sd=1)
> plot(x,y1,type="l",lwd=2,col="red")
> y2=dnorm(x,mean=0,sd=2)
> lines(x,y2,type="l",lwd=2,col="blue")
Figure 5. A normal distribution with standard deviation 2 is "wider" than the standard normal distribution having
standard deviation 1.
Key Idea: The key idea of importance is to note that the normal curve having standard deviation
equal to 2 is "twice as spread out" as the "standard" normal curve having standard deviation 1.
In similar fashion, a normal curve with standard deviation 1/2 would be "half as wide" as the
standard normal curve.
> x=seq(-8,8,length=500)
> y3=dnorm(x,mean=0,sd=1/2)
> plot(x,y3,type="l",lwd=2,col="green")
> y2=dnorm(x,mean=0,sd=2)
> lines(x,y2,type="l",lwd=2,col="blue")
> y1=dnorm(x,mean=0,sd=1)
> lines(x,y1,type="l",lwd=2,col="red")
> legend("topright",c("sigma=1/2","sigma=2","sigma=1"),
+ lty=c(1,1,1),col=c("green","blue","red"))
The above sequence of commands produces the image shown in Figure 6. Note: Remember that
the "plus" symbol is R's line continuation character. R will provide this character when you hit
the Enter key to end the previous line.
Key Idea: The key idea of importance is to note that the normal curve having standard deviation
equal to 1/2 is "half as spread out" as the "standard" normal curve having standard deviation 1.
The Mean
In Figures 3, 4, and 5, note that each of the normal curves is "centered" about a mean equal to
zero. If the mean were different, we would expect the normal curve to be "centered" about new
mean. In the following example, we use coding to create a normal curve with mean equal to 10
and standard deviation equal to 2.
> x=seq(4,16,length=200)
> y=dnorm(x,mean=10,sd=2)
> plot(x,y,type="l",lwd=2,col="red")
Figure 7. A normal distribution with mean equal to 10 and standard deviation equal to 2.
The choice of domain bears some explanation. In the activity The Standard Normal Distribution,
we learned that essentially all of the data occurs within three standard deviations of the mean.
Because the standard deviation is 2, three standard deviations to the left of the mean takes us to
4, while three standard deviations to the right of the mean takes us to 16.
Key Idea: The key idea of importance is to note that this particular normal curve is "centered" or
"balanced" about its mean 10.
As a second example, the following code draws the normal curve with mean equal to 100 with
standard deviation equal to 10.
> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")
Figure 8. A normal distribution with mean equal to 100 and standard deviation equal to 10.
Again, the mean is 100 and the standard deviation is 10, so three standard deviations each side of
the mean sees us sketching this normal curve on the interval (70,130).
Key Idea: As in the previous example, the most important thing to note is that the distribution
shown in Figure 8 is "centered" or "balanced" about its mean 100.
The Area Under the Probability Density Function
As in the activity The Standard Normal Distribution, the command pnorm will compute the area
to the left of a given value of x.
> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")
> x=seq(70,90,length=100)
> y=dnorm(x,mean=100,sd=10)
> polygon(c(70,x,90),c(0,y,0),col="gray")
Figure 9. The area to the left of 90 represents the probability of selecting a number less than 90 from a normal
distribution with mean 100 and standard deviation 10.
> pnorm(90,mean=100,sd=10)
[1] 0.1586553
Hence, the probability of selecting a random number less than 90 from a normal distribution
having mean 100 and standard deviation 10 is 0.1586553.
As a second example, the following code shades the area between 90 and 100 shown in Figure
10.
> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")
> x=seq(90,100,length=200)
> y=dnorm(x,mean=100,sd=10)
> polygon(c(90,x,100),c(0,y,0),col="gray")
Figure 10. The area to the right of 90 and to the left of 100 represents the probability of selecting a number
between 90 and 100 from a normal distribution with mean 100 and standard deviation 10.
The following command will calculate the shaded area in Figure 10.
> pnorm(100,mean=100,sd=10)-pnorm(90,mean=100,sd=10)
[1] 0.3413447
68%-95%-99.7% Rule
In the activity The Standard Normal Distribution, we introduced the 68% - 95% - 99.7% rule in
conjunction with the standard normal distribution. The 68% - 95% - 99.7% works just as well as
a rule of thumb even when the mean and standard deviation change. For example, the following
code shades the region within one standard deviation of the mean in Figure 11.
> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")
> x=seq(90,110,length=200)
> y=dnorm(x,mean=100,sd=10)
> polygon(c(90,x,110),c(0,y,0),col="gray")
Figure 11. The shaded area represents the probability of drawing a number from the normal distribution (mean =
100, standard deviation = 10) that falls within one standard deviation of the mean.
Remember that R's pnorm command finds the area to the left of a given value of x. Thus, to find
the area between x = 90 and x = 110, we must subtract the area to the left of x = 90 from the area
to the left of x = 110.
In similar fashion, we can get the area within two and three standard deviations.
> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")
> x=seq(80,120,length=200)
> y=dnorm(x,mean=100,sd=10)
> polygon(c(80,x,120),c(0,y,0),col="gray")
Figure 12. The shaded area represents the probability of drawing a number from the normal distribution (mean =
100, standard deviation = 10) that falls within two standard deviations of the mean.
To find the area between x = 80 and x = 120, we must subtract the area to the left of x = 80 from
the area to the left of x = 120.
Note again that there is a 95% chance that the number drawn falls within two standard deviations
of the mean.
> x=seq(70,130,length=200)
> y=dnorm(x,mean=100,sd=10)
> plot(x,y,type="l",lwd=2,col="red")
> polygon(c(70,x,130),c(0,y,0),col="gray")
Figure 13. The shaded area represents the probability of drawing a number from the normal distribution (mean =
100, standard deviation = 10) that falls within three standard deviations of the mean.
To find the area between x = 70 and x = 130, we must subtract the area to the left of x = 70 from
the area to the left of x = 130.
Therefore, the chance that a number drawn randomly from the normal distribution falls within
three standard deviations of the mean is 99.7%!
Important Result: We conclude that virtually all numbers from any normal distribution occur
within three standard deviations of the mean.
Quantiles
In the activity The Standard Normal Distribution, the command qnorm was introduced to
calculate quantiles. This command is fed the area under the curve to the left of some unknown
number, then calculates the unknown number.
For example, suppose that the area under the curve to the left of some unknown x-value is 0.95,
as shown in Figure 14.
Figure 14. The area under the curve to the left of some unknown x-value is 0.95.
To find the unknown value of x we use R's qnorm command (the "q" is for "quantile").
> qnorm(0.95,mean=100,sd=10)
[1] 116.4485
Hence, there is a 95% probability that a random number less than or equal to 116.4485 is chosen
from the standard normal distribution.
In a sense, R's pnorm and qnorm commands play the roles of inverse functions. On one hand,
the command pnorm is fed a number and asked to find the probability that a random selection
from the standard normal distribution falls to the left of this number. On the other hand, the
command qnorm is given the probability and asked to find a limiting number so that the area
under the curve to the left of that number equals the given probability.
We must emphasize that the area under the curve to the left is used when applying the commands
pnorm and qnorm. If you are given an area to the right, then you must make a simple
adjustment before applying the qnorm command. Suppose, as shown in Figure 15, that the area
to the right of an unknown number is 0.80.
Figure 15. The area under the curve to the right of some unknown x-value is 0.80.
In this case, we must subtract the area to the right from the number 1 to obtain 1 - 0.80 = 0.20,
which is the area to the left of the unknown value of x shown in Figure 16.
Figure 16. The area under the curve to the left of some unknown x-value is 1 - 0.80 = 0.20.
We can now use the qnorm command to find the unknown value of x.
> qnorm(0.20,mean=100,sd=10)
[1] 91.58379
We now know that the probability of selecting a number from the standard normal distribution
that is greater than or equal to 91.58379 is 0.80.
Enjoy!
We hope you enjoyed this introduction to the normal distribution R system. We encourage you to
explore further.
Linear Regression in R
In this activity we will explore the relationship between a pair of variables. We will first learn to
create a scatter plot for the given data, then we will learn how to craft a "Line of Best Fit" for our
plot.
Scatterplots
The data set in the table that follows is taken from The Data and Story Library. Researchers
measured the heights of 161 children in Kalama, a village in Egypt. The heights were averaged
and recorded each month, with the study lasting several years. The data is presented in the table
that follows.
Age in
Average Height in Centimeters
Months
18 76.1
19 77
20 78.1
21 78.2
22 78.8
23 79.7
24 79.9
25 81.1
26 81.2
27 81.8
28 82.8
29 83.5
To build our scatter plot, we must first enter the data in R. This is a fairly straightforward task.
Because the ages are incremented in months, we can use the command age=18:29, which uses R
start:finish syntax to begin a vector at the number 18, then increment by 1 until the number 29
is reached.
> age=18:29
> age
[1] 18 19 20 21 22 23 24 25 26 27 28 29
> height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
> height
[1] 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8 83.5
We can check that age and height have the same number of elements with R's length command.
> length(age)
[1] 12
> length(height)
[1] 12
> plot(age,height)
Note that the data in Figure 1 is approximately linear. As the age increases, the average height
increases at an approximately constant rate. One could not fit a single line through each and
every data point, but one could imagine a line that is fairly close to each data point, with some of
the data points appearing above the line, others below for balance. In this next activity, we will
calculate and plot the Line of Best Fit, or the Least Squares Regression Line.
We will use R's lm command to compute a "linear model" that fits the data in Figure 1. The
command lm is a very sophisticated command with a host of options (type ?lm to view a full
description), but in its simplest form, it is quite easy to usea. The syntax height~age is called a
model equation and is a very sophisticated R construct. We are using its most simple form here.
The symbol separating "height" and "age" in the syntax height~age is a "tilde." It is located on
the key to the immediate left of the the #1 key on your keyboard. You must use the Shift Key to
access the "tilde."
> res=lm(height~age)
Let's examine what is returned in the variable res.
> res
Call:
lm(formula = height ~ age)
Coefficients:
(Intercept) age
64.928 0.635
Note the "Coefficients" part of the contents of res. These coefficients are the intercept and slope
of the line of best fit. Essentially, we are being told that the equation of the line of best fit is:
Note that this result has the form y = m x + b, where m is the slope and b is the intercept of the
line.
It is a simple matter to superimpose the "Line of Best Fit" provided by the contents of the
variable res. The command abline will use the data in res to draw the "Line of Best Fit."
> abline(res)
The result of this command is the "Line of Best Fit" shown in Figure 2.
Figure 2. The abline command superimposes the "Line of Best Fit" on our previous scatterplot.
The command abline is a versatile tool. You can learn more about this command by entering ?
abline and reading the resulting help file.
Prediction
Now that we have the equation of the line of best fit, we can use the equation to make
predictions. Suppose, for example, that we wished to estimate the average height of a child at age
27.5 months. One technique would be to enter this value into the equation of the line of best fit.
> 0.635*27.5+64.928
[1] 82.3905
Enjoy!
We hope you enjoyed this introduction to the principles of Linear Regression in the R system.
We encourage you to explore further. Use the commands ?plot, ?lm, and ?abline to learn more
about producing scatterplots and performing linear regression.
Data Frames in R
In this activity we will introduce R's dataframes. Technically, a dataframe is R is a type of object.
Less formally, a dataframe is a type of table where the typical use employs the rows as
observations and the columns as variables. Take, for example, Mean Height versus Age data
from the activity Linear Regression in R.
Age in
Average Height in Centimeters
Months
18 76.1
19 77
20 78.1
21 78.2
22 78.8
23 79.7
24 79.9
25 81.1
26 81.2
27 81.8
28 82.8
29 83.5
In the activity Linear Regression in R, we entered the age in a vector called age.
> age=18:29
> age
[1] 18 19 20 21 22 23 24 25 26 27 28 29
> height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
> height
[1] 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8 83.5
We will now use R's data.frame command to create our first dataframe and store the results in
the variable village.
> village=data.frame(age=age,height=height)
> village
age height
1 18 76.1
2 19 77.0
3 20 78.1
4 21 78.2
5 22 78.8
6 23 79.7
7 24 79.9
8 25 81.1
9 26 81.2
10 27 81.8
11 28 82.8
12 29 83.5
The result is fairly clear. The contents of the vector age are stored in the first column of the
dataframe under the heading "age" and the contents of the vector height are stored in the second
column of the dataframe under the heading "height."
Let's examine the contents of our workspace with R's ls() command (which stands for "list the
contents of the workspace." The contents of your workspace might vary, depending on whether
or not you have created other variables than those introduced in this activity
> ls()
[1] "age" "height" "village"
Note that their are three objects in our workspace. We can examine the "class" of each object in
our workspace, which tells us what kind of object is contained in each variable.
> ls()
> class(age)
[1] "integer"
> class(height)
[1] "numeric"
> class(village)
[1] "data.frame"
Thus we see that the vector age has class "integer." This should be clear as it was created with
the command age=18:29. Each of these numbers are integers. On the other hand, the average
heights were numbers such as 76.1, etc. These are not integers, but floating point numbers, so it
is not surprising that the class of the object height is "numeric."
It is interesting, though not surprising --- after all, village was created with the data.frame
command --- that the class of the object village is "data.frame." This is a new type of object that
we will find quite useful.
Before we begin to explore dataframes in more depth, let's clean up our workspace a bit. Let's
"remove" the age and height objects from our workspace.
> remove(age,height)
If you now try to print the contents of age and/or height, again you will see that they are gone.
> age
Error: object "age" not found
> height
Error: object "height" not found
Indeed, if you "list" the contents of your workspace, you will see that the age and height objects
are gone. However, the dataframe village remains.
> ls()
[1] "village"
We will now explain how to access the variables contained in our data frame. In this case, we
know that the each column contains a variable.
> village
age height
1 18 76.1
2 19 77.0
3 20 78.1
4 21 78.2
5 22 78.8
6 23 79.7
7 24 79.9
8 25 81.1
9 26 81.2
10 27 81.8
11 28 82.8
12 29 83.5
The first column contains age and the second height. But we can quickly ascertain the names of the
columns with R's names command. This would be useful if we were using someone else's dataframe and
we were not sure of the column headings.
> names(village)
[1] "age" "height"
OK, as expected. But how do we access the data in each column? One way is to state the variable
containing the dataframe, followed by a dollar sign, then the name of the column we wish to
access. For example, if we wanted to access the data in the "age" column, we would do the
following:
> village$age
[1] 18 19 20 21 22 23 24 25 26 27 28 29
> village$height
[1] 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8
[12] 83.5
However, the additional typing required by the "dollar sign" notation can quickly become
tiresome, so R provides the ability to "attach" the variables in the dataframe to our workspace.
> attach(village)
> ls()
[1] "village"
No evidence of the variables in the workspace. However, R has made copies of the variables in
the columns of the dataframe, and most importantly, we can access them without the "dollar
notation."
> age
[1] 18 19 20 21 22 23 24 25 26 27 28 29
> height
[1] 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8
[12] 83.5
It is important to understand that these are "copies" of the columns of the data frame. Suppose
that we make a change to one of the entries of age.
> age[1]=12
> age
[1] 12 19 20 21 22 23 24 25 26 27 28 29
Note, however, that the data in the dataframe village remains unchanged.
> village
age height
1 18 76.1
2 19 77.0
3 20 78.1
4 21 78.2
5 22 78.8
6 23 79.7
7 24 79.9
8 25 81.1
9 26 81.2
10 27 81.8
11 28 82.8
12 29 83.5
Because we've "attached" the dataframe village, we can execute the plot command as we did in
the Linear Regression activity.
> plot(age,height)
And we can once again use the linear model command to produce the line of best fit.
> res=lm(height~age)
> res
Call:
lm(formula = height ~ age)
Coefficients:
(Intercept) age
64.928 0.635
Again, the coefficients provide the intercept and slope of the line of best fit.
The command abline uses the data in res to draw the "Line of Best Fit."
> abline(res)
The result of this command is the "Line of Best Fit" shown in Figure 2.
Figure 2. The abline command superimposes the "Line of Best Fit" on our previous scatterplot.
Prediction
We can use equation of the line of best fit to make predictions. Suppose, for example, that we
wished to estimate the average height of a child at age 27.5 months. One technique would be to
enter this value into the equation of the line of best fit.
> 0.635*27.5+64.928
[1] 82.3905
However, it is more efficient to use dataframes and R's predict command to predict the average
height when the age is 27.5 months.
> predict(res,data.frame(age=27.5))
[1] 82.38986
Detach
When you no longer need the data in the dataframe village, it is good practice to "detach" the
dataframe.
> detach(village)
Note that "age" and "height" are no longer available for use.
> age
Error: object "age" not found
> height
Error: object "height" not found
Many R commands are able to handle dataframes without attaching them. For example, if you
give the plot command a dataframe with two columns, by default it will plot the second column
versus the first. The following command will produce the image shown in Figure 1.
> plot(village)
We can get the linear model in this case without attaching the dataframe village, simply by
passing the dataframe village as an argument to the lm command.
> res=lm(height~age,data=village)
> res
Call:
lm(formula = height ~ age, data = village)
Coefficients:
(Intercept) age
64.928 0.635
The abline will produce the line of best fit shown in Figure 2.
> abline(res)
The single command abline(lm(height~age),data=village) will also produce the line of best fit
shown in Figure 2. Thus, if you are in a hurry, two commands are enough to produce the
scatterplot and line of best fit.
> plot(village)
> abline(lm(height~age,data=village))
Enjoy!
We hope you enjoyed this introduction to the use of dataframes in the R system. We encourage
you to explore their use further. We will certainly do so in upcoming activities.
Importing Data in R
In this activity we will learn how to import external data files into R. This activity builds upon
the lessons learned in Dataframes in R. If you are not familiar with dataframes in R, we
encourage you to first work through the activity Dataframes in R.
In the table that follows, we list average heights of children versus their ages. This data set was
collected over a period of time in an Egyptian village and first explored in Linear Regression in
R, then again in Dataframes in R.
Age in
Average Height in Centimeters
Months
18 76.1
19 77
20 78.1
21 78.2
22 78.8
23 79.7
24 79.9
25 81.1
26 81.2
27 81.8
28 82.8
29 83.5
In the activity Dataframes in R, we created vectors age and height.
> age=18:29
> height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
We then used R's data.frame command to create a dataframe and stored the result in the variable
village.
> village=data.frame(age=age,height=height)
> village
age height
1 18 76.1
2 19 77.0
3 20 78.1
4 21 78.2
5 22 78.8
6 23 79.7
7 24 79.9
8 25 81.1
9 26 81.2
10 27 81.8
11 28 82.8
12 29 83.5
We can examine the contents of our workspace with the command ls().
> ls()
[1] "age" "height" "village"
And we can delete the age and height vectors with the remove command.
> remove(age,height)
> ls()
[1] "village"
We can "attach" the dataframe stored in village, which allows easy access to its columns.
> attach(village)
> ls()
[1] "village"
> age
[1] 18 19 20 21 22 23 24 25 26 27 28 29
> height
[1] 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8 83.5
When we are through working with the dataframe, it is a good practice to "detach" the dataframe.
> detach(village)
> age
Error: object "age" not found
> height
Error: object "height" not found
> ls()
[1] "village"
Finally, before learning how to import data in electronic format, delete the dataframe stored in
village.
> remove(village)
> ls()
character(0)
Open your favorite text editor on your system. For example, you might use Notepad on Windows
or Textedit on the Mac. Enter the data exactly as follows.
age height
18 76.1
19 77.0
20 78.1
21 78.2
22 78.8
23 79.7
24 79.9
25 81.1
26 81.2
27 81.8
28 82.8
29 83.5
Save the file as "village.txt". Be sure to make note of the folder or directory where you save your
file. In our case, we saved the file in:
/Users/darnold/Documents/MathDept/trunk/MathDept/html/R/village.txt
Caution: If you use Microsoft Word, you must save the file in text format. Do not use Word's
proprietary doc format.
An important questions arises. How will R find the file "village.txt"? Before we provide specific
instructions, we must first address the concept of R's working directory. The current working
directory is found with the getwd() command. The response will differ on your system.
> getwd()
[1] "/Users/darnold"
The response tells us that the current working directory is /Users/darnold.
We can change the current working directory with the command setwd(). Simply enter the
desired destination as a string (delimited with quotes).
> setwd("/Users/darnold/Documents")
You can check the result of this command with the getwd() command.
> getwd()
[1] "/Users/darnold/Documents"
Alternately, we can use the menus provided by R's GUI interfaces on Windows and the Mac
operating systems. For example, on the Mac there is a Misc menu that contains submenus for
setting and getting the current working directory:
The Get Working Directory submenu, will when selected, inform the user of the current working
directory. It performs the same task as the getwd() command.
The Change Working Directory submenu, will when selected, open the operating's system
dialog for browsing the directory structure. Simply select the directory or folder in the manner
you are accustomed to on your operating system and click the Open button. You can check the
result with the getwd() command or the Get Working Directory submenu item.
It's time to import the data in the file "village.txt". You will recall that we stored the file in:
/Users/darnold/Documents/MathDept/trunk/MathDept/html/R/village.txt
age height
18 76.1
19 77.0
20 78.1
21 78.2
22 78.8
23 79.7
24 79.9
25 81.1
26 81.2
27 81.8
28 82.8
29 83.5
Note that the first line of the file contains "headers", that is, "names" for each of the columns.
The R command read.table is used to read data from external sources and place the result in a
data frame. By default, read.table reads data from a file where the data is separated by white
space (one or more spaces, tabs, newlines, or carriage returns).
> village=read.table(file="/Users/darnold/Documents/MathDept/trunk/MathDept/
html/R/village.txt",header=TRUE)
> village
age height
1 18 76.1
2 19 77.0
3 20 78.1
4 21 78.2
5 22 78.8
6 23 79.7
7 24 79.9
8 25 81.1
9 26 81.2
10 27 81.8
11 28 82.8
12 29 83.5
Trying to remember the path to your data file ("village.txt") can be frustrating, so there is an
alternative approach, one that allows the user to browse their directory structure in the usual
manner.
> village=read.table(file=file.choose())
The option file.choose() will pop open a dialog that allows the user to browse through their
directory structure in an accustomed manner. Locate the file in your directory structure, then
click the Open button. The result is identical to the first use of read.table above.
> village
age height
1 18 76.1
2 19 77.0
3 20 78.1
4 21 78.2
5 22 78.8
6 23 79.7
7 24 79.9
8 25 81.1
9 26 81.2
10 27 81.8
11 28 82.8
12 29 83.5
Another approach involves changing the current working directory. Using the setwd() command
or the Change Working Directory submenu of the Misc menu, change the working directory to
where you stored the file "village.txt".
> setwd("/Users/darnold/Documents/MathDept/trunk/MathDept/html/R")
> getwd()
[1] "/Users/darnold/Documents/MathDept/trunk/MathDept/html/R"
The dir() command will reveal the files stored in this directory.
> dir()
[1] "ImportingData.php" "SampleDiscrete.php" "barplot.php"
[4] "boxplot.php" "code" "dataframe.php"
[7] "graphics" "hist.php" "index.php"
[10] "pie.php" "regression.php" "simple.php"
[13] "village.txt"
Note that "village.txt" is among the files stored in this directory. Because we've changed the
current directory to the folder containing the file "village.txt", we can greatly simplify the use of
read.table to read the file and store the result in a dataframe.
> village=read.table(file="village.txt",header=TRUE)
Because we've changed the current directory to the folder containing the file "village.txt", the file
argument of the read.table no longer requires the long pathname we used above. Note that the
result is identical to the previous results above.
> village
age height
1 18 76.1
2 19 77.0
3 20 78.1
4 21 78.2
5 22 78.8
6 23 79.7
7 24 79.9
8 25 81.1
9 26 81.2
10 27 81.8
11 28 82.8
12 29 83.5
Data can be stored in files in many different formats. This short tutorial does not cover all of the
possibilities, but if the file is stored on the internet in the same form as we used to store data in
the file "village.txt", then it is a simple matter to import that data into R.
For practice, we've stored the file "village.txt" at the following URL:
http://online.redwoods.edu/instruct/darnold/Math15/RData/village.txt
> village=read.table(file="http://online.redwoods.edu/instruct/darnold/
Math15/RData/village.txt",header=TRUE)
> village
age height
1 18 76.1
2 19 77.0
3 20 78.1
4 21 78.2
5 22 78.8
6 23 79.7
7 24 79.9
8 25 81.1
9 26 81.2
10 27 81.8
11 28 82.8
12 29 83.5
It is comforting that the syntax for importing the data from the internet is identical to the syntax
for importing a file from the local computer system.
One can now use the dataframe to perform a variety of analyses. For example, one might want a
scatterplot of height versus age and a superimposed line of best fit.
> plot(village)
> abline(lm(height~age,data=village))
These commands were introduced in the activity Linear Regression in R and Dataframes in R.
The resulting scatterplot and line of best fit are shown in Figure 1.
Figure 2. Superimposing the "Line of Best Fit" on a scatterplot.
Enjoy!
We hope you enjoyed this introduction to importing data into the R system. We encourage you to
explore further. Use the command ?read.table to learn more about using this important
command.
In this first part of the introduction to the Central Limit Theorem, we will show how to draw and
visualize a sample of random numbers from a distribution. From there we will examine the mean
and standard deviation of the sample, then examine the distribution of the sample means.
"d" is for "density." It is used to find values of the probability density function.
"p" is for "probability." It is used to find the probability that the random variable lies to the left
of a given number.
"q" is for "quantile." It is used to find the quantiles of a given distribution.
There is a fourth letter, namely "r", that is used to draw random numbers from a distribution. So,
for example, runif and rexp would be used to draw random numbers from the uniform and
exponential distributions, respectively.
Let's use the rnorm command to draw 500 numbers at random from a normal distribution having
mean 100 and standard deviation 10.
> x=rnorm(500,mean=100,sd=10)
> x
[1] 110.67263 102.07696 114.41904 98.52447 95.31791 103.32522
[7] 96.85134 105.37060 98.60348 103.72672 101.82439 107.65795
[13] 96.91467 89.42021 106.06962 111.24015 90.51377 108.22921
...
...
When you examine the numbers stored in the variable x, there is a sense that you are pulling
random numbers that are clumped about a mean of 100. However, a histogram of this selection
provides a better understanding of the data stored in x.
> hist(x,prob=TRUE)
The above command produces the histogram shown in Figure 1.
Figure 1. A histogram of 500 random numbers drawn from a normal distribution with mean 100 and standard
deviation 10.
Let's try the experiment again, drawing a new set of 500 random numbers from the normal
distribution having mean 100 and standard deviation 10.
> x=rnorm(500,mean=100,sd=10)
> hist(x,prob=TRUE,ylim=c(0,0.04))
The histogram in Figure 2 is different from the histogram shown in Figure 1, owing to the
"random" selection of numbers. However, it does share some common traits with the histogram
shown in Figure 1: (1) it appears "normal" in shape, (2) it appears to be "balanced" or "centered"
about 100, and (3) all data appears to occur within 3 increments of 10 of the mean. This is strong
evidence that the random numbers have been drawn from a normal distribution having mean 100
and standard deviation 10. We can provide further evidence of this claim by superimposing a
normal probability density function with mean 100 and standard deviation 10.
> curve(dnorm(x,mean=100,sd=10),70,130,add=TRUE,lwd=2,col="red")
The curve command is new. Some comments on its use are in order:
1. In its simplest form, the syntax curve(f(x),from=,to=) draws the "function" defined by f(x) on the
interval (from,to). Our function is dnorm(x,mean=100,sd=10). The curve command sketches this
function of x on the interval (from,to).
2. The notation "from=" and "to=" may be omitted if the arguments are submitted in the proper
order to the curve command, function first, value of "from" second, then value of "to" third.
That is what we've done, substituting 70 for "from" and 130 for "to".
3. If the argument "add" is set to TRUE, as we have done, then the curve is "added" to the existing
figure. If this argument is omitted, or if it is set to FALSE, then a new plot is drawn, erasing any
previous graphics drawn.
> mean(x)
[1] 99.75439
Of course, if we take another sample of 500 random numbers from the normal distribution with
mean 100 and standard deviation 10, we get a new sample that has a different mean.
> x=rnorm(500,mean=100,sd=10)
> mean(x)
[1] 99.91978
In this case, we have a new sample of 500 randomly selected numbers, and the mean of this
sample provides a different result, namely 99.91978. The next question to ask is "what happens
if we do this repeatedly?"
In the next activity, we will repeatedly sample from the normal distribution. Each sample will
select five random numbers from the normal distribution having mean 100 and standard
deviation 10. We will then find the mean of the five numbers in our sample. We will repeat this
experiment 500 times, collecting the sample means in a vector xbar as we go.
We begin by declaring the mean and standard deviation of the distribution from which we will
draw random numbers. Then we declare the sample size (the number of random numbers drawn).
Each time we draw a sample of size n = 5 from the normal distribution having mean μ = 100 and
standard deviation σ = 10, we need someplace to store the mean of the sample. Because we
intend to collect the means of 500 samples, we initialize a vector xbar to initially contain 500
zeros.
> xbar=rep(0,500)
The rep command "repeats" the entry zero 500 times. As a result, the vector xbar now contains
500 entries, each of which is zero.
It is easy to draw a sample of size n = 5 from the normal distribution having mean μ = 100 and
standard deviation σ = 10. We simply issue the command rnorm(n,mean=mu,sd=sigma). To
find the mean of this result, we simply add the adjustment
mean(rnorm(n,mean=mu,sd=sigma)). The final step is to store this result in the vector xbar.
Then we must repeat this same process an additional 499 times for a total of 500 sample means.
This requires the use of a for loop.
The for construct used by R is similar to the "for loops" used in many programming languages.
It is a simple task to sketch the histogram of the sample means contained in the vector xbar.
> hist(xbar,prob=TRUE,breaks=12,xlim=c(70,130),ylim=c(0,0.1))
There are a number of important observations to be made about the histogram of sample means
in Figure 4, particularly when it is compared with the histograms of Figures 2 and 3.
1. It is essential to note the labels on the horizontal axis. In Figures 2 and 3, the label is x. This is
because the histograms in Figures 2 and 3 are simply describing the shape of 500 random
numbers selected from the normal distribution with mean μ = 100 and standard deviation σ =
10. On the other hand, the histogram of Figure 4 is describing the distribution of 500 sample
means, each of which was found by selecting n = 5 numbers from the normal distribution with
mean μ = 100 and standard deviation σ = 10, then computing their mean (average). The
horizontal axis in Figure 4 emphasizes this fact with the label xbar.
2. It is important to note that the distribution of xbar in Figure 4 appears "normal" in shape. This is
so even though the sample size is relatively small (n = 5).
3. It appears that the "balance point" or "center" of the distribution in Figure 4 occurs near 100.
This can be checked with the following command:
4. > mean(xbar)
5. [1] 100.3104
This calculation provides the "mean of the sample means," so to speak. It is important to note
that the mean of the sample means appears to be identical to the mean of the distribution from
which the samples were drawn.
6. The distribution of sample means in Figure 4 appears to be "less spread out" or "narrower" than
the distributions in Figures 2 and 3.
Increasing the Sample Size
Let's repeat the last experiment, but this time let's draw samples of size n = 10 from the same
"parent population," the normal distribution having mean μ = 100 and standard deviation σ = 10.
Key Idea: When we select samples from a normal distribution, then the distribution of sample
means is also "normal" in shape.
Key Idea: The mean of the distribution of sample means appears to be the same as the mean of
the "parent population" from which we selected our samples (see the "balance point" or "center"
in Figure 5). This is easily checked.
> mean(xbar)
[1] 100.0866
Key Idea: By increasing the size of our samples (n = 10), the histogram of the sample means
becomes "less spread out" or "narrower," as is clearly seen when contrasting the spread of the
histograms in Figures 4 and 5. This behavior (increasing the sample size decreases the spread)
seems quite reasonable. We would expect a more accurate estimate of the mean of the parent
population if we take the mean of larger sample size. For example, you have a better chance of
estimating the average height of the student population at a school if you ask ten students their
height and average than if you asked only five students their height and averaged.
1. If you draw samples from a normal distribution, then the distribution of sample means is also
normal.
2. The mean of the distribution of sample means is identical to the mean of the "parent
population," the population from which the samples are drawn.
3. The higher the sample size that is drawn, the "narrower" will be the spread of the distribution of
sample means.
This statement of the Central Limit Theorem is not complete. We will add refinements to this
statement in later activities.
Enjoy!
We hope you enjoyed this introduction to the Central Limit Theorem system. We encourage you
to explore further. You might try repeating the experiments provided in this activity with sample
sizes n = 15, 20, and 25.
We stated that we would refine this statement of the Central Limit Theorem in further activities.
Let's proceed to do that now.
In the activity The Central Limit Theorem (Part 1), we drew our random samples from a "parent"
population whose distribution was "normal." In this activity, we'll choose parent populations that
are not normal, then see if the conclusions of the Central Limit Theorem still hold.
Figure 1. The exponential probability density function has both mean and standard deviation equal to 1/λ.
> curve(dexp(x,rate=1),0,4,lwd=2,col="red",ylab="p")
The syntax curve(expr,from=,to=) sketches the graph of expr on the interval (from,to).
In this example, from = 0 and to = 4.
The curve command above produces the probability density curve for the exponential
distribution shown in Figure 2.
Figure 2. Sketching the probability density function for the exponential distribution (λ = 1).
"d" is for "density." It is used to find values of the probability density function.
"p" is for "probability." It is used to find the probability that the random variable lies to the left
of a given number.
"q" is for "quantile." It is used to find the quantiles of a given distribution.
There is a fourth letter, namely "r", that is used to draw random numbers from a distribution.
Let's use the rexp command to draw 500 numbers at random from the exponential distribution
having mean 1 and standard deviation 1.
> x=rexp(500,rate=1)
...
...
When you examine the numbers stored in the variable x, it is difficult to get a sense of the
distribution of numbers. However, a histogram of this selection provides a better understanding
of the data stored in x.
> hist(x,prob=TRUE)
Figure 3. A histogram of 500 random numbers drawn from the exponential distribution with λ = 1
1. Visually, it is not unreasonable to estimate that the "balance point" or "center" of the
distribution is near 1. However, a quick calculation provides convincing evidence that the mean
is 1.
2. > mean(x)
3. [1] 1.006549
In our previous examples, we drew 500 random numbers from an exponential distribution with
mean and standard deviation equal to 1. This is called "drawing a sample of size 500" from the
exponential distribution with mean and standard deviation eqyak ti 1. This leads to a sample of
500 random numbers. One immediate question we can ask is "what is the mean of our sample?"
> mean(x)
[1] 1.006549
Of course, if we take another sample of 500 random numbers from the exponential distribution
with mean and standard deviation equal to 1, we get a new sample that has a different mean.
> x=rexp(500,rate=1)
> mean(x)
[1] 0.9780556
In this case, we have a new sample of 500 randomly selected numbers, and the mean of this
sample provides a different result, namely 0.9780556. The next question to ask is "what happens
if we do this repeatedly?"
In the next activity, we will repeatedly sample from the exponential distribution. Each sample
will select five random numbers from the exponential distribution having mean and standard
deviation equal to 1. We will then find the mean of the five numbers in our sample. We will
repeat this experiment 500 times, collecting the sample means in a vector xbar as we go.
We begin by declaring the rate of the exponential distribution from which we will draw random
numbers. Then we declare the sample size (the number of random numbers drawn).
> lambda=1
> n=5
Each time we draw a sample of size n = 5 from the exponential distribution having mean μ = 1
and standard deviation σ = 1, we need someplace to store the mean of the sample. Because we
intend to collect the means of 500 samples, we initialize a vector xbar to initially contain 500
zeros.
> xbar=rep(0,500)
The rep command "repeats" the entry zero 500 times. As a result, the vector xbar now contains
500 entries, each of which is zero.
It is easy to draw a sample of size n = 5 from the exponential distribution having mean μ = 1 and
standard deviation σ = 1. We simply issue the command exp(n,rate=lambda). To find the mean
of this result, we simply add the adjustment mean(exp(n,rate=lambda)). The final step is to
store this result in the vector xbar. Then we must repeat this same process an additional 499
times for a total of 500 sample means. This requires the use of a for loop.
The for construct used by R is similar to the "for loops" used in many programming languages.
It is a simple task to sketch the histogram of the sample means contained in the vector xbar.
There are a number of important observations to be made about the histogram of sample means
in Figure 4, particularly when it is compared with the histograms of Figures 2 and 3.
1. It is essential to note the labels on the horizontal axis. In Figures 2 and 3, the label is x. This is
because the histograms in Figures 2 and 3 are simply describing the shape of 500 random
numbers selected from the exponential distribution with mean μ = 1 and standard deviation σ =
1. On the other hand, the histogram of Figure 4 is describing the distribution of 500 sample
means, each of which was found by selecting n = 5 numbers from the exponential distribution
with mean μ = 1 and standard deviation σ = 1, the computing their mean (average). The
horizonatal axis in Figure 4 emphasizes this fact with the label xbar.
2. It is important to note that the distribution of xbar in Figure 4 is not normalin shape. Indeed, the
distribution is decidedly skewed to the right.
> lambda=1
> n=10
> xbar=rep(0,500)
> for (i in 1:500) { xbar[i]=mean(rexp(n,rate=1))}
> hist(xbar,prob=TRUE,breaks=12)
The histogram in Figure 5 is still not normal in shape. Again, it is definitely skewed to the right,
though perhaps not as much as the histogram in Figure 4 that was produced with a smaller
sample size.
Aha! The histogram in Figure 6 has the appearance of a normal distribution. The "right-
skewness" is beginning to disappear, when compared with the histograms in Figures 4 and 5.
> lambda=1
> n=30
> xbar=rep(0,500)
> for (i in 1:500) { xbar[i]=mean(rexp(n,rate=1))}
> hist(xbar,prob=TRUE,breaks=12)
The histogram in Figure 7 has the symmetric bell-shape of the normal distribution.
Key Observation: It would appear that the distribution of the sample means will be normal in
shape, regardless of the shape of the "parent" population, provided the sample size is large
enough. In Figure 7, a sample size of n = 30 seemed to be enough to guarantee that the
distribution of sample means is normal in shape, even though the samples were drawn from the
exponential distribution, a distribution that is highly skewed to the right.
There are two more important observations to be made about the distribution shown in Fiugre 7:
1. Key Observation: The histogram in Figure 7 appears to be "balanced" or "centered" about xbar =
1. Indeed:
2. > mean(xbar)
3. [1] 0.9978525
This mean is the same as the mean of the "parent" population (the exponential distribution with
mean μ = 1) from which our samples were drawn.
4. Key Observation: The standard deviation of the distribution in Figure 7 appears to be much
smaller than the standard deviation of the parent population (the exponential distribution had σ
= 1). Indeed, if we recall that all data in the normal distribution falls within about three standard
deviations of the mean, the histogram in Figure 7 would indicate a standard deviation of about
0.2. We will have more to say about the standard deviation of the sample means in a later
activity.
Let's repeat the experiment again (increasing the sample size), only this time let's use a "parent"
population that is discrete and skewed to the left. Specifically:
A Discrete Distribution
x p
1 0.1
2 0.1
3 0.1
4 0.1
5 0.2
6 0.4
We can "load" the values of the random variable and their probabilities in R as follows:
> x=c(1,2,3,4,5,6)
> p=c(0.1,0.1,0.1,0.1,0.2,0.4)
We can provide a "stick" plot of this discrete distribution with the following code:
x=c(1,2,3,4,5,6)
p=c(0.1,0.1,0.1,0.1,0.2,0.4)
plot(x,p,type="h",lwd=2,col="red",ylim=c(0,0.5))
points(x,p,pch=16,cex=2,col="black")
The code above will produce the discrete distribution shown in Figure 8.
Figure 7. A discrete distribution that is badly skewed to the left.
Figure 1.The mean is found by summing the product of the values of the random variable and their associated
probabilities.
Thus, we can find the mean of our discrete distribution with the following calculation:
Making this calculation is R greatly simplifies the task. First find the product of the vectors x and
p. Note: When you take the product of two vectors, R will produce a third vector, each entry of
which is the product of the corresponding entries in the vectors being multiplied.
> x*p
[1] 0.1 0.2 0.3 0.4 1.0 2.4
> sum(x*p)
[1] 4.4
Key Observation: Look again at the "spike" plot in Figure 7. Imagine a "knife-edge" at 4.4 and
set the distribution of Figure 7 atop the "knife-edge." Will the distribution balance? Keep in mind
the "principle of the lever" or the "teeter-totter effect." The outliers, such as x = 1, positioned a
farther distance from the mean but with lower probability, can balance values with more
"massive" probabilities that are clumped closer to the mean. With these thoughts in mind, the
mean value μ = 4.4 seems reasonable.
In the activity Sampling a Discrete Population, the sample command was used to draw a sample.
The syntax sample(x, size, replace=, prob=) indicates that we must provide the following
arguments:
Let's draw a sample of size 1000 and create a histogram of the resulting sample.
n=1000
xs=sample(x,size=n,replace=TRUE,prob=p)
As in the activity Sampling a Discrete Distribution, the table command provides a nice summary
of the sample.
> table(xs)
xs
1 2 3 4 5 6
97 116 91 102 189 405
> barplot(table(xs)/length(xs),xlab="x",ylab="frequency")
This last command produces the barplot shown in Figure 8.
Figure 8.Plotting a sample of size 1000 selected from the discrete distribution.
The barplot in Figure 8 is a random sample from the discrete distribution shown in Figure 7,
whose theoretical mean is 4.4. Let's see what the mean of our sample is.
> mean(xs)
[1] 4.385
Of course, if we draw another random sample, we will get a different sample mean.
> xs=sample(x,size=n,replace=TRUE,prob=p)
> mean(xs)
[1] 4.479
n=5
xbar=rep(0,500)
for (i in 1:500) {
xbar[i]=mean(sample(x,size=n,replace=TRUE,prob=p))
}
hist(xbar,prob=TRUE,breaks=12)
Note that the distribution of sample means in Figure 9 is not normal. Indeed, the distribution is
skewed to the left. This is to be expected as the "parent" population is highly skewed to the left
and the sample size we are using is quite small. Let's increase the sample size and see what
happens.
n=10
xbar=rep(0,500)
for (i in 1:500) {
xbar[i]=mean(sample(x,size=n,replace=TRUE,prob=p))
}
hist(xbar,prob=TRUE,breaks=12)
There's a bit of an improvement (starting to look somewhat normal) but the distribution of
sample means in Figure 10 is still skewed to the left. Let's increase the sample size and see what
happens.
n=20
xbar=rep(0,500)
for (i in 1:500) {
xbar[i]=mean(sample(x,size=n,replace=TRUE,prob=p))
}
hist(xbar,prob=TRUE,breaks=12)
The distribution is now taking on the symmery and "bell-shape" of the normal distribution. Let's
increase the sample size one more time and see what happens.
n=30
xbar=rep(0,500)
for (i in 1:500) {
xbar[i]=mean(sample(x,size=n,replace=TRUE,prob=p))
}
hist(xbar,prob=TRUE,breaks=12)
The histogram in Figure 7 has the symmetric bell-shape of the normal distribution.
Key Observation: It would appear that the distribution of the sample means will be normal in
shape, regardless of the shape of the "parent" population, provided the sample size is large
enough. In Figure 12, a sample size of n = 30 seemed to be enough to guarantee that the
distribution of sample means is normal in shape, even though the samples were drawn from a
discrete distribution, a distribution that is highly skewed to the left.
Key Observation: The histogram in Figure 12 appears to be "balanced" or "centered" about xbar
= 4.4. Indeed:
> mean(xbar)
[1] 4.396933
This mean is the same as the mean of the "parent" population (the discrete distribution of Figure
7 with mean μ = 4.4) from which our samples were drawn.
1. If you draw samples from a distribution, then the distribution of sample means is also normal,
provided a large enough sample size is used. It appears that that the Magic Number for a
sufficient sample size is n = 30. Note: This is why the Central Limit Theorem is oftentimes referred
to a the "Law of Large Numbers."
2. The mean of the distribution of sample means is identical to the mean of the "parent
population," the population from which the samples are drawn.
3. The higher the sample size that is drawn, the "narrower" will be the spread of the distribution of
sample means.
This statement of the Central Limit Theorem is still not complete. We still need to discuss just
how the standard deviation of the sample means varies with the sample size. We will attack this
question in a later activity.
Enjoy!
We hope you enjoyed this second activity on the Central Limit Theorem system. We encourage
you to explore further. You might try repeating the experiments in this activity with different
"parent" distributions.
Transforming Data in R
In the activity Linear Regression in R, we showed how to calculate and plot the "line of best fit"
for a set of data. As a quick reminder, consider the normal average January minimum
temperatures in 56 American cities, presented at the following URL:
http://lib.stat.cmu.edu/DASL/Datafiles/USTemperatures.html
This file is one of many data sets stored at the Data and Story Library. We've adapted the file to
make it easier to import into R. You can download the adapted file at
http://msenux.redwoods.edu/mathdept/R/USTemperatures.txt. Take note of the folder name
where you save this file as USTemperatures.txt.
In the activity Importing Data in R, we showed one of the simplest methods to import a datafile
into a dataframe in R.
> USTemps=read.table(file=file.choose(),header=TRUE)
The option file.choose() will pop open a dialog that allows the user to browse through their
directory structure in an accustomed manner. Locate the file USTemperatures.txt in your
directory structure, then click the Open button.
> USTemps=read.table(file=file.choose(),header=TRUE)
> USTemps
Temp Lat Long
1 44 31.2 88.5
2 38 32.9 86.8
3 35 33.6 112.5
4 31 35.4 92.8
5 47 34.3 118.7
...
...
...
56 14 41.2 104.9
The first variable Temp contains the average January low temperature for a particular US city
having latitude and longtitude stored in the second and third variables, Lat and Long,
respectively.
> res=lm(Temp~Lat,data=USTemps)
> res
Call:
lm(formula = Temp ~ Lat, data = USTemps)
Coefficients:
(Intercept) Lat
108.728 -2.110
It's a simple matter to produce a scatterplot and the line of best fit.
> plot(Temp~Lat,data=USTemps)
> abline(res)
The above commands produce the scatterplot and the line of best fit in Figure 1.
Figure 1. Plot of temperature versus latitude and the line of best fit.
In Figure 1, we see a general downward linear trend (which explains the negative slope). As the
latitude increases, we move northward from the equator (which has latitude zero) and the
temperature gets colder.
It's important to note the location of the variable x in the power function in Figure 2. Later in this
activity, we will contrast the power function with the exponential function, where the variable x
is an exponent, rather than the base as in the equation in Figure 2.
Let's look at an example, choosing the power function y = 3x2. First, let's produce a plot.
> x=seq(1,5,0.5)
> y=3*x^2
> plot(x,y)
It could be argued that the data is somewhat linear, showing a general upward trend. However, it
is more likely (based on Figure 3) that the function is nonlinear, due to the general bend in curve
shown in Figure 3.
The log of a product is the sum of the logs, so we can write the following.
This last form shows that if we plot the log of y versus the log of x, the graph will be linear with
slope 2 and intercept log 3.
> plot(log(x),log(y))
> res=lm(log(y)~log(x))
> res
Call:
lm(formula = log(y) ~ log(x))
Coefficients:
(Intercept) log(x)
1.099 2.000
The slope is 2, which is the slope indicated in log y = log 3 + 2 log x. Supposedly, the intercept
should be log 3.
> log(3)
[1] 1.098612
This result agrees with the intercept found using R's lm command as reported in the variable res.
Important Result: With any power function, the graph of the logarithm of the dependent variable versus
the logarithm of the independent variable will be a line. Thus, if the graph of the logarithm of the
response variable versus the logarithm of the independent variable is a line, then we should suspect that
the relationship between the original variables is that of a power function.
Resting Metabolic Rate
Let's apply what we've learned about power functions to a concrete example. In the table that
follows, the resting metabolic rates of several classs of primates are presented along with the
mass of the primate class. The resting metabolic rate (RMR) is defined to be the amount of
energy required by the body when the body is doing nothing. This data is taken from Leonard
and Robinson (1997, AJPA) and repeated in and article by Marcus Hamilton, Model-Fitting with
Linear Regression: Power Functions.
A. trivirgatus 0.85 46
C. molloch 0.7 54
S. imperator 0.4 35
S. fusicollis 0.3 28
S. sciureus 0.8 66
C. guereza 7 265
P. anubis 13 520
H. lar 6 292
!Kung 46 1383
!Kung 41 1099
We've massaged the data and stored it in the file RMR.txt. Download the file and save it as
RMR.txt. Take note of the folder in which you save the file. Read the data into R as follows:
> primates=read.table(file=file.choose(),header=TRUE)
> primates
Weight RMR
1 8.50 363
2 6.40 293
3 0.85 46
4 8.41 346
5 0.70 54
6 2.60 143
7 2.40 135
8 0.40 35
9 0.30 28
10 0.80 66
11 7.90 327
12 7.00 265
13 5.50 331
14 29.30 956
15 13.00 520
16 6.00 292
17 39.50 1036
18 29.80 839
19 83.60 1948
20 37.80 1074
21 10.50 408
22 46.00 1383
23 41.00 1099
24 59.60 1591
25 51.80 1394
> plot(RMR~Weight,data=primates)
One could argue that the data has a general linear appearance. However, there is evidence in
Figure 5 of a slightly concave down bend to the data, suggesting that we might use a power
function (perhaps the square root function) to fit the data. This encourages us to try plot the
logarithm of the RMR versus the logarithm of the Weight as follows.
> plot(log(RMR)~log(Weight),data=primates)
Aha! The plot in Figure 6 shows a definite linear trend. Therefore, our suspicion that a power
function might fit the original data set is probably valid (there are actual statistical tests for this
assumption which we will explore in later activities). Let's find the line of best fit for the
transformed data in Figure 6.
> res=lm(log(RMR)~log(Weight),data=primates)
> res
Call:
lm(formula = log(RMR) ~ log(Weight), data = primates)
Coefficients:
(Intercept) log(Weight)
4.2409 0.7599
This result tells us that we can fit a linear model to the transformed data as follows:
We evaluate e4.2409.
> exp(4.2409)
[1] 69.47035
Finally, we again use the fact that the logarithm and exponential are inverses.
Note that this final function has the form y = a xb, a power function that should "fit" the original
data set.
Let's provide some visual verification of our model. First, find the range (min and max) of the
Weight data.
> range(primates$Weight)
[1] 0.3 83.6
Plot the original data, then superimpose a plot of the power function that we think will "fit" the
data.
> plot(RMR~Weight,data=primates)
> x=seq(0.3,83.6,0.1)
> y=69.47035*x^0.7599
> lines(x,y,type="l",lwd=2,col="red")
These commands produce the scatterplot and the power function that "fits" the data shown in
Figure 7.
In the power function, the independent variable was the base, as in y = a xb. In the exponential
function, the independent variable is now an exponent, as in y = a bx. This is a subtle but
important difference.
Let's look at an example of an exponential function, choosing the example y = 3(2x). First, let's
produce a plot.
> x=seq(1,5,0.5)
> y=3*2^x
> plot(x,y)
Let's take the logarithm of both sides of the exponential function y = 3(2x). The base of the
logarithm is irrelevant; however, log x is understood to be the natural logarithm (base e) in R.
That is, log x = loge x in R.
The log of a product is the sum of the logs, so we can write the following.
This last form shows that if we plot the log of y versus x, the graph will be linear with slope log
2 and intercept log 3.
> plot(log(x),log(y))
It's interesting to find the line of best fit for the transformed data.
> res=lm(log(y)~x)
> res
Call:
lm(formula = log(y) ~ x)
Coefficients:
(Intercept) x
1.0986 0.6931
The slope is 0.6931, which agrees with the slope indicated in log y = log 3 + x log 2; that is, the
slope is log 2.
> log(2)
[1] 0.6931472
The intercept is 1.098, which agrees with the intercept indicated in log y = log 3 + x log 2; that
is, the intercept is log 3.
> log(3)
[1] 1.098612
This result agrees with the intercept found using R's lm command as reported in the variable res.
Important Result: With any exponential function, the graph of the logarithm of the dependent variable
versus the independent variable will be a line. Thus, if the graph of the logarithm of the response
variable versus the independent variable is a line, then we should suspect that the relationship between
the original variables is that of a exponential function.
Cell Phone Subscribers
Let's apply what we've learned to a concrete example. In the table that follows, the number of
cell phone subscriptions (in millions) is presented as a function of the number of years that have
passed since 1987.
t (years since
subscribers (in millions)
1987)
1 1.6
2 2.7
3 4.4
4 6.4
5 8.9
6 13.1
7 19.3
8 28.2
9 38.2
10 48.7
> t=seq(1,10)
> subscribers=c(1.6,2.7,4.4,6.4,8.9,13.1,19.3,28.2,38.2,48.7)
> plot(t,subscribers)
These commands produce the scatterplot shown in Figure 11.
We suspect exponential growth. This leads us to plot the logarithm of the subscriber data versus
the year since 1987.
> plot(t,log(subscribers))
> res=lm(log(subscribers)~t)
> abline(res)
These commands produce the scatterplot and the line of best fit shown in Figure 12.
Figure 12. Plotting log of subscribers versus the years since 1987.
The data of Figure 12 certainly appear to be linear. Let's examine the model of the line of best fit.
> res
Call:
lm(formula = log(subscribers) ~ t)
Coefficients:
(Intercept) t
0.2630 0.3774
> exp(0.2630)
[1] 1.300827
> plot(t,subscribers)
> t=seq(1,10,0.1)
> y=1.300827*exp(0.3774*t)
> lines(t,y,type="l",lwd=2,col="red")
The above commands will plot the original data, then superimpose a plot of the exponential
function that we found to "fit" the data.
Figure 13. The exponential function 'fits' the original data.
Not bad!
Enjoy!
We hope you enjoyed this introduction to the principles of transforming data in the R system. In
upcoming activities, we will discuss further what is meant by a "good fit."
1. If you draw samples from a normal distribution, then the distribution of sample means is also
normal.
2. The mean of the distribution of sample means is identical to the mean of the "parent
population," the population from which the samples are drawn.
3. The higher the sample size that is drawn, the "narrower" will be the spread of the distribution of
sample means.
To provide visual evidence of the narrowing spread, the following code will produce the image
shown in Figure 1. The four examples pictured use sample sizes of 5, 10, 20, and 30. (Note: It is
very difficult to type the following commands without error. In a moment, we'll provide a more
efficient way to enter the commands. In the meantime, give the commands a try, if only to
appreciate how difficult it is to get them correctly entered in sequence.)
> par(mfrow=c(2,2))
>
> n=5
> xbar_5=rep(0,500)
> for (i in 1:500){ xbar_5[i]=mean(rnorm(n,mean=100,sd=10))}
> hist(xbar_5,prob=TRUE,breaks=12,xlim=c(70,130))
>
> n=10
> xbar_10=rep(0,500)
> for (i in 1:500) { xbar_10[i]=mean(rnorm(n,mean=100,sd=10))}
> hist(xbar_10,prob=TRUE,breaks=12,xlim=c(70,130))
>
> n=20
> xbar_20=rep(0,500)
> for (i in 1:500) { xbar_20[i]=mean(rnorm(n,mean=100,sd=10))}
> hist(xbar_20,prob=TRUE,breaks=12,xlim=c(70,130))
>
> n=30
> xbar_30=rep(0,500)
> for (i in 1:500) { xbar_30[i]=mean(rnorm(n,mean=100,sd=10))}
> hist(xbar_30,prob=TRUE,breaks=12,xlim=c(70,130))
1. There is a new command, par(mfrow=c(2,2)), which divides the graphics window into two rows
and two columns (see Figure 1). The par stands for "parameter," and there are a whole host of
that can alter the figure window (type ?par for more information).
2. Think of mfrow as meaning "multiple figures entered by rows." Because there are two rows and
two columns, the first histogram command creates a plot in row 1, column 1, the second in row
1, column 2, the third in row 2, column 1, and the fourth in row 2 column 2. There is also a
command mfcol that enters the figures by columns.
3. The for loops are explained in the activity The Central Limit Theorem (Part 1).
4. Note that xlim=c(70,130) creates a common axis so that a proper comparison of the spreads can
be made.
In Figure 1, note that the spread narrows as the sample size increases from n = 5 to n = 10, n =
20, and n = 30.
Figure 1. A visual comparison of the distribution of sample means as the sample size increases.
Note that the mean of each distribution of sample means is 100 (each of the histograms in Figure
1 balances about 100), the same as the mean of the parent population from which the samples are
drawn.
Sourcing Files
If you were successful in typing the lines that produced Figure 1, then you are far better typist
than me. When commands must be correctly entered in sequence, such as the lines of code that
produced Figure 1, it can be exasperating to make a mistake, knowing that you must retype the
entire sequence anew.
A far more efficient way to proceed is to use R's editor to enter the lines of code, save the file,
then execute the code in the file. This way, if you make a simple typing mistake, you make a
change to the file, save, then execute the file anew. R calls these files of code "source files," and
executing the file is called "sourcing the file."
On the Mac operating system (Windows and Linux should be similar), you open the editor by
selecting New Document from the File menu (there is also an icon for new documents on the
console toolbar). When the editor opens, type the following lines.
par(mfrow=c(2,2))
n=5
xbar_5=rep(0,500)
for (i in 1:500){ xbar_5[i]=mean(rnorm(n,mean=100,sd=10))}
hist(xbar_5,prob=TRUE,breaks=12,xlim=c(70,130))
n=10
xbar_10=rep(0,500)
for (i in 1:500) { xbar_10[i]=mean(rnorm(n,mean=100,sd=10))}
hist(xbar_10,prob=TRUE,breaks=12,xlim=c(70,130))
n=20
xbar_20=rep(0,500)
for (i in 1:500) { xbar_20[i]=mean(rnorm(n,mean=100,sd=10))}
hist(xbar_20,prob=TRUE,breaks=12,xlim=c(70,130))
n=30
xbar_30=rep(0,500)
for (i in 1:500) { xbar_30[i]=mean(rnorm(n,mean=100,sd=10))}
hist(xbar_30,prob=TRUE,breaks=12,xlim=c(70,130))
Save the file as central3_1.R. The .R extension reminds us that the file contains R source code.
To execute the code in the file central3_1.R, select Source File from the File menu (there is
also an icon for "sourcing" files on the console toolbar). This opens a dialog where you can
browse your system in the usual manner and select the saved source file (central3_1.R in our
case). Clicking the Open button will execute the source code in the file and produce the image in
Figure 1. If you have errors, or wish to make adaptions, correct your source code, save, then
"source" the file again.
On our system, "sourcing" the file central3_1.R sends the following command to the console:
>
source("/Users/darnold/Documents/MathDept/trunk/MathDept/html/R/code/central3_
1.R")
This line shows both the path and filename of the file containing our source code. We've found
that should errors occur, or should we wish to add further code or make changes, we can save our
changes, then hit the "up-arrow" on our keyboard to replay this line in our console. Hitting enter
executes the line and "sources" our saved changes, a most efficient way to work.
Once you've "sourced" your code, the workspace will reflect variables used in your code. (Note:
Your workspace may differ.)
> ls()
[1] "i" "n" "xbar_10" "xbar_20" "xbar_30" "xbar_5"
> i
[1] 500
The variable n should contain the last sample size used in your code.
> n
[1] 30
The variables xbar_5, xbar_10, xbar_20, and xbar_30 should contain 500 sample means with
the indicated sample size. For example, xbar_5 contains the means of 500 samples of size n = 5
drawn from a normal population having mean μ = 100 and standard deviation σ = 10.
> xbar_5
[1] 94.47642 98.71575 106.02464 104.54210 102.46225 95.55821
[7] 106.01677 103.29282 101.07612 93.36717 97.08930 99.53715
[13] 98.72121 91.59586 100.12131 100.20003 99.66180 97.42841
[19] 102.15585 93.75008 98.89875 103.13431 101.45754 97.65430
[25] 103.02266 97.22582 95.37828 96.91194 98.07255 96.93296
...
...
The histograms in Figure 1 indicate that the spread of the distributions of sample means narrows
with increasing sample size. Let's examine the standard deviations of xbar_5, xbar_10,
xbar_20, and xbar_30. The sd command is used to find the standard deviation of any collection
of data.
> sd(xbar_5)
[1] 4.52229
> sd(xbar_10)
[1] 3.179132
> sd(xbar_20)
[1] 2.165316
> sd(xbar_30)
[1] 1.820735
Sure enough, as the sample size increases, the standard deviation, which is a measure of the
spread of the distribution, clearly decreases.
Let's add a couple more data points. Save the following lines in the file central3_2.R.
n=5
xbar_5=rep(0,500)
for (i in 1:500){ xbar_5[i]=mean(rnorm(n,mean=100,sd=10))}
n=10
xbar_10=rep(0,500)
for (i in 1:500) { xbar_10[i]=mean(rnorm(n,mean=100,sd=10))}
n=20
xbar_20=rep(0,500)
for (i in 1:500){ xbar_20[i]=mean(rnorm(n,mean=100,sd=10))}
n=30
xbar_30=rep(0,500)
for (i in 1:500) { xbar_30[i]=mean(rnorm(n,mean=100,sd=10))}
n=40
xbar_40=rep(0,500)
for (i in 1:500) { xbar_40[i]=mean(rnorm(n,mean=100,sd=10))}
n=50
xbar_50=rep(0,500)
for (i in 1:500) { xbar_50[i]=mean(rnorm(n,mean=100,sd=10))}
Now, collect the sample sizes in a vector ssizes and standard deviations of the distributions of
sample means in a vector sdevs, then plot the standard deviations versus the sample sizes. Add
these lines to the code in the file central3_2.R.
ssizes=c(5,10,20,30,40,50)
sdevs=c(sd(xbar_5),sd(xbar_10),sd(xbar_20),sd(xbar_30),sd(xbar_40),sd(xbar_50)
)
plot(ssizes,sdevs)
Save the file central3_2.R then "source" the file to produce the scatterplot shown in Figure 2.
Figure 2. Plotting the standard deviations of the distributions of sample means versus the sample size.
Let's examine the current workspace in R. Note that this next command is executed at the
command line.
> ls()
[1] "i" "n" "sdevs" "ssizes" "xbar_10" "xbar_20"
[7] "xbar_30" "xbar_40" "xbar_5" "xbar_50"
We will now try to find the specific relationship between the standard deviations of the
distributions of sample means and the corresponding sample size. In the activity Transforming
Data in R, we examined two models: power functions and exponential functions. Let's plot the
logarithm of sdevs versus ssizes.
> plot(log(ssizes),log(sdevs))
The above command produces the scatterplot shown in Figure 3.
The fact that the plot of the logarithm of sdevs versus ssizes in Figure 3 is linear indicates that a
power function would be a good model for the relationship between sdevs and ssizes.
As we learned in the activity Transforming Data in R, we first fit a line to the logarithm of sdevs
versus the logarithm of ssizes in Figure 1.
> res=lm(log(sdevs)~log(ssizes))
> res
Call:
lm(formula = log(sdevs) ~ log(ssizes))
Coefficients:
(Intercept) log(ssizes)
2.2962 -0.4949
> abline(res)
This produces the line of best fit in Figure 4.
The fact that the line fits the data in Figure 4 so well is a strong indication that a power function
is the relationship we seek between the standard deviations of the distribution of sample means
and the sample sizes.
The output in the variable res tells us that the line of best fit in Figure 4 has the following
equation.
On the left, the exponential and the logarithm are inverses. On the right, the exponential of a sum
is the product of the exponentials.
> exp(2.2962)
[1] 9.936352
Again, the logarithm and exponential are inverses, which leads to the following result.
Conclusions
Note the number 10. It is not a coincidence that this number is the standard deviation of the
parent population. Remember, we drew our samples from a normal distribution with mean μ =
100 and standard deviation σ = 10.
Central Limit Theorem. Let X be a random variable having normal distribution with mean μ and
standard deviation σ. Let X be the random variable representing the mean of a sample of size n
drawn from the distribution of X. Then:
3. The standard deviation of the distribution of X is related to the standard deviation of the
distribution of X by the following equation:
Note that the result in item #3 is identical to our experimental result, repeated here for
convenience.
The comparison is better seen if we use a little algebra to change the form of the result in item
#3.
Thus, to find the standard deviation of the distribution of sample means, take the standard
deviation of the parent population and divide it by the square root of the sample size.
Here is the key fact: The hypotheses and conclusions of the Central Limit Theorem as stated
above hold regardless of the distribution of the parent population, provided the sample size is
large enough. We now will examine this statement through examples.
> x=seq(0,10,0.1)
> plot(x,y,type="l")
> plot(x,y,type="l",lwd=2,col="red")
These commands produce the uniform density function on [0,10] in Figure 5.
The mean of the uniform distribution on the interval [a, b] is given by the formula:
You need to use a little calculus to determine the standard deviation of the uniform distribution
on the interval [a, b], but the result is:
Thus, the standard deviation of the uniform distribution on the interval [0,10] is:
Now, let's draw 500 samples of size n=100 from the uniform distribution on [0,10].
> n=100
> xbar=rep(0,500)
> for (i in 1:500) { xbar[i]=mean(runif(n,min=0,max=10))}
> hist(xbar,prob=TRUE)
Figure 6. Histogram of means of samples of size n = 10 drawn from the uniform distribution [0,10].
Now, let's check the conclusions of the Central Limit Theorem.
1. The first conclusion of the Central Limit Theorem is the fact that the distribution of sample
means is normal. This is evident in Figure 6.
2. The second conclusion of the Central Limit Theorem states that the mean of the sample means
is the same as the mean of the distribution from which the samples are drawn. In Figure 6, the
mean of the distribution of sample means appears to be about 5 (estimate the balance point in
Figure 6), which is the same as the mean of the uniform distribution on [0,10]. However, let's do
one more check:
3. > mean(xbar)
4. [1] 5.002197
Looks good!
5. The third conclusion of the Central Limit Theorem states that the standard deviation of the
distribution of sample means is calculated with:
> sd(xbar)
[1] 0.2936821
In the activity Continuous Distributions in R, we examined the exponential function having the
following probability density function:
The mean and standard deviation of this exponential distribution are μ = 1/λ and σ = 1/λ,
respectively.
> lambda=1
> y=dexp(x,rate=lambda)
> plot(x,y,type="l",lwd=2,col="red")
These commands produce the histogram in Figure 7.
Now, let's draw 500 samples of size n=100 from the exponential distribution with λ = 1.
> n=100
> xbar=rep(0,500)
> for (i in 1:500) { xbar[i]=mean(rexp(n,rate=lambda))}
> hist(xbar,prob=TRUE)
These commands produce the histogram of the sample means shown in Figure 8.
Figure 8. Distribution of sample means of size n = 100 from the exponential distribution with λ = 1.
1. The first conclusion of the Central Limit Theorem is the fact that the distribution of sample
means is normal. This is evident in Figure 8.
2. The second conclusion of the Central Limit Theorem states that the mean of the sample means
is the same as the mean of the distribution from which the samples are drawn. In Figure 8, the
mean of the distribution of sample means appears to be about 1 (estimate the balance point
in Figure 8), which is the same as the mean of the exponential distribution with λ = 1.
However, let's do one more check:
3. > mean(xbar)
4. [1] 0.9997962
Looks good!
5. The third conclusion of the Central Limit Theorem states that the standard deviation of the
distribution of sample means is calculated with:
Let's check this prediction.
> sd(xbar)
[1] 0.09948667
Enjoy!
We hope you enjoyed this third activity on the Central Limit Theorem system. We
encourage you to explore further. You might try repeating the experiments in this activity
with different "parent" distributions.