R Tutorial

R Tutorial Input Assignment The most straight forward way to store a list of numbers is through an assignment using the
c command. (c stands for "combine.") The idea is that a list of numbers is stored under a given name, and the name is used to refer to the data. A list is specified with the c command, and assignment is specified with the "< " symbols. Another term used to describe the list of numbers is to call it a "vector." The numbers within the c command are separated by commas. As an e!ample, we can create a new variable, called "bubba" which will contain the numbers ", #, $, and %&
> bubba <- c(3,5,7,9) > 'hen you enter this command you should not see any output e!cept a new command line. The command creates a list of numbers called "bubba." To see what numbers is included in bubba type "bubba" and press the enter (ey& > bubba [1] 3 5 7 9 > )f you wish to wor( with one of the numbers you can get access to it using the variable and then s*uare brac(ets indicating which number& > bubba[2] [1] 5 > bubba[1] [1] 3 > bubba[0] numeric(0) > bubba[3] [1] 7 > bubba[4] [1] 9 > +otice that the first entry is referred to as the number , entry, and the -ero entry can be used to indicate how the computer will treat the data. .ou can store strings using both single and double *uotes, and you can store real numbers. .ou now have a list of numbers and are ready to e!plore. )n the chapters that follow we will e!amine the basic operations in R that will allow you to do some of the analyses re*uired in class. Reading a CSV file /nfortunately, it is rare to have 0ust a few data points that you do not mind typing in at the prompt. )t is much more common to have a lot of data points with complicated relationships. 1ere we will e!amine how to read a data set from a file using the read.csv function but first discuss the format of a data file.
'e assume that the data file is in the format called "comma separated values" (csv). That is, each line contains a row of values which can be numbers or letters, and each value is separated by a comma. 'e also assume that the very first row contains a list of labels. The idea is that the labels in the top row are used to refer to the different columns of values. 2irst we read a very short, somewhat silly, data file. The data file is called simple.csv and has three columns of data and si! rows. The three columns are labeled "trial," "mass," and "velocity." 'e can pretend that each row comes from an observation during one of two trials labeled "A" and "3." A copy of the data file is shown below and is created in defiance of 'erner 1eisenberg& "trial","mass","vel cit!" """,10,12 """,11,14 "#",5,$ "#",%,10 """,10&5,13 "#",7,11 The command to read the data file is read.csv. 'e have to give the command at least one arguments, but we will give three different arguments to indicate how the command can be used in different situations. The first argument is the name of file. The second argument indicates whether or not the first row is a set of labels. The third argument indicates that there is a comma between each number of each line. The following command will read in the data and assign it to a variable called "heisenberg&" > 'eisenber( <- rea)&csv(*ile+"sim,le&csv",'ea)+-./0,se,+",") > 'eisenber( trial mass vel cit! 1 " 10&0 12 2 " 11&0 14 3 # 5&0 $ 4 # %&0 10 5 " 10&5 13 % # 7&0 11 > summar!('eisenber() trial mass vel cit! "13 2in& 1 5&00 2in& 1 $&00 #13 1st #u&1 %&25 1st 3u&110&25 2e)ian 1 $&50 2e)ian 111&50 2ean 1 $&25 2ean 111&33 3r) 3u&110&3$ 3r) 3u&112&75 2a4& 111&00 2a4& 114&00 > (+ote that if you are using a 4icrosoft system the file naming convention is different from what we use here. )f you want to use a bac(slash it needs to be escaped, i.e. use two bac(slashes together "55." Also you can specify what folder to use by clic(ing on the "2ile" option in the main menu and choose the option to specify your wor(ing directory.) The variable "heisenberg" contains the three columns of data. 6ach column is assigned a name based on the header (the first line in the file). .ou can now access each individual column using a "7" to separate the two names& > 'eisenber(5trial [1] " " # # " # 6evels1 " #
> 'eisenber(5mass [1] 10&0 11&0 5&0 %&0 10&5 > 'eisenber(5vel cit! [1] 12 14 $ 10 13 11 >
7&0
)f you are not sure what columns are contained in the variable you can use the names command& > names('eisenber() [1] "trial" "mass"
"vel cit!"
'e will loo( at another e!ample which is used throughout this tutorial. we will loo( at the data found in a spreadsheet located athttp://cdiac.ornl.gov/ftp/ndp061a/trees 1.!"1. A description of the data file is located at http://cdiac.ornl.gov/ftp/ndp061a/ndp061a.t#t. The original data is given in an e!cel spreadsheet. )t has been converted into a csv file, trees 1.csv, by deleting the top set of rows and saving it as a "csv" file. This is an option to save within e!cel. (.ou should save the file on your computer.) )t is a good idea to open this file in a spreadsheet and loo( at it. This will help you ma(e sense of how R stores the data. The data is used to indicate an estimate of biomass of ponderosa pine in a study performed by 8ale '. 9ohnson, 9. Timothy 3all, and Roger 2. 'al(er who are associated with the 3iological :ciences ;enter, 8esert Research )nstitute, <.=. 3o! >?@@?, Reno, +A B%#?> and the 6nvironmental and Resource :ciences ;ollege of Agriculture, /niversity of +evada, Reno, +A B%#,@. The data is consists of #C lines, and each line represents an observation. 6ach observation includes measurements and mar(ers for @B different measurements of a given tree. 2or e!ample, the first number in each row is a number, either ,, @, ", or C, which signifies a different level of e!posure to carbon dio!ide. The si!th number in every row is an estimate of the biomass of the stems of a tree. +ote that the very first line in the file is a list of labels used for the different columns of data. The data can be read into a variable called "tree" in using the read.csv command& > tree <- rea)&csv(*ile+"trees91&csv",'ea)er+-./0,se,+",")7 This will create a new variable called "tree." )f you type in "tree" at the prompt and hit enter, all of the numbers stored in the variable will be printed out. Try this, and you should see that it is difficult to ma(e any sense out of the numbers. There are many different ways to (eep trac( of data in R. 'hen you use the read.csv command R uses a specific (ind of variable called a "data frame." All of the data are stored within the data frame as separate columns. )f you are not sure what (ind of variable you have then you can use the attributes command. This will list all of the things that R uses to describe the variable& > attributes(tree) 5names [1] "8" "9" "8:#." ".0;" "6<#2" "=-#2" ".-#2" "6<988" [9] "=-988" ".-988" "6<#88" "=-#88" ".-#88" "6<8"88" "=-8"88" ".-8"88" [17] "6<>88" "=->88" ".->88" "6<2?88" "=-2?88" ".-2?88" "6<;88" "=-;88" [25] ".-;88" "6<=88" "=-=88" ".-=88" 5class [1] ")ata&*rame"
5r @&names [1] "1" "2" "14" "15" [1%] "1%" "17" "29" "30" [31] "31" "32" "44" "45" [4%] "4%" "47" >
"3"
"4"
"5"
"%"
"7"
"$"
"9"
"10" "11" "12" "13"
"1$" "19" "20" "21" "22" "23" "24" "25" "2%" "27" "2$" "33" "34" "35" "3%" "37" "3$" "39" "40" "41" "42" "43" "4$" "49" "50" "51" "52" "53" "54"
The first thing that R stores is a list of names which refer to each column of the data. 2or e!ample, the first column is called ";", the second column is called "+." Tree is of type data.frame. 2inally, the rows are numbered consecutively from , to #C. 6ach column has #C numbers in it. )f you (now that a variable is a data frame but are not sure what labels are used to refer to the different columns you can use the names command& > names(tree) [1] "8" "9" "8:#." ".0;" "6<#2" "=-#2" ".-#2" "6<988" [9] "=-988" ".-988" "6<#88" "=-#88" ".-#88" "6<8"88" "=-8"88" ".-8"88" [17] "6<>88" "=->88" ".->88" "6<2?88" "=-2?88" ".-2?88" "6<;88" "=-;88" [25] ".-;88" "6<=88" "=-=88" ".-=88" > )f you want to wor( with the data in one of the columns you give the name of the data frame, a "7" sign, and the label assigned to the column. 2or e!ample, the first column in tree can be called using "tree7;&" > tree58 [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 [39] 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 > $rief %ote on &i#ed 'idth &iles There are many ways to read data using R. 'e only give two e!amples, direct assignment and reading csv files. 1owever, another way deserves a brief mention. )t is common to come across data that is organi-ed in flat files and delimited at preset locations on each line. This is often called a "fi!ed width file." The command to deal with these (ind of files is read.fwf. 6!amples of how to use this command are not given here, but if you would li(e more information on how to use this command enter the following command& > 'el,(rea)&*@*)
$asic (ata )*pes %um+ers
The most basic way to store a number is to ma(e an assignment of a single number& > a <- 3 > The "< " tells R to ta(e the number to the right of the symbol and store it in a variable whose name is given on the left. .ou can also use the "D" symbol. 'hen you ma(e an assignment R does not print out any information. )f you want to see what value a variable has 0ust type the name of the variable on a line and press the enter (ey& > a [1] 3 > This allows you to do all sorts of basic operations and save the numbers& > b <- sArt(aBaC3) > b [1] 3&4%4102 > )f you want to get a list of the variables that you have defined in a particular session you can list them all using the ls command& > ls() [1] "a" "b" > .ou are not limited to 0ust saving a single number. .ou can create a list (also called a "vector") using the c command& > a <- c(1,2,3,4,5) > a [1] 1 2 3 4 5 > aC1 [1] 2 3 4 5 % > mean(a) [1] 3 > var(a) [1] 2&5 > .ou can get access to particular entries in the vector in the following manner& > a <- c(1,2,3,4,5) > a[1] [1] 1 > a[2] [1] 2 > a[0] numeric(0) > a[5] [1] 5 > a[%]
[1] 9" > +ote that the -ero entry is used to indicate how the data is stored. The first entry in the vector is the first number, and if you try to get a number past the last number you get "+A." Strings .ou are not limited to 0ust storing numbers. .ou can also store strings. A string is specified by using *uotes. 3oth single and double *uotes will wor(& > a <- "'ell " > a [1] "'ell " > b <- c("'ell ","t'ere") > b [1] "'ell " "t'ere" > b[1] [1] "'ell " > &actors Another important way R can store data is as a factor. =ften times an e!periment includes trials for different levels of some e!planatory variable. 2or e!ample, when loo(ing at the impact of carbon dio!ide on the growth rate of a tree you might try to observe how different trees grow when e!posed to different preset concentrations of carbon dio!ide. The different levels are also called factors. Assuming you (now how to read in a file, we will loo( at the data file given in the first chapter. :everal of the variables in the file are factors& > summar!(tree58:#.) "1 "2 "3 "4 "5 "% 85 8% 3 1 1 3 1 3 1 1 87 86% 867 D1 D2 D3 1 1 1 1 1 3 >
"7 1 D4 1
#1 1 D5 1
#2 3 D% 1
#3 3 D7 1
#4 3
#5 3
#% 3
#7 3
81 1
82 3
83 1
84 3
3ecause the set of options given in the data file corresponding to the ";13R" column are not all numbers R automatically assumes that it is a factor. 'hen you use summary on a factor it does not print out the five point summary, rather it prints out the possible values and the fre*uency that they occur. )n this data set several of the columns are factors, but the researchers used numbers to indicate the different levels. 2or e!ample, the first column, labeled ";," is a factor. 6ach trees was grown in an environment with one of four different possible levels of carbon dio!ide. The researchers *uite sensibly labeled these four environments as ,, @, ", and C. /nfortunately, R cannot determine that these are factors and must assume that they are regular numbers. This is a common problem and there is a way to tell R to treat the ";" column as a set of factors. .ou specify that a variable is a factor using the factor command. )n the following e!ample we convert tree$C into a factor&
> tree58 [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 [39] 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 > summar!(tree58) 2in& 1st 3u& 2e)ian 2ean 3r) 3u& 2a4& 1&000 2&000 2&000 2&519 3&000 4&000 > tree58 <- *act r(tree58) > tree58 [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 [39] 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 6evels1 1 2 3 4 > summar!(tree58) 1 2 3 4 $ 23 10 13 > levels(tree58) [1] "1" "2" "3" "4" > =nce a vector is converted into a set of factors then R treats it in a different manner then when it is a set of numbers. A set of factors have a decrete set of possible values, and it does not ma(e sense to try to find averages or other numerical descriptions. =ne thing that is important is the number of times that each factor appears, called their "fre*uencies," which is printed using the summary command. (ata &rames Another way that information is stored is in data frames. This is a way to ta(e many vectors of different types and store them in the same variable. The vectors can be of all different types. 2or e!ample, a data frame may contain many lists, and each list might be a list of factors, strings, or numbers. There are different ways to create and manipulate data frames. 4ost are beyond the scope of this introduction. They are only mentioned here to offer a more complete description. <lease see the first chapter for more information on data frames. =ne e!ample of how to create a data frame is given below& > > > > C C > a <- c(1,2,3,4) b <- c(2,4,%,$) levels <- *act r(c(""","#",""","#")) bubba <- )ata&*rame(*irst+a, sec n)+b, *+levels) bubba *irst sec n) * 1 1 2 " 2 2 4 # 3 3 % " 4 4 $ # > summar!(bubba) *irst sec n) * 2in& 11&00 2in& 12&0 "12 1st 3u&11&75 1st 3u&13&5 #12 2e)ian 12&50 2e)ian 15&0 2ean 12&50 2ean 15&0
3r) 3u&13&25 2a4& 14&00 > bubba5*irst [1] 1 2 3 4 > bubba5sec n) [1] 2 4 % $ > bubba5* [1] " # " # 6evels1 " # > )a+les
3r) 3u&1%&5 2a4& 1$&0
Another common way to store information is in a table. 1ere we loo( at how to define both one way and two way tables. 'e only loo( at how to create and define tablesE the functions used in the analysis of proportions are e!amined in another chapter. ,ne 'a* )a+les The first e!ample is for a one way table. =ne way tables are not the most interesting e!ample, but it is a good place to start. =ne way to create a table is using the tablecommand. The arguments it ta(es is a vector of factors, and it calculates the fre*uency that each factor occurs. 1ere is an e!ample of how to create a one way table& > a <- *act r(c(""",""","#",""","#","#","8",""","8")) > results <- table(a) > results a " # 8 4 3 2 > attributes(results) 5)im [1] 3 5)imnames 5)imnames5a [1] """ "#" "8" 5class [1] "table" > summar!(results) 9umber * cases in table1 9 9umber * *act rs1 1 > )f you (now the number of occurrences for each factor then it is possible to create the table directly, but the process is, unfortunately, a bit more convoluted. There is an easier way to define one way tables (a table with one row), but it does not e!tend easily to two way tables (tables with more than one row). .ou must first create a matri! of numbers. A matri! is li(e a vector in that it is a list of numbers, but it is different in that you can have both rows and columns of numbers. 2or e!ample, in our e!ample above the number of occurrences of "A" is C, the number of occurrences of "3" is ", and the number of occurrences of ";" is @. 'e will create one row of numbers. The first column contains a C, the second column contains a ", and the third column contains a @&
> >
ccur <- matri4(c(4,3,2),nc l+3,b!r @+-./0) ccur [,1] [,2] [,3] [1,] 4 3 2 At this point the variable "occur" is a matri! with one row and three columns of numbers. To dress it up and use it as a table we would li(e to give it labels for each columns 0ust li(e in the previous e!ample. =nce that is done we convert the matri! to a table using the as.table command& > c lnames( ccur) <- c(""","#","8") > ccur " # 8 [1,] 4 3 2 > ccur <- as&table( ccur) > ccur " # 8 " 4 3 2 > attributes( ccur) 5)im [1] 1 3 5)imnames 5)imnames[[1]] [1] """ 5)imnames[[2]] [1] """ "#" "8" 5class [1] "table" > )!o 'a* )a+les )f you want to add rows to your table 0ust add another vector to the argument of the table command. )n the e!ample below we have two *uestions. )n the first *uestion the responses are labeled "+ever," ":ometimes," or "Always." )n the second *uestion the responses are labeled ".es," "+o," or "4aybe." The set of vectors "a," and "b," contain the response for each measurement. The third item in "a" is how the third person responded to the first *uestion, and the third item in "b" is how the third person responded to the second *uestion. > a <c("= metimes","= metimes","9ever",""l@a!s",""l@a!s","= metimes","= meti mes","9ever") > b <- c("2a!be","2a!be","Ees","2a!be","2a!be","9 ","Ees","9 ") > results <- table(a,b) > results b a 2a!be 9 Ees "l@a!s 2 0 0 9ever 0 1 1 = metimes 2 1 1 >
The table command allows us to do a very *uic( calculation, and we can immediately see that two people who said "4aybe" to the first *uestion also said ":ometimes" to the second *uestion. 9ust as in the case with one way tables it is possible to manually enter two way tables. The procedure is e!actly the same as above e!cept that we now have more than one row. 'e give a brief e!ample below to demonstrate how to enter a two way table that includes brea(down of a group of people by both their gender and whether or not they smo(e. .ou enter all of the data as one long list but tell R to brea( it up into some number of columns& > > > > > se4sm Fe<-matri4(c(70,120,%5,140),nc l+2,b!r @+-./0) r @names(se4sm Fe)<-c("male","*emale") c lnames(se4sm Fe)<-c("sm Fe","n sm Fe") se4sm Fe <- as&table(se4sm Fe) se4sm Fe sm Fe n sm Fe male 70 120 *emale %5 140 > $asic ,perations and %umerical (escriptions $asic ,perations =nce you have a vector (or a list of numbers) in memory most basic operations are available. 4ost of the basic operations will act on a whole vector and can be used to *uic(ly perform a large number of calculations with a single command. There is one thing to note, if you perform an operation on more than one vector it is often necessary that the vectors all contain the same number of entries. 1ere we first define a vector which we will call "a" and will loo( at how to add and subtract constant numbers from all of the numbers in the vector. 2irst, the vector will contain the numbers ,, @, ", and C. 'e then see how to add # to each of the numbers, subtract ,? from each of the numbers, multiply each number by C, and divide each number by #. > a [1] 1 2 3 4 > a C 5 [1] % 7 $ 9 > a - 10 [1] -9 -$ -7 -% > aB4 [1] 4 $ 12 1% > aG5 [1] 0&2 0&4 0&% 0&$ > 'e can save the results in another vector called "b&" > b <- a - 10 > b [1] -9 -$ -7 -% > )f you want to ta(e the s*uare root, find e raised to each number, the logarithm, etc., then the usual commands can be used&
> sArt(a) [1] 1&000000 1&414214 1&732051 2&000000 > e4,(a) [1] 2&71$2$2 7&3$905% 20&0$5537 54&59$150 > l ((a) [1] 0&0000000 0&%931472 1&09$%123 1&3$%2944 > e4,(l ((a)) [1] 1 2 3 4 > 3y combining operations and using parentheses you can ma(e more complicated e!pressions& > c <- (a C sArt(a))G(e4,(2)C1) > c [1] 0&23$405$ 0&40%9$42 0&5%40743 0&7152175 > +ote that you can do the same operations with vector arguments. 2or e!ample to add the elements in vector a to the elements in vector b use the following command& > a C b [1] -$ -% -4 -2 > The operation is performed on an element by element basis. +ote this is true for almost all of the basic functions. :o you can bring together all (inds of complicated e!pressions& > aBb [1] -9 -1% -21 -24 > aGb [1] -0&1111111 -0&2500000 -0&42$5714 -0&%%%%%%7 > (aC3)G(sArt(1-b)B2-1) [1] 0&75123%4 1&0000000 1&2$$4234 1&%311303 > .ou need to be careful of one thing. 'hen you do operations on vectors they are performed on an element by element basis. =ne ramification of this is that all of the vectors in an e!pression must be the same length. )f the lengths of the vectors differ then you may get an error message, or worse, a warning message and unpredictable results& > a <- c(1,2,3) > b <- c(10,11,12,13) > aCb [1] 11 13 15 14 Harnin( messa(e1 l n(er bIect len(t' is n t a multi,le >
* s' rter
bIect len(t' in1 a C b
As you wor( in R and create new vectors it can be easy to loose trac( of what variables you have defined. To get a list of all of the variables that have been defined use the ls() command& > ls()
[1] "a" "last&@arnin(" [%] "tree" >
"b" "trees"
"bubba"
"c"
2inally, you should (eep in mind that the basic operations almost always wor( on an element by element basis. There are rare e!ceptions to this general rule. 2or e!ample, if you loo( at the minimum of two vectors using the min command you will get the minimum of all of the numbers. There is a special command, called pmin, that may be the command you want in some circumstances& > a <- c(1,-2,3,-4) > b <- c(-1,2,-3,4) > min(a,b) [1] -4 > ,min(a,b) [1] -1 -2 -3 -4 > $asic %umerical (escriptions Fiven a vector of numbers there are some basic commands to ma(e it easier to get some of the basic numerical descriptions of a set of numbers. 1ere we assume that you can read in the tree data that was discussed in a previous chapter. )t is assumed that it is stored in a variable called "tree&" > tree <- rea)&csv(*ile+"trees91&csv",'ea)er+-./0,se,+",")7 > names(tree) [1] "8" "9" "8:#." ".0;" "6<#2" "=-#2" ".-#2" "6<988" [9] "=-988" ".-988" "6<#88" "=-#88" ".-#88" "6<8"88" "=-8"88" ".-8"88" [17] "6<>88" "=->88" ".->88" "6<2?88" "=-2?88" ".-2?88" "6<;88" "=-;88" [25] ".-;88" "6<=88" "=-=88" ".-=88" > 6ach column in the data frame can be accessed as a vector. 2or e!ample the numbers associated with the leaf biomass (G234) can be found using "tree7G234&" > tree56<#2 [1] 0&430 0&400 0&310 [13] 0&%50 0&1$0 1&210 [25] 1&1$0 0&$30 0&770 [37] 1&%90 1&4$0 0&%70 [49] 1&2%0 0&9%5 >
0&450 0&$20 0&520 1&320 0&900 1&1$0 0&4$0 0&210 0&270 0&520 0&300 0&5$0 0&4$0 0&5$0 0&5$0 0&410 0&4$0 1&7%0 1&220 0&770 1&020 0&130 0&%$0 0&%10 0&700 0&$20 0&7%0 0&740 1&240 1&120 0&750 0&390 0&$70 0&410 0&5%0 0&550 0&$40 0&970 1&070 1&220
The following commands can be used to get the mean, median, *uantiles, minimum, ma!imum, variance, and standard deviation of a set of numbers& > mean(tree56<#2)
[1] 0&7%49074 > me)ian(tree56<#2) [1] 0&72 > Auantile(tree56<#2) 0J 25J 50J 75J 100J 0&1300 0&4$00 0&7200 1&0075 1&7%00 > min(tree56<#2) [1] 0&13 > ma4(tree56<#2) [1] 1&7% > var(tree56<#2) [1] 0&14293$2 > s)(tree56<#2) [1] 0&37$0717 > 2inally, there is one command that will print out the min, ma!, mean, median, and *uantiles& > summar!(tree56<#2) 2in& 1st 3u& 2e)ian 0&1300 0&4$00 0&7200 >
2ean 3r) 3u& 0&7%49 1&00$0
2a4& 1&7%00
The summary command is especially nice because if you give it a data frame it will print out the summary for every vector in the data frame& > summar!(tree) 8 2in& 11&000 0&1300 1st 3u&12&000 3u&10&4$00 2e)ian 12&000 0&7200 2ean 12&519 0&7%49 3r) 3u&13&000 3u&11&0075 2a4& 14&000 1&7%00 =-#2 2in& 10&0300 1st 3u&10&1900 2e)ian 10&2450 2ean 10&2$$3 3r) 3u&10&3$00 2a4& 10&7200 .-988 2in& 10&4700 1st 3u&10&%000 2e)ian 10&7500 2ean 10&7394 3r) 3u&10&$100 2a4& 11&5500
9 2in& 11&000 "1 "4 "% #2 #3 #4 1st 3u&11&000 2e)ian 12&000 2ean 11&92%
8:#. 1 3
.0; 2in& 1 1&00 1 3
6<#2 2in& 1 1st
1st 3u&1 9&00
1 3 1 3
2e)ian 114&00 2ean 1 3 113&05
2e)ian 1 2ean 1 3r) 1
3r) 3u&13&000 2a4& 13&000
3r) 3u&120&00 120&00 2a4&
1 3
2a4&
.-#2 2in& 10&1200 1st 3u&10&2$25 2e)ian 10&4450 2ean 10&4%%2 3r) 3u&10&5500 2a4& 11&5100 6<#88 2in& 125&00 1st 3u&134&00 2e)ian 137&00 2ean 13%&9% 3r) 3u&141&00 2a4& 14$&00
(Kt'er)13% 9"Ls 6<988 2in& 10&$$0 1st 3u&11&312 2e)ian 11&550 2ean 11&5%0 3r) 3u&11&7$$ 2a4& 12&7%0 =-#88 2in& 114&00 1st 3u&117&00 2e)ian 11$&00 2ean 11$&$0 3r) 3u&120&00 2a4& 127&00
111&00 =-988 2in& 10&3700 1st 3u&10&%400 2e)ian 10&7$50 2ean 10&7$72 3r) 3u&10&9350 2a4& 11&2900 .-#88 2in& 115&00 1st 3u&119&00 2e)ian 120&00 2ean 121&43 3r) 3u&123&00 2a4& 141&00
6<8"88 2in& 10&2100 1st 3u&10&2%00 2e)ian 10&2900 2ean 10&2$%9 3r) 3u&10&3100 2a4& 10&3%00 =->88 2in& 10&$70 1st 3u&10&940 2e)ian 11&055 2ean 11&105 3r) 3u&11&210 2a4& 11&520 .-2?88 2in& 10&04000 1st 3u&10&0%000 2e)ian 10&07000 2ean 10&0%%4$ 3r) 3u&10&07000 2a4& 10&09000 6<=88 2in& 10&0900 1st 3u&10&1325 2e)ian 10&1%00 2ean 10&1%%1 3r) 3u&10&1$75 2a4& 10&2%00 >
=-8"88 2in& 10&1300 1st 3u&10&1%00 2e)ian 10&1700 2ean 10&1774 3r) 3u&10&1$75 2a4& 10&2400 .->88 2in& 10&330 1st 3u&10&400 2e)ian 10&475 2ean 10&473 3r) 3u&10&520 2a4& 10&%40 6<;88 2in& 10&1500 1st 3u&10&2000 2e)ian 10&2400 2ean 10&23$1 3r) 3u&10&2700 2a4& 10&3100 =-=88 2in& 10&1400 1st 3u&10&1%00 2e)ian 10&1$00 2ean 10&1$17 3r) 3u&10&2000 2a4& 10&2$00
.-8"88 2in& 10&1100 1st 3u&10&1%00 2e)ian 10&1%50 2ean 10&1%54 3r) 3u&10&1700 2a4& 10&2400 6<2?88 2in& 10&0700 1st 3u&10&1000 2e)ian 10&1200 2ean 10&1109 3r) 3u&10&1300 2a4& 10&1400 =-;88 2in& 10&1500 1st 3u&10&2200 2e)ian 10&2$00 2ean 10&2707 3r) 3u&10&3175 2a4& 10&4100 .-=88 2in& 10&0900 1st 3u&10&1200 2e)ian 10&1300 2ean 10&129$ 3r) 3u&10&1475 2a4& 10&1700
6<>88 2in& 10&%500 1st 3u&10&$100 2e)ian 10&9000 2ean 10&9053 3r) 3u&10&9900 2a4& 11&1$00 9"Ls 11&0000 =-2?88 2in& 10&100 1st 3u&10&110 2e)ian 10&130 2ean 10&135 3r) 3u&10&150 2a4& 10&190 .-;88 2in& 10&1000 1st 3u&10&1300 2e)ian 10&1450 2ean 10&14%5 3r) 3u&10&1%00 2a4& 10&2100
$asic -ro+a+ilit* (istri+utions 'e loo( at some of the basic operations associated with probability distributions. There are a large number of probability distributions available, but we only loo( at a few. )f you would li(e to (now what distributions are available you can do a search using the command help.search("distribution"). 1ere we give details about the commands associated with the normal distribution and briefly mention the commands for other distributions. The functions for different distributions are very similar where the differences are noted below. )he %ormal (istri+ution There are four functions that can be used to generate the values associated with the normal distribution. .ou can get a full list of them and their options using the help command& > 'el,(9 rmal) The first function we loo( at it dnorm. Fiven a set of values it returns the height of the probability distribution at each point. )f you only give the points it assumes you want to use a mean of -ero and
standard deviation of one. There are options to use different values for the mean and standard deviation, though& > )n rm(0) [1] 0&39$9423 > )n rm(0)BsArt(2B,i) [1] 1 > )n rm(0,mean+4) [1] 0&000133$302 > )n rm(0,mean+4,s)+10) [1] 0&03%$2701 >v <- c(0,1,2) > )n rm(v) [1] 0&39$9422$ 0&24197072 0&05399097 > 4 <- seA(-20,20,b!+&1) > ! <- )n rm(4) > ,l t(4,!) > ! <- )n rm(4,mean+2&5,s)+0&1) > ,l t(4,!) The second function we e!amine is pnorm. Fiven a number or a list it computes the probability that a normally distributed random number will be less than that number. This function also goes by the rather ominous title of the ";umulative 8istribution 2unction." )t accepts the same options as dnorm& > ,n rm(0) [1] 0&5 > ,n rm(1) [1] 0&$413447 > ,n rm(0,mean+2) [1] 0&02275013 > ,n rm(0,mean+2,s)+3) [1] 0&2524925 > v <- c(0,1,2) > ,n rm(v) [1] 0&5000000 0&$413447 0&9772499 > 4 <- seA(-20,20,b!+&1) > ! <- ,n rm(4) > ,l t(4,!) > ! <- ,n rm(4,mean+3,s)+4) > ,l t(4,!) The ne!t function we loo( at is qnorm which is the inverse of pnorm. The idea behind qnorm is that you give it a probability, and it returns the number whose cumulative distribution matches the probability. 2or e!ample, if you have a normally distributed random variable with mean -ero and standard deviation one, then if you give the function a probability it returns the associated Z score& > An rm(0&5) [1] 0 > An rm(0&5,mean+1) [1] 1 > An rm(0&5,mean+1,s)+2) [1] 1 > An rm(0&5,mean+2,s)+2) [1] 2 > An rm(0&5,mean+2,s)+4)
[1] 2 > An rm(0&25,mean+2,s)+2) [1] 0&%510205 > An rm(0&333) [1] -0&431%442 > An rm(0&333,s)+3) [1] -1&294933 > An rm(0&75,mean+5,s)+2) [1] %&34$9$ > v + c(0&1,0&3,0&75) > An rm(v) [1] -1&2$1551% -0&5244005 0&%744$9$ > 4 <- seA(0,1,b!+&05) > ! <- An rm(4) > ,l t(4,!) > ! <- An rm(4,mean+3,s)+2) > ,l t(4,!) > ! <- An rm(4,mean+3,s)+0&1) > ,l t(4,!) The last function we e!amine is the rnorm function which can generate random numbers whose distribution is normal. The argument that you give it is the number of random numbers that you want, and it has optional arguments to specify the mean and standard deviation& > rn rm(4) [1] 1&23$7271 -0&2323259 -1&20030$1 -1&%71$4$3 > rn rm(4,mean+3) [1] 2&%330$0 3&%174$% 2&03$$%1 2&%01933 > rn rm(4,mean+3,s)+3) [1] 4&5$055% 2&974903 4&75%097 %&395$94 > rn rm(4,mean+3,s)+3) [1] 3&000$52 3&7141$0 10&032021 3&295%%7 > ! <- rn rm(200) > 'ist(!) > ! <- rn rm(200,mean+-2) > 'ist(!) > ! <- rn rm(200,mean+-2,s)+4) > 'ist(!) > AAn rm(!) > AAline(!) )he t (istri+ution There are four functions that can be used to generate the values associated with the t distribution. .ou can get a full list of them and their options using the help command& > 'el,(-Dist) These commands wor( 0ust li(e the commands for the normal distri+ution. =ne difference is that the commands assume that the values are normali-ed to mean -ero and standard deviation one, so you have to use a little algebra to use these functions in practice. The other difference is that you have to specify the number of degrees of freedom. The commands follow the same (ind of naming convention, and the names of the commands are dt, pt, qt, and rt.
A few e!amples are given below to show how to use the different commands. 2irst we have the distribution function, dt& > > > > > 4 <- seA(-20,20,b!+&5) ! <- )t(4,)*+10) ,l t(4,!) ! <- )t(4,)*+50) ,l t(4,!)
+e!t we have the cumulative probability distribution function& > ,t(-3,)*+10) [1] 0&00%%71$2$ > ,t(3,)*+10) [1] 0&99332$2 > 1-,t(3,)*+10) [1] 0&00%%71$2$ > ,t(3,)*+20) [1] 0&99%4%2 > 4 + c(-3,-4,-2,-1) > ,t((mean(4)-2)Gs)(4),)*+20) [1] 0&0011%554$ > ,t((mean(4)-2)Gs)(4),)*+40) [1] 0&000%030%4 +e!t we have the inverse cumulative probability distribution function& > At(0&05,)*+10) [1] -1&$124%1 > At(0&95,)*+10) [1] 1&$124%1 > At(0&05,)*+20) [1] -1&72471$ > At(0&95,)*+20) [1] 1&72471$ > v <- c(0&005,&025,&05) > At(v,)*+253) [1] -2&595401 -1&9%93$5 -1&%50$99 > At(v,)*+25) [1] -2&7$743% -2&059539 -1&70$141 > 2inally random numbers can be generated according to the t distribution& > rt(3,)*+10) [1] 0&9440930 2&17343%5 0&%7$52%2 > rt(3,)*+20) [1] 0&1043300 -1&4%$219$ 0&0715013 > rt(3,)*+20) [1] 0&$023$32 -0&47597$0 -1&054%125 )he $inomial (istri+ution
There are four functions that can be used to generate the values associated with the binomial distribution. .ou can get a full list of them and their options using the help command& > 'el,(#in mial) These commands wor( 0ust li(e the commands for the normal distri+ution. The binomial distribution re*uires two e!tra parameters, the number of trials and the probability of success for a single trial. The commands follow the same (ind of naming convention, and the names of the commands are dbinom, pbinom, qbinom, andrbinom. A few e!amples are given below to show how to use the different commands. 2irst we have the distribution function, dbinom& > > > > > > > > 4 <- seA(0,50,b!+1) ! <- )bin m(4,50,0&2) ,l t(4,!) ! <- )bin m(4,50,0&%) ,l t(4,!) 4 <- seA(0,100,b!+1) ! <- )bin m(4,100,0&%) ,l t(4,!)
+e!t we have the cumulative probability distribution function& > ,bin m(24,50,0&5) [1] 0&443$%24 > ,bin m(25,50,0&5) [1] 0&55%137% > ,bin m(25,51,0&5) [1] 0&5 > ,bin m(2%,51,0&5) [1] 0&%1011% > ,bin m(25,50,0&5) [1] 0&55%137% > ,bin m(25,50,0&25) [1] 0&9999%2 > ,bin m(25,500,0&25) [1] 4&955%5$e-33 +e!t we have the inverse cumulative probability distribution function& > Abin m(0&5,51,1G2) [1] 25 > Abin m(0&25,51,1G2) [1] 23 > ,bin m(23,51,1G2) [1] 0&2$79247 > ,bin m(22,51,1G2) [1] 0&200531 2inally random numbers can be generated according to the binomial distribution& > rbin m(5,100,&2) [1] 30 23 21 19 1$
> rbin m(5,100,&7) [1] %% %% 5$ %$ %3 > )he Chi.S/uared (istri+ution There are four functions that can be used to generate the values associated with the ;hi :*uared distribution. .ou can get a full list of them and their options using the help command& > 'el,(8'isAuare) These commands wor( 0ust li(e the commands for the normal distri+ution. The first difference is that it is assumed that you have normali-ed the value so no mean can be specified. The other difference is that you have to specify the number of degrees of freedom. The commands follow the same (ind of naming convention, and the names of the commands are dchisq, pchisq, qchisq, and rchisq. A few e!amples are given below to show how to use the different commands. 2irst we have the distribution function, dchisq& > > > > > 4 <- seA(-20,20,b!+&5) ! <- )c'isA(4,)*+10) ,l t(4,!) ! <- )c'isA(4,)*+12) ,l t(4,!)
+e!t we have the cumulative probability distribution function& > ,c'isA(2,)*+10) [1] 0&003%59$47 > ,c'isA(3,)*+10) [1] 0&01$57594 > 1-,c'isA(3,)*+10) [1] 0&9$1424 > ,c'isA(3,)*+20) [1] 4&097501e-0% > 4 + c(2,4,5,%) > ,c'isA(4,)*+20) [1] 1&114255e-07 4&%49$0$e-05 2&773521e-04 1&1024$$e-03 +e!t we have the inverse cumulative probability distribution function& > Ac'isA(0&05,)*+10) [1] 3&940299 > Ac'isA(0&95,)*+10) [1] 1$&30704 > Ac'isA(0&05,)*+20) [1] 10&$50$1 > Ac'isA(0&95,)*+20) [1] 31&41043 > v <- c(0&005,&025,&05) > Ac'isA(v,)*+253) [1] 19$&$1%1 210&$355 217&1713 > Ac'isA(v,)*+25) [1] 10&519%5 13&11972 14&%1141
2inally random numbers can be generated according to the ;hi :*uared distribution& > rc'isA(3,)*+10) [1] 1%&$0075 20&2$412 12&39099 > rc'isA(3,)*+20) [1] 17&$3$$7$ $&59193% 17&4$%372 > rc'isA(3,)*+20) [1] 11&19279 23&$%907 24&$1251 4ore &ront -age $asic -lots 'e loo( at some of the ways R can display information graphically. This is a basic introduction to some of the basic plotting commands. )n each of the topics that follow it is assumed that two different data sets, !1.dat and trees 1.csv have been read and defined using the same variables as in the first chapter. 3oth of these data sets come from the study discussed on the web site given in the first chapter. 'e assume that they are read using "read.csv" into variables "w," and "tree&" > @1 <- rea)&csv(*ile+"@1&)at",se,+",",'ea)+-./0) > names(@1) [1] "vals" > tree <- rea)&csv(*ile+"trees91&csv",se,+",",'ea)+-./0) > names(tree) [1] "8" "9" "8:#." ".0;" "6<#2" "=-#2" ".-#2" "6<988" [9] "=-988" ".-988" "6<#88" "=-#88" ".-#88" "6<8"88" "=-8"88" ".-8"88" [17] "6<>88" "=->88" ".->88" "6<2?88" "=-2?88" ".-2?88" "6<;88" "=-;88" [25] ".-;88" "6<=88" "=-=88" ".-=88" > Strip Charts A strip chart is the most basic type of plot available. )t plots the data in order along a line with each data point represented as a bo!. 1ere we provide e!amples using the "w," data frame mentioned at the top of this page, and the one column of the data is "w,7vals." To create a strip chart of this data use the stripchart command& stri,c'art(@15vals) As you can see this is about as bare bones as you can get. There is no title nor a!es labels. )t only shows how the data loo(s if you were to put it all along one line and mar( out a bo! at each point. )f you would prefer to see which points are repeated you can specify that repeated points be stac(ed& > stri,c'art(@15vals,met' )+"stacF") A variation on this is to have the bo!es moved up and down so that there is more separation between them&
Tutorials&
> stri,c'art(@15vals,met' )+"Iitter") )f you do not want the bo!es plotting in the hori-ontal direction you can plot them in the vertical direction& > stri,c'art(@15vals,vertical+-./0) > stri,c'art(@15vals,vertical+-./0,met' )+"Iitter") :ince you should al!a*s annotate your plots there are many different ways to add titles and labels. =ne way is within the stripchart command itself& > stri,c'art(@15vals,met' )+"stacF", main+L6ea* #i 2ass in :i(' 8K2 0nvir nmentL, 4lab+L#i 2ass * 6eavesL) )f you have a plot already and want to add a title, you can use the title command& > title(L6ea* #i 2ass in :i(' 8K2 0nvir nmentL,4lab+L#i 2ass * 6eavesL)
+ote that this simply adds the title and labels and will write over the top of any titles or labels you already have. 0istograms A histogram is very common plot. )t plots the fre*uencies that data appears within certain ranges. 1ere we provide e!amples using the "w," data frame mentioned at the top of this page, and the one column of data is "w,7vals." To plot a histogram of the data use the "hist" command& > 'ist(@15vals) As you can see R will automatically calculate the intervals to use. There are man* options to determine how to brea( up the intervals. 1ere we loo( at 0ust one way, varying the domain si-e and number of brea(s. )f you would li(e to (now more about the other options chec( out the help page& > 'el,('ist) .ou can specify the number of brea(s to use using the breaks option. 1ere we loo( at the histogram for various numbers of brea(s& > > > > > > 'ist(@15vals,breaFs+2) 'ist(@15vals,breaFs+4) 'ist(@15vals,breaFs+%) 'ist(@15vals,breaFs+$) 'ist(@15vals,breaFs+12)
.ou can also vary the si-e of the domain using the xlim option. This option ta(es a vector with two entries in it, the left value and the right value& > 'ist(@15vals,breaFs+12,4lim+c(0,10)) > 'ist(@15vals,breaFs+12,4lim+c(-1,2))
> 'ist(@15vals,breaFs+12,4lim+c(0,2)) > 'ist(@15vals,breaFs+12,4lim+c(1,1&3)) > 'ist(@15vals,breaFs+12,4lim+c(0&9,1&3)) > The options for adding titles and labels are e!actly the same as for strip charts. .ou should al!a*s annotate your plots and there are many different ways to add titles and labels. =ne way is within the hist command itself& > 'ist(@15vals, main+L6ea* #i 2ass in :i(' 8K2 0nvir nmentL, 4lab+L#i 2ass * 6eavesL) )f you have a plot already and want to change or add a title, you can use the title command& > title(L6ea* #i 2ass in :i(' 8K2 0nvir nmentL,4lab+L#i 2ass * 6eavesL)
+ote that this simply adds the title and labels and will write over the top of any titles or labels you already have. )t is not uncommon to add other (inds of plots to a histogram. 2or e!ample, one of the options to the stripchart command is to add it to a plot that has already been drawn. 2or e!ample, you might want to have a histogram with the strip chart drawn across the top. The addition of the strip chart might give you a better idea of the density of the data& > 'ist(@15vals,main+L6ea* #i 2ass in 0nvir nmentL,4lab+L#i 2ass * 6eavesL,!lim+c(0,1%)) > stri,c'art(@15vals,a))+-./0,at+15&5) $o#plots A bo!plot provides a graphical view of the median, *uartiles, ma!imum, and minimum of a data set. 1ere we provide e!amples using two different data sets. The first is the "w," data frame mentioned at the top of this page, and the one column of data is "w,7vals." The second is the "tree" data frame from the "trees%,.csv" data file which is also mentioned at the top of the page. 'e first use the "w," data set and loo( at the bo!plot of this data set& > b 4,l t(@1) Again, this is a very plain graph, and the title and labels can be specified in e!actly the same way as in the stripchart and hist commands& > b 4,l t(@15vals, main+L6ea* #i 2ass in :i(' 8K2 0nvir nmentL, !lab+L#i 2ass * 6eavesL) +ote that the default orientation is to plot the bo!plot vertically. 3ecause of this we used the ylab option to specify the a!is label. There are a large number of options for this command. To see more of the options see the help page& > 'el,(b 4,l t) :i(' 8K2
As an e!ample you can specify that the bo!plot be plotted hori-ontally by specifying the horizontal option& > b 4,l t(@15vals, main+L6ea* #i 2ass in :i(' 8K2 0nvir nmentL, 4lab+L#i 2ass * 6eavesL, ' riM ntal+-./0) The option to plot the bo! plot hori-ontally can be put to good use to display a bo! plot on the same image as a histogram. .ou need to specify the add option, specify where to put the bo! plot using the at option, and turn off the addition of a!es using the axes option& > 'ist(@15vals,main+L6ea* #i 2ass in :i(' 0nvir nmentL,4lab+L#i 2ass * 6eavesL,!lim+c(0,1%)) > b 4,l t(@15vals,' riM ntal+-./0,at+15&5,a))+-./0,a4es+<"6=0) )f you are feeling really cra-y you can ta(e a histogram and add a bo! plot and a strip chart& > 'ist(@15vals,main+L6ea* #i 2ass in :i(' 0nvir nmentL,4lab+L#i 2ass * 6eavesL,!lim+c(0,1%)) > b 4,l t(@15vals,' riM ntal+-./0,at+1%,a))+-./0,a4es+<"6=0) > stri,c'art(@15vals,a))+-./0,at+15) :ome people shell out good money to have this much fun. 2or the second part on bo!plots we will loo( at the second data frame, "tree," which comes from the "trees%,.csv" file. To reiterate the discussion at the top of this page and the discussion in the data t*pes chapter, we need to specify which columns are factors& > tree <- rea)&csv(*ile+"trees91&csv",se,+",",'ea)+-./0) > tree58 <- *act r(tree58) > tree59 <- *act r(tree59) 'e can loo( at the bo!plot of 0ust the data for the stem biomass& > b 4,l t(tree5=-#2, main+L=tem #i 2ass in Di**erent 8K2 0nvir nmentsL, !lab+L#i 2ass * =temsL) That plot does not tell the whole story. )t is for all of the trees, but the trees were grown in different (inds of environments. The boxplot command can be used to plot a separate bo! plot for each level. )n this case the data is held in "tree7:T34," and the different levels are stored as factors in "tree7;." The command to create different bo!plots is the following& b 4,l t(tree5=-#2Ntree58) +ote that for the level called "@" there are four outliers which are plotted as little circles. There are many options to annotate your plot including different labels for each level. <lease use the help(boxplot) command for more information. Scatter -lots A scatter plot provides a graphical view of the relationship between two sets of numbers. 1ere we provide e!amples using the "tree" data frame from the "trees%,.csv" data file which is mentioned at the top of the 8K2 8K2
page. )n particular we loo( at the relationship between the stem biomass ("tree7:T34") and the leaf biomass ("tree7G234"). The command to plot each pair of points as an x coordinate and a y coorindate is "plot&" > ,l t(tree5=-#2,tree56<#2) )t appears that there is a strong positive association between the biomass in the stems of a tree and the leaves of the tree. )t appears to be a linear relationship. )n fact, the corelation between these two sets of observations is *uite high& > c r(tree5=-#2,tree56<#2) [1] 0&911595 > Fetting bac( to the plot, you should always annotate your graphs. The title and labels can be specified in e!actly the same way as with the other plotting commands& > ,l t(tree5=-#2,tree56<#2, main+".elati ns'i, #et@een =tem an) 6ea* #i mass", 4lab+"=tem #i mass", !lab+"6ea* #i mass") %ormal 11 -lots The final type of plot that we loo( at is the normal *uantile plot. This plot is used to determine if your data is close to being normally distributed. .ou cannot be sure that the data is normally distributed, but you can rule out if it is not normally distributed. 1ere we provide e!amples using the "w," data frame mentioned at the top of this page, and the one column of data is "w,7vals." The command to generate a normal *uantile plot is qqnorm. .ou can give it one argument, the univariate data set of interest& > AAn rm(@15vals) .ou can annotate the plot in e!actly the same way as all of the other plotting commands given here& > AAn rm(@15vals, main+"9 rmal 3-3 ;l t * t'e 6ea* #i mass", 4lab+"-'e retical 3uantiles * t'e 6ea* #i mass", !lab+"=am,le 3uantiles * t'e 6ea* #i mass") After you creat the normal *uantile plot you can also add the theoretical line that the data should fall on if they were normally distributed& > AAline(@15vals) )n this e!ample you should see that the data is not *uite normally distributed. There are a few outliers, and it does not match up at the tails of the distribution. 2inear 2east S/uares Regression
1ere we loo( at the most basic linear least s*uares regression. The main purpose is to provide an e!ample of the basic commands. )t is assumed that you (now how to enter data or read data files which is covered in the first chapter, and it is assumed that you are familiar with the different data t*pes. 'e will e!amine the interest rate for four year car loans, and the data that we use comes from the 3.S. &ederal Reserve4s mean rates. 'e are loo(ing at and plotting means. This, of course, is a very bad thing because it removes a lot of the variance and is misleading. The only reason that we are wor(ing with the data in this way is to provide an e!ample of linear regression that does not use too many data points. 8o not try this without a professional near you, and if a professional is not near you do not tell anybody you did this. They will laugh at you. <eople are mean, especially professionals. The first thing to do is to specify the data. 1ere there are only five pairs of numbers so we can enter them in manually. 6ach of the five pairs consists of a year and themean interest rate& > !ear <- c(2000 , > rate <- c(9&34 , 2001 $&50 , , 2002 7&%2 , , 2003 , %&93 , 2004) %&%0)
The ne!t thing we do is ta(e a loo( at the data. 'e first plot the data using a scatter plot and notice that it looks linear. To confirm our suspicions we then find the correlation between the year and the mean interest rates& > ,l t(!ear,rate, main+"8 mmercial #anFs Onterest .ate * r 4 Eear 8ar 6 an", sub+"'tt,1GG@@@&*e)eralreserve&( vGreleasesG(19G20050$05G") > c r(!ear,rate) [1] -0&9$$0$13 At this point we should be e!cited because associations that strong never happen in the real world unless you coo( the boo(s or wor( with averaged data. The ne!t *uestion is what straight line comes "closest" to the dataH )n this case we will use least s*uares regression as one way to determine the line. 3efore we can find the least s*uare regression line we have to ma(e some decisions. 2irst we have to decide which is the e!planatory and which is the response variable. 1ere, we arbitrarily pic( the e!planatory variable to be the year, and the response variable is the interest rate. This was chosen because it seems li(e the interest rate might change in time rather than time changing as the interest rate changes. ('e could be wrongE finance is very confusing.) The command to perform the least s*uare regression is the lm command. The command has many options, but we will (eep it simple and not e!plore them here. )f you are interested use the help(lm) command to learn more. )nstead the only option we e!amine is the one necessary argument which specifies the relationship. :ince we specified that the interest rate is the response variable and the year is the e!planatory variable this means that the regression line can be written in slope intercept form& rate ! (slope) year " (intercept) The way that this relationship is defined in the lm command is that you write the vector containing the response variable, a tilde ("I"), and a vector containing the e!planatory variable& > *it <- lm(rate N !ear) > *it 8all1
lm(* rmula + rate N !ear) 8 e**icients1 (Onterce,t) 1419&20$ !ear -0&705
'hen you ma(e the call to lm it returns a variable with a lot of information in it. )f you are 0ust learning about least s*uares regression you are probably only interested in two things at this point, the slope and the y intercept. )f you 0ust type the name of the variable returned by lm it will print out this minimal information to the screen. (:ee above.) )f you would li(e to (now what else is stored in the variable you can use the attributes command& > attributes(*it) 5names [1] "c e**icients" "resi)uals" [5] "*itte)&values" "assi(n" [9] "4levels" "call" 5class [1] "lm"
"e**ects" "Ar" "terms"
"ranF" ")*&resi)ual" "m )el"
=ne of the things you should notice is the coefficients variable within fit. .ou can print out the y intercept and slope by accessing this part of the variable& > *it5c e**icients[1] (Onterce,t) 1419&20$ > *it5c e**icients[[1]] [1] 1419&20$ > *it5c e**icients[2] !ear -0&705 > *it5c e**icients[[2]] [1] -0&705 +ote that if you 0ust want to get the number you should use two s*uare braces. :o if you want to get an estimate of the interest rate in the year @?,# you can use the formula for a line& > *it5c e**icients[[2]]B2015C*it5c e**icients[[1]] [1] -1&3%7 :o if you 0ust wait long enough, the ban(s will pay you to ta(e a carJ A better use for this formula would be to calculate the residuals and plot them& > res <- rate - (*it5c e**icients[[2]]B!earC*it5c e**icients[[1]]) > res [1] 0&132 -0&003 -0&17$ -0&1%3 0&212 > ,l t(!ear,res) That is a bit messy, but fortunately there is an easier way to get the residuals&
> resi)uals(*it) 1 2 3 4 0&132 -0&003 -0&17$ -0&1%3
5 0&212
)f you want to plot the regression line on the same plot as your scatter plot you can use the abline function along with your variable fit& > ,l t(!ear,rate, main+"8 mmercial #anFs Onterest .ate * r 4 Eear 8ar 6 an", sub+"'tt,1GG@@@&*e)eralreserve&( vGreleasesG(19G20050$05G") > abline(*it) 2inally, as a teaser for the (inds of analyses you might see later, you can get the results of an 2 test by as(ing R for a summary of the fit variable& > summar!(*it) 8all1 lm(* rmula + rate N !ear) .esi)uals1 1 2 3 4 0&132 -0&003 -0&17$ -0&1%3 8 e**icients1 5 0&212
0stimate =t)& 0rr r t value ;r(>PtP) (Onterce,t) 1419&20$00 12%&94957 11&1$ 0&00153 BB !ear -0&70500 0&0%341 -11&12 0&0015% BB --=i(ni*& c )es1 0 LBBBL 0&001 LBBL 0&01 LBL 0&05 L&L 0&1 L L 1 .esi)ual stan)ar) err r1 0&2005 n 3 )e(rees * *ree) m 2ulti,le .-=Auare)1 0&97%3, ")Iuste) .-sAuare)1 0&9%$4 <-statistic1 123&% n 1 an) 3 D<, ,-value1 0&001559
Calculating Confidence Intervals 1ere we loo( at some e!amples of calculating confidence intervals. The e!amples are for both normal and t distributions. 'e assume that you can enter data and (now the commands associated with +asic pro+a+ilit*. ,. @. ". Calculating a Confidence Interval &rom a %ormal (istri+ution Calculating a Confidence Interval &rom a t (istri+ution Calculating 5an* Confidence Intervals &rom a t (istri+ution
Calculating a Confidence Interval &rom a %ormal (istri+ution 1ere we will loo( at a fictitious e!ample. 'e will ma(e some assumptions for what we might find in an e!periment and find the resulting confidence interval using a normal distribution. 1ere we assume that the sample mean is #, the standard deviation is @, and the sample si-e is @?. )n the e!ample below we will use a %#K confidence level and wish to find the confidence interval. The commands to find the confidence interval in R are the following&
> a <- 5 > s <- 2 > n <- 20 > err r <- An rm(0&975)BsGsArt(n) > le*t <- a-err r > ri('t <- aCerr r > le*t [1] 4&123477 > ri('t [1] 5&$7%523 > The true mean has a probability of %#K of being in the interval between C.,@ and #.BB. Calculating a Confidence Interval &rom a t (istri+ution ;alculating the confidence interval when using a t test is similar to using a normal distribution. The only difference is that we use the command associated with the t distribution rather than the normal distribution. 1ere we repeat the procedures above, but we will assume that we are wor(ing with a sample standard deviation rather than an e!act standard deviation. Again we assume that the sample mean is #, the sample standard deviation is @, and the sample si-e is @?. 'e use a %#K confidence level and wish to find the confidence interval. The commands to find the confidence interval in R are the following& > a <- 5 > s <- 2 > n <- 20 > err r <- At(0&975,)*+n-1)BsGsArt(n) > le*t <- a-err r > ri('t <- aCerr r > le*t [1] 4&0%3971 > ri('t [1] 5&93%029 > The true mean has a probability of %#K of being in the interval between C.?> and #.%C. 'e now loo( at an e!ample where we have a univariate data set and want to find the %#K confidence interval for the mean. )n this e!ample we use one of the data sets given in the data input chapter. 'e use the w# data set& > @1 <- rea)&csv(*ile+"@1&)at",se,+",",'ea)+-./0) > summar!(@1) vals 2in& 10&130 1st 3u&10&4$0 2e)ian 10&720 2ean 10&7%5 3r) 3u&11&00$ 2a4& 11&7%0 > len(t'(@15vals) [1] 54
> mean(@15vals) [1] 0&7%5 > s)(@15vals) [1] 0&37$1222 'e can now calculate an error for the mean& > err r <1)Bs)(@15vals)GsArt(len(t'(@15vals)) > err r [1] 0&1032075 At(0&975,)*+len(t'(@15vals)-
The confidence interval is found by adding and subtracting the error from the mean& > le*t <- mean(@15vals)-err r > ri('t <- mean(@15vals)Cerr r > le*t [1] 0&%%17925 > ri('t [1] 0&$%$2075 > There is a %#K probability that the true mean is between ?.>> and ?.B$. Calculating 5an* Confidence Intervals &rom a t (istri+ution :uppose that you want to find the confidence intervals for many tests. This is a common tas( and most software pac(ages will allow you to do this. 'e have three different sets of results& ;omparison , 4ean Froup ) Froup )) ,? ,?.# :td. 8ev. " @.# +umber (pop.) "?? @"?
;omparison @ 4ean Froup ) ,@ :td. 8ev. C +umber (pop.) @,?
Froup ))
,"
#."
"C?
;omparison " 4ean Froup ) Froup )) "? @B.# :td. 8ev. C.# " +umber (pop.) C@? C??
2or each of these comparisons we want to calculate the associated confidence interval for the difference of the means. 2or each comparison there are two groups. 'e will refer to $roup one as the group whose results are in the first row of each comparison above. 'e will refer to $roup two as the group whose results are in the second row of each comparison above. 3efore we can do that we must first compute a standard error and a t score. 'e will find general formulae which is necessary in order to do all three calculations at once. 'e assume that the means for the first group are defined in a variable called m#. The means for the second group are defined in a variable called m%. The standard deviations for the first group are in a variable called sd#. The standard deviations for the second group are in a variable called sd%. The number of samples for the first group are in a variable called num#. 2inally, the number of samples for the second group are in a variable called num%. 'ith these definitions the standard error is the s*uare root of (sd,L@)Mnum,N(sd@L@)Mnum@. The R commands to do this can be found below& > > > > > > > > m1 <- c(10,12,30) m2 <- c(10&5,13,2$&5) s)1 <- c(3,4,4&5) s)2 <- c(2&5,5&3,3) num1 <- c(300,210,420) num2 <- c(230,340,400) se <- sArt(s)1Bs)1Gnum1Cs)2Bs)2Gnum2) err r <- At(0&975,)*+,min(num1,num2)-1)Bse
To see the values 0ust type in the variable name on a line alone& > m1 [1] 10 12 30 > m2 [1] 10&5 13&0 2$&5 > s)1 [1] 3&0 4&0 4&5 > s)2 [1] 2&5 5&3 3&0 > num1
[1] 300 210 420 > num2 [1] 230 340 400 > se [1] 0&2391107 0&39$5074 0&2%5921% > err r [1] 0&47113$2 0&7$5%092 0&5227$25 +ow we need to define the confidence interval around the assumed differences. 9ust as in the case of finding the p values in previous chapter we have to use the pmincommand to get the number of degrees of freedom. )n this case the null hypotheses are for a difference of -ero, and we use a %#K confidence interval& > le*t <- (m1-m2)-err r > ri('t <- (m1-m2)Cerr r > le*t [1] -0&97113$2 -1&7$5%092 0&9772175 > ri('t [1] -0&02$$%177 -0&2143907% 2&0227$249 > This gives the confidence intervals for each of the three tests. 2or e!ample, in the first e!periment the %#K confidence interval is between ?.%$ and ?.?". Calculating p Values 1ere we loo( at some e!amples of calculating p values. The e!amples are for both normal and t distributions. 'e assume that you can enter data and (now the commands associated with +asic pro+a+ilit*. ,. @. ". Calculating a Single p Value &rom a %ormal (istri+ution Calculating a Single p Value &rom a t (istri+ution Calculating 5an* p Values &rom a t (istri+ution
Calculating a Single p Value &rom a %ormal (istri+ution 'e loo( at the steps necessary to calculate the p value for a particular test. )n the interest of simplicity we only loo( at a two sided test, and we focus on one e!ample. 1ere we want to show that the mean is not close to a fi!ed value, a. 1?& mu! D a, 1a& mu! not D a, The p value is calculated for a particular sample mean. 1ere we assume that we obtained a sample mean, x and want to find its p value. )t is the probability that we would obtain a given sample mean that is greater than the absolute value of its O score or less than the negative of the absolute value of its O score. 2or the special case of a normal distribution we also need the standard deviation. 'e will assume that we are given the standard deviation and call it s. The calculation for the p value can be done in several of ways. 'e will loo( at two ways here. The first way is to convert the sample means to their associated O score. The other way is to simply specify the standard deviation and let the computer do the conversion. At first glance it may seem li(e a no brainer, and we should 0ust use the second method. /nfortunately, when using the t distribution we need to convert to the t score, so it is a good idea to (now both ways.
'e first loo( at how to calculate the p value using the O score. The O score is found by assuming that the null hypothesis is true, subtracting the assumed mean, and dividing by the theoretical standard deviation. =nce the O score is found the probability that the value could be less the O score is found using the pnorm command. This is not enough to get the p value. )f the O score that is found is positive then we need to ta(e one minus the associated probability. Also, for a two sided test we need to multiply the result by two. 1ere we avoid these issues and insure that the O score is negative by ta(ing the negative of the absolute value. 'e now loo( at a specific e!ample. )n the e!ample below we will use a value of a of #, a standard deviation of @, and a sample si-e of @?. 'e then find the p value for a sample mean of $& > a <- 5 > s <- 2 > n <- 20 > 4bar <- 7 > M <- (4bar-a)G(sGsArt(n)) > M [1] 4&47213% > 2B,n rm(-abs(M)) [1] 7&74421%e-0% > 'e now loo( at the same problem only specifying the mean and standard deviation within the pnorm command. +ote that for this case we cannot so easily force the use of the left tail. :ince the sample mean is more than the assumed mean we have to ta(e two times one minus the probability& > a <- 5 > s <- 2 > n <- 20 > 4bar <- 7 > 2B(1-,n rm(4bar,mean+a,s)+sGsArt(20))) [1] 7&74421%e-0% > Calculating a Single p Value &rom a t (istri+ution 2inding the p value using a t distribution is very similar to using the O score as demonstrated above. The only difference is that you have to specify the number of degrees of freedom. 1ere we loo( at the same e!ample as above but use the t distribution instead& > a <- 5 > s <- 2 > n <- 20 > 4bar <- 7 > t <- (4bar-a)G(sGsArt(n)) > t [1] 4&47213% > 2B,t(-abs(t),)*+n-1) [1] 0&0002%11934 > 'e now loo( at an e!ample where we have a univariate data set and want to find the p value. )n this e!ample we use one of the data sets given in the data input chapter. 'e use the w# data set&
> @1 <- rea)&csv(*ile+"@1&)at",se,+",",'ea)+-./0) > summar!(@1) vals 2in& 10&130 1st 3u&10&4$0 2e)ian 10&720 2ean 10&7%5 3r) 3u&11&00$ 2a4& 11&7%0 > len(t'(@15vals) [1] 54 1ere we use a two sided hypothesis test, 1?& mu, D ?.$, 1a& mu, not D ?.$, :o we calculate the sample mean and sample standard deviation in order to calculate the p value& > t <- (mean(@15vals)-0&7)G(s)(@15vals)GsArt(len(t'(@15vals))) > t [1] 1&2%3217 > 2B,t(-abs(t),)*+len(t'(@15vals)-1) [1] 0&21204 Calculating 5an* p Values &rom a t (istri+ution :uppose that you want to find the p values for many tests. This is a common tas( and most software pac(ages will allow you to do this. 1ere we see how it can be done in R. 1ere we assume that we want to do a one sided hypothesis test for a number of comparisons. )n particular we will loo( at three hypothesis tests. All are of the following form& 1?& mu, mu@ D ?, 1a& mu, mu@ not D ?, 'e have three different sets of comparisons to ma(e& ;omparison , 4ean Froup ) Froup )) ,? ,?.# :td. 8ev. " @.# +umber (pop.) "?? @"?
;omparison @ 4ean Froup ) Froup )) ,@ ," :td. 8ev. C #." +umber (pop.) @,? "C?
2or each of these comparisons we want to calculate a p value. 2or each comparison there are two groups. 'e will refer to $roup one as the group whose results are in the first row of each comparison above. 'e will refer to $roup two as the group whose results are in the second row of each comparison above. 3efore we can do that we must first compute a standard error and a t score. 'e will find general formulae which is necessary in order to do all three calculations at once. 'e assume that the means for the first group are defined in a variable called m#. The means for the second group are defined in a variable called m%. The standard deviations for the first group are in a variable called sd#. The standard deviations for the second group are in a variable called sd%. The number of samples for the first group are in a variable called num#. 2inally, the number of samples for the second group are in a variable called num%. 'ith these definitions the standard error is the s*uare root of (sd,L@)Mnum,N(sd@L@)Mnum@. The associated t score is m, minus m@ all divided by the standard error. The R comands to do this can be found below& > > > > > > > > m1 <- c(10,12,30) m2 <- c(10&5,13,2$&5) s)1 <- c(3,4,4&5) s)2 <- c(2&5,5&3,3) num1 <- c(300,210,420) num2 <- c(230,340,400) se <- sArt(s)1Bs)1Gnum1Cs)2Bs)2Gnum2) t <- (m1-m2)Gse
To see the values 0ust type in the variable name on a line alone& > m1
[1] 10 12 30 > m2 [1] 10&5 13&0 2$&5 > s)1 [1] 3&0 4&0 4&5 > s)2 [1] 2&5 5&3 3&0 > num1 [1] 300 210 420 > num2 [1] 230 340 400 > se [1] 0&2391107 0&39$5074 0&2%5921% > t [1] -2&0910$2 -2&5093%4 5&%407%1 To use the pt command we need to specify the number of degrees of freedom. This can be done using the pmin command. +ote that there is also a command called min, but it does not wor( the same way. .ou need to use pmin to get the correct results. The numbers of degrees of freedom are pmin(num#&num%)'#. :o the p values can be found using the following R command& > ,t(t,)*+,min(num1,num2)-1) [1] 0&01$$11%$ 0&00%42%$9 0&9999999$ )f you enter all of these commands into R you should have noticed that the last p value is not correct. The pt command gives the probability that a score is less that the specified t. The t score for the last entry is positive, and we want the probability that a t score is bigger. =ne way around this is to ma(e sure that all of the t scores are negative. .ou can do this by ta(ing the negative of the absolute value of the t scores& > ,t(-abs(t),)*+,min(num1,num2)-1) [1] 1&$$11%$e-02 %&42%$90e-03 1&%059%$e-0$ The results from the command above should give you the p values for a one sided test. )t is left as an e!ercise how to find the p values for a two sided test. Calculating )he -o!er ,f A )est 1ere we loo( at some e!amples of calculating the power of a test. The e!amples are for both normal and t distributions. 'e assume that you can enter data and (now the commands associated with +asic pro+a+ilit*. All of the e!amples here are for a two sided test, and you can ad0ust them accordingly for a one sided test. ,. @. ". Calculating )he -o!er 3sing a %ormal (istri+ution Calculating )he -o!er 3sing a t (istri+ution Calculating 5an* -o!ers &rom a t (istri+ution
Calculating )he -o!er 3sing a %ormal (istri+ution 1ere we calculate the power of a test for a normal distribution for a specific e!ample. :uppose that our hypothesis test is the following& 1?& mu D a, 1a& mu not D a.
The power of a test is the probability that we can the re0ect null hypothesis at a given mean that is away from the one specified in the null hypothesis. 'e calculate this probability by first calculating the probability that we accept the null hypothesis when we should not. This is the probability to ma(e a type )) error. The power is the probability that we do not ma(e a type )) error so we then ta(e one minus the result to get the power. 'e can fail to re0ect the null hypothesis if the sample happens to be within the confidence interval we find when we assume that the null hypothesis is true. To get the confidence interval we find the margin of error and then add and subtract it to the proposed mean, a, to get the confidence interval. 'e then turn around and assume instead that the true mean is at a different, e!plicitly specified level, and then find the probability a sample could be found within the original confidence interval. )n the e!ample below the hypothesis test is for 1?& mu D #, 1a& mu not D #. 'e will assume that the standard deviation is @, and the sample si-e is @?. )n the e!ample below we will use a %#K confidence level and wish to find the power to detect a true mean that differs from # by an amount of ,.#. (All of these numbers are made up solely for this e!ample.) The commands to find the confidence interval in R are the following& > a <- 5 > s <- 2 > n <- 20 > err r <- An rm(0&975)BsGsArt(n) > le*t <- a-err r > ri('t <- aCerr r > le*t [1] 4&123477 > ri('t [1] 5&$7%523 > +e!t we find the O scores for the left and right values assuming that the true mean is #N,.#D>.#& > assume) <- a C 1&5 > Qle*t <- (le*t-assume))G(sGsArt(n)) > Qri('t <-(ri('t-assume))G(sGsArt(n)) > , <- ,n rm(Qri('t)-,n rm(Qle*t) > , [1] 0&0$1%3792 The probability that we ma(e a type )) error if the true mean is >.# is appro!imately B.,K. :o the power of the test is , p& > 1-, [1] 0&91$3%2 )n this e!ample, the power of the test is appro!imately %,.BK. )f the true mean differs from # by ,.# then the probability that we will re0ect the null hypothesis is appro!imately %,.BK. Calculating )he -o!er 3sing a t (istri+ution
;alculating the power when using a t test is similar to using a normal distribution. =ne difference is that we use the command associated with the t distribution rather than the normal distribution. 1ere we repeat the test above, but we will assume that we are wor(ing with a sample standard deviation rather than an e!act standard deviation. 'e will e!plore three different ways to calculate the power of a test. The first method ma(es use of the scheme many boo(s recommend if you do not have the non central distribution available. The second does ma(e use of the non central distribution, and the third ma(es use of a single command that will do a lot of the wor( for us. )n the e!ample the hypothesis test is the same as above, 1?& mu D #, 1a& mu not D #. Again we assume that the sample standard deviation is @, and the sample si-e is @?. 'e use a %#K confidence level and wish to find the power to detect a true mean that differs from # by an amount of ,.#. The commands to find the confidence interval in R are the following& > a <- 5 > s <- 2 > n <- 20 > err r <- At(0&975,)*+n-1)BsGsArt(n) > le*t <- a-err r > ri('t <- aCerr r > le*t [1] 4&0%3971 > ri('t [1] 5&93%029 > The number of observations is large enough that the results are *uite close to those in the e!ample using the normal distribution. +e!t we find the t scores for the left and right values assuming that the true mean is #N,.#D>.#& > assume) <- a C 1&5 > tle*t <- (le*t-assume))G(sGsArt(n)) > tri('t <- (ri('t-assume))G(sGsArt(n)) > , <- ,t(tri('t,)*+n-1)-,t(tle*t,)*+n-1) > , [1] 0&11125$3 The probability that we ma(e a type )) error if the true mean is >.# is appro!imately ,,.,K. :o the power of the test is , p& > 1-, [1] 0&$$$7417 )n this e!ample, the power of the test is appro!imately BB.%K. )f the true mean differs from # by ,.# then the probability that we will re0ect the null hypothesis is appro!imately BB.%K. +ote that the power calculated for a normal distribution is slightly higher than for this one calculated with the t distribution. Another way to appro!imate the power is to ma(e use of the non centrality parameter. The idea is that you give it the critical t scores and the amount that the mean would be shifted if the alternate mean were the true mean. This is the method that most boo(s recommend.
> nc, <- 1&5G(sGsArt(n)) > t <- At(0&975,)*+n-1) > ,t(t,)*+n-1,nc,+nc,)-,t(-t,)*+n-1,nc,+nc,) [1] 0&1111522 > 1-(,t(t,)*+n-1,nc,+nc,)-,t(-t,)*+n-1,nc,+nc,)) [1] 0&$$$$47$ Again, we see that the probability of ma(ing a type )) error is appro!imately ,,.,K, and the power is appro!imately BB.%K. +ote that this is slightly different than the previous calculation but is still close. 2inally, there is one more command that we e!plore. This command allows us to do the same power calculation as above but with a single command. > , @er&t&test(n+n,)elta+1&5,s)+s,si(&level+0&05, t!,e+" ne&sam,le",alternative+"t@ &si)e)",strict + -./0) Kne-sam,le t test , @er calculati n n )elta s) si(&level , @er alternative + + + + + + 20 1&5 2 0&05 0&$$$$47$ t@ &si)e)
This is a powerful command that can do much more than 0ust calculate the power of a test. 2or e!ample it can also be used to calculate the number of observations necessary to achieve a given power. 2or more information chec( out the help page, help(power.t.test). Calculating 5an* -o!ers &rom a t (istri+ution :uppose that you want to find the powers for many tests. This is a common tas( and most software pac(ages will allow you to do this. 1ere we see how it can be done in R. 'e use the e!act same cases as in the previous chapter. 1ere we assume that we want to do a two sided hypothesis test for a number of comparisons and want to find the power of the tests to detect a , point difference in the means. )n particular we will loo( at three hypothesis tests. All are of the following form& 1?& mu, mu@ D ?, 1a& mu, mu@ not D ?, 'e have three different sets of comparisons to ma(e& ;omparison , 4ean Froup ) ,? :td. 8ev. " +umber (pop.) "??
Froup ))
,?.#
@.#
@"?
;omparison @ 4ean Froup ) Froup )) ,@ ," :td. 8ev. C #." +umber (pop.) @,? "C?
2or each of these comparisons we want to calculate the power of the test. 2or each comparison there are two groups. 'e will refer to $roup one as the group whose results are in the first row of each comparison above. 'e will refer to $roup two as the group whose results are in the second row of each comparison above. 3efore we can do that we must first compute a standard error and a t score. 'e will find general formulae which is necessary in order to do all three calculations at once. 'e assume that the means for the first group are defined in a variable called m#. The means for the second group are defined in a variable called m%. The standard deviations for the first group are in a variable called sd#. The standard deviations for the second group are in a variable called sd%. The number of samples for the first group are in a variable called num#. 2inally, the number of samples for the second group are in a variable called num%. 'ith these definitions the standard error is the s*uare root of (sd,L@)Mnum,N(sd@L@)Mnum@. The R commands to do this can be found below& > > > > > > > m1 <- c(10,12,30) m2 <- c(10&5,13,2$&5) s)1 <- c(3,4,4&5) s)2 <- c(2&5,5&3,3) num1 <- c(300,210,420) num2 <- c(230,340,400) se <- sArt(s)1Bs)1Gnum1Cs)2Bs)2Gnum2)
To see the values 0ust type in the variable name on a line alone& > m1 [1] 10 12 30 > m2 [1] 10&5 13&0 2$&5 > s)1 [1] 3&0 4&0 4&5 > s)2 [1] 2&5 5&3 3&0 > num1 [1] 300 210 420 > num2 [1] 230 340 400 > se [1] 0&2391107 0&39$5074 0&2%5921% +ow we need to define the confidence interval around the assumed differences. 9ust as in the case of finding the p values in previous chapter we have to use the pmincommand to get the number of degrees of freedom. )n this case the null hypotheses are for a difference of -ero, and we use a %#K confidence interval& > le*t <- At(0&025,)*+,min(num1,num2)-1)Bse > ri('t <- -le*t > le*t [1] -0&47113$2 -0&7$5%092 -0&5227$25 > ri('t [1] 0&47113$2 0&7$5%092 0&5227$25 'e can now calculate the power of the one sided test. Assuming a true mean of , we can calculate the t scores associated with both the left and right variables& > tl <- (le*t-1)Gse > tr <- (ri('t-1)Gse > tl [1] -%&152541 -4&4$0743 -5&72%434 > tr [1] -2&2117$%5 -0&5379$44 -1&7945799 > ,r bOO <- ,t(tr,)*+,min(num1,num2)-1) ,t(tl,)*+,min(num1,num2)-1) > ,r bOO [1] 0&0139$479 0&29557399 0&03%73$74 > , @er <- 1-,r bOO > , @er [1] 0&9$%0152 0&70442%0 0&9%32%13 The results from the command above should give you the p values for a two sided test. )t is left as an e!ercise how to find the p values for a one sided test. 9ust as was found above there is more than one way to calculate the power. 'e also include the method using the non central parameter which is recommended over the previous method& > t <- At(0&975,)*+,min(num1,num2)-1) > t [1] 1&970377 1&971379 1&9%5927 > nc, <- (1)Gse
> ,t(t,)*+,min(num1,num2)-1,nc,+nc,)-,t(-t,)*+,min(num1,num2)-1,nc,+nc,) [1] 0&01374112 0&29533455 0&03%%0$42 > 1-(,t(t,)*+,min(num1,num2)-1,nc,+nc,)-,t(-t,)*+,min(num1,num2)1,nc,+nc,)) [1] 0&9$%25$9 0&704%%55 0&9%3391% > )!o 'a* )a+les 1ere we loo( at some e!amples of how to wor( with two way tables. 'e assume that you can enter data and understand the different data t*pes. ,. @. ". Creating a )a+le from (ata Creating a )a+le (irectl* )ools &or 'or"ing 'ith )a+les
Creating a )a+le from (ata 'e first loo( at how to create a table from raw data. 1ere we use a fictitious data set, smo"er.csv. This data set was created only to be used as an e!ample, and the numbers were created to match an e!ample from a te!t boo(, p. >@% of the Cth edition of 4oore and 4c;abePs Introduction to the -ractice of Statistics. .ou should loo( at the data set in a spreadsheet to see how it is entered. The information is ordered in a way to ma(e it easier to figure out what information is in the data. The idea is that "#> people have been polled on their smo(ing status (:mo(e) and their socioeconomic status (:6:). 2or each person it was determined whether or not they are current smo(ers, former smo(ers, or have never smo(ed. Also, for each person their socioeconomic status was determined (low, middle, or high). The data file contains only two columns, and when read R interprets them both as factors& > sm FerData <- rea)&csv(*ile+Lsm Fer&csvL,se,+L,L,'ea)er+-) > summar!(sm FerData) =m Fe =0= current111% :i(' 1211 * rmer 1141 6 @ 1 93 never 1 99 2i))le1 52 > .ou can create a two way table of occurrences using the table command and the two columns in the data frame& > sm Fe <- table(sm FerData5=m Fe,sm FerData5=0=) > sm Fe current * rmer never > )n this e!ample, there are #, people who are current smo(ers and are in the high :6:. +ote that it is assumed that the two lists given in the table command are both factors. (4ore information on this is available in the chapter on data t*pes.) Creating a )a+le (irectl* :i(' 6 @ 2i))le 51 43 22 92 2$ 21 %$ 22 9
:ometimes you are given data in the form of a table and would li(e to create a table. 1ere we e!amine how to create the table directly. /nfortunately, this is not as direct a method as might be desired. 1ere we create an array of numbers, specify the row and column names, and then convert it to a table. )n the e!ample below we will create a table identical to the one given above. )n that e!ample we have " columns, and the numbers are specified by going across each row from top to bottom. 'e need to specify the data and the number of rows& > > > > > sm Fe <- matri4(c(51,43,22,92,2$,21,%$,22,9),nc l+3,b!r @+-./0) c lnames( ) <- c(":i('","6 @","2i))le") r @names( ) <- c("current","* rmer","never") sm Fe <- as&table(sm Fe) sm Fe :i(' 6 @ 2i))le current 51 43 22 * rmer 92 2$ 21 never %$ 22 9 )ools &or 'or"ing 'ith )a+les 1ere we loo( at some of the commands available to help loo( at the information in a table in different ways. 'e assume that the data using one of the methods above, and the table is called "smo(e." 2irst, there are a couple of ways to get graphical views of the data& > bar,l t(sm Fe,le(en)+-,besi)e+-,main+L=m Fin( =tatus b! =0=L) > ,l t(sm Fe,main+"=m Fin( =tatus #! = ci ec n mic =tatus") There are a number of ways to get the marginal distributions using the mar$in.table command. )f you 0ust give the command the table it calculates the total number of observations. .ou can also calculate the marginal distributions across the rows or columns based on the one optional argument& > mar(in&table(sm Fe) [1] 35% > mar(in&table(sm Fe,1) current * rmer never 11% 141 99 > mar(in&table(sm Fe,2) :i(' 211 6 @ 2i))le 93 52
;ombining these commands you can get the proportions& > sm FeGmar(in&table(sm Fe) :i(' 6 @ 2i))le current 0&14325$43 0&1207$%52 0&0%179775 * rmer 0&25$42%97 0&07$%51%9 0&05$9$$7% never 0&19101124 0&0%179775 0&0252$090 > mar(in&table(sm Fe,1)Gmar(in&table(sm Fe) current * rmer never 0&325$427 0&39%0%74 0&27$0$99
> mar(in&table(sm Fe,2)Gmar(in&table(sm Fe) :i(' 6 @ 2i))le 0&592%9%% 0&2%123%0 0&14%0%74 That is a little obtuse, so fortunately, there is a better way to get the proportions using the prop.table command. .ou can specify the proportions with respect to the different marginal distributions using the optional argument& > ,r ,&table(sm Fe) :i(' 6 @ 2i))le current 0&14325$43 0&1207$%52 0&0%179775 * rmer 0&25$42%97 0&07$%51%9 0&05$9$$7% never 0&19101124 0&0%179775 0&0252$090 > ,r ,&table(sm Fe,1) :i(' 6 @ 2i))le current 0&439%552 0&370%$97 0&1$9%552 * rmer 0&%524$23 0&19$5$1% 0&14$93%2 never 0&%$%$%$7 0&2222222 0&0909091 > ,r ,&table(sm Fe,2) :i(' 6 @ 2i))le current 0&24170%2 0&4%23%5% 0&42307%9 * rmer 0&43%0190 0&3010753 0&403$4%2 never 0&3222749 0&23%5591 0&17307%9 )f you want to do a chi s*uared test to determine if the proportions are different, there is an easy way to do this. )f we want to test at the %#K confidence level we need only loo( at a summary of the table& > summar!(sm Fe) 9umber * cases in table1 35% 9umber * *act rs1 2 -est * r in)e,en)ence * all *act rs1 8'isA + 1$&51, )* + 4, ,-value + 0&0009$0$ :ince the p value is less that #K we can re0ect the null hypothesis at the %#K confidence level and can say that the proportions vary. =f course, there is a hard way to do this. This is not for the faint of heart and involves some linear algebra which we will not describe. )f you wish to calculate the table of e!pected values then you need to multiply the vectors of the margins and divide by the total number of observations& > e4,ecte) <as&arra!(mar(in&table(sm Fe,1)) t(as&arra!(mar(in&table(sm Fe,2))) G mar(in&table(sm Fe) > e4,ecte) :i(' 6 @ 2i))le current %$&752$1 30&30337 1%&943$2 * rmer $3&57022 3%&$3427 20&59551 never 5$&%7%97 25&$%23% 14&4%0%7 ((he "t" function takes the transpose of the array.) JBJ
The result in this array and can be directly compared to the e!isting table. 'e need the s*uare of the difference between the two tables divided by the e!pected values. The sum of all these values is the ;hi s*uared statistic& > c'i <- sum((e4,ecte) - as&arra!(sm Fe))R2Ge4,ecte)) > c'i [1] 1$&50974 'e can then get the p value for this statistic& > 1-,c'isA(c'i,)*+4) [1] 0&0009$0$23%
Case Stud*: 'or"ing )hrough a 0' -ro+lem 'e loo( at a sample homewor( problem and the R commands necessary to e!plore the problem. )t is assumed that you are familiar will all of the commands discussed throughout this tutorial. ,. @. ". C. #. -ro+lem Statement )ransforming the (ata )he Confidence Interval )est of Significance )he -o!er of the test
-ro+lem Statement This problem comes from the #th edition of 4oore and 4c;abes Introduction to the -ractice of Statistics and can be found on pp C>> C>$. The data consists of the emissions of three different pollutants from C> different engines. A copy of the data we use here is availa+le. The problem e!amined here is different from that given in the boo( but is motivated by the discussion in the boo(. )n the following e!amples we will loo( at the carbon mono!ide data which is one of the columns of this data set. 2irst we will transform the data so that it is close to being normally distributed. 'e will then find the confidence interval for the mean and then perform a significance test to evaluate whether or not the data is away from a fi!ed standard. 2inally, we will find the power of the test to detect a fi!ed difference from that standard. 'e will assume that a confidence level of %#K is used throughout. )ransforming the (ata 'e first begin a basic e!amination of the data. The first step is to read in the file and get a summary of the center and spread of the data. )n this instance we will focus only on the carbon mono!ide data. > en(ine <- rea)&csv(*ile+"tableS7S3&csv",se,+",",'ea)+-./0) > names(en(ine) [1] "en" "'c" "c " "n 4" > summar!(en(ine) en 'c c n 4 2in& 1 1&00 2in& 10&3400 2in& 1 1&$50 2in& 10&490
>
1st 3u&112&75 2e)ian 124&50 2ean 124&00 3r) 3u&135&25 2a4& 14%&00
1st 3u&10&4375 2e)ian 10&5100 2ean 10&5502 3r) 3u&10&%025 2a4& 11&1000
1st 3u&1 4&3$$ 2e)ian 1 5&905 2ean 1 7&$79 3r) 3u&110&015 2a4& 123&530
1st 3u&11&110 2e)ian 11&315 2ean 11&340 3r) 3u&11&495 2a4& 12&940
At first glance the carbon mono!ide data appears to be s(ewed. The spread between the third *uartile and the ma! is five times the spread between the min and the first *uartile. A bo!plot is show in 2igure ,. showing that the data appears to be s(ewed. This is further confirmed in the histogram which is shown in 2igure @. 2inally, a normal ** plot is given in 2igure ", and the data does not appear to be normal. > > > > > > > AAn rm(en(ine5c ,main+"8arb n 2 n 4i)e") AAline(en(ine5c ) b 4,l t(en(ine5c ,main+"8arb n 2 n 4i)e") 'ist(en(ine5c ,main+"8arb n 2 n 4i)e") AAn rm(en(ine5c ,main+"8arb n 2 n 4i)e") AAline(en(ine5c )
2igure ,. 3o!plot of the ;arbon 4ono!ide 8ata.
2igure @. 1istogram of the ;arbon 4ono!ide 8ata.
2igure ". +ormal QQ <lot of the ;arbon 4ono!ide 8ata. 'e ne!t see if the data can be transformed to something that is closer to being normally distributed. 'e e!amine the logarithm of the data. 2irst, the bo!plot of the log of the data appears to be more evenly distributed as shown in 2igure C. Also, the histogram appears to be centered and closer to normal in 2igure #. 2inally, in 2igure >. the normal ** plot shows that the data is more consistent with what we would e!pect from normal data. > > > > > > len(ine <- l ((en(ine5c ) b 4,l t(len(ine,main+"8arb n 2 n 4i)e") 'ist(len(ine,main+"8arb n 2 n 4i)e") AAn rm(len(ine,main+"33 ;l t * r t'e 6 ( AAline(len(ine)
* t'e 8arb n 2 n 4i)e")
2igure C. 3o!plot of the Gogarithm of the ;arbon 4ono!ide 2igure #. 1istogram of the Gogarithm of the ;arbon 4ono!ide 8ata. 8ata.
2igure >. +ormal QQ <lot of the Gogarithm of the ;arbon 4ono!ide There is strong evidence that the logarithm of the carbon mono!ide data more closely resembles a normal distribution then does the raw carbon mono!ide data. 2or that reason all of the analysis that follows will be for the logarithm of the data and will ma(e use of the new list "lengine." )he Confidence Interval 'e now find the confidence interval for the carbon mono!ide data. As stated above, we will wor( with the logarithm of the data because it appears to be closer to a normal distribution. This data is stored in the list
called "lengine." :ince we do not (now the true standard deviation we will use the sample standard deviation and will use a t distribution. 'e first find the sample mean, the sample standard deviation, and the number of observations& > m > s > n > m [1] > s [1] > n [1] > <- mean(len(ine) <- s)(len(ine) <- len(t'(len(ine) 1&$$3%7$ 0&59$3$51 4$
+ow we find the standard error& > se <- sGsArt(n) > > se [1] 0&0$%3%945 2inally, the margin of error is found based on a %#K confidence level which can then be used to define the confidence interval& > err r <- seBAt(0&975,)*+n-1) > err r [1] 0&1737529 > le*t <- m-err r > ri('t <- m C err r > le*t [1] 1&709925 > ri('t [1] 2&057431 > The %#K confidence interval is between ,.$, and @.?>. Reep in mind that this is for the logarithm so the %#K confidence interval for the original data can be found by "undoing" the logarithm& > e4,(le*t) [1] 5&52$54$ > e4,(ri('t) [1] 7&$25$4 > :o the %#K confidence interval for the carbon mono!ide is between #.#" and $.B". )est of Significance 'e now perform a test of significance. 1ere we suppose that ideally the engines should have a mean level of #.C and do a two sided hypothesis test. 1ere we assume that the true mean is labeled "mu" and state the hypothesis test&
1?& mu D #.C, 1a& muS not D #.C, To perform the hypothesis test we first assume that the null hypothesis is true and find the confidence interval around the assumed mean. 2ortunately, we can use the values from the previous step& > l9ull <- l ((5&4) - err r > r9ull <- l ((5&4) C err r > l9ull [1] 1&512%4% > r9ull [1] 1&$%0152 > m [1] 1&$$3%7$ > The sample mean lies outside of the assumed confidence interval so we can re0ect the null hypothesis. There is a low probability that we would have obtained our sample mean if the true mean really were #.C. Another way to approach the problem would be to calculate the actual p value for the sample mean that was found. :ince the sample mean is greater than #.C it can be found with the following code& > 2B(1-,t((m-l ((5&4))Gse,)*+n-1)) [1] 0&02%92539 :ince the p value is @.$K which is less than #K we can re0ect the null hypothesis. +ote that there is yet another way to do this. The function t.test will do a lot of this wor( for us. > t&test(len(ine,mu + l ((5&4),alternative + "t@ &si)e)") Kne =am,le t-test )ata1 len(ine t + 2&2$41, )* + 47, ,-value + 0&02%93 alternative '!, t'esis1 true mean is n t eAual t 95 ,ercent c n*i)ence interval1 1&709925 2&057431 sam,le estimates1 mean * 4 1&$$3%7$
1&%$%399
4ore information and a more complete list of the options for this command can be found using the help command& > 'el,(t&test) )he -o!er of the test 'e now find the power of the test. To find the power we need to set a level for the mean and then find the probability that we would accept the null hypothesis if the mean is really at the prescribed level. 1ere we will find the power to detect a difference if the level were $. Three different methods are e!amined. The
first is a method that some boo(s advise to use if you do not have a non central t test available. The second does ma(e use of the non central t test. 2inally, the third method ma(es use of a customi-ed R command. 'e first find the probability of accepting the null hypothesis if the level really were $. 'e assume that the true mean is $ and then find the probability that a sample mean would fall within the confidence interval if the null hypothesis were true. Reep in mind that we have to transform the level of $ by ta(ing its logarithm. Also (eep in mind that this is a two sided test& > t6e*t <- (l9ull-l ((7))G(sGsArt(n)) > t.i('t <- (r9ull-l ((7))G(sGsArt(n)) > , <- ,t(t.i('t,)*+n-1) - ,t(t6e*t,)*+n-1) > , [1] 0&1%29119 > 1-, [1] 0&$370$$1 > :o the probability of ma(ing a type )) error is appro!imately ,>."K, and the probability of detecting a difference if the level really is $ is appro!imately B".$K. Another way to find the power is to use a non centrality parameter. This is the method that many boo(s advise over the previous method. The idea is that you give it the critical t values associated with your test and also provide a parameter that indicates how the mean is shifted. > t <- At(0&975,)*+n-1) > s'i*t <- (l ((5&4)-l ((7))G(sGsArt(n)) > ,t(t,)*+n-1,nc,+s'i*t)-,t(-t,)*+n-1,nc,+s'i*t) [1] 0&1%2$579 > 1-(,t(t,)*+n-1,nc,+s'i*t)-,t(-t,)*+n-1,nc,+s'i*t)) [1] 0&$371421 > Again, we see that the power of the test is appro!imately B".$K. +ote that this result is slightly off from the previous answer. This approach is often recommended over the previous approach. The final approach we e!amine allows us to do all the calculations in one step. )t ma(es use of the non centrality parameter as in the previous e!ample, but all of the commands are done for us. > , @er&t&test(n+n,)elta+l ((7)-l ((5&4),s)+s,si(&level+0&05, t!,e+" ne&sam,le",alternative+"t@ &si)e)",strict + -./0) Kne-sam,le t test , @er calculati n n )elta s) si(&level , @er alternative + + + + + + 4$ 0&2595112 0&59$3$51 0&05 0&$371421 t@ &si)e)
This is a powerful command that can do much more than 0ust calculate the power of a test. 2or e!ample it can also be used to calculate the number of observations necessary to achieve a given power. 2or more information chec( out the help page, help(power.t.test).
Case Stud* II: A 6A5A -aper on Cholesterol 'e loo( at a paper that appeared in the 9ournal of the American 4edical Association and e!plore how to use R to confirm the results. )t is assumed that you are familiar will all of the commands discussed throughout this tutorial. ,. @. ". C. #. >. $. ,vervie! of the -aper )he )a+les Confirming the p.values in )a+le 7 Confirming the p.values in )a+le 8 &inding the -o!er of the )est in )a+le 7 (ifferences +* Race in )a+le 9 Summar*
,vervie! of the -aper The paper we e!amine is Trends in Serum Lipids and Lipoproteins of Adults, 1960-2002 , 4argaret 8. ;arroll, 4:<1, 8avid A. Gacher, 48, <aul 8. :orlie, <h8, 9ames ). ;leeman, 48, 8avid 9. Fordon, 48, <h8, 4ichael 'ol-, 4:, :cott 4. Frundy, 48, <h8, ;lifford G. 9ohnson, 4:<1, 6ournal of the American 5edical Association, =ctober ,@, @??# Aol @%C, +o. ,C, pp& ,$$" ,$B,. The goal is to confirm the results and e!plore some of the other results not e!plicitly addressed in the paper. This paper received a great deal of attention in the media. A partial list of some of the articles is the following& &,: %e!s !!!.medpagetoda*.com Argus 2eader )he ;lo+e and 5ail
The authors e!amine the trends of several studies of cholesterol levels of Americans. The studies have been conducted in ,%>? ,%>@, ,%BB ,%%C, ,%$> ,%B?, ,%BB ,%%C, and ,%%% @??@. :tudies of the studies previous to ,%%% have indicated that overall cholesterol levels are declining. The authors of this paper focus on the changes between the two latest studies, ,%BB ,%%C and ,%%% @??@. They concluded that between certain populations cholesterol levels have decreased over this time. =ne of the things that received a great deal of attention is the lin(age the authors drew between lowered cholesterol levels and increased use of new drugs to lower cholesterol. 1ere is a *uote from their conclusions& (he increase in the proportion of adults usin$ lipid'lowerin$ medication& particularly in older a$e $roups& likely contributed to the decreases in total and )*) cholesterol levels observed. 1ere we focus on the confirming the results listed in Tables " and C of the paper. 'e confirm the p values given in the paper and then calculate the power of the test to detect a prescribed difference in cholesterol levels. )he )a+les Gin(s to the tables in the paper are given below. Gin(s are given to verbatim copies of the tables. 2or each table there are two lin(s. The first is to a te!t file displaying the table. The second is to a csv file to be loaded into R. )t is assumed that you have downloaded each of the csv files and made them available. 2in"s to the )a+les in the paper.
Table , Table @ Table " Table C Table # Table >
te#t te#t te#t te#t te#t te#t
csv csv csv csv csv csv
Confirming the p.values in )a+le 7 The first thing we do is confirm the p values. The paper does not e!plicitly state the hypothesis test, but they use a two sided test as we shall soon see. 'e will e!plicitly define the hypothesis test that the authors are using but first need to define some terms. 'e need the means for the ,%BB ,%%C and the ,%%% @??@ studies and will denote them 4 BB and 4%% respectively. 'e also need the standard errors and will denote them :6BB and :6%% respectively. )n this situation we are trying to compare the means of two e!periments and do not have matched pairs. 'ith this in mind we can define our hypothesis test& 4BB 4%% D ?, 4BB 4%% not D ?, 'hen we assume that the hypothesis test we calculate the p values using the following values& :ample 4ean D 4BB 4%%, :6 D s*rt(:6BB@ N :6%%@). +ote that the standard errors are given in the data, and we do not have to use the number of observations to calculate the standard error. 1owever, we do need the number of observations in calculating the p value. The authors used a t test. There are complicated formulas used to calculate the degrees of freedom for the comparison of two means, but here we will simply find the minimum of the set of observations and subtract one. 'e first need to read in the data from "table".csv" and will call the variable "t"." +ote that we use a new option, row.namesD"group". This option tells R to use the entries in the "group" column as the row names. =nce the table has been read we will need to ma(e use of the means in the ,%BB ,%%C study ("t"74.BB") and the means in the ,%%% @??@ study ("t"74.%%"). 'e will also have to ma(e use of the corresponding standard errors ("t"7:6.BB" and "t"7:6.%%") and the number of observations ("t"7+.BB" and "t"7+.%%"). > t3 <rea)&csv(*ile+"table3&csv",'ea)er+-./0,se,+",",r @&names+"(r u,") > r @&names(t3) [1] "all" "(20" "men" "m(20" "m20-29" "m30-39" "m40-49" "m50-59" [9] "m%0-74" "m75" "@ men" "@(20" "@20-29" "@30-39" "@40-49" "@50-59" [17] "@%0-74" "@75" > names(t3) [1] "9&%0" "2&%0" "=0&%0" "9&71" "2&71" "=0&71" "9&7%" "2&7%" "=0&7%" [10] "9&$$" "2&$$" "=0&$$" "9&99" "2&99" "=0&99" "," > t352&$$ [1] 204 20% 204 204 1$0 201 211 21% 214 205 205 207 1$3 1$9 204 22$ 235 231
> t352&99 [1] 203 203 203 202 1$3 200 212 215 204 195 202 204 1$3 194 203 21% 223 217 > )i** <- t352&$$-t352&99 > )i** [1] 1 3 1 2 -3 1 -1 1 10 10 3 3 0 -5 1 12 12 14 > se <- sArt(t35=0&$$R2Ct35=0&99R2) > se [1] 1&140175 1&0%3015 1&500000 1&500000 2&195450 2&193171 3&3%1547 3&0413$1 [9] 2&193171 3&32$%%3 1&131371 1&0%3015 2&140093 1&9$4943 2&12%029 2&4$394$ [17] 2&12%029 2&$%0070 > )e( <- ,min(t359&$$,t359&99)-1 > )e( [1] 7739 $$0$ 3%4$ 41%4 %73 %72 759 570 970 515 4090 4%43 9%0 $%0 753 [1%] 5%$ 945 552 'e can now calculate the t statistic. 2rom the null hypothesis, the assumed mean of the difference is -ero. 'e can then use the pt command to get the p values. > t <- )i**Gse > t [1] 0&$7705$0 2&$221%2% 0&%%%%%%7 1&3333333 -1&3%%4%2% 0&4559%0$ [7] -0&2974$21 0&32$79$0 4&559%075 3&00420$$ 2&%51%504 2&$221%2% [13] 0&0000000 -2&51$9%3% 0&4703%04 4&$3101$1 5&%443252 4&$949$52 > ,t(t,)*+)e() [1] 0&$0975$$25 0&997%09%07 0&7474$%3$2 0&90$752313 0&0$%1250$9 0&%75717245 [7] 0&3$30$9952 0&%2$7$5421 0&999997110 0&99$%03$37 0&995979577 0&997%04$09 [13] 0&500000000 0&005975203 0&%$0$$3135 0&999999125 0&9999999$9 0&999999354 There are two problems with the calculation above. 2irst, some of the t values are positive, and for positive values we need the area under the curve to the right. There are a couple of ways to fi! this, and here we will insure that the t scores are negative by ta(ing the negative of the absolute value. The second problem is that this is a two sided test, and we have to multiply the probability by two& > ,t(-abs(t),)*+)e() [1] 1&902412e-01 2&390393e-03 [%] 3&242$2$e-01 3&$30900e-01 [11] 4&020423e-03 2&395191e-03 [1%] $&74$%5%e-07 1&0959%%e-0$ > 2B,t(-abs(t),)*+)e() [1] 3&$04$23e-01 4&7$07$%e-03 [%] %&4$5%55e-01 7&%%1799e-01 [11] $&040$45e-03 4&7903$2e-03 [1%] 1&749731e-0% 2&191933e-0$ >
2&52513%e-01 9&1247%9e-02 $&%12509e-02 3&71214%e-01 2&$$9$94e-0% 1&39%1%3e-03 5&000000e-01 5&975203e-03 3&1911%9e-01 %&4%2$14e-07 5&050272e-01 1&$24954e-01 1&722502e-01 7&424292e-01 5&7797$$e-0% 2&79232%e-03 1&000000eC00 1&195041e-02 %&3$2337e-01 1&2925%3e-0%
These numbers are a close match to the values given in the paper, but the output above is hard to read. 'e introduce a new command to loop through and print out the results in a format that is easier to read.
The for loop allows you to repeat a command a specified number of times. 1ere we want to go from ,, @, ", ..., to the end of the list of p values and print out the group and associated p value& > , <- 2B,t(-abs(t),)*+)e() > * r (I in 11len(t'(,)) T cat(",-value * r ",r @&names(t3)[I]," ",,[I],"Un")7 V ,-value * r all 0&3$04$23 ,-value * r (20 0&0047$07$% ,-value * r men 0&5050272 ,-value * r m(20 0&1$24954 ,-value * r m20-29 0&1722502 ,-value * r m30-39 0&%4$5%55 ,-value * r m40-49 0&7%%1799 ,-value * r m50-59 0&7424292 ,-value * r m%0-74 5&7797$$e-0% ,-value * r m75 0&00279232% ,-value * r @ men 0&00$040$45 ,-value * r @(20 0&0047903$2 ,-value * r @20-29 1 ,-value * r @30-39 0&01195041 ,-value * r @40-49 0&%3$2337 ,-value * r @50-59 1&749731e-0% ,-value * r @%0-74 2&191933e-0$ ,-value * r @75 1&2925%3e-0% > 'e can now compare this to Table " (given in the lin( above) and see that we have good agreement. The differences come from a round off errors from using the truncated data in the article as well as using a different method to calculate the degrees of freedom. +ote that for p values close to -ero the percent errors are very large. )t is interesting to note that among the categories (rows) given in the table, only a small number of the differences have a p value small enough to re0ect the null hypothesis at the %#K level. The differences with a p value less than #K are the group of all people, men from >? to $C, men greater than $C, women from @? $C, all women, and women from the age groups of "? "%, #? #%, >? $C, and greater than $C. The p values for nine out of the eighteen categories are low enough to allow us to re0ect the associated null hypothesis. =ne of those categories is for all people in the study, but very few of the male categories have significant differences at the %#K level. The ma0ority of the differences are in the female categories especially the older age brac(ets. Confirming the p.values in )a+le 8 'e now confirm the p values given in Table C. The level of detail in the previous section is not given, rather the commands are briefly given below& > t4 rea)&csv(*ile+"table4&csv",'ea)er+-./0,se,+",",r @&names+"(r u,") > names(t4) [1] "=$$9" "=$$2" "=$$=0" "=999" "=992" "=99=0" "," > )i** <- t45=$$2 - t45=992 > se <- sArt(t45=$$=0R2Ct45=99=0R2) > )e( <- ,min(t45=$$9,t45=999)-1 > t <- )i**Gse <-
> , <- 2B,t(-abs(t),)*+)e() > * r (I in 11len(t'(,)) T cat(",-values * r ",r @&names(t4)[I]," ",,[I],"Un")7 V ,-values * r 2" 0&077243%2 ,-values * r 2"2 0&%592499 ,-values * r 2"H 0&00249772$ ,-values * r 9:H 0&11$422$ ,-values * r 9:H2 0&2%73$51 ,-values * r 9:HH 0&025$5374 ,-values * r 9:# 0&0019%3195 ,-values * r 9:#2 0&003442551 ,-values * r 9:#H 0&007932079 > Again, the p values are close to those given in Table C. The numbers are off due to truncation errors from the true data as well as a simplified calculation of the degrees of freedom. As in the previous section the p values that are close to -ero has the greatest percent errors. &inding the -o!er of the )est in )a+le 7 'e now will find the power of the test to detect a difference. 1ere we arbitrarily choose to find the power to detect a difference of C points and then do the same for a difference of > points. The first step is to assume that the null hypothesis is true and find the %#K confidence interval around a difference of -ero& > t3 <rea)&csv(*ile+"table3&csv",'ea)er+-./0,se,+",",r @&names+"(r u,") > se <- sArt(t35=0&$$R2Ct35=0&99R2) > )e( <- ,min(t359&$$,t359&99)-1 > tcut <- At(0&975,)*+)e() > tcut [1] 1&9%0271 1&9%0233 1&9%0%14 1&9%0534 1&9%3495 1&9%3500 1&9%3094 1&9%4135 [9] 1&9%2413 1&9%45$1 1&9%0544 1&9%0475 1&9%243$ 1&9%272% 1&9%3119 1&9%4149 [17] 1&9%2477 1&9%4271 +ow that the cutoff t scores for the %#K confidence interval have been established we want to find the probability of ma(ing a type )) error. 'e find the probability of ma(ing a type )) error if the difference is a positive C. > t!,eOO <- ,t(tcut,)*+)e(,nc,+4Gse)-,t(-tcut,)*+)e(,nc,+4Gse) > t!,eOO [1] 0&0%0$3127 0&035732%% 0&24009202 0&2400%497 0&555$3392 0&5550$927 [7] 0&77$9$59$ 0&740%47$2 0&55477307 0&775737$4 0&057%5$2% 0&0357%1%0 [13] 0&53%$$%74 0&47$$47$7 0&5321$%25 0&%3753092 0&5319924$ 0&7131%9%9 > 1-t!,eOO [1] 0&9391%$7 0&9%42%73 0&75990$0 0&7599350 0&4441%%1 0&4449107 0&2210140 [$] 0&2593522 0&44522%9 0&2242%22 0&9423417 0&9%423$4 0&4%31133 0&5211521 [15] 0&4%7$13$ 0&3%24%91 0&4%$0075 0&2$%$303 >
)t loo(s li(e there is a mi! here. :ome of the tests have a very high power while others are poor. :i! of the categories have very high power and four have powers less than "?K . =ne problem is that this is hard to read. 'e now show how to use the for loop to create a nicer output& > , @er <- 1-t!,eOO > * r (I in 11len(t'(, @er)) T cat(", @er * r ",r @&names(t3)[I]," ",, @er[I],"Un")7 V , , , , , , , , , , , , , , , , , , > 'e see that the composite groups, groups made up of larger age groups, have much higher powers than the age stratified groups. )t also appears that the groups composed of women seem to have higher powers as well. (ifferences +* Race in )a+le 9 'e now loo( at the differences in the rows of G8G cholesterol in Table @. Table @ lists the cholesterol levels bro(en down by race. 1ere the p values are calculated for the means for the totals rather than for every entry in the table. 'e will use the same two sided hypothesis test as above, the null hypothesis is that there is no difference between the means. :ince we are comparing means from a subset of the entries and across rows we cannot simply copy the commands from above. 'e will first read the data from the files and then separate the first, fourth, seventh, and tenth rows& > t1 <- rea)&csv(*ile+Ltable1&csvL,'ea)+-./0,se,+L,L,r @&name+"?r u,") > t1 - tal :D6 6D6 =-? "ll?20 $$09 $$0$ 3$%7 39$2 2?20 41%5 41%4 1$15 1$93 H?20 4%44 4%44 2052 20$9 "2" 2122 2122 950 994 "2"-2 99$ 99$ 439 4%7 "2"-< 1124 1124 511 527 "9:H 433$ 4337 193$ 1997 @er @er @er @er @er @er @er @er @er @er @er @er @er @er @er @er @er @er * * * * * * * * * * * * * * * * * * r r r r r r r r r r r r r r r r r r all 0&9391%$7 (20 0&9%42%73 men 0&75990$ m(20 0&759935 m20-29 0&4441%%1 m30-39 0&4449107 m40-49 0&221014 m50-59 0&2593522 m%0-74 0&44522%9 m75 0&2242%22 @ men 0&9423417 @(20 0&9%423$4 @20-29 0&4%31133 @30-39 0&5211521 @40-49 0&4%7$13$ @50-59 0&3%24%91 @%0-74 0&4%$0075 @75 0&2$%$303
"9:H-2 2091 2090 924 9%5 "9:H-< 2247 2247 1014 1032 "9:# 1%02 1%02 %70 %74 "9:#-2 749 749 309 312 "9:#-H $53 $53 3%1 3%2 220-29 %74 %74 304 311 230-39 %73 %73 31% 323 240-49 7%0 7%0 31$ 342 250-59 571 571 245 2%2 2%0-%9 %71 %70 2$7 301 270 $1% $1% 345 354 H20-29 9%1 9%1 415 419 H30-39 $%1 $%1 374 377 H40-49 754 755 347 352 H50-59 5%9 5%9 25% 2%3 H%0-%9 %72 %71 315 324 H70 $27 $27 345 354 > t2 <- rea)&csv(*ile+Ltable2&csvL,'ea)+-,se,+L,L,r @&name+"(r u,") > l)l2 <- c(t256D6[1],t256D6[4],t256D6[7],t256D6[10]) > l)l=0 <- c(t256D6=0[1],t256D6=0[4],t256D6=0[7],t256D6=0[10]) > l)l9 <- c(t156D6[1],t156D6[4],t156D6[7],t156D6[10]) > l)l9ames <c(r @&names(t1)[1],r @&names(t1)[4],r @&names(t1) [7],r @&names(t1)[10]) > l)l2 [1] 123 121 124 121 > l)l=0 [1] 1&0 1&3 1&2 1&% > l)l9 [1] 3$%7 950 193$ %70 > l)l9ames [1] ""ll?20" ""2"" ""9:H" ""9:#" 'e can now find the appro!imate p values. This is not the same as the previous e!amples because the means are not being compared across matching values of different lists but down the rows of a single list. 'e will ma(e use of two for loops. The idea is that we will loop though each row e!cept the last row. Then for each of these rows we ma(e a comparison for every row beneath& > * r (I in 11(len(t'(l)l2)-1)) T * r (F in (IC1)1len(t'(l)l2)) T )i** <- l)l2[I]-l)l2[F] se <- sArt(l)l=0[I]R2Cl)l=0[F]R2) t <- )i**Gse n <- min(l)l9[I],l)l9[F])-1 , <- 2B,t(-abs(t),)*+n) cat("(",I,",",F,") - ",l)l9ames[I]," vs ",l)l9ames[F]," ,1 ",,,"Un") V V ( 1 , 2 ) - "ll?20 vs "2" ,1 0&2229$72 ( 1 , 3 ) - "ll?20 vs "9:H ,1 0&52212$4 ( 1 , 4 ) - "ll?20 vs "9:# ,1 0&2$952$1 ( 2 , 3 ) - "2" vs "9:H ,1 0&090270%% ( 2 , 4 ) - "2" vs "9:# ,1 1 ( 3 , 4 ) - "9:H vs "9:# ,1 0&1340$%2
'e cannot re0ect the null hypothesis at the %#K level for any of the differences. 'e cannot say that there is a significant difference between any of the groups in this comparison. Summar* 'e e!amined some of the results stated in the paper Trends in Serum Lipids and Lipoproteins of Adults, 1960-2002 and confirmed the stated p values for Tables " and C. 'e also found the p values for differences by some of the race categories in Table @. 'hen chec(ing the p values for Table " we found that nine of the eighteen differences are significant. 'hen loo(ing at the bo!plot (below) for the means from Table " we see that two of those (the means for the two oldest categories for women) are outliers. > b 4,l t(t352&%0,t352&71,t352&7%,t352&$$,t352&99, names+c("%0-%2","71-73","7%-$0","$$-94","99-02"), 4lab+"=tu)ies",!lab+"=erum - tal 8' lester l", main+"8 m,aris n * =erum - tal 8' lester l #! =tu)!")
The bo!plot helps demonstrate what was found in the comparisons of the previous studies. There has been long term declines in cholesterol levels over the whole term of these studies. The new wrin(le in this study is the association of the use of statins to reduce blood serum cholesterol levels. The results given in Table > show that every category of people has shown a significant increase in the use of statins.
=ne *uestion for discussion is whether or not the association demonstrates strong enough evidence to claim a causative lin( between the two. The news articles that were published in response to the article imply causation between the use of statins and lower cholesterol levels. 1owever, the study only demonstrates a positive association, and the decline in cholesterol levels may be a long term phenomena with interactions between a wide range of factors.

R Tutorial

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

R Tutorial

Hochgeladen von

Copyright:

Verfügbare Formate

R Tutorial Input Assignment The most straight forward way to store a list of numbers is through an assignment using the

"10" "11" "12" "13"

$asic (ata )*pes %um+ers

3r) 3u&1%&5 2a4& 1$&0

bIect len(t' in1 a C b

[1] "a" "last&@arnin(" [%] "tree" >

2ean 3r) 3u& 0&7%49 1&00$0

.0; 2in& 1 1&00 1 3

6<#2 2in& 1 1st

1st 3u&1 9&00

2e)ian 114&00 2ean 1 3 113&05

2e)ian 1 2ean 1 3r) 1

3r) 3u&13&000 2a4& 13&000

3r) 3u&120&00 120&00 2a4&

lm(* rmula + rate N !ear) 8 e**icients1 (Onterce,t) 1419&20$ !ear -0&705

"e**ects" "Ar" "terms"

"ranF" ")*&resi)ual" "m )el"

> resi)uals(*it) 1 2 3 4 0&132 -0&003 -0&17$ -0&1%3

;omparison @ 4ean Froup ) ,@ :td. 8ev. C +umber (pop.) @,?

2igure ,. 3o!plot of the ;arbon 4ono!ide 8ata.

2igure @. 1istogram of the ;arbon 4ono!ide 8ata.

* t'e 8arb n 2 n 4i)e")

Table , Table @ Table " Table C Table # Table >

te#t te#t te#t te#t te#t te#t

csv csv csv csv csv csv

Das könnte Ihnen auch gefallen