Sie sind auf Seite 1von 102

Linear Statistical Analysis I

Fall 2013
() Linear Statistical Analysis I Fall 2013 1 / 101
Outline
Introduction to R statistical software
Introduction to regression and model building
() Linear Statistical Analysis I Fall 2013 2 / 101
R language
R is a popular statistical software package especially suitable for
data analysis and graphical representation.
It is a free open source software.
The R package can be downloaded from
http://cran.us.r-project.org/ where you can also nd useful
information for R. Note that they have different versions for
different operation systems such as Windows, Linux, MacOS.
online tutorials by googling the key words R language tutorial.
() Linear Statistical Analysis I Fall 2013 3 / 101
R language
Open the R package.
() Linear Statistical Analysis I Fall 2013 4 / 101
type commands after the prompt symbol
For example, if we want to compute
3
943
and

9,
> 3/(9
*
4-5)
[1] 0.09677419
> sqrt(9)
[1] 3
If you want to make comments on the commands, use the symbol
#. The expression after this symbol will not be executed.
> 3/(9
*
4-5) # This is a comment.
[1] 0.09677419
> sqrt(9) # compute the square root of 9.
[1] 3
() Linear Statistical Analysis I Fall 2013 5 / 101
workspce and working directory
The workspace is the place in your computer where R reads or
saves data or R objects.
The directory of the workspace is the working directory.
Find the current working directory
> getwd()
[1] "F:/Teaching/2013_spring_Computing"
change the current working directory
> setwd("F:/Teaching")
> getwd()
[1] "F:/Teaching"
or you can use the File menu in the Console window of R.
If there is an error when you input data, rst check whether the
data is in the working directory.
() Linear Statistical Analysis I Fall 2013 6 / 101
arithmetic expressions and variables
arithmetic expressions: type expressions after the prompt symbol.
R will evaluate their values.
> 2+3
[1] 5
> 3-4
[1] -1
> 3
*
5
[1] 15
> 3/5
[1] 0.6
> 13%%5 # compute the remiander of 13 divided by 5
[1] 3
> 13%/%5 #the integer quotient of 13 divided by 5
[1] 2
> 3^2
[1] 9
() Linear Statistical Analysis I Fall 2013 7 / 101
arithmetic expressions and variables
> sqrt(9)
[1] 3
> log(3)
[1] 1.098612
> log2(2)
[1] 1
> log10(100)
[1] 2
> exp(3)
[1] 20.08554
> factorial(3) #compute 3!
[1] 6
> log((1+exp(5)))
*
3^(-2)
[1] 0.5563017
() Linear Statistical Analysis I Fall 2013 8 / 101
arithmetic expressions and variables
variable: the name is case sensitive, that is, upper and lower case
letters are distinct.
assignment operator <-
> x<-13+exp(2) # assign the value of the
# expression to the variable x
> x # check the value of x
[1] 20.38906
equivalently, you can use
> x=13+exp(2)
> x
[1] 20.38906
() Linear Statistical Analysis I Fall 2013 9 / 101
arithmetic expressions and variables
you can include variables in expressions, however, the values of
the variables must be assigned before the expressions.
> x=2+3
> (2+3)
*
4
[1] 20
> x=2+3
> x
*
4
[1] 20
> y
*
4
Error: object y not found
> y=x^2
> y
[1] 25
() Linear Statistical Analysis I Fall 2013 10 / 101
Relational and Logical Operators
logical values: TRUE (can be denoted by T) and FALSE
(denoted by F).
we can apply the arithmetic operators to the logical values, in this
case, TRUE is identied as 1 and FALSE is identied as 0
> TRUE+TRUE
[1] 2
> TRUE+FALSE
[1] 1
> TRUE
*
FALSE
[1] 0
> x=TRUE
> x
[1] TRUE
> y=FALSE
> y-x
[1] -1
() Linear Statistical Analysis I Fall 2013 11 / 101
Relational and Logical Operators
The relational operators are
> 1>2 # logical expression with two possible
[1] FALSE
> 1<2 #logical values TRUE or FALSE
[1] TRUE
> 1<=2
[1] TRUE
> 1>=2
[1] FALSE
> (1+3)==4
[1] TRUE
> 3
*
2!=6
[1] FALSE
> x=(3
*
2!=6)
> x
[1] FALSE
() Linear Statistical Analysis I Fall 2013 12 / 101
Relational and Logical Operators
logical operators: can be used to connect two or more logical
expressions
> (1>2)&((1+3)==4) # "&" means "and". The combined
# expression is true only if
# both the two expressions are true.
[1] FALSE
> (1<2)&((1+3)==4)
[1] TRUE
> TRUE&FALSE
[1] FALSE
> (1>2)|((1+3)==4) # "|" means "or". The combined
# expression is true if and only if
# at least one of the two
# expressions are true.
[1] TRUE
() Linear Statistical Analysis I Fall 2013 13 / 101
Relational and Logical Operators
> (1<2)|((1+3)==4)
[1] TRUE
> (1>2)|((1+3)==5)
[1] FALSE
> TRUE|FALSE
[1] TRUE
> ((1+2)==1)
[1] FALSE
> !((1+2)==1) #"!" means "NOT".
[1] TRUE
> !TRUE
[1] FALSE
> !FALSE
[1] TRUE
() Linear Statistical Analysis I Fall 2013 14 / 101
basic data types in R
vector: concatente several numbers or vectors into a vector by
using the operator c()
> c(1,3,2)
[1] 1 3 2
> c(c(1,2),c(3,5),c(3,0,0))
[1] 1 2 3 5 3 0 0
> x=c(3,4,9)
> x
[1] 3 4 9
> y=c(x,x,c(0,1,2))
> y
[1] 3 4 9 3 4 9 0 1 2
() Linear Statistical Analysis I Fall 2013 15 / 101
basic data types in R
Special vectors:
> 1:9
[1] 1 2 3 4 5 6 7 8 9
> 5:1
[1] 5 4 3 2 1
> rep(0.5,6) # repeat 0.5 six times
[1] 0.5 0.5 0.5 0.5 0.5 0.5
> rep(-3,8)
[1] -3 -3 -3 -3 -3 -3 -3 -3
> seq(0,1, length.out=11) # get 11 numbers from
#0 to 1 with equal space.
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> seq(1,2,length.out=21)
[1] 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35
[9] 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75
[17] 1.80 1.85 1.90 1.95 2.00
() Linear Statistical Analysis I Fall 2013 16 / 101
basic data types in R
logical vector: consists of logical values
> x=c(T,T,F,F) #TRUE can be denoted by T
> x
[1] TRUE TRUE FALSE FALSE
> x=1:5
> x==1 # each component of x is compared with 1
[1] TRUE FALSE FALSE FALSE FALSE
> y=(x!=1)
> y
[1] FALSE TRUE TRUE TRUE TRUE
() Linear Statistical Analysis I Fall 2013 17 / 101
basic data types in R
extract or modify subsets of a vector by using index vector
> x=c(3,2,5,5,7,8)
> x[c(1,3,5)]# extract the 1,3,5th numbers in x
[1] 3 5 7
> y=x[c(3,1,5)]
> y
[1] 5 3 7
> z=x[-c(2,4)]# remove the 2,4th numbers in x
> z
[1] 3 5 7 8
> x[1:3]
[1] 3 2 5
> x[1:3]=c(1,2,3)
> x
[1] 1 2 3 5 7 8
() Linear Statistical Analysis I Fall 2013 18 / 101
basic data types in R
nd all possible values in a vector
> x=c(3,2,5,5,7,8,7,8,9)
> unique(x)
[1] 3 2 5 7 8 9
() Linear Statistical Analysis I Fall 2013 19 / 101
basic data types in R
> x[c(1,3,5)]=0
> x
[1] 0 2 0 5 0 8
extract or modify subsets of a vector by using logical vector
> x=c(3,2,5,5,7,8,3,6,9)
> x=c(3,2,5,5,7,8,3,6,9) #we will extract the
#numbers larger than 4
> index=(x>4)
> index
[1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
> x[index]
[1] 5 5 7 8 6 9
> x[index]# only values corresponding TRUE
# in "index" will be slected
[1] 5 5 7 8 6 9
() Linear Statistical Analysis I Fall 2013 20 / 101
basic data types in R
extract or modify subsets of a vector by using logical vector
> x=1:10 #extract numbers less or equal
# to 6 and larger than 3
> x[(x<=6)&(x>3)]
[1] 4 5 6
> x=c(3,2,5,5,7,8,3,6,9)
> y=c(1,3,5,7,6,4,2,3,5)
> #we extract the numbers in
> #the positions corresponding to y>4
> x[y>4]
[1] 5 5 7 9
() Linear Statistical Analysis I Fall 2013 21 / 101
basic data types in R
extract the indices corresponding to the numbers which satises
specic conditions in a vector.
> x=c(T,T,F,T,F,F,T,T,T,F)
> which(x)
[1] 1 2 4 7 8 9
> x=c(3,2,5,5,7,8,3,6,9)
> which(x>5)
[1] 5 6 8 9
> x[which(x>5)]
[1] 7 8 6 9
() Linear Statistical Analysis I Fall 2013 22 / 101
operations on vectors
The elementary arithmetic operators, +, -, *, /,

(for raising to a
power), log, exp, sin, cos, tan, sqrt and so on, can be applied to
vectors in an element-by-element sense.
> x=1:5
> x
[1] 1 2 3 4 5
> y=10:15
> y
[1] 10 11 12 13 14 15
> y-x
[1] 9 9 9 9 9 14
Warning message:
In y - x : longer object length is not
a multiple of shorter object length
() Linear Statistical Analysis I Fall 2013 23 / 101
operations on vectors
> x
[1] 1 2 3 4 5
> y=11:15 #x and y should have the same lengths.
> y-x
[1] 10 10 10 10 10
> 2
*
x
[1] 2 4 6 8 10
> x^2
[1] 1 4 9 16 25
> y/x
[1] 11.000000 6.000000 4.333333 3.500000 3.000000
> x+1
[1] 2 3 4 5 6
> x
*
y
[1] 11 24 39 56 75
() Linear Statistical Analysis I Fall 2013 24 / 101
operations on vectors
get the length, minimum, maximum, mean, variance of a vector
> z=c(43, 45, 5, 44, 767, 57, 68,33,111)
> length(z)
[1] 9
> min(z)
[1] 5
> max(z)
[1] 767
> which(z==min(z))
[1] 3
> which(z==max(z))
[1] 5
> mean(z)
[1] 130.3333
> var(z)
[1] 57815.75
> sd(z)
[1] 240.4491
() Linear Statistical Analysis I Fall 2013 25 / 101
operations on vectors
sort the numbers in a vector in an increasing or decreasing order
> z=c(43, 45, 5, 44, 767, 57, 68,33,111)
> sort(z)
[1] 5 33 43 44 45 57 68 111 767
> sort(z,decreasing = T)
[1] 767 111 68 57 45 44 43 33 5
> y=sort(z,index.return = T)
> y
$x
[1] 5 33 43 44 45 57 68 111 767
$ix
[1] 3 8 1 4 2 6 7 9 5
> y$x
[1] 5 33 43 44 45 57 68 111 767
> z[y$ix]
[1] 5 33 43 44 45 57 68 111 767
() Linear Statistical Analysis I Fall 2013 26 / 101
matrix
we start with some simple matrices
> matrix(1.5, nrow=2,ncol=3,)
[,1] [,2] [,3]
[1,] 1.5 1.5 1.5
[2,] 1.5 1.5 1.5
> diag(3.3,3)
[,1] [,2] [,3]
[1,] 3.3 0.0 0.0
[2,] 0.0 3.3 0.0
[3,] 0.0 0.0 3.3
> y=diag(-2,3,4)
> y
[,1] [,2] [,3] [,4]
[1,] -2 0 0 0
[2,] 0 -2 0 0
[3,] 0 0 -2 0
() Linear Statistical Analysis I Fall 2013 27 / 101
matrix
form a matrix from a vector
> x=matrix(c(1,2,3,4,5,6), 2,3)
> x
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> matrix(c(1,2,3,4,5,6), 3,2)
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> matrix(c(1,2,3,4,5,6), 3,2,byrow=T)
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
() Linear Statistical Analysis I Fall 2013 28 / 101
matrix
form a matrix by combining vectors
> x=c(1,2,3)
> y=c(4,5,6)
> z=c(7,8,9)
> cbind(x,y,z)#combine three vectors as columns
x y z
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> rbind(x,y,z)#combine three vectors as rows
[,1] [,2] [,3]
x 1 2 3
y 4 5 6
z 7 8 9
() Linear Statistical Analysis I Fall 2013 29 / 101
matrix
form a matrix by combining matrices
> w=cbind(x,y)
> w
x y
[1,] 1 4
[2,] 2 5
[3,] 3 6
> v=cbind(z,y)
> v
z y
[1,] 7 4
[2,] 8 5
[3,] 9 6
() Linear Statistical Analysis I Fall 2013 30 / 101
matrix
form a matrix by combining matrices
> cbind(w,v)
x y z y
[1,] 1 4 7 4
[2,] 2 5 8 5
[3,] 3 6 9 6
> rbind(w,v)
x y
[1,] 1 4
[2,] 2 5
[3,] 3 6
[4,] 7 4
[5,] 8 5
[6,] 9 6
() Linear Statistical Analysis I Fall 2013 31 / 101
matrix
extract or modify subsets of a matrix
> x=matrix(1:15,3,5)
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
> x[2,4]# the number in the second row and the fourth column.
[1] 11
> y=x[1,]
># extract the first row
># which is converted to a vector.
> y
[1] 1 4 7 10 13
> str(y)
int [1:5] 1 4 7 10 13
() Linear Statistical Analysis I Fall 2013 32 / 101
matrix
> z=x[,2]
># extract the second column
> z
[1] 4 5 6
> str(z)
int [1:3] 4 5 6
> w=x[1:2,1:3]
> w
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
> v=x[c(1,3),]
> v
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 3 6 9 12 15
() Linear Statistical Analysis I Fall 2013 33 / 101
matrix
extract irregularly distributed subsets of a matrix
> x=matrix(1:15,3,5)
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
> x[index]
[1] 5 11 15
> index=rbind(c(2,2),c(2,4),c(3,5))
> index
[,1] [,2]
[1,] 2 2
[2,] 2 4
[3,] 3 5
() Linear Statistical Analysis I Fall 2013 34 / 101
matrix
extract irregularly distributed subsets of a matrix
> x[index]=0
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 0 8 0 14
[3,] 3 6 9 12 0
> which(x==0, arr.ind = T)
row col
[1,] 2 2
[2,] 2 4
[3,] 3 5
> x[which(x==0, arr.ind = T)]
[1] 0 0 0
() Linear Statistical Analysis I Fall 2013 35 / 101
operations on matrices
The elementary arithmetic operators, +, -, *, /,

(for raising to a
power), log, exp, sin, cos, tan, sqrt and so on, can be applied to
matrices in an element-by-element sense.
> x=matrix(1:6,2,3)
> x
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> sqrt(x)
[,1] [,2] [,3]
[1,] 1.000000 1.732051 2.236068
[2,] 1.414214 2.000000 2.449490
() Linear Statistical Analysis I Fall 2013 36 / 101
operations on matrices
> y=matrix(11:16,2,3)
> y
[,1] [,2] [,3]
[1,] 11 13 15
[2,] 12 14 16
> x+y
[,1] [,2] [,3]
[1,] 12 16 20
[2,] 14 18 22
() Linear Statistical Analysis I Fall 2013 37 / 101
operations on matrices
transpose of a matrix
> y=matrix(11:16,2,3)
> y
[,1] [,2] [,3]
[1,] 11 13 15
[2,] 12 14 16
> t(y)
[,1] [,2]
[1,] 11 12
[2,] 13 14
[3,] 15 16
> y=1:3 # vector: a special matrix with one column.
> t(y)
[,1] [,2] [,3]
[1,] 1 2 3
() Linear Statistical Analysis I Fall 2013 38 / 101
convert a matrix to a vector
convert a matrix to a vector
> x=matrix(1:6,2,3)
> x
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> as.vector(x)
[1] 1 2 3 4 5 6
> as.vector(t(x))
[1] 1 3 5 2 4 6
() Linear Statistical Analysis I Fall 2013 39 / 101
matrix multiplication
> A=matrix(1:6,3,2)
> A
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> B=matrix(11:16,2,3)
> B
[,1] [,2] [,3]
[1,] 11 13 15
[2,] 12 14 16
> A%
*
%B # different from A
*
B
[,1] [,2] [,3]
[1,] 59 69 79
[2,] 82 96 110
[3,] 105 123 141
() Linear Statistical Analysis I Fall 2013 40 / 101
Linear equations and inversion of a matrix
For example, to solve the following equations
2x+y+3z=1
3x-y+z=4
x+y+z=2
> A=cbind(c(2,3,1),c(1,-1,1),c(3,1,1))
> A # the matrix of coefficients
[,1] [,2] [,3]
[1,] 2 1 3
[2,] 3 -1 1
[3,] 1 1 1
> a=c(1,4,2)
> solve(A,a)
[1] 2.333333 1.333333 -1.666667
> v=solve(A,a)
> A%
*
%v
[,1]
[1,] 1
[2,] 4
[3,] 2
() Linear Statistical Analysis I Fall 2013 41 / 101
Linear equations and inversion of a matrix
For example, to solve the following equations
> A%
*
%v-a
[,1]
[1,] 0.000000e+00
[2,] -4.440892e-16
[3,] 4.440892e-16
> solve(A) # the inverse of A
[,1] [,2] [,3]
[1,] -0.3333333 0.3333333 0.6666667
[2,] -0.3333333 -0.1666667 1.1666667
[3,] 0.6666667 -0.1666667 -0.8333333
() Linear Statistical Analysis I Fall 2013 42 / 101
character vector and factor
A character vector is a vector whose components are character strings
instead of numbers. For example, we construct vectors of names and
genders of ve persons in one ofce.
> names=c("Emily","John", "Lily","Grace","William")
> str(names)
chr [1:5] "Emily" "John" "Lily" "Grace" "William"
> gender=c("F","M","F","F","M")
> str(gender)
chr [1:5] "F" "M" "F" "F" "M"
In the vector gender, there are two categories: male and female,
labeled by M and F. Sometimes, we are interested in the
information such as how many elements in each category. In this case,
we can convert the character vector to a factor.
() Linear Statistical Analysis I Fall 2013 43 / 101
character vector and factor
A character vector is a vector whose components are character strings
instead of numbers. For example, we construct vectors of names and
genders of ve persons in one ofce.
> names=c("Emily","John", "Lily","Grace","William")
> str(names)
chr [1:5] "Emily" "John" "Lily" "Grace" "William"
> gender=c("F","M","F","F","M")
> str(gender)
chr [1:5] "F" "M" "F" "F" "M"
In the vector gender, there are two categories: male and female,
labeled by M and F. Sometimes, we are interested in the
information such as how many elements in each category. In this case,
we can convert the character vector to a factor.
() Linear Statistical Analysis I Fall 2013 44 / 101
character vector and factor
> gender=as.factor(gender)
> gender
[1] F M F F M
Levels: F M
> levels(gender)
[1] "F" "M"
Different levels denote different categories.
() Linear Statistical Analysis I Fall 2013 45 / 101
data frame
The data frame is one of the most important data types in R.
It is similar to matrix. Its elements are organized in rows and
columns.
However, in matrix, all elements are all numbers.
In data frames, it is often the case that in a data frame, some
elements are numbers and the others are numbers.
when R reads data from an external data le, R always save the
data as a data frame.
It is not efcient to type the data directly in R if the data size is
large.
We can read the data from a le where the data is saved.
It is important that the le is in the working directory. Otherwise, R
cannot nd the le. Or you can tell R where is the le.
() Linear Statistical Analysis I Fall 2013 46 / 101
input data from a le
The data has been stored in a le called ch.1.ex.1.dat.
() Linear Statistical Analysis I Fall 2013 47 / 101
use the following command to input the data
> data=read.table("ch.1.ex.1.dat")
> str(data)
data.frame: 4 obs. of 3 variables:
$ V1: int 1 2 4 5
$ V2: int 2 4 5 6
$ V3: int 3 7 7 8
> data
V1 V2 V3
1 1 2 3
2 2 4 7
3 4 5 7
4 5 6 8
() Linear Statistical Analysis I Fall 2013 48 / 101
input data from a le
The data has been stored in a le called ch.1.ex.2.dat. The
numbers is separated by commas
() Linear Statistical Analysis I Fall 2013 49 / 101
> data=read.table("ch.1.ex.2.dat",sep=",")
> str(data)
data.frame: 4 obs. of 3 variables:
$ V1: int 1 2 4 5
$ V2: int 2 4 5 6
$ V3: int 3 7 7 8
> data
V1 V2 V3
1 1 2 3
2 2 4 7
3 4 5 7
4 5 6 8
() Linear Statistical Analysis I Fall 2013 50 / 101
input data from a le
() Linear Statistical Analysis I Fall 2013 51 / 101
> data=read.table("ch.1.ex.3.dat",header=T)
> str(data)
data.frame: 4 obs. of 3 variables:
$ A: int 1 2 4 5
$ B: int 2 4 5 6
$ C: int 3 7 7 8
> data
A B C
1 1 2 3
2 2 4 7
3 4 5 7
4 5 6 8
() Linear Statistical Analysis I Fall 2013 52 / 101
input data from an excel le
The data has been stored in a csv excel le called ch.1.ex.4.csv.
() Linear Statistical Analysis I Fall 2013 53 / 101
> data=read.csv("ch.1.ex.4.csv")
> str(data)
data.frame: 4 obs. of 3 variables:
$ A: int 1 5 4 5
$ B: int 2 4 5 8
$ C: int 3 4 6 7
> data
A B C
1 1 2 3
2 5 4 4
3 4 5 6
4 5 8 7
() Linear Statistical Analysis I Fall 2013 54 / 101
input data from an excel le
The data has been stored in a csv excel le called ch.1.ex.5.csv.
() Linear Statistical Analysis I Fall 2013 55 / 101
> data=read.csv("ch.1.ex.5.csv",header=F)
> data
V1 V2 V3
1 1 2 3
2 5 4 4
3 4 5 6
4 5 8 7
() Linear Statistical Analysis I Fall 2013 56 / 101
data sets in R
R itself contains some data sets which can be loaded directly. You
can use
> data()
to check available data sets in R.
For example, let us consider a data set named sleep. We can
load the data set by using
> data(sleep)
> sleep # check the data
extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
...........
...........
20 3.4 2 10
() Linear Statistical Analysis I Fall 2013 57 / 101
> str(sleep)
data.frame: 20 obs. of 3 variables:
$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
Data which show the effect of two drugs on 10 patients.
There are 20 observations and three variables.
the second variable represents the types of drugs given.
the third is the patients ID.
the rst is the increase in hours of sleep compared to control.
() Linear Statistical Analysis I Fall 2013 58 / 101
extract or modify subsets of a data frame
> # two different ways to extract the first varible (column)
> sleep$extra
[1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0 1.9 0.8 1.1 0.1 -0.1
[16] 4.4 5.5 1.6 4.6 3.4
> sleep[,1]
[1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0 1.9 0.8 1.1 0.1 -0.1
[16] 4.4 5.5 1.6 4.6 3.4
> sleep[1,1]
[1] 0.7
> sleep[1,]
extra group ID
1 0.7 1 1
() Linear Statistical Analysis I Fall 2013 59 / 101
extract or modify subsets of a data frame
> # extract the measurements for the first drug.
> sleep[sleep$group=="2",1]
[1] 1.9 0.8 1.1 0.1 -0.1 4.4
[7] 5.5 1.6 4.6 3.4
> # extract the measurements for the fifth patient.
> sleep[sleep$ID=="5",1]
[1] -0.1 -0.1
> # extract the measurements for the third
> # patient when he took the second type of drug.
> sleep[(sleep$ID=="3")&(sleep$group=="2"),1]
[1] 1.1
> names(sleep) # the names of variables
[1] "extra" "group" "ID"
> names(sleep)=c("hours","Drug","Patient")
> sleep
hours Drug Patient
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
4 -1.2 1 4
() Linear Statistical Analysis I Fall 2013 60 / 101
If all elements of a data frame are numbers, the data frame can be
converted to a matrix.
> data=read.csv("ch.1.ex.5.csv",header=F)
> str(data)
data.frame: 4 obs. of 3 variables:
$ V1: int 1 5 4 5
$ V2: int 2 4 5 8
$ V3: int 3 4 6 7
> X=as.matrix(data)
> str(X)
int [1:4, 1:3] 1 5 4 5 2 4 5 8 3 4 ...
> X
V1 V2 V3
[1,] 1 2 3
[2,] 5 4 4
[3,] 4 5 6
[4,] 5 8 7
() Linear Statistical Analysis I Fall 2013 61 / 101
Conversly, a matrix can be convert to a data frame
> X=matrix(1:15,5,3)
> X
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 2 7 12
[3,] 3 8 13
[4,] 4 9 14
[5,] 5 10 15
> str(X)
int [1:5, 1:3] 1 2 3 4 5 6 7 8 9 10 ...
> Y=data.frame(X)
> Y
X1 X2 X3
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
> str(Y)
data.frame: 5 obs. of 3 variables:
$ X1: int 1 2 3 4 5
$ X2: int 6 7 8 9 10
$ X3: int 11 12 13 14 15
() Linear Statistical Analysis I Fall 2013 62 / 101
control ow
In addition to typing the commands after the prompt symbol, you
can write a group of commands in your text editor such as
notepad, MS word and so on, then copy and paste these
commands after the prompt symbol in R.
For example, I write the following commands in Notepad,
() Linear Statistical Analysis I Fall 2013 63 / 101
control ow
Then I copy and paste them to R
> data=read.csv("ch.1.ex.5.csv",header=F)
> X=as.matrix(data)
> names(X)=c("x1","x2","x3")
> print(X)
V1 V2 V3
[1,] 1 2 3
[2,] 5 4 4
[3,] 4 5 6
[4,] 5 8 7
() Linear Statistical Analysis I Fall 2013 64 / 101
control ow
The control-ow specify the order in which computations are
performed.
Some commonly used control-ow methods: If-Else statement,
loops (including for and while statements)
If-Else statement: Example: compute the absolute value of a. I
write the following commands in Notepad,
() Linear Statistical Analysis I Fall 2013 65 / 101
control ow
Then I copy and paste them to R
> a=1
> if(a>0)
+ {absolute=a
+ print(absolute)
+ }else
+ {absolute=-a
+ print(absolute)
+ }
[1] 1
() Linear Statistical Analysis I Fall 2013 66 / 101
control ow
> a=1
> if(a>0)
+ {absolute=a
+ print(absolute)
+ }else
+ {absolute=-a
+ print(absolute)
+ }
[1] 1
() Linear Statistical Analysis I Fall 2013 67 / 101
control ow
> a=-8
> if(a>0)
+ {absolute=a
+ print(absolute)
+ }else
+ {absolute=-a
+ print(absolute)
+ }
[1] 8
() Linear Statistical Analysis I Fall 2013 68 / 101
Example 2
> x=-3
> if((x<0)|(x>2))
+ {f=0
+ print(f)
+ }else
+ {if(x<1)
+ {f=x
+ print(f)
+ }else
+ {f=2-x
+ print(f)
+ }
+ }
[1] 0
() Linear Statistical Analysis I Fall 2013 69 / 101
Example 2
> x=8
> if((x<0)|(x>2))
+ {f=0
+ print(f)
+ }else
+ {if(x<1)
+ {f=x
+ print(f)
+ }else
+ {f=2-x
+ print(f)
+ }
+ }
[1] 0
() Linear Statistical Analysis I Fall 2013 70 / 101
control ow
the for loop
> sum=0 #set the initial values of the sum
> for (i in 1:6)
+ {sum=sum+i
+ print(c("i=",i, "sum=",sum))
+ }
[1] "i=" "1" "sum=" "1"
[1] "i=" "2" "sum=" "3"
[1] "i=" "3" "sum=" "6"
[1] "i=" "4" "sum=" "10"
[1] "i=" "5" "sum=" "15"
[1] "i=" "6" "sum=" "21"
() Linear Statistical Analysis I Fall 2013 71 / 101
control ow
the for loop
> sum=0 #set the initial values of the sum
> for (i in c(2,4,6,8,10))
+ {sum=sum+i
+ print(c("i=",i, "sum=",sum))
+ }
[1] "i=" "2" "sum=" "2"
[1] "i=" "4" "sum=" "6"
[1] "i=" "6" "sum=" "12"
[1] "i=" "8" "sum=" "20"
[1] "i=" "10" "sum=" "30"
() Linear Statistical Analysis I Fall 2013 72 / 101
Data management
The data input has been introduced. I will introduce the topics
including data output, missing, data manipulation, Merging, combining,
and subsetting datasets.
save a R object: for example,
() Linear Statistical Analysis I Fall 2013 73 / 101
save a R object
> X=matrix(1:15,3,5)
> X
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
> save(X,file="ex.RData")
> X=0
> X
[1] 0
> load("ex.RData")
> X
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
() Linear Statistical Analysis I Fall 2013 74 / 101
output data to external les
> data(sleep)
> str(sleep)
data.frame: 20 obs. of 3 variables:
$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
> sleep.1=sleep[sleep$group=="1",c(1,3)]
> sleep.1
extra ID
1 0.7 1
2 -1.6 2
3 -0.2 3
............
> write.table(sleep.1,file="sleep.1.dat")
() Linear Statistical Analysis I Fall 2013 75 / 101
output data to external les
both the row names and the column names are written in the le
> write.table(sleep.1,file="sleep.1.dat")
() Linear Statistical Analysis I Fall 2013 76 / 101
output data to external les
If we do not want the column name or the row name, use
> write.table(sleep.1,file="sleep.1.dat",row.name=F)
() Linear Statistical Analysis I Fall 2013 77 / 101
output data to external les
If we want the le be saved as a csv le, use
> write.csv(sleep.1,file="sleep.1.csv")
() Linear Statistical Analysis I Fall 2013 78 / 101
missing data
It is very common in practice that there are missing values in a
data set. For example, there are missing values denoted by *.
() Linear Statistical Analysis I Fall 2013 79 / 101
missing data
> X=read.table("ch.1.ex.6.dat",na.strings = "
*
")
> X
V1 V2 V3 V4 V5
1 1 2 3 4 NA
2 1 3 5 7 8
3 7 NA 9 0 1
In R, all missing values are denoted by NA which can be
identued by the following command,
> x=X[,2]
> x
[1] 2 3 NA
> is.na(x)
[1] FALSE FALSE TRUE
() Linear Statistical Analysis I Fall 2013 80 / 101
missing data
> is.na(X)
V1 V2 V3 V4 V5
[1,] FALSE FALSE FALSE FALSE TRUE
[2,] FALSE FALSE FALSE FALSE FALSE
[3,] FALSE TRUE FALSE FALSE FALSE
> sum(is.na(X)) # the number of missing values
[1] 2
> mean(X[,1])
[1] 3
> mean(X[,2])
[1] NA
> #claculate the mean with the missing values excluded
> mean(X[,2],na.rm=TRUE)
[1] 2.5
() Linear Statistical Analysis I Fall 2013 81 / 101
missing data
Example: write a function to replace the missing values in a data frame by the column mean with the missing values excluded in
the corresponding colum. If the whole column is missed, we remove the culomn.
> missing.replace=function(X)
+ {
+ temp=NULL
+ for (i in 1:dim(X)[2])
+ {if(sum(is.na(X[,i]))==0)
+ {temp=cbind(temp,X[,i])
+ }else
+ {if(sum(!is.na(X[,i]))>0)
+ {y=X[,i]
+ y[is.na(X[,i])]=mean(y,na.rm=TRUE)
+ temp=cbind(temp,y)
+ }
+ }
+ }
+ temp=data.frame(temp)
+ temp
+ }
() Linear Statistical Analysis I Fall 2013 82 / 101
missing data
> X
V1 V2 V3 V4 V5
1 1 2 3 4 NA
2 1 3 5 7 8
3 7 NA 9 0 1
> missing.replace(X)
V1 y V3 V4 y.1
1 1 2.0 3 4 4.5
2 1 3.0 5 7 8.0
3 7 2.5 9 0 1.0
> Y=X
> Y[,2]=NA
> Y
V1 V2 V3 V4 V5
1 1 NA 3 4 NA
2 1 NA 5 7 8
3 7 NA 9 0 1
> missing.replace(Y)
V1 V2 V3 y
1 1 3 4 4.5
2 1 5 7 8.0
3 7 9 0 1.0
() Linear Statistical Analysis I Fall 2013 83 / 101
merging data
the first data set
id year female inc
1 80 0 5000
1 81 0 5500
1 82 0 6000
2 80 1 2000
2 81 1 2200
2 82 1 3300
3 80 0 3000
3 81 0 2000
3 82 0 1000
the second given below.
id maxval
2 2400
1 1800
4 1900
The desired merged dataset would look like:
id year female inc maxval
1 81 0 5500 1800
1 80 0 5000 1800
1 82 0 6000 1800
2 82 1 3300 2400
2 80 1 2000 2400
2 81 1 2200 2400
3 82 0 1000 NA
3 80 0 3000 NA
3 81 0 2000 NA
4 NA NA NA 1900
() Linear Statistical Analysis I Fall 2013 84 / 101
merging data
Suppose that X and Y are the data frames for the two data sets, respectively.
> merge(X,Y,by="id")
id year female inc maxval
1 1 81 0 5500 1800
2 1 80 0 5000 1800
3 1 82 0 6000 1800
4 2 82 1 3300 2400
5 2 80 1 2000 2400
6 2 81 1 2200 2400
> merge(X,Y,by="id",all=T)
id year female inc maxval
1 1 81 0 5500 1800
2 1 80 0 5000 1800
3 1 82 0 6000 1800
4 2 82 1 3300 2400
5 2 80 1 2000 2400
6 2 81 1 2200 2400
7 3 82 0 1000 NA
8 3 80 0 3000 NA
9 3 81 0 2000 NA
10 4 NA NA NA 1900
() Linear Statistical Analysis I Fall 2013 85 / 101
graphical exploration of data
Example: This famous (Fishers or Andersons) iris data set gives the measurements in centimeters of the variables sepal length
and width and petal length and width, respectively, for 50 owers from each of 3 species of iris. The species are Iris setosa,
versicolor, and virginica.
> X=read.table("cell.1.dat")
> str(X)
data.frame: 1460 obs. of 7 variables:
$ V1: int 0 0 0 0 0 1 1 1 1 1 ...
$ V2: int 100 100 100 100 100 100 100 100 100 100 ...
$ V3: int 0 0 0 0 0 0 0 0 0 0 ...
$ V4: int 0 0 0 0 0 0 0 0 0 0 ...
$ V5: int 0 0 0 0 0 0 0 0 0 0 ...
$ V6: int 0 0 0 0 0 10 6 12 16 8 ...
$ V7: int 1600 1600 1604 1590 1581 1568 1569 1577 1570 1577 ...
> Y=read.table("cell.2.dat")
> Z=read.table("cell.3.dat")
() Linear Statistical Analysis I Fall 2013 86 / 101
scatter plot
> plot(iris$Sepal.Length,iris$Petal.Length)
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
1
2
3
4
5
6
7
iris$Sepal.Length
i
r
i
s
$
P
e
t
a
l
.
L
e
n
g
t
h
() Linear Statistical Analysis I Fall 2013 87 / 101
scatter plot
If we want to change the labels of x axis, y axis and the title of the plot.
> plot(iris$Sepal.Length,iris$Petal.Length,xlab="Sepal Length"
+ ,ylab="Petal Length",main="The Scatter plot")
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
1
2
3
4
5
6
7
The Scatter plot
Sepal Length
P
e
t
a
l

L
e
n
g
t
h
() Linear Statistical Analysis I Fall 2013 88 / 101
Matrix of scatterplots
If we want to draw the scatteplots of all pairs of the variables
> pairs(iris)
Sepal.Length
2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5
4
.
5
6
.
0
7
.
5
2
.
0
3
.
0
4
.
0
Sepal.Width
Petal.Length
1
3
5
7
0
.
5
1
.
5
2
.
5
Petal.Width
4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 1.0 1.5 2.0 2.5 3.0
1
.
0
2
.
0
3
.
0
Species
() Linear Statistical Analysis I Fall 2013 89 / 101
histogram
>hist(iris$Sepal.Length,nclass=20)
Histogram of iris$Sepal.Length
iris$Sepal.Length
F
r
e
q
u
e
n
c
y
5 6 7 8
0
5
1
0
1
5
() Linear Statistical Analysis I Fall 2013 90 / 101
histogram
> boxplot(iris$Sepal.Length[iris$Species=="setosa"],iris$Sepal.Length[iris$Species=="versicolor"],
+ iris$Sepal.Length[iris$Species=="virginica"],names=c("setosa", "versicolor", "virginica"),
+ main="Sepal Length of three species")
setosa versicolor virginica
4
.
5
5
.
0
5
.
5
6
.
0
6
.
5
7
.
0
7
.
5
8
.
0
Sepal Length of three species
() Linear Statistical Analysis I Fall 2013 91 / 101
Regression
Regression is the study of relationships, dependence between two
sets of variables based on the data or observations made on
these variables.
Based on the relationships, we can make prediction on one set of
variables from the values of the other set.
Example 1: Predict the price of a stock in 6 months from now, on
the basis of company performance measures and economic data.
Example 2: Estimate the amount of glucose in the blood of a
diabetic person, from the infrared absorption spectrum of that
persons blood.
Example 3: Examine the correlation between the level of
prostate-specic antigen and a number of clinical measures such
as cancer volume, prostate weight, age, and so on.
() Linear Statistical Analysis I Fall 2013 92 / 101
variables
Variables are quantitative or qualitative measurements of
characteristics of objects under our study. Typically, the values of
variables vary for different objects.
They are considered as random variables with some probability
distributions.
We will use the uppercase letters, such as X, Y, to denote
variables and the lowercase letters, x and y, to denote their
particular values.
() Linear Statistical Analysis I Fall 2013 93 / 101
variables
In this course, for each regression model, the variables will be
classied into two categories:
independent variables, or inputs, or features, or predictor, or
regressor, or explanatory variable; typically denoted by X
1
, X
2
, .
dependent variables, or outputs, or responses, or outcome variable;
typically denoted by Y
1
, Y
2
, . In this course, we only consider
one response.
The partition of dependent and independent variables is obvious
in some data set but may vary according to the purposes of the
study in other data.
For example, suppose that we collected the temperature and
precipitation data of a city over a period. If we want to construct a
model to predict the temperature from the precipitation, then the
response is temperature and the predictor is precipitation,
however, If precipitation has to be predicted from temperature,
then the response is precipitation and the predictor is temperature.
() Linear Statistical Analysis I Fall 2013 94 / 101
variable
Example 1: Y=price of the stock in 6 months from now,
X
1
=company performance measures and X
2
=economic data.
Example 2:Y=the amount of glucose in the blood of a diabetic
person, X=the infrared absorption spectrum of that persons
blood.
Example 3: Y=level of prostate-specic antigen and X
1
=cancer
volume, X
2
=prostate weight, X
3
=age, and so on.
() Linear Statistical Analysis I Fall 2013 95 / 101
regression models
To construct a model from which given any values of the
predictors: X
1
, X
2
, , we can give a good guess of the value of
Y such that the difference between the guess and the true value is
as small as possible. The guess is called predicted value denote
dby

Y.
deterministic system: given any values of X
1
, X
2
, , the value of
Y is determinstic. That is, if any two observations of X
1
, X
2
, ,
x
(1)
i
and x
(2)
i
, i = 1, 2, , the true values of Y are y
(1)
and y
(2)
,
then if x
(1)
i
= x
(2)
i
, i = 1, 2, , then y
(1)
= y
(2)
.
mathematically, Y can be consdered as a function of X
1
, X
2
, ,
that is, Y = f (X
1
, X
2
, ).
() Linear Statistical Analysis I Fall 2013 96 / 101
regression models
No randomness is involved in deterministic system.
However, in practice, there are many factors affecting the
responses, we cannot nd and measure all those factors.
In statistics, we do not consider deterministic models. Instead, we
will consider models having randomness.
These models are more exible and more appropriate.
() Linear Statistical Analysis I Fall 2013 97 / 101
regression models
Specically, consider
Y = f (X
1
, X
2
, ) + ,
f is a (unknown) function of the variables X
1
, X
2
, , which are
interesting and easy to observe or measure.
The term is a ranomd variable which is used to account for the
randomness due to the unobserved factors or variables, noises or
errors.
A standard assumption is that is independent of X
1
, X
2
, and
its expection is 0.
Then we have E(Y|X
1
, X
2
, ) = f (X
1
, X
2
, ).
More assumptions will be imposed in the following classes.
() Linear Statistical Analysis I Fall 2013 98 / 101
linear regression models
The function f (X
1
, X
2
, ) is typically unknown and not a function
with simple form.
In this course, we will assume the function is linear, that is, if we
have p explanation variables, then
f (X
1
, X
2
, , X
p
) =
0
+
1
X
1
+
2
X
2
+ +
p
X
p
,
where
0
,
1
, ,
p
are coefcients which are typically unknown
parameters and the estimation of them based on data is a main
topic of the course.
Why do we assume f is a linear function?
() Linear Statistical Analysis I Fall 2013 99 / 101
linear regression models
The reasons we assume f is a linear function:
Since f can be very general function, it is hopeless to nd the
exact form of f based on limited sample.
Hence, an approximation to f is desirable. Acutually, the linear
function is the rst order approximation to any smooth function in
a range of X
1
, X
2
, which is not too large.
Although there are other better approximation, the computation is
a very important consideration especially in the precomputer age
of statistics. It is much easier to perform analysis with the linear
models than other models.
() Linear Statistical Analysis I Fall 2013 100 / 101
linear regression models
Even in todays computer era there are still good reasons to study
and use them. They are simple and often provide an adequate
and interpretable description of how the inputs affect the output.
For prediction purposes they can sometimes outperform fancier
nonlinear models, especially in situations with small numbers of
training cases, low signal-to-noise ratio or sparse data.
Finally, linear methods can be applied to transformations of the
inputs and this considerably expands their scope.
() Linear Statistical Analysis I Fall 2013 101 / 101
linear regression models
simple linear regression model: there is only one predictor X.
multiple linear regression model: there are more than one
predictors. The matrix will be involved. Review the basic
knowledge of matrix: denition, operations, solving linear
equation, and so on.
() Linear Statistical Analysis I Fall 2013 102 / 101

Das könnte Ihnen auch gefallen