Linear Regression Chp1

Linear Statistical Analysis I
Fall 2013
() Linear Statistical Analysis I Fall 2013 1 / 101
Outline
Introduction to R statistical software
Introduction to regression and model building
R language
R is a popular statistical software package especially suitable for
data analysis and graphical representation.
It is a free open source software.
The R package can be downloaded from
http://cran.us.r-project.org/ where you can also nd useful
information for R. Note that they have different versions for
different operation systems such as Windows, Linux, MacOS.
online tutorials by googling the key words R language tutorial.
R language
Open the R package.
type commands after the prompt symbol
For example, if we want to compute
3
943
and

9,
> 3/(9
*
4-5)
[1] 0.09677419
> sqrt(9)
[1] 3
If you want to make comments on the commands, use the symbol
#. The expression after this symbol will not be executed.
> 3/(9
*
4-5) # This is a comment.
[1] 0.09677419
> sqrt(9) # compute the square root of 9.
[1] 3
workspce and working directory
The workspace is the place in your computer where R reads or
saves data or R objects.
The directory of the workspace is the working directory.
Find the current working directory
> getwd()
[1] "F:/Teaching/2013_spring_Computing"
change the current working directory
> setwd("F:/Teaching")
> getwd()
[1] "F:/Teaching"
or you can use the File menu in the Console window of R.
If there is an error when you input data, rst check whether the
data is in the working directory.
arithmetic expressions and variables
arithmetic expressions: type expressions after the prompt symbol.
R will evaluate their values.
> 2+3
[1] 5
> 3-4
[1] -1
> 3
*
5
[1] 15
> 3/5
[1] 0.6
> 13%%5 # compute the remiander of 13 divided by 5
[1] 3
> 13%/%5 #the integer quotient of 13 divided by 5
[1] 2
> 3^2
[1] 9
> sqrt(9)
[1] 3
> log(3)
[1] 1.098612
> log2(2)
[1] 1
> log10(100)
[1] 2
> exp(3)
[1] 20.08554
> factorial(3) #compute 3!
[1] 6
> log((1+exp(5)))
*
3^(-2)
[1] 0.5563017
variable: the name is case sensitive, that is, upper and lower case
letters are distinct.
assignment operator <-
> x<-13+exp(2) # assign the value of the
# expression to the variable x
> x # check the value of x
[1] 20.38906
equivalently, you can use
> x=13+exp(2)
> x
[1] 20.38906
you can include variables in expressions, however, the values of
the variables must be assigned before the expressions.
> x=2+3
> (2+3)
*
4
[1] 20
> x=2+3
> x
*
4
[1] 20
> y
*
4
Error: object y not found
> y=x^2
> y
[1] 25
Relational and Logical Operators
logical values: TRUE (can be denoted by T) and FALSE
(denoted by F).
we can apply the arithmetic operators to the logical values, in this
case, TRUE is identied as 1 and FALSE is identied as 0
> TRUE+TRUE
[1] 2
> TRUE+FALSE
[1] 1
> TRUE
*
FALSE
[1] 0
> x=TRUE
> x
[1] TRUE
> y=FALSE
> y-x
[1] -1
The relational operators are
> 1>2 # logical expression with two possible
[1] FALSE
> 1<2 #logical values TRUE or FALSE
[1] TRUE
> 1<=2
[1] TRUE
> 1>=2
[1] FALSE
> (1+3)==4
[1] TRUE
> 3
*
2!=6
[1] FALSE
> x=(3
*
2!=6)
> x
[1] FALSE
logical operators: can be used to connect two or more logical
expressions
> (1>2)&((1+3)==4) # "&" means "and". The combined
# expression is true only if
# both the two expressions are true.
[1] FALSE
> (1<2)&((1+3)==4)
[1] TRUE
> TRUE&FALSE
[1] FALSE
> (1>2)|((1+3)==4) # "|" means "or". The combined
# expression is true if and only if
# at least one of the two
# expressions are true.
[1] TRUE
> (1<2)|((1+3)==4)
[1] TRUE
> (1>2)|((1+3)==5)
[1] FALSE
> TRUE|FALSE
[1] TRUE
> ((1+2)==1)
[1] FALSE
> !((1+2)==1) #"!" means "NOT".
[1] TRUE
> !TRUE
[1] FALSE
> !FALSE
[1] TRUE
basic data types in R
vector: concatente several numbers or vectors into a vector by
using the operator c()
> c(1,3,2)
[1] 1 3 2
> c(c(1,2),c(3,5),c(3,0,0))
[1] 1 2 3 5 3 0 0
> x=c(3,4,9)
> x
[1] 3 4 9
> y=c(x,x,c(0,1,2))
> y
[1] 3 4 9 3 4 9 0 1 2
Special vectors:
> 1:9
[1] 1 2 3 4 5 6 7 8 9
> 5:1
[1] 5 4 3 2 1
> rep(0.5,6) # repeat 0.5 six times
[1] 0.5 0.5 0.5 0.5 0.5 0.5
> rep(-3,8)
[1] -3 -3 -3 -3 -3 -3 -3 -3
> seq(0,1, length.out=11) # get 11 numbers from
#0 to 1 with equal space.
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> seq(1,2,length.out=21)
[1] 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35
[9] 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75
[17] 1.80 1.85 1.90 1.95 2.00
logical vector: consists of logical values
> x=c(T,T,F,F) #TRUE can be denoted by T
> x
[1] TRUE TRUE FALSE FALSE
> x=1:5
> x==1 # each component of x is compared with 1
[1] TRUE FALSE FALSE FALSE FALSE
> y=(x!=1)
> y
[1] FALSE TRUE TRUE TRUE TRUE
extract or modify subsets of a vector by using index vector
> x=c(3,2,5,5,7,8)
> x[c(1,3,5)]# extract the 1,3,5th numbers in x
[1] 3 5 7
> y=x[c(3,1,5)]
> y
[1] 5 3 7
> z=x[-c(2,4)]# remove the 2,4th numbers in x
> z
[1] 3 5 7 8
> x[1:3]
[1] 3 2 5
> x[1:3]=c(1,2,3)
> x
[1] 1 2 3 5 7 8
nd all possible values in a vector
> x=c(3,2,5,5,7,8,7,8,9)
> unique(x)
[1] 3 2 5 7 8 9
> x[c(1,3,5)]=0
> x
[1] 0 2 0 5 0 8
extract or modify subsets of a vector by using logical vector
> x=c(3,2,5,5,7,8,3,6,9)
> x=c(3,2,5,5,7,8,3,6,9) #we will extract the
#numbers larger than 4
> index=(x>4)
> index
[1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
> x[index]
[1] 5 5 7 8 6 9
> x[index]# only values corresponding TRUE
# in "index" will be slected
[1] 5 5 7 8 6 9
extract or modify subsets of a vector by using logical vector
> x=1:10 #extract numbers less or equal
# to 6 and larger than 3
> x[(x<=6)&(x>3)]
[1] 4 5 6
> x=c(3,2,5,5,7,8,3,6,9)
> y=c(1,3,5,7,6,4,2,3,5)
> #we extract the numbers in
> #the positions corresponding to y>4
> x[y>4]
[1] 5 5 7 9
extract the indices corresponding to the numbers which satises
specic conditions in a vector.
> x=c(T,T,F,T,F,F,T,T,T,F)
> which(x)
[1] 1 2 4 7 8 9
> x=c(3,2,5,5,7,8,3,6,9)
> which(x>5)
[1] 5 6 8 9
> x[which(x>5)]
[1] 7 8 6 9
operations on vectors
The elementary arithmetic operators, +, -, *, /,

(for raising to a
power), log, exp, sin, cos, tan, sqrt and so on, can be applied to
vectors in an element-by-element sense.
> x=1:5
> x
[1] 1 2 3 4 5
> y=10:15
> y
[1] 10 11 12 13 14 15
> y-x
[1] 9 9 9 9 9 14
Warning message:
In y - x : longer object length is not
a multiple of shorter object length
> x
[1] 1 2 3 4 5
> y=11:15 #x and y should have the same lengths.
> y-x
[1] 10 10 10 10 10
> 2
*
x
[1] 2 4 6 8 10
> x^2
[1] 1 4 9 16 25
> y/x
[1] 11.000000 6.000000 4.333333 3.500000 3.000000
> x+1
[1] 2 3 4 5 6
> x
*
y
[1] 11 24 39 56 75
get the length, minimum, maximum, mean, variance of a vector
> z=c(43, 45, 5, 44, 767, 57, 68,33,111)
> length(z)
[1] 9
> min(z)
[1] 5
> max(z)
[1] 767
> which(z==min(z))
[1] 3
> which(z==max(z))
[1] 5
> mean(z)
[1] 130.3333
> var(z)
[1] 57815.75
> sd(z)
[1] 240.4491
sort the numbers in a vector in an increasing or decreasing order
> z=c(43, 45, 5, 44, 767, 57, 68,33,111)
> sort(z)
[1] 5 33 43 44 45 57 68 111 767
> sort(z,decreasing = T)
[1] 767 111 68 57 45 44 43 33 5
> y=sort(z,index.return = T)
> y
$x
[1] 5 33 43 44 45 57 68 111 767
$ix
[1] 3 8 1 4 2 6 7 9 5
> y$x
[1] 5 33 43 44 45 57 68 111 767
> z[y$ix]
[1] 5 33 43 44 45 57 68 111 767
matrix
we start with some simple matrices
> matrix(1.5, nrow=2,ncol=3,)
[,1] [,2] [,3]
[1,] 1.5 1.5 1.5
[2,] 1.5 1.5 1.5
> diag(3.3,3)
[,1] [,2] [,3]
[1,] 3.3 0.0 0.0
[2,] 0.0 3.3 0.0
[3,] 0.0 0.0 3.3
> y=diag(-2,3,4)
> y
[,1] [,2] [,3] [,4]
[1,] -2 0 0 0
[2,] 0 -2 0 0
[3,] 0 0 -2 0
matrix
form a matrix from a vector
> x=matrix(c(1,2,3,4,5,6), 2,3)
> x
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> matrix(c(1,2,3,4,5,6), 3,2)
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> matrix(c(1,2,3,4,5,6), 3,2,byrow=T)
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
matrix
form a matrix by combining vectors
> x=c(1,2,3)
> y=c(4,5,6)
> z=c(7,8,9)
> cbind(x,y,z)#combine three vectors as columns
x y z
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> rbind(x,y,z)#combine three vectors as rows
[,1] [,2] [,3]
x 1 2 3
y 4 5 6
z 7 8 9
matrix
form a matrix by combining matrices
> w=cbind(x,y)
> w
x y
[1,] 1 4
[2,] 2 5
[3,] 3 6
> v=cbind(z,y)
> v
z y
[1,] 7 4
[2,] 8 5
[3,] 9 6
matrix
form a matrix by combining matrices
> cbind(w,v)
x y z y
[1,] 1 4 7 4
[2,] 2 5 8 5
[3,] 3 6 9 6
> rbind(w,v)
x y
[1,] 1 4
[2,] 2 5
[3,] 3 6
[4,] 7 4
[5,] 8 5
[6,] 9 6
matrix
extract or modify subsets of a matrix
> x=matrix(1:15,3,5)
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
> x[2,4]# the number in the second row and the fourth column.
[1] 11
> y=x[1,]
># extract the first row
># which is converted to a vector.
> y
[1] 1 4 7 10 13
> str(y)
int [1:5] 1 4 7 10 13
matrix
> z=x[,2]
># extract the second column
> z
[1] 4 5 6
> str(z)
int [1:3] 4 5 6
> w=x[1:2,1:3]
> w
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
> v=x[c(1,3),]
> v
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 3 6 9 12 15
matrix
extract irregularly distributed subsets of a matrix
> x=matrix(1:15,3,5)
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
> x[index]
[1] 5 11 15
> index=rbind(c(2,2),c(2,4),c(3,5))
> index
[,1] [,2]
[1,] 2 2
[2,] 2 4
[3,] 3 5
matrix
extract irregularly distributed subsets of a matrix
> x[index]=0
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 0 8 0 14
[3,] 3 6 9 12 0
> which(x==0, arr.ind = T)
row col
[1,] 2 2
[2,] 2 4
[3,] 3 5
> x[which(x==0, arr.ind = T)]
[1] 0 0 0
operations on matrices
The elementary arithmetic operators, +, -, *, /,

(for raising to a
power), log, exp, sin, cos, tan, sqrt and so on, can be applied to
matrices in an element-by-element sense.
> x=matrix(1:6,2,3)
> x
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> sqrt(x)
[,1] [,2] [,3]
[1,] 1.000000 1.732051 2.236068
[2,] 1.414214 2.000000 2.449490
> y=matrix(11:16,2,3)
> y
[,1] [,2] [,3]
[1,] 11 13 15
[2,] 12 14 16
> x+y
[,1] [,2] [,3]
[1,] 12 16 20
[2,] 14 18 22
transpose of a matrix
> y=matrix(11:16,2,3)
> y
[,1] [,2] [,3]
[1,] 11 13 15
[2,] 12 14 16
> t(y)
[,1] [,2]
[1,] 11 12
[2,] 13 14
[3,] 15 16
> y=1:3 # vector: a special matrix with one column.
> t(y)
[,1] [,2] [,3]
[1,] 1 2 3
convert a matrix to a vector
convert a matrix to a vector
> x=matrix(1:6,2,3)
> x
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> as.vector(x)
[1] 1 2 3 4 5 6
> as.vector(t(x))
[1] 1 3 5 2 4 6
matrix multiplication
> A=matrix(1:6,3,2)
> A
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> B=matrix(11:16,2,3)
> B
[,1] [,2] [,3]
[1,] 11 13 15
[2,] 12 14 16
> A%
*
%B # different from A
*
B
[,1] [,2] [,3]
[1,] 59 69 79
[2,] 82 96 110
[3,] 105 123 141
Linear equations and inversion of a matrix
For example, to solve the following equations
2x+y+3z=1
3x-y+z=4
x+y+z=2
> A=cbind(c(2,3,1),c(1,-1,1),c(3,1,1))
> A # the matrix of coefficients
[,1] [,2] [,3]
[1,] 2 1 3
[2,] 3 -1 1
[3,] 1 1 1
> a=c(1,4,2)
> solve(A,a)
[1] 2.333333 1.333333 -1.666667
> v=solve(A,a)
> A%
*
%v
[,1]
[1,] 1
[2,] 4
[3,] 2
Linear equations and inversion of a matrix
For example, to solve the following equations
> A%
*
%v-a
[,1]
[1,] 0.000000e+00
[2,] -4.440892e-16
[3,] 4.440892e-16
> solve(A) # the inverse of A
[,1] [,2] [,3]
[1,] -0.3333333 0.3333333 0.6666667
[2,] -0.3333333 -0.1666667 1.1666667
[3,] 0.6666667 -0.1666667 -0.8333333
character vector and factor
A character vector is a vector whose components are character strings
instead of numbers. For example, we construct vectors of names and
genders of ve persons in one ofce.
> names=c("Emily","John", "Lily","Grace","William")
> str(names)
chr [1:5] "Emily" "John" "Lily" "Grace" "William"
> gender=c("F","M","F","F","M")
> str(gender)
chr [1:5] "F" "M" "F" "F" "M"
In the vector gender, there are two categories: male and female,
labeled by M and F. Sometimes, we are interested in the
information such as how many elements in each category. In this case,
we can convert the character vector to a factor.
A character vector is a vector whose components are character strings
instead of numbers. For example, we construct vectors of names and
genders of ve persons in one ofce.
> names=c("Emily","John", "Lily","Grace","William")
> str(names)
chr [1:5] "Emily" "John" "Lily" "Grace" "William"
> gender=c("F","M","F","F","M")
> str(gender)
chr [1:5] "F" "M" "F" "F" "M"
In the vector gender, there are two categories: male and female,
labeled by M and F. Sometimes, we are interested in the
information such as how many elements in each category. In this case,
we can convert the character vector to a factor.
> gender=as.factor(gender)
> gender
[1] F M F F M
Levels: F M
> levels(gender)
[1] "F" "M"
Different levels denote different categories.
data frame
The data frame is one of the most important data types in R.
It is similar to matrix. Its elements are organized in rows and
columns.
However, in matrix, all elements are all numbers.
In data frames, it is often the case that in a data frame, some
elements are numbers and the others are numbers.
when R reads data from an external data le, R always save the
data as a data frame.
It is not efcient to type the data directly in R if the data size is
large.
We can read the data from a le where the data is saved.
It is important that the le is in the working directory. Otherwise, R
cannot nd the le. Or you can tell R where is the le.
input data from a le
The data has been stored in a le called ch.1.ex.1.dat.
use the following command to input the data
> data=read.table("ch.1.ex.1.dat")
> str(data)
data.frame: 4 obs. of 3 variables:
$ V1: int 1 2 4 5
$ V2: int 2 4 5 6
$ V3: int 3 7 7 8
> data
V1 V2 V3
1 1 2 3
2 2 4 7
3 4 5 7
4 5 6 8
The data has been stored in a le called ch.1.ex.2.dat. The
numbers is separated by commas
> data=read.table("ch.1.ex.2.dat",sep=",")
> str(data)
$ V1: int 1 2 4 5
$ V2: int 2 4 5 6
$ V3: int 3 7 7 8
> data
V1 V2 V3
1 1 2 3
2 2 4 7
3 4 5 7
4 5 6 8
> data=read.table("ch.1.ex.3.dat",header=T)
> str(data)
$ A: int 1 2 4 5
$ B: int 2 4 5 6
$ C: int 3 7 7 8
> data
A B C
1 1 2 3
2 2 4 7
3 4 5 7
4 5 6 8
input data from an excel le
The data has been stored in a csv excel le called ch.1.ex.4.csv.
> data=read.csv("ch.1.ex.4.csv")
> str(data)
$ A: int 1 5 4 5
$ B: int 2 4 5 8
$ C: int 3 4 6 7
> data
A B C
1 1 2 3
2 5 4 4
3 4 5 6
4 5 8 7
input data from an excel le
The data has been stored in a csv excel le called ch.1.ex.5.csv.
> data=read.csv("ch.1.ex.5.csv",header=F)
> data
V1 V2 V3
1 1 2 3
2 5 4 4
3 4 5 6
4 5 8 7
data sets in R
R itself contains some data sets which can be loaded directly. You
can use
> data()
to check available data sets in R.
For example, let us consider a data set named sleep. We can
load the data set by using
> data(sleep)
> sleep # check the data
extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
...........
...........
20 3.4 2 10
> str(sleep)
$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
Data which show the effect of two drugs on 10 patients.
There are 20 observations and three variables.
the second variable represents the types of drugs given.
the third is the patients ID.
the rst is the increase in hours of sleep compared to control.
extract or modify subsets of a data frame
> # two different ways to extract the first varible (column)
> sleep$extra
[1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0 1.9 0.8 1.1 0.1 -0.1
[16] 4.4 5.5 1.6 4.6 3.4
> sleep[,1]
[1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0 2.0 1.9 0.8 1.1 0.1 -0.1
[16] 4.4 5.5 1.6 4.6 3.4
> sleep[1,1]
[1] 0.7
> sleep[1,]
extra group ID
1 0.7 1 1
extract or modify subsets of a data frame
> # extract the measurements for the first drug.
> sleep[sleep$group=="2",1]
[1] 1.9 0.8 1.1 0.1 -0.1 4.4
[7] 5.5 1.6 4.6 3.4
> # extract the measurements for the fifth patient.
> sleep[sleep$ID=="5",1]
[1] -0.1 -0.1
> # extract the measurements for the third
> # patient when he took the second type of drug.
> sleep[(sleep$ID=="3")&(sleep$group=="2"),1]
[1] 1.1
> names(sleep) # the names of variables
[1] "extra" "group" "ID"
> names(sleep)=c("hours","Drug","Patient")
> sleep
hours Drug Patient
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
4 -1.2 1 4
If all elements of a data frame are numbers, the data frame can be
converted to a matrix.
> str(data)
$ V1: int 1 5 4 5
$ V2: int 2 4 5 8
$ V3: int 3 4 6 7
> X=as.matrix(data)
> str(X)
int [1:4, 1:3] 1 5 4 5 2 4 5 8 3 4 ...
> X
V1 V2 V3
[1,] 1 2 3
[2,] 5 4 4
[3,] 4 5 6
[4,] 5 8 7
Conversly, a matrix can be convert to a data frame
> X=matrix(1:15,5,3)
> X
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 2 7 12
[3,] 3 8 13
[4,] 4 9 14
[5,] 5 10 15
> str(X)
int [1:5, 1:3] 1 2 3 4 5 6 7 8 9 10 ...
> Y=data.frame(X)
> Y
X1 X2 X3
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
> str(Y)
$ X1: int 1 2 3 4 5
$ X2: int 6 7 8 9 10
$ X3: int 11 12 13 14 15
control ow
In addition to typing the commands after the prompt symbol, you
can write a group of commands in your text editor such as
notepad, MS word and so on, then copy and paste these
commands after the prompt symbol in R.
For example, I write the following commands in Notepad,
control ow
Then I copy and paste them to R
> X=as.matrix(data)
> names(X)=c("x1","x2","x3")
> print(X)
V1 V2 V3
[1,] 1 2 3
[2,] 5 4 4
[3,] 4 5 6
[4,] 5 8 7
control ow
The control-ow specify the order in which computations are
performed.
Some commonly used control-ow methods: If-Else statement,
loops (including for and while statements)
If-Else statement: Example: compute the absolute value of a. I
write the following commands in Notepad,
control ow
Then I copy and paste them to R
> a=1
> if(a>0)
+ {absolute=a
+ print(absolute)
+ }else
+ {absolute=-a
+ print(absolute)
+ }
[1] 1
control ow
> a=1
> if(a>0)
+ {absolute=a
+ print(absolute)
+ }else
+ {absolute=-a
+ print(absolute)
+ }
[1] 1
control ow
> a=-8
> if(a>0)
+ {absolute=a
+ print(absolute)
+ }else
+ {absolute=-a
+ print(absolute)
+ }
[1] 8
Example 2
> x=-3
> if((x<0)|(x>2))
+ {f=0
+ print(f)
+ }else
+ {if(x<1)
+ {f=x
+ print(f)
+ }else
+ {f=2-x
+ print(f)
+ }
+ }
[1] 0
Example 2
> x=8
> if((x<0)|(x>2))
+ {f=0
+ print(f)
+ }else
+ {if(x<1)
+ {f=x
+ print(f)
+ }else
+ {f=2-x
+ print(f)
+ }
+ }
[1] 0
control ow
the for loop
> sum=0 #set the initial values of the sum
> for (i in 1:6)
+ {sum=sum+i
+ print(c("i=",i, "sum=",sum))
+ }
[1] "i=" "1" "sum=" "1"
[1] "i=" "2" "sum=" "3"
[1] "i=" "3" "sum=" "6"
[1] "i=" "4" "sum=" "10"
[1] "i=" "5" "sum=" "15"
[1] "i=" "6" "sum=" "21"
control ow
the for loop
> sum=0 #set the initial values of the sum
> for (i in c(2,4,6,8,10))
+ {sum=sum+i
+ print(c("i=",i, "sum=",sum))
+ }
[1] "i=" "2" "sum=" "2"
[1] "i=" "4" "sum=" "6"
[1] "i=" "6" "sum=" "12"
[1] "i=" "8" "sum=" "20"
[1] "i=" "10" "sum=" "30"
Data management
The data input has been introduced. I will introduce the topics
including data output, missing, data manipulation, Merging, combining,
and subsetting datasets.
save a R object: for example,
save a R object
> X=matrix(1:15,3,5)
> X
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
> save(X,file="ex.RData")
> X=0
> X
[1] 0
> load("ex.RData")
> X
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
output data to external les
> data(sleep)
> str(sleep)
$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
> sleep.1=sleep[sleep$group=="1",c(1,3)]
> sleep.1
extra ID
1 0.7 1
2 -1.6 2
3 -0.2 3
............
> write.table(sleep.1,file="sleep.1.dat")
both the row names and the column names are written in the le
> write.table(sleep.1,file="sleep.1.dat")
If we do not want the column name or the row name, use
> write.table(sleep.1,file="sleep.1.dat",row.name=F)
If we want the le be saved as a csv le, use
> write.csv(sleep.1,file="sleep.1.csv")
missing data
It is very common in practice that there are missing values in a
data set. For example, there are missing values denoted by *.
missing data
> X=read.table("ch.1.ex.6.dat",na.strings = "
*
")
> X
V1 V2 V3 V4 V5
1 1 2 3 4 NA
2 1 3 5 7 8
3 7 NA 9 0 1
In R, all missing values are denoted by NA which can be
identued by the following command,
> x=X[,2]
> x
[1] 2 3 NA
> is.na(x)
[1] FALSE FALSE TRUE
missing data
> is.na(X)
V1 V2 V3 V4 V5
[1,] FALSE FALSE FALSE FALSE TRUE
[2,] FALSE FALSE FALSE FALSE FALSE
[3,] FALSE TRUE FALSE FALSE FALSE
> sum(is.na(X)) # the number of missing values
[1] 2
> mean(X[,1])
[1] 3
> mean(X[,2])
[1] NA
> #claculate the mean with the missing values excluded
> mean(X[,2],na.rm=TRUE)
[1] 2.5
missing data
Example: write a function to replace the missing values in a data frame by the column mean with the missing values excluded in
the corresponding colum. If the whole column is missed, we remove the culomn.
> missing.replace=function(X)
+ {
+ temp=NULL
+ for (i in 1:dim(X)[2])
+ {if(sum(is.na(X[,i]))==0)
+ {temp=cbind(temp,X[,i])
+ }else
+ {if(sum(!is.na(X[,i]))>0)
+ {y=X[,i]
+ y[is.na(X[,i])]=mean(y,na.rm=TRUE)
+ temp=cbind(temp,y)
+ }
+ }
+ }
+ temp=data.frame(temp)
+ temp
+ }
missing data
> X
V1 V2 V3 V4 V5
1 1 2 3 4 NA
2 1 3 5 7 8
3 7 NA 9 0 1
> missing.replace(X)
V1 y V3 V4 y.1
1 1 2.0 3 4 4.5
2 1 3.0 5 7 8.0
3 7 2.5 9 0 1.0
> Y=X
> Y[,2]=NA
> Y
V1 V2 V3 V4 V5
1 1 NA 3 4 NA
2 1 NA 5 7 8
3 7 NA 9 0 1
> missing.replace(Y)
V1 V2 V3 y
1 1 3 4 4.5
2 1 5 7 8.0
3 7 9 0 1.0
merging data
the first data set
id year female inc
1 80 0 5000
1 81 0 5500
1 82 0 6000
2 80 1 2000
2 81 1 2200
2 82 1 3300
3 80 0 3000
3 81 0 2000
3 82 0 1000
the second given below.
id maxval
2 2400
1 1800
4 1900
The desired merged dataset would look like:
id year female inc maxval
1 81 0 5500 1800
1 80 0 5000 1800
1 82 0 6000 1800
2 82 1 3300 2400
2 80 1 2000 2400
2 81 1 2200 2400
3 82 0 1000 NA
3 80 0 3000 NA
3 81 0 2000 NA
4 NA NA NA 1900
merging data
Suppose that X and Y are the data frames for the two data sets, respectively.
> merge(X,Y,by="id")
1 1 81 0 5500 1800
2 1 80 0 5000 1800
3 1 82 0 6000 1800
4 2 82 1 3300 2400
5 2 80 1 2000 2400
6 2 81 1 2200 2400
> merge(X,Y,by="id",all=T)
1 1 81 0 5500 1800
2 1 80 0 5000 1800
3 1 82 0 6000 1800
4 2 82 1 3300 2400
5 2 80 1 2000 2400
6 2 81 1 2200 2400
7 3 82 0 1000 NA
8 3 80 0 3000 NA
9 3 81 0 2000 NA
10 4 NA NA NA 1900
graphical exploration of data
Example: This famous (Fishers or Andersons) iris data set gives the measurements in centimeters of the variables sepal length
and width and petal length and width, respectively, for 50 owers from each of 3 species of iris. The species are Iris setosa,
versicolor, and virginica.
> X=read.table("cell.1.dat")
> str(X)
$ V1: int 0 0 0 0 0 1 1 1 1 1 ...
$ V2: int 100 100 100 100 100 100 100 100 100 100 ...
$ V3: int 0 0 0 0 0 0 0 0 0 0 ...
$ V4: int 0 0 0 0 0 0 0 0 0 0 ...
$ V5: int 0 0 0 0 0 0 0 0 0 0 ...
$ V6: int 0 0 0 0 0 10 6 12 16 8 ...
$ V7: int 1600 1600 1604 1590 1581 1568 1569 1577 1570 1577 ...
> Y=read.table("cell.2.dat")
> Z=read.table("cell.3.dat")
scatter plot
> plot(iris$Sepal.Length,iris$Petal.Length)
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
1
2
3
4
5
6
7
iris$Sepal.Length
i
r
i
s
$
P
e
t
a
l
.
L
e
n
g
t
h
scatter plot
If we want to change the labels of x axis, y axis and the title of the plot.
> plot(iris$Sepal.Length,iris$Petal.Length,xlab="Sepal Length"
+ ,ylab="Petal Length",main="The Scatter plot")
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
1
2
3
4
5
6
7
The Scatter plot
Sepal Length
P
e
t
a
l

L
e
n
g
t
h
Matrix of scatterplots
If we want to draw the scatteplots of all pairs of the variables
> pairs(iris)
Sepal.Length
2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5
4
.
5
6
.
0
7
.
5
2
.
0
3
.
0
4
.
0
Sepal.Width
Petal.Length
1
3
5
7
0
.
5
1
.
5
2
.
5
Petal.Width
4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 1.0 1.5 2.0 2.5 3.0
1
.
0
2
.
0
3
.
0
Species
histogram
>hist(iris$Sepal.Length,nclass=20)
Histogram of iris$Sepal.Length
iris$Sepal.Length
F
r
e
q
u
e
n
c
y
5 6 7 8
0
5
1
0
1
5
histogram
> boxplot(iris$Sepal.Length[iris$Species=="setosa"],iris$Sepal.Length[iris$Species=="versicolor"],
+ iris$Sepal.Length[iris$Species=="virginica"],names=c("setosa", "versicolor", "virginica"),
+ main="Sepal Length of three species")
setosa versicolor virginica
4
.
5
5
.
0
5
.
5
6
.
0
6
.
5
7
.
0
7
.
5
8
.
0
Sepal Length of three species
Regression
Regression is the study of relationships, dependence between two
sets of variables based on the data or observations made on
these variables.
Based on the relationships, we can make prediction on one set of
variables from the values of the other set.
Example 1: Predict the price of a stock in 6 months from now, on
the basis of company performance measures and economic data.
Example 2: Estimate the amount of glucose in the blood of a
diabetic person, from the infrared absorption spectrum of that
persons blood.
Example 3: Examine the correlation between the level of
prostate-specic antigen and a number of clinical measures such
as cancer volume, prostate weight, age, and so on.
variables
Variables are quantitative or qualitative measurements of
characteristics of objects under our study. Typically, the values of
variables vary for different objects.
They are considered as random variables with some probability
distributions.
We will use the uppercase letters, such as X, Y, to denote
variables and the lowercase letters, x and y, to denote their
particular values.
variables
In this course, for each regression model, the variables will be
classied into two categories:
independent variables, or inputs, or features, or predictor, or
regressor, or explanatory variable; typically denoted by X
1
, X
2
, .
dependent variables, or outputs, or responses, or outcome variable;
typically denoted by Y
1
, Y
2
, . In this course, we only consider
one response.
The partition of dependent and independent variables is obvious
in some data set but may vary according to the purposes of the
study in other data.
For example, suppose that we collected the temperature and
precipitation data of a city over a period. If we want to construct a
model to predict the temperature from the precipitation, then the
response is temperature and the predictor is precipitation,
however, If precipitation has to be predicted from temperature,
then the response is precipitation and the predictor is temperature.
variable
Example 1: Y=price of the stock in 6 months from now,
X
1
=company performance measures and X
2
=economic data.
Example 2:Y=the amount of glucose in the blood of a diabetic
person, X=the infrared absorption spectrum of that persons
blood.
Example 3: Y=level of prostate-specic antigen and X
1
=cancer
volume, X
2
=prostate weight, X
3
=age, and so on.
regression models
To construct a model from which given any values of the
predictors: X
1
, X
2
, , we can give a good guess of the value of
Y such that the difference between the guess and the true value is
as small as possible. The guess is called predicted value denote
dby

Y.
deterministic system: given any values of X
1
, X
2
, , the value of
Y is determinstic. That is, if any two observations of X
1
, X
2
, ,
x
(1)
i
and x
(2)
i
, i = 1, 2, , the true values of Y are y
(1)
and y
(2)
,
then if x
(1)
i
= x
(2)
i
, i = 1, 2, , then y
(1)
= y
(2)
.
mathematically, Y can be consdered as a function of X
1
, X
2
, ,
that is, Y = f (X
1
, X
2
, ).
regression models
No randomness is involved in deterministic system.
However, in practice, there are many factors affecting the
responses, we cannot nd and measure all those factors.
In statistics, we do not consider deterministic models. Instead, we
will consider models having randomness.
These models are more exible and more appropriate.
regression models
Specically, consider
Y = f (X
1
, X
2
, ) + ,
f is a (unknown) function of the variables X
1
, X
2
, , which are
interesting and easy to observe or measure.
The term is a ranomd variable which is used to account for the
randomness due to the unobserved factors or variables, noises or
errors.
A standard assumption is that is independent of X
1
, X
2
, and
its expection is 0.
Then we have E(Y|X
1
, X
2
, ) = f (X
1
, X
2
, ).
More assumptions will be imposed in the following classes.
linear regression models
The function f (X
1
, X
2
, ) is typically unknown and not a function
with simple form.
In this course, we will assume the function is linear, that is, if we
have p explanation variables, then
f (X
1
, X
2
, , X
p
) =
0
+
1
X
1
+
2
X
2
+ +
p
X
p
,
where
0
,
1
, ,
p
are coefcients which are typically unknown
parameters and the estimation of them based on data is a main
topic of the course.
Why do we assume f is a linear function?
The reasons we assume f is a linear function:
Since f can be very general function, it is hopeless to nd the
exact form of f based on limited sample.
Hence, an approximation to f is desirable. Acutually, the linear
function is the rst order approximation to any smooth function in
a range of X
1
, X
2
, which is not too large.
Although there are other better approximation, the computation is
a very important consideration especially in the precomputer age
of statistics. It is much easier to perform analysis with the linear
models than other models.
Even in todays computer era there are still good reasons to study
and use them. They are simple and often provide an adequate
and interpretable description of how the inputs affect the output.
For prediction purposes they can sometimes outperform fancier
nonlinear models, especially in situations with small numbers of
training cases, low signal-to-noise ratio or sparse data.
Finally, linear methods can be applied to transformations of the
inputs and this considerably expands their scope.
simple linear regression model: there is only one predictor X.
multiple linear regression model: there are more than one
predictors. The matrix will be involved. Review the basic
knowledge of matrix: denition, operations, solving linear
equation, and so on.

Linear Regression Chp1

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Linear Regression Chp1

Hochgeladen von

Copyright:

Verfügbare Formate

Linear Statistical Analysis I

Das könnte Ihnen auch gefallen