You are on page 1of 12

Spatial Statistics With R: Getting Started

Introduction
In the last practical, you saw how to handle geographical data in R, and how to carry out
some basic, and more advanced statistical analysis on the data. However, even the
more advanced Poisson modelling carried out did not take into consideration any spatial
dependencies in the data. The breach of peace counts in each of the census blocks were
modelled as independent Poisson counts, and the number of counts in each block was
considered only in terms of other properties of that block, ignoring anything happening in
surrounding blocks. However, there is a large area of statistical analysis devoted to
processes in which events in nearby areas are related. In this practical you will learn how
to use R libraries devoted to this kind of analysis - in particular spdep.

The spdep Package


The name of this package is a shortened form of spatial dependencies and contains a
number of statistical routines for testing for spatial dependencies in random variables, as
well as other routines for allowing for such dependencies when fitting models. To begin
this practical, start up R by opening your working folder and clicking on the pract.RData
file and then load the packages GISTools. Enter
library(GISTools)
to load these - and then enter
data(newhaven)
which will make the New Haven data visible again. Note that when you loaded the spdep
package the printout shows that a number of 'helper' libraries were also loaded. You will
see something like:
Loading
Loading
Loading
Loading
Loading
Loading

required
required
required
required
required
required

package:
package:
package:
package:
package:
package:

tripack
sp
maptools
foreign
boot
spam

The topology of a spatial data set is the term usually described the spatial arrangement of
geographical items within it - in particular, for a polygon data set the topology is a list of
polygons that touch one another. Here, touching can mean the sharing of a common
edge, and in some cases it can also mean the sharing of a common single point (for
example when two census block areas are joined only at their corners.
spdep has a function to extract the topology information from a polygon object - called
poly2nb. The nb here stands for neighbours - since it is basically a list of which polygon
neighbours which other ones. Enter
blocks.nb = poly2nb(blocks)
Spatial Statistics with R: Page 1 of 12

to store this information in a variable called blocks.nb. It is possible to plot this information
as a kind of network. The nodes on the network are the so-called label points for the
polygon file. Each polygon has a label point - a point somewhere inside the polygon
where any text used to label the polygon may be placed. They are useful useful as node
points on a network representing polygon neighbours. To extract the label points, as a
point object, enter
blocks.labs = poly.labels(blocks)
and then it is possible to plot the neighbour information. Here, this is done on a backdrop
of the census block polygons:
plot(blocks,col='grey')
plot(blocks.nb,coordinates(blocks.labs),col='red',add=TRUE)
the default for poly2nb is to define neighbours as having points, as well as edges, in
common. This is sometimes called queen's case topology because connection at edges
and corners corresponds to the legal moves of the queen in chess. It is also possible to
extract rook's case topology - where only common edges define neighbours. This
corresponds to legal moves of the rook in chess. To extract rook's case moves, add the
argument queen=FALSE to the poly2nb function:
blocks.nb = poly2nb(blocks,queen=FALSE)
plot(blocks,col='grey')
plot(blocks.nb,coordinates(blocks.labs),col='red',add=TRUE)
This repeats the network map from before, but now only polygons with common edges
are connected. In this case, as few polygon pairs are connected only at the corners, the
result is fairly similar.
An alternative definition of topology (based on nearness of polygons rather than contiguity)
is to defined two polygons as neighbours if their label points are within some distance d of
one another. R can define these kinds of neighbours using the dnearneigh function. For
example, to define census blocks as being neighbours if they are within 1.2 miles apart,
enter the following:
blocks.nb2 = dnearneigh(poly.labels(blocks),0,miles2ft(1.2))
It is then possible to plot the neighbour network under this definition:
plot(blocks,col='grey')
plot(blocks.nb2,coordinates(blocks.labs),col='red',add=TRUE)
Note that this demonstrates that under different definitions of neighbour, quite different
patterns of network can occur.

Computing and Testing Moran's I


Having defined contiguity for this census block example, it is now possible to investigate
the degree of spatial dependency there is in the attribute data. A typical way of doing this
is to compute the Moran's-I coefficient. Morans-I is defined as
Spatial Statistics with R: Page 2 of 12

N
I=! !
i

wij

"
#"
#
Xj X

w
X

X
ij
i
j
#
! "
2
X

X
i
i

! !

Where:

Xi

Is the attribute attached to polygon i

N
wij

Is the number of polygons

Indicates whether polygons i and j are neigbours


Is the average polygon attribute value

The formula may seem complex, but essentially it measures the degree to which similarvalued attributes occur near to each other. If above average valued attributes tend to be
near other above-average attributes, this gives a positive value of Morans-I. If, on the
other hand, above average values tend to occur near to below average values - in a
checker-board pattern - this gives a negative Morans-I. Morans I is typically between -1
and 1, and in some ways is similar to a correlation coefficient. A value of zero suggests
no spatial dependency. It is sometimes referred to as a measure of autocorrelation as it
measures the variable Xs correlation to itself, in a geographical sense. To illustrate this,
choropleth maps corresponding to four values of Morans-I are given below:
I = 0.904

I = 0.126

I = 0.074

I = 0.435

Spatial Statistics with R: Page 3 of 12

R compute Morans-I. To do this, it needs to convert a neigbourhood list to a w-list. This


is really just another way of storing the polygon adjacency data. The conversion is done
with the nb2listw function. Enter
blocks.lw = nb2listw(blocks.nb)
which stores the w-list in blocks.lw. Having done this, it is possible to investigate spatial
dependency of some of the New Haven data. To test whether the percent vacant
properties variable P_VACANT exhibits spatial dependency, we first attach the data frame
from the blocks object:
attach(data.frame(blocks))
To compute the Morans-I statistic, now enter:
moran.test(P_VACANT,blocks.lw)
which produces the following output:

Moran's I test under randomisation

data: P_VACANT
weights: blocks.lw
Moran I statistic standard deviate = 2.7789, p-value = 0.002727
alternative hypothesis: greater
sample estimates:
Moran I statistic
Expectation
Variance
0.143721934
-0.007812500
0.002973471
This needs some explanation. The first number of the last line printed gives the Morans-I
statistic itself - about 0.144.
The other information relates to a statistical test as to
whether the Morans-I is equal to zero. If this is the case, then the theoretical values for
the expected value of Morans-I and its sample variance are estimated using the following
formulae:
E(I) =

1
N 1

N D 6EC 2
(N + 1)(N 1)C 2

V ar(I) =

where:
A=

B=

1 !!
(wij + wji )2 , i != j
2 i j
!
k

!
!

wjk +
wik
j

Spatial Statistics with R: Page 4 of 12

C=

!!
i

wij , i != j

D = (N 2 3N + 3)A N B + 3C 2
!

4 /N
(Xi X)
E = "! i
#2
2
i (Xi X) /N

These (very) complicated formulae can be used to create a test statistic

z=

I E(I)

{V ar(I)}

1
2

which is approximately Normally distributed, and can be looked up against a p-value. The
last line of the printout from moran.test tells you that the value of E(I) for P_VACANT is
about -0.008 (labelled Expectation) and that for Var(I) is about 0.0026 (labelled
Variance). These can be used to compute z in the formula above, which is then used to
test the hypothesis that I=0. In the printout from moran.test this is labelled as Moran I
statistic standard deviate and takes the value of around 2.61. Finally the p-value for the
statistic is computed, and shown in the printout to be about 0.00448. Recall that the pvalue is the probability of obtaining a value at least as extreme as the one observed from
the data, given that the null hypothesis is true. Thus, the lower the value, the more
evidence against the null hypothesis. Here the smallness of the p-value suggests strong
evidence against the null hypothesis - ie we should reject the hypothesis that I=0, implying
that some degree of spatial dependency is present.
We can now do the same test in terms of density of breach of peace events - firstly
compute the density values in events per square mile:
density = poly.counts(breach,blocks)/

ft2miles(ft2miles(poly.areas(blocks)))
and then carry out the Morans-I test:
moran.test(density,blocks.lw)
this gives a print-out similar to that before. In this case the Morans-I statistic is 0.235. As
a self test you should be able to find the p-value for this and decide whether Morans-I
differs significantly from zero.
Simulation-Based Tests
The basis for the significance tests in the last section was to compute the expected value
and variance of the Morans-I statistic under the assumption that there is no spatial
dependency in the attribute X. Here, this is done by assuming that if there was no spatial
dependency, then any of the observed X-values could have occurred with equal chance at
any of the polygons. In other words, any permutation of polygon attributes to the
polygons is equally likely. The formulae for E(I) and Var(I) were theoretically derived given
Spatial Statistics with R: Page 5 of 12

this assumption. However, the assumption that Morans-I is normally distributed in this
case is only approximate.
In times when computers were a lot slower than they are now, this approach was probably
the most appropriate but now there is an alternative approach. This is simply to permute
the attributes randomly amongst the polygons a large number of times, and note the
values of Morans-I each time. By comparing the actual Morans-I against these, we can
see how extreme the true value is compared to those generated under the assumption
that any permutation is equally likely. If there are n simulations, and m of these have a
larger value than the true Morans-I, then the experimental p-value is m/(n+1). The
theoretical approach of the previous section is relatively easy to compute (although seven
formulae may seem complex to a human, they can be calculated in a fraction of a second
by a computer) but it is only approximate. The simulation approach - also called the
Monte-Carlo approach - outlined here requires more computer time (usually n should be
around 10,000) but the simulations are of the true model. R can can carry out simulationbased tests with the moran.mc function:
moran.mc(P_VACANT,blocks.lw,nsim=10000)
The extra argument nsim tells the function how many simulations to carry out - that is, the
number n mentioned above. The result will be something like:

Monte-Carlo simulation of Moran's I

data: P_VACANT
weights: blocks.lw
number of simulations + 1: 10001
statistic = 0.1256, observed rank = 9909, p-value = 0.0092
alternative hypothesis: greater
Note that the p-value here - although slightly different from that obtained from moran.test
still suggests that the hypothesis of zero Morans-I should be rejected. Also note that
when you run moran.mc you may well obtain slightly different results, as this approach is
based on random simulation, and so no two runs of the function will have identical
outcomes. As another self-test, try running moran.mc on the density variable.

Regression Models with Spatial Autoregression


In this section the idea of spatial dependency is taken a step further, by considering its
effect when calibrating regression models. A standard bivariate regression model has the
form

Yi = 0 + 1 Xi + "i
where the Y variable is to be predicted by the X variable. The beta {0 , 1 } values are the
regression coefficients (intercept and slope respectively) and the final epsilon {!i } term is
an error term. In a standard model it is assumed that these are normally distributed, with
a mean of zero. It is also assumed that all errors have the same standard deviation, and
that they are independent. However, in many geographical situations, the last assumption
is dubious. The error term in a model is essentially related to factors influencing the Y
Spatial Statistics with R: Page 6 of 12

variable that are not reflected in the predictor variable X. If such factors relate to a
geographical phenomenon, it is possible that their effects might spill over, so that error
terms in adjacent regions will depend on one another. In this case, the model above will
be inappropriate,
and models allowing for dependency in the epsilons should be
considered instead.
To consider this kind of model, we will look at two new New Haven crime variables related
to residential burglaries. These are both point objects, called burgres.f and burgres.n.
burgres.f is a list of burglaries occurring between 1st august 2007 and 31st january 2008
where entry was forced into the property, and burgres.n is a list of burglaries from the
same time period where no entry was forced. In the case of non-forced entry, this
suggests that the property was left insecure, perhaps by leaving a door or window open.
Both variables are point objects. One interesting question is whether both kinds of
residential burglary occur in the same places - that is, if a place is a high risk area for nonforced entry, does it imply that it is also a high risk for forced entry? To investigate this,
we will use a bivariate regression model that attempts to predict the density of forced
burglaries from the density of non-forced ones.
The indicators needed for this are the rates of burglary given the number of properties at
risk. Here we use the variable OCCUPIED from the data frame in the census blocks
object to estimate the number of properties at risk. If we were to compute rates per 1,000
households, this would be
1000*(number of burglaries in block)/OCCUPIED
and since this is over a six-month period, doubling this quantity gives the number of
burglaries per 1,000 households per year. However, typing in OCCUPIED shows that
some blocks have no occupied housing, so the above quantity is not defined. To
overcome this problem we select a subset of the blocks object consisting only of blocks
with greater than zero occupied dwellings. For polygon spatial objects, each individual
polygon can be treated like a row in a data frame for purposes of subset selection. Thus
to select only the blocks where the variable OCCUPIED is greater than zero, enter
blocks2 = blocks[OCCUPIED > 0,]
to stored the subset census block data in the object blocks2. We can now compute the
burglary rates for forced and non-forced entries by first counting the burglaries in each
block in blocks2 (with the poly.counts function), dividing these numbers by the OCCUPIED
counts and then multiplying by 2,000 (to get yearly rates per 1,000 households). However,
before we do this, remember that we need the OCCUPIED column from blocks2 and not
blocks - but at the moment the one from blocks is attached. To sort this out, firstly detach
the data frame associated with blocks and then attach the one associated with blocks2:
detach(data.frame(blocks))
attach(data.frame(blocks2))
now the two rate variables can be calculated:
forced.rate = 2000*poly.counts(burgres.f,blocks2)/OCCUPIED
notforced.rate = 2000*poly.counts(burgres.n,blocks2)/OCCUPIED

Spatial Statistics with R: Page 7 of 12

so we now have the two rates stored in forced.rate and notforced.rate. A first attempt at
modelling the relationship between the two rates could be via simple bivariate regression ignoring any spatial dependencies in the error term. This is done using the lm function,
which creates a simple regression model object.
model1 = lm(forced.rate~notforced.rate)
this stores the basic model in model1 - to see the regression coefficients, enter
summary(model1)
which produces the following output:
Call:
lm(formula = forced.rate ~ notforced.rate)
Residuals:
Min
1Q
-11.209 -5.467

Median
-1.434

3Q
3.002

Max
30.926

Coefficients:
(Intercept)
notforced.rate
--Signif. codes:

Estimate Std. Error t value Pr(>|t|)


5.4667
0.8059
6.784 4.18e-10 ***
0.3790
0.1627
2.329
0.0215 *
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 7.557 on 125 degrees of freedom


Multiple R-squared: 0.04159, Adjusted R-squared: 0.03392
F-statistic: 5.424 on 1 and 125 DF, p-value: 0.02147
the key things to note here are that the forced rate is related to the not-forced rate by the
formula
expected(forced rate) = 5.47 + 0.379*(not forced rate)
and that the coefficient for the not forced rate is statistically different from zero - so there is
evidence that the two rates are related. One possible explanation is that if a burglar is
active in an area, they will only use force to enter dwellings when it is necessary, making
use of an insecure window or door if they spot the opportunity. Thus in areas where
burglars are active, both kinds of burglary could potentially occur. However, in areas
where they are less active it is less likely for either kind burglary to occur.
However, this regression model could possibly be improved if, instead of assuming that
the error terms are independent, we assume a spatial dependency. This can be done in a
number of ways, but the approach we will use here is the spatially autocorrelated
regression (SAR) model:

yi =

wij yj + 0 + 1 xi + #i

Spatial Statistics with R: Page 8 of 12

The difference between this and the standard model is the first term on the left hand side.

ij is equal to 1 if polygons i and j are neighbours and zero otherwise. The


Here,
coefficient control;s the degree of spatial dependency. Effectively the variable y for a
given polygon is predicted not just by x but also by the y-variables of polygons
neighbouring y. Calibrating a SAR model involves estimating the regression coefficients,
as before, but also involves estimating . In R, SAR models can be calibrated using the
function spautolm. This works in a similar way to lm, but also needs the contiguity
information in listw form. Since we are now working with blocks2 rather than blocks we
need to extract the information for the newer object:
blocks2.nb = poly2nb(blocks2)
blocks2.lw = nb2listw(blocks2.nb)
Now it is possible to fit the SAR model model2 = spautolm(forced.rate~notforced.rate,listw=blocks2.lw)
This stores the result in model2 - more information can be found by entering
summary(model2)
giving the following output Call: spautolm(formula
blocks2.lw)
Residuals:
Min
1Q
-10.9220 -5.2990

Median
-1.4167

forced.rate

3Q
3.0909

notforced.rate,

listw

Max
31.0006

Coefficients:
(Intercept)
notforced.rate

Estimate Std. Error z value Pr(>|z|)


5.47698
0.88334 6.2003 5.636e-10
0.36239
0.16098 2.2511
0.02438

Lambda: 0.13856 LR test value: 0.99413 p-value: 0.31873


Log likelihood: -435.558
ML residual variance (sigma squared): 55.579, (sigma: 7.4551)
Number of observations: 127
Number of parameters estimated: 4
AIC: 879.12
This shows that the model calibrated in this way gives the model
expected(forced rate) = 5.48 + 0.362*(not forced rate)
which differs only very slightly from the model obtained with a standard regression model.
The section marked 'lambda' in the output shows that the estimated value of the
dependency coefficient is 0.139, but the test of a null hypothesis of zero dependency has
Spatial Statistics with R: Page 9 of 12

a p-value of around 0.319 - so we fail to reject the null hypothesis. This suggests that, in
this case, one does not need to allow for spatial dependency of the error term.

The Modifiable Areal Unit Problem


In the previous section, data was summarised and then analysed at the US Census block
level. One important issue with spatial analytical models of this kind is their dependency
on the set of areal units used. For example, if we were to work with US Census tracts
instead of blocks, would we obtain similar results? With the data set here, it is possible to
test this. Firstly, included in the library is an object called tracts which consists of the
polygon outlines of the US Census tracts for New Haven. To see the relationship between
the tracts and the blocks, enter:
plot(blocks,border=red)
plot(tracts,lwd=2,add=TRUE)
the parameter lwd controls the line width being drawn. The Census blocks are nested
within the tracts. Next, compute the burglary rates for the tracts; first off detach the data
frame associated with blocks2 and the attach the one for tracts:
detach(data.frame(blocks2))
attach(data.frame(tracts))
now, compute burglary rates for the tracts:
forced.rate.t = 2000*poly.counts(burgres.f,tracts)/OCCUPIED
notforced.rate.t = 2000*poly.counts(burgres.n,tracts)/OCCUPIED
and run a basic model:
model1.t=lm(forced.rate.t~notforced.rate.t)
summary(model1.t)
you should now be familiar with the format of the output - working with data based on
census tracts, we obtain the model
expected(forced rate) = 5.24 + 0.413*(not forced rate)
Notice that the difference in calibrating the model brought about by altering the areal units
used for the analysis is notably larger than the difference made by the inclusion of a spatial
dependency term in the error model. This is referred to as the Modifiable Areal Unit
Problem - first identified in the 1930s, and extensively research by Stan Openshaw in the
1970s and beyond. This variability in results is often the case, and illustrates the
importance of the Modifiable Areal Unit Problem as an issue in spatial analysis.
A Zone-Free Approach
An alternative approach to mapping these crime patterns is to use kernel density
estimation. Here we model the relative density of the points as a density surface essentially a function of location (x,y) representing the relative likelihood of occurrence of
an event at that point. If we think of locations in space as a very fine pixel grid, then
summing the pixels making up an arbitrary region on the map gives the probability that an
event occurs in that area.
Spatial Statistics with R: Page 10 of 12

For the more mathematically-minded, if f(x,y) is the density function, then the probability
that an even occurs in an area A is:

! !

f (x, y) dydx

(x,y)A

Kernel density estimators operate by averaging a small 'bump' (a probability distributioin in


2D, in fact) centred on each observed point. Thus, the approximation to f is given by:

! " x xi y yi #
1
f(x, y) =
k
,
h1 h2 i
h1
h2
in mathematical terms. The function k is the kernel function - that is, the 'bump'
described earlier. The h parameters control the smoothness of the estimate. Very small
values give rise to very 'spikey' surfaces, and large values to very flat ones. Typically,
they are chosen automatically, from the distribution of the points. Here, the function to
compute a kernel density estimation is kde.points. This estimates the value of the density
over a grid of points, and returns the result as a grid object. It can take two arguments the set of points to use, and another geographical object, whose bounding box will be the
extent of the grid object to be created. The points object breach will be used to produce a
kernel surface:
breach.dens = kde.points(breach,lims=tracts)
This stores the kernel density estimate of breach of peace in a grid object called
breach.dens. A quick way of drawing the density is to use the level.plot function:
level.plot(breach.dens)
This draws a shaded contour plot of the density function. One thing to notice is that this
covers a rectangular area - but to give context it would be helpful to add a map of New
Haven. For example, to add the Census tracts, type
plot(tracts,add=TRUE)
Another approach might be to mask out the information outside of the study area. The
kde.points function always computes values on a rectangular grid, but part of the grid lies
outside of the New Haven area. To overcome this, it is possible to create a mask polygon
object. This is simply a normal polygon object, shaped like the rectangle that kde.points
produces, but with a hole in it the shape of the study area. In this case the hole is shaped
like New Haven. If the mask polygon is plotted over the level plot of the grid data, with
both its edges and fill colour being white, the effect is to erase the parts of the density
surface lying outside of the study area. This can be achieved using the poly.outer function:
masker = poly.outer(breach.dens,tracts,extend=100)
The first two parameters give the outer rectangle and the hole shape, respectively. The
third parameter actually causes the outer rectangle to extend by a small amount in each
direction - sometimes this is useful, since occasionally their is a very slight mismatch
between the coordinates of the outer edge of the grid, and the outer edge of the mask
Spatial Statistics with R: Page 11 of 12

polygon. The erasing technique set out above might then fail to erase a small amount of
information on the edge of the grid. The extend parameter avoids this by making the mask
polygons outer edges slightly exceed those of the grid. Here, we extend the edges by
100 feet. Now we have a masking polygon, called masker we can plot this on the map.
The quickest way to do this is to use the add.masking command - this is more or less the
same as the plot command, but defaults to drawing white filled polygons with white
boundaries. Enter
add.masking(masker)
This erases the part of the density map outside of New Haven. However it has also partly
erased the external boundaries of the census tracts. It would probably have been more
sensible to draw the tracts after the mask polygon was drawn.
A better map can be
achieved by entering the commands in this order:
level.plot(breach.dens)
add.masking(masker)
plot(tracts,add=TRUE)
Finally, it is also possible to use shading schemes (as seen in practical 2) to draw level
plots with different intervals or colours. To do this, the auto.shading function is used as
before. The variable to define the shading scheme is the kernel density estimate of the
breach.dens object - accessed by breach.dens$kde. The following gives a level plot with 7
levels, drawn as shades of green:
breach.dens.shades = auto.shading(breach.dens$kde,

n=7,cols=brewer.pal(7,"Greens"),cutter=range.cuts)
level.plot(breach.dens,shades=breach.dens.shades)
add.masking(masker)
plot(tracts,add=TRUE)
Note the first command is split over two lines.

End of Practical
At this stage, the practical has finished. To exit R, enter
save.image(file='rpract.RData')
detach(data.frame(blocks))
q()
Which will save your current variables into a file in your working folder, undo the attach
command entered earlier, and quit R.

Spatial Statistics with R: Page 12 of 12