Anda di halaman 1dari 54

Introduction to Spatial

Statistics

Budhi Setiawan, PhD

Types of Spatial Data
Continuous Random Field

Lattice Data

Point Pattern Data

Note: Each type of data is analyzed
differently
Geostatistics
Geostatistical analysis is distinct from
other spatial models in the statistics
literature in that it assumes the region
of study is continuous
Observations could
be taken at any point
within the study area

Interpolation at
points in between
observed locations
makes sense
0
5
1
0
1
5
2
0
X

0
5
1
0
1
5
2
0
Y

0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
Z
Spatial Autocorrelation
Spatial modeling is based on the
assumption that observations close in
space tend to co-vary more strongly
than those far from each other
Positively co-vary: values are similar in
value
E.g. elevation (or depth) tends to be similar for
locations close together)
Negatively co-vary: values tend to be
opposite in value
E.g. density of an organism that is highly
spatially clustered, where observations in
between clusters are low and values within
clusters are high
Covariance
Definition: two variables are said to co-vary
if their correlation coefficient is not zero


where is the correlation coefficient
between X and Y and o
X
(o
Y
) is the
standard deviation of X (Y)

Consider this in the context of a single
variable
E.g. do nearest neighbors have non-zero
covariance?
y x y x y x
y x E y x y x o o o o = = = = )] )( [( ) , cov( ) , (
,
Continuous Data Geostatistics
Notation

Z(s) is the random process at location s=(x,
y)
z(s) is the observed value of the process at
location s=(x, y)
D is the study region
The sample is the set {z(s) : s e D} . We say
that it is a partial realization of the random
spatial process {Z(s) : s e D}

Conceptual Model
where
(s) is the mean structure; called large-scale non-spatial trend

W(s) is a zero-mean, stationary process whose autocorrelation
range is larger than min{|| s
i
s
j
||: i,j = 1, 2, , n}; called smooth
small-scale variation

q(s) is a zero-mean, stationary process whose autocorrelation
range is
smaller than min{|| s
i
s
j
||: i,j = 1, 2, , n} and which is
independent of W(s); called micro-scale variation or measurement
error

c(s) is the random noise term with zero-mean and constant
variance and which is independent of W(s) and q(s)
) ( ) ( ) ( ) ( ) ( s s s W s s Z + + + =
Simpler Conceptual Model
where
(s) is the mean structure; called large-scale non-
spatial trend

(s) = W(s) + q(s) is a zero-mean, stationary
process with autocorrelation which combines the
smooth small- scale and micro-scale variation

c(s) is the random noise term with zero-mean and
constant variance which is independent of W(s) and
q(s)
) ( ) ( ) ( ) ( ) ( s s s W s s Z + + + =
) ( ) ( ) ( ) ( s s s s Z c o + + =
Graphical Concept with Trend
-5
0
5
10
15
20
25
30
35
Z
0 5 10 15 20 25 30 35
X
Linear Fit
Fit Each Value
Bivariate Fit of Z By X
Red line indicates large-scale
trend

Green line shows how the
data are arranged around the
trend

Note that there is a pattern
to the points around the red
line. The pattern implies
possible positive
autocorrelation in Z(x).

Finally, there is white noise.
Graphical Concept without Trend
Red line indicates a
constant mean, i.e. no large-
scale trend

Green line shows how the
data are arranged around the
trend

Again, the pattern of the
green line implies possible
positive autocorrelation in
RZ(x)
-15
-10
-5
0
5
10
15
R
Z
0 5 10 15 20 25 30 35
X
Linear Fit
Fit Each Value
Bivariate Fit of RZ By X
Important Point
The model indicates that Z can be
decomposed into large-scale
variation, small + micro-scale
variation, and noise
The reality is that any estimated
decomposition is not a unique
E.g. in the graph just shown, we could
have instead added a sinusoidal aspect to
the large-scale trend and hence captured
much of the apparent autocorrelation
Example
Red line indicates large-scale
trend captured by a
sinusoidal + linear trend

Green line shows how the
data are arranged around the
trend

Note that now there is no
obvious pattern and so the
remaining unexplained
variation is likely white noise
in Z(x).
-5
0
5
10
15
20
25
30
35
Z
0 5 10 15 20 25 30 35
X
Smoothing Spline Fit, lambda=1
Fit Each Value
Bivariate Fit of Z By X
Modeling
Ultimately we want to do modeling of
Z using the geostatistical model
Requires estimates of the model
components
the mean
the small-scale variation and the
covariances among Z values at different
locations
Any leftovers, i.e. the unexplained or
residual variability
) ( ) ( ) ( ) ( s s s s Z c o + + =
Important Point
The choice of approach (detailed fit of a
trend vs. large-scale trend + autocorrelation)
to estimating/predicting Z depends strongly
on the reason for and uses of the model
E.g. if you are interested in predicting Z at
unsampled locations within the study area, then
any model that uses covariates to estimate large-
scale trend must also have the covariates known
for the unsampled locations
E.g. if you are interested in understanding the
reasons for the spatial distribution of Z then you
may or may not want to incorporate a spatial
correlation component
Correlation Structure
(Semivariogram)
Now, to assess spatial autocorrelation we look at
the behavior of the following:



for every possible pair of locations in the dataset (N
locations yields N(N-1)/2 pairs).
Correlated: we would expect Z(s
i
) to be similar in
value to Z(s
j
) and hence the squared difference to
be small.
Independent: we would expect the squared
difference to be relatively large since the two
numbers would vary according to the population
variability.
2 / )] ( ) ( [
2
t Z s Z 2 / )] ( ) ( [
2
t Z s Z 2 / )] ( ) ( [
2
t Z s Z
2
)] ( ) ( [
2
j i
ij
s Z s Z
=
Plot (Variogram Cloud)
distance
g
a
m
m
a
5 10 15
0
.
0
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
0
.
1
0
Looking for
pattern, i.e. is
there a trend in
with respect to
distance between
two locations
Variogram cloud for a dataset of 400 observations
Empirical Variogram
The variogram cloud is usually very
uninformative
Difficult to discern trend or pattern
More pertinent is to calculate the average
values of for different distances
Problem is we dont usually have discrete
distances between locations (happens only
when data are on a perfect grid).
A common method for averaging at specific
distances is to bin the distances into intervals
(called lag distances), i.e. use all points within
some bin width around a given distance value
Continuous Data Geostatistics
Because we do not usually have lots of values at
discrete distances, a common method for averaging
the values at discrete distances is to use all points
within some bin width around a given distance value.

So we choose several levels of h (distances) and
calculate the empirical variogram:



where N(h) is the set of all locations that are a distance
of h apart within a tolerance region around h, i.e.


and |N(h)| is the number of pairs in N(h).


2
( )
1

2 ( ) [ ( ) ( )]
| ( ) |
N h
h Z s Z t
N h
=

)} ( || || || :|| ) , {( ) ( h tol t s or h t s t s h N e = =
Empirical Semivariogram
distance
g
a
m
m
a
0 2 4 6 8 10 12
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
This plot is called an
omnidirectional classical
empirical semivariogram

Omnidirectional because the
direction between the pairs of
locations was ignored,

Classical because the
equation used to estimate the
mean (alternatives exist that
are robust to outliers or to
failure of assumptions of the
model)

Semi because of the division
by 2 in the equation used
Graph based on a set of 20 distance lags
Important Points
The constantly increasing semi-variogram
indicates that there is a problem with this
dataset
Ideally, it should at some distance level off at the
variance of the process implying that at some
distance the relationship between 2 locations is
the same regardless of the distance between
them (i.e. observations are independent at large
distances)
This graph indicates that
The data imply correlation exists at all distances (and
therefore the study region is small relative to the range
of autocorrelation) or
The data have a large-scale trend which may account
for most of the seeming autocorrelation (small-scale
trend)
Semivariogram
distance
g
a
m
m
a
0 2 4 6 8 10 12
0
.
0
0
.
5
1
.
0
1
.
5
Note the rise and
then leveling off
of the (h) values
as distance
increases
Well cover shapes
for variograms in
more detail later
Empirical semivariogram for different dataset in which
there was no large-scale trend but definite autocorrelation
Semivariogram
Note that the (h)
values are more-
or-less the same
regardless of
distance
Empirical semivariogram for different dataset in which
there was no large-scale trend and no autocorrelation
distance
g
a
m
m
a
0 5 10 15
0
.
0
0
.
0
0
2
0
.
0
0
4
0
.
0
0
6
0
.
0
0
8
Important Points
If the empirical semivariogram increases in
distance between locations, then the
correlation between points is decreasing as
distance increases
The point at which it flattens to a constant
value is the distance at which any two points
that distance or larger apart are independent.
The value of is the variance of the spatial
process
At this point in our analyses, the number of lag
distances you use is not that critical but when
we try to fit a curve to the empirical
semivariogram later the number of lags
becomes very important
Important Point About Directionality
Another point to consider is whether the
pattern of autocorrelation, i.e. the shape of
the curve describing the semivariogram, is
the same in every direction.
Cant tell from the omnidirectional plot.
Need to check if there is a directional effect
Directional Semivariograms
To check directionality in the
covariance, plot for each h for
different directions
Modify the sets of locations over
which the averaging occurs
Typically done using a set of binned
directions (wedges of the compass)
Requires that you modify the definition
of neighborhood
)} ( ), ( || || : ) , {( ) , ( angle tol h tol t s t s h N e Z e = Z
Directional Semivariograms
EXAMPLE:
calculate mean
variability for
the angles 0,
22.5, 45, 67.5,
90, and 112.5
with a tolerance
of 11.25 on
each side.
0
1
2
3
4
5
0
0 2 4 6 8 10 12
22.5 45
0 2 4 6 8 10 12
67.5 90
0 2 4 6 8 10 12
0
1
2
3
4
5
112.5
distance
g
a
m
m
a
Need for Assumptions in Order to
Proceed Beyond This Point
The data that are collected are a
partial observation of the spatial
surface (e.g. map) that we are
interested in
In addition, it is usually assumed that
there is some super process that
created the particular surface for
which we have this partial view
To estimate the spatial autocorrelation we
need to make some assumptions.
Otherwise, we dont have sufficient
information to make any inferences.
Two Assumptions
Stationarity, specifically second-order
stationarity
Isotropy
Stationarity
The mean of the process is constant, i.e. no trend
(s) = for all s e D (1)
The covariance between any pair of points
depends only on the distance (and possibly
direction) of the points NOT the location of the
points in space:

where C(.) is the covariance function
This implies that the variance of Z is constant everywhere
If both points are met then the spatial process we
are studying is said to be second-order
stationary.
D s t s C t Z s Z e = ) ( )) ( ), ( cov(
D s s s C s Z s Z
j i j i
e = ) ( )) ( ), ( cov(
Relationship between Semivariogram and Correlation
Assuming intrinsic stationarity, we have



Now, assuming that ,
we have


where . Thus,

[ ( ) ( )] 0 E Z Z + = s h s
[ ( ) ( )] 2 ( ) Var Z Z + = s h s h
1 2 1 2 1 2
2
[ ( ) ( )] [ ( )] [ ( )] 2 [ ( ), ( )]
2 2 ( )
Var Z Z Var Z Var Z Cov Z Z
C o
= +
=
s s s s s s
h
2
1 2
[ ( )] [ ( )] Var Z Var Z o = = s s
1 2
= s s h
2
( ) ( ) C o = h h
Isotropy
The covariance between any pair of
points does not depend on direction
but only distance
) ( ||) (|| )) ( ), ( cov( h C s s C s Z s Z
j i j i
= =

-



-
- -




-


If this holds
then the spatial
process is said
to be isotropic
Non-Constant Mean
Two ways to handle a trend when it does
exist:
Detrend the data using regression (or similar) with
covariates and then use the residuals from the
trend analysis for the spatial autocorrelation
analysis
E.g. disease rates as a function of population density
Universal kriging (UK) which allows for estimating
the trend as a global polynomial in s = (x, y) and
estimating the spatial autocorrelation
simultaneously
UK ignores other explanatory covariates which can be
advantageous or not depending on the purpose of your
study
Non-Constant Variance
To account for heterogeneity (non-
constant variance),
estimate variability in smaller subregions of
the study area
Need to make decisions about the size and extent of
the subregions
Need sufficient numbers of observations within each
subregion
Transform or standardize your data so that the
variability of the transformed values is constant
over the region
Anisotropy
Two types of anisotropy
Geometric
the range over which correlation is non-zero depends
on direction
The variance is constant over all directions
This type can be adjusted for in geostatistical analyses
Zonal
Anything not geometric anisotropy
Anisotropy implies that the spatial process
evolves differentially throughout the study
region
Variography
Fitting a valid semivariogram function
to the empirical semivariogram
Now we are interested in describing
the variogram as an equation in which
variance is a function of the distance.
We shall assume that the spatial
process is second-order stationary
and isotropic in the following.
Semivariogram
We have already seen how to obtain the empirical
variogram of


is the semivariogram and is the primary
quantity of interest because


Now we are interested in describing the
semivariogram as a function of the distance.

We shall assume that the spatial process is second-
order stationary and isotropic in the following.

)) ( ) ( var( ) ( 2 t Z s Z h =
) (h
) ( ) 0 ( ) ( h C C h =
Semivariogram
Semivariogram Models have the following
properties:

1) Many are not linear in their parameters

2) Must be conditionally negative-definite, i.e. the
function must satisfy

for any real numbers satisfying

3) If as , there is microscale
variation which is assumed to be due to
measurement error (ME) or a process occurring at
the microscale. ME is measurable only if we have
replicate values at each location in the sample.

>
s t
t s
t s a a ) ( 2 0

= 0
i
a
0
) ( c h
0 h
Semivariogram
Semivariogram Models have the following
properties:

If (h) is constant for every h except h = 0 where
(0) = 0, then Z(s) and Z(t) are uncorrelated for
any pair of locations s and t

, i.e. ||h||
2
is
increasing faster than (h) as h increases
|| || 0 || || / ) ( 2
2
h as h h
distance
sill

nugget
range
A Typical Semivariogram
Characteristics of the Semivariogram
It is 0 when the separation distance is 0 (Var(0)=0).
Nugget effect: variation in two points very close
together.
May be measurement error
May be indicative of erratic process (gold ore).
The sill corresponds to the overall variance of the
data.
Data separated by distances less than the range
are spatially autocorrelated (Less variation
between close observations than between far
observations.)
2 2
) ( ) ( || ||
j i j i i i
y y x x + = s s
Estimating the Semivariogram
Take all pairwise differences in the data:
(Z(s
i
)-Z(s
j
)), s = (x, y), a point in the 2-D plane.
Compute the Euclidean distance between the
spatial locations:


Average pairs that have the same distance
class;
Binning: like a 2-D histogram.
End Result: Empirical Semivariogram
Modeling the Semivariogram
The semivariogram measures variation among
units h units apart.
Note: We do not want negative standard errors.
So, we model the semivariogram with selected
parametric functions ensuring all standard errors
are nonnegative.
We estimate the nugget, sill, and range
parameters of the model that best fit the empirical
semivariogram (nonlinear least squares problem).
Selected
semivariogram
models
Covariogram Models
Power Model is simply a
reparameterization of the
exponential model.
Spherical
Model
Exponential Model
Gaussian Model
Covariogram vs. Semivariogram
The covariogram and semivariogram are related:
) ( ) 0 ( ) ( h C C h =
The fitted semivariogram model
Estimates: nugget=0.084, sill=0.269, range=110.3 miles
Common methods for fitting these functions to a set of empirical
semivariogram means:

1) choose the most likely candidate model

2) Methods for estimating the parameters of the model :

non-linear least squares estimation allows for the estimation of parameters
that enter the equation non-linearly but ignores any dependences among the
empirical variogram values

non-linear weighted least-squares generalized least squares in which the
variance-covariance of the variogram data points is accounted for in the
estimation procedure

maximum likelihood assuming the data are Normally distributed but the
estimators are likely to be highly biased, especially in small samples (the
usual remedy is jackknifing)

restricted maximum likelihood maximize a slightly altered likelihood function
which reduces the bias of the MLEs
Properties of Variogram
Models
if as then there is microscale
variation
Usually assumed to be due to measurement
error (ME)
ME is measurable only if we have replicate
values at each location in the sample
When fitting a variogram function, may estimate
a non-zero value for c
0
even when you do not
have replicate observations at sites. This is
called the nugget.

if (h) is constant for every h except h=0
where (0) = 0, then Z(s
i
) and Z(s
j
) are
uncorrelated for any pair of locations s
i
and
s
j

>
s t
t s
t s a a ) ( 2 0

>
s t
t s
t s a a ) ( 2 0

>
s t
t s
t s a a ) ( 2 0

>
s t
t s
t s a a ) ( 2 0

>
s t
t s
t s a a ) ( 2 0

>
s t
t s
t s a a ) ( 2 0

>
s t
t s
t s a a ) ( 2 0

>
s t
t s
t s a a ) ( 2 0

>
s t
t s
t s a a ) ( 2 0
= 0
i
a

= 0
i
a

= 0
i
a

= 0
i
a

= 0
i
a

= 0
i
a
0
) ( c h
0 h
Properties of Variogram
Models
Choosing a Best Model
Need to choose the variogram model that
best fits the data
Best minimum unexplained variation after
fitting
Look at a measure of deviance


where is the empirical semivariogram for the i
th

lag and is the value predicted by the fitted
semivariogram model


i
i i
h h
2
)] ( ) ( [
) (
i
h
) (
i
h
Choosing a Best Model
In the absence of comparing deviance
(or similar) measures to determine if
the model seems appropriate
Compare fits visually
Use prior knowledge from other studies to
determine

Next Steps
Using the results of the variography to
do statistical modeling of the spatial
process
kriging

Anda mungkin juga menyukai