Anda di halaman 1dari 24

An introduction to applied geostatistics

Part 4 – Model validation; simulation

Overheads
D G Rossiter
Department of Earth Systems Analysis
International Institute for Geo-information Science & Earth Observation (ITC)
<http://www.itc.nl/personal/rossiter>

July 9, 2005
AN INTRODUCTION TO APPLIED GEOSTATISTICS 1

Model validation
With any predictive method, we would like to know how good it is. This is model
validation.

• cf. model calibration, when we are building the model

The basic idea is to compare model predictions with reality. Two main
methods:

1. Separate validation dataset

2. Cross-validation using calibration dataset

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 2

Independent validation
Simple measures of validity:

• Root mean squared error (RMSE) of the residuals: the actual vs. estimate
(from the model) in the validation dataset; lower is better:
" #1/2
n
1
RMSE = ∑
n i=1
(ŷi − yi )2

• Bias or mean error (ME) of estimated vs. actual mean of the validation
dataset; should be zero (0)

1 n
ME = ∑(ŷi − yi)
n i=1

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 3

Cross-validation
If we don’t have an independent data set to evaluate a model, we can use the
same sample points that were used to estimate the model to validate that same
model.

This seems a bit dubious, but with enough points, the effect of the removed point
on the model (which was estimated using that point) is minor.

N.b. this is not legitimate for non-geostatistical models, because there is no


theory of spatial correlation.

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 4

Cross–validation procedure

1. Compute experimental variogram with all sample points; model it

2. For each point


(a) Remove the point from the sample set
(b) predict at that point using the other points and the modelled variogram

3. Summarize the deviations of the model from the actual point

Then models can be compared by their summary statistics, also by looking at


individual predictions of interest.

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 5

Summary statistics for cross–validation

• Root Mean Square Error (RMSE): lower is better ; computed as for


independent validation

• Bias or mean error (ME): should be 0; computed as for independent validation

• Mean Squared Deviation Ratio (MSDR) of residuals with kriging variance:


should be 1
1 N {z(~xi) − ẑ(~xi)}2
MSDR = ∑
N i=1 σ̂ 2(~xi)

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 6

Cross-validation in gstat
> # leave-one-out cross-validation
> kcv<-krige.cv(log(cadmium)~1, ~x+y, meuse, model=m2); kcv
‘data.frame’: 155 obs. of 8 variables:
$ x : num 181072 181025 181165 181298 181307 ...
$ y : num 333611 333558 333537 333484 333330 ...
$ var1.pred: num 1.482 1.649 1.452 1.259 0.952 ...
$ var1.var : num 0.796 0.781 0.768 0.811 0.761 ...
$ observed : Named num 2.460 2.152 1.872 0.956 1.030 ...
..- attr(*, "names")= chr "1" "2" "3" "4" ...
$ residual : num 0.9774 0.5031 0.4195 -0.3033 0.0773 ...
$ zscore : num 1.0953 0.5692 0.4786 -0.3368 0.0886 ...
$ fold : int 1 2 3 4 5 6 7 8 9 10 ...
> truehist(kcv$residual)
> # some residuals are very large, show their locations
> bubble(kcv, z="residual", fill=F)
> # measures of goodness: ME = 0, MSE low, MSDR = 1
> mean(kcv$residual); mean(kcv$residual**2)
> mean((kcv$residual)**2/kcv$var1.var)
[1] 7.996923e-05
[1] 0.8644106
[1] 1.129047

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 7

Residuals from cross-validation and their location

residual

●●

0.6


●● ●

●● ● ●
●● ●

● ●

333000 ●
0.5

● ●●





● ● ●● ●

● ●

● ● ●●

0.4

● ● ●
●●
●●
● ●

●● ●
● ●



●● ●●
332000
●●● ●● ● ● −2.787
●● ●●●●●●
0.3

● −0.384
● 0.057

y
● ●● ●

● 0.565

● ● ● ● ● 2.424
● ● ●

●●
0.2

●● ●


● ●
331000
●●● ●● ●●
● ●
●●● ● ● ●● ●
●●●


0.1





● ● ●

● ●

● ●
●● ● ●
● ●


● ●
0.0

● ● ●
330000
● ●

● ●


−3 −2 −1 0 1 2

kcv$residual 178500 179000 179500 180000 180500 181000 181500


x

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 8

Spatial simulation
Simulation is the process or result of representing what reality might look like,
given a model.

In geostatistics, this reality is usually a spatial distribution (map).

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 9

What is stochastic simulation?

• “Simulation” is a general term for studying a system without physically


implementing it.

• “Stochastic” simulation means that there is a random component to the


simulation model: quantified uncertainty is included so that each simulation is
different.

• Non-spatial example: planning the number and timing of clerks in a new


branch bank; customer behaviour (arrival times, transaction length) is
stochastic and represented by probability distributions.

• Reference for spatial simulation:


Goovaerts, P., 1997. Geostatistics for natural resources evaluation. Applied
Geostatistics Series. Oxford University Press, New York; Chapter 8.

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 10

Why spatial simulation?

• Recall: the theory of regionalized variables assumes that the values we


observe come from some random process; in the simplest case, with one
expected value (first-order stationarity) with a spatially-correlated error that
is the same over the whole area (second-order stationarity).

• So we’d like to see “alternative realities”; that is, spatial patterns that, by this
theory, could have occurred in some “parallel universe”.

• In addition, kriging maps are unrealistically smooth, especially in areas


with low sampling density.

* Even if there is a high nugget effect in the variogram, this variability is not
reflected in adjacent prediction points, since they are estimated from almost
the same data.

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 11

When must simulation be used?


Goovaerts: “Smooth interpolated maps should not be used for applications
sensitive to the presence of extreme values and their patterns of continuity.”
(p. 370)

Example: ground water travel time depends on sequences of large or small


values (“critical paths”), not just on individual values.

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 12

Local uncertainty vs. spatial uncertainty

• Recall: kriging prediction also provides a prediction error; this is the BLUP
and its error for each prediction location separately.

• So, at each prediction location we obtain a probability distribution of the


prediction, a measure of its uncertainty. This is fine for evaluating each
prediction individually.

• But, it is not valid to evaluate the set of predictions! Errors are by definition
spatially-correlated (as shown by the fitted variogram model), so we can’t
simulate the error in a field by simulating the error in each point separately.

• Spatial uncertainty is a representation of the error over the entire field of


prediction locations at the same time.

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 13

Practical applications of spatial simulation

• If the distribution of the target variable(s) over the study area is to be used as
input to a model, then the uncertainty is represented by a number of
simulations.

• Procedure:
1. Simulate a “large” number of realizations of the spatial field
2. Run the model on each simulation
3. Summarize the output of the different model runs

• The statistics of the output give a direct measure of the uncertainty of the
model in the light of the sample and the model of spatial variability.

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 14

Unconditional simulation
In unconditional simulation, we simulate the field with no reference to the actual
sample, i.e. the data we have. (It’s only one realistion, no more valid than any
other.)

This is mainly to visualise a random field as modelled by a variogram, not for


prediction.

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 15

What is preserved in unconditional simulation?

1. Mean over field

2. Covariance structure

Data points are not predicted exactly.

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 16

Unconditional simulation in gstat


The krige function allows a number of simulation nsim to be specified. For
unconditional simulation, specify no data (data=NULL), instead use the
dummy=TRUE option.

Since there is no data with which to estimate the mean, it must be specified as
the beta parameter.
> x <- krige(log(cadmium) ~ 1, ~ x + y, data = NULL, newdata = meuse.grid,
+ model = m2, nmax = 20, nsim = 5, beta=mean(log(cadmium)), dummy = TRUE)
[using unconditional gaussian simulation]
> levelplot(z ~ x + y | name, map.to.lev(x, z=c(3:7)), aspect = mapasp(x),
+ main = "five unconditional realisations of a correlated Gaussian field")

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 17

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 18

Conditional simulation
This simulates the field, while respecting the sample. So the simulated maps look
more like the best (kriging) prediction, but usually much more spatially-variable
(depending on the magnitude of the nugget).

These are inputs into spatially-explicit models, e.g. hydrology.

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 19

What is preserved in conditional simulation?

1. Mean over field

2. Covariance structure

3. Observed data (points are predicted exactly)

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 20

Conditional simulation in gstat


Here the data must be named, so the dummy=TRUE option is not used. The beta
parameter may be given (usually as estimated from the data); if not it is estimated
by GLS.
> mean(log(cadmium))
[1] 0.5610659
> sims <- krige(log(cadmium) ~ 1, ~ x + y, meuse, meuse.grid,
+ model = m2, nmax = 64, nsim = 6, beta=mean(log(cadmium)))
[using conditional gaussian simulation]
> levelplot(z ~ x + y | name, map.to.lev(sims, z=c(3:8)), aspect = mapasp(sims),
+ main = "six conditional realisations of a correlated Gaussian field")

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 21

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 22

Indicator simulation
(See notes “Indicator Kriging”)

Indicator variables can also be simulated. Here the result is a 0/1 variable:
indicator false/true. This is unlike IK where the result is a probability of a 1
(indicator is true).
In gstat, the target value must be an indicator, and the indicator=TRUE argument
must be included. The mean is estimated by the proportion of true indicators in
the sample.
threshold<-4
indicator <- (cadmium >= threshold)
vi<-variogram(indicator ~1, ~x+y, meuse)
mi.f <- fit.variogram(vi, vgm(0.09, "Gau", 500, 0.11)
sims <- krige(indicator ~ 1, ~ x + y, meuse, meuse.grid,
+ model = vm1.f,
+ nsim=6, indicator=TRUE,
+ nmax=64, beta=sum(indicator)/length(indicator))
levelplot(z ~ x + y | name,
map.to.lev(sims, z=c(3:8)), aspect = mapasp(sims),
main = "Six conditional realisations of an indicator variable")

D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 23

D G R OSSITER

Anda mungkin juga menyukai