Overheads
D G Rossiter
Department of Earth Systems Analysis
International Institute for Geo-information Science & Earth Observation (ITC)
<http://www.itc.nl/personal/rossiter>
July 9, 2005
AN INTRODUCTION TO APPLIED GEOSTATISTICS 1
Model validation
With any predictive method, we would like to know how good it is. This is model
validation.
The basic idea is to compare model predictions with reality. Two main
methods:
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 2
Independent validation
Simple measures of validity:
• Root mean squared error (RMSE) of the residuals: the actual vs. estimate
(from the model) in the validation dataset; lower is better:
" #1/2
n
1
RMSE = ∑
n i=1
(ŷi − yi )2
• Bias or mean error (ME) of estimated vs. actual mean of the validation
dataset; should be zero (0)
1 n
ME = ∑(ŷi − yi)
n i=1
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 3
Cross-validation
If we don’t have an independent data set to evaluate a model, we can use the
same sample points that were used to estimate the model to validate that same
model.
This seems a bit dubious, but with enough points, the effect of the removed point
on the model (which was estimated using that point) is minor.
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 4
Cross–validation procedure
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 5
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 6
Cross-validation in gstat
> # leave-one-out cross-validation
> kcv<-krige.cv(log(cadmium)~1, ~x+y, meuse, model=m2); kcv
‘data.frame’: 155 obs. of 8 variables:
$ x : num 181072 181025 181165 181298 181307 ...
$ y : num 333611 333558 333537 333484 333330 ...
$ var1.pred: num 1.482 1.649 1.452 1.259 0.952 ...
$ var1.var : num 0.796 0.781 0.768 0.811 0.761 ...
$ observed : Named num 2.460 2.152 1.872 0.956 1.030 ...
..- attr(*, "names")= chr "1" "2" "3" "4" ...
$ residual : num 0.9774 0.5031 0.4195 -0.3033 0.0773 ...
$ zscore : num 1.0953 0.5692 0.4786 -0.3368 0.0886 ...
$ fold : int 1 2 3 4 5 6 7 8 9 10 ...
> truehist(kcv$residual)
> # some residuals are very large, show their locations
> bubble(kcv, z="residual", fill=F)
> # measures of goodness: ME = 0, MSE low, MSDR = 1
> mean(kcv$residual); mean(kcv$residual**2)
> mean((kcv$residual)**2/kcv$var1.var)
[1] 7.996923e-05
[1] 0.8644106
[1] 1.129047
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 7
residual
●●
●
0.6
●
●● ●
●● ● ●
●● ●
●
● ●
●
333000 ●
0.5
● ●●
●
●
●
●
●
●
● ● ●● ●
● ●
●
● ● ●●
●
0.4
● ● ●
●●
●●
● ●
●● ●
● ●
●
●
●
●● ●●
332000
●●● ●● ● ● −2.787
●● ●●●●●●
0.3
● −0.384
● 0.057
y
● ●● ●
●
● 0.565
●
● ● ● ● ● 2.424
● ● ●
●
●●
0.2
●● ●
●
●
●
● ●
331000
●●● ●● ●●
● ●
●●● ● ● ●● ●
●●●
●
●
0.1
●
●
●
●
● ● ●
●
● ●
●
● ●
●● ● ●
● ●
●
● ●
0.0
● ● ●
330000
● ●
●
● ●
●
●
−3 −2 −1 0 1 2
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 8
Spatial simulation
Simulation is the process or result of representing what reality might look like,
given a model.
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 9
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 10
• So we’d like to see “alternative realities”; that is, spatial patterns that, by this
theory, could have occurred in some “parallel universe”.
* Even if there is a high nugget effect in the variogram, this variability is not
reflected in adjacent prediction points, since they are estimated from almost
the same data.
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 11
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 12
• Recall: kriging prediction also provides a prediction error; this is the BLUP
and its error for each prediction location separately.
• But, it is not valid to evaluate the set of predictions! Errors are by definition
spatially-correlated (as shown by the fitted variogram model), so we can’t
simulate the error in a field by simulating the error in each point separately.
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 13
• If the distribution of the target variable(s) over the study area is to be used as
input to a model, then the uncertainty is represented by a number of
simulations.
• Procedure:
1. Simulate a “large” number of realizations of the spatial field
2. Run the model on each simulation
3. Summarize the output of the different model runs
• The statistics of the output give a direct measure of the uncertainty of the
model in the light of the sample and the model of spatial variability.
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 14
Unconditional simulation
In unconditional simulation, we simulate the field with no reference to the actual
sample, i.e. the data we have. (It’s only one realistion, no more valid than any
other.)
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 15
2. Covariance structure
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 16
Since there is no data with which to estimate the mean, it must be specified as
the beta parameter.
> x <- krige(log(cadmium) ~ 1, ~ x + y, data = NULL, newdata = meuse.grid,
+ model = m2, nmax = 20, nsim = 5, beta=mean(log(cadmium)), dummy = TRUE)
[using unconditional gaussian simulation]
> levelplot(z ~ x + y | name, map.to.lev(x, z=c(3:7)), aspect = mapasp(x),
+ main = "five unconditional realisations of a correlated Gaussian field")
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 17
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 18
Conditional simulation
This simulates the field, while respecting the sample. So the simulated maps look
more like the best (kriging) prediction, but usually much more spatially-variable
(depending on the magnitude of the nugget).
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 19
2. Covariance structure
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 20
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 21
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 22
Indicator simulation
(See notes “Indicator Kriging”)
Indicator variables can also be simulated. Here the result is a 0/1 variable:
indicator false/true. This is unlike IK where the result is a probability of a 1
(indicator is true).
In gstat, the target value must be an indicator, and the indicator=TRUE argument
must be included. The mean is estimated by the proportion of true indicators in
the sample.
threshold<-4
indicator <- (cadmium >= threshold)
vi<-variogram(indicator ~1, ~x+y, meuse)
mi.f <- fit.variogram(vi, vgm(0.09, "Gau", 500, 0.11)
sims <- krige(indicator ~ 1, ~ x + y, meuse, meuse.grid,
+ model = vm1.f,
+ nsim=6, indicator=TRUE,
+ nmax=64, beta=sum(indicator)/length(indicator))
levelplot(z ~ x + y | name,
map.to.lev(sims, z=c(3:8)), aspect = mapasp(sims),
main = "Six conditional realisations of an indicator variable")
D G R OSSITER
AN INTRODUCTION TO APPLIED GEOSTATISTICS 23
D G R OSSITER