Anda di halaman 1dari 44

What is Geostatistics?

“Geostatistics: study of phenomena

that vary in space and/or time”
(Deutsch, 2002)

“Geostatistics can be regarded as a

collection of numerical
techniques that deal with the
characterization of spatial attributes,
employing primarily random models
in a manner similar to the
way in which time series analysis
characterizes temporal data.”
(Olea, 1999)

“Geostatistics offers a way of

describing the spatial continuity of
natural phenomena and provides
adaptations of classical regression
techniques to take advantage of this
continuity.” (Isaaks and Srivastava,
Geostatistics deals with
spatially autocorrelated data.

Autocorrelation: correlation
between elements of a series
and others from the same
series separated from them
by a given interval. (Oxford
American Dictionary)

Some spatially autocorrelated

A plot showing 100 random numbers with a
parameters of interest to
"hidden" sine function, and an autocorrelation reservoir engineers: facies,
(correlogram) of the series on the bottom. reservoir thickness, porosity,

When computed, the resulting number can range

from +1 to -1. An autocorrelation of +1 represents
perfect positive correlation (i.e. an increase seen in
one time series will lead to a proportionate increase
in the other time series), while a value of -1
represents perfect negative correlation (i.e. an
increase seen in one time series results in a
proportionate decrease in the other time series).

This value can be useful for computing for security

analysis. For example, if you know a stock
Visual comparison of convolution, historically has a high positive autocorrelation value
crosscorelation and autocorrelation. and you witnessed the stock making solid gains over
the past several days, you might reasonably expect
the movements over the upcoming several days (the
leading time series) to match those of the lagging
time series and to move upwards.
Basic Components of Geostatistics

(Semi)variogram analysis – characterization

of spatial correlation

Kriging – optimal interpolation; generates

best linear unbiased estimate at each
location; employs semivariogram model

Stochastic simulation – generation of

multiple equiprobable images of the
variable; also employs semivariogram model

Variography : The process

of estimating the theoretical
semivariogram. Steps: (1)
exploratory data analysis, (2)
check for global trend, (3)
computation of the empirical
semivariogram, (4) binning
and fitting a semivariogram
model, (5) computation of
directional variograms to
identify anisotropy.

Example of one-dimensional data interpolation by kriging, with confidence

intervals. Squares indicate the location of the data. The kriging interpolation
is in red. The confidence intervals are represented by gray areas.
Stochastic simulation

A statistical process involving a number of

random variables depending on a variable
parameter (which is usually time)
Exploratory Analysis of Example Data

Our example data consist of

vertically averaged porosity
values, in percent, in Zone A of
the Big Bean Field (fictitious, but
based on data from a real field).
Porosity values are available from
85 wells distributed throughout
the field, which is approximately
20 km in east-west extent and 16
km north-south. The porosities
range from 12% to 17%. Here are
the data values posted at the well
Geostatistical methods are optimal
when data are :

- normally distributed and

- stationary (mean and variance do not
vary significantly in space)
Significant deviations from normality
and stationarity can cause problems, so
it is always best to begin by looking at
a histogram or
similar plot to check for normality and
a posting of the data values in space to
check for significant trends. The
posting above shows some hint of a
SW-NE trend, which we will check
Looking at the histogram (with a normal density superimposed) and
a normal quantile-quantile plot shows that the porosity distribution
does not deviate too severely from normality
Spatial Covariance,
Correlation and Semivariance

You have already learned that covariance

and correlation are measures of the
similarity between two different variables.
To extend these to measures of spatial
similarity, consider a scatterplot where the
data pairs represent measurements of the
same variable made some distance apart
from each other. The separation
distance is usually referred to as “lag”, as
used in time series analysis. We’ll refer to
the values plotted on the vertical axis as
the lagged variable, although the decision
as to which axis represents the lagged
values is somewhat arbitrary. Here is a
scatterplot of porosity values at wells
separated by a nominal lag of 1000 m:
Spatial Covariance, Correlation
and Semivariance (Contn’s)

Because of the irregular

distribution of wells, we cannot
expect to find many pairs of data
values separated by exactly 1000
m, if we find any at all. Here we
have introduced a “lag tolerance”
of 500 m, pooling the data pairs
with separation distances between
500 and 1500 m in order to get a
reasonable number of pairs for
computing statistics. The actual
lags for the data pairs shown in the
crossplot range from 566 m to
1456 m, with a mean lag of
1129 m.
Spatial Covariance, Correlation
and Semivariance (Contn’s) The three statistics shown on the
crossplot are the covariance,correlation,
and semivariance between the porosity
values on the
horizontal axis and the lagged porosity
values on the vertical axis.
To formalize the definition of these
statistics, we need to introduce some
notation. Following standard
geostatistical practice, we’ll
u: vector of spatial coordinates (with
components x, y or “easting” and
“northing” for our 2D example)
z(u): variable under consideration as a
function of spatial location (porosity in
this example)
h: lag vector representing separation
between two spatial locations
z(u+h): lagged version of variable under
Sometimes z(u) will be referred to as the
“tail” variable and z(u+h) will be referred to
as the “head” variable, since we can think of
them as being located at the tail and head of
the lag vector, h. The scatterplot of tail
versus head values for a certain lag, h, is
often called an h-scattergram.

Now, with N(h) representing the number of

pairs separated by lag h (plus or minus the
lag tolerance), we can compute the statistics
for lag h as
The semivariance is the moment of
inertia or spread of the h scattergram
about the 45° (1 to 1) line shown on
the plot.
Covariance and correlation are both
measures of the similarity of
the head and tail values. Semivariance
is a measure of the dissimilarity.
Here are the h-scatterplots for
nominal lags of 2000 m and
3000 m.
Note that the covariance and
correlation decrease and the
semivariance increases with
increasing separation distance.
The plot above shows all three
statistics versus actual mean lag for
the contributing data pairs at each
lag. The shortest lag shown (the
nominally “zero” lag) includes six
data pairs with a mean lag of 351 m.
The correlation versus lag is referred
to as the correlogram and the
semivariance versus lag is the
semivariogram. The covariance
versus lag is generally just referred
to as the covariance
The empirical functions that we have plotted – computed
from the sample data – are of course just estimators of the
theoretical functions C(h), r(h), and g (h), which can be
thought of as population parameters. Estimating these
functions based on irregularly distributed data (the usual
case) can be very tricky due to the need to pool data pairs
into lag bins.

Larger lag spacings and tolerances allow more data pairs for
estimation but reduce the amount of detail in the
semivariogram (or covariance or correlogram). The problem
is particularly difficult for the shorter lags, which tend to
have very few pairs (six in this example). This is
unfortunate, since the behavior near the origin is the most
important to characterize.
Under the condition of second-order
stationarity (spatially constant mean and
variance), the covariance function,
correlogram, and semivariogram obey the
following relationships:

C(0) = Cov(Z(u),Z(u)) = Var(Z(u))

r(h) = C(h) C(0)
g (h) = C(0)-C(h)

In words, the lag-zero covariance should be equal to the global variance of

the variable under consideration, the correlogram should look like the
covariance function scaled by the variance, and the semivariogram should
look like the covariance function turned upside down:

Where Z(x) is the data value at location x

and N is the number of pairs of data points
separated by lag distance h.
The sill is the amount of semivariance
achieved at the plateau of the curve, and is
equivalent to the variance of the data. The
range is the lag distance at which data is no
longer correlated. Data within the range are
correlated and can be used for making
predictions. These two values can be
calculated by fitting a model to the
semivariogram. Different models will yield
different values for the sill and range. The
nugget is the semivariance at h = 0, and is a
measure of the inherent variability in the data
or the noise of the data.
The range and sill
When you look at the model of a
semivariogram, you'll notice that
at a certain distance, the model
levels out. The distance where the
model first flattens out is known
as the range. Sample locations
separated by distances closer than
the range are spatially
autocorrelated, whereas locations
farther apart than the range are
The value that the semivariogram
model attains at the range (the
value on the y-axis) is called the
sill. The partial sill is the sill
minus the nugget.
The nugget Theoretically, at zero separation distance
(lag = 0), the semivariogram value is 0.
However, at an infinitesimally small
separation distance, the semivariogram
often exhibits a nugget effect, which is
some value greater than 0. For example, if
the semivariogram model intercepts the y-
axis at 2, then the nugget is 2.
The nugget effect can be attributed to
measurement errors or spatial sources of
variation at distances smaller than the
sampling interval or both. Measurement
error occurs because of the error inherent
in measuring devices. Natural phenomena
can vary spatially over a range of scales.
Variation at microscales smaller than the
sampling distances will appear as part of
the nugget effect. Before collecting data, it
is important to gain some understanding of
the scales of spatial variation.
Modeling the Semivariogram
Using h to represent lag distance, a to represent (practical) range,
and c to represent sill, the five most frequently used models are:
Above are Semi-variograms of effective porosity and permeability logs
produced with a core calibrated multi-mineral petrophysical analysis.

Semi-variogram modeling is usually done before spatial modeling.

Short ranges usually infer a high degree of heterogeneity while long
ranges tend to infer larger structures and less heterogeneity.
Using the vertical semi-variogram one can obtain a rough idea of what
the horizontal semi-variogram may be based on ranges for different
lithological and depositional facies.

Having more wells in the near vicinity would help acquire the true
horizontal semi-variogram but in the case of many wildcat wells they
may be the only one within miles so the modeling Geologist must be
able to make a rough guesstimate for the ratio of vertical to horizontal
Stochastic or geostatistical
modeling is a method of
determining heterogeneity and
uncertainty in a spatial
distribution such as a petroleum
reservoir. Before drilling,
optimum placement of the well is
key to maximize profits while
minimizing uncertainty. Multiple
realizations give many "what if"
type scenarios providing a general
overview of the inherent
uncertainty inevitable with sparse
well control.
The deterministic model, who studies a population of bacteria, considers
continuous concentrations of molecules. However, in a single bacteria, the
quantity of the different proteins is of the order of 100, and the concentrations
take thus discrete values. These values depend on events (production,
degradation) which are hard to predict, and must therefore be approached in
terms of probability of occurence (or preopensity).


A stochastic model incorporates random variations to predict future

conditions and to see what they might be like.
To introduce that randomness we use a new function : propensities.
As an example, to illustrate this we consider
4 possible reactions. Each reaction has a probability
to happen in the next time step.

We randomly chose the next reaction

regarding the propensities.
When we run a stochastic simulation once, we get a trajectory. This graph represents
the random evolution of the variables. Because of the randomness, if we run the script
another time we will get a different trajectory. That is why to be able to interpret the
results we have to run the cripts hundreds or thousands of times.

Instead of describing a process which can only evolve in one way, in a stochastic
or random process there is some indeterminacy : even if the initial condition is
known, there are several directions in which the process may evolve.
Traditional continuous and deterministic biochemical rate equations do not
accurately predict cellular reactions since they rely on random reactions that
require the interactions of millions of molecules. In contrast, the Gillespie
algorithm allows a discrete and stochastic simulation of a system with few
reactants because every molecule is explicitly simulated. When simulated, a
Gillespie realization represents a random walk of the entire system.
We will use the following notations :

Each reaction Rj is characterized mathematically by two

quantities :
Once we have presented the theory behind the stochastic
approach,let us have a look at the algorithm.
The algorithm comprises 5 steps.

Simulation is broadly defined as the process of

replicating reality using a model. In
geostatistics, simulation is the realization of a
random function (surface) that has the same
statistical features as the sample data used to
generate it (measured by the mean, variance,
and semivariogram). Gaussian geostatistical
simulation (GGS), more specifically, is suitable
for continuous data and assumes that the data,
or a transformation of the data, has a normal
(Gaussian) distribution. The main assumption
behind GGS is that the data is stationary—the
mean, variance, and spatial structure
(semivariogram) do not change over the spatial
domain of the data.
Another key assumption of GGS is that the
random function being modeled is a
multivariate Gaussian random function.
GGS offers an advantage over kriging.
Because kriging is based on a local average
of the data, it produces smoothed output.
GGS, on the other hand, produces better
representations of the local variability
because it adds the local variability that is
lost in kriging back into the surfaces it
generates. The variability that GGS
realizations add to the predicted value at a
particular location has a mean of zero, so that
the average of many GGS realizations tends
toward the kriging prediction. This concept
is illustrated in the figure below. Different
realizations are represented as a stack of
output layers, and the distribution of values
at a particular coordinate is Gaussian, with a
mean equal to the kriged estimate for that
location and a spread that is given by the
kriging variance at that location.
There are many applications in
which spatially dependent
variables are used as input for
models (for example, flow
simulation in petroleum
engineering). In these cases,
uncertainty in the model's results
is evaluated by producing a
number of simulations using the
following procedure:
1. A large number of equally
probable realizations are
simulated for the variable.
2. The model (generally termed
a transfer function) is run using
the simulated variable as input.
3. The model runs are
summarized to evaluate
variability in the model's output.
Realization of effective porosity produced with the sequential Gaussian
simulation (sGs) algorithm and the semi-variograms shown above.
Realization of permeability produced with the sequential Gaussian
simulation (sGs) algorithm and the semi-variograms shown above.
How many realizations should be generated?

Results from simulation studies should

not depend on the number of
realizations that were generated. One
way to determine how many
realizations to generate is to compare
the statistics for different numbers of
realizations in a small portion of the
data domain (a subset is used to save
time). The statistics tend toward a
fixed value as the number of
realizations increases. The statistics
examined in the example below are the
first and third quartiles, which were
calculated for a small region (subset)
of simulated elevation surfaces (in feet
above sea level) for the state of
Wisconsin, USA.
The top graph shows fluctuations in
elevation for the first 100 realizations.
The lower graph shows results for
1,000 realizations
Some Geostatistics Textbooks
C.V. Deutsch, 2002, Geostatistical Reservoir Modeling, Oxford University Press, 376
pages. o Focuses specifically on modeling of facies, porosity, and permeability for
reservoir simulation.
C.V. Deutsch and A.G. Journel, 1998, GSLIB: Geostatistical Software Library and
User's Guide, Second Edition, Oxford University Press, 369 pages.
o Owner's manual for the GSLIB software library; serves as a standard
reference for concepts and terminology.
P. Goovaerts, 1997, Geostatistics for Natural Resources Evaluation, Oxford
University Press, 483 pages.
o A nice introduction with examples focused on an environmental chemistry
dataset; includes more advanced topics like factorial kriging.
E.H. Isaaks and R.M. Srivastava, 1989, An Introduction to Applied Geostatistics,
Oxford University Press, 561 pages.
o Probably the best introductory geostatistics textbook; intuitive
development of concepts from first principles with clear examples at every
P.K. Kitanidis, 1997, Introduction to Geostatistics: Applications in Hydrogeology,
Cambridge University Press, 249 pages.
o A somewhat different take, with a focus on generalized covariance
functions; includes discussion of geostatistical inversion of (groundwater)
flow models.
M. Kelkar and G. Perez, 2002, Applied Geostatistics for Reservoir Characterization,
Society of Petroleum Engineers Inc., 264 pages.
o Covers much the same territory as Deutsch's 2002 book; jam-packed with
figures illustrating concepts.
R.A. Olea, 1999, Geostatistics for Engineers and Earth Scientists, Kluwer Academic
Publishers, 303 pages.
o Step by step mathematical development of key concepts, with clearly
documented numerical examples.