function of the other registered variables was to use a regional approach. Then one expression was adjusted to every single location. Most of the research centres that work on Climate Change modelling do not provide estimations of precipitation but the do offer predictions for other variables such as temperature, atmospheric pressure, geopotencial height that are easier to predict. Taking into consideration that the variables that are available for future scenarios the following multiple regression approach was used to predict precipitation for every station. Keywords: Principal Component Analysis, multiple regression, precipitation estimation, Climate Change escenario, Mediterranean Areas.
In order to get familiar with the PCA technique, for this project data registered by 19 weather stations in Spain will be used so as to find the relationship among the variables registered that leads to provide a good estimator of the rainfall. Depending on the results obtained, if a common relationship between the variables in all the stations is observed this one will be used to characterize the precipitation over Spain. However, due to the big climatic differences between regions in the country, it is expected to find that each meteorological variables plays a different role for each climatic area. 1.1 Available data and objectives: In order to do this project, and thanks to the Spanish Meteorological Agency (AEMET), monthly data registered in 19 different weather stations placed in different provinces is available. In most locations the registered information is available from January 1920 until August 2012. Some of the variables registered at the weather stations are: Month, Year, Av. Temperature, Max Temp, Min. Temp, Total Precipitation, Max. Daily precipitation, Rainy days, Snowy days, hail days, Atmospheric Pressure and average isolation. Due to the big amount of these variables the use of PCA technique becomes necessary in order to identify which of these variables are closely related with the rainfall and try to find a way of predicting the monthly rainfall by using one combination of the variables here given. By using PCA, It will not only be possible to identify which variables are the ones who have a stronger meaning inside, but also, obtain new variables that are going to be linearly independent between each other. Then the ones that will explain a higher % of the variance will be chosen so as to predict the monthly rainfall. By using PCA the complexity of the problem will be simplified because of the fact that a smaller number of variables will be involved in the estimation of the rainfall. 1.2 Procedure: The PCA analysis will be performed in every different weather station as well as in the whole sample. The first goal will by identifying if the vectors of the PCA base are the same (or involve the same variables in the same way) for each station. Then according to the results, it will be interesting to find if the same PCA base can be used for understanding the problem all over Spain or if, on the other hand, there are some climatic regions inside the country that follow different patterns (some variables will be strongly correlated with the rainfall in some areas but not in the other ones).
2
Method
of
analysis
2.1
Principal
Component
Analysis
(PCA)
As
it
was
said
before
the
statistical
technique
chosen
to
analyze
the
data
set
available
is
Principal
Component
Analysis
(PCA)
PCA
is
the
most
common
form
of
a
factor
analysis.
By
using
these
technique
it
will
be
possible
to
obtain
new
variables
(also
known
as
dimensions
or
Principal
Components)
that
will
be
linear
combinations
of
the
original
variables
registered
by
each
weather
station.
By
definition
of
PCA
this
new
variables
are
uncorrelated
with
one
another.
These
variables
also
will
try
to
capture
as
much
of
the
original
variance
in
the
data
as
possible.
As
an
output
of
the
PCA
the
first
principal
component
will
show
the
direction
of
greatest
variability
(covariance)
in
the
data,
the
second
principal
component
is
the
next
orthogonal
(uncorrelated)
dimension
of
greatest
variability.
This
procedure
is
following
to
find
all
the
principal
components.
The
number
of
principal
component
that
can
be
obtained
is
equal
to
the
number
of
variables
inside
the
data
set
(in
this
study
this
number
of
variables
is
10,
apart
from
the
atmospheric
variables
registered,
the
month
to
which
the
data
corresponds
is
also
included).
However,
PCA
technique
is
not
only
a
way
of
transferring
a
ser
of
correlated
variables
into
a
set
of
uncorrelated
ones,
but
also
a
method
to
reduce
the
number
of
parameters
of
the
problem.
In
other
words,
if
the
initial
set
of
variables
is
highly
correlated
it
will
be
possible
to
work
with
only
some
of
the
main
principal
components
obtained
because
a
small
number
of
variables
will
be
able
almost
the
same
variability
of
the
data.
The
possibility
of
reducing
the
number
of
dimensions
of
the
problem
is
one
of
the
most
used
properties
of
the
PCA
technique.
It
is
not
easy
to
decide
which
is
the
number
of
principal
components
that
should
be
used
in
the
analysis,
depending
on
the
necessities
of
the
research
it
will
be
possible
to
accept
losing
some
information
in
order
to
work
with
a
smaller
number
of
variables.
Understanding
the
physical
meaning
of
the
principal
components,
even
can
be
very
interesting,
is
not
always
possible.
This
is
the
reason
why
when
it
is
not
possible
to
reduce
the
problem
into
a
small
number
of
variables
it
can
be
better
to
work
with
the
initial
set
of
variables
because,
even
the
complexity
of
the
problem
will
be
bigger,
it
can
be
easier
to
understand
the
result
if
all
the
variables
have
a
clear
physical
meaning.
2.2
Application
of
PCA
to
the
data
set:
In
this
study
PCA
technique
will
be
applied
to
the
data
registered
in
19
different
meteorological
stations
placed
in
the
main
cities
of
Spain.
In
spite
of
the
normal
situations
where
PCA
is
used,
as
it
can
be
seen
in
Table
1,
the
variables
here
studied
do
not
have
a
high
correlation
between
each
other.
It
may
seem
that
the
decision
of
applying
PCA
technique
in
this
situation
is
wrong.
However,
another
interesting
point
of
PCA
is
that
this
technique
also
allows
to
identify
insights
and
hidden
relation
between
the
variables
that
can
not
be
seen
just
by
performing
a
simple
analysis
of
the
registered
values.
Atmos. Pressure Atmos. Pressure Av. Temp Frost Height Insolation % Max. Daily Rainfall Month Rainy Days Total Precip. Wind dir Wind vel 1.00 -0.43 0.338 -0.03 -0.03 -0.13 -0.12 -0.2 -0.2 -0.05 -0.02 Av. Temp -0.43 1.00 -0.59 -0.3 -0.08 -0.14 -0.24 -0.38 -0.24 0.04 -0.05 Frost Height Insol % Max. Daily R -0.13 -0.14 -0.1 -0.13 -0.06 1.00 0.1 0.46 0.79 -0.03 0.16 Month Rainy Days -0.2 -0.38 -0.03 0 -0.04 0.46 0.01 1.00 0.68 0.01 0.107 Total Precip. -0.2 -0.24 -0.08 -0.11 -0.06 0.79 0.09 0.68 1.00 -0.04 0.15 Wind dir -0.05 0.04 -0.04 -0.11 -0.03 -0.03 -0.01 0.01 -0.04 1.00 0.01 Wind vel -0.02 -0.05 -0.13 -0.38 -0.53 0.16 -0.05 0.107 0.15 0.01 1.00
0.338 -0.59 1.00 0.33 0.12 -0.1 -0.13 -0.03 -0.08 -0.04 -0.13
-0.03 -0.08 0.12 0.45 1.00 -0.06 0 -0.04 -0.06 -0.03 -0.53
Table 1: Covariance matrix of the variables registered in the weather station in Barcelona.
Using R software and RCommander Package the result obtained after performing the PCA analysis to the data obtained in Barcelona is shown in the following figure and table:
Figure
1:
Sedimentation
graphic
showing
the
%
of
Explained
Variance.
Component Component Component Component Component Component Component Component Component Component 1 2 3 4 5 6 7 8 9 10
Eigenvalue
2.99 1.83 1.11 0.96 0.92 0.77 0.67 0.35 0.29 0.11
%
of
variance
29.90 18.35 11.10 9.61 9.20 7.67 6.73 3.52 2.90 1.02
cumulative
%
of
variance
29.90 48.25 59.35 68.96 78.16 85.83 92.56 96.08 98.98 100
Table 2: Sedimentation graphic showing the variance explained by every principal component.
As Figure 1 shows, there are 10 principal components that represent the whole variance contained in the data set from Barcelona. If the main objective of the research was to reduce the complexity of the problem it would be possible to use the 6 main principal components and reproduce around the 86 % of the variability as it can be seen in Table 2. It is also interesting to analyze the contribution that each of the registered variables has inside the obtained principal components. Figure 2 shows the role that each variable plays inside the 2 main principal components (just these two components represent almost the 50 % of the variance of the data set. )
Figure
2:
Variables
factor
main
for
the
2
main
principal
components
for
the
data
registered
in
Barcelona.
It is difficult to get any outcome from Figure 2 so it is not possible or at least trivial to guess the physical meaning that the principal components would have in this case. Nevertheless, it is important to keep in mind that the main objective of this analysis is to find the variables that have a stronger relation with precipitation so as to be able to estimate in for future scenarios. The same PCA analysis is performed for the remaining 18 stations and similar results are obtained. It is important to point out that for all the stations the principal components explain the same % of variance and also the loads that each of the registered variables are the same. Another example can be seen in Figure 3 for the variables registered in Guadalajara.
Figure
2:
Variables
factor
main
for
the
2
main
principal
components
for
the
data
registered
in
Guadalajara.
Just by doing a visual analysis of these variables factor map One may be tempted to think that the data set is not depending on the station where it has registered so all the meteorological variables play a similar role around the country. 2.3 Discussion of the results obtained after performing the PCA: After performing the PCA analysis for each meteorological station the decision of trying to find just a unique relation to extrapolate the precipitation for the whole country or finding an individual relation for each station had to be taken. The first idea was to find a single relationship but after analyzing together the contribution that each variable had inside the first components some interesting points were discovered.
Figure
3:
Contribution
that
each
of
the
registered
variables
has
in
the
first
principal
component
for
each
station.
As it can be seen in Figure 3 there is a main trend that most of the station follows. However, there are some singularities in some of the variables that may play an important key role when trying to get an estimation of the precipitation. Taking a look at the Averaged temperature it is possible to see 3 different behaviors: in most of the cities the coefficient is negative, it gets nearer to cero for the stations placed in Barcelona, Girona, Tarragona and Castelln (4 cities that are along the Mediterranean coast, North-east of Spain.) and also the coefficient becomes positive in Zaragoza and Palencia (Zaragoza is one city that is 300 km to the west from Barcelona and Palencia is another city that is close to the border with Portugal.) Figure 4 shows a climatic map of Spain where the position of these cities is show. As it can be seen the climate of the cities that are distributed along the Mediterranean coast and also Zaragoza an Palencia have particular climatic conditions that made them different to the rest of the cities used in this study, which climate could be defined as continental. There are also more differences that can be seen when taking a detailed look into the different variables. The cities mentioned before offer different coefficients to the ones offered by the majority of the cities for most of the variables. However, there are some variables like Wind velocity where cities as Cadiz also can be differentiated from the main trend. It is interesting to say that Cadiz is really close to the Strait of Gibraltar, and is one of the most important places in Europe to practice surf. It seems then that these singularities that every city has can also be observed taking a detailed look into the PCA results.
10
Something similar happens with Jaen if one focuses on the insolation %. Jaen is one of the Spanish provinces where most of the olive oil is produced. One should take into account that the quality of the olives are closely related to the insolation that they have. Even these conclusions are just qualitative and seem hard to be proved it is very interesting to see how PCA method is pointing out these insights that could be very difficult to see from a simple observation of the registered values.
Figure
4:
Climate
in
Spain
and
position
of
the
cities
were
the
coefficient
that
Average
temperature
is
different
to
the
one
offered
by
most
of
the
stations
in
the
first
principal
component.
As a conclusion of the PCA analysis for every station it seems logical that the estimation of the precipitation will offer a better performance if one works in a regional scale rather than trying to deal with the problem for the whole country.
11
3 Analysis and Results Thanks to the PCA analysis it was possible to decide that the best way to predict the precipitation as a function of the other registered variables was to use a regional approach. Then one expression was adjusted to every single location. Most of the research centres that work on Climate Change modelling are not able to provide reliable estimations of rainfall due to the fact that rainfall is a very random phenomena. However, these institutions provide estimations of other variables such as temperature, atmospheric pressure, geopotencial height that are easier to predict. Taking into consideration that the variables that are available for future scenarios the following multiple regression approach was used to predict precipitation for every station: Even, the approach here suggested to obtain an estimate of the precipitation is very simple (according to the correlation matrix shown at the previous section) and is not able to offer a good prediction for precipitation an interesting result can be seen in Figure 5.
Figure
5:
Correlation
obtained
after
adjusting
the
Multivariate
Regression
model
suggested
to
estimate
precipitation.
12
It
is
interesting
to
see
that
the
stations
that
are
closer
to
the
Mediterranean
are
not
able
at
all
to
predict
precipitation.
It
is
obvious
then
that
at
least
in
these
areas
the
problem
should
be
tried
to
solve
using
different
variables.
However
the
areas
that
are
affecting
by
the
Atlantic
an
continental
climate
offer
a
better
result
even
the
model
here
suggested
is
also
not
valid.
Nevertheless,
it
has
to
be
said
that
the
result
is
very
logical
because
the
rainfall
event
that
take
place
at
the
Atlantic
areas
are
easier
to
predict
because
they
are
originated
meanly
by
oceanic
storms
that
are
big
scale
phenomena
that
always
developed
under
some
certain
atmospheric
pressure
and
wind
conditions.
However,
the
Mediterranean
climate
is
known
to
suffer
from
convective
rainfall
events
that
happen
at
a
smaller
scale
and
are
much
more
difficult
to
predict.
4
Conclusions
As
it
has
been
described
in
the
previous
sections,
PCA
is
shown
to
be
a
useful
technique
to
study
the
problem
here
suggested.
In
this
case
PCA
has
not
been
used
to
reduce
the
dimension
of
the
problem
because
of
the
fact
that
the
variables
of
the
data
set
showed
a
very
low
correlation
between
each
other.
However
PCA
has
been
useful
to
discover
insights
relations
of
the
variables
and
the
location
that
could
not
have
been
directly
observed.
It
is
very
important
to
analyze
the
output
of
the
PCA
using
different
points
of
view.
Then,
Thanks
to
the
output
offered
by
PCA
it
has
been
possible
to
identify
the
most
suitable
way
of
dealing
with
the
problem
just
by
using
a
simple
approach
to
try
to
estimate
precipitation
at
each
station
in
stead
of
doing
it
for
the
whole
country.
However,
the
linear
model
used
to
estimate
the
precipitation
is
not
offering
a
good
result
but
can
be
helpful
as
a
first
approach
for
further
research.
Even
the
linear
regression
model
offers
a
very
low
performance
another
interesting
thing
and
also
a
consistent
result
was
that
precipitation
in
the
Mediterranean
areas
is
harder
to
be
predicted
while
in
Atlantic
climate
areas
the
correlation
between
the
atmospheric
variables
are
closer.
13
5
References
Reference
paper:
Kahy
E.,
S.
Kalayci
,
and
T.C.
Piechota,
2008:
Streamflow
Regionalization:
Case
Study
of
Turkey.
Journal
of
Hydrologic
Engineering.
Vol
13,
No.
4,
pp.
205-214
Other
papers:
Basalirwa
C.P.K.,
J.O.
Odiyo,
R.j.
Mngodo,
and
E.J.
Mpeta.
1999:
The
climatological
Regions
of
Tanzania
Based
on
the
rainfall
characteristics.
International
Journal
of
Climatology.
19:
69-80.
Stathis
D.,
D.
Myronidis,
1999:
Principal
Component
Analysis
of
Precipitaiton
in
Thessaly
Region
(Central
Greece).
Global
NEST
Journal,
Vol
11,
No
4,
pp
467-476.
L
S.,
J.
Josse,
and
F.
Husson,
2008.
FactoMineR:
An
R
Package
for
Multivariate
Analysis.
Journal
of
Statistical
Software.
Volume
25,
Issue
1.
14