Anda di halaman 1dari 22

A Sales Forecast Model for the German Automobile

Market Based on Time Series Analysis


and Data Mining Methods
Abstract
In this contribution, various sales forecast models for the German
automobile market are developed and tested. Our most important criteria for
the assessment of these models are the quality of the prediction as well as an
easy explicability. Yearly, quarterly and monthly data for newly registered
automobiles from 1992 to 2007 serve as the basis for the tests of these models.
The time series model used consists of additive components: trend, seasonal,
calendar and error component. The three latter components are estimated univariately
while the trend component is estimated multivariately by Multiple
Linear Regression as well as by a Support Vector Machine. Possible influences
which are considered include macro-economic and market-specific factors.
These influences are analysed by a feature selection. We found the
non-linear model to be superior. Furthermore, the quarterly data provided the
most accurate results.
Keywords: Sales Forecast, Time Series Analysis, Data Mining, Automobile
Industry.
1 Introduction
Successful corporate management depends on efficient strategic and operative
planning.
Errors in planning often lead to enormous costs and in some cases also to a loss
of reputation. Reliable forecasts make an important contribution to efficient planning.
As the automobile industry is one of the most important sectors of the German economy,
its development is of utmost interest.
The introduction and the development of mathematical algorithms, combined with
the utilization of computers, have increased the reliability of forecasts enormously.
Enhanced methods, e.g. Data Mining, and advanced technology allowing the storage

A Sales Forecast Model for the German Automobile Market 147


and evaluation of large empirical data sets generate the means of producing more
reliable forecasts than ever before. At the same time, the methods have become more
complex. However, the explicability of a forecast model is as important as its reliability. Therefore, the main objective of our work is to present a model for sales forecasts
which is highly accurate and at the same time easily explicable.
Although Lewandowski [1] [2] investigated sales forecast problems in general and
concerning the automobile industry in particular in the 1970s, few studies have focused on forecasts concerning the German automobile industry thereafter. Recent
publications have only been presented by Dudenhffer and Borscheid [3] [4], applying time series methods to their forecasts. Another approach has been chosen by Bck
et al. [5], who used evolutionary algorithms for their forecasts.
The present contribution pursues new routes by including Data Mining methods.
But there are also differences in the time series methods which are applied. A detailed
analysis of the past is needed for a reliable forecast of the future. Because of that, one
focus of our work is the broad collection and analysis of relevant data. Another focus
is the construction and the tests of the used model. The data base of our models consists of the main time series (registrations of new automobiles) and the secondary time
series, also called exogenous parameters, which should influence the trend of our
main time series. To eliminate parameters with insignificant influence on the main
time series, a feature selection method is used. These tasks are solved for yearly,
monthly and quarterly data. Then the results are analyzed and compared in order to
answer the following main questions of this contribution:
1. Is it possible to create a model which is easy to explain and which at the
same time provides reliable forecasts?
2. Which exogenous parameters influence the sales market of the German
automobile industry?
3. Which collection of data points, yearly, monthly or quarterly data, is the
most suitable one?

2 Data
The main time series comprises the number of registrations of new automobiles in
Germany for every time period. Hence, the market sales are represented by the number of registrations of new automobiles, provided by the Federal Motor Transport
Authority.
The automobile market in Germany increased extraordinary by the reunification of
the two German states in 1990. This can only be treated as a massive shock event
which caused all data prior to 1992 to be discarded. Therefore, we use the yearly,
monthly and quarterly registrations of the years 1992 to 2007. The sales figures of
these data are shown in Figures 1-3 and the seasonal pattern of these time series is
clearly recognizable in the last two figures.

Our choice of the exogenous parameters fits the reference model for the automobile
market given by Lewandowski [2]. In this model the following properties are considered:
a) Variables of the global (national) economy
b) Specific variables of the automobile market
c) Variables of the consumer behavior w.r.t. the changing economic cycle
d) Variables that characterize the influences of credit restrictions or other fiscal
measures concerning the demand behaviour in the automobile industry.
Based on this model, the following ten market influencing factors, shown in Table 1,
are chosen [6] (exogenous parameters )

In Table 1, it is shown that not all exogenous parameters used are published on a
monthly, quarterly, and yearly base. In cases in which the necessary values are not given
directly, the following values are taken:
Yearly data analysis: The averages of the Unemployment and Interest Rate of each
year are used.
Quarterly data analysis: The average of the Unemployment and Interest Rate of
each quarter is used. For the parameters Consumer Price Index and Petrol Charge the
values of the first months of each quarter are taken.

Monthly data analysis: In the case of the quarterly published parameters, a linear
interpolation between the values of two sequential quarters is used.

3 Methodology
3.1 Time Series
Time Series Model
In this contribution an additive model with the following components to mimic the
time series is applied.
For the estimation of the seasonal component there are many standard methods like
exponential smoothing [7], the ASA-II method [8], the Census X-11 method [9], or
the method of Box and Jenkins [10]. In this contribution, the Phase Average method
[11] is used because it is quite easy to interpret. To get accurate results with this
method, the time series must have a constant seasonal pattern over time and it has to
be trendless. A constant seasonal pattern is given in our time series. To guarantee the
trend freedom, a trend component is estimated univariately and subtracted before the
seasonal component is estimated. The latter is done by using a method which is close
to the moving average method [12]. Because of the small given data set, differing
from the standard method, the following formula is used to compute the mean m of a
period:
Although a univariate trend estimation would be easier to explain, this route is not
followed in this contribution because the assumption that the registrations of new
automobiles in Germany are not influenced by any other parameter is not justified.
Hence, the most important component, the trend, is estimated multivariately.

A Sales Forecast Model for the German Automobile Market 151


For the linear trend estimation the Multiple Linear Regression (MLR) [12] is used.
The Support Vector Machine (SVM) with

-Regression and Gaussian kernel [13]


[14] [15] is chosen as a representative of a non-linear estimation because the SVM

has proven to provide suitable results in other industrial projects [16] [17]. However,
this choice might be altered in future publications.
Calendar Component
The calendar component considers the number of working days within a single period. For the estimation of the calendar component p
the total number of days of the period t.
Then the absolute values p
The error component is estimated with the Autoregressive-Moving-Average-Process
of order two [8]. A condition to use this method is the stationarity of the error component. This condition is tested by the Kwiatkowski-Phillips-Schmidt-Shin Test (KPSSTest) [18]. In the cases of non-stationarity, it is set to zero.
3.2 Data Pre-processing
Time lag
In reality, external influencing factors do not always have a direct effect on a time
series, but rather this influence is delayed. The method used to assign the time lag is
based on a correlation analysis.
Time lag estimation
If the value y
t
of a time series Y has its influence on the time series X in t + s, the
time lag of the time series Y is given by the value s. Then the correlation between the
main time series X with its values x
1
,...,x
T
and all of the k secondary time series Y
i
,
i=1,...,k, with its values y
i
1
,...,y
i

T
, is computed. Afterwards the secondary time series is
shifted by one time unit, i.e. the value y
i
t
becomes the value y
i
t+1
and the correlation
between the time series x
2
,...,x
T
and y
i
1
,...y
i
T-1
is assigned. This shifting is repeated up
to a pre-defined limit. The number of shifts of every highest correlation between the
main and a secondary time series is the value of the time lag of this secondary time
series.
Smoothing the exogenous parameters by using the time lag
It is assumed that every value y
t
of an exogenous parameter Y is influenced by its
past values. The time lag indicates how many past data points influence the current
value. This results in the following method:
Let s be the time lag of Y = y
1
,...,y
T
, then the current value y

t
is calculated by the
weighted sum

152 B. Brhl et al.

=
+=
=

sty
Tsty
y
t
s
j
jt
js
t
,...,1
,...,1)1(
1

where )1,0(


is the weighting factor.
Normalisation
To achieve comparability between factors which are not weighted in the same way,
these factors have to be normalized to a similar range. With that step numerical errors
can be tremendously reduced. As normalization method, the z-Transformation is
applied. It refines the mean value to zero and the standard deviation to one: Let v
t
be
any factor at a particular time t, tT, then the z-Transformation is calculated by
,
)(
)(
,
v
vv
v
t
normalizedt

=
where )(v

is the mean and )(v

the standard deviation of v.


Feature Selection
Methods which are typically used for Feature Selection are the correlation analysis,
the Principal Component Analysis (PCA) [19], the Wrapper Approach [20], and the
Filter Approach [21]. Here, the Wrapper Approach with two different regression

methods - the Multiple Linear Regression and the Support Vector Machine - is chosen
for dimension reduction. Compared with other methods, this method provides more
explicable results even for small data sets. Additionally, forecasts with the PCA are
calculated as a reference model for our results. The PCA results are not easily explicable, as the PCA-transformed parameters can not be traced back to the original ones.
Therefore the results have not been considered for the final solution
4 Evaluation Workflow
The data, i.e. the main time series and the exogenous parameters, are divided into
training and test data. The methods introduced in Chapter 3 are used to generate a
model on the training data, which is evaluated by applying it to the test data. The
complete evaluation workflow is shown in Figure 4.
Step 1: Data Integration: The bundling of all input information to one data source is
the first step in the workflow. Thereby, the yearly, quarterly or monthly data ranges
from 1992 to 2007. The initial data is assumed to have the following form:
Main Time Series Secondary Time Series
x
t
=m
t
+s
t
+p
t
+e
t
, t = 1, ..., T y
i
t
, t = 1, ..., T and i = 1, ..., k
Step 2: Data Pre-processing: Before the actual analysis, an internal data preprocessing is performed, wherein special effects contaminating the main time series

are eliminated. For example, the increase of the German sales tax in 2007 from 16%

Fig. 4. Evaluation Workflow: First, the data is collected and bundled. After a data preprocessing, it is split into training and test set. The model is built on the training set and the
training error is calculated. Then the model is applied to the test data. Thereby, the new
registrations for the test time period are predicted and compared with the real values and based on
this the test error is calculated.
to 19% led to an expert estimated sales increase of approximately 100.000 automobiles in 2006. Hence, this number was subtracted in 2006 and added in 2007. Furthermore, the exogenous parameters were normalized by the z-Transformation.
The normalized data are passed on to an external data pre-processing procedure.
The method used is the Wrapper Approach with an exhaustive search. Since we use a
T-fold cross-validation (leave-one-out) to select the best feature set, it should be noted
that we implicitly assumed independency of the parameters. As regression method for
the feature evaluation a Linear Regression is applied in the case of linear trend esti-

mation and a Support Vector Machine in the case of non-linear trend estimation.
The elimination of the special effects in monthly data is not applicable because the
monthly influences can be disregarded.
Step 3: Seasonal Component: The estimation of the seasonal component is done by
the Phase Average method. In this contribution, the seasonal component is estimated
before the trend component. The reason is that the Support Vector Machine absorbs a

154 B. Brhl et al.


part of the seasonal variation, leading to faulty results. In order to remove the influence of trends on the data, the trend is estimated univariately by using the method
presented in section 3.1. The univariate trend component is subtracted and the seasonal component can be estimated on the revised time series.
For the analysis of yearly data, the seasonal component cannot be computed because it measures the seasonal variation within one year.
Step 4: Time Lag and trend component: To analyse monthly and quarterly data, the
time lag of each exogenous parameter is calculated by a correlation analysis. The
exogenous parameters are smoothed using the estimated time lag. In this case, it is of
advantage to limit the time lag because the influence of the exogenous parameters on
the main time series is temporary bounded. The limit chosen was one year for
monthly and quarterly data. Hence, the time lag is 0, if it is greater than the limit.
Afterwards, the trend component of the training set is estimated multivariately by
Linear Regression (linear case) or by a Support Vector Machine (non-linear case). To
optimize some specific parameters for the Support Vector Machine, a Grid Search
algorithm is applied. Thereby, a bootstrapping with ten replications is performed for
the evaluation.
Step 5: Calendar component: The calendar component of the training set is estimated univariately by the method presented in section 3.1.
The calendar component is computed only for monthly data because the variation
of the working days in the case of yearly and quarterly data can be disregarded.
Step 6: ARMA-Model: In the stationary case the error component is estimated by a
second-order ARMA process. Otherwise it set to zero (cf. 3.1).
Step 7: Training Error: The absolute training error is derived from the difference
between the original values of the main time series and the values estimated by the

model. These values are given by the sum of the values of the trend, seasonal, and
calendar component. In the case of a stationary error component, the values estimated
by the ARMA model are added. The mean ratio between the absolute training errors
and the original values gives the Mean Absolute Percentage Error (MAPE).
Let x
i
, i=1,...,T, be the original time series after the elimination of special effects
and z
i
, i=1,...,T, the estimated values. Then, the error functions considered are represented by the following formulas:
Mean Absolute Error

=
=
T
i
iiMAE
zx
T
E
1
1

=
T
i
i
ii
MAPE

Mean Absolute Percentage Error

x
zx
T
E
1
1
Step 8: Forecast: The predictions for the test time period are obtained by summing
up the corresponding seasonal component, the trend component based on the exogenous parameters of the new time period and the respective multivariate regression
method, and the calendar component. In the case of a stationary error component, the
values predicted by the ARMA process are added, too.
Step 9: Test error: The differences of the predictions and the original values of the
test set lead to the test errors. Its computation conforms exactly to the computation of
the training errors.

A Sales Forecast Model for the German Automobile Market 155


5 Results
The results are obtained from the execution of the workflow presented in chapter 4.
In a first step, all ten exogenous parameters are used in the model. Secondly, a feature selection is performed by the Wrapper Approach and the same workflow is executed with only the selected parameters. The training period consists either of 14 or
15 years leading to a test set of two years or one year, respectively. In each case, an
MLR as well as an SVM is used for the multivariate trend estimation. This leads to
eight different evaluation workflows for each type of data collection.
In order to be able to assess the error rates resulting from the evaluation workflows, upper bounds for the training and test errors are required as reference values
for the evaluation. Therefore, a Principal Component Analysis (PCA) is applied to the
original exogenous parameters, and the same evaluation workflow is performed with
the PCA-transformed parameters. The PCA results are an indicator of what can be
expected from the multivariate model. However, in this contribution, the transformed
parameters cannot be used, because they are not explicable and the influences of the
original parameters cannot be reproduced anymore.

The results of the yearly, monthly, and quarterly data which generate the smallest
errors using the PCA-transformed exogenous parameters are shown in Table 2. A
training period of 14 years is used. In all cases the SVM gives much better results
than the MLR [22]. To optimize the parameters of the SVM, the Grid search algorithm is applied.

5.1 Yearly Model


The Feature Selection for the linear trend approximation showed that the Gross Domestic Product, Unemployment Rate, Price Index, Private Consumption, and Industrial Investment Demand were the only parameters which significantly influenced the
main time series. For the non-linear trend approximation, the respective parameters
were the same, except for the Price Index being exchanged with the Latent Replacement Demand.
Table 3 shows the results of the various models. By comparison with the PCA
analysis for yearly data (see Table 2), one can clearly see that the quality of the linear
trend models is inferior. Considering the fact that we face high test errors although we
start from low training errors, one can assume that the training set was too small for
this specific problem and the model was overfitted. Especially, data points which are
not in close proximity to the training set are hard to predict correctly.

In contrast, the results for the non-linear models shown in Table 3 have a better
quality compared with the PCA analysis. That originates from the saturation effect
generated by the regression in conjunction with the Support Vector Machine. It leads
to the fact that data points far off can still be reasonably predicted. Another advantage
(and consequence) of this approach is the fact that a parameter reduction does not
severely lower the quality of the predictions down to a threshold value of five parameters. Models with such a low number of parameters offer the chance to easily
explain the predictions, which appeals to us.
A general problem, however, is the very limited amount of information that leads
to a prediction of, again, limited use (only annual predictions, no details for shortterm planning). Therefore, an obvious next step is to test the model with the best statistics available, i.e. with monthly data.
5.2 Monthly Model
In this case, the Feature Selection resulted in the following: For the linear trend
model, only the parameters Model Policy and Latent Replacement Demand were
relevant while in the non-linear model, new car registrations were significantly influenced by the Gross Domestic Product, Disposal Personal Income, Interest Rate Model
Policy, Latent Replacement Demand, Private Consumption, and Industrial Investment
Demand, i.e. a superset of parameters of the linear case.
The results given in Table 4 are again first compared to the PCA analysis, cf.
Table 2. As for the yearly data, the non-linear models are superior to both the results
for the PCA analysis and for the linear model. Most accurate predictions can be
achieved for the non-linear model with all parameters. However, the deviations are

deemed too high and are therefore unacceptable for accurate predictions in practice.
One reason for this originates from the fact that most parameters are not collected
and given monthly, but need to be estimated from their quarterly values. Additionally, the time lag of the parameters can only be roughly estimated and is assumed to
be a constant value for reasons of feasibility.

The results for the training and test errors for the models based on quarterly data
are given in Table 5. Again, the linear model is inferior compared to the PCA (cf.
Table 2) and compared to the non-linear model. The difference between training and
test errors for the linear model with all parameters is still severe. Furthermore, the
total error of the linear model with reduced parameters might look small. However, acloser
look reveals that this originates only from error cancellation [22]. Altogether

this indicates that the training set is again too small to successfully apply this model
for practical use.
The results for the non-linear model, in turn, are very satisfying. They are better
than the results from the PCA analysis and provide the best absolute test errors of all
investigated models. This also indicates that all parameters are meaningful contributions for this kind of macro-economic problem. In a previous work, we have shown
that a reduction to the six most relevant parameters would more than double the test
error [22].
5.4 Summary
It can be clearly stated that the SVM provides a superior prediction (less test errors)
compared to the MLR. This illustrates that the mutual influence of the parameters is
essential to achieve accurate forecasts. In order to identify the overall best models, the
errors of the training and test sets are accumulated to annual values. The results for
the best model based on yearly, monthly, and quarterly data are visualized in Figure 5.

Fig. 5. Graphical illustration of the absolute errors of the best models for a 15 years training
period, cumulated to years: On yearly and monthly data the non-linear model with reduced
parameters, on quarterly data the non-linear model with all parameters
During the training period, the best model for the monthly data is significantly

worse compared to both other models. The same holds for the test period. Here, the
best quarterly model is significantly superior to the best yearly model, with roughly
half of the test error. It can be observed that the quarterly model does not only deliver a
minor test error, but at the same time provides higher information content than the
yearly model. Both models generate very low errors during the training period, showing again that the set of parameters is well adapted to our problem. The only drawback
of the best quarterly model is the fact that all exogenous parameters are necessary,
making the model less explicable.
6 Discussion and Conclusion
Based on the results of Chapter 5, the three questions mentioned at the beginning of
this contribution can now be answered.
1.
Is it possible to create a model which is easy to interpret and which at the
same time provides reliable forecasts?
To answer this question, a more discriminate approach must be taken. Considering
only the used additive model, the answer is yes, because the additive model has
given better results than the Principal Component Analysis.
By looking at the different methods used in our model, answering the question becomes more difficult. Simple and easily explicable univariate estimations are used for
the seasonal, calendar and error component but a more difficult multivariate method
for the greatest and most important component, the trend. Thereby, the results given
by the more easily explicable Multiple Linear Regression are less favorable than the
results given by the less explicable Support Vector Machine. But in general, the chosen model is relatively simple and gives satisfying results in consideration of the quality of the forecast.
2.
Which exogenous parameters influence the sales market of the German
automobile industry?
Here, it has to be differentiated between the yearly, monthly and quarterly data. In
the yearly model, only a few exogenous parameters are needed to get satisfying results. But it is not possible to generalize the results, because of the very small data set.
Also, in the monthly model also less exogenous parameters can be used. But most of
the exogenous parameters are not published monthly, so that the exact values of these
parameters are not given, leading to inadequate results. However, in the quarterly

model, where the highest number of exogenous parameters is explicitly given, a reduction of the exogenous parameters in our tested model is not possible without decreasing the quality of the results.
3.
Which collection of data points, yearly, monthly or quarterly data, is the
most suitable one?
Yearly, monthly, and quarterly data are regarded. The problems in the yearly
model are the very small data set and the small information content of the forecast.
The problems presented by the monthly model include training and test errors which
are much higher than in the yearly and quarterly models. Probable causes for the
weakness of the monthly model are the inexact nature of the monthly data since most
of the exogenous parameters are not collected monthly. The problems of the yearly
model as well as the problems of the monthly model can be solved by using the quarterly model. Therefore, the quarterly model is the superior method, even though no
reduction of exogenous parameters is possible in this model.
To conclude, it should be pointed out that forecasts are always plagued by uncertainty. There can be occurrences (special effects) in the future, which are not predictable or whose effects can not be assessed. The current financial crisis, which led to
lower sales in the year 2008, is an example for such an occurrence. Because of this
fact, forecasts can only be considered as an auxiliary means for corporate management and have to be interpreted with care [23].

Anda mungkin juga menyukai