120 Query:
Query Cost (Elapse Time in Sec.) on Oracle 8.0
414
Types Statistics for Frequently-Changing Environmental Factors
CPU
— number of running processes;
— number of stopped processes;
— number of sleeping processes
— number of zombie processes overcome. In the rest of this paper, we introduce a feasible
Statistics — percentage of user time; — percentage of system time
— percentage of idle time
— load averages for the past 1, 5, and 15 minutes, respectively
method to capture the frequently-changing factors in a cost
Memory
— available memory;
— shared memory;
— used memory
— buffer memory model.
Statistics — available swap; — used swap
— free swap; — cached swap
— amount of memory swapped in; — amount of memory swapped out
I/O — number of reads per sec.; — number of writes per sec.
Statistics
Other
— percentage of disk utilization
— number of current users; — number of interrupts per sec.
3. Regression with qualitative variable
Statistics — number of context switches per sec.; — number of system calls per sec.
Table 1. System Stats for Frequently-Changing Factors in Unix As mentioned before, the key idea of our method is to
determine a number of contention states for a dynamic envi-
size, physical data distribution, and index clustering ronment and use a qualitative variable to indicate the states.
ratio may change quite frequently. However, they A cost model with the qualitative variable can be used to
may not have an immediate significant impact on estimate the cost of a query in different contention states.
query cost until such changes accumulate to a cer- The issues on how to include a qualitative variable in a cost
tain degree. Thus we also consider these factors model and how to determine an appropriate set of system
as occasionally-changing factors. The changes of contention states are discussed in this section.
occasionally-changing factors can be found via check-
ing the local database catalog and/or system configu- 3.1. Qualitative variable
ration files.
To simplify the problem, we consider the combined
Steady factors. These factors rarely change. Exam-
effect of all the frequently-changing factors on a query
ples of such factors are local DBMS type (e.g., rela-
cost together rather than individually. Although these dy-
tional or object-oriented), local database location (e.g.,
namic factors may change differently in terms of the chang-
local or remote), and local CPU speed (e.g., 300MHz).
ing frequency and degree, they all contribute to the con-
Although these factors may have an impact on a cost
tention level of the underlying system environment. The
model, the chance for them to change is very small.
cost of a query increases as the contention level. The sys-
Clearly, the steady factors usually do not cause a prob- tem contention level can be divided into a number of dis-
lem for a query cost model. If significant changes for such crete states (categories) such as “ ” ( ),
factors occur at a local site, they can be handled in a sim- “ ”( ), “ ” ( ),
ilar way as described below for the occasionally-changing and “ ” ( ). A qualitative variable is
factors. used to indicate the contention states. This qualitative vari-
For the occasionally-changing factors, a simple and ef- able, therefore, reflects the combined effect of foregoing
fective approach to capturing them in a cost model is to frequently-changing environmental factors. A cost model
invoke the static query sampling method periodically or incorporating such a qualitative variable can capture the dy-
whenever a significant change for the factors occurs. Since namic environmental factors to certain degree.
these factors do not change very often, rebuilding cost As shown in [17, 19], a statistical relationship between
models from time to time to capture them is acceptable. query costs and their affecting factors such as operand and
However, this approach cannot be used for the frequently- result table sizes can be established by multiple regres-
changing factors because frequent invocations of the static sion. The established relationship can be then used as a
query sampling method would significantly increase the cost model to estimate query costs.
system load and the cost model maintenance overhead. On Usually, only quantitative variables are considered in a
the other hand, if a cost model cannot capture the dramatical regression model. These variables such as operand table
changes in a system environment, poor query cost estimates size take values on a well-defined scale. However, many
may be used by the query optimizer, resulting in inefficient variables of interest may not be quantitative but qualitative.
query execution plans. Qualitative variables only have several discrete categories
Theoretically speaking, to capture the frequently- (states). For example, the foregoing qualitative variable
changing factors in a cost model, one approach is to include indicating system contention states may have states , ,
all explanatory variables that reflect such factors in the cost , and . Such a qualitative variable can also be incor-
model. However, this approach encounters several difficul- porated into a regression model.
ties. First, the ways in which these factors affect a query A qualitative variable can be represented by a set of in-
cost are not clear. As a result, the appropriate format of dicator variables. For example, the above contention state
a cost model that directly includes the relevant variables is variable with four states can be represented by three in-
hard to determine. Second, the large number of such fac- dicator variables: , , and , where indicates
tors (see Table 1) makes a cost model too complicated to , while indicates ; indi-
derive and maintain even if the previous difficulty could be cates , while indicates ;
415
indicates , while indicates . Note that the cost of a query usually consists of (1) ini-
Clearly, indicate . Note tialization cost such as moving a disk head to the right po-
that no more than one indicator variable can be 1 simultane- sition; (2) I/O cost such as fetching a tuple from an operand
ously (i.e., can only take one state at a time). In general, table; and (3) CPU cost such as evaluating the qualification
a qualitative variable that have categories (states) need condition for a given tuple. A typical cost model for a unary
indicator variables to represent it. query class may look like:
(1)
3.2. General regression model
where and are the cardinalities of the operand
Let and be the response variable table and result table, respectively; , and are the
and (quantitative) explanatory variables in a regression parameters representing the initialization cost, the cost of
model, respectively. Let a qualitative variable with retrieving a tuple from the operand table, and the cost of
states (categories) be represented by indicator variables processing a tuple in the result table, respectively. Both
. The qualitative variable can influence and may reflect I/O as well as CPU costs. Therefore, the
the regression model in the following four different ways initialization cost affects the intercept term in a cost model,
(see Table 2): while the I/O and CPU costs affect the slope terms in the
Type Regression Equation cost model. Clearly, the contention level of a system can
significantly affect not only the initialization cost but also
Coincident: the I/O and CPU costs of a query because the resources like
Parallel: the disk, I/O bandwidth and CPU are shared by multiple
Concurrent: processes. As a result, both the intercept and slope terms in
General:
a query cost model may change when the system contention
level changes. Therefore, to incorporate a qualitative vari-
able representing the system contention states into a query
cost model, the general qualitative regression model is more
Table 2. Qualitative Regression Equation Forms
appropriate.
Coincident. The relationship between the response and
explanatory variables stays the same for all states of 3.3. Determining system contention states
. In other words, the equations for all states are co-
incident. This in fact is the situation for a static sys- Combining multiple dynamic environmental factors into
tem environment assumed by the static query sampling a composite qualitative variable with a number of discrete
method. contention states greatly simplifies the development of a
Parallel. The relationship between the response and cost model for a dynamic environment. The question now
explanatory variables may differ in the intercept term is how to determine an appropriate set of system contention
but not the slope terms for different states of . The states for a dynamic environment.
relevant equation in Table 2 shows that the intercept
Two extremes
term for the th state of the qualitative variable is
( ; and ). Since the slope There are two extremes in determining a set of contention
terms remain the same for all states, the equations for states. One extreme is to consider only one contention state
different states are parallel. for the system environment. A cost model developed in
such a case is useful if the system environment is static.
Concurrent. The relationship between the response
This, in fact, was the case that the static query sampling
and explanatory variables may differ in the slope terms
method assumed. However, as pointed out before, a real
but not the intercept term for different states of . The
system environment may change dynamically over time.
relevant equation in Table 2 shows that the th slope
Using one contention state is obviously insufficient to de-
term ( ) for the th state of the qualitative
scribe the dynamic environment. For a dynamic environ-
variable is ( ; and ).
ment, usually, the more the contention states are considered,
The equations for different states have the same inter-
the better a cost model. In principle, as long as we consider
cept term. They are said to be concurrent.
a sufficient number of contention states for the environment,
General. The relationship between the response and we can get a satisfactory cost model. Another extreme is to
explanatory variables may differ in both the intercept consider an infinite number of contention states. However,
term and the slope terms for different states of the qual- the more the contention states are considered, the more the
itative variable. This is the most general case. indicator variables are used in the cost model. The number
416
of coefficients that need to be determined in a cost model To solve these two problems, the following algorithm is
therefore increases. Hence, if too many contention states used to improve the above straightforward uniform parti-
are considered, the cost model can be very complicated, tion:
which is not good for either the development or mainte- A LGORITHM 3.1 : Contention States Determination via Iterative
nance of the cost model. In practice, as we will see in Sec- Uniform Partition with Merging Adjustment (IUPMA)
tion 5, a small number of contention states (three to six) are Input: Observed data of sample queries and their associated
usually sufficient to yield a good cost model. probing query costs
Output: A set of system contention states4
Method:
Determining states via iterative uniform partition 1. begin
2. Derive a qualitative regression model with one contention
Notice that, for a given query, its cost increases as the sys- state using the sample query data;
tem contention level increases (see Figure 1). Based on this 3. Let be the coefficient of total determination of the
current regression model;
observation, we can use the cost of a probing query to gauge 4. Let be the standard error of estimation of the
current regression model;
the system contention level2 . The range of probing costs 5. ;
(therefore, the contention level) is divided into subranges, 6. do
7. ;
each of which represents a contention state for the dynamic 8. ;
environment. 9. Obtain a set of contention states for the system
environment via the straightforward uniform partition;
Let the cost of probing query fall in the 10. Derive a qualitative regression model with contention
range in a dynamic environment. A sim- states using sample query data;
11. Let be the coefficient of total determination for the
ple way to determine the system contention states is to par- current regression model;
tition range into subranges with an equal 12. Let be the standard error of estimation of the current
regression model;
size. In other words, to determine contention states3 13. until ( and ) are
, we divide range into sufficiently small or is too large;
subranges 14. ;
15. Let ( ) represent the current contention
and where states in ;
and . The 16. Let ( ) be the adjusted coefficient
system environment is said to be in contention state if of th variable for state in the general model in Table 2,
where is a dummy variable for the intercept term;
( ). To obtain more sys- 17. for down to do
tem contention states, we can simply increase . Hence, 18.
yields a set of the system con- 19. if is too small then
20. tag that states and should be merged;
tention states for the dynamic environment. 21. end for
Using this partition, it is easy to determine the system 22. if some states are tagged to be merged then
23. Derive a qualitative regression model with new merged states
contention state in which a query is executed. Let using sample query data;
24. goto step 15;
be a set of sample queries which are 25. end if;
performed in a dynamic environment and whose observed 26. return the current set of contention states;
27. end.
data (costs, result table sizes, etc.) are to be used to derive
a regression cost model for a query class. To determine the There are two phases in Algorithm 3.1. The first phase
system contention state in which is executed, the is to determine a set of contention states via the uniform
cost of probing query in the same environment is partition. The algorithm iteratively checks each qualitative
measured. if ( ). We call the regression model with an incremental number of contention
costs of a probing query associated with the sample queries states until (1) the model cannot be significantly improved
are sampled probing query costs. in terms of the coefficient of total determination5 and the
One basic question is how to determine a proper . An- standard error of estimation 6 ; or (2) too many contention
other question is how to eliminate some unnecessary sepa- states have been determined. Condition (2) is used here to
rations of subranges. Clearly, if the performance behaviors prevent that a derived cost model becomes too complicated
of queries in contention states and (for some ) (in terms of the number of variables involved). The set of
are similar, separating and is unnecessary. The de- contention states obtained from the first phase are based on
termination of system contention states should balance the
4 In fact, the algorithm integrates the contention states determination
accuracy and simplicity (hence low maintenance overhead) procedure with the cost model development procedure (to be discussed in
of a derived cost model. the next section). As a result, a cost model is also produced as an output of
the algorithm.
2 Our experiments showed that most queries, except the ones with ex- 5 The coefficient of total determination measures the proportion of vari-
tremely small cost (e.g., several hundredths of a second), can well serve as ability in the response variable explained by the explanatory variables in a
a probing query to gauge the system contention level. regression model [12]. The higher, the better.
3 A decreasing index is used here to simplify the descriptions of the 6 The standard error of estimation is an indication of the accuracy of
algorithms and derived cost models. estimation given by the model [12]. The smaller, the better
417
the uniform partition of the probing query cost range (see An agglomerative hierarchical algorithm is often used
Figure 2). The partition does not consider whether two for data clustering 6 . The main idea of the algorithm is
states actually have significantly different effects on the cost to place each data object in its own cluster initially and then
model or not. It is possible that some neighboring states gradually merge clusters into larger and larger clusters until
have only slight different effects on the cost model. If so, a desired number of clusters have been found. The criterion
the states should be merged into one to simplify the cost used to merge two clusters and is to make their dis-
model. Such a merging adjustment is done during the sec- tance minimized. One widely used distance measure is the
ond phase of the algorithm. If the maximum of relative er- distance between the centroids or means and
rors of the corresponding pairs of adjusted coefficients (i.e., of two clusters, i.e., .
, and , ) for two states Let be the maximum allowed number of system con-
and is too small, these two states are considered not to tention states. The above clustering algorithm can be used
have significantly different effects on the cost model. The to obtain clusterings (
subranges in the final adjusted partition of probing query ; ’s are clusters such that
cost range may not have an equal size. for ) for sampled prob-
ing query costs. Let subranges and
uniform partition
after , where and
1st phase: Im Im-1 Im-2 Im-3 I2 I1 probing
query
cost
Cmin Cmax , here and
after
2nd phase: I’k I’k-1 I’k-3 I’1 are the minimum and maximum probing query
adjusted partition costs in cluster . Clearly, gives
a set of the system contention states for the dynamic
Figure 2. Contention States Determination via IUPMA environment, which reflects the distribution information of
probing query costs (the contention level). If we use such
Determining states via data clustering in Line 9 in Algorithm 3.1, we get a new algorithm,
To capture the effect of every contention level on query termed as the Contention States Determination via Iterative
costs for a dynamic environment in a cost model, we can Clustering with Merging Adjustment (ICMA).
let each contention level point have an equal chance to be Note that, for clustered probing query costs, it is possible
chosen for running a given sample query. In other words, that a cluster may not have a sufficient number of sampled
the probing query costs associated with the sample queries data points to meet the minimum requirement for regres-
to indicate the sampled contention level points follow the sion analysis. In such a case, we draw additional sample
uniform distribution within their range. A cost model de- data points (via executing more sample queries) to make
rived by using such sample data can be used to estimate the cluster meet the minimum requirement rather than sim-
the cost of a query executed at any contention level. How- ply treat the data points in the cluster as outliers and ignore
ever, in a real dynamic application environment, the con- them. Although this way may change the distribution of the
tention level may occur more often in some subranges than contention level sightly, no useful contention level points
the others. To better capture the performance behavior of a are ignored in the derived cost model.
dynamic environment, we can choose the contention level
points for running sample queries based on the actual distri- Probing costs estimation
bution of the contention level in the dynamic environment.
To minimize the overhead for determining a system con-
As a result, the associated probing query costs may not fol-
tention state, a query with a small cost is preferred as a
low the uniform distribution in their range. More often they
probing query. To further reduce the overhead, estimated
are grouped into clusters.
costs (rather than observed costs) of probing query can
Although Algorithm 3.1 is designed for uniformly dis-
be used to determine the contention states of a dynamic en-
tributed probing query costs, it usually can also handle clus-
vironment. The idea is to first develop a regression equation
tered probing query costs well due to its iterating and ad-
between the probing query cost and some major system
justing mechanisms. However, the resulting partition of the
contention parameters7 (such as CPU load , I/O utiliza-
probing query cost range for the clustered cases may not
tion , and size of used memory space for a dynamic
be the best since the boundaries considered at each itera-
environment in Table 1), i.e.,
tion in the algorithm are fixed, regardless of the distribution
of the system contention level. To overcome the problem, (2)
a data mining algorithm for data clustering can be incor-
porated into the contention states determination procedure 7 A standard statistical procedure can be used to determine the signifi-
here. cant parameters for a system environment.
418
where ( ) are regression coefficients. Af- quantitative variable plus the intercept term. Each group
terwards, every time when we want to determine the sys- has coefficients, one for each state of the qualitative vari-
tem contention state in which a query is executed we only able. In addition, the variance of error terms need also to be
need to check which subrange the estimated cost of estimated.
probing query lies in by using (2) without actually ex-
Sample queries drawn from a query class are performed
ecuting the probing query. Since obtaining the parameter
in a dynamic environment. Their observed data as well as
values ( ) in (2) usually requires less overhead
their associated probing query costs are recorded and used
than executing a probing query, using the estimated costs
to derive a regression cost model for the query class. A load
of a probing query to determine system contention states
builder, which is part of the MDBS agent for each local
is usually more efficient. However, estimation errors may
DBS [2], is used to simulate a dynamic application environ-
introduce certain inaccuracy.
ment at a local site in an MDBS during the query sampling
procedure. The MDBS agent may also have an environment
4. Development of cost models monitor which collects system statistics used for estimating
the probing query costs when the estimation approach in
As mentioned before, we extend the query sampling Section 3.3 is employed.
method for a static environment in [17] so as to develop
cost models for a dynamic environment via introducing a 4.2. Regression cost models
qualitative variable. Such extensions are discussed in this
section. A qualitative regression cost model contains a set
of quantitative explanatory variables and a set of in-
4.1. Query classification and sampling dicator variables for a qualitative variable indicating sys-
tem contention states. Similar to the static query sam-
Similar to the static query sampling method, we group pling method, we divide the cost model into two parts:
local queries on a local database system into classes based . The basic model represents
on their potential access methods to be employed. The pre- the essential part of the model, while the secondary part is
vious classification rules and procedures in [17] can be uti- used to further improve the model. The qualitative variable
lized. For example, (i.e., the indicator variables) is included in both parts of the
cost model to capture the dynamic environmental factors.
Set is split into two subsets and , where con-
tains basic (quantitative) explanatory variables in the basic
model, while contains secondary (quantitative) explana-
tory variables in the secondary part. Table 3 lists poten-
is a class of unary queries that are most likely performed tial explanatory variables in each of the subsets for a unary
by using a clustered-index scan access method in a DBMS. query class and a join query class. If all variables (including
Hence a similar performance behavior is shared among the indicator variables) are included, the full cost model is:
queries in the class and can be described by a common cost
model.
A sample of queries are then drawn from each query
class in a similar way as before. However, since more
parameters associated with the indicator variables are in-
cluded in a cost model, more sample queries need to be
drawn in order to meet the commonly-used rule for sam-
pling in statistics, i.e., sample at least 10 observations for
every parameter to be estimated [12]. The following propo-
sition gives a guideline on the minimum number of sample However, usually, not all variables are necessary for a
queries needed for regression analysis. given cost model.
To determine the variables to be included in a regression
P ROPOSITION 4.1 For the general qualitative regression
cost model for a query class, a mixed backward and for-
cost model in Table 2 with quantitative explanatory vari-
ables and one qualitative variable for states, at least ward procedure described below is adopted. We start with
observations need to be sampled. the full basic model which includes all variables in and
use a backward procedure to eliminate insignificant basic
P ROOF. Notice that there are groups of regression explanatory variables one by one. Note that, in our algo-
coefficients in the cost model, one for each independent rithm, if an explanatory variable is removed from the
419
Class Basic Explanatory Variables Secondary Explanatory Variables
formula similar to (3); is a given small positive constant.
Unary – size (cardinality) of operand table – tuple length of operand table
Query
Class
– size of intermediate table
– size of result table
– tuple length of result table
– operand table length
Since the average simple correlation coefficient indi-
– result table length cates the degree of linear relationship between and on
– size of 1st operand table – tuple length of 1st operand table
Join
Query
– size of 2nd operand table
– size of 1st intermediate table
– tuple length of 2nd operand table
– tuple length of result table
average in all states, foregoing condition ( ) selects an ex-
Class – size of 2nd intermediate table
– size of result table
– 1st operand table length
– 2nd operand table length
planatory variable that contributes the least (on average
– size of Cartesian product of – result table length
intermediate tables in all states) in explaining the response variable . Since
the standard error of estimation is an indication of estima-
Table 3. Potential Explanatory Variables for Cost Models tion accuracy, foregoing condition ( ) ensures that removing
model, its coefficients for all con- variable from the model improves the estimation accu-
tention states (determined by indicator variables ’s) are racy or affects the model very little. Removing a variable
removed. We then use a forward selection procedure to add that has a little effect on the model can reduce the complex-
more significant secondary explanatory variables from ity and maintenance overhead of the model.
into the cost model. This procedure tries to further improve In the forward selection procedure, the next variable
the cost model. Similar to the backward procedure, if a sec- from to be added into the current model is the one sat-
ondary variable is added into the model, its coefficients isfies ( ) its average simple correlation coefficient
for all contention states are included. with the residuals of the current model
Since it is expected that most basic variables are important for all states is the largest among all explanatory variables
to a cost model and only a few secondary explanatory vari- in the model; i.e., it can explain the most (on average for
ables are important, both the backward elimination and the all states) about the variations that the current model cannot
forward selection procedures most likely terminate soon af- explain; and ( ) it significantly improves the estimation ac-
ter they start. curacy, i.e., and , where denote
Assume that we have sampling observations in con- the standard errors of estimation for the augmented model
tention state ( ), with observations (i.e., with included) and the original model, respectively;
in total. Consider the simple correction coefficient between and is a given small positive constant.
variables and in contention state : Note that the exact number of explanatory variables in a
cost model is determined after the above mixed backward
and forward procedure is done. However, we need such in-
formation to determine the query sample size from Propo-
sition 4.1 at the beginning of the cost model development.
Since it is expected that most basic explanatory variables in
are selected and only a few secondary explanatory vari-
where are the values from the th sampling ob- ables in are used for a cost model, we expect the number
servation ( ) in state . For any explanatory of explanatory variables in a cost model usually not exceed
variable , if its maximum simple correlation coefficient . Based on experiments, the maximum num-
with response variable is too small, it has ber of contention states for a dynamic environment in
little linear relationship with in any state. Such explana- practice can also be estimated. Hence, a reasonable query
tory variables should be removed from consideration. sample size is:
In the backward elimination procedure, the next variable (4)
to be removed from the current model is the one which
satisfies two conditions ( ) its average simple correlation from Proposition 4.1.
coefficient with response variable
for all contention states is the smallest among all explana- 4.3. Measures for developing useful models
tory variables in the current model; ( ) it makes or
, where is the standard error of estimation Multicollinearity occurs when explanatory variables are
for the reduced model (i.e., with removed) given by: highly correlated among themselves. In such a case, the es-
timated regression coefficients tend to have large sampling
variability. It is better to avoid multicollinearity.
(3)
The presence of multicollinearity is detected by means of
the variance inflation factor [11]. When an explana-
here denote the observed query cost, estimated tory variable has a strong linear relationship with the other
query cost given by the reduced model, and number of ex- explanatory variables, its is large. In a dynamic envi-
planatory variables in the model, respectively; is the stan- ronment with multiple contention states, let (
dard error of estimation for the original model given by a ) be the variance inflation factor of explanatory variable
420
Query Class Cost Estimation Model with Qualitative Variable (i.e., Multi-States Cost Models)
+1 +0 +1 -4 -3 -4 -5 -4
-4 -2 -2 -2 -7 -4 -5
+1 +1 +2 +1 +2 +1 -4 -4 -4
-3 -3 -3 -3 -3 -2 -2 -2
-2 -1 +0 +0 +1 +1 +1 +1
+0 +1 +0 +1 +1
+2 +2 +2 -7 -8 -6 -3 -2
-2
-1 +0 +1 -3 -3 -3 -4 -4
-4 -2 -2 -2 -4 -4 -4 -5
-5 -6
+1 +1 +1 +1 +1 +1 -3 -3 -3
-3 -3 -3 -3 -2 -2 -2 -2
-2
+2 +2 +2 -7 -6 -6 -6
-2 -2 -2 -2 -3 -3 -3 -6
+1 +1 +1 +1 +1 -1 +2 +1
421
cost models were derived by applying the static query sam- models, the multi-states cost models increase the num-
pling method to sampling data obtained from a dynamic en- ber of very good cost estimates (i.e., with relative er-
vironment (Static Approach 2). This in fact is to restrict the rors 0.3) and the number of good cost estimates (i.e.,
multi-states query sampling method to consider only one within one time range) by 27.0% and 20.2% (on av-
contention state. erage) respectively for the test queries. Figures 6
5 show comparisons among the observed costs, esti-
query cost model average very good good
class type cost estimates estimates mated costs by the multi-states cost models, and esti-
for multi-states (3) 0.972 0.157e+2 0.528e+2 55% 78%
one-state 0.798 0.363e+2 0.511e+2 30% 58%
mated costs by the one-state cost models for the test
static 0.972 0.672e+0 0.290e+1 3% 5% queries in a dynamic environment.
for multi-states (6) 0.994 0.997e+1 0.620e+2 60% 76%
one-state
static
0.779
0.986
0.620e+2
0.733e+0
0.690e+2
0.359e+1
24%
7%
48%
14%
The more contention states are considered, the better
for multi-states (3) 0.996 0.230e+3 0.735e+3 37% 62% the derived cost model usually is. For example, the co-
one-state
static
0.910
0.992
0.254e+3
0.116e+2
0.431e+3
0.381e+2
27%
9%
45%
13%
efficients of total determination for the cost models for
for multi-states (3) 0.982 0.160e+2 0.680e+2 69% 81%
query class with 1 to 6 contention states
one-state 0.876 0.576e+2 0.865e+2 35% 60% are 0.7788, 0.9636, 0.9674, 0.9899, 0.9922, respec-
static 0.999 0.917e-1 0.402e+1 3% 6%
for multi-states (6) 0.993 0.143e+2 0.873e+2 63% 74%
tively. However, the improvement may be very small
one-state 0.901 0.672e+2 0.108e+3 35% 62% after the number of contention states reaches certain
static 0.999 0.301e+0 0.493e+1 4% 8%
for multi-states (4) 0.999 0.148e+3 0.998e+3 51% 67%
point. Table 5 shows that usually considering 3 to 6
one-state 0.951 0.507e+3 0.882e+3 22% 44% contention states for a dynamic environment is suffi-
static 0.999 0.503e+1 0.492e+2 0% 1%
cient to obtain a good cost model.
Table 5. Statistics for Cost Models Like static techniques [3, 17], it is also true to the
multi-states query sampling method that small-cost
From the experimental results, we can have the following queries usually have worse cost estimates than large-
observations: cost queries. The main reason for this is that even a
The multi-states query sampling method presented in small momentary change in the system environment
this paper can derive good cost models in a dynamic may have a significant impact on the cost of a small-
environment. The coefficients of total determination cost query. It is not easy to capture all such small
in Table 5 indicate that all derived models can cap- environmental changes in a cost model. Fortunately,
ture 98.9% variations in query cost on average. The estimating the costs of small-cost queries is not as im-
standard errors of estimation are acceptable, compared portant as estimating the costs of large-cost queries be-
with the magnitude of the average cost of relevant sam- cause it is more important to identify large-cost queries
ple queries (only 22% of average costs on average). so that inefficient execution plans can be avoided.
The statistical F-tests at significance level Contention states determination algorithm IUPMA
were also conducted, which showed that all cost mod- works well for both uniformly-distributed and clus-
els are useful for estimating query costs in a dynamic tered probing query costs, while algorithm ICMA can
environment. determine an even better set of system contention
The (static) cost models derived by the static query states for the clustered cases. Note that the sampled
sampling method for a static environment (i.e., Static probing query costs were drawn by following the dis-
Approach 1) are not suitable for estimating query costs tribution of the contention level in a dynamic environ-
in a dynamic environment. Although such cost mod- ment. In fact, the experimental results shown in Tables
els may have good coefficients of total determination 4 5 and Figures 4 9 were obtained for the uni-
(99.1% on average in Table 5) for the sampling data form case. Extensive experiments were also conducted
in a static environment, they can hardly give good cost for clustered cases. The experimental results showed
estimates in a dynamic environment (gave only 7.8% that, for a given query class, the cost model derived
good cost estimates on average in Table 5 for the test in the clustered cases is usually better than the one
queries in our experiments). derived for the uniform case even if IUPMA is used.
This is because the cost models for the clustered cases
The (multi-states) cost models derived by using the only need to capture performance behavior of queries
multi-states query sampling method for a dynamic en- in more focused and narrower subrange(s) of the con-
vironment significantly improve the (one-state) cost tention level. Table 6 shows some typical experimental
models derived by applying the static query sampling results for a query class in a dynamic environment with
method for the dynamic environment (i.e., Static Ap- clustered contention levels (see Figure 10 for the rel-
proach 2). In fact, compared with the one-state cost evant frequency distribution of the contention level).
422
800 1400
solid line --- observed cost solid line --- observed cost
dashed line (o) --- estimated cost by qualitative approach (multi-states) dashed line (o) --- estimated cost by qualitative approach (multi-states)
700 dotted line (+) --- estimated cost by static approach (on e-state) dotted line (+) --- estimated cost by static approach (on e-state)
1200
600
Query Cost (Elapse Time in Sec.)
500
800
400
600
300
400
200
200
100
0 0
-100 -200
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
No. of Result Tuples 5
x 10 No. of Result Tuples 5
x 10
Figure 4. Costs for Test Queries in on DB2 5.0 Figure 5. Costs for Test Queries in on Oracle 8.0
1200 1800
solid line --- observed cost solid line --- observed cost
dashed line (o) --- estimated cost by qualitative approach (multi-states) dashed line (o) --- estimated cost by qualitative approach (multi-states)
dotted line (+) --- estimated cost by static approach (on e-state) 1600 dotted line (+) --- estimated cost by static approach (on e-state)
1000
1400
Query Cost (Elapse Time in Sec.)
800 1200
1000
600
800
400
600
200 400
200
0
0
-200 -200
0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5
No. of Result Tuples 5
x 10 No. of Result Tuples 5
x 10
Figure 6. Costs for Test Queries in on DB2 5.0 Figure 7. Costs for Test Queries in on Oracle 8.0
6000 7000
solid line --- observed cost solid line --- observed cost
dashed line (o) --- estimated cost by qualitative approach (multi-states) dashed line (o) --- estimated cost by qualitative approach (multi-states)
dotted line (+) --- estimated cost by static approach (on e-state) dotted line (+) --- estimated cost by static approach (on e-state)
6000
5000
Query Cost (Elapse Time in Sec.)
5000
4000
4000
3000
3000
2000
2000
1000
1000
0
0
-1000 -1000
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
No. of Result Tuples 5
x 10 No. of Result Tuples 5
x 10
Figure 8. Costs for Test Queries in on DB2 5.0 Figure 9. Costs for Test Queries in on Oracle 8.0
423
query states # of average very good good
class determination states cost estimates estimates environment. Usually, considering a small number of con-
for IUPMA 3 0.978 0.128e+2 0.488e+2 58% 82% tention states is sufficient to yield a good cost model.
ICMA 3 0.991 0.740e+1 0.465e+2 82% 95%
Although dynamic environmental factors have signifi-
cant effects on query cost, they were ignored in most exist-
Table 6. Statistics for Cost Models in a Clustered Case
ing cost models for MDBSs or other database systems due
to lack of appropriate techniques. This paper introduces a
25 promising approach to tackling the problem. However, fur-
ther research needs to be done in order to fully solve all
20 relevant issues.
15 References
Frequency
424