Abstract: In this paper, a model and a solution procedure is developed for the File Allocation and Join
Site Selection Problem with 2-way Join [FAJSP-2], defined on a telecommunications network. This
problem attempts to integrate the file allocation and query optimization aspects of a distributed comput-
ing system. By allowing for queries that require processing up to two file types, this problem is designed
to determine simultaneously the number and location of file types and the location of join operations. This
problem is modeled as a mixed-integer linear program, for which a fast dual-ascent approximation
procedure is developed. Extensive computational results are presented which demonstrate that our
dual-ascent procedure is able to solve even large-scale problems to near optimality quickly. 1999 John
Wiley & Sons, Inc. Networks 33: 109 124, 1999
109
110 MURTHY AND SEO
Fig. 2. Relations.
tioned database management systems. Gavish and Segev operations are executed. However, the fact is that FAP and
[14] defined a special case of the distributed query optimi- query optimization are highly interdependent. Surely, the
zation problem that consists of queries involving set oper- location of the files can strongly impact the right sequence
ations (e.g., set difference and set intersection) between sets and location of join operations. Conversely, the selection of
of tuples that are geographically dispersed. They presented the join site (in query optimization) has a major impact on
a mixed-integer programming model with three heuristic the query communication costs.
procedures and a plant location-based lower-bounding pro- To illustrate this interdependency, consider the example
cedure to solve the model. similar to that found in [1]. Figure 2 shows the nature of the
It is evident from the preceding discussion that while three relations (or file types): WINE_USA, WINE-
FAP and query optimization have been extensively re- _FRANCE, and WEATHER. The first two relations contain
searched, they have been treated largely as separate prob- tuples that represent a wine for which the grapes were
lems. In query optimization, while determining the se- grown in a certain area, picked in a certain year, and bottled
quence of join operations and their geographic sites, the by a certain producer. The relation WEATHER contains
locations of database files or permanent relations are as- attributes YEAR, AREA, COUNTRY, and SUN, where
sumed to be given. Similarly, research on FAP has largely SUN represents the hours of sun in a given area and year.
ignored the issues associated with the query access plan. It Figure 3 shows the DCS network where copies of these
is either assumed that all queries require access to a copy of relations are to reside.
a single database file or that when queries require several Let the sizes of relations WINE_USA, WINE_FRANCE,
files the copies are simply sent to the user site where all and WEATHER be 12,000, 15,000, and 18,000 megabytes,
112 MURTHY AND SEO
tion 6, we discuss the connections that [FAJSP-2] has with M 5 (IXQ) index set of all user location query-type
these other problems and conclude this paper. combinations. Henceforth, an index m
{ M will be referred to as simply user-
query
D index set of all file types
2. MODEL FORMULATION fm frequency of user-query m
am the first file type used to process user-query
m
A distributed database system typically consists of many
database files/fragments and hundreds of different queries bm the second file type used to process user-
originating from various sites in the distributed environ- query m
ment. As seen in Figure 1, processing a query may some- Sd size of file d (megabyte)
times require more than two file copies. Although ideally Sm size of the result relation/file of user-query m
one would want a mathematical model for this most general (megabyte)
case, the ensuing model (in all probability) will be far too Ui d
amount of updates from node i to file type d
complex. Consequently, in [FAJSP-2], we concentrate on (megabytes)
the allocation of files and the selection of join sites in sj unit storage cost at node j ($/megabyte)
systems that require the joining of two files only.
rd reduction factor associated with file type d
At first sight, this may appear to be rather limiting. We,
however, believe that the study of [FAJSP-2] is of consid- c jk unit communication cost between nodes j and
erable importance for two reasons: First, as observed by k ($/megabyte)
Ram and Narasimhan [24] and Yu et al. [31], most queries Md # M index set of all user-queries m that need to
require processing of at most two relations. More impor- access file type d, i.e., for each m { M d ,
tantly, we think that the study of [FAJSP-2] ought to be either d 5 a m or d 5 b m
viewed as a first step toward solving the most general case
(i.e., one with queries requiring more than two files). Ob- There are two cost components in [FAJSP-2]: the query
serve that even in the general case queries are processed by communication cost and the update/storage cost. Consider a
joining two relations at a time in sequence. This suggests user-query m that is associated with a query q originating
several possibilities of obtaining good solutions to the general from node i. This query requires copies of two file types,
problem by suitably decomposing it into several [FAJSP-2] a m and b m, which are processed by a simple join opera-
problems. In short, we believe that the study of [FAJSP-2] will tion. Suppose further that user-query m uses a copy of file
provide valuable insights into the solution of the more general type a m that resides at node j 1 and a copy of file type b m
problem. that resides at node j 2 . Also, let the join operation take place
In our model, it is assumed that individual nodes in the at some node k. The communication costs for this query is
distribution system are sufficiently large to accommodate all computed as follows: First, a reduced copy of a m and b m
processing, storage, and communications requirements. Ca- is sent from locations j 1 and j 2 , respectively, to join site k.
pacity restrictions therefore are not considered here. Many Using reducer programs consisting of unary operations and
difficulties arise when one tries to seamlessly integrate semijoins, the files are appropriately reduced by a factor r.
different types of processors into a single, logically inte- Hence, the cost of transmitting data from nodes j 1 and j 2 to
grated system. Low-level interfacing problems limit system node k is
flexibility and hamper the efficiency of operations. Homo-
geneous computer systems (i.e., systems supported by a set
of identical computers) have been suggested as a solution to @S am z r am z c j1k 1 S bm z r bm z c j2k#.
these problems [21]. In this research, we assume such a
system. At node k, a relation of size S m is obtained after the join
A copy of a program file, one that performs the query/ operation. This result is then transmitted to the user location
update operations, is assumed to reside at each node since i at a cost of S m z c ki . Assuming that the frequency of
they are of a relatively small size and updates on them are user-query m is f m , the optimal communication cost and the
quite infrequent. The following notations are used to de- join site for file type a m from location j 1 and file type b m
velop the model for [FAJSP-2]: from location j 2 can be determined in linear time as
Associated with each copy of file d at location j is an update 3. THE DUAL-ASCENT PROCEDURE
communication and storage cost. Given that the amount of
updates from a user location i to file type d is U di and the
unit storage cost at node j is s j , the update and storage cost In this section, we discuss in detail the development of a
is dual-ascent procedure for [FAJSP-2]. A dual-ascent procedure
is a structured approach for solving the LP dual or the La-
F dj 5 O ~U z c ! 1 s z S .
d
i ij j d (2)
grangian dual. It solves the dual in a manner that ensures a
monotonic improvement in the dual value. This is in contrast to
i{I the more popular subgradient procedure which can take a
longer time to converge, since the dual value often regresses
The following decision variables describe the decisions during the course of its execution. Also, in the dual-ascent
made in [FAJSP-2]: procedure, often good primal solutions can be easily identified
using complementary slackness conditions.
A number of highly successful dual-ascent schemes have
x mj1j2 5 1, if file a m at j 1 and file b m at j 2 is used to
been developed in the past for solving large-scale integer
satisfy user-query m,
programming problems [3, 8, 10, 30]. It is interesting to
5 0, otherwise. note that in all these problems the formulation consists of a
y dj 5 1, if a copy of file type d is stored at node j, set of constraints that are totally unimodular and a set of
5 0, otherwise. variable upper bound (VUB) constraints. It was found that
their LP relaxations frequently yielded natural integer solu-
We now present the following integer programming tions or provided bounds close to the optimal. The formu-
model for [FAJSP-2]: lation (P) bears a striking resemblance to these problems.
Consequently, it is expected that good lower bounds may be
(P) Minimize O O OC
m{M j1{N j2{N
mj1j2 z x mj1j2 1 O OF zy
d{D j{N
d
j
d
j
obtained by solving the LP relaxation of (P).
Our dual-ascent procedure consists of three iterative
steps: (1) a dual-ascent step, (2) a primal step, and (3) a
dual-adjustment step. In the dual-ascent step, the dual of the
Subject to: LP relaxation of (P) is rapidly solved to near optimality by
taking advantage of its special structure. In the primal step,
OOx j1 j2
mj1j2 51 for all m { M (3)
a feasible solution to (P) is obtained from the dual solution
by ensuring that certain complementary slackness condi-
tions are satisfied. During this process, if all complementary
Ox
slackness conditions are satisfied, we stop with the optimal
mj1j2 # y j2d for all m { M, j2 { N, d { b m (4) solution. If not, using the complementary slackness viola-
j1 tions as a guide, certain dual values are adjusted so as to
enable a further increase in the dual value. This is the
Ox
j2
mj1j2 # y j1d for all m { M, j1 { N, d { a m (5) dual-adjustment step. These three steps are repeated until no
further improvements in the primal or dual solution occurs.
Finally, if a gap still exists between the primal and dual
values, then a fast add drop heuristic is executed to im-
x mj1j2 { $0, 1%, and y dj { $0, 1%
prove the primal solution. This add drop heuristic is very
for all m { M, j1 { N, j2 { N, d { D. (6) similar in approach to that described in Balakrishnan et al.
[3] or Fisher et al. [10]. Hence, we have not provided details
In the objective function, the first term represents the query here, which can be found in Seo [27].
communication cost, while the second term represents the
composite update and storage costs. Constraint set (3) de-
scribes the requirement that each user-query m needs to 3.1. The Dual-ascent Step
access the first file type a m from some location j 1 and the
second file type b m from some location j 2 . Constraint set Consider the following LP dual of (P):
(4) describes the condition that if file d is not stored at
location j 1 no user-query m, for which file type d is a m, can
access it from location j 1 . Similarly, constraint set (5)
(D) Maximize On
m{M
m
am bm
n m # C mj1j2 1 v mj1 1 v mj2 STEP 0. [INITIALIZATION]
for all m { M, j1 { N, j2 { N (7) Set S dj 4 F dj for all d { D, j { N.
Set vdmj 4 0, nm 4 0 for all d { D, j { N, and m { M.
Ov
m{Md
d
mj # F dj for all d { D, j { N (8) For each m { N, do the following steps:
STEP 1. [INCREASE DUAL nm]
am am
v dmj $ 0 for all d { D, m { M, j { N (9) 1.1. Set v mj1 4 v mj1 1 S aj1m for all j 1 { N.
1.2. Set v mj2 4 v mj2 1 S bj2m for all j 2 { N.
bm bm
is, L 1
d defines the set of all locations where a copy of file Once the set L 1
d is constructed, the x values are obtained
type d is stored. Let L *d be defined as as follows: For each m, we set that x mj19j29 5 1 such that
L *d 5 $ j { NuS dj 5 0% for all d { D. (16) C mj19j29 5 Min $C mj1j2uC mj1j2 5 n m%. (18)
j1{La1m, j2{Lb1m
j91 and j92 whose corresponding slacks S aj19m and S bj29m have files are reduced by a factor r before shipping them to the
been increased, respectively. It is worth noting that L aS m join site. For m 5 1 and 2, r1 5 0.85 and r3 5 0.75 (file
# L*am and LbS m # L*bm. We now obtain a restricted set M1 type 1 is reduced by 15% and file type 3 is reduced by 25%),
, M, defined as the set of all user-queries m9 m, such while for m 5 3 and 4, r2 5 0.35 and r4 5 0.35. Similarly,
that either L*am9 , LaS m or L*bm9 , LbS m, or both. The the size of the result relation that is obtained after the join
dual-ascent step is first executed over the restricted set operation also undergoes a reduction operation. For m 5 1
M1. Then, the set M1 is enlarged to include m and the and 2, the reduction factor is 0.12 (the result relation is 12%
dual-ascent step is executed over this enlarged set. This of the sum of the reduced files 1 and 3), and for m 5 2 and
marks the end of the dual-adjustment step. A formal 3, the reduction factor is 0.28. Finally, the network is
statement of the dual-adjustment step is presented in the assumed to be packet-switched. Thus, the communication
Appendix. tariffs are independent of the distance between communi-
The idea behind executing the dual ascent step over the cating sites. Here, all unit communication costs are assumed
restricted set M 1 is best illustrated by the following simple to be $0.04/Mb.
am
example: Let v mj1 5 D violate complementary slackness Table I presents the query costs associated with each
am
condition (14). Further, upon setting v mj1 5 0, we get L aS m user-query m and the location of file type a m and b m. The
am am
5 { j 1 , j91 }, with S j1 and S j19 each increased by D, while query costs are calculated using (1). The table also presents
L bS m 5 {A}. Of course, n m also decreases by D. Now the optimal join site k associated with each of the query
suppose there exists a user-query m 1 such that L *a m1 costs in parentheses. It is instructive to note here that since
5 { j 1 } with a m1 5 a m and another user-query m 2 such the unit communication costs are the same for all links the
that L *a m2 5 { j91 } with a m2 5 a m . Clearly, both m 1 and m 2 optimal join site simply depends on the relative reduced
qualify to belong to the set M 1 . Also, due to Lemma 1, file sizes, ( r a m z S a m ), ( r b m z S b m ), and S m . In particular,
Minj2 {C mj1j2 } 5 n m1 and Minj2 {C mj1j2 } 5 n m2 . For the it is easy to see that the optimal join site k will either be at
sake of simplicity, let us assume that (a) S aj m1 $ D for all j1, j2 or at the user location i. Consider, for instance, the
j j 1 and (b) S aj m2 $ D for all j j91 . Now, if the case where m 5 2, j1 5 2, j2 5 3. This query originates
dual-ascent step is executed simply over m 1 and m 2 , then from node 2 and involves the join of file types 1 and 3.
both n m1 and n m2 will increase by D. Thus, even though n m Hence, r 1 z S 1 5 10,200 Mb, r 3 z S 3 5 13,500 Mb, and
decreases by D, the net increase in the dual value is D. Of S 2 5 0.12(10,200 1 13,500) 5 2844 Mb. Here, since
course, if S aj m1 , D for some j j 1 and S aj m2 , D for ( r 1 z S 1 1 S 2 ) . r 3 z S 3 , the optimal join site is k 5 3,
some j j91 , then both n m1 and n m2 will increase by an that is, the cheapest option is to send reduced file type 1
amount less than D. However, this will result in S aj1m and from node 2 to node 3, execute the join with file type 3
S aj19m both being positive after the initial dual-ascent step there, and then send the result relation back to node 2. The
over M 1 . By including m into M 1 , the dual variable n m resulting query communication cost is computed to be 0.04 z
also increases. Thus, the inclusion of m into M 1 ensures 35 z (10,200 1 2844) 5 $18,261. On the contrary, in the
that the dual at least reaches its original value. case of m 5 3, j1 5 3, j2 5 1, the optimal join site is k
5 3. To satisfy this query originating from node 3, the
reduced file type 3 is sent from node 1 to node 3. At node
4. AN EXAMPLE 3, the join of file type 2 and 3 takes place, and therefore the
query is satisfied locally.
We now illustrate the workings of the dual-ascent procedure Table I also presents the sum of update and storage costs
by solving an example based on the data and network shown associated with each file type and location. To compute the
in Figures 2 and 3, respectively. To describe the problem update costs, the frequency of updates on a file type from a
further, let the nodes Chicago, San Francisco, New York, user location is assumed to be 30% of the frequency of
and Paris be indexed as 1, 2, 3, and 4, respectively. Simi- queries using that file. The storage cost is assumed to be
larly, the file types WINE_USA, WINE_FRANCE, and $0.1/Mb. Thus, for example, the update and storage cost
WEATHER are indexed as 1, 2, and 3, with respective file associated with file type 3 at location 3 is 0.3 z 0.04 z (13
sizes of 12,000, 15,000, and 18,000 Mb. We now consider 1 35 1 90) z 18,000 1 0.1 z 18,000 1 0.1 z 18,000
four user-queries: The users for user-queries m 5 1, 2, 3, 5 $31,608. Here, file type 3 is used for user-queries 1, 2, 3,
and 4 are, respectively, located at nodes i 5 1, 2, 3, and 4. and 4. However, since this file resides at node 3, to deter-
For m 5 1 and m 5 2, the queries require the join of file mine the update communication cost, we only consider the
types 1 and 3, that is, a1 5 1, b1 5 3, a2 5 1, and b2 5 3. frequencies associated with user-queries 1, 2, and 4.
The queries associated with m 5 3 and m 5 4 require the In the dual-ascent procedure, we first initialize all
join of file types 2 and 3, that is, a3 5 2, b3 5 3, a4 5 2, slacks Sdj to equal to the corresponding update costs Fdj
and b4 5 3. The frequencies of queries for each user-query and the dual variables vdmj and nm to zero. The next three
m is f m 5 [13, 35, 97, 90]. Further, as discussed in steps are executed iteratively for each m { M: For m 5 1,
Section 2, before performing a join operation, the relevant in Step 1, (v11j1) 5 [6240, 3072, 8112, 8112] and (v31j2)
118 MURTHY AND SEO
TABLE I. Query communications costs and update Minimizing over the (C 1j1j2 ) matrix above, we get n1
and communication storage costs 5 41,198. Next, in Step 2, we increase the slacks (S 1j1 )
without decreasing n1 as follows: First, C* 5 [44,630,
(a) Query Communication Costs
41,462, 41,198, 42,710] is obtained by minimizing over
j2\j1 1 2 3 4 each column (or j1) of the (C 1j1j2 ) matrix. Next, as per
step 2.2, D 1 5 C* 2 n 1 5 [3432, 264, 0, 1512] and D
Query Costs ($) for m 5 1
5 Min{D 1 , ( v 11j1 )} 5 [3432, 264, 0, 1512] are obtained.
1 0(1) 5304(1) 5304(1) 5304(1) It should now be apparent that D represents the maximum
2 6782(2) 1478(2) 6782(2) 6782(2) amount by which each column in (C 1j1j2 ) can be decreased,
3 6782(3) 6782(3) 1478(3) 6782(3)
without decreasing n1. Thus, as per Step 2.3, this D amount
is transferred from ( v 11j1 ) to (S 1j1 ), resulting in ( v 11j1 )
4 6782(4) 6782(4) 6782(4) 1478(4)
5 [2808, 2808, 8112, 6600] and (S 1j1 ) 5 [3432, 264, 0,
1512]. Similarly, in Step 3, the slacks (S 3j2 ) are increased
Query Costs ($) for m 5 2
without decreasing n1. The (C 1j1j2 ) matrix used is
1 3981(1) 18,261(1) 18,261(1) 18,261(1)
3 4
2 14,280(2) 0(2) 14,280(2) 14,280(2) 52,560 57,864 63,168 61,656
3 18,261(3) 18,261(3) 3981(3) 18,261(3) 54,590 49,286 59,894 58,382
C 1j1j2 5 41,198 41,198 41,198 44,990 .
4 18,261(4) 18,261(4) 18,261(4) 3981(4)
42,710 42,710 48,014 41,198
Query Costs ($) for m 5 3
To obtain C*, we now determine the minimum value in
1 12,547(1) 32,917(1) 24,444(3) 32,917(1) each row of the (C 1j1j2 ). Thus, C* 5 [52,560, 49,286,
2 32,917(2) 12,547(2) 24,444(3) 32,917(2) 41,198, 41,198]. As per Step 3.2, D1 5 [11,362, 8088, 0,
3 20,370(3) 20,370(3) 0(3) 20,370(3) 0] and D 5 [11,362, 8088, 0, 0]. As a result, ( v 31j2 )
4 32,917(4) 32,917(4) 24,444(3) 12,547(4) 5 [38,390, 36,912, 31,608, 33,120] and (S3j2) 5 [11,362,
8088, 0, 0].
Query Costs ($) for m 5 4 The next iterative step with m 5 2 begins by setting
( v 12j1 ) 5 [3432, 264, 0, 1512] and ( v 32j2 ) 5 [11,362,
1 11,642(1) 30,542(1) 30,542(1) 22,680(4) 8088, 0, 0], while (S 1j1 ) and (S 3j2 ) are set to zero. The
2 30,542(2) 11,642(2) 30,542(2) 22,680(4) resulting revised cost matrix obtained is
3 30,542(3) 30,542(3) 11,642(3) 22,680(4)
3 4
4 18,900(4) 18,900(4) 18,900(4) 0(4) 18,775 29,887 29,623 31,135
25,800 8352 22,368 23,880
C 2j1j2 5 21,693 18,525 3981 19,773 .
(b) Update Communication and Storage Costs
21,693 18,525 18,261 5493
Update/Storage Costs ($):
Location(i) From Step 1.4, n2 5 3981. The reader can now verify
File Type that upon executing Steps 2 and 3 ( v 12j1 ) 5 [0, 0, 0, 0],
(d) 1 2 3 4 (S 1j1 ) 5 [3432, 264, 0, 1512], ( v 32j2 ) 5 [0, 3981, 0, 0],
1 6240 3072 8112 8112 and (S 3j2 ) 5 [11,362, 4107, 0, 0]. For m 5 3, we begin
by setting ( v 23j1 ) 5 [35,160, 35,160, 17,700, 18,960],
2 35,160 35,160 17,700 18,960
( v 33j2 ) 5 [11,362, 4107, 0, 0], while (S 2j1 ) and (S 3j2 ), to
3 49,752 45,000 31,608 33,120
zero. Then, by minimizing over the resulting (C 3j1j2 ) ma-
trix, we obtain n3 5 17,700. Next, we try to increase the
slacks (S 2j1 ) and (S 3j2 ) to (S 2j1 ) 5 [35,160, 34,114, 0,
5 [49,752, 45,000, 31,608, 33,120], while the slacks (S1j1) 13,807] and (S 3j2 ) 5 [6209, 0, 0, 0], while setting ( v 23j1 )
and (S3j2) are set to zero. The following revised cost 5 [0, 1046, 17,700, 5153] and ( v 33j2 ) 5 [5153, 4107,
matrix is obtained: 0, 0]. Finally, with m 5 4, we begin by setting ( v 24j1 )
5 [35,160, 34,114, 0, 13,807], ( v 34j2 ) 5 [6209, 0, 0,
0] and (S 2j1 ) 5 (S 3j2 ) 5 [0, . . , 0]. Now, by minimizing
3 4
55992 58128 63168 63168 (C 4j1j2 ) over j1 and j2, we obtain n4 5 11,642. With the
58022 49550 59894 59894 execution of Steps 2 and 3, we obtain, ( v 24j1 ) 5 [0, 0, 0,
C 1j1j2 5 44630 41462 41198 46502 . 11,642], (S 2j1 ) 5 [35,610, 34,114, 0, 2165], ( v 34j2 )
46142 42974 48014 42710 5 [0, 0, 0, 0] and (S 3j2 ) 5 [6209, 0, 0, 0]. This marks the
DUAL-ASCENT PROCEDURE FOR FILE-ALLOCATION PROBLEM 119
TABLE II. The set of (cmj1j2 2 nm) values 5 4, it is a valid location. Hence, L 12 5 {3} (or y 3 5 1).
2
Problem
No. No. Nodes No. Queries No. y Variables No. x Variables No. Constraints
frequency of usage, which is randomly generated in the multiplied by a reduction factor ranging from 0.01 to 0.5.
range 10 100. Since the result of joining multiple files is The test problems also assume that the networks are packet-
usually less than its sum, the total volume of the query was switched. Hence, all unit query communication costs were
Problem
No. No. Nodes No. Files No. Queries No. y Variables No. x Variables No. Constraints
TABLE V. Solution quality and CPU timestwo files The performance of the dual-ascent procedure was also
scenario compared to MPSX/370 V2. MPSX/370 V2 has one of the
most efficient implementations of the simplex method for
Problem Average Percentage Average CPU Time
No. Duality Gap (seconds) large-sized LP problems. While comparing the dual-ascent
procedure to MPSX/370 V2, the simplex routine OPTI-
1 0.000 0.015 MIZE was used to solve the LP relaxation of (P). This was
2 0.021 0.231 done to make the comparison a fair one, as both solve the
3 0.124 0.813 LP relaxation of (P). The efficient implementation features
4 0.001 1.648 of MPSX/370 V2 are (1) the OPTIMIZE procedure which
5 0.123 1.582 uses the LU factorization method, which is most useful
6 0.000 4.439
when the problem is very large and sparse and which also
takes advantage of dynamic alteration of pricing and cycling
routines during the optimization process, and (2) the Vector
Facility Support feature which incorporates the use of vec-
set to equal $0.04/kilobits. The update cost, CU dj , is the torization in the pricing-out step to speed up the overall run
result of transmitting updates from all user locations i to file
location j, if file d is kept there. The volume of updates were
assumed to be 10, 20, 30, 40, and 50% of the volume of the TABLE VI. Solution quality and CPU timesmultifiles
query that requires file d. scenario
The file sizes (megabytes) and the unit storage costs
Problem Average Average CPU
($/megabyte) were randomly generated in the range [6000 No. Gap (%) Time (seconds)
12,000] and [0.04 0.1], respectively. The storage costs;
CS dj , are the product of the file size and the unit storage cost. 7 0.011 0.026
The fixed cost F dj is now obtained as the sum of the update 8 0.041 0.337
and storage costs. For each problem type listed in Tables III 9 0.021 0.133
and IV, five test cases were generated. These cases differed 10 0.655 7.14
from each other in the ratio of the query to update traffic. 11 0.435 0.21
These ratios were 10, 20, 30, 40, and 50%. Tables V and VI 12 0.475 8.67
present results on the quality of the solutions obtained after
13 0.631 0.11
the application of the dual-ascent procedure. Also presented
14 0.014 0.78
are the computational times taken in terms of CPU seconds.
Here, solution quality is measured as the percentage duality 15 0.428 0.08
gap between the best primal value Z *P and the best dual 16 0.02 3.56
value Z *D , that is, % GAP 5 (Z *P 2 Z *D )100/Z *D . 17 0.023 1.02
The average percentage gap ranged from 0 to 0.68%. 18 0.075 24.96
Clearly, these gaps are extremely low, indicating that the 19 0.09 1.15
dual-ascent procedure is successful in obtaining very close 20 0.9 41.12
to optimal solutions. From these results, there seems to be 21 0.253 1.23
no discernible pattern as to how the percentage gap varies 22 0.021 36.46
with the number of nodes, the number of files, or the 23 0.093 0.29
number of queries. Equally important, the results indicate 24 0.249 32.37
that near optimal primal and dual solutions are obtained
25 0.044 0.38
very swiftly. The average time taken to execute the dual-
26 0.25 17.31
ascent procedure and the add drop heuristic ranged from
0.015 to 52.204 seconds. These results suggest that the 27 0.488 5.09
dual-ascent procedure is indeed very useful as a fast heu- 28 0.679 11.94
ristic, which, in addition, also provides performance guar- 29 0.152 8.03
antees, especially for very large sized problems. If an opti- 30 0.001 8.65
mal solution is desired, then the dual-ascent procedure can 31 0.111 1.06
be easily embedded into a branch-and-bound procedure. 32 0.012 52.20
Since the gaps obtained were uniformly so small, it is 33 0.049 3.37
conjectured that an optimal solution can be found and 34 0.051 8.15
verified by enumerating a relatively small number of 35 0.053 0.72
branch-and-bound nodes. Hence, an optimal solution can be 36 0.027 36.29
found in a reasonable amount of time.
122 MURTHY AND SEO
TABLE VII. A Performance comparison of dual ascent A problem in Distributed Computing Systems that
procedure MPSXtwo files scenario [FAJSP-2] is closely related to is the File Allocation and
Report Assignment problem introduced by Ramesh and
Average CPU Time in Seconds
Ryan [25]. In several large corporations, generating and
No. No. Dual-ascent MPSX/370 transmitting reports is an important function of the infor-
Nodes Queries Procedure V2 mation system department. These reports typically re-
quire the pooling and processing of data from several
10 10 0.003 1.2
databases that are located at various computer sites.
10 100 0.015 7.8
Clearly, the report-generating function has an impact on
20 20 0.161 4.8
where the files are placed due to the cost of transmitting
20 100 0.818 53.7 the data from the file sites to the location where the report
30 30 0.265 17.7 is to be prepared and the cost of transmitting the report to
30 100 1.582 186.9 the end user. Given a set of file types needed to prepare
a report, the site where the report is prepared is analogous
to the join site in [FAJSP-2]. If the number of files
needed to prepare a report is limited to two, then the
time. The performance comparison of the dual-ascent pro- resulting problem is equivalent to [FAJSP-2].
cedure to MPSX/370 V2 is presented in Table VII. For It is well known that the File Allocation and Facility
illustration purposes, only the problems with two file types Location have problems that are closely related. For in-
were tested. For all the problems tested, the dual-ascent stance, the Simple File Allocation Problem is analogous to
procedure was found to be more than an order of magnitude the Uncapacitated Facility Location Problem [28]. Another
faster than MPSX. The dual-ascent procedure outperformed type of location problem, the Two Echelon Distribution
MPSX/370 V2 by factors ranging from about 30 to more Facility Location Problem [TEDFLP], is closely related to
than a 100. More significantly, on close examination, it was [FAJSP-2]. In [TEDFLP], one level of facilities represents
found that as the problem size increased the factor by which distribution centers where multiple products are consoli-
the dual-ascent procedure outperformed MPSX/370 V2 also dated. The next level of facilities are market-oriented ware-
increased. This indicates that the time taken by the dual- houses that serve retailers or final consumer demands. The
ascent procedure proposed in this research increases at a objective here was to determine the location of facilities at
slower rate with the problem size than MPSX/370 V2 does. each echelon that minimizes total cost. This problem was
well described by Gao and Robinson [11], who assumed
that all facilities are homogeneous, in that all the products
6. CONCLUSIONS flow in a simple route through the two echelons to the final
customer. However, oftentimes, the second level of facili-
In this paper, a methodology was proposed that attempts to ties, that is, the warehouses that serve retailers or final
unify the two aspects of distributed database systems: the customers, may be used to store not just a single product
file allocation and the query optimization problems. We type but several product types. Even though these product
assume that each query requires the processing of at most types may be similar, they may come from different types of
two file types. In our methodology, we developed an integer facilities at the higher echelon. This would be so especially
linear program that simultaneously determines the locations if the higher-echelon facilities consisted of manufacturing/
where data files are to be stored and the sites where the joins assembly units. Such a problem requiring two facility types
of two files are to occur for each query. We presented a at the higher echelon is analogous to [FAJSP-2].
dual-ascent procedure that solves the integer program ap- A worthwhile future research endeavor is to develop
proximately. Our computational results show that the dual- suitable heuristics for the more general File Allocation and
ascent procedure is able to quickly solve even large-scale Join Site Selection Problem I in which several queries may
problems to near optimality. require the joining of more than two file types, as illustrated
While the primary purpose of this paper was to address in Figure 1. We are currently investigating suitable ways of
the specific problem: File Allocation and Join Site Selection decomposing the general problem into several [FAJSP-2] or
Problem with 2 way joins [FAJSP-2], it is indeed interesting 2-way join problems. One approach under investigation is a
to note that this problem is closely related to certain other top-down approach, where, for example, assuming tempo-
well-known problems in Distributed Computing Systems rary relations T 1 and T 2 to be result relations, the file
and Facility Location. We think that it is important to know placements of R 1 , R 2 , R 3 , and R 4 are obtained for each
these connections since the solution procedure developed location. Then, assuming T 3 to be a result relation, the
for [FAJSP-2] can be suitably modified to solve these other locations for T 1 and T 2 are determined. In a similar way, the
problems also. problem may also be decomposed in a bottom-up fashion.
DUAL-ASCENT PROCEDURE FOR FILE-ALLOCATION PROBLEM 123
[13] B. Gavish and H. Pirkul, Computer and database location in [23] I. Murthy and P. Seo, A nested dual ascent procedure for the
distributed computer systems, IEEE Trans Comput C-35 dynamic single file allocation problem, INFOR 31 (1993),
(1987), 583590. 186 204.
[14] B. Gavish and A. Segev, Set query optimization in distrib- [24] S. Ram and P. Narasimhan, Database allocation in a dis-
uted database systems, ACM Trans Database Syst 11 tributed environment: incorporating a concurrency control
(1986), 265293. mechanism and queuing costs, Mgmt Sci 40 (1994), 969
[15] D. Ghosh, I. Murthy, and A. Moffett, File allocation prob- 983.
lem: comparison of models with worst case and average [25] R. Ramesh and B. Ryan, Optimal file allocation and report
communication delays, Oper Res 40 (1992), 1074 1085. assignment in distributed information networks, Naval Res
[16] A. Hevner and S. Yao, Query processing in distributed Log 37 (1990), 165181.
database systems, IEEE Trans Software Eng SE-5 (1979),
[26] A. Segev, Optimization of join operations in horizontally
177187.
partitioned database systems, ACM Trans Database Syst 11
[17] H. Katzan, Jr., Distributed information systems, Petrocelli, (1986), 48 80.
New York, 1979.
[27] P. Seo, File allocation and join site selection problem in
[18] L.J. Laning and M.S. Leonard, File allocations in a distrib-
distributed database systems, unpublished Ph.D. disserta-
uted computer communications network, IEEE Trans Com-
tion, Louisiana State University 1994.
put C-32 (1983), 232244.
[28] B. Wah, File placement on distributed computer systems,
[19] K.D. Levin and H.D. Morgan, Optimal program and data
IEEE Trans Comput 17 (1984), 2330.
locations in computer networks, Commun ACM 20 (1976),
315321. [29] E. Wong, Retrieving dispersed data from SDD-1: A system
[20] S. Mahmoud and J. S. Riordan, Optimal allocation of re- for distributed databases, Second International Berkeley
sources in distributed information networks, ACM Trans Workshop on Distributed Data Management and Computer
Database Syst 1 (1976), 66 78. Networks, 1977.
[21] S. Manning and R. W. Peebles, A homogeneous network for [30] R. Wong, A dual ascent approach for Steiner tree problem
data sharing communications, Comput Network (1977), on a directed graph, Math Program 28 (1984), 271287.
211224. [31] P.S. Yu, H.V. Heiss, and S. Lee, Workload characterization
[22] I. Murthy and D. Ghosh, File allocation involving worst of relation database environments, IBM research report no.
case response times and link capacities: model and solution RC 4675, IBM Research Division, T.J. Watson Research
procedure, Eur J Oper Res 67 (1993), 418 427. Center, Yorktown Heights, NY 10598, 1989.