File Allocation and Join Site Selection Problem

A Dual-Ascent Procedure for the File Allocation
and Join Site Selection Problem on a

Telecommunications Network
Ishwar Murthy,1 Phil K. Seo2

1
Department of Information Systems and Decision Sciences, Louisiana State University,
Baton Rouge, Louisiana 70803
2
Myung Jee University, Seoul, South Korea
Received 18 September 1997; accepted 14 August 1998
Abstract: In this paper, a model and a solution procedure is developed for the File Allocation and Join
Site Selection Problem with 2-way Join [FAJSP-2], defined on a telecommunications network. This
problem attempts to integrate the file allocation and query optimization aspects of a distributed comput-
ing system. By allowing for queries that require processing up to two file types, this problem is designed
to determine simultaneously the number and location of file types and the location of join operations. This
problem is modeled as a mixed-integer linear program, for which a fast dual-ascent approximation
procedure is developed. Extensive computational results are presented which demonstrate that our
dual-ascent procedure is able to solve even large-scale problems to near optimality quickly. 1999 John
Wiley & Sons, Inc. Networks 33: 109 124, 1999
Keywords: telecommunications network; distributed computing systems; integer programming; dual

ascent
1. INTRODUCTION with each other. The primary function of a DCS is to

provide a logically integrated information system for orga-
In this paper, we introduce a model that attempts to integrate nizations that are geographically dispersed. Accordingly, it
two important problems in Distributed Computing Systems attempts to disperse data, and the data-processing function
(DCS): the File Allocation Problem and the Query Optimi- to more closely mirror the workings of such an organization.
zation Problem. At a physical level, a DCS can be thought Clearly, the benefits of such a system are lower communi-
of as a set of computers interconnected by telecommunica- cation cost, quicker access to data, higher system reliability,
tions facilities, that is, a telecommunications network where and ease of incremental growth [17].
each node represents either a stand-alone computer or a A natural consequence of a DCS is to distribute the
cluster of computers and each link represents a connection databases across the network. By this, we mean that a
between two computers that enables them to communicate database is broken up into several database fragments (or
files), with each fragment placed at various locations of the
DCS network. Assuming the existence of (i) computers at
Correspondence to: I. Murthy; e-mail: imurthy@unix1.sncc.lsu.edu each location and (ii) a telecommunications network, there
1999 John Wiley & Sons, Inc. CCC 0028-3045/99/020109-16
109
110 MURTHY AND SEO
are two important issues associated with the design of

distributed databases. The first problem, known as the file
allocation problem (FAP), is concerned with the placement
of files at various nodes in the network. The second prob-
lem, query optimization, deals with the efficient processing
of queries that involve multiple, usually geographically
dispersed, file types. Although these two problems are
closely related, review of the current literature reveals that
most investigations have concentrated on a single aspect in
isolation.
1.1. File Allocation Problem

The primary issue in file allocation is to determine the
number of file copies that must be maintained in the dis- Fig. 1. Processing of a query with five relations.
tributed system and the location of each file copy. Here,
each file type refers to a database fragment. The FAP was
initially studied by Chu [5]. Chu proposed a model where query is processed. This query requires the processing of
multiple file copies are allocated such that operating costs five data fragments (file types): R1, R2, R3, R4, and R5.
are minimized. Chus model also took into account the These files are also referred to as permanent relations. To
memory-size restrictions of individual processors and an process this query, permanent relations R1 and R2 are
average delay constraint for queries. Levin and Morgan [19] joined to create a temporary relation T1. Similarly, relations
proposed a comprehensive model that explicitly took into R3 and R4 are joined to create another temporary relation
account the dependencies between program and data files. T2. Then, T1 and T2 are joined to create relation T3.
However, with this model, only relatively small problems Finally, T3 is joined with permanent relation R5 to create
can be solved. The FAP has also been studied in conjunction the result relation T4, which is sent to the query site. It is
with the network-design problem as evidenced in [4, 13, 15, worth noting here that this is not the only way to obtain
18, 20, 22]. The overall architectural design of a distributed result T4 from the five relations. Consequently, it should be
system that includes the network design, the distributed evident that obtaining the best sequence of join operations is
database design, and the location of report generation sites an important issue in query optimization.
was studied in [13, 25]. In contrast to the static FAP, the The pioneering work in the area of distributed query
dynamic file allocation problem (DFAP) involves the deter- processing was done by Wong [29]. Wongs technique
mination of file allocation policies over time. An excellent translated a query into a sequence of moves of relations and
discussion on DFAP and file migration models can be found local processing actions. Chu and Hurley [6] proposed pro-
in [12]. A special case of DFAP was addressed in Murthy cedures for obtaining a query processing policy based on an
and Seo [23] where the changes in query requirements were operating-cost model that incorporates communication and
seasonal. processing costs. Hevner and Yao [16] investigated the
decomposition of the query optimization problem into two
parts: the independent optimization of simple queries that
1.2. Query Optimization can be solved easily and optimally, and the integration of
The critical issues in query optimization include (i) deter- them within schedules for relations. They presented an
mining the physical copies of the files/fragments upon optimization algorithm (PARALLEL) which is shown to
which to execute the query, given a query expression over derive a minimal response time distribution strategy for any
fragments,* and (ii) selecting the order of execution of these given simple query. The algorithm for the general query is
operations. The latter typically involves the determination a heuristic that uses an improved exhaustive search to find
of a good sequence of join, semijoin, and union operations. efficient distribution strategies. Building on [16], Apers et
Operations such as selections and projections, which usu- al. [2] presented a procedure that is capable of obtaining the
ally do not impact interprocessor communications, are gen- optimal solution when delay minimization is the objective.
erally ignored. Ceri and Pelagatti [4] considered selection, projection, and
Figure 1 provides a good illustration of how a typical join operations that are required by a transaction on a
distributed database. The authors show that for the alloca-
tion problem only the join operations are relevant. They
* The term materialization is typically used in the literature to denote
proposed an integer programming model and discussed
a nonredundant copy of the entire distributed database upon which the heuristic solution methods. Segev [26] investigated the
query is executed. problem of optimizing 2-way joins in horizontally parti-
DUAL-ASCENT PROCEDURE FOR FILE-ALLOCATION PROBLEM 111
Fig. 2. Relations.
tioned database management systems. Gavish and Segev operations are executed. However, the fact is that FAP and
[14] defined a special case of the distributed query optimi- query optimization are highly interdependent. Surely, the
zation problem that consists of queries involving set oper- location of the files can strongly impact the right sequence
ations (e.g., set difference and set intersection) between sets and location of join operations. Conversely, the selection of
of tuples that are geographically dispersed. They presented the join site (in query optimization) has a major impact on
a mixed-integer programming model with three heuristic the query communication costs.
procedures and a plant location-based lower-bounding pro- To illustrate this interdependency, consider the example
cedure to solve the model. similar to that found in [1]. Figure 2 shows the nature of the
It is evident from the preceding discussion that while three relations (or file types): WINE_USA, WINE-
FAP and query optimization have been extensively re- _FRANCE, and WEATHER. The first two relations contain
searched, they have been treated largely as separate prob- tuples that represent a wine for which the grapes were
lems. In query optimization, while determining the se- grown in a certain area, picked in a certain year, and bottled
quence of join operations and their geographic sites, the by a certain producer. The relation WEATHER contains
locations of database files or permanent relations are as- attributes YEAR, AREA, COUNTRY, and SUN, where
sumed to be given. Similarly, research on FAP has largely SUN represents the hours of sun in a given area and year.
ignored the issues associated with the query access plan. It Figure 3 shows the DCS network where copies of these
is either assumed that all queries require access to a copy of relations are to reside.
a single database file or that when queries require several Let the sizes of relations WINE_USA, WINE_FRANCE,
files the copies are simply sent to the user site where all and WEATHER be 12,000, 15,000, and 18,000 megabytes,
112 MURTHY AND SEO
and Yu [7] also considered an integrated strategy for choos-

ing the locations where relations are to be stored, while
simultaneously determining the sites where the join opera-
tions are to take place. In their formulation, the objective
was to minimize communication costs based on a given set
of transactions (queries) and their associated input rates. An
integer linear programming formulation was developed, but
no solution procedure was provided. Furthermore, since the
database was assumed to be nonreplicated, update costs
were ignored in their formulation.
Fig. 3. The DCS network.

1.3. Objective of This Research
respectively. Lets also suppose that a copy of WINE_USA Like Apers [1] and Cornell and Yu [7] before, the objective
resides in San Francisco; a copy of WINE_FRANCE, in of this paper was to attempt to integrate the FAP and query
Paris; and a copy of WEATHER, in New York. Supposing optimization problems. To that end, we introduce the prob-
that a frequent query from a user in Chicago is: Provide the lem, the File Allocation and Join Site Selection Problem,
name, year, and hours of sun of all wines from Napa Valley, with a 2-way join [FAJSP-2]. In this problem, we allow for
USA, there are several ways to execute this query. One queries that require processing of up to two file types, as
option is to transmit a copy of WINE_USA from San shown in Figure 4. Queries that require two file types are
Francisco to New York. At New York, joins based on processed by a simple join operation. This join operation
YEAR and AREA are computed between WINE_USA and can occur at any location, and, subsequently, the result
WEATHER. The size of the result relation is 500 mega- relation is transmitted to the (user) site where the query
bytes, which is then transmitted to Chicago. The total mega- originated. Given many such queries, the problem is to (i)
bytes transmitted in this option is 12,500. Another option determine the number of copies of each file type and their
could be to transmit WEATHER to San Francisco, execute respective node sites, and (ii) for each query, determine
the join there, and then send the result relation to Chicago. which copies of the two files to access and the location
In this option, a total of 18,500 megabytes are transmitted. where the join is to occur. The decisions above are deter-
Yet another option could be to transmit both WINE_USA mined by one that minimizes the total query communication
and WEATHER to Chicago and perform the join operation and file update/storage costs.
there. In this case, a total of 33,000 megabytes are trans- In the next section, we discuss the assumptions made in
mitted. If the network is packet-switched, then the query [FAJSP-2]. Then, we develop an integer programming
communication cost is simply proportional to the number of model for [FAJSP-2]. In Section 3, we present a novel,
megabytes transmitted, in which case the first option seems dual-ascent procedure that solves the model approximately
to be the cheapest. This then illustrates the impact of the join but provides tight bounds. In Section 4, we illustrate the
site on the query communication cost. One way to reduce workings of the dual-ascent procedure with an example. The
the query communication cost is to maintain an additional subsequent section presents the results of numerical exper-
copy of WEATHER in San Francisco. Now since San iments. These results clearly demonstrate the ability of our
Francisco has a copy of WEATHER and WINE_USA, the procedure to quickly obtain very good solutions even for
join operation can take place there and the result relation large-scale problems. This then is the principal contribution
transmitted to Chicago. In this case, only 500 megabytes are of this paper. While the problem addressed in this paper is
transmitted. Of course, due to the additional copy, update a specific one, it is closely related to some other important
and storage costs increase. This illustrates the effect the file problems in Facility Location and File Allocation. In Sec-
locations have on the optimal join site.
To the best of our knowledge, with the exception of
Apers [1] and Cornell and Yu [7], there has been no study
that develops an integrated strategy of simultaneously de-
termining the location where database files are to be stored
and where queries are to be processed. Apers [1] general-
ized the FAP into data and operation allocation problems
given queries, updates, and frequency of usage. Apers
model first determines the fragments to be allocated. Sub-
sequently, it determines the nodes where these fragments
and the operations on them will be allocated. The objective
of the model is to minimize total transmission cost. Cornell Fig. 4. A simple 2-way join.
tion 6, we discuss the connections that [FAJSP-2] has with M 5 (IXQ) index set of all user location query-type
these other problems and conclude this paper. combinations. Henceforth, an index m
{ M will be referred to as simply user-
query
D index set of all file types
2. MODEL FORMULATION fm frequency of user-query m
am the first file type used to process user-query
m
A distributed database system typically consists of many
database files/fragments and hundreds of different queries bm the second file type used to process user-
originating from various sites in the distributed environ- query m
ment. As seen in Figure 1, processing a query may some- Sd size of file d (megabyte)
times require more than two file copies. Although ideally Sm size of the result relation/file of user-query m
one would want a mathematical model for this most general (megabyte)
case, the ensuing model (in all probability) will be far too Ui d
amount of updates from node i to file type d
complex. Consequently, in [FAJSP-2], we concentrate on (megabytes)
the allocation of files and the selection of join sites in sj unit storage cost at node j ($/megabyte)
systems that require the joining of two files only.
rd reduction factor associated with file type d
At first sight, this may appear to be rather limiting. We,
however, believe that the study of [FAJSP-2] is of consid- c jk unit communication cost between nodes j and
erable importance for two reasons: First, as observed by k ($/megabyte)
Ram and Narasimhan [24] and Yu et al. [31], most queries Md # M index set of all user-queries m that need to
require processing of at most two relations. More impor- access file type d, i.e., for each m { M d ,
tantly, we think that the study of [FAJSP-2] ought to be either d 5 a m or d 5 b m
viewed as a first step toward solving the most general case
(i.e., one with queries requiring more than two files). Ob- There are two cost components in [FAJSP-2]: the query
serve that even in the general case queries are processed by communication cost and the update/storage cost. Consider a
joining two relations at a time in sequence. This suggests user-query m that is associated with a query q originating
several possibilities of obtaining good solutions to the general from node i. This query requires copies of two file types,
problem by suitably decomposing it into several [FAJSP-2] a m and b m, which are processed by a simple join opera-
problems. In short, we believe that the study of [FAJSP-2] will tion. Suppose further that user-query m uses a copy of file
provide valuable insights into the solution of the more general type a m that resides at node j 1 and a copy of file type b m
problem. that resides at node j 2 . Also, let the join operation take place
In our model, it is assumed that individual nodes in the at some node k. The communication costs for this query is
distribution system are sufficiently large to accommodate all computed as follows: First, a reduced copy of a m and b m
processing, storage, and communications requirements. Ca- is sent from locations j 1 and j 2 , respectively, to join site k.
pacity restrictions therefore are not considered here. Many Using reducer programs consisting of unary operations and
difficulties arise when one tries to seamlessly integrate semijoins, the files are appropriately reduced by a factor r.
different types of processors into a single, logically inte- Hence, the cost of transmitting data from nodes j 1 and j 2 to
grated system. Low-level interfacing problems limit system node k is
flexibility and hamper the efficiency of operations. Homo-
geneous computer systems (i.e., systems supported by a set
of identical computers) have been suggested as a solution to @S am z r am z c j1k 1 S bm z r bm z c j2k#.
these problems [21]. In this research, we assume such a
system. At node k, a relation of size S m is obtained after the join
A copy of a program file, one that performs the query/ operation. This result is then transmitted to the user location
update operations, is assumed to reside at each node since i at a cost of S m z c ki . Assuming that the frequency of
they are of a relatively small size and updates on them are user-query m is f m , the optimal communication cost and the
quite infrequent. The following notations are used to de- join site for file type a m from location j 1 and file type b m
velop the model for [FAJSP-2]: from location j 2 can be determined in linear time as
N index set of all locations C mj1j2 5 Min@ f m z S am z r am z c j1k

k{N
I index set of all user locations
Q index set of all query types 1 f m z S bm z r bm z c j2k 1 f m z S m z c ki#. (1)
114 MURTHY AND SEO
Associated with each copy of file d at location j is an update 3. THE DUAL-ASCENT PROCEDURE
communication and storage cost. Given that the amount of
updates from a user location i to file type d is U di and the
unit storage cost at node j is s j , the update and storage cost In this section, we discuss in detail the development of a
is dual-ascent procedure for [FAJSP-2]. A dual-ascent procedure
is a structured approach for solving the LP dual or the La-
F dj 5 O ~U z c ! 1 s z S .
d
i ij j d (2)
grangian dual. It solves the dual in a manner that ensures a
monotonic improvement in the dual value. This is in contrast to
i{I the more popular subgradient procedure which can take a
longer time to converge, since the dual value often regresses
The following decision variables describe the decisions during the course of its execution. Also, in the dual-ascent
made in [FAJSP-2]: procedure, often good primal solutions can be easily identified
using complementary slackness conditions.
A number of highly successful dual-ascent schemes have
x mj1j2 5 1, if file a m at j 1 and file b m at j 2 is used to
been developed in the past for solving large-scale integer
satisfy user-query m,
programming problems [3, 8, 10, 30]. It is interesting to
5 0, otherwise. note that in all these problems the formulation consists of a
y dj 5 1, if a copy of file type d is stored at node j, set of constraints that are totally unimodular and a set of
5 0, otherwise. variable upper bound (VUB) constraints. It was found that
their LP relaxations frequently yielded natural integer solu-
We now present the following integer programming tions or provided bounds close to the optimal. The formu-
model for [FAJSP-2]: lation (P) bears a striking resemblance to these problems.
Consequently, it is expected that good lower bounds may be
(P) Minimize O O OC
m{M j1{N j2{N
mj1j2 z x mj1j2 1 O OF zy
d{D j{N
d
j
d
j
obtained by solving the LP relaxation of (P).
Our dual-ascent procedure consists of three iterative
steps: (1) a dual-ascent step, (2) a primal step, and (3) a
dual-adjustment step. In the dual-ascent step, the dual of the
Subject to: LP relaxation of (P) is rapidly solved to near optimality by
taking advantage of its special structure. In the primal step,
OOx j1 j2
mj1j2 51 for all m { M (3)
a feasible solution to (P) is obtained from the dual solution
by ensuring that certain complementary slackness condi-
tions are satisfied. During this process, if all complementary
Ox
slackness conditions are satisfied, we stop with the optimal
mj1j2 # y j2d for all m { M, j2 { N, d { b m (4) solution. If not, using the complementary slackness viola-
j1 tions as a guide, certain dual values are adjusted so as to
enable a further increase in the dual value. This is the
Ox
j2
mj1j2 # y j1d for all m { M, j1 { N, d { a m (5) dual-adjustment step. These three steps are repeated until no
further improvements in the primal or dual solution occurs.
Finally, if a gap still exists between the primal and dual
values, then a fast add drop heuristic is executed to im-
x mj1j2 { $0, 1%, and y dj { $0, 1%
prove the primal solution. This add drop heuristic is very
for all m { M, j1 { N, j2 { N, d { D. (6) similar in approach to that described in Balakrishnan et al.
[3] or Fisher et al. [10]. Hence, we have not provided details
In the objective function, the first term represents the query here, which can be found in Seo [27].
communication cost, while the second term represents the
composite update and storage costs. Constraint set (3) de-
scribes the requirement that each user-query m needs to 3.1. The Dual-ascent Step
access the first file type a m from some location j 1 and the
second file type b m from some location j 2 . Constraint set Consider the following LP dual of (P):
(4) describes the condition that if file d is not stored at
location j 1 no user-query m, for which file type d is a m, can
access it from location j 1 . Similarly, constraint set (5)
(D) Maximize On
m{M
m
enforces the requirement that if file type d is not stored at

location j 2 , no user-query m for which b m 5 d can access
it from location j 2 . Subject to:
am bm
n m # C mj1j2 1 v mj1 1 v mj2 STEP 0. [INITIALIZATION]
for all m { M, j1 { N, j2 { N (7) Set S dj 4 F dj for all d { D, j { N.
Set vdmj 4 0, nm 4 0 for all d { D, j { N, and m { M.
Ov
m{Md
d
mj # F dj for all d { D, j { N (8) For each m { N, do the following steps:
STEP 1. [INCREASE DUAL nm]
am am
v dmj $ 0 for all d { D, m { M, j { N (9) 1.1. Set v mj1 4 v mj1 1 S aj1m for all j 1 { N.
1.2. Set v mj2 4 v mj2 1 S bj2m for all j 2 { N.
bm bm
1.3. Set S aj1m 4 0, S bj2m 4 0 for all j 1 { N, j 2 { N.

In (D), the dual variables n m are associated with each 1.4. Set n m 4 Minj 1{N, j 2{N {C mj1j2 1 v mj1 am bm
1 v mj2 }.
am bm
constraint in (3), while v mj1 and v mj2 are associated with am
constraints (4) and (5), respectively. Dual constraints (7) STEP 2. [INCREASE SLACK Sj1 ]
and (8) are associated with the variables x mj1j2 and y dj , For each j 1 { N, do the following:
respectively.
The dual formulation suggests that the dual value can be 2.1. Set C* 4 Minj 2{N{C mj1j2 }.
am
increased by increasing each dual variable n m . Let 2.2. Set D 1 4 C* 2 n m , and D 4 Min{D 1 , v mj1 }.
am am am am
2.3. Set v mj1 4 v mj1 2 D, S j1 4 S j1 1 D.
am bm bm
C mj1j2 5 C mj1j2 1 v mj1 1 v mj2 (10) STEP 3. [INCREASE SLACK Sj2 ]
For each j 2 { N, do the following:
represent the right-hand-side values of each constraint in
3.1. Set C* 4 Minj 1{N{C mj1j2 }.
(7). Given a feasible set of dual values { v dmj }, that is, one bm
3.2. Set D 1 4 C* 2 n m , and D 4 Min{D 1 , v mj2 }.
that satisfies (8) and (9), C mj1j2 can be interpreted as the bm bm bm bm
3.3. Set v mj2 4 v mj2 2 D, S j2 4 S j2 1 D.
revised cost of accessing file type a m from node j 1 and
file type b m from node j 2 , in order to satisfy user-query m. In Step 0 above, we start with a dual-feasible solution by
It is clear from (7) that having fixed the dual values { v dmj } setting all variables to zero. In Steps 2.2 and 3.2, the manner
the optimal value of n m can be obtained simply as in which D is obtained ensures that n m is not decreased and
constraints (9) are satisfied.
n m 5 Min $C mj1j2%. (11)
j1{N, j2{N
3.2. The Primal Step
After the dual-ascent step, a feasible dual solution {n1, v1}
From (10) and (11), it is evident that each n m can be
is obtained. From this dual solution, a feasible solution to
increased as much as possible by increasing each dual pair
am bm (P) is obtained using complementary slackness relation-
( v mj1 , v mj2 ), for all j 1 { N, j 2 { N. In turn, the amount
am bm ships. The complementary slackness conditions that the
that each dual pair ( v mj1 , v mj2 ) can increase is limited by
optimal primal-dual LP pair has to satisfy are
the resource-type constraint (8). Let S dj , the slack associated
with constraint (8), represent the amount of unutilized re-
source. Thus, the most a dual variable v dmj can increase by S dj z y dj 5 0 for all d { D, j { N (12)
is S dj . am bm
The dual-ascent step consists of two basic steps for each ~C mj1j2 1 v mj1 1 v mj2 2 n m! z x mj1j2 5 0
m { M: (i) increase the dual n m , and (ii) increase slacks S dj for all m { M, j 1 { N, j 2 { N (13)
for all j { N, d 5 a m, and d 5 b m. Given a feasible dual
solution, a variable n m is increased in the following manner.
First, each dual pair ( v mj1 am bm
, v mj2 ) is increased by (S aj1m ,
am
v mj1 am
z ~ y mj1 2 Ox
j2{N
mj1j2 !50 for all m { M, j 1 { N (14)
bm
S j2 ), for all j 1 { N, j 2 { N. Then, n m is updated as per
(11). It is now worth noting from the manner in which n m is
determined in (11) that all increases in v mj1 am bm
and v mj2 do not
bm
v mj2 bm
z ~ y mj2 2 Ox
j1{N
mj1j2 !50 for all m { M, j 2 { N. (15)
contribute equally toward increasing n m . Consequently, in

the second step, we attempt to increase back the slacks S aj1m In the primal step, a primal feasible solution { y 1 , x 1 } is
and S bj2m for all j 1 { N and j 2 { N, without decreasing n m . constructed that satisfies all the complementary slackness
The idea here is to simply take back those resources that do conditions (12) and (13). In addition, the amount of viola-
not contribute to increasing n m and, hence, provide oppor- tions of conditions (14) and (15) are kept at a minimum.
tunities for increasing other n m variables. The dual-ascent Constructing { y 1 , x 1 } involves, in part, developing the set
step is now formally presented: L1 1
d for each d { D, where L d 5 { j { N u y j 5 1}, that
d
116 MURTHY AND SEO
is, L 1
d defines the set of all locations where a copy of file Once the set L 1
d is constructed, the x values are obtained
type d is stored. Let L *d be defined as as follows: For each m, we set that x mj19j29 5 1 such that
L *d 5 $ j { NuS dj 5 0% for all d { D. (16) C mj19j29 5 Min $C mj1j2uC mj1j2 5 n m%. (18)
j1{La1m, j2{Lb1m
The feasible solution { y 1 , x 1 } is such that L 1

d # L*
d . This
naturally ensures that (12) is satisfied. Before describing the By selecting a j91 { L a1m and a j92 { L b1m that satisfies (17),
procedure for obtaining L 1d , the following property helps to violations of complementary slackness conditions (14) and
explain how conditions (13) are also satisfied by construct- (15) are minimized. For some m { M, to satisfy constraints
ing L 1
d from L *d: (4) and (5), we can only set that x mj1j2 5 1, where j 1 { L a1m
and a j2 { Lb1m. Further, to satisfy conditions (13), Cmj1j2 5 nm.
Lemma. After the dual-ascent step, for each m { M, there Observe that for some j*1 { La1m and a j*2 { Lb1m, if j2xmj1*j2
exists at least one j1 { L*am and j2 { L*bm such that Cmj1j2 5 0 and j1xmj1j2* 5 0, then xmj1*j2* 5 0 and complementary
5 nm. slackness conditions (14) and (15) are violated by the amount
am bm am bm
Proof. Observe that after Step 1 of the dual ascent step, (vmj1* 1 vmj2* ). Clearly, due to (18), the sum (vmj19 1 vmj29 )
1 1
S aj1m 5 0 and S bj2m 5 0 for all j 1 { N and j 2 { N. Now, is the largest among all j1 { Lam and a j2 { Lbm such that Cmj1j2
define T a m # N to be a set wherein for each j 1 { T a m , 5 nm. By setting xmj19j29 5 1, we, in effect, avoid this amount
Minj2{N {C mj1j2 } 5 n m , and, similarly, T b m # N to be a set of violation.
wherein for each j 2 { T b m , Minj1{N {O mj1j2 } 5 n m , after
Step 1. Clearly, from the manner in which slacks are in-
3.3. The Dual-adjustment Step
creased in Steps 2 and 3, there is no increase in slacks S aj m
for each j { T a m and S bj m for each j { T b m . This implies After the primal step, if { y 1 , x 1 } and {n1, v1} are found
that Tam # L*am and Tbm # L*bm. Further, for each j1 { Tam, to satisfy all complementary slackness conditions, we stop
there exists some j*2 { N such that Cmj1j2 5 nm and Sbj2*m 5 0. with { y 1 , x 1 } being the optimal solution to (P). If not,
Otherwise, there would be room to further increase nm. Simi- some of the dual variable values are adjusted starting with a
larly, for each j2 { Tbm, there exists some j*1 { N such that variable v dmj that either violates condition (14) or (15). Note
Cmj1j2 5 nm and Saj1*m 5 0. Since uTamu $ 1 and uTbmu $ 1, we that (14) is violated if v mj1am am
. 0, while y mj1 5 1 and j2
bm
have, hence, the proof. x mj1j2 5 0. Similarly, (15) is violated if v mj2 . 0, while
bm am
y mj2 5 1 and j1 x mj1j2 5 0. By decreasing v mj1 , if it is
To construct L 1 d , we first add copies of file type d at associated with a violation of (14), or decreasing v mj2 bm
, if it
essential locations to L 1 d from L * d . A copy of file d at associated with a violation of (15), opportunities for increas-
location j91 is deemed essential, if for some m, a m 5 d and ing the dual value are created.
am
j91 is the only location in L*d such that Minj2{N{Cmj19j2} 5 nm. Observe that if v mj1 is associated with a violation of
Naturally, for such a user-query m, the only way for condition (14), then a copy of a m resides at node j 1 , but it is not used
(13) to be satisfied is to set xmj19j2* 5 1, where Cmj19j2* 5 nm. to satisfy user-query m. Therefore, a copy of a m must also
This forces j91 to be added to L1 d . Similarly, file type d at exist at another location j91 which is accessed, that is, j2
location j92 is essential, if for some m, bm 5 d and j92 is the only x mj19j2 5 1. Hence, uL a1m u $ 2. Now, since Minj2 {C mj1j2 }
am
location in L*d such that Minj1{N{Cmj1j29} 5 nm. 5 n m , for each j 1 { L a1m , a decrease in v mj1 results in a
Having placed essential copies, only those additional corresponding decrease in n m and an increase in the slack
copies are added to L 1 d so as to ensure that the requirements S aj1m , by the same amount. However, since uL a1m u $ 2, it is
of all user-queries are met while satisfying condition (13). possible to increase at least one more slack. This is because,
am bm am bm
This is done as follows: For each m, we check to see if there due to (18), ( v mj19 1 v mj29 ) $ ( v mj1 1 v mj2 ). Therefore,
exists a j 1 { L a1m and a j 2 { L b1m such that C mj1j2 5 n m . am am bm
since v mj1 . 0, at least v mj19 or v mj29 must be positive.
If this condition is not met, then we select a j *1 { L *a m and This increasing of two or more slacks in exchange for
a j *2 { L *b m such that decreasing n m creates the opportunity for increasing the
bm
dual. The same argument holds for decreasing v mj2 , if it is
C mj1*j2* 5 Min $C mj1j2uC mj1j2 5 n m%. (17) associated with a violation of condition (15).
am bm
j1{L*am, j2{L*bm In the dual-adjustment step, if a variable v mj1 (or v mj2 )
is found to violate conditions (14) or (15), then it is reduced
If j *1 L a1m or j *2 L b1m , they are each added to their to zero. Accordingly, n m is reduced while the slack S aj1m (or
respective sets. This, of course, makes it possible to set S bj2m ) is increased, by the same amount. Now, we attempt to
x mj1*j2* 5 1 and thereby satisfy condition (13). Due to increase all the slacks without changing n m . This is done in
Lemma 1, it is always possible to find a j 1 { L *a m and a j 2 exactly the same manner as Steps 2 and 3 of the dual-ascent
{ L *b m such that C mj1j2 5 n m . step. Let L aS m and L bS m denote the index set of all locations
j91 and j92 whose corresponding slacks S aj19m and S bj29m have files are reduced by a factor r before shipping them to the
been increased, respectively. It is worth noting that L aS m join site. For m 5 1 and 2, r1 5 0.85 and r3 5 0.75 (file
# L*am and LbS m # L*bm. We now obtain a restricted set M1 type 1 is reduced by 15% and file type 3 is reduced by 25%),
, M, defined as the set of all user-queries m9 m, such while for m 5 3 and 4, r2 5 0.35 and r4 5 0.35. Similarly,
that either L*am9 , LaS m or L*bm9 , LbS m, or both. The the size of the result relation that is obtained after the join
dual-ascent step is first executed over the restricted set operation also undergoes a reduction operation. For m 5 1
M1. Then, the set M1 is enlarged to include m and the and 2, the reduction factor is 0.12 (the result relation is 12%
dual-ascent step is executed over this enlarged set. This of the sum of the reduced files 1 and 3), and for m 5 2 and
marks the end of the dual-adjustment step. A formal 3, the reduction factor is 0.28. Finally, the network is
statement of the dual-adjustment step is presented in the assumed to be packet-switched. Thus, the communication
Appendix. tariffs are independent of the distance between communi-
The idea behind executing the dual ascent step over the cating sites. Here, all unit communication costs are assumed
restricted set M 1 is best illustrated by the following simple to be $0.04/Mb.
am
example: Let v mj1 5 D violate complementary slackness Table I presents the query costs associated with each
am
condition (14). Further, upon setting v mj1 5 0, we get L aS m user-query m and the location of file type a m and b m. The
am am
5 { j 1 , j91 }, with S j1 and S j19 each increased by D, while query costs are calculated using (1). The table also presents
L bS m 5 {A}. Of course, n m also decreases by D. Now the optimal join site k associated with each of the query
suppose there exists a user-query m 1 such that L *a m1 costs in parentheses. It is instructive to note here that since
5 { j 1 } with a m1 5 a m and another user-query m 2 such the unit communication costs are the same for all links the
that L *a m2 5 { j91 } with a m2 5 a m . Clearly, both m 1 and m 2 optimal join site simply depends on the relative reduced
qualify to belong to the set M 1 . Also, due to Lemma 1, file sizes, ( r a m z S a m ), ( r b m z S b m ), and S m . In particular,
Minj2 {C mj1j2 } 5 n m1 and Minj2 {C mj1j2 } 5 n m2 . For the it is easy to see that the optimal join site k will either be at
sake of simplicity, let us assume that (a) S aj m1 $ D for all j1, j2 or at the user location i. Consider, for instance, the
j j 1 and (b) S aj m2 $ D for all j j91 . Now, if the case where m 5 2, j1 5 2, j2 5 3. This query originates
dual-ascent step is executed simply over m 1 and m 2 , then from node 2 and involves the join of file types 1 and 3.
both n m1 and n m2 will increase by D. Thus, even though n m Hence, r 1 z S 1 5 10,200 Mb, r 3 z S 3 5 13,500 Mb, and
decreases by D, the net increase in the dual value is D. Of S 2 5 0.12(10,200 1 13,500) 5 2844 Mb. Here, since
course, if S aj m1 , D for some j j 1 and S aj m2 , D for ( r 1 z S 1 1 S 2 ) . r 3 z S 3 , the optimal join site is k 5 3,
some j j91 , then both n m1 and n m2 will increase by an that is, the cheapest option is to send reduced file type 1
amount less than D. However, this will result in S aj1m and from node 2 to node 3, execute the join with file type 3
S aj19m both being positive after the initial dual-ascent step there, and then send the result relation back to node 2. The
over M 1 . By including m into M 1 , the dual variable n m resulting query communication cost is computed to be 0.04 z
also increases. Thus, the inclusion of m into M 1 ensures 35 z (10,200 1 2844) 5 $18,261. On the contrary, in the
that the dual at least reaches its original value. case of m 5 3, j1 5 3, j2 5 1, the optimal join site is k
5 3. To satisfy this query originating from node 3, the
reduced file type 3 is sent from node 1 to node 3. At node
4. AN EXAMPLE 3, the join of file type 2 and 3 takes place, and therefore the
query is satisfied locally.
We now illustrate the workings of the dual-ascent procedure Table I also presents the sum of update and storage costs
by solving an example based on the data and network shown associated with each file type and location. To compute the
in Figures 2 and 3, respectively. To describe the problem update costs, the frequency of updates on a file type from a
further, let the nodes Chicago, San Francisco, New York, user location is assumed to be 30% of the frequency of
and Paris be indexed as 1, 2, 3, and 4, respectively. Simi- queries using that file. The storage cost is assumed to be
larly, the file types WINE_USA, WINE_FRANCE, and $0.1/Mb. Thus, for example, the update and storage cost
WEATHER are indexed as 1, 2, and 3, with respective file associated with file type 3 at location 3 is 0.3 z 0.04 z (13
sizes of 12,000, 15,000, and 18,000 Mb. We now consider 1 35 1 90) z 18,000 1 0.1 z 18,000 1 0.1 z 18,000
four user-queries: The users for user-queries m 5 1, 2, 3, 5 $31,608. Here, file type 3 is used for user-queries 1, 2, 3,
and 4 are, respectively, located at nodes i 5 1, 2, 3, and 4. and 4. However, since this file resides at node 3, to deter-
For m 5 1 and m 5 2, the queries require the join of file mine the update communication cost, we only consider the
types 1 and 3, that is, a1 5 1, b1 5 3, a2 5 1, and b2 5 3. frequencies associated with user-queries 1, 2, and 4.
The queries associated with m 5 3 and m 5 4 require the In the dual-ascent procedure, we first initialize all
join of file types 2 and 3, that is, a3 5 2, b3 5 3, a4 5 2, slacks Sdj to equal to the corresponding update costs Fdj
and b4 5 3. The frequencies of queries for each user-query and the dual variables vdmj and nm to zero. The next three
m is f m 5 [13, 35, 97, 90]. Further, as discussed in steps are executed iteratively for each m { M: For m 5 1,
Section 2, before performing a join operation, the relevant in Step 1, (v11j1) 5 [6240, 3072, 8112, 8112] and (v31j2)
118 MURTHY AND SEO
TABLE I. Query communications costs and update Minimizing over the (C 1j1j2 ) matrix above, we get n1
and communication storage costs 5 41,198. Next, in Step 2, we increase the slacks (S 1j1 )
without decreasing n1 as follows: First, C* 5 [44,630,
(a) Query Communication Costs
41,462, 41,198, 42,710] is obtained by minimizing over
j2\j1 1 2 3 4 each column (or j1) of the (C 1j1j2 ) matrix. Next, as per
step 2.2, D 1 5 C* 2 n 1 5 [3432, 264, 0, 1512] and D
Query Costs ($) for m 5 1
5 Min{D 1 , ( v 11j1 )} 5 [3432, 264, 0, 1512] are obtained.
1 0(1) 5304(1) 5304(1) 5304(1) It should now be apparent that D represents the maximum
2 6782(2) 1478(2) 6782(2) 6782(2) amount by which each column in (C 1j1j2 ) can be decreased,
3 6782(3) 6782(3) 1478(3) 6782(3)
without decreasing n1. Thus, as per Step 2.3, this D amount
is transferred from ( v 11j1 ) to (S 1j1 ), resulting in ( v 11j1 )
4 6782(4) 6782(4) 6782(4) 1478(4)
5 [2808, 2808, 8112, 6600] and (S 1j1 ) 5 [3432, 264, 0,
1512]. Similarly, in Step 3, the slacks (S 3j2 ) are increased
without decreasing n1. The (C 1j1j2 ) matrix used is
1 3981(1) 18,261(1) 18,261(1) 18,261(1)
3 4
2 14,280(2) 0(2) 14,280(2) 14,280(2) 52,560 57,864 63,168 61,656
3 18,261(3) 18,261(3) 3981(3) 18,261(3) 54,590 49,286 59,894 58,382
C 1j1j2 5 41,198 41,198 41,198 44,990 .
4 18,261(4) 18,261(4) 18,261(4) 3981(4)
42,710 42,710 48,014 41,198
To obtain C*, we now determine the minimum value in
1 12,547(1) 32,917(1) 24,444(3) 32,917(1) each row of the (C 1j1j2 ). Thus, C* 5 [52,560, 49,286,
2 32,917(2) 12,547(2) 24,444(3) 32,917(2) 41,198, 41,198]. As per Step 3.2, D1 5 [11,362, 8088, 0,
3 20,370(3) 20,370(3) 0(3) 20,370(3) 0] and D 5 [11,362, 8088, 0, 0]. As a result, ( v 31j2 )
4 32,917(4) 32,917(4) 24,444(3) 12,547(4) 5 [38,390, 36,912, 31,608, 33,120] and (S3j2) 5 [11,362,
8088, 0, 0].
Query Costs ($) for m 5 4 The next iterative step with m 5 2 begins by setting
( v 12j1 ) 5 [3432, 264, 0, 1512] and ( v 32j2 ) 5 [11,362,
1 11,642(1) 30,542(1) 30,542(1) 22,680(4) 8088, 0, 0], while (S 1j1 ) and (S 3j2 ) are set to zero. The
2 30,542(2) 11,642(2) 30,542(2) 22,680(4) resulting revised cost matrix obtained is
3 30,542(3) 30,542(3) 11,642(3) 22,680(4)
3 4
4 18,900(4) 18,900(4) 18,900(4) 0(4) 18,775 29,887 29,623 31,135
25,800 8352 22,368 23,880
C 2j1j2 5 21,693 18,525 3981 19,773 .
(b) Update Communication and Storage Costs
21,693 18,525 18,261 5493
Update/Storage Costs ($):
Location(i) From Step 1.4, n2 5 3981. The reader can now verify
File Type that upon executing Steps 2 and 3 ( v 12j1 ) 5 [0, 0, 0, 0],
(d) 1 2 3 4 (S 1j1 ) 5 [3432, 264, 0, 1512], ( v 32j2 ) 5 [0, 3981, 0, 0],
1 6240 3072 8112 8112 and (S 3j2 ) 5 [11,362, 4107, 0, 0]. For m 5 3, we begin
by setting ( v 23j1 ) 5 [35,160, 35,160, 17,700, 18,960],
2 35,160 35,160 17,700 18,960
( v 33j2 ) 5 [11,362, 4107, 0, 0], while (S 2j1 ) and (S 3j2 ), to
3 49,752 45,000 31,608 33,120
zero. Then, by minimizing over the resulting (C 3j1j2 ) ma-
trix, we obtain n3 5 17,700. Next, we try to increase the
slacks (S 2j1 ) and (S 3j2 ) to (S 2j1 ) 5 [35,160, 34,114, 0,
5 [49,752, 45,000, 31,608, 33,120], while the slacks (S1j1) 13,807] and (S 3j2 ) 5 [6209, 0, 0, 0], while setting ( v 23j1 )
and (S3j2) are set to zero. The following revised cost 5 [0, 1046, 17,700, 5153] and ( v 33j2 ) 5 [5153, 4107,
matrix is obtained: 0, 0]. Finally, with m 5 4, we begin by setting ( v 24j1 )
5 [35,160, 34,114, 0, 13,807], ( v 34j2 ) 5 [6209, 0, 0,
0] and (S 2j1 ) 5 (S 3j2 ) 5 [0, . . , 0]. Now, by minimizing
3 4
55992 58128 63168 63168 (C 4j1j2 ) over j1 and j2, we obtain n4 5 11,642. With the
58022 49550 59894 59894 execution of Steps 2 and 3, we obtain, ( v 24j1 ) 5 [0, 0, 0,
C 1j1j2 5 44630 41462 41198 46502 . 11,642], (S 2j1 ) 5 [35,610, 34,114, 0, 2165], ( v 34j2 )
46142 42974 48014 42710 5 [0, 0, 0, 0] and (S 3j2 ) 5 [6209, 0, 0, 0]. This marks the
TABLE II. The set of (cmj1j2 2 nm) values 5 4, it is a valid location. Hence, L 12 5 {3} (or y 3 5 1).
2
for each m, j1, j2 However, observe that neither nodes 2, 3, or 4 in L *3 are

essential locations for file type 3. Hence, currently L 1 3
j2\j1 1 2 3 4
5 {A}. We now look for additional copies so as to ensure
m 5 1, (C 1j1j2 2 n 1 ) values that all user-queries m are met while satisfying condition
(13). Starting with m 5 1, since b1 5 3 and L 1 3 is a null
1 0 5304 10,608 9096 set, user-query 1 cannot be met. Therefore, we select a j *1
2 5304 0 10,608 9096 { L *1 and a j *2 { L *3 in a manner specified by (17). We get
3 0 0 0 3792 j *1 5 3 and j *2 5 3 and, accordingly, update L 1 3 5 {3}
4 1512 1512 6816 0 ( y 33 5 1). Now, from (18), we obtain x 133 5 x 233 5 x 333
5 x 433 5 1 and the rest of the x variables equal to zero.
m 5 2, (C 2j1j2 2 n 2 ) values There are no complementary slackness violations, and the
primal objective value also is 74,521. Thus the primal
1 0 14,280 14,280 14,280 solution obtained is optimal. We note that the optimal
2 14,280 0 14,280 14,280 solution was identified without going into the dual-adjust-
3 14,280 14,280 0 14,280 ment phase. This optimal solution involves placing a copy
4 14,280 14,280 14,280 0 of WINE_USA, WINE_FRANCE, and WEATHER, all at
New York. Further, all four queries access copies of these
m 5 3, (C 3j1j2 2 n 3 ) values files at New York, with the joins also occurring at New
York.
1 0 21,416 29,597 25,523
2 19,324 0 28,551 24,477
3 2670 3716 0 7823 5. COMPUTATIONAL EXPERIMENTS
4 15,217 16,263 24,444 0
In this section, we present results of our computational
m 5 4, (C 4j1j2 2 n 4 ) values experiments that test our dual-ascent procedure on several
randomly generated problems with varying characteristics.
1 0 18,900 18,900 22,680 The purpose of these experiments was to (1) evaluate the
2 18,900 0 18,900 22,680 quality of the solutions generated by the dual-ascent proce-
3 18,900 18,900 0 22,680 dure and (2) to test the ability of the procedure to solve large
4 7258 7258 7258 0 problems quickly. The solution procedure was implemented
in FORTRAN and the testing was performed on an IBM
3090 mainframe computer. A total of 36 test problem types
end of the first round of the dual-ascent step, with n m were considered, the characteristics of which are high-
5 74,521. lighted in Tables III and IV.
We now look for a feasible primal solution { y 1 , x 1 } by Table III is restricted to problems where the number of
first examining the slacks (S dj ). They are (S 1j ) 5 [3432, file types in the system is two. The problems in Table IV
264, 0, 1512], (S 2j ) 5 [35,610, 34,114, 0, 2165], and allow for a greater number of file types. The number of
(S 3j ) 5 [6209, 0, 0, 0]. We now obtain the sets L *1 5 {3}, locations considered range from 10 to 30, while the number
L *2 5 {3}, and L *3 5 {2, 3, 4}. These sets represent of file types vary from 10 to 50. THe number of queries in
potential sites where copies of file type d can be kept so as the system considered was 100 and 1000. In all the test
to satisfy the complementary slackness condition (12). We problems, it was assumed that every user location is also a
now try to develop the set L 1 d from L * d by first identifying potential file location. Furthermore, a user can access every
essential locations for each file type d. For file type 1, potential file location. This results in a large number of x
clearly, node 3 is an essential location, since it is the only variables as shown in Tables III and IV.
potential site in L *1 . Table II lists the (C mj1j2 2 n m ) values In all the problems, the query transmission costs C mj1j2
for each m, j1, and j2. It is clear from this table that for m were generated as follows: If i 5 j 1 5 j 2 , that is, the user
5 1, Minj2{N {C 13j2 } 5 n 1 at j2 5 3, and for m 5 2, location and the location of both file types coincide, then
Minj2{N {C 23j2 } 5 n 2 at j2 5 3. Hence, node 3 is a valid C mj1j2 5 0. Otherwise, the volume of query m was first
location for file type 1 in the sense that if x 133 5 1 and x 233 obtained as the sum of a percentage of the file sizes, a m and
5 1 then the complementary slackness condition (13) as- b m. This is consistent with the observation that, in general,
sociated with them would be satisfied. Thus, we set L 1 1 the larger the file, the more frequent are the queries on it.
5 {3}, in effect, set y 13 5 1. Similarly, it is easy to see that These percentages were randomly generated in the range
for file type 2 also node 3 is an essential location. Again, from 10 to 100%. The total volume of the query was now
from observing the (C mj1j2 2 n m ) values for m 5 3 and m determined as the product of the percentage of size and the
120 MURTHY AND SEO
TABLE III. Dimension of test problems2 files scenario
Problem
No. No. Nodes No. Queries No. y Variables No. x Variables No. Constraints
1 10 100 20 10,000 2100

2 10 1000 20 100,000 21,000
3 20 100 40 40,000 4100
4 20 1000 40 400,000 41,000
5 30 100 60 90,000 6100
6 30 1000 60 900,000 61,000
frequency of usage, which is randomly generated in the multiplied by a reduction factor ranging from 0.01 to 0.5.
range 10 100. Since the result of joining multiple files is The test problems also assume that the networks are packet-
usually less than its sum, the total volume of the query was switched. Hence, all unit query communication costs were
TABLE IV. Dimension of test problemsmultifile scenario
Problem
No. No. Nodes No. Files No. Queries No. y Variables No. x Variables No. Constraints
7 10 10 100 100 10,000 2100

8 10 10 1000 100 100,000 21,000
9 10 20 100 200 10,000 2100
10 10 20 1000 200 100,000 21,000
11 10 30 100 300 10,000 2100
12 10 30 1000 300 100,000 21,000
13 10 40 100 400 10,000 2100
14 10 40 1000 400 100,000 21,000
15 10 50 100 500 10,000 2100
16 10 50 1000 500 100,000 21,000
17 10 10 100 200 40,000 4100
18 10 10 1000 200 400,000 41,000
19 20 20 100 400 40,000 4100
20 20 20 1000 400 400,000 41,000
21 20 30 100 600 40,000 4100
22 20 30 1000 600 400,000 41,000
23 20 40 100 800 40,000 4100
24 20 40 1000 800 400,000 41,000
25 20 50 100 1000 40,000 4100
26 20 50 1000 1000 400,000 41,000
27 30 10 100 300 90,000 6100
28 30 10 1000 300 900,000 61,000
29 30 20 100 600 90,000 6100
30 30 20 1000 600 900,000 61,000
31 30 30 100 900 90,000 6100
32 30 30 1000 900 900,000 61,000
33 30 40 100 1200 90,000 6100
34 30 40 1000 1200 900,000 61,000
35 30 50 100 1500 90,000 6100
36 30 50 1000 1500 900,000 61,000
TABLE V. Solution quality and CPU timestwo files The performance of the dual-ascent procedure was also
scenario compared to MPSX/370 V2. MPSX/370 V2 has one of the
most efficient implementations of the simplex method for
Problem Average Percentage Average CPU Time
No. Duality Gap (seconds) large-sized LP problems. While comparing the dual-ascent
procedure to MPSX/370 V2, the simplex routine OPTI-
1 0.000 0.015 MIZE was used to solve the LP relaxation of (P). This was
2 0.021 0.231 done to make the comparison a fair one, as both solve the
3 0.124 0.813 LP relaxation of (P). The efficient implementation features
4 0.001 1.648 of MPSX/370 V2 are (1) the OPTIMIZE procedure which
5 0.123 1.582 uses the LU factorization method, which is most useful
6 0.000 4.439
when the problem is very large and sparse and which also
takes advantage of dynamic alteration of pricing and cycling
routines during the optimization process, and (2) the Vector
Facility Support feature which incorporates the use of vec-
set to equal $0.04/kilobits. The update cost, CU dj , is the torization in the pricing-out step to speed up the overall run
result of transmitting updates from all user locations i to file
location j, if file d is kept there. The volume of updates were
assumed to be 10, 20, 30, 40, and 50% of the volume of the TABLE VI. Solution quality and CPU timesmultifiles
query that requires file d. scenario
The file sizes (megabytes) and the unit storage costs
Problem Average Average CPU
($/megabyte) were randomly generated in the range [6000 No. Gap (%) Time (seconds)
12,000] and [0.04 0.1], respectively. The storage costs;
CS dj , are the product of the file size and the unit storage cost. 7 0.011 0.026
The fixed cost F dj is now obtained as the sum of the update 8 0.041 0.337
and storage costs. For each problem type listed in Tables III 9 0.021 0.133
and IV, five test cases were generated. These cases differed 10 0.655 7.14
from each other in the ratio of the query to update traffic. 11 0.435 0.21
These ratios were 10, 20, 30, 40, and 50%. Tables V and VI 12 0.475 8.67
present results on the quality of the solutions obtained after
13 0.631 0.11
the application of the dual-ascent procedure. Also presented
14 0.014 0.78
are the computational times taken in terms of CPU seconds.
Here, solution quality is measured as the percentage duality 15 0.428 0.08
gap between the best primal value Z *P and the best dual 16 0.02 3.56
value Z *D , that is, % GAP 5 (Z *P 2 Z *D )100/Z *D . 17 0.023 1.02
The average percentage gap ranged from 0 to 0.68%. 18 0.075 24.96
Clearly, these gaps are extremely low, indicating that the 19 0.09 1.15
dual-ascent procedure is successful in obtaining very close 20 0.9 41.12
to optimal solutions. From these results, there seems to be 21 0.253 1.23
no discernible pattern as to how the percentage gap varies 22 0.021 36.46
with the number of nodes, the number of files, or the 23 0.093 0.29
number of queries. Equally important, the results indicate 24 0.249 32.37
that near optimal primal and dual solutions are obtained
25 0.044 0.38
very swiftly. The average time taken to execute the dual-
26 0.25 17.31
ascent procedure and the add drop heuristic ranged from
0.015 to 52.204 seconds. These results suggest that the 27 0.488 5.09
dual-ascent procedure is indeed very useful as a fast heu- 28 0.679 11.94
ristic, which, in addition, also provides performance guar- 29 0.152 8.03
antees, especially for very large sized problems. If an opti- 30 0.001 8.65
mal solution is desired, then the dual-ascent procedure can 31 0.111 1.06
be easily embedded into a branch-and-bound procedure. 32 0.012 52.20
Since the gaps obtained were uniformly so small, it is 33 0.049 3.37
conjectured that an optimal solution can be found and 34 0.051 8.15
verified by enumerating a relatively small number of 35 0.053 0.72
branch-and-bound nodes. Hence, an optimal solution can be 36 0.027 36.29
found in a reasonable amount of time.
122 MURTHY AND SEO
TABLE VII. A Performance comparison of dual ascent A problem in Distributed Computing Systems that
procedure MPSXtwo files scenario [FAJSP-2] is closely related to is the File Allocation and
Report Assignment problem introduced by Ramesh and
Average CPU Time in Seconds
Ryan [25]. In several large corporations, generating and
No. No. Dual-ascent MPSX/370 transmitting reports is an important function of the infor-
Nodes Queries Procedure V2 mation system department. These reports typically re-
quire the pooling and processing of data from several
10 10 0.003 1.2
databases that are located at various computer sites.
10 100 0.015 7.8
Clearly, the report-generating function has an impact on
20 20 0.161 4.8
where the files are placed due to the cost of transmitting
20 100 0.818 53.7 the data from the file sites to the location where the report
30 30 0.265 17.7 is to be prepared and the cost of transmitting the report to
30 100 1.582 186.9 the end user. Given a set of file types needed to prepare
a report, the site where the report is prepared is analogous
to the join site in [FAJSP-2]. If the number of files
needed to prepare a report is limited to two, then the
time. The performance comparison of the dual-ascent pro- resulting problem is equivalent to [FAJSP-2].
cedure to MPSX/370 V2 is presented in Table VII. For It is well known that the File Allocation and Facility
illustration purposes, only the problems with two file types Location have problems that are closely related. For in-
were tested. For all the problems tested, the dual-ascent stance, the Simple File Allocation Problem is analogous to
procedure was found to be more than an order of magnitude the Uncapacitated Facility Location Problem [28]. Another
faster than MPSX. The dual-ascent procedure outperformed type of location problem, the Two Echelon Distribution
MPSX/370 V2 by factors ranging from about 30 to more Facility Location Problem [TEDFLP], is closely related to
than a 100. More significantly, on close examination, it was [FAJSP-2]. In [TEDFLP], one level of facilities represents
found that as the problem size increased the factor by which distribution centers where multiple products are consoli-
the dual-ascent procedure outperformed MPSX/370 V2 also dated. The next level of facilities are market-oriented ware-
increased. This indicates that the time taken by the dual- houses that serve retailers or final consumer demands. The
ascent procedure proposed in this research increases at a objective here was to determine the location of facilities at
slower rate with the problem size than MPSX/370 V2 does. each echelon that minimizes total cost. This problem was
well described by Gao and Robinson [11], who assumed
that all facilities are homogeneous, in that all the products
6. CONCLUSIONS flow in a simple route through the two echelons to the final
customer. However, oftentimes, the second level of facili-
In this paper, a methodology was proposed that attempts to ties, that is, the warehouses that serve retailers or final
unify the two aspects of distributed database systems: the customers, may be used to store not just a single product
file allocation and the query optimization problems. We type but several product types. Even though these product
assume that each query requires the processing of at most types may be similar, they may come from different types of
two file types. In our methodology, we developed an integer facilities at the higher echelon. This would be so especially
linear program that simultaneously determines the locations if the higher-echelon facilities consisted of manufacturing/
where data files are to be stored and the sites where the joins assembly units. Such a problem requiring two facility types
of two files are to occur for each query. We presented a at the higher echelon is analogous to [FAJSP-2].
dual-ascent procedure that solves the integer program ap- A worthwhile future research endeavor is to develop
proximately. Our computational results show that the dual- suitable heuristics for the more general File Allocation and
ascent procedure is able to quickly solve even large-scale Join Site Selection Problem I in which several queries may
problems to near optimality. require the joining of more than two file types, as illustrated
While the primary purpose of this paper was to address in Figure 1. We are currently investigating suitable ways of
the specific problem: File Allocation and Join Site Selection decomposing the general problem into several [FAJSP-2] or
Problem with 2 way joins [FAJSP-2], it is indeed interesting 2-way join problems. One approach under investigation is a
to note that this problem is closely related to certain other top-down approach, where, for example, assuming tempo-
well-known problems in Distributed Computing Systems rary relations T 1 and T 2 to be result relations, the file
and Facility Location. We think that it is important to know placements of R 1 , R 2 , R 3 , and R 4 are obtained for each
these connections since the solution procedure developed location. Then, assuming T 3 to be a result relation, the
for [FAJSP-2] can be suitably modified to solve these other locations for T 1 and T 2 are determined. In a similar way, the
problems also. problem may also be decomposed in a bottom-up fashion.
3.1. Set C* 4 Minj 1{N{O mj19j29 }

The authors gratefully acknowledge the contribution of the bm
referees and the associate editor toward substantial improvements
bm bm bm bm
in the presentation of the paper. The editorial assistance of Deb. 3.3. Set v mj29 4 v mj29 2 D, S j29 4 S j29 1 D.
Ghosh is also gratefully accepted. 3.4. If D . 0, then set L bS m 4 L bS m j 29 .
STEP 4. [INCREASE SLACKS S aj19m ]
APPENDIX A: THE DUAL-ADJUSTMENT For each j91 { N, do the following:
STEP
4.1. Set C* 4 Minj92{N {O mj19j29 }
am
[A] Dual Adjustment on am
vmj1 , if am
vmj1 > 0, am
ymj1 4.2. Set D 1 4 C* 2 n m , and D 5 Min{D 1 , v mj19 }.
am am am am
5 1, j2 xmj1j2 5 0 4.3. Set v mj19 4 v mj19 2 D, S j19 4 S j19 1 D.
4.4. If D . 0, then set L aS m 4 L aS m j 19 .
STEP 1. [INITIALIZATION]
STEP 5. [DUAL-ASCENT STEP]
1.1. Set L aS m 4 {A}, L bS m 4 {A}, M 1 4 {A}.
5.1. For each m9 m, if either L *a m9 , L aS m , where a m9
am
STEP 2. [DECREASE DUALS n m , v mj1 ] 5 a m, or L *b m9 , L bS m , where b m9 5 b m, then set
am
2.1. Set n m 4 n m 2 v mj1 , S aj1m 4 S aj1m 1 v mj1
am
. M 1 4 M 1 m9.
am
2.2. Set v mj1 4 0, L a m 4 L a m j 1 .
S S 5.2. Perform the dual-ascent step over M 1 .
5.3. Set M 1 4 M 1 m, and repeat the dual-ascent step
STEP 3. [INCREASE SLACKS S aj19m ] over M 1 .
For each j91 j 1 , do the following:
3.1. Set C* 4 Min92{N {O mj19j29 } REFERENCES
am
am am am am
3.3. Set v mj19 4 v mj19 2 D, S j19 4 S j19 1 D. [1] P. Apers, Data allocation in distributed database systems,
3.4. If D . 0, then set L aS m 4 L aS m j 19 . ACM Trans Database Syst 13 (1988), 263304.
[2] P. Apers, A. Hevner, and S. Yao, Optimization algorithms
STEP 4. [INCREASE SLACKS S bj29m ]
for distributed queries, IEEE Trans Software Eng SE-9
For each j92 { N, do the following: (1983), 57 68.
[3] A. Balakrishnan, T. Magnanti, and R. Wong, A dual-ascent
4.1. Set C* 4 Minj91{N {O mj19j29 } procedure for large-scale uncapacitated network design,
bm
4.2. Set D 1 4 C* 2 n m , and D 5 Min{D 1 , v mj29 }. Oper Res 37 (1989), 716 740.
bm bm bm bm
4.3. Set v mj29 4 v mj29 2 D, S j29 4 S j29 1 D. [4] S. Ceri and G. Pelagatti, Allocation of operations in distrib-
4.4. If D . 0, then set L bS m 4 L bS m j 29 . uted database access, IEEE Trans Comput C-31 (1982),
STEP 5. [DUAL-ASCENT STEP] 119 128.
[5] W. Chu, Optimal file allocation in a multiple computer
5.1. For each m9 m, if either L *a m9 , L aS m , where a m9 system, IEEE Trans Comput C-18 (1969), 865 889.
5 a m, or L *b m9 , L bS m , where b m9 5 b m, then set [6] W. Chu and P. Hurley, Optimal query processing for dis-
M 1 4 M 1 m9. tributed database systems, IEEE Trans Comput C-31
5.2. Perform the dual-ascent step over M 1 . (1982), 135150.
5.3. Set M 1 4 M 1 m, and repeat the dual-ascent step [7] C. Cornell and P. Yu, On optimal site assignment for
over M 1 . relations in the distributed database environment, IEEE
Trans Software Eng 15 (1989), 1004 1009.
bm bm bm
[B] Dual adjustment on vmj2 , if vmj2 > 0, ymj2 [8] D. Erlenkotter, A dual-based procedure for uncapacitated
5 1, j1 xmj1j2 5 0 facility location, Oper Res 26 (1978), 9921009.
[9] M. Fisher and D. Hochbaum, Database location in computer
STEP 1. [INITIALIZATION]
networks, J ACM 27 (1980), 718 735.
1.1. Set L aS m 4 {A}, L bS m 4 {A}, M 1 4 {A}. [10] M. Fisher, R. Jaikumar, and L. Van Wassenhove, A multi-
bm
plier adjustment method for the generalized assignment
STEP 2. [DECREASE DUALS n m , v mj2 ] problem, Mgmt Sci 32 (1986), 10951103.
bm
[11] L. Gao and E. Robinson, A dual-based optimization proce-
2.1. Set n m 4 n m 2 v mj2 , S bj2m 4 S bj2m 1 v mj2
bm
. dure for the two-echelon uncapacitated facility location
bm
2.2. Set v mj2 4 0, L b m 4 L b m j 2 .
S S
problem, Naval Res Log 39 (1992), 191212.
STEP 3. [INCREASE SLACKS S bj29m ] [12] B. Gavish and O.R. Liu Sheng, Dynamic file migration in
distributed computer systems, Commun ACM 33 (1990),
For each j92 j 2 , do the following: 177189.
124 MURTHY AND SEO
[13] B. Gavish and H. Pirkul, Computer and database location in [23] I. Murthy and P. Seo, A nested dual ascent procedure for the
distributed computer systems, IEEE Trans Comput C-35 dynamic single file allocation problem, INFOR 31 (1993),
(1987), 583590. 186 204.
[14] B. Gavish and A. Segev, Set query optimization in distrib- [24] S. Ram and P. Narasimhan, Database allocation in a dis-
uted database systems, ACM Trans Database Syst 11 tributed environment: incorporating a concurrency control
(1986), 265293. mechanism and queuing costs, Mgmt Sci 40 (1994), 969
[15] D. Ghosh, I. Murthy, and A. Moffett, File allocation prob- 983.
lem: comparison of models with worst case and average [25] R. Ramesh and B. Ryan, Optimal file allocation and report
communication delays, Oper Res 40 (1992), 1074 1085. assignment in distributed information networks, Naval Res
[16] A. Hevner and S. Yao, Query processing in distributed Log 37 (1990), 165181.
database systems, IEEE Trans Software Eng SE-5 (1979),
[26] A. Segev, Optimization of join operations in horizontally
177187.
partitioned database systems, ACM Trans Database Syst 11
[17] H. Katzan, Jr., Distributed information systems, Petrocelli, (1986), 48 80.
New York, 1979.
[27] P. Seo, File allocation and join site selection problem in
[18] L.J. Laning and M.S. Leonard, File allocations in a distrib-
distributed database systems, unpublished Ph.D. disserta-
uted computer communications network, IEEE Trans Com-
tion, Louisiana State University 1994.
put C-32 (1983), 232244.
[28] B. Wah, File placement on distributed computer systems,
[19] K.D. Levin and H.D. Morgan, Optimal program and data
IEEE Trans Comput 17 (1984), 2330.
locations in computer networks, Commun ACM 20 (1976),
315321. [29] E. Wong, Retrieving dispersed data from SDD-1: A system
[20] S. Mahmoud and J. S. Riordan, Optimal allocation of re- for distributed databases, Second International Berkeley
sources in distributed information networks, ACM Trans Workshop on Distributed Data Management and Computer
Database Syst 1 (1976), 66 78. Networks, 1977.
[21] S. Manning and R. W. Peebles, A homogeneous network for [30] R. Wong, A dual ascent approach for Steiner tree problem
data sharing communications, Comput Network (1977), on a directed graph, Math Program 28 (1984), 271287.
211224. [31] P.S. Yu, H.V. Heiss, and S. Lee, Workload characterization
[22] I. Murthy and D. Ghosh, File allocation involving worst of relation database environments, IBM research report no.
case response times and link capacities: model and solution RC 4675, IBM Research Division, T.J. Watson Research
procedure, Eur J Oper Res 67 (1993), 418 427. Center, Yorktown Heights, NY 10598, 1989.

File Allocation and Join Site Selection Problem

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

File Allocation and Join Site Selection Problem

Diunggah oleh

Hak Cipta:

Format Tersedia

A Dual-Ascent Procedure for the File Allocation

and Join Site Selection Problem on a

Ishwar Murthy,1 Phil K. Seo2

Received 18 September 1997; accepted 14 August 1998

Keywords: telecommunications network; distributed computing systems; integer programming; dual

1. INTRODUCTION with each other. The primary function of a DCS is to

1999 John Wiley & Sons, Inc. CCC 0028-3045/99/020109-16

are two important issues associated with the design of

1.1. File Allocation Problem

and Yu [7] also considered an integrated strategy for choos-

Fig. 3. The DCS network.

N index set of all locations C mj1j2 5 Min@ f m z S am z r am z c j1k

enforces the requirement that if file type d is not stored at

1.3. Set S aj1m 4 0, S bj2m 4 0 for all j 1 { N, j 2 { N.

contribute equally toward increasing n m . Consequently, in

The feasible solution { y 1 , x 1 } is such that L 1

for each m, j1, j2 However, observe that neither nodes 2, 3, or 4 in L *3 are

TABLE III. Dimension of test problems2 files scenario

1 10 100 20 10,000 2100

TABLE IV. Dimension of test problemsmultifile scenario

7 10 10 100 100 10,000 2100

3.1. Set C* 4 Minj 1{N{O mj19j29 }

Anda mungkin juga menyukai