Anda di halaman 1dari 20

Oracle & Distributed Databases

Unit 6

Unit 6
Structure: 6.1 Introduction Objectives 6.2

Distributed Database Design and Query Processing

A Framework for Distributed Database Design 6.2.1 6.2.2 Objectives of the Design of Data Distribution Top Down and Bottom Up Approach A classical Design Methodologies Self Assessment Questions

6.3

The Design of Database Fragmentation 6.3.1 Horizontal Fragmentation 6.3.1.1 Primary Fragmentation 6.3.1.2 Derived Horizontal Fragmentation 6.3.2 6.3.3 Vertical Fragmentation Mixed Fragmentation

Self Assessment Questions 6.4 The Allocation of Fragments Self Assessment Questions 6.5 Query Processing Problem Self Assessment Questions 6.6 Objectives of Query Processing Self Assessment Questions 6.7 Characterization of Query Processors Self Assessment Questions 6.8 Layers of Query Processing 6.8.1 6.8.2 Query Decomposition Data Localization
Page No. 110

Sikkim Manipal University

Oracle & Distributed Databases

Unit 6

6.8.3 6.8.4

Global Query Optimization Local Query Optimization

Self Assessment Questions 6.9 Summary

6.10 Terminal Questions 6.11 Answers to Self Assessment Questions

6.1 Introduction
The concept of data distribution itself is difficult to design and implement because of various technical and organizational issues. So we need to have an efficient design methodology. From the technical aspect, the

interconnection of sites and appropriate distribution of the data and applications to the sites depending upon the requirement of applications and for optimizing performances. From the organizational point, the issue of decentralization is crucial and distributing an application has a greater effect on the organization. The increasing success of relational database technology in data processing is suitable, in part, to the availability of nonprocedural languages, which can significantly improve application development and end-user productivity. Query Processing has considerably important both in Centralized and Distributed processing systems. However, the query processing problem is much more difficult in distributed environments than in the conventional systems. In exact, the relations involved in distributed queries may be fragmented and/or replicated, there by inducing communication overhead costs. Objectives: By the end of Unit 6 learners are able to describe the topics like A framework for distributed database design The objectives of design of data distribution
Page No. 111

Sikkim Manipal University

Oracle & Distributed Databases

Unit 6

Top Down and Bottom Up design approaches

The design of database fragmentation like Horizontal Fragmentation Vertical Fragmentation Mixed Fragmentation

The allocation of fragments General Criteria for Fragment allocation

Various problems of query processing About an ideal Query Processor The concept of layering in query processing

6.2 A Framework for Distributed Database Design


The design of a centralized database concentrates on Designing the conceptual schema that describes the complete database. Designing the Physical database, this maps the conceptual schema to the storage areas and determines the appropriate access methods. The above two steps contributes in distributed database towards the design of Global schema and the design of local databases. The added steps are: Designing the Fragmentation: The actual procedure of dividing the existing global relations into horizontal, vertical or mixed fragments Designing the allocation of fragments: Allocation of fragments according to the site requirements Before designing the Distributed database a thorough knowledge about the application is a must. In this case we expect the following things from the designer.

different

Site of Origin: The site from which the application is issued. The frequency of invoking the request at each site
Page No. 112

Sikkim Manipal University

Oracle & Distributed Databases

Unit 6

The number, type and the statistical distribution of accesses made by each application to each required data.

6.2.1 Objectives of the Design of Data Distribution In the design of data distribution the following objectives should be considered. Processing Locality: Reducing the remote references in turn maximizing the local references is the primary aim of the data distribution. This can be achieved by having redundant fragment allocation meeting the site requirements. Complete locality is an extended idea, which simplifies the execution of application. Availability and Reliability of Distributed Data: Availability is achieved by having multiple copies of the data for read only applications. Reliability is achieved by storing the multiple copies of the information, as it will be helpful in case of system crashes. Workload Distribution: workload distribution is the major goal to have high degree of parallelism. Storage Costs and Processing Locality: Cost criteria and Availability of storage areas should be intelligently handled for effective data distribution. Using the all above criteria may increase the design complexity. So important aspects are taken as objectives depending upon the requirement and others are treated as constraints. In the next section let us design a simple approach for maximizing the processing locality. 6.2.2 Top-Down and Bottom-Up Approach: Classical Design Methodologies There are two classical approaches as far as distributed databases design is concerned. They are:
Sikkim Manipal University Page No. 113

Oracle & Distributed Databases

Unit 6

1. Top-Down Approach: This may be quite useful when the system has to be designed from the scratch. Here we follow the following steps: Design of Global Schema. Design of Fragmentation Schema. Design of Allocation Schema. Design of Local Schema (Design of Physical Databases). 2. Bottom-Up Approach: This can be used for an existing system. This approach is based on the integration of existing schemata into a single, global schema. But requires that the following aspects have to be fulfilled. The selection of a common database model for describing the Global schema of the database. The translation of each local schema into the common data model. The Integration of common schemata into a common Global schema, i.e. the merging of common data definitions and the resolution of conflicts among different representations given to the same data. The Bottom-Up design requires solving these three problems. Then of course the design steps are just reverse of the previous method. Self Assessment Questions 6.2 1. is the actual procedure of dividing the existing global relations into horizontal, vertical or mixed fragments. 2. In the objectives of design of data distribution, is an extended idea, which simplifies the execution of application. 3. There are classical approaches as far as distributed databases design is concerned. 4. The Design of Global schema, Fragmentation schema, Allocation Schema and Local Schema is the steps of approach.
Sikkim Manipal University Page No. 114

Oracle & Distributed Databases

Unit 6

6.3 The Design of Database Fragmentation


Here we discuss the design of non-overlapping fragments, which are the logical units of allocation. That is, it is important to have an efficient design methodology so that we can overcome the related problems of allocation. In the following, we explain the design of Horizontal, Vertical and Mixed Fragmentations. 6.3.1 Horizontal Fragmentation Here we discuss two important methods called Primary and Derived. Determining the horizontal fragmentation involves knowing: The logical properties of the data such as fragmentation predicates. The statistical properties of the data such as the number of references of applications to the fragments. 6.3.1.1 Primary Fragmentation The correctness of Primary fragmentation requires that each global relation be selected in one and only one fragment. Thus, determining the primary horizontal fragmentation of a global relation requires determining a set of disjoint and complete selection predicates (we shall define this later in this section). The property we expect from each fragment is that the elements of them must be referenced homogeneously by all applications. 6.3.1.2 Derived Horizontal Fragmentation This is not based on the properties of its own attributes, but it is derived from the horizontal fragmentation of another relation. It is used to make the join between the fragments. A distributed join is a join between horizontally fragmented relations. 6.3.2 Vertical Fragmentation This requires grouping the attributes into sets, which are referenced in the similar manner by applications. This method has been discussed by considering two separate types of problems:
Sikkim Manipal University Page No. 115

Oracle & Distributed Databases

Unit 6

The Vertical Partitioning Problem: Here set must be disjoint. Of course one attribute must be common. For example assume that a relation S is vertically fragmented using this concept into S1 and S2.This can be useful where an application can be executed using either S1 or S2.Otherwise having the complete S at a particular site may be a unnecessary burden.

Two possible Design Approaches 1. The Split Approach: The global relations are progressively split into fragments 2. The Grouping Approach: The attributes are progressively aggregated to constitute fragments. Both are Heuristic approaches as each iteration steps look for best choice. In both the cases formulas are used to indicate the best

possible splitting or grouping. R1 S1 R2 S2 R3 S3 R4 R R2 R R3 S R4 S R5 R S S S R1 S R S

Figure 6.1: The different possible join graphs Sikkim Manipal University Page No. 116

Oracle & Distributed Databases

Unit 6

The Vertical Clustering Problem: Here sets can overlap. Here depending upon the requirement you may have more than one common attribute in the two different fragments of a global relation. It introduces Replication within fragments, as some common attributes are present in the fragments. It is suitable only for Read-Only applications; because for applications, which involve frequent updating of these common attributes needs to be referred to the sites where all these attributes are present. Therefore, Vertical clustering is suggested where overlapping attributes are not heavily updated.

6.3.3 Mixed Fragmentation The simple way for performing this is: Apply Horizontal fragmentation to Vertical fragments Apply Vertical fragmentation to Horizontal fragments

Both these aspects are illustrated using the following figures 6.2 and 6.3. A1 A2 A3 A4 A5

Fig: 6.2: Vertical fragmentations followed by horizontal fragmentation.

A1

A2

A3

A4

A5

Fig: 6.3: Vertical fragmentations followed by horizontal fragmentation

Sikkim Manipal University

Page No. 117

Oracle & Distributed Databases

Unit 6

Self Assessment Questions 6.3 1. The correctness of fragmentation requires that each global relation be selected in one and only one fragment. 2. A is a join between horizontally fragmented relations. 3. In vertical partitioning problem, the attributes are progressively aggregated to constitute fragments; the approach is called as . 4. is suggested where overlapping attributes are not heavily updated.

6.4 The Allocation of Fragments


In this section we explain the different aspects to be considered when you go for allocating a particular fragment to site. This section describes some general criteria that can be used for allocating fragments. There are two types of allocation methods, which can be followed. They are: Non-Redundant Allocation: It is simple. A method known as Best-fit approach can be used; i.e. a measure is associated with each possible allocation, and the site with the bets measure is selected. It avoids placing a fragment at a given site where already a fragment is present which is related to this fragment. 1. Redundant Allocation: It is complex design, since: o o The degree of replication is a variable of the problem. The modeling of read applications is complicated as the applications may select any of the several alternatives. The following two methods can be used for determining the redundant allocation of fragments:

Determine the set of all sites where the benefit of allocating one copy of the fragment is higher than the cost, and allocate a copy of the fragment to each element of this site; this method selects all beneficial sites.

Sikkim Manipal University

Page No. 118

Oracle & Distributed Databases

Unit 6

Start from a non-replicated version. Then progressively introduce replicated copies from the most beneficial; the process is terminated when no additional replication is beneficial.

Both the reliability and availability of the system increases if there are two or three copies of the fragment, but further copies give a less than proportional increase. Self Assessment Questions 6.4 1. In allocation of fragmentation, allocation is complex design since the degree of replication is a variable of the problem.

6.5 Query Processing Problem


The main duty of a relational query processor is to transform a high-level query (in relational calculus), into an equivalent lower level query (in relational algebra). The distributed database is of major importance for query processing since the definition of fragments is based on the objective of increasing reference locality, and sometimes-parallel execution for the most important queries. The role of a distributed query processor is to map a high level query on a distributed database (a set of global relations) into a sequence of database operations (of relational algebra) on relational fragments. Several important functions characterize this mapping: The calculus query must be decomposed into a sequence of relational operations called an algebraic query The data accessed by the query must be localized so that the operations on relations are translated to bear on local data (fragments) The algebraic query on fragments must be extended with

communication operations and optimized with respect to a cost function to be minimized. This cost function refers to computing resources such as disk I/Os, CPUs, and communication networks.

Sikkim Manipal University

Page No. 119

Oracle & Distributed Databases

Unit 6

The low-level query actually implements the execution strategy for the query. The transformation must achieve both correctness and efficiency. The well-defined mapping with the above said functional characteristics makes the correctness issue easy. But producing an efficient execution strategy is more complex. A relational calculus query may have many equivalent and correct transformations into relational algebra. Since each equivalent execution strategy can lead to different consumptions of computer resources, the main problem is to select the execution strategy that minimizes the resource consumption. Self Assessment Questions 6.5 1. The role of a distributed is to map a high level query on a distributed database into a sequence of database operations on relational fragments. 2. The calculus query must be decomposed into a sequence of relational operations called an query.

6.6 Objectives of Query Processing

The main objectives of query processing in a distributed environment is to form a high level query on a distributed database, which is seen as a single database by the users, into an efficient execution strategy expressed in a low level language on local databases.

An important point of query processing is query optimization. Because many execution strategies are correct transformations of the same highlevel query, the one that optimizes (minimizes) resource consumption should be retained.

The good measures of resource consumption are: o The total cost that will be incurred in processing the query. It is the some of all times incurred in processing the operations of the query at various sites and intrinsic communication.

Sikkim Manipal University

Page No. 120

Oracle & Distributed Databases

Unit 6

The resource time of the query. This is the time elapsed for executing the query. Since operations can be executed in parallel at different sites, the response time of a query may be significantly less than its cost.

Obviously the total cost should be minimized. o In a distributed system, the total cost to be minimized includes CPU, I/O, and communication costs. These costs can be minimized by reducing the number of I/O operations through fast access methods to the data and efficient use of main memory. The communication cost is the time needed for exchanging the data between sites participating in the execution of the query. o In centralized systems, only CPU and I/O cost have to be considered.

Self Assessment Questions 6.6


1. An important point of query processing is optimization. 2. The of the query is the time elapsed for executing the query. 3. The cost is the time needed for exchanging the data between

sites participating in the execution of the query.

6.7 Characterization of Query Processors


It is very difficult to give the characteristics, which differentiates centralized and distributed query processors. Still some of them have been listed here. Out of them, the first four are common to both and the next four are particular to distributed query processors. o Languages: The input language to the query processor can be based on relational calculus or relational algebra. In distributed context, the output language is generally some form of relational algebra augmented with communication primitives.

Sikkim Manipal University

Page No. 121

Oracle & Distributed Databases

Unit 6

Types of Optimization: Conceptually, query optimization is to choose a best point of solution space that leads to the minimum cost. A popular approach called exhaustive search is used. This is a method where heuristic techniques are used. In both centralized and distributed systems a common heuristic is to minimize the size of intermediate relations. Performing unary operations first and ordering the binary operations by the increasing size of their intermediate relations can do this.

Optimization Timing: A query may be optimized at different times relative to the actual time of query execution. Optimization can be done statically before executing the query or dynamically as the query is executed. The main advantage of the later method is that the actual sizes of the intermediate relations are available to the query processor, thereby minimizing the probability of a bad choice.

Statistics: The effectiveness of the query optimization is based on statistics on the database. Dynamic query optimization requires statistics in order to choose the operation that has to be done first. Static query optimization requires statistics to estimate the size of intermediate relations. The accuracy of the statistics can be improved by periodical updating.

Decision Sites: Most of the systems use centralized decision approach, in which a single site generates the strategy. However, the decision process could be distributed among various sites participating in the elaboration of the best strategy. The centralized approach is simpler but requires the knowledge of the complete distributed database where as the distributed approach requires only local information.

Exploitation of the Network Topology: the distributed query processor exploits the network topology. This issue reduces the work of distributed query optimization, which can be dealt as two separate problems:

Sikkim Manipal University

Page No. 122

Oracle & Distributed Databases

Unit 6

Selection of the global execution strategy, based on the inter-site communication and selection of each local execution strategy, based on a centralized query processing algorithms. With local area networks, communication costs are comparable to I/O costs. o Exploitation of Replicated Fragments: For reliability purposes it is useful to have fragments replicated at different sites. Query processors have to exploit this information either statically or dynamically for processing the query efficiently. o Use of Semi-Joins: The semi-join operation reduces the size of the data that are exchanged between the sites so that the communication cost can be reduced. Self Assessment Questions 6.7 1. In distributed context, the is generally some form of relational algebra augmented with communication primitives. 2. Dynamic query optimization requires in order choosing the operation that has to be done first. 3. For purposes it is useful to have fragments replicated at different sites.

6.8 Layers of Query Processing


The problem of query processing can itself be decomposed into several subprograms, corresponding to various layers. In figure 8.4, a generic layering scheme for query processing is shown where each layer solves a well-defined sub-problem. The input is a query on distributed data expressed in relational calculus. This distributed query is posed on global (distributed) relations, meaning that data distribution is hidden. Four main layers are involved to map the distributed query into an optimized sequence of local operations, each act on a local database. These layers perform the functions of query decomposition, data localization, global query

Sikkim Manipal University

Page No. 123

Oracle & Distributed Databases

Unit 6

optimization, and local query optimization. The first three layers are performed by a central site and use global information; the local sites do the fourth.
CALCULUS QUERY ON DISTRIBUTED RELATIONS

QUERY DECOMPOSITION

GLOBAL SCHEMA

ALGEBRAIC QUERY ON DISTRIBUTED RELATIONS

CONTROL SITE

DATA LOCALIZATION

FRAGMENT SCHEMA

FRAGMENT QUERY

GLOBAL OPTIMIZATION

STATISTICS ON

OPTIMIZED FRAGMENT QUERY WITH COMMUNICATION OPERATIONS

LOCAL SITES

LOCAL OPTIMIZATION

LOCAL SCHEMA

OPTIMIZED LOCAL QUERIES

Figure 6.4: Generic Layering Scheme for Distributed Query Processing

Sikkim Manipal University

Page No. 124

Oracle & Distributed Databases

Unit 6

6.8.1 Query Decomposition The first layer decomposes the distributed calculus query into an algebraic query on global relations. The information needed for this transformation is found in the global conceptual schema describing the global relations. However, the information about data distribution is not used here but in the next layer. Thus the techniques used by this layer are those of a centralized DBMS. Query decomposition can be viewed as four successive steps o The calculus query is rewritten in a normalized form that is suitable for subsequent manipulation. Normalization of a query generally involves the manipulation of the query quantifiers and of the query qualification by applying logical operator priority. o The normalized query is analyzed semantically so that incorrect queries are detected and rejected as early as possible. Techniques to detect incorrect queries exist only for a subset of relational calculus. Typically, they use some sort of graph that captures the semantics of the query. o o The correct query (still expressed in relational calculus) is simplified. One way to simplify a query is to eliminate redundant predicates. The calculus query is restructured as an algebraic query. The quality of an algebraic query is defined in terms of expected performance. The traditional way to do this transformation toward a "better" algebraic specification is to start with an initial algebraic query and transform it in order to find a "good" one. The initial algebraic query is derived immediately from the calculus query by translating the predicates and the target statement into relational operations as they appear in the query. This directly translated algebra query is then restructured through transformation rules. The algebraic query generated by this layer is good in the sense that the worse executions are avoided.

Sikkim Manipal University

Page No. 125

Oracle & Distributed Databases

Unit 6

6.8.2 Data Localization The input to the second layer is an algebraic query on distributed relations. The main role of the second layer is to localize the querys data using data distribution information. Relations are fragmented and stored in disjoint subsets called fragments, each being stored at a different site. This layer determines which fragments are involved in the query and transforms the distributed query into a fragment query. Fragmentation is defined through fragmentations rules that can be expressed as relational operations. A distributed relation can be reconstructed by applying the fragmentation rules, and then deriving a program, called a localization program, of relational algebra operations, which then act on fragments. Generating a fragments query is done in two steps. o The distributed query is mapped into a fragment query by substituting each distributed relation by its reconstruction program (also called materialization program. o The fragment query is simplified and restructured to produce another good query. Simplification and restructuring may be done according to the same rules used in the decomposition layer. As in the decomposition layer, the final fragment query is generally far from optimal because information regarding fragments is not utilized. 6.8.3 Global Query Optimization The input to the third layer is a fragment query, that is, an algebraic query on fragments. The goal of query optimization is to find an execution strategy for the query, which is close to optimal. An execution strategy for a distributed query can be described with relational algebra operations and communication primitives (send/receive operations) for transferring data between sites. The previous layers have already optimized the query for example, by eliminating redundant expressions. However, this optimization is independent of fragments characteristics such as cardinalities. In addition,
Sikkim Manipal University Page No. 126

Oracle & Distributed Databases

Unit 6

communication operations are not yet specified. By permuting the ordering of operations within one fragment query, many equivalent queries may be found. Query optimization consists of finding the best ordering of operations in the fragments query, including communication operations, which minimize a cost function. The cost function, often defined in terms of time units, refers to computing resources such as disk space, disk I/Os, buffer space, CPU cost, communication cost and so on. An important aspect of query optimization is join ordering, since permutations of the joint within the query may lead to improvements of orders of magnitude. One basic technique for optimizing a sequence of distributed join operations is through the semi-join operator. The main value of the semi-join in a distributed system is to reduce the size of the join operands and then the communication cost. The output of the query optimization layer is an optimized algebraic query with communication operation included on fragments. 6.8.4 Local Query Optimization The last layer us performed by all the sites having fragments involved in query. Each sub-query executing at one site, called a local query, is then optimized using the local schema of the site. At this time, the algorithms to perform the relational operations may be chosen. Local optimization uses the algorithms of centralized systems. Self Assessment Questions 6.8 1. How many layers are involved to map the distributed query into an optimized sequence of local operations? 2. The layer decomposes the distributed calculus query into an algebraic query on global relations. 3. The main role of the data localization layer is to the querys data using data distribution information.
Sikkim Manipal University Page No. 127

Oracle & Distributed Databases

Unit 6

4. One basic technique for optimizing a sequence of distributed join operations is through the operator.

6.9 Summary
In this unit we have discussed the four phases of the design of Distributed databases: Global schema, Fragmentation schema, Allocation schema and Local schema. Some important aspects of design of fragmentation and allocation schemas are described. Also in this unit we have provided an overview of query processing in distributed DBMSs. We have introduced the function and objectives of query processing. The goals of the query processing are discussed. We have described a characterization of query processors based on their implementation choices. Also we proposed a generic layering scheme for describing distributed query processing.

6.10 Terminal Questions


1. What are the approaches used while designing the distributed database? Explain 2. State Vertical partitioning and Clustering problem. 3. Explain the general principle of allocation of fragments in distributed databases. 4. List the objectives of Query Processing 5. What are characteristics of distributed query processor? Explain 6. How do you identify the query decomposition? Explain

6.11 Answers to Self Assessment Questions


Answers to Self Assessment Questions 6.2 1. Designing the fragments
2. Complete locality
3. 4.

two top-down
Page No. 128

Sikkim Manipal University

Oracle & Distributed Databases

Unit 6

Answers to Self Assessment Questions 6.3 1. Primary


2. Distributed join

3. Grouping Approach
4. Vertical clustering

Answers to Self Assessment Questions 6.4 1. redundant Answers to Self Assessment Questions 6.5 1. query processor
2. algebraic

Answers to Self Assessment Questions 6.6 1. query


2. resource time
3.

communication

Answers to Self Assessment Questions 6.7 1. output language 2. statistics


3. reliability

Answers to Self Assessment Questions 6.8 1. four 2. query decomposition 3. localize


4. semi-join

Sikkim Manipal University

Page No. 129

Anda mungkin juga menyukai