Anda di halaman 1dari 14

July 2011 Master of Computer Application (MCA) Semester 4 MC0077 Advanced Database Systems 4 Credits

(Book ID: B0882) Assignment Set 2

1. Describe the following with suitable examples: o Cost Estimation o Measuring Index Selectivity

Ans: Cost Estimation


One of the hardest problems in query optimization is to accurately estimate the costs of alternative query plans. Optimizers cost query plans using a mathematical model of query execution costs that relies heavily on estimates of the cardinality, or number of tuples, flowing through each edge in a query plan. Cardinality estimation in turn depends on estimates of the selection factor of predicates in the query. Traditionally, database systems estimate selectivity through fairly detailed statistics on the distribution of values in each column, such as histograms. This technique works well for estimation of selectivity of individual predicates. However many queries have conjunctions of predicates such as select count(*) from R, S where R.make='Honda' and R.model='Accord'. Query predicates are often highly correlated (for example, model='Accord'implies make='Honda'), and it is very hard to estimate the selectivity of the conjunct in general. Poor cardinality estimates and uncaught correlation are one of the main reasons why query optimizers pick poor query plans. This is one reason why a DBA should regularly update the database statistics, especially after major data loads/unloads. The Cardinality of a set is a measure of the "number of elements of the set". There are two approaches to cardinality one which compares sets directly using bijections and injections, and another which uses cardinal numbers Measuring Index Selectivity Index Selectivity B*TREE Indexes improve the performance of queries that select a small percentage of rows from a table. As a general guideline, we should create indexes on tables that are often queried for less than 15% of the tables rows. This value may be higher in situations where all data can be retrieved from an index, or where the indexed columns can be used for joining to other tables. The ratio of the number of distinct values in the indexed column / columns to the number of records in the table represents the selectivity of an index. The ideal selectivity is 1. Such selectivity can be reached only by unique indexes on NOT NULL columns. Example with good Selectivity

A table having 100000 records and one of its indexed column has 88000 distinct values, then the selectivity of this index is 88000 / 100000 = 0.88. Oracle implicitly creates indexes on the columns of all unique and primary keys that you define with integrity constraints. These indexes are the most selective and the most effective in optimizing performance. The selectivity of an index is the percentage of rows in a table having the same value for the indexed column. An indexs selectivity is good if few rows have the same value. Example with Bad Selectivity lf an index on a table of 100000 records had only 500 distinct values, then the indexs selectivity is 500 / 100000 = 0.005 and in this case a query which uses the limitation of such an index will retum 100000 / 500 = 200 records for each distinct value. It is evident that a full table scan is more efficient as using such an index where much more I/O is needed to scan repeatedly the index and the table. How to Measure Index Selectivity? Manually measure index selectivity The ratio of the number of distinct values to the total number of rows is the selectivity of the columns. This method is useful to estimate the selectivity of an index before creating it. select count (distinct job) "Distinct Values" from emp;

select count(*) "Total Number Rows" from emp;

Selectivity = = 0.35

Distinct 5

Values

Total /

Number

Rows 14

Automatically measuring index selectivity We can determine the selectivity of an index by dividing the number of distinct indexed values by the number of rows in the table. create index analyze table emp compute statistics; idx_emp_job from = on emp(job); user_indexes EMP

select distinct_keys where table_name and index_name = IDX_EMP_JOB;

select num_rows where table_name = EMP;

from

user_tables

Selectivity = DISTINCT_KEYS / NUM_ROWS = 0.35 Selectivity of each individual Column Assuming that the table has been analyzed it is also possible to query USER_TAB_COLUMNS to investigate the selectivity of each column individually. select from where table_name = EMP; column_name, num_distinct user_tab_columns

2. Describe the following: o Statements and Transactions in a Distributed Database o Heterogeneous Distributed Database Systems Ans:

Statements and Transactions in a Distributed Database


The following sections introduce the terminology used when discussing statements and transactions in a distributed database environment. Remote and Distributed Statements A Remote Query is a query that selects information from one or more remote tables, all of which reside at the same remote node. A Remote Update is an update that modifies data in one or more tables, all of which are located at the same remote node.

Note: A remote update may include a sub-query that retrieves data from one or more remote nodes, but because the update is performed at only a single remote node, the statement is classified as a remote update. A Distributed Query retrieves information from two or more nodes. A distributed update modifies data on two or more nodes. A distributed update is possible using a program unit, such as a procedure or a trigger, that includes two or more remote updates that access data on different nodes. Statements in the program unit are sent to the remote nodes, and the execution of the program succeeds or fails as a unit. Remote and Distributed Transactions A Remote Transaction is a transaction that contains one or more remote statements, all of which reference the same remote node. A Distributed Transaction is any transaction that includes one or more statements that, individually or as a group, update data on two or more distinct nodes of a distributed database. If all statements of a transaction reference only a single remote node, the transaction is remote, not distributed.

Heterogeneous Distributed Database Systems


The Oracle distributed database architecture allows the mix of different versions of Oracle along with database products from other companies to create a heterogeneous distributed database system. The Mechanics of a Heterogeneous Distributed Database In a distributed database, any application directly connected to a database can issue a SQL statement that accesses remote data in the following ways (For the sake of explanation we have taken Oracle as a base): Data in another database is available, no matter what version. Databases at other physical locations are connected through a network and maintain communication. Data in a non-compatible database (such as an IBM DB2 database) is available, assuming that the non-Compatible database is supported by the applications gateway architecture, say SQL*Connect in case of Oracle, One can connect the Oracle and non-Oracle databases with a network and use SQL*Net to maintain communication. Figure 9.3 illustrates a heterogeneous distributed database system encompassing different versions of Oracle and non-Oracle databases.

Figure 9.3: Heterogeneous Distributed Database Systems When connections from an Oracle node to a remote node (Oracle or non-Oracle) initially are established, the connecting Oracle node records the capabilities of each remote system and the associated gateways. SQL statement execution proceeds. However, in heterogeneous distributed systems, SQL statements issued from an Oracle database to a non-Oracle remote database server are limited by the capabilities of the remote database server and associated gateway. For example, if a remote or distributed query includes an Oracle extended SQL function (for example, an outer join), the function may have to be performed by the local Oracle database. Extended SQL functions in remote updates (for example, an outer join in a sub-query) are not supported by all gateways.

3. Explain: A) Data Warehouse Architecture B) Data Storage Methods Ans:


Data Warehouse Architecture The term Data Warehouse Architecture is primarily used today to describe the overall structure of a Business Intelligence system. Other historical terms include Decision Support Systems (DSS), Management Information Systems (MIS), and others. The Data Warehouse Architecture describes the overall system from various perspectives such as data, process, and infrastructure needed to communicate the structure, function and interrelationships of each component. The infrastructure or technology perspective details the various hardware and software products used to implement the distinct components of the overall

system. The data perspective typically diagrams the source and target data structures and aid the user in understanding what data assets are available and how they are related. The process perspective is primarily concerned with communicating the process and flow of data from the originating source system through the process of loading the data warehouse, and often the process that client products use to access and extract data from the warehouse. Data Storage Methods In OLTP Online Transaction Processing Systems relational database design use the discipline of data modeling and generally follow the Codd rules of data normalization in order to ensure absolute data integrity. Less complex information is broken down into its most simple structures (a table) where all of the individual atomic level elements relate to each other and satisfy the normalization rules. Codd defines 5 increasing stringent rules of normalization and typically OLTP systems achieve a 3rd level normalization. Fully normalized OLTP database designs often result in having information from a business transaction stored in dozens to hundreds of tables. Relational database managers are efficient at managing the relationships between tables and result in very fast insert/update performance because only a little bit of data is affected in each relational transaction. OLTP databases are efficient because they are typically only dealing with the information around a single transaction. In reporting and analysis, thousands to billions of transactions may need to be reassembled imposing a huge workload on the relational database. Given enough time the software can usually return the requested results, but because of the negative performance impact on the machine and all of its hosted applications, data warehousing professionals recommend that reporting databases be physically separated from the OLTP database. In addition, data warehousing suggests that data be restructured and reformatted to facilitate query and analysis by novice users. OLTP databases are designed to provide good performance by rigidly defined applications built by programmers fluent in the constraints and conventions of the technology. Add in frequent enhancements, and to many a database is just a collection of cryptic names, seemingly unrelated and obscure structures that store data using incomprehensible coding schemes. All factors that while improving performance, complicate use by untrained people. Lastly, the data warehouse needs to support high volumes of data gathered over extended periods of time and are subject to complex queries and need to accommodate formats and definitions of inherited from independently designed package and legacy systems. Designing the data warehouse data Architecture synergy is the realm of Data Warehouse Architects. The goal of a data warehouse is to bring data together from a variety of existing databases to support management and reporting needs. The generally accepted principle is that data should be stored at its most elemental level because this provides for the most useful and flexible basis for use in reporting and information analysis. However, because of different focus on specific requirements, there can be alternative methods for design and implementing data warehouses. There are two leading approaches to organizing the data in a data warehouse. The dimensional approach advocated by Ralph Kimball and the normalized approach advocated by Bill Inmon. Whilst the dimension approach is very useful in data mart design, it can result in a rats nest of long term data integration and abstraction complications when used in a data warehouse. In the "dimensional" approach, transaction data is partitioned into either a measured "facts", which are generally numeric data that captures specific values or "dimensions" which contain the reference information that gives each transaction its context. As an example, a sales transaction would be broken up into facts such as the number of products ordered, and the price paid, and dimensions such as date, customer, product, geographical location and salesperson. The main

advantages of a dimensional approach are that the data warehouse is easy for business staff with limited information technology experience to understand and use. Also, because the data is prejoined into the dimensional form, the data warehouse tends to operate very quickly. The main disadvantage of the dimensional approach is that it is quite difficult to add or change later if the company changes the way in which it does business. The "normalized" approach uses database normalization. In this method, the data in the data warehouse is stored in third normal form. Tables are then grouped together by subject areas that reflect the general definition of the data (customer, product, finance, etc.). The main advantage of this approach is that it is quite straightforward to add new information into the database the primary disadvantage of this approach is that because of the number of tables involved, it can be rather slow to produce information and reports. Furthermore, since the segregation of facts and dimensions is not explicit in this type of data model, it is difficult for users to join the required data elements into meaningful information without a precise understanding of the data structure. Subject areas are just a method of organizing information and can be defined along any lines. The traditional approach has subjects defined as the subjects or nouns within a problem space. For example, in a financial services business, you might have customers, products and contracts. An alternative approach is to organize around the business transactions, such as customer enrollment, sales and trades.

4. Discuss, how the process of retrieving a Text Data differs from the process of retrieval of an Image?

Ans:
Text-based Information Retrieval Systems As indicated in Table 3.1, text-based information retrieval systems, or more correctly text document retrieval systems, have as long a development history as systems for management of structured, administrative data. The basic structure for digital documents, illustrated in Figure 3.5, has remained relatively constant a header of descriptive attributes, currently called metadata, is prefixed to the text of each document. The resulting document collection is stored in a Document DB. Note that in Figure 3.5 the attribute body can be replaced by a pointer (or link) to a storage location separate from the metadata.

Figure 3.5: Basic digital document structure In comparison to the structured/regular data used by administrative applications, documents are unstructured, consisting of a series of characters that represent words, sentences and paragraphs of unequal length. This requires different techniques for indexing, search and retrieval than that used for structured administrative data. Rather than indexing attribute values separately, a document retrieval system develops a term index similar to the ones found in the back of books, i.e. a list of the terms found in the documents with lists of where each term is located in the document collection. The frequency of term occurrence within a document is assumed to indicate the semantic content of the document. Search for relevant documents is commonly based on the semantic content of the document, rather than on the descriptive attribute values connected to it. For example, if we assume that the data stored in the attribute Document.Body in Figure 3.3a is the actual text of the document, than the retrieval algorithm, when processing Q2 in Figure 3.3c, searches the term index and selects those documents that contain one or more of the query terms database, management, sql3 and msql. It then sorts the resulting document list according to the frequency of these terms in each document. There are two principle problems in using term matching for document retrieval: 1. Terms can be ambiguous, having meaning dependent on context, and 2. There is frequently a mismatch between the terms used by the searcher in his/her query and the terms used by the authors in the document collections. Techniques and tools developed to address these problems and thus improve retrieval quality include: Indexing techniques based on word stems, Dictionaries, thesauri, and grammatical rules as tools for interpretation of both search terms and documents. Similarity and clustering algorithms,

Mark-up languages (adaptations of the editors tag set) to indicate areas of the text, such as titles, chapters, and its layout, that can be used to enhance relevance evaluations, and finally Metadata standards for describing the semantic content and context for a document. None of these techniques or tools is supported by the standard for relational database management systems. However, since there is a need to store text data with regular administrative data, various text management techniques are being added to or-dbms systems. Recently, Baeza-Yates & Ribeiro-Neto, (1999) estimated that 90% of computerized data is in the form of text documents. This data is accessible using the retrieval technology developed for offline document/information retrieval systems and adapted for the newer Digital Libraries and Web search engines. Due to the expanding quantity of text available on the internet, research and development efforts are (still) focused on improving the indexing and retrieval (similarity) algorithms used. Image Retrieval Systems Due to the large storage requirements for images, computer generation of image material, in the form of charts, illustrations and maps, predated the creation of image databases and the need for ad-hoc image retrieval. Development of scanning devices, particularly for medical applications, and digital cameras, as well as the rapidly increasing capacity of computer storage has lead to the creation of large collections of digital image material. Today, many organizations, such as news media, museums and art galleries, as well as police and immigration authorities, maintain large collections of digital images. For example, the New Your Public Library has made their digital gallery, with over 480,000 scanned images, available to the Internet public. Maintaining a large image collection leads necessarily to a need for an effective system for image indexing and retrieval. Image data collections have a similar structure as that used for text document collections, i.e. each digital image is associated with descriptive metadata, an example of which is illustrated in Figure 3.6. While management of the metadata is the same for text and image collections, the techniques needed for direct image comparison are quite different from those used for text documents. Therefore, current image retrieval systems use 2 quite different approaches for image retrieval (not necessarily within the same system).

Figure 3.6: Digital image document structure 1. Retrieval based on metadata, generated manually, that describe the content, meaning/interpretation and/or context for each image, and/or 2. Retrieval based on automatically selected, low level features, such as color and texture distribution and identifiable shapes. This approach is frequently called CBIR or content based image retrieval Most of the metadata attributes used for digitized images, such as those listed in Figure 3.6, can be stored as either regular structured attributes or text items. Once collected, metadata can be used to retrieve images using either exact match on attribute values or text-search on text descriptive fields. Most image retrieval systems utilize this approach. For example, a Google search for images about Humpback whales listed over 15,000 links to images based on the text captions, titles, file names accompanying the images (July 26th 2006). As noted earlier, images are strings of pixels with no other explicit relationship to the following pixel(s) than their serial position. Unlike text documents, there is no image vocabulary that can be used to index the semantic content. Instead, image pixel analysis routines extract dominant lowlevel features, such as the distribution of the colors and texture(s) used, and location(s) of identifiable shapes. This data is used to generate a signature for each image that can be indexed and used to match a similar signature generated for a visual query, i.e. a query based on an image example. Unfortunately, using low-level features does not necessarily give a good semantic result for image retrieval

5. What are differences in Centralized and Distributed Database Systems? List the relative advantages of data distribution.

Ans:
Features of Distributed vs. Centralized Databases or Differences in Distributed & Centralized Databases Centralized Control vs. Decentralized Control In centralized control one "database administrator" ensures safety of data whereas in distributed control, it is possible to use hierarchical control structure based on a "global database administrator" having the central responsibility of whole data along with "local database administrators", who have the responsibility of local databases. Data Independence In central databases it means the actual organization of data is transparent to the application programmer. The programs are written with "conceptual" view of the data (called "Conceptual schema"), and the programs are unaffected by physical organization of data. In Distributed Databases, another aspect of "distribution dependency" is added to the notion of data independence as used in Centralized databases. Distribution Dependency means programs are written assuming the data is not distributed. Thus correctness of programs is unaffected by the movement of data from one site to another; however, their speed of execution is affected.

Reduction of Redundancy In centralized databases redundancy was reduced for two reasons: (a) inconsistencies among several copies of the same logical data are avoided, (b) storage space is saved. Reduction of redundancy is obtained by data sharing. In distributed databases data redundancy is desirable as (a) locality of applications can be increased if data is replicated at all sites where applications need it, (b) the availability of the system can be increased, because a site failure does not stop the execution of applications at other sites if the data is replicated. With data replication, retrieval can be performed on any copy, while updates must be performed consistently on all copies. Complex Physical Structures and Efficient Access In centralized databases complex accessing structures like secondary indexed, interfile chains are used. All these features provide efficient access to data. In distributed databases efficient access requires accessing data from different sites. For this an efficient distributed data access plan is required which can be generated either by the programmer or produced automatically by an optimizer. Problems faced in the design of an optimizer can be classified in two categories: a) Global optimization consists of determining which data must be accessed at which sites and which data files must consequently be transmitted between sites. b) Local optimization consists of deciding how to perform the local database accesses at each site. Integrity, Recovery and Concurrency Control A transaction is an atomic unit of execution and atomic transactions are the means to obtain database integrity. Failures and concurrency are two dangers of atomicity. Failures may cause the system to stop in midst of transaction execution, thus violating the atomicity requirement. Concurrent execution of different transactions may permit one transaction to observe an inconsistent, transient state created by another transaction during its execution. Concurrent execution requires synchronization amongst the transactions, which is much harder in all distributed systems. Privacy and Security In traditional databases, the database administrator, having centralized control, can ensure that only authorized access to the data is performed. In distributed databases, local administrators face the same as well as two new aspects of the problem; (a) security (protection) problems because of communication networks is intrinsic to database systems. (b) In certain databases with a high degree of "site autonomy" may feel more protected because they can enforce their own protections instead of depending on a central database administrator. Distributed Query Processing The DDBMS should be capable of gathering and presenting data from more than one site to answer a single query. In theory a distributed system can handle queries more quickly than a centralized

one, by exploiting parallelism and reducing disc contention; in practice the main delays (and costs) will be imposed by the communications network. Routing algorithms must take many factors into account to determine the location and ordering of operations. Communications costs for each link in the network are relevant, as also are variable processing capabilities and loadings for different nodes, and (where data fragments are replicated) trade-offs between cost and currency. If some nodes are updated less frequently than others there may be a choice between querying the local out-of-date copy very cheaply and getting a more up-to-date answer by accessing a distant location. The ability to do query optimization is essential in this context the main objective being to minimize the quantity of data to be moved around. With single-site databases one must consider both generalized operations on internal query representations and the exploitation of information about the current state of the database. Distributed Directory (Catalog) Management Catalogs for distributed databases contain information like fragmentation description, allocation description, mappings to local names, access method description, statistics on the database, protection and integrity constraints (consistency information) which are more detailed as compared to centralized databases.

6. How the process of retrieval of text differs from the retrieval of Images? What are the considerations that should be taken care of during information retrieval? Ans:
Text-based Information Retrieval Systems As indicated in Table 3.1, text-based information retrieval systems, or more correctly text document retrieval systems, have as long a development history as systems for management of structured, administrative data. The basic structure for digital documents, illustrated in Figure 3.5, has remained relatively constant a header of descriptive attributes, currently called metadata, is prefixed to the text of each document. The resulting document collection is stored in a Document DB. Note that in Figure 3.5 the attribute body can be replaced by a pointer (or link) to a storage location separate from the metadata. Basic digital document structure In comparison to the structured/regular data used by administrative applications, documents are unstructured, consisting of a series of characters that represent words, sentences and paragraphs of unequal length. This requires different techniques for indexing, search and retrieval than that used for structured administrative data. Rather than indexing attribute values separately, a document retrieval system develops a term index similar to the ones found in the back of books, i.e. a list of the terms found in the documents with lists of where each term is located in the document collection. The frequency of term occurrence within a document is assumed to indicate the semantic content of the document. Search for relevant documents is commonly based on the semantic content of the document, rather than on the descriptive attribute values connected to it. For example, if we assume that the data stored in the attribute Document.Body in Figure 3.3a is the actual text of the document, than the retrieval algorithm, when processing Q2 in Figure 3.3c, searches the term index and selects those documents that contain one or more of the query terms database, management, sql3

and msql. It then sorts the resulting document list according to the frequency of these terms in each document. There are two principle problems in using term matching for document retrieval: 1. Terms can be ambiguous, having meaning dependent on context, and 2. There is frequently a mismatch between the terms used by the searcher in his/her query and the terms used by the authors in the document collections. Techniques and tools developed to address these problems and thus improve retrieval quality include: Indexing techniques based on word stems, Dictionaries, thesauri, and grammatical rules as tools for interpretation of both search terms and documents. Similarity and clustering algorithms, Mark-up languages (adaptations of the editors tag set) to indicate areas of the text, such as titles, chapters, and its layout, that can be used to enhance relevance evaluations, and finally Metadata standards for describing the semantic content and context for a document. None of these techniques or tools is supported by the standard for relational database management systems. However, since there is a need to store text data with regular administrative data, various text management techniques are being added to or-dbms systems. Recently, Baeza-Yates & Ribeiro-Neto, (1999) estimated that 90% of computerized data is in the form of text documents. This data is accessible using the retrieval technology developed for offline document/information retrieval systems and adapted for the newer Digital Libraries and Web search engines. Due to the expanding quantity of text available on the internet, research and development efforts are (still) focused on improving the indexing and retrieval (similarity) algorithms used. Image Retrieval Systems Due to the large storage requirements for images, computer generation of image material, in the form of charts, illustrations and maps, predated the creation of image databases and the need for ad-hoc image retrieval. Development of scanning devices, particularly for medical applications, and digital cameras, as well as the rapidly increasing capacity of computer storage has lead to the creation of large collections of digital image material. Today, many organizations, such as news media, museums and art galleries, as well as police and immigration authorities, maintain large collections of digital images. For example, the New Your Public Library has made their digital gallery, with over 480,000 scanned images, available to the Internet public. Maintaining a large image collection leads necessarily to a need for an effective system for image indexing and retrieval. Image data collections have a similar structure as that used for text document collections, i.e. each digital image is associated with descriptive metadata, an example of which is illustrated in Figure 3.6. While management of the metadata is the same for text and image collections, the techniques needed for direct image comparison are quite different from

those used for text documents. Therefore, current image retrieval systems use 2 quite different approaches for image retrieval (not necessarily within the same system).

Information Retrieval, as the retrieval of documents, commonly text but also visual and audio, that describe objects and/or events of interest. Both retrieval types match the query specifications to database values. However, while data retrieval only retrieves items that match the query specification exactly, information retrieval systems return items that are deemed (by the retrieval system) to be relevant or similar to the query terms. In the later case, the information requester must select the items that are actually relevant to his/her request. Quick examples include the request for the balance of bank account vs selecting relevant links from a google.com result list. User requests for data are typically formed as "retrieval-by-content", i.e. the user asks for data related to some desired property or information characteristic. These requests or queries must be specified using one of the query languages supported by the DMS query processing subsystem. A query language is tailored to the data type(s) of the data collection

Anda mungkin juga menyukai