Data Warehousing Design Considerations

Data Warehouse Design Considerations
M. Tech. Course Seminar Report Submitted in partial fulllment of the requirements for the degree of Master of Technology
by Abhishek Sugandhi Roll No: 04305016
under the guidance of Prof. N.L.Sarda
Department of Computer Science and Engineering Indian Institute of Technology, Bombay Mumbai
Acknowledgment
I would like to thank my seminar guide, Prof. N. L. Sarda for his valuable guidance, and encouragement without which, it would not be possible for me to complete my work.
Abstract
Data warehouse is a complex information system primarily used in decision making process by means of On-Line Analytical Processing (OLAP) applications.Over the last years, data warehouses are getting a lot of attention both from the industrial and the research community. The reason lies in their great importance: making predictions about the (near) future, has always been desirable for business companies. In chapter 1, I will discuss the basics of data warehouse and its modeling techniques. Decision support places some rather dierent requirements on database technology compared to traditional on-line transaction processing. Data Warehouses are usually modeled using Dimensional Modeling, for better understandability and easy extendibility. As Data Warehouses store huge amount of both current and historical data, special attention should be given to changing dimensions, time and date dimensions, hierarchal dimensions, while modeling data warehouse.In this discussion,in chapter 2, I am going to focus on handling this issues while modeling the Data warehouse. Software vendors have quickly developed products and services for improving the efciency of querying on Data Warehouses.In chapter 3, I will discuss the querying feature provided by Oracle 9i for improving eciency of aggregate queries, and querying feature provided by MDX.MDX stands for the Multidimensional Expressions (MDX). It is a language used to manipulate multidimensional information in Microsoft SQL Server 2000 Analysis Services.
Contents
1 Introduction 1.1 What is Data Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Warehouse Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Dimensional Model Vs. ER Model . . . . . . . . . . . . . . . . . . . . . . . 2 Data Warehouse Design Issues 2.1 How to model time and date dimension . . . . 2.2 Dimension normalization . . . . . . . . . . . . 2.3 Surrogate keys . . . . . . . . . . . . . . . . . . 2.4 Slowly Changing Dimensions . . . . . . . . . . 2.4.1 Type 1: Overwrite the Value . . . . . . 2.4.2 Type 2: Add a new Dimension Row . . 2.4.3 Type 3: Add a new Dimension Column 2.5 Rapidly Changing Dimension . . . . . . . . . 2.6 Handling Hierarchies . . . . . . . . . . . . . . 2.6.1 Fixed Depth Hierarchy . . . . . . . . . 2.6.2 Variable Depth Hierarchy . . . . . . . 2.7 Multivalued Dimension . . . . . . . . . . . . . 2.8 Heterogeneous Dimension . . . . . . . . . . . 2.9 Dimension Role Playing . . . . . . . . . . . . 2.10 Conformed Dimensions . . . . . . . . . . . . . 4 4 4 5 7 7 8 8 9 9 10 10 11 12 12 12 14 14 16 17 18 18 19 19 22 23 24 25 26
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
3 Querying on Data Warehouse 3.1 Oracles 9i SQL extension for Aggregation Queries in 3.1.1 Syntax . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Applications in building Cross-Tabular Report 3.2 Writing MDX queries for Data Warehouse . . . . . . 3.2.1 Common Terms . . . . . . . . . . . . . . . . . 3.2.2 Rules . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 MDX Query Structure . . . . . . . . . . . . . 3.2.4 Specifying Axis Dimensions . . . . . . . . . . 3
data . . . . . . . . . . . . . . . . . . . . .
warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
3.2.5 3.2.6 3.2.7 3.2.8 4 Conclusion
Establishing Cube Context . . Specifying Slicer Dimensions . Example Queries . . . . . . . Dierence of MDX with SQL
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
26 26 27 28 29
Chapter 1 Introduction
1.1 What is Data Warehouse
Large Organizations have presences in many place, for example, a retail Bank may have its branch located all over the country and each of these may generate large amount of data. Further more, these large organizations have complex internal organization, and therefore, dierent data may be present in dierent location, or on dierent operational system, or under dierent schema. Corporate decision maker require data from all these sources. Setting up queries on all such individual source is inecient. Moreover, the sources of data store only current data, whereas, decision makers need access to past data as well. They may be interested in information about the eect of any policy they have introduced by observing purchase pattern change since the introduction of that policy. Data Warehouses provide a solution to all these problems. Data warehouse is repository of information gathered from multiple sources, stored under a unied schema, at a single site. Once gathered, data are stored for a long time, permitting access to historical data. Thus, data warehouse provide a single consolidated interface to data, making decision support queries easier to write [?].
1.2
Warehouse Schemas
Data Warehouse typically have schemas that are designed for data analysis. Thus the data is multidimensional data with measure attributes and dimensional attributes. The measure attributes are those attributes that can measure some value and can be aggregated upon.For example in sales relation, the no. of unit sold, amount are measure attributes. Dimension attributes are those attributes that dene the dimensions on which measure attributes or summaries of measure attributes can be viewed. for example, product name, sales time,etc are dimension attributes [?]. In Data Warehouses, Fact table contains the multidimensional data, and these tables are usually very large. To minimize the storage
Customer customer_id Other Attributes Sales customer_id time_id channel_id Promotion_id quantity sold cost amount
Channels channel_id Other Attributes
Time time_id Other Attributes
Promotion Promotion_id Other Attributes
Figure 1.1: Star Schema [?] requirements, dimension attributes are usually short identiers that are foreign key in some other tables called dimensional tables. Usually, a fact table is associated with many dimension table and contain foreign key for each of these dimensional table. Fact table is kept highly normalized to reduce space requirement whereas Dimension tables are highly denormalized to ease the browsing among dierent attributes of a dimension and to enable us to write simple and easily understandable queries. Resultant Schema with a fact table and multiple dimensional tables, and foreign keys from the fact table to dimensional tables is called a star schema. If we normalize the dimension table, so that a dimension table contain foreign key to other dimensional table then, the resultant schema have a multiple level of dimensional tables, then such schema are called snowake schemas. Some Complex Data Warehouse may have more than one fact table [?].
1.3
Dimensional Model Vs. ER Model
The main dierence between Dimensional Model and ER Model lies in the fact that dimension tables in dimensional model are denormalized, whereas Dimension tables in ER model are highly normalized. ER design technique seeks to remove the redundancy in data by normalizing the relations so that there is less disk space wastage and there are no insert, or update anomaly, but in case of data warehouses, if the dimension tables are normalized into typical snowake (normalized) structures, two bad things happen. First, the data 6
model becomes too complex to be presented to the user. Second, linking the elements among the various branches of the snowake compromises browsing performance. Even when a long text string appears redundantly in the dimension table and can be moved to an outrigger table(table that is formed after normalization), you wont save enough disk space to justify moving it because the major amount of the disk space is consumed by the fact table(which is highly normalized) [?]. In many cases, normalization can actually increase the storage requirements. If the cardinality of the repeated dimension data element is high (in other words, there are just a few duplications), the outrigger table may be nearly as big as the main dimension table. But we have introduced another key structure that is now repeated in both tables [?]. Another argument given for normalizing the dimensions is to improve insert or update performance. This is rarely important in a decision-support environment. You update the dimension tables only once per night (typically), and the processing associated with loading perhaps millions of fact records dominates the really minor processing associated with inserting or updating dimension records.A dimensional database design has a xed structure that has no alternative join paths. This greatly simplies the optimization and evaluation of queries on these schema [?]. Fact table in Dimensional model represent many-many relationship among the dimensional table. We can convert an ER model into dimensional model in presence of such a many-many relationship, and such relationship is always present in Data Warehouses.
Chapter 2 Data Warehouse Design Issues

2.1 How to model time and date dimension
Date Dimension is one of the most important dimensions in data warehouse. It is guaranteed to be present in every data mart because virtually every data mart is time series.Instead of keeping date as an attribute in fact table, or other dimensional table, we should build a separate dimension table, because it will allow the analysts to query the data warehouse, on some special attributes like a holiday or major event etc. SQL date functions do no support ltering by these attributes, so if the business process need to slice the data by these nonstandard date attributes, then an explicit dimensional table is essential.Calendar logic should belong in Dimension table, rather than in application code [?]. Unlike most of other dimension we may build date dimension in advance. Every day is represented as row in date dimension. For keeping history of 10 years only 3650 rows will be needed. Date dimension key should be an integer rather than a date data type. This is explained further in surrogate key section. If we wanted access to time of transaction for day part analysis, instead of keeping time of day attributes like hour, min etc as elds in Date dimension table, we should handle it through separate Time Of Day dimension joined to fact table [?].This can save a good amount of space, as now instead of keeping 24 * 60 = 1440 rows to keep information about every minute in Date dimensional table for every day (means 3650 * 1440 for 10 years, which is very large for any dimensional table), we can build only one Time Of Day table which will contain only 1440 rows as total. Date dimension and Time Of Day dimension are completely independent.
2.2
Dimension normalization
Dimension table normalization is usually referred as snowaking. Redundant attributes are removed from from at denormalized dimension table and placed in normalized secondary dimension tables.But we should generally avoid snowaking due to following reasons [?] : 1. Snowake tables make much more complex representation 2. Numerous tables and joins usually translate into slow query performance 3. The minor space savings associated with snowaked dimension tables are insignificant as dimension tables are generally smaller and most of the space is consumed by the fact table. 4. It slows down user ability to browse within a dimension. 5. Finally snowaking defeats the use of bitmap index. Bit map indexes are very useful when indexing low cardinality elds in our dimension tables. But there are times when snowaking is permissible, when a clump of correlated attributes is used repeatedly in various independent roles. for example, in promotion dimension we would need to store promotion begin date and promotion end date attribute[?]. One more example, when we have to store multivalued attribute then we would need bridge table.
2.3
Surrogate keys
Surrogate keys are integers that are assigned sequentially as needed to populate a dimension. the surrogate keys merely serve to join dimensional tables to the fact table.surrogate keys are benecial as the following reasons [?] : 1. We should avoid operational code or other smart keys as data warehouse keys, because normally these operation codes are recycled after some period say one year but data warehouse will retain data for years. One of primary benet of surrogate keys is that they buer the data warehouse environment from operational changes. If we rely on operational code, we are also vulnerable to key overlap problems. 2. The smaller surrogate keys translate into smaller fact tables, smaller fact table indices and more fact table rows per i/o operation. 3. Surrogate keys can be used to record dimension conditions that have no operational code. For example, when our dimensional model have dates that are yet to be 9
determined. There are no SQL date value for it, but it can be handled in case of surrogate keys.We can just keep one more row in date dimensional table with its unique key, to identify YET TO DETERMINE condition, and avoid a null date dimension key in fact table.
2.4
Slowly Changing Dimensions
While dimension attributes are relatively static, they are not xed forever.Dimension attributes change, albeit rather slowly, over time. Tracking of accurate change is necessary so that business user can see the impact of each and every dimension change.When we need to keep track, it is unacceptable to put everything in fact table and make every dimension time-dependent to deal with these changes. Instead, we can take advantage of the fact that most of the dimensions are constant over time. We can preserve independent dimensional structure with only relatively minor changes to contend with changes. We refer to these nearly constant dimension as slowly changing dimensions [?]. For each attribute in dimensional table, we must specify a strategy to handle change. There are 3 basics technique for dealing with attribute changes [?]. 1. Overwrite the value 2. Add a Dimension Row 3. Add a Dimension Column for Example, Suppose that manufacturing operations makes a slight change in packaging of SKU 38 (unique product no given by organization ), and the packaging description changes from glued box to pasted box. Along with this change, manufacturing operations decides not to change the SKU number of the product, or bar code (UPC) that is printed on the box.Let us see, how the issue of handling this changing dimension is taken care of in all the above methods.
2.4.1
Type 1: Overwrite the Value
With the type 1, we merely overwrite the old attribute value in the dimension row, replacing it with the current value. In doing so attribute always reect the most recent assignment.Type 1 response is simple to implement but it does not maintain any history of prior attribute value [?]. Type 1 technique is the simplest and fastest. But it doesnt maintain past history! Nevertheless, overwriting is frequently used when the data warehouse team legitimately decides that the old value of the changed dimension attribute is not interesting.[?]. In 10
above example, Original row Product Key 12345 will be updated as Produce Description Scent Packaging glued Box SKU No. ABC922
Product Key 12345
Produce Description Scent
Packaging pasted Box
SKU No. ABC922
2.4.2
Type 2: Add a new Dimension Row
The second technique is the most common and has a number of powerful advantages.If the data warehouse team decides to track the change of an attribute issue another record(row in dimensional table), with the changed value of attribute. The only dierence between records is in the changed attribute. Even the operational codes are the same. This technique for tracking slowly changing dimensions is very powerful because new dimension records automatically partition history in the fact table. The old version of the dimension record points to all history in the fact table prior to the change. The new version of the dimension record points to all history after the change [?].There is no need for a time-stamp in the product table to record the change. This is best recorded by a fact table record with the correct key of newly added record [?]. Another advantage of this technique is that you can gracefully track as many changes to a dimensional item as you wish. Each change generates a new dimension record, and each record partitions history perfectly. The main drawbacks of the technique are the requirement to generalize the dimension key, and the growth of the dimension table itself [?]. Using Type 2 technique for previous example, we would have 2 product dimension rows (both original and updated ) as Product Key 12345 34567 Produce Description Scent Scent Packaging glued Box pasted Box SKU No. ABC922 ABC922
2.4.3
Type 3: Add a new Dimension Column
With Type 2 response partitions history, it does not allow us to associate the new attribute value with old fact history or vice-versa. However, we sometimes want the ability to see
11
fact data as if the change never occurred. We can attack this requirement, not by creating a new dimension record as in the Type 2 technique, but by creating a new current value eld. The type 3 technique allow us to see new and historical fact data by either the new or prior attribute values [?]. Using Type 3 technique for previous example, we would have update original row as Product Key 12345 Produce Description Scent current Packaging pasted Box previous packaging glued box SKU No. ABC922
2.5
Rapidly Changing Dimension
Normally, we will not use any of the techniques mentioned previously for handling changing dimension, if the dimension already contains million of the rows.Unfortunately, huge dimensions are also more likely to change than moderately sized dimension. We sometimes calls this situation rapidly changing monster dimensions [?]. The solution to handle such problem, is to break frequently analyzed or frequently changing dimensions into separate dimension, referred as minidimension [?]. There would be one row in minidimension for every unique combination of frequently analyzed attribute Level encountered in the data (not one row per customer). We leave behind more constant or less frequently queried attributes in original huge customer table.When creating the minidimension, continuously changing variable should be converted to banded ranges.In other words, we force the attributes in minidimension to take relatively small number of dimension values [?]. Although these restricts the use of predened bands, it drastically reduces the number of combinations in the minidimension. Every time, we build the fact table row, we include 2 foreign keys related to the dimension: the regular dimension key and minidimension key. The minidimension key should be the part of fact tables set of foreign keys to provide ecient access to the fact table. This design delivers browsing and constraining performance benets by providing a slower point of entries to the fact table, and we can avoid joins to huge dimensional table if attributes(static) from that table are not constrained. When the minidimension key participates as foreign key in fact table, another benet is that the fact table serves to capture the minidimensions attribute changes.We can keep track of loading which minidimension key when we want to change attribute of dimension. Earlier rows would be still using the old values of minidimension key. Thus we could keep track of history as well [?].
12
Customer Dimension Customer Dimension Customer Key Customer ID Customer Name ............. Age Gender Annual Income Customer Key Customer ID Customer Name .............. Becomes Customer Minidimension Dimension Customer Minidimension Key Customer Age Band Customer Gender Customer Income Band
Fact Table Customer Key Customer Minidimension Key More Foreign Keys.......... Facts............
Figure 2.1: Example of Handling Rapidly Changing large Dimension [?]
2.6
Handling Hierarchies
In many dimensions, hierarchy is inherent. We will take 2 approaches to handle hierarchies. The rst is straightforward and handle the hierarchy adequately with simplistic approach. The second approach is much more advanced and complicated but also much more extensible.
2.6.1
Fixed Depth Hierarchy
This happens rare, if we are confronted with a dimension that is highly predictable with xed number of levels (say N). In this case, we can keep N attributes in dimension corresponding to these N levels [?].If some other records from the dimension table are not having hierarchy up to the maximum no of levels, then we would duplicate lower level attributes to higher level attributes.In this way, we can report hierarchy to any level of hierarchy. for every record of that dimension.
2.6.2
Variable Depth Hierarchy
Representing an arbitrary variable depth hierarchy is an inherently dicult task in a relational environment.A simple computer science approach to storing such information would add a Parent Key eld to the Customer dimension. The Parent Key eld would be a recursive pointer that would contain the proper key value for the parent of any given customer. A special null value would be required for the topmost Customer in any given overall enterprise [?] . The problem with this recursive pointer approach is that, it cannot be used eectively with standard SQL. Standard SQL GROUP BY clause cannot follow the recursive pointers 13
downward, for aggregating an additive fact in the fact table [?]. Instead of using a recursive pointer, we can solve this modeling problem by inserting a bridge table(helper table) between the dimension table and the fact table.The bridge table contains one record for each separate path from each node in the organization tree to itself and to every node below it. Each Pathway row contains key of key of parent roll-up entity, no of levels between parent and the subsidiary, bottom-most ag indicating that there are no further nodes beneath it and nally, a top-most ag to indicate there are no further nodes above the parent [?]. Now, if we want to descend the hierarchy, we join the dimension table with bridge table by connecting dimensions primary key with bridge tables parent dimension key. Now we can constrain any particular dimension and request an aggregate measure of all dimensions at or below it.We can use no of level attribute to control depth of analysis. Similarly when we want to ascend the hierarchy, we reverse the join by connecting dimension key with the bridge table subsidiary dimension key [?]. When a group of nodes is moved from one part of an organization to another, only the bridge table rows that refer to paths outside the parent to the moved structures need to be altered.All rows referring to paths within the moved structure need not be aected.We need to add rows, if the moved structure had new parent. When issuing the SQL statement using bridge table, we need to be cautious about over counting the facts.When connecting the tables, we must constrain the customer dimensions to a single value and then join to the bridge table [?].
Customer Dimension Customer Key Customer ID Customer Name
Hierarchy Bridge Parent Customer Key Subsdiary Customer Key Leval Name Bottom flag Top Flag
Fact table Date Key Customer Key
Figure 2.2: Handling hierarchy through bridge table [?]
14
2.7
Multivalued Dimension
There are situations where we need to attach a multivalued dimension table to the fact table. Example of these situation is when we associate many customers to account, when multiple diagnoses are associated with single patient etc. Database designers usually take one of following approaches for handling Multivalued Dimension attributes [?] :
Choose one value and omit the other values Extend the dimension list to have a xed number of Multivalued dimensions Put a bridge (helper) table in between this fact table and the Multivalued dimension table. Frequently, designers choose a single value (rst approach). If we take these approach, the modeling problem goes away, but we will still be in doubt whether the Multivalued dimension data is useful. The second approach of creating a xed number of additional Multivalued dimension slots in the fact table key is also not a good idea, as there can be some situation where the number of Multivalued dimension exceed slots we have allocated. Also, we cannot easily query the multiple separate Multivalued dimensions [?]. Bridge table placed between the Multivalued dimension and the fact table is the best solution. The Multivalued dimension key in the fact table is changed to be a Multivalued dimension Group key. The helper table in the middle is the Multivalued dimension Group table. It has one record for each Multivalued dimension in a group of Multivalued dimensions [?]. The Multivalued dimension Group table is joined to the original Multivalued dimension on the Multivalued dimension key. The Multivalued dimension Group table contains a very important numeric attribute: the weighting factor. The weighting factor allows reports to be created that dont double count the Billed Amount in the fact table. We can assign the weighting factors equally within a Multivalued dimension Group. If there are three Multivalued dimensions, then each gets a weighting factor of 1/3. If we have some other rational basis for assigning the weighting factors dierently, then we can change the factors, as long as all the factors in a Diagnosis Group always add up to one.
2.8
Heterogeneous Dimension
Many a times, in real world the situation arises when the business provides heterogeneous services or products. For example, a retail Bank oers variety of products like mortgage 15
Patient Fact table Other Attributes Digosis group key Diagnosis group Helper Table Digosis group key Diagnosis key Weight
Diganosis Dimension Table Diagnosis key Other Attributes
Figure 2.3: Handling Multivalued Diagnosis Dimension through bridge table [?] or checking accounts to the same customer. These products have specic attributes and facts related to them only, and also general attributes, and fact that are common among them. In this case, Business users typically require two dierent perspective that are dicult to present in single fact table. The rst perspective is global view, including the ability to slice and dice all general facts simultaneously, regardless of their product type. The second perspective required by users is specic line-of business view that focuses on in-depth details of one business such as checking or mortgage [?]. There is a long list of attribute specically for any specic line of business. We cannot add these spatial facts in one fact table; if we did it for each line of business, we would end up with several hundred facts, most of which include nulls in any specic row. Like wise, if we attempt to include specic line of business attributes in any dimension table, we would have hundred of attribute, almost all of which would be empty for any given row. The solution to this dilemma is to create a custom schema for the checking line of business that is just limited to just checking accounts.Now both the custom checking fact table and corresponding product dimension are widened to describe all specic facts and attributes that make sense only for checking products [?]. These custom tables also contain the core attributes and facts so that, we can avoid join from the core and custom schema in order to get complete set of facts and attributes. The keys of custom product dimension is same as used in core product dimension, which 16
contains all possible product keys [?]. As conformed dimensions are is essential, each custom product dimension is subset of rows from core product dimension table. A family of core and custom fact table is needed when a business has heterogeneous products that have naturally dierent facts and descriptors but a simple customer base demands an integrated view. We can consider handling of the specic line of business attributes as context dependent outrigger to the core dimension. We can isolate the core attributes in in the base product dimension table, and we can include a snowake key in each base record that points to that point to its proper custom dimension outrigger [?].
If line of business of of custom and core dimension are separate, they cannot reside in same space, in this case, data in core fact table need to be duplicated only once to implement all custom tables. Otherwise, we can avoid duplicating both the core fact keys and core facts in the custom line of business fact tables [?].
General Dimension1 Dimension 1 Key Core Attributes .... Specific line of Business Dimension 1 Specific Dimension 1Key Custom Attributes..... Fact Table Date Key Dimension 1 Key Dimension2 key More Foreign Keys Core Facts........ Custom facts....... General Dimension 2 Dimension 2 Key Core Attributes...... Specific line of Business Dimension 2 Specific Dimension 2Key Custom Attributes.....
Figure 2.4: Handling Heterogeneous Dimensions [?]
2.9
Dimension Role Playing
A role in a data warehouse is a situation in which a single dimension appears several times in the same fact table.In certain kinds of fact tables, Date can appear repeatedly. For example, a typical Fact table can include Order Date, Packaging Date, Shipping Date, Delivery Date, Payment Date, Return Date, Refer to Collection Date, and other facts [?]. We cannot join these seven foreign keys to the same table. SQL would interpret such a seven-way simultaneous join as requiring that all of the dates be the same. Instead of a seven-way join, we have to create an illusion of seven independent Date dimension tables. We even need to go to the length of labeling all of the columns in each of the tables 17
uniquely. If we dont label the columns uniquely, we will not be able to dierentiate the columns apart if several of them have been dragged into a report [?]. For the user, we can create the illusion of seven independent time tables in a couple of ways. We can either make seven identical physical copies of the time table, or we can create seven virtual copies of the time table with the SQL SYNONYM command. Regardless of the approach, once we have made these seven clones, we still have to dene a SQL view on each copy in order to make the eld names uniquely dierent [?]. Now that we have seven dierently described Time dimensions, they can be used as if they were independent. They can have completely unrelated constraints, and they can play dierent roles in a report.
2.10
Conformed Dimensions
A conformed dimension is a dimension that means the same thing with every possible fact table to which it can be joined. Generally this means that a conformed dimension is identical in each data mart. A major responsibility of the central data warehouse design team is to establish, publish, maintain, and enforce the conformed dimensions. Conformed dimensions are enormously important to the data warehouse. Without a strict adherence to conformed dimensions, the data warehouse cannot function as an integrated whole. Conformed dimensions make possible a single dimension table to be used against multiple fact tables in the same database space, consistent user interfaces and data content whenever the dimension is used, and a consistent interpretation of attributes and, therefore, roll ups across data marts [?]. It is possible to create a subset of a conformed dimension table for certain data marts if you know that the domain of the associated fact table only contains that subset. For example, the master Product table can be restricted to just those products manufactured at a particular location if the data mart in question pertains only to that location. We could call this a simple data subset, because the reduced dimension table preserves all the attributes of the original dimension and exists at the original granularity [?].
18
Chapter 3 Querying on Data Warehouse

3.1 Oracles 9i SQL extension for Aggregation Queries in data warehouse
In this section, all example queries Will be performed on Sales History Schema in gure 1.1. All examples, and theory is taken from [?]. Aggregation is a fundamental part of data warehousing. To improve aggregation performance in your warehouse, Oracle provides the following extensions to the GROUP BY clause to make query reporting faster and easier: ROLLUP Extension ROLL UP calculates aggregate functions such as SUM, COUNT, MAX, MIN, and AVG at increasing levels of aggregation, from the most detailed up to a grand total.It is very helpful for subtotaling along a hierarchical dimension such as time or geography. It creates subtotals that roll up from the most detailed level to a grand total, following a grouping list specied in the ROLL UP clause. CUBE Extension CUBE is an extension similar to ROLL UP, enabling a single statement to calculate all possible combinations of aggregations. CUBE can generate the information needed in cross-tabulation reports with a single query. CUBE is typically most suitable in queries that use columns from multiple dimensions rather than columns representing dierent levels of a single dimension. CUBE takes a specied set of grouping columns and creates subtotals for all of their possible combinations. In terms of multidimensional analysis, CUBE generates all the subtotals that could be calculated for a data cube with the specied dimensions. Multiple SELECT statements combined with UNION ALL statements could provide the same information gathered through CUBE or ROLL UP. However, this might require many SELECT statements.The more columns used in a CUBE or ROLLUP clause, the greater the savings compared to the UNION ALL approach. GROUPING Functions The GROUPING functions help you identify the group each 19
row belongs to and enable sorting subtotal rows and ltering results. Grouping helps in dierentiating NULL values created by CUBE or ROLLUP and stored NULL values.Secondly it helps in nding out programattically what is a level of aggregation for a given subtotal.GROUPING returns 1 when it encounters a NULL value created by a ROLLUP or CUBE operation. That is, if the NULL indicates the row is a subtotal, GROUPING returns a 1. Any other type of value, including a stored NULL, returns a 0. Grouping ID Function GROUPING ID returns a single number that enables you to determine the exact GROUP BY level. For each row, GROUPING ID takes the set of 1 s and 0 s that would be generated if you used the appropriate GROUPING functions and concatenates them, forming a bit vector. The bit vector is treated as a binary number, and the number s base-10 value is returned by the GROUPING ID function. GROUPING SETS Expression Computing a full cube creates a heavy processing load, so replacing cubes with grouping sets can signicantly increase performance.You can selectively specify the set of groups that you want to create using a GROUPING SETS expression within a GROUP BY clause. This allows precise specication across multiple dimensions without computing the whole CUBE.CUBE and ROLLUP can be thought of as grouping sets with very specic semantics.
3.1.1
Syntax
Extension ROLLUP Syntax SELECT..... GROUP BY ROLLUP(grouping column reference list) GROUP BY expr1, ROLLUP(expr2, expr3) SELECT..... GROUP BY CUBE (grouping column reference list) GROUP BY expr1, CUBE(expr2, expr3) SELECT.. [GROUPING(dimension column)..].. GROUP BY.. CUBE ROLLUP (dimension column)
PARTIAL ROLLUP CUBE PARTIAL CUBE GROUPING
GROUPING SETS
GROUP BY [GROUPING sets(dimension column).. ]
3.1.2
Applications in building Cross-Tabular Report
These extensions are used to generate cross-tabular reports easily and eciently. For example, in gure, for a cross-tabular report showing, the total sales by country id and channel desc for the US and UK through the Internet and Direct Sales in September 20
Country Channel UK Direct Sales Internet Total 100000 75000 175000 US 200000 45000 245000 Total 300000 120000 420000
Figure 3.1: Cross Tabular Report 2004, we will need to calculate 4 subtotals and one grand total. Half of the values needed for this report would not be calculated with a query that requested SUM(amount sold) and did a GROUP BY(channel desc, country id). To get the higher-level aggregates we would require additional queries.But we can easily generate all these subtotals and grand total by giving only one query using CUBE extension in its GROUPBY clause. The Query will be
SELECT channel desc, calendar month desc, country id, SUM(amount sold) FROM sales, customers, times, channels WHERE sales.time id=times.time id AND sales.cust id=customers.cust id AND sales.channel id= channels.channel id AND channels.channel desc IN (Direct Sales, Internet) AND times.calendar month desc = 2004-09 AND country id IN (UK, US) GROUP BY CUBE(channel desc,country id);
Result of these Query will appear as in table shown below
21
channel desc Direct Sales DirectSales Direct Sales Internet Internet Internet
Country id UK US UK US UK US
amount sold 100000 200000 300000 75000 45000 120000 175000 245000 420000
In the table above, the GROUPBY clause of above query has evaluated four levels of aggregation. Aggregation on (channel desc, country id) (channel desc) (country id) Grand Total The values in the result of these queries can be directly sent to cross-tabular report.
If the similar Query is executed with ROLLUP extension instead of CUBE extension in its GROUPBY clause, It would perform following 3 Levels of Aggregation. (channel desc, country id) (channel desc) Grand Total And the result of Query would appear as in following table : channel desc Direct Sales DirectSales Direct Sales Internet Internet Internet Country id UK US UK US amount sold 100000 200000 300000 75000 45000 120000 420000 22
We can also generate above tables using GROUPING SETS extension. With GROUPING SETS expression, we have to explicitly specify the levels of aggregation we wish to perform.The Query will be
SELECT channel desc,country id, SUM(amount sold) FROM sales, customers, times, channels WHERE sales.time id=times.time id AND sales.cust id=customers.cust id AND sales.channel id= channels.channel id AND channels.channel desc IN (Direct Sales, Internet) AND times.calendar month desc = 2004-09 AND country id IN (UK, US) GROUP BY GROUPING SETS((channel desc, country id), (channel desc),(country id), ());
Both CUBE and ROLLUP can be thought of as GROUPING SETS with very specic semantics. CUBE(a, b, c) is equivalent to GROUPING SETS ((a, b, c), (a, b), (a, c), (b, c), (a), (b), (c), ())
ROLLUP(a, b, c) is equivalent to GROUPING SETS ((a, b, c), (a, b), ())
3.2
Writing MDX queries for Data Warehouse
MDX, stands for Multidimensional Expressions.It is a syntax that supports the denition and manipulation of multidimensional objects and data. MDX is similar in many ways to the Structured Query Language (SQL) syntax, but is not an extension of the SQL language; in fact, some of the functionality that is supplied by MDX can be supplied, although not as eciently or intuitively, by SQL. As with an SQL query, each MDX query requires the SELECT clause,the FROM clause and the WHERE clause. These and other keywords provide the tools used to extract specic portions of data from a cube (multidimensional structure) for analysis. MDX also supplies a robust set of functions for the manipulation of retrieved data, as well as the ability to extend MDX with user-dened functions. 23
Figure 3.2: Multidimensional Structure : Cube [?]
3.2.1
Common Terms
Cube Cube is a multidimensional structure that contains dimensions and measures. Dimensions dene the structure of the cube, while measures provide the numerical values of interest to the end user. Cell positions in the cube are dened by the intersection of dimension members, and the measure values are aggregated to provide the values in the cells [?]. Member A member is the lowest level of reference when describing cell data in a Cube.A member is an item in a dimension representing one or more occurrences of data. Members are combined to form Tuples and Tuples are combined to form Sets. These Sets are used in SELECT clause of SQL for retrieving data from Cube [?]. Tuples A Tuple is used to dene a slice of data from a Cube; it is composed of an ordered 24
collection of one Member from one or more dimensions. A Tuple is used to identify specic sections of multidimensional data from a cube; a tuple composed of one member from each dimension in a cube completely describes a cell value [?]. Sets A Set is an ordered collection of zero, one or more Tuples. A Set is most commonly used to dene Axis and Slicer dimensions in an MDX query, and as such may have only a single Tuple or may be, in certain cases, empty. In MDX syntax, tuples are enclosed in braces to construct a set.A set is most commonly used to dene axis and slicer dimensions in an MDX query [?]. Axis and Slicer Dimensions A SELECT statement is used to select the Dimensions and Members to be returned, referred to as Axis dimensions. The WHERE statement is used to restrict the returned data to specic Dimension and Member criteria, referred to as a slicer dimension. An axis dimension is expected to return data for multiple members, while a slicer dimension is expected to return data for a single member [?].
3.2.2
Rules
Rules for specifying Members 1. By specifying the actual name or the alias. for example [Packages] If the member name starts with number or contains spaces, it should be within braces 2. By specifying dimension name or any one of the ancestor member names as a prex to the member name. for example,[Measures].[Packages]. (Measure dimension is associated with all the facts) 3. By specifying the name of a calculated member dened in the WITH section [?]. Rules for specifying Tuples 1. Tuple consist of one or more member 2. If a tuple is composed of members from more than one dimension, the members represented by the tuple must be enclosed in parentheses. for example, (Time.[2nd half], Route.nonground.air) 3. If a tuple consist of only one member, we can omit parenthesis [?].
25
Rules for specifying Sets 1. A set consist of one or more tuples enclosed in braces. except in some cases where the set is represented by an MDX function which returns a set [?]. For example, { (Time.[1st half], Route.nonground.air), (Time.[2nd half], Route.nonground.sea) } [?]. 2. A set can contain more than one occurrence of the same tuple. for example,{ Time.[2nd half], Time.[2nd half] } 3. When a set has more than one tuple,the in each tuple of the set, members must represent the same dimensions as do the members of other tuples of the set. Additionally, the dimensions must be represented in the same order. In other words, each tuple of the set must have the same dimensionality [?]. For example { (Time.[1st half], Route.nonground.air), (Time.[2nd half], Route.nonground.sea) } [?]. 4. A set can also be a collection of sets, and it can also be empty (containing no tuples) [?].
3.2.3
MDX Query Structure
A basic Multidimensional Expressions (MDX) query is structured in a fashion similar to the following example [?] : SELECT<axis specication> <axis specication> FROM<cube specication> WHERE<slicer specication> In MDX, the SELECT statement is used to specify a dataset containing a subset of multidimensional data. To specify a dataset, an MDX query must contain information about
The number of axes. You can specify up to 128 axes in an MDX query. The members from each dimension to include on each axis of the MDX query. The name of the cube that sets the context of the MDX query. The members from a slicer dimension on which data is sliced for members from the axis dimensions [?].
26
3.2.4
Specifying Axis Dimensions
Axis dimensions determine layout of query results from a database.Multidimensional Expressions (MDX) uses the SELECT clause to specify axis dimensions by assigning a set to a particular axis. In the following syntax example, each <axis specication> value denes one axis dimension. The number of axes in the dataset is equal to the number of <axis specication> values in the Multidimensional Expressions (MDX) query. An MDX query can support up to 128 specied axes, but very few MDX queries will use more than 5 axes [?]. The breakdown of the <axis specication> syntax is: [axis specication] ::= [set] ON [axis name] [axis name] ::= COLUMNS ROWS PAGES SECTIONS CHAPTERS AXIS([index]) Each axis dimension is associated with a number: 0 for the x-axis, 1 for the y-axis, 2 for the z-axis, and so on. The <index> value is the axis number. For the rst 5 axes, the aliases COLUMNS, ROWS, PAGES, SECTIONS, and CHAPTERS can be used in place of AXIS(0), AXIS(1), AXIS(2), AXIS(3), and AXIS(4), respectively [?]. An MDX query cannot skip axes. That is, a query that includes one or more <axis name> values must not exclude lower-numbered or intermediate axes. For example, a query cannot have a ROWS axis without a COLUMNS axis, or have COLUMNS and PAGES axes without a ROWS axis [?].
3.2.5
Establishing Cube Context
To establish cube context, indicate the cube on which you want the Multidimensional Expressions (MDX) query to run. The FROM clause in an MDX query determines the cube context. The following syntax indicates which cube supplies the context for the MDX query [?] : FROM cube specication The cube specication is completed with the name of a single cube. For example, if an MDX query is to be run against the SalesCube cube, the FROM clause would be: FROM SalesCube
3.2.6
Specifying Slicer Dimensions
Slicer dimensions are used optionally in the WHERE Clause of the query,to limit a query to apply only to a specic area of the database. Dimensions that are not explicitly assigned to an axis are assumed to be slicer dimensions and lter with their default 27
members.Default Member is usually the All member if an (All) level exists, or else an arbitrary member of the highest level.The breakdown of the WHERE clause syntax is [?]: WHERE [(slicer specication)] A slicer dimension can accept only expressions that evaluate into a single tuple. This does not mean that only a single tuple can be explicitly stated in the slicer dimension. for example, WHERE ( [Time].[1st half], [Route].[nonground] ) If the slicer specication cannot be resolved into a single tuple, an error will occur [?].
3.2.7
Example Queries
For example in the cube shown in gure, if we want to calculate total Unit Sales and total Store Sales for all USA CA Stores in year 1997 and 1998 for a sales schema, then we would give the following query SELECT { [Measures].[Unit Sales], [Measures].[Store Sales] } ON COLUMNS, { [Time].[1997], [Time].[1998] } ON ROWS FROM Sales WHERE( [Store].[USA].[CA] )
This query will return the result as shown in the following table : Unit Sales 1997 1998 75000 140000 Store Sales 100000 200000
We can also rewrite the above query as SELECT { [Measures].[Unit Sales], [Measures].[Store Sales] } ON AXIS(0), { [Time].[1997], [Time].[1998] } ON AXIS(1) FROM Sales WHERE( [Store].[USA].[CA] )
28
3.2.8
Dierence of MDX with SQL
Here are the main list of dierences between MDX and SQL :
1. The principal dierence between SQL and MDX is the ability of MDX to reference multiple dimensions.SQL refers to only two dimensions, columns and rows, when processing queries. Because SQL was designed to handle only two-dimensional tabular data, the terms column and row have meaning in SQL syntax.MDX, in comparison, can process one, two, three, or more dimensions in queries. Because multiple dimensions can be used in MDX, each dimension is referred to as an axis [?]. 2. In SQL, the SELECT clause is used to dene the column layout for a query, while the WHERE clause is used to dene the row layout. However, in MDX the SELECT clause can be used to dene several axis dimensions, while the WHERE clause is used to restrict multidimensional data to a specic dimension or member [?]. 3. In SQL, the WHERE clause is used to lter the data returned by a query. In MDX, the WHERE clause is used to provide a slice of the data returned by a query. While the two concepts are similar, they are not equivalent.The SQL query uses the WHERE clause to contain an arbitrary list of items that should (or should not) be returned in the result set. While a long list of conditions in the lter can narrow the scope of the data that is retrieved, there is no requirement that the elements in the clause will produce a clear and concise subset of data.In MDX, however, the concept of a slice means that each member in the WHERE clause identies a distinct portion of data from a dierent dimension. Because of the organizational structure of multidimensional data, it is not possible to request a slice for multiple members of the same dimension. Because of this, the WHERE clause in MDX can provide a clear and concise subset of data [?].
29
Chapter 4 Conclusion
Design of the data warehouse greatly inuences the quality of the analysis that is possible with data in it. If invalid or corrupt data is allowed to get into the data warehouse, the analysis done with this data is likely to be invalid. So, special attention should be given to the issues like slowly changing dimensions, rapidly changing dimensions, multivalued dimensions etc. that are discussed here while designing the data warehouse. Dimensional modeling should be used for designing Data Warehouse instead of ER Modeling because main focus here in Data Warehouse is not for removing redundancy from dimensions but focus is on queries that are simple to understand and easier to write. After the rapid acceptance of data warehousing systems during past three years, there will continue to be many more enhancements and adjustments to the data warehousing system model. Further evolution of the hardware and software technology will also continue to greatly inuence the capabilities that are built into data warehouses.
30
Bibliography
[1] Basic MDX. World Wide Web, http://www.msdn.microsoft.com/library. [2] Essbase Analytic Services Database Administrators Guide. World Wide Web, http://dev.hyperion.com/techdocs/essbase/essbase 71/Docs/dbag/frameset.htm. [3] Ralph Kimball. Dimensional Modelling Manisfesto. http://www.dbmsmag.com. World Wide Web,
[4] Ralph Kimball and Margy Ross. The Data Warehouse ToolKit. second edition, 2004. [5] Paul Lane. Oracle 9i Data Warehousing Guide. Release 1 (9.0.1) edition, 2001. [6] Michael J. Corey,Michael Abbey , Ian Abramson and Ben Taub. Oracle 8 Data Warehousing. Oracle press edition, 1998. [7] Korth SilberSchatz and Sudarshan. Database System Concepts. fourth edition, 2002.
31

Data Warehousing Design Considerations

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Data Warehousing Design Considerations

Diunggah oleh

Hak Cipta:

Format Tersedia

Data Warehouse Design Considerations

by Abhishek Sugandhi Roll No: 04305016

under the guidance of Prof. N.L.Sarda

3.2.5 3.2.6 3.2.7 3.2.8 4 Conclusion

Channels channel_id Other Attributes

Time time_id Other Attributes

Promotion Promotion_id Other Attributes

Dimensional Model Vs. ER Model

Chapter 2 Data Warehouse Design Issues

Slowly Changing Dimensions

Type 1: Overwrite the Value

Product Key 12345

Produce Description Scent

Packaging pasted Box

SKU No. ABC922

Type 2: Add a new Dimension Row

Type 3: Add a new Dimension Column

Rapidly Changing Dimension

Figure 2.1: Example of Handling Rapidly Changing large Dimension [?]

Fixed Depth Hierarchy

Variable Depth Hierarchy

Customer Dimension Customer Key Customer ID Customer Name

Fact table Date Key Customer Key

Figure 2.2: Handling hierarchy through bridge table [?]

Diganosis Dimension Table Diagnosis key Other Attributes

Figure 2.4: Handling Heterogeneous Dimensions [?]

Dimension Role Playing

Chapter 3 Querying on Data Warehouse

PARTIAL ROLLUP CUBE PARTIAL CUBE GROUPING

GROUP BY [GROUPING sets(dimension column).. ]

Applications in building Cross-Tabular Report

Result of these Query will appear as in table shown below

ROLLUP(a, b, c) is equivalent to GROUPING SETS ((a, b, c), (a, b), ())

Writing MDX queries for Data Warehouse

Figure 3.2: Multidimensional Structure : Cube [?]

MDX Query Structure

Specifying Axis Dimensions

Establishing Cube Context

Specifying Slicer Dimensions

Dierence of MDX with SQL

Anda mungkin juga menyukai