2 Printout DWH Q A

Q1.
What are data marts?
Ans: Data marts are smaller data warehouse. Data mart is a subset of data warehouse fact and summary data that provides users with information specific to their requirement. Data marts can be considered of three types: 2. Independent. 3. Hybrid 1. Dependent
The categorization is based primarily on the data source that feeds the data mart. 1. Dependent Data Mart: a. Source is the data warehouse. Dependent data marts rely on the data warehouse for context. b. The extraction, transformation, and transportation(ETT) process is easy. Dependent data marts draw data from a central data warehouse that has already been created. Thus the main effort in building a mart, the data cleansing and extraction has already been performed. The dependent data mart simply requires data to be moved from one database to another. c. The data mart is part of the enterprise plan. Dependent data marts are usually built to achieve improved performance and availability, better control, and lower telecommunication cost resulting from local access to data relevant to a specific department. 2. Independent Data Mart: Independent data marts are stand alone systems built from scratch that drew data directly from operational and/or external sources of data. a. Sources are operational systems and external sources. b. ETT process is difficult. Because independent data marts draw data from unclean or inconsistent data sources, efforts are directly towards error processing and integration of data. c. The data marts are built to satisfy analytical needs. The creation of independent data marts is often driven by the need for a quick solution to analytical demands. 3. Hybrid Data Marts: Hybrid data marts are the combination of Dependent and Independent data marts. It contains the data from the data warehouse as well as data from external sources.
Q2. What are Slowly Changing Dimensions? Ans: A type of CDC process, which applies to cases where attribute for a record changes over time. Here the data is not only growing, but also changing. a.SCD Type 1: If any update happens at source, it should update at the target. If any insert happens at source, it should be inserted at target. (Update+Insert+RWO). If beyond time frame, no update. If within time frame, Update + Insert. b.SCD Type 2: In Type 2, we are creating history. Any updation at source, we will treat as insert at target. Previous data will be present. Updated data will also be inserted. In this situation, primary keys will also be repeated. So to maintain uniqueness, we need surrogate key. So second time when the first record is reinserted, first record will be version 0, second record will be version 1, third time will be 2, and so on. In Flag, only two versions 0 and 1 will be present. The recent record 1 and the recent previous 0, it goes on changing as the record updates. In time modification, only start value and end value are present. c.SCD Type 3: Vertical Versioning. SCD Type 2 is horizontal versioning. SCD Type 3 is column level maintenance. Suppose 1 lakh products have their price increased by Rs 50. If we do SCD Type 1 or SCD Type 2, row wise maintenance, then 1 lakh rows will become 2 lakh. It will create a lot of problem. So better add a column like old_value and new_value. By this, we can save a lot of space. But it is not advisable as it needs change in database table structure. SCD Type 1: The new record replaces the original record. Only one record exists in the database current data. SCD Type 2: A new record is added into the dimension table. Two records exist in database current data and previous history data. SCD Type 3: The original data is modified to include new data. One record exists in database new information is attached with old information in the same row.
The "Slowly Changing Dimension" problem is a common one
particular to data warehousing. In a nutshell, this applies to cases where the attribute for a record varies over time. We give an example below: Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in the customer lookup table has the following record: Customer State Name Key 1001 Christina Illinois At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc. now modify its customer table to reflect this change? This is the "Slowly Changing Dimension" problem. There are in general three ways to solve this type of problem, and they are categorized as follows: Type 1: The new record replaces the original record. No trace of the old record exists. Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two people. Type 3: The original record is modified to reflect the change. We next take a look at each of the scenarios and how the data model and the data looks like for each of them. Finally, we compare and contrast among the three alternatives. In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept. In our example, recall we originally have the following table: Customer Key Name State 1001 Christina Illinois After Christina moved from Illinois to California, the new information replaces the new record, and we have the following table: Customer Key 1001 Name Christina State California
Advantages: - This is the easiest way to handle the Slowly Changing
Dimension problem, since there is no need to keep track of the old information. Disadvantages: - All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Christina lived in Illinois before. Usage: About 50% of the time. When to use Type 1: Type 1 Slowly Changing Dimension should be used when it is not necessary for the data warehouse to keep track of historical changes. In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The new record gets its own primary key. In our example, recall we originally have the following table: Customer Name State Key 1001 Christina Illinois After Christina moved from Illinois to California, we add the new information as a new row into the table: Customer Key 1001 1005 Name Christina Christina us to State Illinois California accurately keep all historical
Advantages: - This allows information.
Disadvantages: - This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern. - This necessarily complicates the ETL process. Usage:
About 50% of the time. When to use Type 2: Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes. In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active. In our example, recall we originally have the following table: Customer Key 1001
Name Christina
State Illinois
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns: Customer Key Name Original State Current State Effective Date After Christina moved from Illinois to California, the original information gets updated, and we have the following table (assuming the effective date of change is January 15, 2003): Customer Name Key 1001 Christina Original State Illinois Current State California Effective Date 15-JAN-2003
Advantages: - This does not increase the size of the table, since new information is updated. - This allows us to keep some part of history. Disadvantages: - Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Christina later moves to Texas on December 15, 2003, the California information will be lost.
Usage: Type 3 is rarely used in actual practice. When to use Type 3: Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time. Q3. What is Snowflake Schema? Ans: Any star schema with a flake in it. Flake means extension. In snowflake schema, each dimension has a primary dimension table to which one or more additional dimensions can join. The primary dimension table is the only table that can join to the fact table. In snowflake schema, dimensions may be interlinked or may have one to many relationships with other tables. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to increased number of lookup tables. The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. Q4. What is a Star schema? Ans: Join between central fact table and dimension table is known as star schema design. In a star schema design, all dimensions will be linked directly with a fact table. A simple star schema consist of one fact table, a complex star schema have more than one fact table. In the star schema design, a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables) like a star.
Q5. What is the difference between OLTP and OLAP?
Ans: Difference between data warehouse and OLTP 1. Workload: Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance. So a data warehouse should be optimized to perform well for a wide variety of possible query operations. OLTP systems support only predefined operations. Your applications might be specifically tuned or designed to support only these operations. 2. Data Modifications: A data warehouse is updated on a regular basis by the ETL process(run nightly or weekly) using bulk data modification techniques. The end-user of a data warehouse does not directly update the data warehouse. In OLTP systems, end-users routinely issue individual data modification statements to the database. The OLTP database is always up to date and reflects the current state of each business transaction. 3. Schema design: Data warehouses often use denormalized or partially denormalized schemas(such as star schema) to optimize query performance. OLTP systems often use fully normalized schemas to optimize update/insert/delete performance and to guarantee data consistency. 4. Typical Operations: A typical data warehouse query scans thousands or millions of records. For ex, Find the total sales for all customers last month A typical OLTP operation accesses only a handful of records. Retrieve the current order for this customer 5. Historical Data: Data warehouse usually stores many months or years of data. This is to support historical analysis. OLTP systems usually store data from only a few weeks or months. The OLTP systems stores only historical data as needed to successfully meet the requirement of the current transaction. Q6. What is a dimension table?
Ans: Dimensional tables also known as lookup or reference tables contain the relatively static data in the warehouse. Dimension tables stores the information you normally use to contain queries. Dimensional tables are usually textual and descriptive and you can use them as the headers of the result set. Ex are customers or products. Q7. What is a fact table? Ans: Fact tables are the large tables in your warehouse schemas that store business measurements. Fact tables typically contain facts and foreign keys to the dimension tables. Fact tables represent data, usually numeric and additive, that can be analyzed and examined. Example includes sales, cost, and profit. A fact table typically has two types of columns: those that contain numeric facts(often called measurements), and those that are foreign keys to dimensional tables. A fact table contains either detail level facts or facts that have been aggregated. Fact tables that contain aggregated facts are called summary tables. A fact table usually contains facts with the same level of aggregation. Facts can be additive, semi-additive, and non-additive. The primary key of the fact table is usually a composite key that is made up of all of its foreign keys. There are three types of facts: 1. Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table. 2. Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. 3. Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns: Date Store Product Sales_Amount
The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table -- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week. Say we are a bank with the following fact table: Date Account Current_Balance Profit_Margin The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level. Q8. What is ETL? Ans: ETL stands for processes that enable sources, reformat and database, a data mart on another operational extract, transform, and load, the companies to move data from multiple cleanse it, and load it into another or a data warehouse for analysis, or system to support a business process.
Q9. What are conformed dimensions? Ans: Conformed dimensions are dimensions that are common to more than one data mart, or business process across the organization. Common examples may be time, organization structure, and product. Conformed dimensions should be modeled at the lowest level of granularity used, this way they can service any fact table that needs to use them, either at their natural granularity, or at a higher level. They provide a consistent view of the business, including
attributes and keys to all the data marts in the organization. Conformed dimensions can either be implemented as a single physical table or maybe be a replicated table used by each data mart. Q10. What is ODS? Ans: ODS stores tactical data from production system that is subject-oriented and integrated to address operational needs. The detailed current information in the ODS is transactional in nature, updated frequently(at least daily), and is only held for a short period of time. This is the database used to capture daily business activities and thus is normalized database. ODS captures day to day transactions and you can generate reports on ODS. An Operational Data Store is the operational system whose function is to capture the transactions of the business. The source system should be thought of as being outside the data warehouse, since the data warehouse system has no control over the context and format of the data. The data in these systems can be in many formats from flat files to hierarchical and relational databases. Objectives of Operational Data Store are to: 1.Integrate information from the production systems. 2.Relieve the production systems of reporting analytical demands. 3.Provide access to current data. Q11. What is degenerated dimension table? Ans: A degenerated dimension is data that is dimensional in nature, but stored in a fact table. For ex, if you have a dimension that only has Order number and order line number, you would have a 1:1 relationship with the fact table. Do you want to have two tables with a billion rows or one table with a billion rows. Therefore, this would be a degenerate dimension and Order Number and Order Line Number would be stored in the fact table. When a fact table has dimensional called degenerated dimension. value stored, it is
and
When the cardinality of column value is high, instead of maintaining a separate dimension and having the overhead of making a join with fact table, degenerated dimensions can
be build. For example, In sales fact table, Invoice number is a degenerated dimension. Since Invoice Number is not tied up to an order header table, hence there is no need for invoice number to join a dimensional table; hence it is referred as degenerated dimension. A Degenerate dimension is a dimension which has only a single attribute. This dimension is typically represented as a single field in a fact table. Degenerate Dimensions are the fastest way to group similar transactions. Degenerate Dimensions are used when fact tables represent transactional data. They can be used as primary key for the fact table but they cannot act as foreign keys. Q12. What is dimensional modeling? Ans: Warehouse designing is known as dimensional modeling. Aim is to optimize query performance. Dimensional modeling is the name of a logical design technique often used for data warehouses. ER is a logical design technique that seeks to remove the redundancy in data. DM is a logical design technique that seeks to present the data in a standard, intuitive framework that allows for high-performance access. It is inherently dimensional and it adheres to a discipline that uses the relational model with some important restrictions. Every dimensional model is composed of one table with a multipart key, called the fact table and a set of smaller tables called dimensional tables. Each dimensional table has a single-part primary key that corresponds exactly to one of the components of the multipart key in the fact table. This characteristic star like structure is often called a star join. A fact table, because it has a multipart primary key made up of two or more foreign keys, always expresses a many-to-many relationship. The most useful fact tables also contain one or more numerical measures or facts that occur for the combination of keys that define each record. The most useful facts in a fact table are numeric and additive. Additivity is crucial because data warehouse applications almost never retrieve a single fact table record, rather they fetch back hundreds, thousands, or even millions of these records at a time, and the only useful thing to do with so many records is to add them up. Q13. What is a lookup table?
Ans: Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1"). A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables. Dimensional tables reference tables. are also known as lookup tables or
Q14. What is conformed fact? Ans: Two facts are conformed if they have the same name, units, and definition. If two facts are do not represent the same thing to the business, then they must be given different names. Q15. What is junk dimension? What is the difference between junk dimension and degenerated dimension? Ans: A junk dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension. A number of very small dimensions might be lumped together to form a single dimension, a junk dimension the attributes are not closely related. Grouping of random flags and text attributes in a dimension and moving them to a separate subdimension is known as junk dimension. A junk dimension is a dimension that is created and stored in a separate location for future use. While developing a dimensional model, there are lots of miscellaneous flags and indicators that dont logically belong to the core dimensional tables. Neither you can put them into a fact table as they would unnecessarily increase the size nor you can create a dimensional table for them as they would
dramatically increase the size of dimensional tables. The third option is to create a junk dimension and put all these flags and indicators into that junk dimension that can be referred for future use, which could not else be deleted as they have little significance in the dimensional model. Q16. What are the various ETL tools in the market? Ans: 1. Informatica Power Center 2. Ascential Data Stage 3. ESS Base Hyperion 4. AbIntio 5. BO Data Integrator 6. SAS ETL 7. MS DTS 8. Oracle OWB 9. Pervasive Data Junction 10. Cognos Decision Stream 11. Hummingbird 12. Sunopsis Q17. What is the main functional difference between ROLAP, MOLAP, and HOLAP?(refer 1keydata.com word file to add more) Ans: In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP. MOLAP: This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats. Advantages: Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations. Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly. Disadvantages: Limited in the amount of data it can handle: Because all calculations are performed when the cube is built,
it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself. Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.
ROLAP: This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement. Advantages: Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount. Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities. Disadvantages: Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large. Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.
HOLAP: HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data. The functional difference between these is how the information is stored. In all cases, the users see the data as a cube of dimensions and facts. ROLAP: Detailed data is stored in a relational database in 3NF, star or snowflake form. Queries must summarize data on the fly. MOLAP: Data is stored in multidimensional form dimensions and facts stored together. You can think of this as a persistent cube. Level of detail is determined by the intersection of the dimension hierarchies. HOLAP: Data is stored using a combination of relational and multidimensional storage. Summary data might persist as a cube, while detail data is stored relationally, but transitioning between the two is invisible to the end-user. Q18. What is a level of granularity of a fact table? Ans: Level of granularity means level of detail that you put into the fact table in a data warehouse. For ex, based on design you can decide to put the sales data in each transaction. Now, level of granularity would mean what detail are you willing to put for each transactional fact. Product sales with respect to each minute or you want to aggregate it up to minute and put that data. Q19. Which column go to fact table and which column go to dimension table? Ans: Foreign key elements along with business measures go to fact table and detail information go to dimension table. Q20. Which type of indexing mechanism do we need to use for a typical data warehouse? Ans: Bitmap index. Q21. What is data mining? Ans: Data mining is the process of extracting hidden trends within a data warehouse. For ex, an insurance data
warehouse can be used to mine data for the most high risk people to insure in a certain geographical area. Data mining is the process of analyzing data from different perspectives and summarizing it into useful information. Q22. What are the modeling tools available in the market? Ans: 1. 2. 3. 4. 5. Oracle Designer Erwin (Entity Relationship for windows) Informatica (Cubes/Dimensions) Embarcadero Power Designer Sybase
Q23. What is a general purpose scheduling tool? Ans: A scheduling tool is a tool which is used to schedule the data warehouse jobs. All the jobs which does some process are scheduled using this tool, which eliminates the manual intervention. The basic purpose of a scheduling tool in a DW application is to streamline the flow of data from source to target at specific time or based on some condition. Q24. What are the various reporting tools in the market? Ans: 1. MS-Excel 2. Business Objects (Crystal Reports) 3. Cognos (Impromptu, Power Play, ReportNet) 4. Microstrategy 5. MS reporting services 6. Informatica Power Analyzer 7. Actuate 8. Hyperion (BRIO) 9. Oracle Express OLAP 10. Proclarity 11. SAS Q25. What is ER diagram? Ans: ER is a logical design technique that seeks to remove the redundancy in data. The ER model is a conceptual data model that views the real world as entities and relationships. A basic component of the model is the Entity-Relationship diagram which is used
to visually represent data objects. Q26. What is the difference between BO, Microstrategy, and Cognos? Ans: BO is a ROLAP tool, Microstrategy is a HOLAP tool. Cognos is a MOLAP tool,
Q27. What is the difference between a star schema and a snowflake schema and when we use those schemas? Ans: Star schema is a schema design used in data warehouse where a fact table is in the center and all dimension tables are connected to it directly by one to one relationship. In snowflake schema, the dimension tables are further normalized into different tables. One primary dimension table is joined to fact table and other dimension tables may be joined to the dimension table. This schema is denormalized and results in simple join and less complex query as well as faster results. The use depends on the requirement. A snowflake schema is a way to handle problems that do not fit within the star schema. It consists of outrigger tables which relate to dimensions rather than to the fact table. This schema is normalized and results in complex join and very complex query as well as slower results. The amount of space taken up by dimensions is compared to the space required for a fact table insignificant. Therefore, table space or disk not a considered a reason to create a snowflake so small as to be space is schema.
The main reason for creating a snowflake is to make it simpler and faster for a report writer to create drop down boxes. Rather than having to write a select distinct statement, they can simply select * from the code table. Junk dimensions and mini dimensions are another reason to create add outriggers. The junk dimensions contain data from a normal dimension that you wish to separate out, such as fields that change quickly. Updates are so slow that they can add hours to the load process. With a junk dimension, it is possible to drop and add records rather than update.
Mini dimensions contain data that is so dissimilar between two or more source systems that would cause a very sparse main dimension. The conformed data that can be obtained from all source systems is contained in the parent dimension and the data from each source system that does not match is contained in the child dimension. Finally, if you are unlucky enough to have end users actually adding or updating data to the data warehouse rather than just batch loads, it may be necessary to add these outriggers to maintain referential integrity in the data being loaded. Star schema is good for simple queries and logic. But snowflake schema is good for complex queries and logic. Snowflake schema is nothing but an extension of the star schema in which the dimension tables are further normalized to reduce redundancy. In Star Schema when we try to access many attributes or few attributes from a single dimension table the performance of the query falls. So we denormalize this dimension table into two or sub dimensions. Now the same star schema is transformed into snow Flake schema. By doing so the performance improves. If the table space is not enough to maintain a STAR schema then we should go for Snowflake instead of STAR schema. i.e The table should be splited in to multiple tables. Ex. if you want to maintain time data in one table like year, month, day in one table in star schema, you need to split this data into 3 tables like year, month, day in Snowflake schema. Q28. What is slicing and dicing? Explain with real time usage and business reasons of its use? Ans: Slicing and dicing is a feature that helps us in seeing the more detailed information about a particular thing. For ex, you have a report which shows the quarterly based performance of a particular product. But you want to see it in a monthly wise. So you can use slicing and dicing technique to drill down to monthly level. Q29. What are the various attributes in a time dimension? (see this in orator cd or dilip book)
Ans: Date and time (if datetime data type is not supported then you have to have hour/min/sec on separate columns. Week Month Quarter Year Q30. What is the role of surrogate keys in data warehouse and how will you generate them? Ans: Surrogate key is a substitution for the natural primary key. It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table. It is useful because the natural primary key can change and this makes updates more difficult. Surrogate keys are always integer or numeric. A surrogate key is a simple primary key which maps one to one with a natural compound primary key. The reason for using them is to alleviate the need for the query writer to know the full compound key and also to speed query processing by removing the need for the RDBMS to process the full compound key when considering a join. For example, a shipment could have a natural key of ORDER + ITEM + SHIPMENT_SEQ. By giving it a unique SHIPMENT_ID, subordinate tables can access it with a single attribute, rather than 3. However, it's important to create a unique index on the natural key as well. A surrogate key is a substitution for the natural primary key. We tend to use our own Primary keys (surrogate keys) rather than depend on the primary key that is available in the source system. When integrating the data, trying to work with the source system primary keys will be a little hard to handle. Thats the reason why a surrogate key will be useful even though it serves the same purpose as that of a primary key. Another important need for it is because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult. Q31. What is source qualifier? Ans: Source qualifier is a transformation which extracts
data from the source. Source qualifier acts as SQL query when the source is a relational database and it acts as a data interpreter if the source is a flat file. Source qualifier is a transformation which every mapping should have; it defines the source metadata to the Informatica Repository. The source qualifier has different properties like filtering records, and joining of two source, and so on. Q32. Summarize the difference between OLTP, ODS, and DATA WAREHOUSE? Ans: The OLTP, online transaction processing is a database that handles real time transactions which have some special requirements. ODS is the database used to capture daily business activities and thus is normalized database. ODS captures day to day transactions and you can generate reports on ODS. Data warehouse is a relational database that is designed for analysis and query rather than for transaction processing. Q33. Do you need separate space for data warehouse and data mart? Ans: We dont require any separate space for data mart and data warehouse unless until those marts are too big or client require. We can maintain both in a same schema. Q34. What is data cleansing? How is it done? Ans: Data cleansing is the process of standardizing and formatting the data before we store data in the data warehouse. It is done by several methods implemented by Informatica, like using LTRIM RTRIM to delete extra spaces. Q35. What is the datawarehousing? difference between data warehouse and
Ans: Data warehouse is the relational database that is designed for query and analysis, but the process is called datawarehousing. Datawarehousing is a concept. Q36. What is the need of surrogate key; why primary key not
used as surrogate key? Ans: Refer to Question No. 30. If a column is made a primary key and later there needs a change in the data type or the length for that column then all the foreign keys that are dependent on that primary key should be changed making the database unstable. Surrogate Keys make the database more stable because it insulates the Primary and foreign key relationships from changes in the data types and length. As data is extracted from disparate sources, where in each source might have primary keys with data types or formats inherent to the underlying database, if the same primary keys are used in the DW, there would be inconsistencies in representation of data which would make querying of the database a difficult job, so the surrogate keys are implemented to circumvent these kind of situations. Q37. For 80 GB data warehouse, how many records are there in fact table. There are 25 dimensions and 12 fact tables. Ans: The estimation process is as follows: 1. We have to estimate the size table wise, for example one fact table having say 50 cols. One of the col size is varchar 200. We have to take all col sizes and find the maximum length of each record. Take the required column sizes and find the minimum length of each record. Find the average and multiply with the number of records available in the fact table. 2. We can also estimate future growth, table wise. Take 3 to 4 years data. Find the increment in data quarterly. This will help you to estimate the future growth table wise. Q38. Can a dimension table contain numeric values? Ans: A dimension table can contain anything you want - it is the business requirements around each data element that drive the data model. Ex: Your company makes bicycles. Frame size is part of the product. It's not a fact - it makes no sense to add/average/sum the frame size. It is something you group/sort/filter on. "How many bikes with 19 inch frames were sold by region?"
WHERE ProductDimension.FrameSize = 19 A dimension table can contain numeric values. For example, a perishable product in a grocery store might have SHELF_LIFE (in days) as part of the product dimension. This value may, for example, be used to calculate optimum inventory levels for the product. Too much inventory and an excess of product expires on the shelf, while too little results in lost sales. Q39. What is hybrid slowly changing dimension? Ans: Hybrid SCDs are combination of both SCD 1 and SCD 2. It may happen that in a table, some columns are important and we need to track changes for them i.e capture the historical data for them whereas in some columns even if the data changes, we don't care. For such tables we implement Hybrid SCDs, where in some columns are Type 1 and some are Type 2. Q40. What is VLDB? Ans: VLDB stands for Very Large Database, any database too large (normally more than 1 terabyte) considered as VLDB. It is sometimes used to describe databases occupying magnetic storage in the terabyte range and containing billions of table rows. Q41. How do you load the time dimension? Ans: Time dimensions are usually loaded by a program that loops through all possible dates that may appear in the data. 100 years may be represented in a time dimension, with one row per day.
Time dimension in data warehouse must be loaded manually. We load data into time dimension using PL/SQL scripts. create a procedure to load data into Time Dimension. The procedure needs to run only once to populate all the data. For eg, the code below fills up till 2015. You can modify the code to suit the fields in your table. create or replace procedure QISODS. Insert_W_DAY_D_PR as LastSeqID number default 0; loaddate Date default to_date('12/31/1979','mm/dd/yyyy');
begin Loop LastSeqID := LastSeqID + 1; loaddate := loaddate + 1; INSERT into QISODS.W_DAY_D values( LastSeqID, Trunc(loaddate), Decode(TO_CHAR(loaddate,'Q'),'1',1,decode(to_char(loaddate, 'Q'),'2',1,2) ), TO_FLOAT(TO_CHAR(loaddate, 'MM')), TO_FLOAT(TO_CHAR(loaddate, 'Q')), trunc((ROUND(TO_DECIMAL(to_char(loaddate,'DDD'))) + ROUND(TO_DECIMAL(to_char(trunc(loaddate, 'YYYY'), 'D')))+ 5) / 7), TO_FLOAT(TO_CHAR(loaddate, 'YYYY')), TO_FLOAT(TO_CHAR(loaddate, 'DD')), TO_FLOAT(TO_CHAR(loaddate, 'D')), TO_FLOAT(TO_CHAR(loaddate, 'DDD')), 1, 1, 1, 1, 1, TO_FLOAT(TO_CHAR(loaddate, 'J')), ((TO_FLOAT(TO_CHAR(loaddate, 'YYYY')) + 4713) * 12) + TO_number(TO_CHAR(loaddate, 'MM')), ((TO_FLOAT(TO_CHAR(loaddate, 'YYYY')) + 4713) * 4) + TO_number(TO_CHAR(loaddate, 'Q')), TO_FLOAT(TO_CHAR(loaddate, 'J'))/7, TO_FLOAT (TO_CHAR (loaddate,'YYYY')) + 4713, TO_CHAR(load_date, 'Day'), TO_CHAR(loaddate, 'Month'), Decode(To_Char(loaddate,'D'),'7','weekend','6','weekend','w eekday'), Trunc(loaddate,'DAY') + 1, Decode(Last_Day(loaddate),loaddate,'y','n'), to_char(loaddate,'YYYYMM'), to_char(loaddate,'YYYY') || ' Half' || Decode(TO_CHAR(loaddate,'Q'),'1',1,decode(to_char(loaddate, 'Q'),'2',1,2) ), TO_CHAR(loaddate, 'YYYY / MM'), TO_CHAR(loaddate, 'YYYY') ||' Q ' || TRUNC(TO_number( TO_CHAR(loaddate, 'Q')) ) , TO_CHAR(loaddate, 'YYYY') ||' Week'||
TRUNC(TO_number( TO_CHAR(loaddate, 'WW'))), TO_CHAR(loaddate,'YYYY')); If loaddate=to_Date('12/31/2015','mm/dd/yyyy') Then Exit; End If; End Loop; commit; end Insert_W_DAY_D_PR; Q42. Why are OLTP database designs not generally a good idea for a data warehouse? Ans: Because OLTP databases are transactional data related databases. The meaning of this is these databases are used in real time to insert, update, and delete data. To accomplish these tasks in real time, the model used in OLTP databases is highly normalized. The problem of using this model in datawarehousing is we have to join multiple tables to get a single piece of data. With the amount of historical data we deal with in data warehouse, it is highly desirable not to have a highly normalized data model like OLTP. OLTP cannot store historical information about the organization. It is used for storing the details of daily transactions while a data warehouse is a huge storage of historical information obtained from different data marts for making intelligent decisions about the organization. OLTP database tables are normalized and it will add additional time to queries to return results. Additionally, OLTP database is smaller and it does not contain longer period(many years) data, which needs to be analyzed. A OLTP system is basically ER model and not dimensional model. If a complex query is executed on a OLTP system, it may cause a heavy overhead on the OLTP server that will affect the normal business processes. Q43. Explain the advantages of RAID 1, 1/0, and 5. What type of RAID setup would you put your Tx logs? Ans: Raid 0 - Make several physical hard drives look like one hard drive. No redundancy but very fast. May use for temporary spaces where loss of the files will not result in loss of committed data.
Raid 1- Mirroring. Each hard drive in the twin. Each twin has an exact copy of the so if one hard drive fails, the other is data. Raid 1 is half the speed of Raid 0 write performance are good.
drive array has a other twins data used to pull the and the read and
Raid 1/0 - Striped Raid 0, then mirrored Raid 1. Similar to Raid 1. Sometimes faster than Raid 1. Depends on vendor implementation. Raid 5 - Great for read-only systems. Write performance is 1/3rd that of Raid 1 but Read is same as Raid 1. Raid 5 is great for DW but not good for OLTP. Hard drives are cheap now so I always recommend Raid 1. Q44. Is it correct/feasible to develop a data mart using an ODS? Ans: Yes, it is correct to develop a data mart using an ODS, because ODS which is used to store transaction data and few days(less historic data), this is what data mart is required, so it is correct to develop data mart using ODS. You can build data mart directly having ODS as source and called Independent data marts. Q45. What is a CUBE in data warehouse concept? Ans: Cubes are logical representation of multidimensional data. The edge of the cube contains dimension members and the body of the cube contains data values. The linking in cube ensures that the data in the cube remains consistent. CUBE is used in data warehouse for representing multidimensional data logically. Using the cube, it is easy to carry out certain activity ex. rollup, drill down/drill up, slice and dice, etc. which enables the business users to understand the trend of the business. It is good to have the design of the cube in the star schema so as to facilitate the effective use of the cube. Q46. Difference between snowflake and star schema? What are situations where snowflake schema is better than star schema and when the opposite is true?
Ans: Star Schema means a centralized fact table and surrounded by different dimensions. Snowflake means in the same star schema dimensions split into another dimensions. Star Schema contains Highly Denormalized Data. Snowflake contains Partially normalized. Star can not have parent table, but snow flake contain parent tables. Why need to go for Star: Here 1)less joiners contains 2)simply database 3)support drilling up options Why need to go Snowflake schema: Here some times we used to provide separate dimensions from existing dimensions that time we will go to snowflake Disadvantage Of snowflake: Query performance is very low because more joiners is there Q47. What is the main difference between schema in RDBMS and schemas in data warehouse? Ans: RDBMS schema: Used for OLTP system. Traditional and old schema. Normalized. Difficult to understand and navigate. Cannot solve extract and complex problems. Poorly modeled. More no. of transactions. Less time for query execution. More no. of users. Have insert, update, and delete transactions. DWH schema: Used for OLAP systems. New generation schema. Denormalized. Easy to understand and navigate. Extract and complex problems can be easily solved. Very good model. Less no. of transactions. Less no. of users. More time for query execution. Will not have more insert, delete, and updates.
Q48. What are possible data marts in retail sales? Ans: Product, sales, location, store, time. Q49. What is meant by metadata warehouse and how is it important? in context of a data
Ans: Meta data is data about data. Examples of metadata include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions, and process/method descriptions. The repository environment encompasses all corporate metadata resources: database catalogs, data dictionaries, and navigation services. Metadata includes things like name, length, valid values, and descriptions of a data element. Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems. Metadata synchronization, the process of consolidating, relating, and synchronizing data elements with the same or similar meaning from different systems. Metadata synchronization joins these different elements together in the data warehouse to allow for easier access. Q50. What is a surrogate key? Where we use it? Explain with examples? Ans: For definition, refer Q30. It is useful because the natural primary key (i.e. customer number in customer table) can change and this makes update more difficult. Some tables have columns such as AIRPORT_NAME or CITY_NAME, which are stated as the primary keys (according to the business uses), but not only these can change, indexing on a numerical value is probably better and you could consider create a surrogate key, called say AIRPORT_ID. This should be internal to the system and as far as the client is concerned, you may display only the AIRPORT_NAME. Another benefit you can get from surrogate keys (SID) is: Tracking the SCD - Slowly Changing Dimension Let me give you a simple, classical example: On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what would be in your Employee
Dimension). This employee has a turnover allocated to him on the Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted from Business Unit 'BU1' to Business Unit 'BU2.' All the new turnover have to belong to the new Business Unit 'BU2' but the old one should Belong to the Business Unit 'BU1.' If you used the natural business key 'E1' for your employee within your data warehouse everything would be allocated to Business Unit 'BU2' even what actually belongs to 'BU1.' If you use surrogate keys, you could create on the 2nd of June a new record for the Employee 'E1' in your Employee Dimension with a new surrogate key. This way, in your fact table, you have your old data (before 2nd of June) with the SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee 'E1' + 'BU2.' You could consider Slowly Changing Dimension as an enlargement of your natural key: natural key of the Employee was Employee Code 'E1' but for you it becomes Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with the natural key enlargement process, is that you might not have all part of your new key within your fact table, so you might not be able to do the join on the new enlarge key -> so you need another id. Q51. What is the main difference between Inmon and Kimball philosophies of data warehousing? Ans: Both differed warehouse. in the concept of building the data
According to Kimball, data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional models. Kimball views data warehousing as a constituency of data marts. Data marts are focused on delivering business objectives for departments in the organization. And the data warehouse is a conformed dimension of the data marts. Hence a unified view of the enterprise can be obtained from the dimension modeling on a local departmental level. Inmon beliefs in creating a data warehouse on a subject-bysubject area basis. Hence the development of the data
warehouse can start with data from the online store. Other subject areas can be added to the data warehouse as their needs arise. Point-of-sale (POS) data can be added later if management decides it is necessary. According to Inmon, data warehouse is one part of overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored I 3rd normal form. i.e., Kimball--First Data Marts--Combined way ---Data warehouse Inmon---First Data warehouse--Later----Data marts Q52. What is the difference between view and materialized view? Ans: View stores the SQL statement in the database and let you use it as a table. Every time you access the view, the SQL statement executes. Materialized view stores the results of the SQL in table form in the database. SQL statement only executes once and after that every time you run the query, the stored result set is used. Pros include quick query results. Views do not take any space, but materialized view take space. Q53. What are the advantages traditional approaches? of data mining over
Ans: Data Mining is used for the estimation of future. For example, if we take a company/business organization, by using the concept of Data Mining, we can predict the future of business in terms of Revenue (or) Employees (or) Customers (or) Orders etc. Traditional approaches use simple algorithms for estimating the future. But, it does not give accurate results when compared to Data Mining. Q54. What are the different architecture of data warehouse? Ans: There are three types of architectures. Date warehouse Basic Architecture: In this architecture end users access data that is derived from several sources through the data warehouse.
Architecture: Source --> Warehouse --> End Users Data warehouse with staging area Architecture: Whenever the data that is derived from sources need to be cleaned and processed before putting it into warehouse then staging area is used. Architecture: Source --> Staging Area -->Warehouse --> End Users Data warehouse with staging area and data marts Architecture: Customization of warehouse architecture for different groups in the organization then data marts are added and used. Architecture: Source --> Staging Area --> Warehouse --> Data Marts --> End Users Q55. What are the steps to build the data warehouse? Ans: Gathering business requirements Identifying Sources Identifying Facts Defining Dimensions Define Attributes Redefine Dimensions & Attributes Organize Attribute Hierarchy & Define Relationship Assign Unique Identifiers Additional conventions: Cardinality/Adding ratios 1.Understand the business requirements. 2.Once the business requirements are clear then identify the Grains(Levels). 3.Grains are defined ,design the Dimensional tables with the Lower level Grains. 4.Once the Dimensions are designed, design the Fact table With the Key Performance Indicators(Facts). 5.Once the dimensions and Fact tables are designed define the relation ship between the tables by using primary key and Foreign Key. In logical phase data base design looks like Star Schema design so it is named as Star Schema Design Q56. Give example of degenerated dimension? Ans: Degenerated Dimension is a dimension key without corresponding dimension. Example: In the PointOfSale Transaction Fact table, we have: Date Key (FK), Product Key (FK), Store Key (FK),
Promotion Key (FP), and POS Transaction Number Date Dimension corresponds to Date Key, Production Dimension corresponds to Production Key. In a traditional parent-child database, POS Transactional Number would be the key to the transaction header record that contains all the info valid for the transaction as a whole, such as the transaction date and store identifier. But in this dimensional model, we have already extracted this info into other dimension. Therefore, POS Transaction Number looks like a dimension key in the fact table but does not have the corresponding dimension table. Therefore, POS Transaction Number is a degenerated dimension. Q57. What is the data type of the surrogate key? Ans: Data integer. type of surrogate key is either numeric or
Q58. What is real-time data warehousing? Ans: Real-time data warehousing is a combination of two things: 1) real-time activity and 2) data warehousing. Real-time activity is activity that is happening right now. The activity could be anything such as the sale of widgets. Once the activity is complete, there is data about it. Data warehousing captures business activity data. Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly. In other words, real-time data warehousing is a framework for deriving information from data as the data becomes available. A real time data warehouse provide live data for DSS (may not be 100% up to that moment, some latency will be there). Data warehouse have access to the OLTP sources, data is loaded from the source to the target not daily or weekly, but may be every 10 minutes through replication or logshipping or something like that. SAP BW is providing real time DW, with the help of extended star schema, source data is shared. In real-time data warehousing, your warehouse contains
completely up-to-date data and is synchronized with the source systems that provide the source data. In near-realtime data warehousing, there is a minimal delay between source data being generated and being available in the data warehouse. Therefore, if you want to achieve real-time or near-real-time updates to your data warehouse, youll need to do three things: 1. Reduce or eliminate the time taken to get new and changed data out of your source systems. 2. Eliminate, or reduce as much as possible, the time required to cleanse, transform and load your data. 3. Reduce as much as possible the time required to update your aggregates. Starting with version 9i, and continuing with the latest 10g release, Oracle has gradually introduced features into the database to support real-time, and near-real-time, data warehousing. These features include: Change data capture External tables, table functions, pipelining, and the MERGE command, and Fast refresh materialized views Q59. What is normalization, first normal form, third normal form? Ans: Normalization can be defined as into two different tables, so as to values. The normalization is a step by step redundancies and dependencies of structure normal form, second
segregating of table avoid duplication of process of removing attributes in data step is form. base design. of data. is modified or activities. each
The condition of data at completion of described as a normal Needs for normalization : improves data Ensures minimum redundancy Reduces need to reorganize data when design enhanced. Removes anomalies for database
First normal form : A table is in first normal form when it contains no repeating groups. The repeating column or fields in a unnormalized table are removed from the table and put in to tables of their
own. Such a table becomes dependent on the parent table from which it is derived. The key to this table is called concatenated key, with the key of the parent table forming a part it. Second normal form: A table is in second normal form if all its non_key fields fully dependent on the whole key. This means that each field in a table ,must depend on the entire key. Those that do not depend upon the combination key, are moved to another table on whose key they depend on. Structures which do not contain combination keys are automatically in second normal form. Third normal form: A table is said to be in third normal form , if all the non key fields of the table are independent of all other non key fields of the same table. Q60. What caches? is the difference between static and dynamic
Ans: Static cache stores overloaded values in the memory and it wont change throughout the running of the session Where as dynamic cache stores the values in the memory and changes dynamically during the running of the session used in scd types -- where target table changes and is cache is dynamically changes. Q61. What is meant by Aggregate fact table? Ans: An aggregate fact table stores information that has been aggregated, or summarized from a detail fact table. Aggregate fact table are useful in improving query performance. Often an aggregate fact table can be maintained through the use of materialized views, which, under certain databases, can automatically be substituted for the detailed fact table if appropriate in resolving a query. Q62. What is the life cycle of data warehouse projects? Ans: STRAGEGY & PROJECT PLANNING Definition of scope, goals, objectives expectations & purpose, and
Establishment of implementation strategy Preliminary identification of project resources Assembling of project team Estimation of project schedule REQUIREMENTS DEFINITION Definition of requirements gathering strategy Conducting of interviews and group sessions with users Review of existing documents Study of source operational systems Derivation of business metrics and dimensions needed for analysis ANALYSIS & DESIGN Design of the logical data model Definition of data extraction, transformation, and loading functions Design of the information delivery framework Establishment of storage requirements Definitions of the overall architecture and supporting infrastructure CONSTRUCTION Selection and installation of infrastructure hardware and software Selection and installation of the DBMS Selection and installation of ETL and information delivery tools Completion of the design of the physical data model Completion of the metadata component DEPLOYMENT Completion of user acceptance tests Performance of initial data loads Making user desktops ready for the data warehouse Training and support for the initial set of users Provision for data warehouse security, backup, and recovery MAINTENANCE Ongoing monitoring of the data warehouse Continuous performance tuning Ongoing user training Provision of ongoing support to users Ongoing data warehouse management Q63. What is a CUBE and why we are creating a cube. What is the difference between ETL and OLAP cubes? Ans. Any schema or Table or Report which gives you meaningful information of one attribute with respect to more than one attribute is called a cube. For Ex: In a product table with Product ID and Sales
columns , we can analyze Sales with respect to Product Name , but if you analyze Sales with respect to Product as well as Region( region being attribute in Location Table) the report or Resultant table or schema would be Cube. ETL Cubes: Built in the staging area to load frequently accessed reports to the target. Reporting Cubes : Built after the actual load of all the tables to the target depending on the customer requirement for his business analysis. Q64. Explain the flow of data starting with OLTP to OLAP including staging, summary tables, facts, and dimensions? Ans: OLTP(1)---->ODS(2)------>DWH(3)------->OLAP(4)------------>Reports(5)------>decision(6) 1-2 (extraction)
2-3 (Transformation and here ODS is itself staging area) 3-4-5 (Use of is reporting taken i.e. tool and of generate data reports) is
6-decision served
purpose
warehouse
Q65. What is the definition of normalized and denormalized view and what are the differences between them? Ans: Normalized View-> Process of eliminating the redundancies. Denormalized View-> process the data where duplication takes place. Which means it is not stop the replication. I would like to add one more pt. here, as OLTP is in Normalized form, more no. of tables are scanned or referred for a single query, as through primary key and foreign key data needs to be fetched from its respective Master tables. Whereas in OLAP, as the data is in De-normalized form, for a query the no. of tables queried are less. For eq.:- If we have a banking application. in OLTP environment., we will have a separate table for customer personal details , Address details, its transaction details etc.. Whereas in OLAP environment. These all details can be stored in one single table thus decreasing the scanning of multiple tables for a single record of a customer details.
Q66. What is BUS schema? Ans: BUS schema is composed of a master suite of conformed dimension and standardized definition of facts. Bus Schema : Let we consider/explain these in x,y axis Dimension Table : A,B,C,D,E,F Fact Table: R,S Relation between Fact and Dim tables as follows: ====================================== R->> A,B,E,F and S->>D,C,A
A confirmed must be identified across different Subject.(Any dimension which is found in two fact say R,S : we need to takes these as Vertical axis say x axis and we need to take Dimensional table as Horizontal axis as Y axis. this type of construction of matrix is called bus matrix. these are initially constructed before the universe is created. we can say this initial layout in designing a schema. Every Schema which is started as a Star Schema and then it expands to Multi Star, Snow Flake, and Constellation Schema. Q67. Which testing? automation tool is used in data warehousing
Ans: No tool testing is done in data warehouse, only manual testing is done. Q68. What is data warehousing hierarchy? Ans: Hierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation. For example, in a time dimension, a hierarchy might aggregate data from the month level to the quarter level to the year level. A hierarchy can also be used to define a navigational drill path and to establish a family structure. Q69. What is data validation strategies validation after loading process? for data marts
Ans: Data validation is to make sure that the loaded data is accurate and meets the business requirements. Strategies are different methods followed to meet the
validation requirements Q70. Why should you put your data warehouse on a different system other than your OLTP system? Ans: Data Warehouse is a part of OLAP (On-Line Analytical Processing). It is the source from which any BI tools fetch data for Analytical, reporting or data mining purposes. It generally contains the data through the whole life cycle of the company/product. DWH contains historical, integrated, denormalized, subject oriented data. However, on the other hand the OLTP system contains data that is generally limited to last couple of months or a year at most. The nature of data in OLTP is: current, volatile and highly normalized. Since, both systems are different in nature and functionality we should always keep them in different systems. An DW is typically used most often for intensive querying. Since the primary responsibility of an OLTP system is to faithfully record on going transactions (inserts/updates/deletes), these operations will be considerably slowed down by the heavy querying that the DW is subjected to. Q71. What are semi-additive and factless facts and in which scenario will you use such kind of fact tables? Ans: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. For example: Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information. A factless fact is a fact table that does not contain any fact. They may consist of nothing but keys. These are called factless fact tables. first type of factless fact table is a table that records an event. Many event-tracking tables in dimensional data warehouses turn out to be factless.
A second kind of factless fact table is called a coverage table. Coverage tables are frequently needed when a primary fact table in a dimensional data warehouse is sparse. A factless fact table captures the many-to-many relationships between dimensions, but contains no numeric or textual facts. They are often used to record events or coverage information. Common examples of factless fact tables include: - Identifying product promotion events (to determine promoted products that didnt sell) - Tracking student attendance or registration events Tracking insurance-related accident events - Identifying building, facility, and equipment schedules for a hospital or university. Q72. What are aggregate tables? Ans: Aggregate tables contains the summary of existing warehouse data which is grouped to certain levels of dimensions. It is always easy to retrieve data from aggregated tables than visiting original table which has million records. Aggregate tables reduce the load in the database server and increases the performance of the query and can retrieve the result quickly. These are the tables which contain aggregated / summarized data. E.g. Yearly, monthly sales information. These tables will be used to reduce the query execution time. Q73. What are the different methods of loading Dimension tables? Ans: There are two ways to load data in dimension tables. Conventional (Slow): All the constraints and keys are validated against the data before it is loaded, this way data integrity is maintained. Direct (Fast): All the constraints and keys are disabled before the data is loaded. Once data is loaded, it is validated against all the constraints and keys. If data is found invalid or dirty, it is not included in index and all future processes are skipped on this data.
Q74. What is the difference between data warehousing and business intelligence? Ans: Data warehousing deals with all aspects of managing the development, implementation and operation of a data warehouse or data mart including meta data management, data acquisition, data cleansing, data transformation, storage management, data distribution, data archiving, operational reporting, analytical reporting, security management, backup/recovery planning, etc. Business intelligence, on the other hand, is a set of software tools that enable an organization to analyze measurable aspects of their business such as sales performance, profitability, operational efficiency, effectiveness of marketing campaigns, market penetration among certain customer groups, cost trends, anomalies and exceptions, etc. Typically, the term business intelligence is used to encompass OLAP, data visualization, data mining, and query/reporting tools. difference between star and snow flake schema What is Star Schema? Star Schema is a relational database schema for representing multdimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. One fact table centrally loaded and it surrounding by multiple dimension table. Every dimension table have Primary key and it is connected with foreign key of the fact table The advantage of star schema is performance increase and easy understanding of data. Adv of Star schema: Easy to users understand. Improve the performance because less number of joins and indexes. Optimizer navigation. Most suitable for query processing. Dis adv of Star schema:
It occupies more storage space. It is difficult to maintain and update. Snow Flake Schema? Dimension table hierarchies are broken into separate tables in snow flake schema. In OLAP, this Snowflake schema approach increases the number of joins and poor performance in retrieval of data. Adv of Snowflake schema: It occupies less number of storage spaces. It is easy to updatable and maintains the data because it contains normalization structure. Dis adv of Star schema: Not easy to understand for end users because it contains complex structure. Decrease the performance because more number of joins.

2 Printout DWH Q A

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

2 Printout DWH Q A

Diunggah oleh

Hak Cipta:

Format Tersedia

Q1.

What are data marts?

The "Slowly Changing Dimension" problem is a common one

Advantages: - This is the easiest way to handle the Slowly Changing

Advantages: - This allows information.

Q5. What is the difference between OLTP and OLAP?

Anda mungkin juga menyukai