Anda di halaman 1dari 23

What is a Data Warehousing?

Data Warehouse is a repository of integrated information, available for queries


and analysis. Data and information are extracted from heterogeneous sources as
they are generated....This makes it much easier and more efficient to run queries
over data that originally came from different sources.
Typical relational databases are designed for on-line transactional processing
(OLTP) and do not meet the requirements for effective on-line analytical
processing (OLAP). As a result, data warehouses are designed differently than
traditional relational databases.
Datawarehosing is a process of creating,queriring and populating datawarehouse.
it includes a number of discrete technologies like Identifying sources, Process of
ECCD, ETL which includes data cleansing , data transforming and data loading to
targets.
A Data warehouse is a subject oriented, integrated, time-variant, nonvolatile
collection of data to enable decision making across disparate group of users.

What is real time data-warehousing


Real-time data warehousing is a combination of two things: 1) real-time activity
and 2) data warehousing. Real-time activity is activity that is happening right
now. The activity could be anything such as the sale of widgets. Once the activity
is complete, there is data about it.
Data warehousing captures business activity data. Real-time data warehousing
captures business activity data as it occurs. As soon as the business activity is
complete and there is data about it, the completed activity data flows into the
data warehouse and becomes available instantly. In other words, real-time data
warehousing is a framework for deriving information from data as the data
becomes available.
A real time data warehouse provide live data for DSS (may not be 100% up to
that moment, some latency will be there). Data warehouse have access to the
OLTP sources, data is loaded from the source to the target not daily or weekly,
but may be every 10 minutes through replication or logshipping or something like
that. SAP BW is providing real time DW, with the help of extended starschma,
source data is shared.
In real-time data warehousing, your warehouse contains completely up-todate data and is synchronized with the source systems that provide the source
data. In near-real-time data warehousing, there is a minimal delay between
source data being generated and being available in the data warehouse.
Therefore, if you want to achieve real-time or near-real-time updates to your
data warehouse, youll need to do three things:
Reduce or eliminate the time taken to get new and changed data out of your
source systems.
Eliminate, or reduce as much as possible, the time required to cleanse,
transform and load your data.
Reduce as much as possible the time required to update your aggregates.

Starting with version 9i, and continuing with the latest 10g release, Oracle has
gradually introduced features into the database to support real-time, and nearreal-time, data warehousing. These features include:
Change Data Capture
External tables, table functions, pipelining, and the MERGE command, and
Fast refresh materialized views
Real time Data warehousing means combination of heterogeneous databases and
query and analysis purpose and Decision-making and reporting purpose.

What is ODS
ODS Stands for Operational Data Store not Online Data Storage
It is used to maintain, store the current and up to date information and the
transactions regarding the source databases taken from the OLTP system.
It is directly connected to the source database systems instead of to the staging
area.
It is further connected to data warehouse and moreover can be treated as a part
of the data warehouse database.
It is the final integration point in the ETL process before loading the data into the
Data Warehouse.
It contains near real time data. In typical data warehouse architecture,
sometimes ODS is used for analytical reporting as well as souce for Data
Warehouse
Operational Data Services is Hybrid structure that has some aspects of a data
warehouse and other aspects of an Operational system.
Contains integrated data.
It can support DSS processing.
It can also support High transaction processing.
Placed in between Warehouse and Web to support web users.
Operational data stores can be updated, do provide rapid constant time,and
contain only limited amount of historical data
An Operational Data Store presents a consistent picture of the current data
stored and managed by transaction processing system. As data is modified in the
source system, a copy of the changed data is moved into the ODS. Existing data
in the ODS is updated to reflect the current status of the source system

What is data mining


Data mining is a process of extracting hidden trends within a Datawarehouse. For
example an insurance data warehouse can be used to mine data for the most
high risk people to insure in a certain geographical area.

In its simple definition you can say data mining is a way to discover new meaning
in data.
Data mining is a concept of deriving/discovering the hidden, unexpected
information from the existing data
Data Mining is a non-trivial process of identified valid, potentially useful and
ultimately understand of data
A Datawarehouse typically supplies answer to a question like 'who is buying our
products/". A data mining approach would seek answer to questions like "Who is
NOT buying our products?".

What are Data Marts


Data Mart is a segment of a data warehouse that can provide data for reporting
and analysis on a section, unit, department or operation in the company, e.g.
sales, payroll, production. Data marts are sometimes complete individual data
warehouses which are usually smaller than the corporate data warehouse.
Datamart are small subset of datawarehiuse.it contain the bussiness devision and
department level.
A data mart is a focused subset of a data warehouse that deals with a single
area(like different department) of data and is organized for quick analysis
Data Marts: A subset of data warehouse data used for a specific business
function whose format may be a star schema, hypercube or statistical sample
Data Mart: a data mart is a small data warehouse. In general, a data warehouse
is divided into small units according the business requirements. for example, if
we take a Data Warehouse of an organization, then it may be divided into the
following individual Data Marts. Data Marts are used to improve the performance
during the retrieval of data.
eg: Data Mart of Sales, Data Mart of Finance, Data Mart of Maketing, Data
Mart of HR etc.

What is ER Diagram
ER - Stands for entitity relationship diagrams. It is the first step in the design of
data model which will later lead to a physical database design of possible a OLTP
or OLAP database
The Entity-Relationship (ER) model was originally proposed by Peter in 1976
[Chen76] as a way to unify the network and relational database views.
Simply stated the ER model is a conceptual data model that views the real world
as entities and relationships. A basic component of the model is the EntityRelationship diagram which is used to visually represents data objects.
Since Chen wrote his paper the model has been extended and today it is
commonly used for database design.
For the database designer, the utility of the ER model is:
It maps well to the relational model. The constructs used in the ER model can
easily be transformed into relational tables. It is simple and easy to understand
with a minimum of training. Therefore, the model can be used by the database

designer

to

communicate

the

design

to

the

end

user.

In addition, the model can be used as a design plan by the database developer to
implement a data model in a specific database management software.

What is a Star Schema


A relational database schema organized around a central table (fact table) joined
to a few smaller tables (dimension tables) using foreign key references. The fact
table contains raw numeric items that represent relevant business facts (price,
discount values, number of units sold, dollar value, etc.)
Its a type of organizing the entities in a way, such that u can retrieve the result
from the database easily and very fast. Usually a star schema will have one or
more dimension tables linking around a fact table and looks like a star, Hence the
name.

What is Dimensional Modelling


In Dimensional Modeling, Data is stored in two kinds of tables: Fact Tables and
Dimension tables.
Fact Table contains fact data e.g. sales, revenue, profit etc.....
Dimension table contains dimensional data such as Product Id, product name,
product description etc.....
Dimensional Modelling is a design concept used by many data warehouse
desginers to build thier data warehouses. In this design model all the data is
stored in two types of tables - Facts table and Dimension table. Fact table
contains the facts/measurements of the business and the dimension table
contains the context of measuremnets ie, the dimensions on which the facts are
calculated.
Dimensional Modeling is a logical design technique that seeks to present the data
in a standard, intuitive framework that allows for high-performance access. It is
inherently dimensional, and it adheres to a discipline that uses the relational
model with some important restrictions. Every dimensional model is composed of
one table with a multipart key, called the fact table, and a set of smaller tables
called dimension tables. Each dimension table has a single-part primary key that
corresponds exactly to one of the components of the multipart key in the fact
table. Dimensional Modeling

Why is Data Modeling Important?


Data modeling is probably the most labor intensive and time consuming part of
the development process. Why bother especially if you are pressed for time? A
common response by practitioners who write on the subject is that you should no
more build a database without a model than you should build a house without
blueprints.
The goal of the data model is to make sure that the all data objects required by
the database are completely and accurately represented. Because the data model
uses easily understood notations and natural language, it can be reviewed and
verified as correct by the end-users.
The data model is also detailed enough to be used by the database developers to

use as a "blueprint" for building the physical database. The information contained
in the data model will be used to define the relational tables, primary and foreign
keys, stored procedures, and triggers. A poorly designed database will require
more time in the long-term. Without careful planning you may create a database
that omits data required to create critical reports, produces results that are
incorrect or inconsistent, and is unable to accommodate changes in the user's
requirements.

Steps In Building the Data Model


While ER model lists and defines the constructs required to build a data model,
there is no standard process for doing so. Some methodologies, such as IDEFIX,
specify a bottom-up development process were the model is built in stages.
Typically, the entities and relationships are modeled first, followed by key
attributes, and then the model is finished by adding non-key attributes. Other
experts argue that in practice, using a phased approach is impractical because it
requires too many meetings with the end-users.The sequence used for this
document are:
Identification of data objects and relationships
Drafting the initial ER diagram with entities and relationships
Refining the ER diagram
Add key attributes to the diagram
Adding non-key attributes
Diagramming Generalization Hierarchies
Validating the model through normalization
Adding business and integrity rules to the Model

What Snow Flake Schema


Snowflake schemas normalize dimensions to eliminate redundancy. That is, the
dimension data has been grouped into multiple tables instead of one large table.
For example, a product dimension table in a star schema might be normalized
into a products table, a product_category table, and a product_manufacturer
table in a snowflake schema. While this saves space, it increases the number of
dimension tables and requires more foreign key joins. The result is more complex
queries and reduced query performance
The snowflake schema is an extension of the star schema, where each point of
the star explodes into more points. The main advantage of the snowflake
schema is the improvement in query performance due to minimized disk storage
requirements and joining smaller lookup tables. The main disadvantage of the
snowflake schema is the additional maintenance efforts needed due to the
increase number of lookup tables.
Star schema

A single fact table with N number of Dimension


Snowflake schema
Any dimensions with extended dimensions are know as snowflake schema
Multiple Star (galaxy)
If the schema has more than one fact table then the schema is said to be Multiple
star

Differences between star and snowflake schemas


Star schema uses denormalized dimension tables, but in case of snowflake
schema it uses normalized dimensions to avoid redundancy...
The star schema is created when all the dimension tables directly link to the fact
table. Since the graphical representation resembles a star it is called a star
schema. It must be noted that the foreign keys in the fact table link to the
primary key of the dimension table. This sample provides the star schema for a
sales_ fact for the year 1998. The dimensions created are Store, Customer,
Product_class and time_by_day. The Product table links to the product_class
table through the primary key and indirectly to the fact table. The fact table
contains foreign keys that link to the dimension tables.
The snowflake schema is a schema in which the fact table is indirectly linked to a
number of dimension tables. The dimension tables are normalized to remove
redundant data and partitioned into a number of dimension tables for ease of
maintenance. An example of the snowflake schema is the splitting of the Product
dimension into the product_category dimension and product_manufacturer
dimension.

What are the Different methods of loading Dimension tables


There are of two types insert--> if it is not there in the dimension and update-> if
it exists.
Conventional Load:
Before loading the data, all the Table constraints will be checked against the
data.
Direct load:(Faster Loading)
All the Constraints will be disabled. Data will be loaded directly. Later the data will
be checked against the table constraints and the bad data won't be indexed.
Conventional and Direct load method are applicable for only oracle. The naming
convension is not general one applicable to other RDBMS like DB2 or SQL server..

How do you load the time dimension


Every Datawarehouse maintains a time dimension. It would be at the most
granular level at which the business runs at (ex: week day, day of the month and
so on). Depending on the data loads, these time dimensions are updated. Weekly
process gets updated every week and monthly process, every month.
Generally we load the Time dimension by using SourceStage as a Seq File and we
use one passive stage in that transformer stage we will manually write functions

as Month and Year Functions to load the time dimensions but for the lower level
i.e., Day also we have one function to implement loading of Time Dimension.

What are Aggregate tables


Aggregate tables contain redundant data that is summarized from other data in
the warehouse.
These are the tables which contain aggregated / summarized data. E.g. Yearly,
monthly sales information. These tables will be used to reduce the query
execution time.
Aggregate table contains the summary of existing warehouse data which is
grouped to certain levels of dimensions. Retrieving the required data from the
actual table, which have millions of records will take more time and also affects
the server performance. To avoid this we can aggregate the table to certain
required level and can use it. This tables reduces the load in the database server
and increases the performance of the query and can retrieve the result very
fastly.

What is the Difference between OLTP and OLAP


OLTP
Current data
Short database transactions
Online update/insert/delete
Normalization is promoted
High volume transactions
Transaction recovery is necessary

OLAP
Current and historical data
Long database transactions
Batch update/insert/delete
Denormalization is promoted
Low volume transactions
Transaction recovery is not necessary
OLTP is nothing but OnLine Transaction Processing ,which contains a normalized
tables and online data, which have frequent insert/updates/delete.
But OLAP(Online Analytical Programming) contains the history of OLTP data,
which is, non-volatile ,acts as a Decisions Support System and is used for
creating forecasting reports.
Index
OLTP : FEW
OLAP : MANY
Joins
OLTP : MANY
OLAP : FEW

What is ETL

ETL is a short for Extract, Transform and Load. It is a data integration function
that involves extracting the data from outside sources , transforming it into
business needs and ultimately loading it into a Datawarehouse
ETL is an abbreviation for "Extract, Transform and Load".This is the process of
extracting data from their operational data sources or external data sources,
transforming the data which includes cleansing, aggregation, summarization,
integration, as well as basic transformation and loading the data into some form
of the data warehouse.

What are the vaious ETL tools in the Market


1. Informatica Power Center
2. Ascential Data Stage
3. ESS Base Hyperion
4. Ab Intio
5. BO Data Integrator
6. SAS ETL
7. MS DTS
8. Oracle OWB
9. Pervasive Data Junction
10. Cognos Decision Stream
11. Sunopsis
12. SQL Loader

What are the various Reporting tools in the Market


1. MS-Excel
2 .Business Objects (Crystal Reports)
3. Cognos (Impromptu, Power Play)
4. Microstrategy
5. MS reporting services
6. Informatica Power Analyzer
7. Actuate
8. Hyperion (BRIO)
9. Oracle Express OLAP
10.Proclarity
11.SAS

What are modeling tools available in the Market


Modeling Tool Vendor
==============
ERWin Computer Associates
ER/Studio Embarcadero
Power Designer Sybase
Oracle Designer Oracle

What is a dimension table


A dimensional table is a collection of hierarchies and categories along which the
user can drill down and drill up. it contains only the textual attributes.
A dimesion table in datawarehouse is one which contains primary key and
attributes.we called primary key as DIMID's(dimension id's).

A dimensional table is a collection of hierarchies and categories along which the


user can drill down and drill up. It contains only the textual attributes.
Dimension tables r nothing but a master table, thru which u can extract the
actual transactions .Dimension table contains less columns and more rows.
Dimensional table is a table which contains business dimensions thru which v
analyze the business metrics

What is Fact table


A table in a data warehouse whose entries describe data in a fact table.
Dimension tables contain the data from which dimensions are created.
A fact table in dataware house is it describes the transaction data.It contains
characterstics and keyfigures. A Fact table is a collection of facts and foriegn key
relations to the dimensions.
Fact Table contains the measurements or metrics or facts of business process. If
your business process is "Sales", then a measurement of this business process
such as "monthly sales number" is captured in the Fact table. Fact table also
contains the foriegn keys for the dimension tables.

What is a lookup table


When a table is used to check for some data for its presence prior to loading of
some other data or the same data to another table, the table is called a LOOKUP
Table.
When a value for the column in the target table is looked up from another table
apart from the source tables, that table is called the lookup table.
When we want to get related value from some other table based on particular
value... suppose in one table A we have two columns emp_id,name and in other
table B we have emp_id adress in target table we want to have
emp_id,name,address we will take source as table A and look up table as B by
matching EMp_id we will get the result as three columns...emp_id,name,address
A lookup table is nothing but a 'lookup' it give values to referenced table (it is a
reference), it is used at the run time, it saves joins and space in terms of
transformations. Example, a lookup table called states, provide actual state name
('Texas') in place of TX to the output.

When a table is used to check for some data for its presence prior to loading of some
other data or the same data to another table, the table is called a LOOKUP Table.
What is a general purpose scheduling tool
General purpose of scheduling tool may be cleansing and loading data at specific
given time
The basic purpose of the scheduling tool in a DW Application is to stream line the
flow of data from Source To Target at specific time or based on some condition.

What is Normalization, First Normal Form, Second Normal Form , Third Normal
Form
Normalization can be defined as segregating of table into two different tables, so
as to avoid duplication of values.
The normalization is a step by step process of removing redundancies and
dependencies of attributes in data structure
The condition of data at completion of each step is described as a normal form.
Needs for normalization: improves data base design.
Ensures minimum redundancy of data.
Reduces need to reorganize data when design is modified or enhanced.
Removes anomalies for database activities.
First normal form :
A table is in first normal form when it contains no repeating groups.
The repeating column or fields in an un normalized table are removed from the
table and put in to tables of their own.
Such a table becomes dependent on the parent table from which it is derived.
The key to this table is called concatenated key, with the key of the parent table
forming a part it.
Second normal form:
A table is in second normal form if all its non key fields are fully dependent on
the whole key.
This means that each field in a table, must depend on the entire key.
Those that do not depend upon the combination key, are moved to another
table on whose key they depend on.
Structures which do not contain combination keys are automatically in second
normal form.
Third normal form:
A table is said to be in third normal form , if all the non key fields of the table
are independent of all other non key fields of the same table.
Normalization : The process of decomposing
redundancy is called Normalization.

tables

to eliminate

data

1N.F:- The table should caontain scalar or atomic values.


2 N.F:- Table should be in 1N.F + No partial functional dependencies
3 N.F :-Table should be in 2 N.F + No transitive dependencies
2NF - table should be in 1NF + non-key should not dependent on subset of the
key ({part, supplier}, sup address)
3NF - table should be in 2NF + non key should not dependent on another nonkey ({part}, warehouse name, warehouse addr)

{primary key}
more...
4,5 NF - for multi-valued dependencies (essentially to describe many-to-many
relations)
Normalization:It is the process of efficiently organizing data in a database.There
are 2-goals of the normalization process: 1. Eliminate redundant data 2. Ensure
data dependencies make sense(only storing related data in a table)First Normal
Form:It sets the very basic rules for an organized database. 1. Eliminate
duplicate columns from the same table 2. Create separate tables for each group
of related data and identify each row with a unique column or set of
columns.Second Normal Form:Further addresses the concept of removing
duplicative data. 1.Remove subsets of data that apply to multiple rows of a table
and place them in a separate tables. 2.Create relationships between these new
tables and their predecessors through the use of foreign keys.Third Normal Form:
1.Remove columns that are not dependent upon the primary key.Fourth Normal
Form: 1.A relation is in 4NF if it has no multi valued dependencies.These
normalization guidelines are cumulative.For a database to be in 2NF, it must first
fulfill all the criteria of a 1NF database.

What type of Indexing mechanism do we need to use for a typical


Datawarehouse
On the fact table it is best to use bitmap indexes. Dimension tables can use
bitmap and/or the other types of clustered/non-clustered, unique/non-unique
indexes.
To my knowledge, SQLServer does not support bitmap indexes. Only Oracle
supports bitmaps.
It generally depends upon the data which u hav ein table if u have less distinct
values in particular column its always that u built up bit map index... rather that
other one on dimension tables generally we have indexes...

Which columns go to the fact table and which columns go the dimension table
The Aggreation or calculated value colums will go to Fac Tablw and details
information will go to diamensional table.
To add on, Foreign key elements along with Business Measures, such as Sales in
$ amt, Date may be a business measure in some case, units (qty sold) may be a
business measure, are stored in the fact table. It also depends on the granularity
at which the data is stored

What is a level of Granularity of a fact table


Level of granularity means level of detail that you put into the fact table in a data
warehouse. For example: Based on design you can decide to put the sales data in
each transaction. Now, level of granularity would mean what detail are you willing
to put for each transactional fact. Product sales with respect to each minute or
you want to aggregate it upto minute and put that data.

It also means that we can have (for example) data agregated for a year for a
given product as well as the data can be drilled down to Monthly, weekl and daily
basis...teh lowest level is known as the grain. going down to details is Granularity

What does level of Granularity of a fact table signify


It describes the amount of space required for a database.
Level of Granularity indicates the extent of aggregation that will be permitted to
take place on the fact data. More Granularities implies more aggregation potential
and vice-versa.
In simple terms, level of granularity defines the extent of detail. As an example,
let us look at geographical level of granularity. We may analyze data at the levels
of COUNTRY, REGION, TERRITORY, CITY and STREET. In this case, we say the
highest level of granularity is STREET.
Level of granularity means the upper/lower level of hierarchy, up to which we can
see/drill the data in the fact table.
Granularity means nothing but it is a level of representation of measures and
metrics.
The lowest level is called detailed data
and highest level is called summary data
It depends of project we extract fact table significance

How are the Dimension tables designed


Most dimension tables are designed using Normalization principles upto 2NF. In
some instances they are further normalized to 3NF.
Find where data for this dimension are located.
Figure out how to extract this data.
Determine how to maintain changes to this dimension (see more on this in the
next section).
Change fact table and DW population routines.

What are slowly changing dimensions


Dimensions that change over time are called Slowly Changing Dimensions. For
instance, a product price changes over time; People change their names for some
reason; Country and State names may change over time. These are a few
examples of Slowly Changing Dimensions since some changes are happening to
them over a period of time
If the data in the dimension table happen to change very rarely,then it is called
as slowly changing dimension.
ex: changing the name and address of a person,which happens rerely.
While handling Slowly changing Dimension, Dimesion schema might required to
change. It depends on Business Requirement.
E.g Dimension Table Product has Product ID and Price. If Price changes , if we
update the Price in the Dimension we might end up in loosing History Data. In
this case we can Add One Column as Dateof Change. So if Price Changes for the
given date, 1 Record gets added in to Dimension while keeping the history intact.

What is SCD1 , SCD2 , SCD3


SCD 1: Complete overwrite
SCD 2: Preserve all history. Add row
SCD 3: Preserve some history. Add additional column for old/new.
SCD Type 1: the attribute value is overwritten with the new value, obliterating
the historical attribute values.For example, when the product roll-up
changes for a given product, the roll-up attribute is merely updated with the
current
value.
SCD Type 2:a new record with the new attributes is added to the dimension
table. Historical fact table rows continue to reference the old dimension key with
the old roll-up attribute; going forward, the fact table rows will reference the new
surrogate key with the new roll-up thereby perfectly partitioning history.
SCD Type 3: attributes are added to the dimension table to support two
simultaneous roll-ups - perhaps the current product roll-up as well as current
version minus one, or current version and original.
SCD:-------- The value of dimensions is used change very rarely, That is called
Slowly Changing dimensions
Here mainly 3
1)SCD1:Replace the old values overwrite by new values
2)SCD2:Just Creating Additional records
3)SCD3:It's maintain just previous and recent
In the SCD2 again 3
1)Versioning
2)Flagvalue
3)Effective Date range
Versioning:Here the updated dimensions inserted in to the target along with
version number
The new dimensions will be inserted into the target along with Primary key
Flagvalue:The updated dimensions insert into the target along with 0
and new dimensions inset into the target along with 1

What is degenerate dimension table?


The values of dimension which is stored in fact table is called degenerate
dimensions. These dimensions doesn,t have its own dimensions.
A attribute in fact table its not a fact and its not a key value
In simple terms, the column in a fact table that does not map to any dimensions,
neither it s a measure column.
for e.g Invoice_no, Invoice_line_no in fact table will be a degenerate dimension
(columns), provided if you dont have a dimension called invoice.
Degenerate Dimensions : If a table contains the values, which r neither dimesion
nor measures is called degenerate dimensions.Ex : invoice id,empno

What are conformed dimensions


If the dimension is 100% sharable across the starschema then this dimension is
called as confirmed dimension.

They are dimension tables in a star schema data mart that adhere to a common
structure, and therefore allow queries to be executed across star schemas. For
example, the Calendar dimension is commonly needed in most data marts. By
making this Calendar dimension adhere to a single structure, regardless of what
data mart it is used in your organization, you can query by date/time from one
data mart to another to another.
Conformed dimentions are dimensions which are common to the cubes.(cubes
are the schemas contains facts and dimension tables). Consider Cube-1 contains
F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the Facts and Dimensions
here D1,D2 are the Conformed Dimensions
If a table is used as a dimension table for more than one fact tables. then the
dimesion table is called conformed dimesions.
Conformed Dimensions are the one if they share one or more attributes whose
values are drawn from the same domains.
A conformed dimension is a single, coherent view of the same piece of data
throughout the organization. The same dimension is used in all subsequent star
schemas defined. This enables reporting across the complete data warehouse in a
simple format

What are non-additive facts


Non-additive facts are facts that cannot be summed up for any of
the dimensions present in the fact table. Example: temparature,bill number...etc
fact table typically has two types of columns: those that contain numeric facts
(often called measurements), and those that are foreign keys to dimension
tables.
A fact table contains either detail-level facts or facts that have been aggregated.
Fact tables that contain aggregated facts are often called summary tables. A fact
table usually contains facts with the same level of aggregation.
Though most facts are additive, they can also be semi-additive or non-additive.
Additive facts can be aggregated by simple arithmetical addition. A common
example of this is sales. Non-additive facts cannot be added at all.
An example of this is averages. Semi-additive facts can be aggregated along
some of the dimensions and not along others. An example of this is inventory
levels, where you cannot tell what a level means simply by looking at it.
If the columns of a fact table is not able in the position to aggregate then it is
called non-additive facts.
Non-additive facts are facts that cannot be summed up for any of the dimensions
present in the fact table

What are Semi-additive and factless facts and in which scenario will you use
such kinds of fact tables
Semi-Additive: Semi-additive facts are facts that can be summed up for some of
the dimensions in the fact table, but not the others. For example:

Current_Balance and Profit_Margin are the facts. Current_Balance is a semiadditive fact, as it makes sense to add them up for all accounts (what's the total
current balance for all accounts in the bank?), but it does not make sense to add
them up through time (adding up all current balances for a given account for
each day of the month does not give us any useful information
A factless fact table captures the many-to-many relationships between
dimensions, but contains no numeric or textual facts. They are often used to
record events or coverage information. Common examples of factless fact tables
include:
- Identifying product promotion events (to determine promoted products
that didnt sell)
- Tracking student attendance or registration events
- Tracking insurance-related accident events
- Identifying building, facility, and equipment schedules for a hospital or
university

Why are OLTP database designs not generally a good idea for a Data
Warehouse
OLTP cannot store historical information about the organization. It is used for
storing the details of daily transactions while a datawarehouse is a huge storage
of historical information obtained from different datamarts for making intelligent
decisions about the organization.

Why should you put your data warehouse on a different system than your OLTP
system
OLTP

system

stands

for

on-line

transaction

processing.

These are used to store only daily transactions as the changes have to be made
in as few places as possible. OLTP do not have historical data of the organization
Datawarehouse will contain the historical information about the organization
Data Warehouse is a part of OLAP (On-Line Analytical Processing). It is the
source from which any BI tools fetch data for Analytical, reporting or data mining
purposes. It generally contains the data through the whole life cycle of the
company/product. DWH contains historical, integrated, denormalized, subject
oriented
data.
However, on the other hand the OLTP system contains data that is generally
limited to last couple of months or a year at most. The nature of data in OLTP is:
current, volatile and highly normalized. Since, both systems are different in
nature and functionality we should always keep them in different systems.
An DW is typically used most often for intensive querying . Since the primary
responsibility of an OLTP system is to faithfully record on going transactions
(inserts/updates/deletes), these operations will be considerably slowed down by
the heavy querying that the DW is subjected to.

Explain the advanatages of RAID 1, 1/0, and 5. What type of RAID setup would
you put your TX logs
Raid 0 - Make several physical hard drives look like one hard drive. No
redundancy but very fast. May use for temporary spaces where loss of the files
will not result in loss of committed data.
Raid 1- Mirroring. Each hard drive in the drive array has a twin. Each twin has an
exact copy of the other twins data so if one hard drive fails, the other is used to
pull the data. Raid 1 is half the speed of Raid 0 and the read and write
performance are good.
Raid 1/0 - Striped Raid 0, then mirrored Raid 1. Similar to Raid 1. Sometimes
faster than Raid 1. Depends on vendor implementation.
Raid 5 - Great for readonly systems. Write performance is 1/3rd that of Raid 1
but Read is same as Raid 1. Raid 5 is great for DW but not good for OLTP.
Hard drives are cheap now so I always recommend Raid 1.

Is it correct/feasible develop a Data Mart using an ODS?


The ODS is technically designed to be used as the feeder for the DW and other
DM's -- yes. It is to be the source of truth.
Read the complete thread at
http://asktom.oracle.com/pls/ask/f?
p=4950:8:16165205144590546310::NO::F4950_P8_DISPLAYID,F4950_P8_CRITERIA:308019684428
45,

What is a CUBE in datawarehousing concept?


Cubes are logical representation of multidimensional data.The edge of the cube
contains dimension members and the body of the cube contains data values.
Cube is a logical schema which contains facts and dimentions
cubes are muti-dimensional view of dw or data marts. it is designed in a logical
way to drill, slice-n-dice. every part of the cube is a logical representation of the
combination of facts-dimension attribs.

What is the main differnce between schema in RDBMS and schemas in


DataWarehouse....?
RDBMS Schema

Used for OLTP systems

Traditional and old schema

Normalized

Difficult to understand and navigate

Cannot solve extract and complex problems

Poorly modelled
DWH Schema

Used for OLAP systems


New generation schema
De Normalized

Easy to understand and navigate


Extract and complex problems can be easily solved
Very good model

What is meant by metadata in context of a Datawarehouse and how it is


important?
Metadata or Meta Data Metadata is data about data. Examples of metadata
include data element descriptions, data type descriptions, attribute/property
descriptions, range/domain descriptions, and process/method descriptions. The
repository environment encompasses all corporate metadata resources: database
catalogs, data dictionaries, and navigation services. Metadata includes things like
the name, length, valid values, and description of a data element. Metadata is
stored in a data dictionary and repository. It insulates the data warehouse from
changes in the schema of operational systems. Metadata Synchronization The
process of consolidating, relating and synchronizing data elements with the same
or similar meaning from different systems. Metadata synchronization joins these
differing elements together in the data warehouse to allow for easier access
In context of a Datawarehouse metadata is meant the information about the data
.This information is stored in the designer repository.
Meta data is the data about data; Business Analyst or data modeler usually
capture information about data - the source (where and how the data is
originated), nature of data (char, varchar, nullable, existance, valid values etc)
and behavior of data (how it is modified / derived and the life cycle ) in data
dictionary a.k.a metadata. Metadata is also presented at the Datamart level,
subsets, fact and dimensions, ODS etc. For a DW user, metadata provides vital
information for analysis / DSS.

What is a linked cube?


A cube can be stored on a single analysis server and then defined as a linked
cube on other Analysis servers. End users connected to any of these analysis
servers can then access the cube. This arrangement avoids the more costly
alternative of storing and maintaining copies of a cube on multiple analysis
servers. linked cubes can be connected using TCP/IP or HTTP. To end users a
linked cube looks like a regular cube.
A cube can be partioned in 3 ways.Replicate,Transparent and Linked.
In the linked cube the data cells can be linked in to another analytical database.If
an end-user clicks on a data cell ,you are actually linking through another
analytic database.
linked cube in which a sub-set of the data can be analysed into great detail. The
linking ensures that the data in the cubes remain consistent.
Partitioning a cube maily used for optimization.(ex) U may have data for 5gb to
create a report u can specify a size for a cube as 2gb so if the cube exceeds 2gb
it automatically creates the second cube to store the data.

What is surrogate key ? where we use it expalin with examples


Surrogate key is the primary key for the Dimensional table.

Surrogate

key

is

substitution

for

the

natural

primary

key.

It is just a unique identifier or number for each row that can be used for the
primary key to the table. The only requirement for a surrogate primary key is
that it is unique for each row in the table.
Data warehouses typically use a surrogate, (also known as artificial or identity
key), key for the dimension tables primary keys. They can use Infa sequence
generator, or Oracle sequence, or SQL Server Identity values for the surrogate
key.
It is useful because the natural primary key (i.e. Customer Number in Customer
table) can change and this makes updates more difficult.
Some tables have columns such as AIRPORT_NAME or CITY_NAME which are
stated as the primary keys (according to the business users) but ,not only can
these change, indexing on a numerical value is probably better and you could
consider creating a surrogate key called, say, AIRPORT_ID. This would be internal
to the system and as far as the client is concerned you may display only the
AIRPORT_NAME.
Another benefit you can get from surrogate keys (SID) is :
Tracking the SCD - Slowly Changing Dimension.
Let me give you a simple, classical example:
On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's
what would be in your Employee Dimension). This employee has a turnover
allocated to him on the Business Unit 'BU1' But on the 2nd of June the Employee
'E1' is muted from Business Unit 'BU1' to Business Unit 'BU2.' All the new
turnover have to belong to the new Business Unit 'BU2' but the old one should
Belong to the Business Unit 'BU1.'
If you used the natural business key 'E1' for your employee within your
datawarehouse everything would be allocated to Business Unit 'BU2' even what
actualy belongs to 'BU1.'
If you use surrogate keys, you could create on the 2nd of June a new record for
the Employee 'E1' in your Employee Dimension with a new surrogate key.
This way, in your fact table, you have your old data (before 2nd of June) with the
SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the
SID of the employee 'E1' + 'BU2.'
You could consider Slowly Changing Dimension as an enlargement of your natural
key: natural key of the Employee was Employee Code 'E1' but for you it becomes
Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference
with the natural key enlargement process, is that you might not have all part of
your new key within your fact table, so you might not be able to do the join on
the new enlarge key -> so you need another id.
When creating a dimension table in a data warehouse, we generally create the
tables witha system generated key to unqiuely identify a row in the dimension.

This key is also known as a surrogate key. The surrogate key is used as the
primary key in the dimension table. The surrogate key will also be placed in the
fact table and a foreign key will be defined between the two tables. When you
ultimately join the data it will join just as any other join within the database.
Surrogate key is a unique identification key, it is like an artificial or alternative
key to production key, bz the production key may be alphanumeric or composite
key but the surrogate key is always single numeric key.
assume the production key is an alphanumeric field. if u create an index for this
fields it will occupy more space, so it is not advicable to join/index, bz generally
all the datawarehousing fact table are having historical data. these factable are
linked with somany dimension table. if it's a numerical fields the performance is
high
A surrogate key is any column or set of columns that can be declared as the
primary key instead of a "real" or natural key. Sometimes there can be several
natural keys that could be declared as the primary key, and these are all called
candidate keys.So a surrogate is a candidate key. A table could actually have
more than one surrogate key, although this would be unusual. The most common
type of surrogate key is an incrementing integer, such as an auto_increment
column in MySQL, or a sequence in Oracle, or an identity column in SQL
Server.Use of surrogate key:Every join between dimension tables and fact tables
in a data warehouse environment should be based on surrogate keys, not natural
keys. It is up to the data extract logic to systematically look up and replace every
incoming natural key with a data warehouse surrogate key each time either a
dimension record or a fact record is brought into the data warehouse
environment.

what is data validation strategies for data mart validation after loading process
Data validation is to make sure that the loaded data is accurate and meets the
business requriments.
Strategies are different methods followed to meet the validation requriments

What is Data warehousing Hierarchy?


Hierarchies
Hierarchies are logical structures that use ordered levels as a means of organizing
data. A hierarchy can be used to define data aggregation. For example, in a time
dimension, a hierarchy might aggregate data from the month level to the
quarter level to the year level. A hierarchy can also be used to define a
navigational drill path and to establish a family structure.
Within a hierarchy, each level is logically connected to the levels above and below
it. Data values at lower levels aggregate into the data values at higher levels. A
dimension can be composed of more than one hierarchy. For example, in the
product dimension, there might be two hierarchies--one for product categories
and one for product suppliers.
Dimension hierarchies also group levels from general to granular. Query tools use
hierarchies to enable you to drill down into your data to view different levels of
granularity. This is one of the key benefits of a data warehouse.
When designing hierarchies, you must consider the relationships in business
structures. For example, a divisional multilevel sales organization.
Hierarchies impose a family structure on dimension values. For a particular level
value, a value at the next higher level is its parent, and values at the next lower

level are its children. These familial relationships enable analysts to access data
quickly.
Levels
A level represents a position in a hierarchy. For example, a time dimension might
have a hierarchy that represents data at the month, quarter, and year levels.
Levels range from general to specific, with the root level as the highest or most
general level. The levels in a dimension are organized into one or more
hierarchies.
Level Relationships
Level relationships specify top-to-bottom ordering of levels from most general
(the root) to most specific information. They define the parent-child relationship
between the levels in a hierarchy.
Hierarchies are also essential components in enabling more complex rewrites. For
example, the database can aggregate an existing sales revenue on a quarterly
base to a yearly aggregation when the dimensional dependencies between
quarter and year are known.

what is BUS Schema?


BUS Schema is composed of a master suite of confirmed dimension and
standardized definition if facts.
A BUS Schema or a BUS Matrix? A BUS Matrix (in Kimball approach) is to idetify
common Dimensions across Business Processies; ie: a way of identifying
Conforming Dimensions.

What are the methodologies of Data Warehousing.


There are 2 methodologies 1)kimball-first datamarts then DWH 2)inmon-first
DWH then datamarts from dwh
Regarding the methodologies in the Datawarehousing . They are mainly 2
methods.
1. Ralph Kimbell Model
2. Inmon Model.
Kimbell model always structed as Denormalised structure.
Inmon model structed as Normalised structure.
Depends on the requirements of the company anyone can follow the
company's DWH will choose the one of the above models.
In Data warehousing contains the Two Methods
1>> Top Down Method
2>>Bottom up method
In Top Down method First load the Datamarts and then load the dataware house.
In Bottom Up method first load the Data warehouse and then load the
Datamarts.
Top down approach in the sense preparing individual departments
data(DataMarts) from the Enterprise DatawareHouse
Bottom Up Approach is nothing but first gathering all the departments data and
then cleanse the data and Transofrms the data and then load all the indivisual
departments data into the enterprise data ware house

What is conformed fact?


Conformed facts are allowed to have the same name in seperate tables and can
be combined and compared mathamatically.
The relationship between the facts and dimensions are with 3NF, and can work in
any type of joins are called as conformed schema, the members of that schema is
called so...

What is Difference between E-R Modeling and Dimentional Modeling.


Basic diff is E-R modeling will have logical and physical model. Dimensional model
will have only physical model.
E-R modeling is used for normalizing the OLTP database design.
Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design.
E-R modelling revovles around the Entities and their relationships to capture the
overall process of the system.
Dimensional model/Muti-Dimensinal Modelling revolves around Dimensions(point
of analysis) for decison making and not to capture the process.
In ER modeling the data is in normalised form. So more number of Joins, which
may adversly affect the system performnace.Whereas in Dimensional Modelling
the data is denormalised, so less number of joins, by which system performance
will improve.

why fact table is in normal form?


A fact table consists of measurements of business requirements and foreign keys
of dimensions tables as per business rules.
Basically the fact table consists of the Index keys of the dimension/ook up tables
and the measures.
So when ever we have the keys in a table .that itself implies that the table is in
the normal form.
Being in normal form, more granularity is achieved with less coding i.e. less
number of joins while retrieving the fact.

What is the definition of normalized and denormalized view and what are the
differences between them
Normalization is the process of removing redundancies.
Denormalization is the process of allowing redundancies.

What is junk dimension?


What is the difference between junk dimension and degenerated dimension?
A junk dimension is a collection of random transcational codes, flags and text
attributes that are unrelated to any particular dimension.The junk dimension is
simply a structure that provides the convienent place to store the junk
dimension.
A "junk" dimension is a collection of random transactional codes, flags and/or
text attributes that are unrelated to any particular dimension. The junk dimension
is simply a structure that provides a convenient place to store the junk

attributes.where asA degenerate dimension is data that is dimensional in nature


but stored in a fact table.
Junk dimension: Grouping of Random flags and text Attributes in a dimension and
moving
them
to
a
separate
sub
dimension.
Degenerate Dimension: Keeping the control information on Fact table ex:
Consider a Dimension table with fields like order number and order line number
and have 1:1 relationship with Fact table, In this case this dimension is removed
and the order information will be directly stored in a Fact table inorder eliminate
unneccessary joins while retrieving order information..
junk dimension
the column which we are using rarely or not used, these columns are formed a
dimension is called junk dimension
degenerative dimension
the column which we use in dimension are degenerative dimension
ex
emp table has empno.ename,sal,job,deptno
but
we are talking only the column empno,ename from the emp table and forming a
dimension this is called degenerative dimension

What is the main difference between Inmon and Kimball philosophies of data
warehousing?
Basically speaking, Inmon professes the Snowflake Schema while Kimball relies
on the Star Schema
Both differed in the concept of building teh datawarehosue..
According to Kimball ...
Kimball views data warehousing as a constituency of data marts. Data marts are
focused on delivering business objectives for departments in the organization.
And the data warehouse is a conformed dimension of the data marts. Hence a
unified view of the enterprise can be obtain from the dimension modeling on a
local departmental level.
Inmon beliefs in creating a data warehouse on a subject-by-subject area basis.
Hence the development of the data warehouse can start with data from the
online store. Other subject areas can be added to the data warehouse as their
needs arise. Point-of-sale (POS) data can be added later if management decides
it is necessary.
i.e.,
Kimball--First DataMarts--Combined way ---Datawarehouse
Inmon---First Datawarehouse--Later----Datamarts
the main difference b/w the kimball and inmon technologies is..
Kimball --- creating datamarts first then combining tehm up to form a
datawarehouse
Inmon----Creating datawarehouse --- then datamarts
Actually, the main difference is Kimball: follows Dimentional Modelling
Inmon: follows ER Modelling
RalfKimball: he follows bottum-up approach i.e., first create individual Data Marts
from the existing sources and then create Data Warehouse.

BillImmon: he follows top-down approach i.e., first create Data Warehouse from
the existing sources and then create individual Dat Marts.

What is the difference between view and materialized view


view - store the SQL statement in the database and let you use it as a table.
Everytime you access the view, the SQL statement executes.
materialized view - stores the results of the SQL in table form in the database.
SQL statement only executes once and after that everytime you run the query,
the stored result set is used. Pros include quick query results.
VIEW : This is a PSEUDO table that is not stored in the database and it is just a
query.MATERIALIZED VIEWS: These are similar to a view but these are
permantely stored in the database and often refreshed. This is used in
optimization for the faster data retrieval and is useful in aggregation and
summarization of data.
Normal Views are for Access purpose ie One need to give access the end user to
specific data.
Materialized Views are to optimize the performance . WHen materialzied view are
used in the queries ,stored results are fetched (Instead of fecthing from individual
talbes) which improves the performance. Refreshing option for materialised view
can be set.

What are the advantages data mining over traditional approaches?


Data Mining is used for the estimation of future. For example, if we take a
company/business organization, by using the concept of Data Mining, we can
predict the future of business interms of Revenue (or) Employees (or) Cutomers
(or) Orders etc.
Traditional approches use simple algorithms for estimating the future. But, it does
not give accurate results when compared to Data Mining.

What are the steps to build the Datawarehouse

Gathering bussiness requiremnts


Identifying Sources
Identifying Facts
Defining Dimensions
Define Attribues
Redefine Dimensions & Attributes
Organise Attribute Hierarchy & Define Relationship
Assign Unique Identifiers
Additional convetions:Cardinality/Adding ratios