Anda di halaman 1dari 13

FALL 2013 ASSIGNMENT

NAME:SHUBHRA MOHAN MUKHERJEE


ROLL NO.:511219019
PROGRAM: BACHELOR OF COMPUTER APPLICATION
SEMESTER: 6
TH
SEM
SUBJECT CODE & NAME: BC0058 DATA WAREHOUSING


Q1. Differentiate between OLTP and Data Warehouse.

Ans.

Data Warehouse : Data Warehouses and Data Warehouse applications are designed primarily to
support executives, senior managers, and business analysts in making complex business decisions.
Data Warehouse applications provide the business community with access to accurate consolidated
information from various internal and external sources. The goal of using a Data Warehouse is to have
an efficient way of managing information and analyzing data.

OLTP Systems: OLTP is an Online Transaction Processing System to handle day-to-day business
transactions. Examples are Railway Reservation Systems, Online Store Purchases etc. These systems
handle tremendous amount of data daily.

Differences between OLTP and Data Warehouse

Application databases are OLTP (On-Line Transaction Processing) systems where every
transaction has to be recorded as and when it occurs. Consider the scenario where a bank
ATM has disbursed cash to a customer but was unable to record this event in the bank records.
If this happens frequently, the bank wouldn't stay in business for too long. So the banking
system is designed to make sure that every transaction gets recorded within the time you stand
before the ATM machine.
A Data Warehouse (DW) on the other end, is a database (yes, you are right, it's a database)
that is designed for facilitating querying and analysis. Often designed as OLAP (On-Line
Analytical Processing) systems, these databases contain read-only data that can be queried
and analyzed far more efficiently as compared to your regular OLTP application databases.
In this sense an OLAP system is designed to be read-optimized.
Separation from your application database also ensures that your business intelligence solution
is scalable (your bank and ATMs don't go down just because the CFO asked for a report), better
documented and managed.
Creation of a DW leads to a direct increase in quality of analysis as the table structures are
simpler (you keep only the needed information in simpler tables), standardized (well-
documented table structures), and often de-normalized (to reduce the linkages between
tables and the corresponding complexity of queries). Having a well-designed DW is the
foundation for successful BI (Business Intelligence)/Analytics initiatives, which are built upon.
Data Warehouses usually store many months or years of data. This is to support historical
analysis. OLTP systems usually store data from only a few weeks or months. The OLTP system
stores only historical data as needed to successfully meet the requirements of the current
transaction.


OLTP VS Data Warehouses

Property OLTP Data Warehouse
Nature of Data
Warehouse 3 NF Multidimensional
Indexes Few Many
Joins Many Some
Duplicate data Normalized Demoralized
Aggregate data Rare Common
Queries Mostly predefined Mostly adhoc
Nature of queries Mostly simple Mostly complex
Updates All the time
Not allowed, only
refreshed
Historical data
Often not
available Essential


Q2. What are the key issues in Planning a Data Warehouse?

Ans. More than any other factor, improper planning and inadequate project management tend to
result in failures. First and foremost, determine if your company really needs a Data Warehouse.

Planning for your Data Warehouse begins with a thorough consideration of the key issues. Answers
to the key questions are vital for the proper planning and the successful completion of the
project. Therefore, let us consider the pertinent issues, one by one.

Values and Expectations. Some companies jump into Data Warehousing without assessing
the value to be derived from their proposed Data Warehouse. Of course, first you have
to be sure that, given the culture and the current requirements of your company; a Data
Warehouse is the most viable solution. After you have established the suitability of this
solution, only then can you begin to enumerate the benefits and value propositions.
Risk Assessment. Planners generally associate project risks with the cost of the project. If
the project fails, how much money will go down the drain? But the assessment of risks is
more than calculating the loss from the project costs. What are the risks faced by the
company without the benefits derivable from a Data Warehouse? What losses are likely to
be incurred? What opportunities are likely to be missed?

The Overall Plan

The seed for a data warehousing initiative gets sown in many ways. The initiative may get
ignited simply because the competition has a Data Warehouse. Different stakeholders may
have different opinions for Data Warehouse construction. Coming to the concise decision is
very crucial here. The Data Warehouse plan discusses the type of Data Warehouse and enumerates
the expectations. This is not a detailed project plan. It is an overall plan to lay the foundation,
to recognize the need, and to authorize a formal project.
Requirements Gathering

Transaction Processing Systems focus on automating the process, making it faster and efficient.
This, in turn means that the requirements for transactional systems are specific and more
directed towards business process automation.

In contrast, the Data Warehousing environment focuses on facilitating the analysis that will
change the process to make it more effective.

Requirement Gathering Approaches

There are two widely used methods for deriving business requirements:

Source Driven Requirements Gathering - This process is based on defining the
requirements by using the source data in production transactional systems. Analyzing the
E-R model of source data does this or the actual physical record layout and selecting data
elements deemed to be of interest.
User Driven Requirements Gathering - This process is based on defining the
requirements by conducting interviews and discussions with users about business
needs and also investing the functions they perform.

Q3. Explain Source Data Component and Data Staging Components of Data Warehouse Architecture.

Ans.

Source Data Component
1. Production Data
2. Internal Data
3. Archived Data
4. External Data

1. Production Data This category of data comes from the various operational systems of
the enterprise. Based on the information requirements in the Data Warehouse,
you choose segments of data from the different operational systems. While dealing
with this data, you come across many variations in the data formats. You also notice
that the data resides on different hardware platforms. Further, the data is
supported by different database systems and operating systems. This is the data
from many vertical applications.
In operational systems, information queries are narrow. You query an operational
system for information about specific instances of business objects. You may want
just the name and address of a single customer. Or, you may need the orders placed by
a single customer in a single week. Or, you may just need to look at a single invoice and
the items billed on that single invoice. In operational systems, you do not have broad
queries. You do not query the operational system in unexpected ways. The queries are
all predictable. Again, you do not expect a particular query to run across different
operational systems. What do all these mean? There is no conformance of data
among the various operational systems of an enterprise. A term like an account may
have different meanings in different systems.
The significant and disturbing characteristic of production data is disparity. Your great
challenge is to standardize and transform the disparate data from the various
production systems, convert the data, and integrate the pieces into useful data for
storage in the Data Warehouse.
2. Internal Data In every organization, users keep their private spreadsheets,
documents, customer profiles, and sometimes even departmental databases. This is
the internal data, parts of which could be useful for Data Warehouse for analysis.
If your organization does business with the customers on a one-to-one basis and the
contribution of each customer to the bottom line is significant, then detailed customer
profiles with ample demographics are important in a Data Warehouse. Profiles of
individual customers become very important for consideration. When your account
representatives talk to their assigned customers or when your marketing
department wants to make specific offerings to individual customers, you need the
details. Although much of this data may be extracted from production systems,
individuals and departments in their private files hold a lot of it.
You cannot ignore the internal data held in private files in your organization. It is a
collective judgment call on how much of the internal data should be included in the
Data Warehouse. The IT department must work with the user departments to gather
the internal data.
Internal data adds additional complexity to the process of transforming and integrating
the data before it can be stored in the Data Warehouse. You have to determine
strategies for collecting data from spreadsheets, find ways of taking data from
textual documents, and tie into departmental databases to gather pertinent data
from those sources. Again, you may want to schedule the acquisition of internal data.
Initially, you may want to limit yourself to only some significant portions before
going live with your first data mart.
3. Archived Data Operational systems are primarily intended to run the current
business. In every operational system, you periodically take the old data and store it in
archived files. The circumstances in your organization dictate how often and which
portions of the operational databases are archived for storage. Some data is archived
after a year.
Sometimes data is left in the operational system databases for as long as five years.
Many different methods of archiving exist. There are staged archival methods. At
the first stage, recent data is archived to a separate archival database that may
still be online. At the second stage, the older data is archived to flat files on disk
storage. At the next stage, the oldest data is archived to tape cartridges or microfilm
and even kept off-site.
As mentioned earlier, a Data Warehouse keeps historical snapshots of data. You
essentially need historical data for analysis over time. For getting historical
information, you look into your archived data sets. Depending on your Data Warehouse
requirements, you have to include sufficient historical data. This type of data is useful
for detecting patterns and analyzing trends.
4. External Data Most executives depend on data from external sources for a high
percentage of the information they use. They use statistics relating to their industry
produced by external agencies. They use market share data of competitors. They use
standard values of financial indicators for their business to check on their
performance.
For example, the Data Warehouse of a car rental company contains data on the current
production schedules of the leading automobile manufacturers. This external data in the
Data Warehouse helps the car rental company plan for their fleet management. The
purposes served by such external data sources cannot be fulfilled by the data
available within your organization itself. The insights gleaned from your production
data and your archived data are somewhat limited. They give you a picture based on
what you are doing or have done in the past. In order to spot industry trends and
compare performance against other organizations, you need data from external
sources.
Usually, data from outside sources do not conform to your formats. You have to
do conversions of data into your internal formats and data types. You have to
organize the data transmissions from the external sources. Some sources may
provide information at regular, stipulated intervals. Others may give you the data on
request. You need to accommodate the variations.

Data Staging Component
1. Data Extraction
2. Data Transformation
3. Data Loading

After you have extracted data from various operational systems and from external
sources, you have to prepare the data for storing in the Data Warehouse. The
extracted data coming from several disparate sources need to be changed,
converted, and made ready in a format that is suitable to be stored for querying and
analysis.
Three major functions need to be performed for getting the data ready. You have to
extract the data, transform the data, and then load the data into the Data Warehouse
storage. These three major functions of extraction, transformation, and preparation
for loading take place in a staging area. The data-staging component consists of a
workbench for these functions. Data staging provides a place and an area with a set
of functions to clean, change, combine, convert, reduplicate, and prepare source data
for storage and use in the Data Warehouse.
Why do you need a separate place or component to perform the data
preparation? Cant you move the data from the various sources into the Data
Warehouse storage itself and then prepare the data? When we implement an
operational system, we are likely to pick up data from different sources, move the data
into the new operational system database, and run data conversions. Why cant this
method work for a Data Warehouse? The essential difference here is this: in a Data
Warehouse you pull in data from many source operational systems. Remember that
data in a Data Warehouse is subject-oriented and cuts across operational applications.
A separate staging area, therefore, is a necessity for preparing data for the
Data Warehouse.
Now that we have clarified the need for a separate data-staging component, let us
understand what happens in data staging. We will now briefly discuss the three major
functions that take place in the staging area.
1. Data Extraction This function has to deal with numerous data sources. You have to
employ the appropriate technique for each data source. Source data may be from
different source machines in diverse data formats. Part of the source data may be in
relational database systems. Some data may be on other legacy network and
hierarchical data models. Many data sources may still be in flat files. You may want
to include data from spreadsheets and local departmental data sets. Data extraction
may become quite complex.
Tools are available on the market for data extraction. You may want to consider
using outside tools suitable for certain data sources. For the other data sources, you
may want to develop in-house programs to do the data extraction. Purchasing outside
tools may entail high initial costs. In-house programs, on the other hand, may mean
ongoing costs for development and maintenance.
After you extract the data, where do you keep the data for further preparation?
You may perform the extraction function in the legacy platform itself if that approach
suits your framework. More frequently, Data Warehouse implementation teams
extract the source into a separate physical environment from which moving the data
into the Data Warehouse would be easier. In the separate environment, you may
extract the source data into a group of flat files, or a data-staging relational
database, or a combination of both.
2. Data Transformation In every system implementation, data conversion is an important
function. For example, when you implement an operational system such as a
magazine subscription application, you have to initially populate your database with
data from the prior system records. You may be converting over from a manual system.
Or, you may be moving from a file-oriented system to a modern system supported with
relational database tables. In either case, you will convert the data from the prior
systems.
Again, as you know, data for a Data Warehouse comes from many disparate
sources. If data extraction for a Data Warehouse poses great challenges, data
transformation presents even greater challenges. Another factor in the Data Warehouse
is that the data feed is not just an initial load. You will have to continue to pick up the
ongoing changes from the source systems. Any transformation tasks you set up for
the initial load will be adapted for the ongoing revisions as well.
You perform a number of individual tasks as part of data transformation. First, you
clean the data extracted from each source. Cleaning may just be correction of
misspellings, or may include resolution of conflicts between state codes and zip
codes in the source data, or may deal with providing default values for missing data
elements, or elimination of duplicates when you bring in the same data from multiple
source systems.
Standardization of data elements forms a large part of data transformation. You
standardize the data types and field lengths for same data elements retrieved
from the various sources. Semantic standardization is another major task. You
resolve synonyms and homonyms. When two or more terms from different source
systems mean the same thing, you resolve the synonyms. When a single term means
many different things in different source systems, you resolve the homonym.
Data transformation involves many forms of combining pieces of data from the different
sources. You combine data from single source record or related data elements from
many source records. On the other hand, data transformation also involves
purging source data that is not useful and separating outsource records into new
combinations. Sorting and merging of data takes place on a large scale in the data
staging area.
In many cases, the keys chosen for the operational systems are field values with built-in
meanings. For example, the product key value may be a combination of characters
indicating the product category, the code of the warehouse where the product is
stored, and some code to show the production batch. Primary keys in the Data
Warehouse cannot have built-in meanings. Data transformation also includes the
assignment of surrogate keys derived from the source system primary keys.
A grocery chain point-of-sale operational system keeps the unit sales and revenue
amounts by individual transactions at the checkout counter at each store. But in the
Data Warehouse, it may not be necessary to keep the data at this detailed level. You
may want to summarize the totals by product at each store for a given day and keep the
summary totals of the sale units and revenue in the Data Warehouse storage. In
such cases, the data transformation function would include appropriate
summarization.
When the data transformation function ends, you have a collection of integrated
data that is cleaned, standardized, and summarized. You now have data ready to
load into each data set in your Data Warehouse.
3. Data Loading Two distinct groups of tasks form the data loading function. When you
complete the design and construction of the Data Warehouse and go live for the
first time, you do the initial loading of the data into the Data Warehouse storage. The
initial load moves large volumes of data using up substantial amounts of time. As
the Data Warehouse starts functioning, you continue to extract the changes to the
source data, transform the data revisions, and feed the incremental data revisions on
an ongoing basis. The figure below illustrates the common types of data
movements from the staging area to the Data Warehouse storage.

Q4. Discuss the Extraction Methods in Data Warehouses.

Ans. The extraction method you choose is highly dependent on the source system and also from
the business needs in the targeted Data Warehouse environment. Very often, there's no possibility to
add additional logic to the source systems to enhance an incremental extraction of data due to
the performance or the increased workload of these systems. Sometimes even the customer is not
allowed to add anything to an out-of-the-box application.
The estimated amount of the data to be extracted and the stage in the ETL process (initial load or
maintenance of data) may also impact the decision of how to extract, from a logical and a physical
perspective. Basically, you have to decide how to extract data logically and physically.

Logical Extraction Methods

There are two kinds of logical extraction:
Full Extraction
Incremental Extraction

Full Extraction

The data is extracted completely from the source system. Since this extraction reflects all the
data currently available on the source system, there's no need to keep track of changes to the data
source since the last successful extraction. The source data will be provided as-is and no additional
logical information (for example, timestamps) is necessary on the source site. An example for a full
extraction may be an export file of a distinct table or a remote SQL statement scanning the
complete source table.

Incremental Extraction

At a specific point in time, only the data that has changed since a welldefined event back in
history will be extracted. This event may be the last time of extraction or a more complex business event
like the last booking day of a fiscal period. To identify this delta change there must be a
possibility to identify all the changed information since this specific time event. This information
can be either provided by the source data itself like an application column, reflecting the last-changed
timestamp or a change table where an appropriate additional mechanism keeps track of the
changes besides the originating transactions. In most cases, using the latter method means adding
extraction logic to the source system.

Many Data Warehouses do not use any change-capture techniques as part of the extraction process.
Instead, entire tables from the source systems are extracted to the data warehouse or staging area,
and these tables are compared with a previous extract from the source system to identify the changed
data.

Physical Extraction Methods

Depending on the chosen logical extraction method and the capabilities and restrictions on the source
side, the extracted data can be physically extracted by two mechanisms. The data can either be
extracted online from the source system or from an offline structure. Such an offline structure
might already exist or it might be generated by an extraction routine.

These are the following methods of physical extraction:

Online Extraction
Offline Extraction

Online Extraction

The data is extracted directly from the source system itself. The extraction process can connect
directly to the source system to access the source tables themselves or to an intermediate
system that stores the data in a reconfigured manner (for example, snapshot logs or change
tables). Note that the intermediate system is not necessarily physically different from the source
system. With online extractions, you need to consider whether the distributed transactions are using
original source objects or prepared source objects.

Offline Extraction

The data is not extracted directly from the source system but is staged explicitly outside the
original source system. The data already has an existing structure (for example, redo logs, archive
logs or transportable tablespaces) or was created by an extraction routine.

You should consider the following structures:

Flat Files:
Data is in a defined, generic format. Additional information about the source object is
necessary for further processing.
Dump Files:
An Oracle - specific format in which the information about the containing objects is included
Redo And Archive Logs
redo logs comprise files in a proprietary format which log a history of all changes made to the
data base. Each redo log file consists of redo records. A redo record (redo entry) , holds a
group of change-vectors, each of which describes or represents a change made to a single block
in the database.
For example, if a user UPDATEs a salary-value in an employee-table, the DBMS generates a
redo record containing change-vectors that describe changes to the data segment block for
the table. And if the user then COMMITs the update, Oracle generates another redo record and
assigns the change a "system change number" (SCN).
A single transaction may involve multiple changes to data blocks, so it may have more than one
redo record.
A group of redo log files to one or more offline destinations, known collectively as the
archived redo log, or more simply the archive log. The process of turning redo log files into
archived redo log files is called archiving. This process is only possible if the database is
running in ARCHIVELOG mode. You can choose automatic or manual archiving.

Q5. Define the process of Data Profiling, Data Cleansing and Data Enrichment.

Ans. When looking at the literature on data warehousing in more detail we call these the conceptual
(or business) perspective, the logical (or data modeling) perspective, and the physical (or
distributed information flow) perspective. Any given data warehouse component or substructure can
be analyzed from all three perspectives. Moreover, in the design, operation, and especially
evolution of data warehouses it is crucial that these three perspectives are maintained consistent
with each other. Finally, quality factors are often associated with specific perspectives, or with
specific relationships between perspectives.

Data Profiling

Data Profiling is the process of examining the data available in an existing data source (e.g. a
database or a file) and collecting statistics and information about that data. The purpose of these
statistics may be to:

1. Find out whether existing data can easily be used for other purposes.
2. Give metrics on data quality including whether the data conforms to company
standards.
3. Assess the risk involved in integrating data for new applications, including the
challenges of joins.
4. Track data quality.
5. Assess whether metadata accurately describes the actual values in the source database.
6. Understanding data challenges early in any data intensive project, so that late
project surprises are avoided. Finding data problems late in the project can incur time
delays and project cost overruns.
7. Have an enterprise view of all data, for uses such as Master Data Management
where key data is needed, or Data governance for improving data quality.

Data Cleansing

Data cleansing or Data Scrubbing is the act of detecting and correcting (or removing) corrupt or
inaccurate records from a record set, table, or database.

Data cleansing involves the following tasks:

Converting data fields to common format
Correcting errors
Eliminating inconsistencies
Matching records to eliminate duplicates
Filling missing values etc.

After cleansing, a data set will be consistent with other similar data sets in the system.

Data Enrichment

Data Enrichment is the process of adding value to your data. In some cases, external data
providers sell data, which may be used to augment existing data. In other cases, data from multiple
internal sources are simply integrated to get the big picture. In any event, the intended result is a data
asset that has been increased in value to the user community.

Q6. What is Metadata Management? Explain Integrated Metadata Management with a block diagram.

Ans. The purpose of Metadata management is to support the development and administration of data
warehouse infrastructure as well as analysis of the data of time.
Metadata widely considered as a promising driver for improving effectiveness and efficiency of
data warehouse usage, development, maintenance and administration. Data warehouse usage can be
improved because metadata provides end users with additional semantics necessary to reconstruct the
business context of data stored in the data warehouse.

Integrated Metadata Management

An integrated Metadata Management supports all kinds of users who are involved in the data
warehouse development process. End users, developers and administrators can use/see the
Metadata. Developers and administrators mainly focus on technical Metadata but make use of business
Metadata if they want. Developers and administrators need metadata to understand
transformations of object data and underlying data flows as well as the technical and conceptual system
architecture.

Several Metadata management systems are in existence. One such system/tool is Integrated Metadata
Repository System (IMRS). It is a metadata management tool used to support a corporate data
management function and is intended to provide metadata management services. Thus, the IMRS
will support the engineering and configuration management of data environments incorporating e-
business transactions, complex databases, federated data environments, and data warehouses /
data marts.
















Figure: Central Repository of Metadata Management

The metadata contained in the IMRS used to support application development, data integration, and
the system administration functions needed to achieve data element semantic consistency across a
corporate data environment, and to implement integrated or shared data environments.

Metadata management has several sub processes like data warehouse development.

Some of them are listed below,

Metadata definition
Metadata collection
Metadata control
Metadata publication to the right people at the right time.
Determining what kind of data to be captured.

Metadata Requirement gathering

Before gathering the overall collection strategy for metadata, the organization must access
metadata requirements. The method is iterative process with planned intermittent release. It is
the warehouse architect responsibility to define and determine the scope and importance of
metadata to be collected before a strategy is put into place. It is the responsibility of the
development team to assign roles and responsibilities in the implementation, ownership and
maintenance of the data.

Metadata classification
Operational
system
[Interface]
Operational
system
[Interface]

Metadata
flow
Operational
system
[Interface]

Operational
system
[Interface]

Central
Repository

In order to have more clarity in Metadata requirements, it is often advisable to have a metadata
classification strategy and share it with all the stakeholders.

Metadata classification details are given as follows:

Classification I- user point of view

Business Metadata: It helps users to understand the data more. For example, key terminologies,
aggregates and summarized views etc.
Technical Metadata: It is used for process automation, analysis and administration.

Classification II process point of view

Design Metadata: Schema definition, source tables, and views.
Population Metadata: ETL information, sources and interface details.

Administrative Metadata: Access rights, protocols, physical location, retention criteria, audit
controls, versioning and usage statistics.

User




Process



Data




Figure: Relation between Metadata classifications

Metadata Collection Strategies

Metadata needs to be stored, defined and accessed in comprehensive manner, just like the data in
the data warehouse. The approach to storage and access constitutes the metadata collection strategy.
Metadata collection strategies typically consist of one or more combinations of the following
technology enablers. Here, we have limited our discussion to the fundamental issues of metadata
repository.

Metadata Repository

One of the main problems with contemporary data warehouse management strategies is that
information changes rapidly. Because of this, it is difficult to be consistent when managing data
Technical user Business user
Design Populate Administer
Operational Data mart
Data warehouse
warehouses. One tool that can allow data warehouse managers to deal with Metadata is called a
repository. By using a repository, the Metadata can be coordinated among different warehouses.
By doing this, all the members of the organization would be able to share data structures and data
definitions. The repository could act as a platform that would be capable of handling information from a
number of different sources. One of the best advantages of using a repository is the consistency that will
exist within the system. It will create a standard that can be understood among a number of different
departments. If a new definition is created for a data mart implementation, a repository can
support the change. A number of different departments would be able to share this information.
A repository can help data warehouse managers in a number of different ways. It can help you in the
development phase, and it can also help lower the cost of maintenance.