Building A Data Warehouse - Stephen Forte 2010 09 11

Building a Data Warehouse Part I: When to build your data warehouse
Most developers are scared of “Business Intelligence” or BI. Most think that BI consists of cubes, pivot/drill down
apps, and analytical decision support systems. While those are very typical outcomes of a BI effort, many people
forget about the first step, the data warehouse.
Typically this is what happens with a BI effort. A system is built, usually a system that deals with transactions. We
call this an OLTP or on‐line transaction processing system. Some time passes and reports are bolted on and some
business analysts build some pivot tables from “raw dumps” of data. As the system grows, reports start to slow
since the system is optimized to deal with one record at a time. Someone, usually a CTO type says: “we need a BI
system.” A development effort is then spent to build a data warehouse and cubes, and some kind of analytical
system on top of those cubes.
I make the argument that developers and project planners should embrace the data warehouse up front. When
you design your OLTP system, also design the supporting data warehouse, even if you have no intention of building
a full‐fledged BI application with cubes and the like. This way you have two distinct advantages. First is that you
have a separate system that is optimized for reporting. This system will allow the rapid creation of many new
reports as well take the load off the OLTP system. Second, when you do decide to build a BI system based on
cubes, you will already have the hard part done, building the data warehouse and supporting ETL.
Since a data warehouse uses more of a flatter data model (more on this in Part II), you can even design your
application to use both the OLTP and data warehouse as data sources. For example, when you have highly
normalized, 3rd normal form transactional tables to support transactions, it is never easy to use those tables for
reporting and displaying of information. Those tables are optimized and indexed to support retrieving and editing
(or adding/deleting) one record at a time. When you try to do things in aggregate, you start to stress your system,
since it was designed to deal with one record at a time.
This design pattern is already in use today at many places. Consider your credit card company for example. I use
American Express, and I never see my transactions show up for at least 24 hours. If go buy something and I phone
American Express and say “what was my last transaction” they will tell you right away. If you look online, you will
not see that transaction until the next business day. Why? When you call the customer service representative, they
are looking at the OLTP system, pulling up one record at a time. When you are looking online, you are looking at
the data warehouse, a system optimized for viewing lots of data in a reporting environment.
You can take this to an extreme, if you ran an e‐commerce site, you can power your product catalog view portion
of the site with the data warehouse and the purchase (inventory) system with the OLTP model. Optimize the site
for browsing (database reads) and at the same time have super‐fast e‐commerce (database writes.) Of course you
have to keep the purchasing/inventory (OLTP) and product display (data warehouse) databases in sync. I’ll talk
about that in Part III. Next, I will take a look at how to build the data warehouse.

Building a Data Warehouse Part II: Building a new schema
See also:
• Part I: When to build your data warehouse
In Part I we looked at when you should build your data warehouse and concluded that you should build it sooner
rather than later to take advantage of reporting and view optimization. Today we will look at your options to build
your data warehouse schema.
When architecting a data warehouse, you have two basic options: build a flat “reporting” table for each operation
you are performing, or build with BI/cubes in mind and implement a “star” or “snowflake” schema. Let’s take a
quick look at the first option and then we will take a look at the star and snowflake schemas.
Whenever the business requests a complex report, developers usually slow down the system with a complex SQL
statement or operation. For example, pretend in our order entry system (OLTP) the business wants a report that
says this: show me the top ten customers in each market including their overall rank. You would usually have to
perform a query like this:
1. Complex joins for unique customer
2. Rollup the sales
3. Ranking functions to determine overall rank
4. Partition functions to segment the rank by country
5. Standard aggregates to get the sales
6. Dump all of this to a work table in order to pull out the top 10 (if you don’t do this, you will lose the
overall rank)
A typical SQL statement to do steps 1‐5 would look like this:
With CTETerritory
As
(
Select cr.Name as CountryName, CustomerID,
Sum(TotalDue) As TotalAmt
From Sales.SalesOrderHeader soh
inner join Sales.SalesTerritory ter
on soh.TerritoryID=ter.TerritoryID
inner join Person.CountryRegion cr
on cr.CountryRegionCode=ter.CountryRegionCode
Group By cr.Name, CustomerID
)
Select *, Rank() Over (Order by TotalAmt DESC) as OverallRank,
Rank() Over
(Partition By CountryName Order By TotalAmt DESC,
CustomerID DESC) As NationalRank
From CTETerritory
Argh! No wonder developers hate SQL and want to use ORMs! (I challenge the best ORM to make this query!)
Instead you can create a table, lets call it SalesRankByRegion, with the fields: CountryName, CustomerID,
TotalSales, OverallRank, and NationalRank, and use the above SQL as part of a synchronization/load script to fill
your reporting table on a regular basis. Then your SQL statement for the above query looks like this:
SELECT * FROM SalesRankByRegion
WHERE CustomerNationalRank Between 1 and 10
ORDER BY CountryName, CustomerNationalRank
The results look like:

That is more like it! A simple select statement is easier for the developer to write, the ORM to map, and the system
to execute.
The SalesRankByRegion table is a vast improvement over having to query all of the OLTP tables (by my count there
are three tables plus the temp table). While this approach has its appeal, very quickly, your reporting tables will
start to proliferate.
Your best option is to follow one of the two industry standards for data warehouse tables, a “star” or a “snowflake’
schema. Using a schema like this gives you a few advantages. They are more generic than the SalesRankByRegion,
which was a table built for one query/report, giving you the ability to run many different reports off each table.
Another advantage is that you will have the ability to build cubes very easily off of a star or snowflake schema as
opposed to a bunch of SalesRankByRegion tables.
The design pattern for building true data warehouse tables are to build a “fact” table, or a table that contains
detail level (or aggregated) “facts” about something in the real world, like an order or customer for example. Inside
of the fact table you will also have “measures” or a numeric value that represents a “fact.” To support your fact
table you will have “dimension” tables. Dimensions are a structure that will categorize your data, usually in the
form of a hierarchy. A dimension table for example could be “time” with a hierarch of OrderYear, OrderQuarter,
OrderMonth, OrderDate, OrderTime.
There are tons of tutorials on the internet that show you how to build a star or snowflake schema and the
difference between them, so I will not repeat them here. (You may want to start here.) I’ll give you the high level
on a simple star schema here.
Let’s say we have an order entry system, such as Northwind (in the Microsoft SQL Server sample database.) You
can have a fact table that revolves around an order. You can then have three (or more) fact tables that focus on:
time, product, and salesperson. The time dimension would roll‐up the order date by year, quarter, month, and
date. The product dimension would roll‐up the product by the product and category. (In most systems you would
have a much deeper hierarchy for products.) The salesperson dimension would be roll‐up of the employee, the
employee manager and the department they work in. The key in each of these tables would then be foreign keys in
the fact table, along with the measure (or numerical data describing the fact.)
There is an example similar to this in Programming SQL Server 2008, a book where I am a co‐author. Here is
modified version of that demo:
Dimension tables:
CREATE TABLE [dwh].[DimTime] (
[TimeKey] [int] IDENTITY (1, 1) NOT NULL Primary Key,
[OrderDate] [datetime] NULL ,
[Year] [int] NULL ,
[Quarter] [int] NULL ,
[Month] [int] NULL
)

CREATE TABLE [dwh].[DimProduct] (
[ProductID] [int] not null Primary Key,
[ProductName] nvarchar(40) not null,
[UnitPrice] [money] not null,
[CategoryID] [int] not null,
[CategoryName] nvarchar(15) not null
)

CREATE TABLE [dwh].[DimEmployee] (
EmployeeID int not null Primary Key,
EmployeeName nvarchar(30) not null,
EmployeeTitle nvarchar(30),
ManagerName nvarchar(30)
)

Fact table:
CREATE TABLE [dwh].FactOrder (
[PostalCode] [nvarchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[ProductID] [int] NOT NULL ,
[EmployeeId] [int] NOT NULL ,
[ShipperId] [int] NOT NULL ,
[Total Sales] [money] NULL ,
[Discount] [float] NULL ,
[Unit Sales] [int] NULL ,
[TimeKey] [int] NOT NULL
)
We have the basis of a star schema. Now we have to fill those tables and keep them up to date. That is a topic for
Part III.

Building a Data Warehouse Part III: Location of your data warehouse
See also:
• Part II: Building a new schema
In Part I we looked at the advantages of building a data warehouse independent of cubes/a BI system and in Part II
we looked at how to architect a data warehouse’s table schema. Today we are going to look at where to put your
data warehouse tables.
Let’s look at the location of your data warehouse. Usually as your system matures, it follows this pattern:
• Segmenting your data warehouse tables into their own isolated schema inside of the OLTP database
• Moving the data warehouse tables to their own physical database
• Moving the data warehouse database to its own hardware
When you bring a new system online, or start a new BI effort, to keep things simple you can put your data
warehouse tables inside of your OLTP database, just segregated from the other tables. You can do this a variety of
ways, most easily is using a database schema (ie dbo), I usually use dwh as the schema. This way it is easy for your
application to access these tables as well as fill them and keep them in sync. The advantage of this is that your data
warehouse and OLTP system is self‐contained and it is easy to keep the systems in sync.
As your data warehouse grows, you may want to isolate your data warehouse further and move it to its own
database. This will add a small amount of complexity to the load and synchronization, however, moving the data
warehouse tables to their own table brings some benefits that make the move worth it. The benefits include
implementing a separate security scheme. This is also very helpful if your OLTP database scheme locks down all of
the tables and will not allow SELECT access and you don’t want to create new users and roles just for the data
warehouse. In addition, you can implement a separate backup and maintenance plan, not having your date
warehouse tables, which tend to be larger, slow down your OLTP backup (and potential restore!). If you only load
data at night, you can even make the data warehouse database read only. Lastly, while minor, you will have less
table clutter, making it easier to work with.
Once your system grows even further, you can isolate the data warehouse onto its own hardware. The benefits of
this are huge, you can have less I/O contention on the database server with the OLTP system. Depending on your
network topology, you can reduce network traffic. You can also load up on more RAM and CPUs. In addition you
can consider different RAID array techniques for the OLTP and data warehouse servers (OLTP would be better with
RAID 5, data warehouse RAID 1.)
Once you move your data warehouse to its own database or its own database server, you can also start to
replicate the data warehouse. For example, let’s say that you have an OLTP that works worldwide but you have
management in offices in different parts of the world. You can reduce network traffic by having all reporting (and
what else do managers do??) run on a local network against a local data warehouse. This only works if you don’t
have to update the date warehouse more than a few times a day.
Where you put your data warehouse is important, I suggest that you start small and work your way up as the
needs dictate.

Building a Data Warehouse Part IV: Extraction, Transformation, and Load
See also:
• Part III: Location of your data warehouse
we looked at how to architect a data warehouse’s table schema. In Part III, we looked at where to put the data
warehouse tables. Today we are going to look at how to populate those tables and keep them in sync with your
OLTP system.
No matter where your data warehouse is located, the biggest challenge with a data warehouse, especially one
where you are going to do real time reporting off of, is that the data is published from the transactional system
(OLTP). By definition, the data warehouse is out of date compared to the OLTP system.
Usually when you ask the boss, “How much latency can you accept between the OLTP and data warehouse
systems” the boss will reply: none. While that is impossible, the more time to develop and the more money you
have will to develop said system, the closer to real time you can get. Always bring up the American Express
example used in Part I for proof that your system can accept at least some latency.
If you remember the code examples from Part II, most of the time you have to query OLTP data, aggregate it, and
then load it into your new schema. This is known as the process of extraction, transformation, and loading (ETL)
data from the OLTP system into the data warehouse. The workflow usually goes like this: at a pre‐set time (at
every change, hourly, nightly, or weekly) query the OLTP data (extraction) and then aggregate and flatten it out
(transformation) and then copy the transformed data to the data warehouse (star or snowflake) tables (load).
You have several options ranging from near real time ETL (very complex and expensive) to nightly data dumps
(least complex and least expensive.) You can also perform full ETL (wipe out the data warehouse tables completely
and refill on each operation) and partial ETL (only add incremental data as per a time series.) The actual load can
be done via database triggers on an add/update/delete for near real time to simple scheduled SQL batches that
wipe data and run your load scripts. Most likely you will want to take a hybrid approach, depending on the
maturity of your system. You have three basic options, ranging from least complex to most complex:
• Direct database dump
• ETL tool
• Database triggers
When you have a long time series to publish your data, say nightly, or weekly, you can do a direct database dump.
The process would be pretty straightforward. At a regular interval (or manually) a process would start that would
query the OLTP database and perform all of the aggregations, etc and then load it into a staging data warehouse
database, then wipe out the production data warehouse and load the data in.
Another option is to use an ETL tool. A good example is SQL Server Integration Services (SSIS) if you are using
Microsoft SQL Server. (Actually SSIS will work with multiple database, you just need a SQL Server host.) A modern
ETL tool will give you the ability to segment the work into logical groups, have a control flow based on success and
failure of a condition, and allows rollbacks.

A typical workflow with an ETL tool is that the ETL will run on a schedule (or based on a condition, such as a
message arriving from a queue or a record written to a admin table) and have a parameter(s) passed to it. This
parameter is usually a time series and the ETL will perform all of the extraction on data from the OLTP database
filtered by that parameter. An ETL tool is the most likely solution you will employ, especially if you have to make
frequent updates from your OLTP system to your data warehouse.
Another option is to use database triggers. For those of you that don’t know a lot about triggers, well they can lead
to evil. ;) That said, they are events that fire when data changes. You can then write SQL code to run when the data
changes, even ETL code. Triggers are hard to debug and difficult to maintain, so I would only suggest that you use a
trigger when you need “real time” updates to your data warehouse and even then, the trigger should only write a
record into an admin table that your ETL process is polling to get started.
A common design pattern with a sophisticated ETL is to do all of the ETL to a staging data warehouse and allow a
check mechanism to verify that the ETL was successful. Once this check is performed (either by computer or by
human, depending on the system), you can then push the data from the staging tables to the production tables.
There is a SQL operator specifically for this process, the MERGE operator. MERGE looks at two tables and looks at
their joins and compares the data and will do add, updates, inserts, and deletes to keep the two tables in sync.
Here is how we would do that with a MERGE statement and our fact table from Part II.
‐‐MERGE the FACT Local to FACT Remote
MERGE dbo.FactOrder as t‐‐target
USING Northwind.dwh.FactOrder as s‐‐source
ON t.ProductID = s.ProductID AND
t.EmployeeID = s.EmployeeID AND
t.ShipperID = s.ShipperID AND
t.TimeKey = s.TimeKey ‐‐joining keys

‐‐record exists in target, but data different
WHEN MATCHED AND (s.OrderDate != t.OrderDate OR
s.PostalCode != t.PostalCode OR
s.[Total Sales] != t.[Total Sales] OR
s.Discount != t.Discount OR
s.[Unit Sales] != t.[Unit Sales]) THEN
UPDATE SET t.OrderDate = s.OrderDate,
t.PostalCode = s.PostalCode,
t.[Total Sales] = s.[Total Sales],
t.Discount = s.Discount,
t.[Unit Sales] = s.[Unit Sales]

WHEN NOT MATCHED BY SOURCE THEN
‐‐only in target, get rid of it
Delete

WHEN NOT MATCHED BY TARGET THEN
‐‐record only exists in source
INSERT VALUES (s.OrderDate, s.PostalCode, s.ProductID,s.EmployeeID,
s.ShipperID, s.[Total Sales], s.Discount, s.[Unit Sales], s.[TimeKey]);‐‐required!!
Tomorrow we will warp up with the application development options.

Building a Data Warehouse Part V: Application Development Options
See also:
• Part III: Location of your data warehouse
• Part IV: Extraction, Transformation, and Load
we looked at how to architect a data warehouse’s table schema. In Part III, we looked at where to put the data
warehouse tables. In Part IV, we are going to look at how to populate those tables and keep them in sync with your
OLTP system. Today, our last part in this series, we will take a quick look at the benefits of building the data
warehouse before we need it for cubes and BI by exploring our reporting and other options.
As I said in Part I, you should plan on building your data warehouse when you architect your system up front.
Doing so gives you a platform for building reports, or even application such as web sites off the aggregated data.
As I mentioned in Part II, it is much easier to build a query and a report against the rolled up table than the OLTP
tables.
To demonstrate, I will make a quick Pivot Table using SQL Server 2008 R2 PowerPivot for Excel (or just PowerPivot
for short!). I have showed how to use PowerPivot before on this blog, however, I usually was going against a SQL
Server table, SQL Azure table, or an OData feed. Today we will use a SQL Server table, but rather than build a
PowerPivot against the OLTP data of Northwind, we will use our new rolled up Fact table.
To get started, I will open up PowerPivot and import data from the data warehouse I created in Part II. I will pull in
the Time, Employee, and Product dimension tables as well as the Fact table.

Once the data is loaded into PowerPivot, I am going to launch a new PivotTable.

PowerPivot understands the relationships between the dimension and fact tables and places the tables in the
designed shown below. I am going to drag some fields into the boxes on the PowerPivot designer to build a
powerful and interactive Pivot Table. For rows I will choose the category and product hierarchy and sum on the
total sales. I’ll make the columns (or pivot on this field) the month from the Time dimension to get a sum of sales
by category/product by month. I will also drag in Year and Quarter in my vertical and horizontal slicers for
interactive filtering. Lastly I will place the Employee field in the Report Filter pane, giving the user the ability to
filter by employee.

The results look like this, I am dynamically filtering by 1997, third quarter and employee name Janet Leverling.

This is a pretty powerful interactive report build in PowerPivot using the four data warehouse tables. If there was
no data warehouse, this Pivot table would have been very hard for an end user to build. Either they or a developer
would have to perform joins to get the category and product hierarchy as well as more joins to get the order
details and sum of the sales. In addition, the breakout and dynamic filtering by Year and Quarter, and display by
month, are only possible by the DimTime table, so if there were no data warehouse tables, the user would have
had to parse out those DateParts. Just about the only thing the end user could have done without assistance from
a developer or sophisticated query is the employee filter (and even that would have taken some PowerPivot magic
to display the employee name, unless the user did a join.)
Of course Pivot Tables are not the only thing you can create from the data warehouse tables you can create
reports, ad hoc query builders, web pages, and even an Amazon style browse application. (Amazon uses its data
warehouse to display inventory and OLTP to take your order.)
I hope you have enjoyed this series, enjoy your data warehousing.

Building A Data Warehouse - Stephen Forte 2010 09 11

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Building A Data Warehouse - Stephen Forte 2010 09 11

Diunggah oleh

Hak Cipta:

Format Tersedia

Building a Data Warehouse Part I: When to build your data warehouse

Anda mungkin juga menyukai