Anda di halaman 1dari 3

Data Modeling for Analytical Data Warehouses.

Interview with Michael Blaha.


by Roberto V. Zicari on March 3, 2012

Many data warehouses contain sensitive data such as personal data. There are legal and
ethical concerns with accessing such data. So the data must be secured and access
controlled as well as logged for audits Michael Blaha.
This is the third interview with our expert Dr. Michael Blaha on the topic Database Modeling. This
time we look at the issue of data design for Analytical Data Warehouses.
In previous interviews we looked at how good is UML for database design , andhow good are
Use Cases for database modeling.
Hope you`ll find this interview interesting. I encourage the community to post comments.
RVZ
Q1: What is the difference between data warehouses and day-to-day business applications?
Michael Blaha: Operational (day-to-day business) applications serve the routine needs of a
business handling orders, scheduling manufacturing runs, servicing patients, and generating
financial statements.
Operational applications have many short transactions that must process quickly. The transactions
both read and write.
Well-written applications pay attention to data quality, striving to ensure correct data and avoid
errors.
In contrast analytical (data warehouse) applications step back from the business routine and
analyze data that accumulates over time. The idea is to gain insight into business patterns that are
overlooked when responding to routine needs. Data warehouse queries can have a lengthy
execution time as they process reams of data, searching for underlying patterns.
End users read from a data warehouse, but they don`t write to it. Rather writing occurs as the
operational applications supply new data that is added to the data warehouse.
Q2: How do you approach data modeling for data warehouse problems?
Michael Blaha: For operational applications, I use the UML class model for conceptual data
modeling. (I often use Enterprise Architect.) The notation is more succinct than conventional
database notations and promotes abstract thinking.
In addition, the UML class model is understandable for business customers as it defers database
design details. And, of course, the UML reaches out to the programming side of development.
In contrast, for analytical applications, I go straight to a database notation. (I often use ERwin.)
Data warehouses revolve around facts and dimensions. The structure of a data warehouse model
is so straightforward (unlike the model of operational application) that a database notation alone
suffices.

For a business user, the UML model and the conventional data model look much the same for a
data warehouse.
The programmers of a data warehouse (the ETL developers) are accustomed to database
notations (unlike the developers in day-to-day applications).
As an aside, I note that in a past book (A Manager`s Guide to Database Technology) I used a UML
class model for analytical modeling. In retrospect I now realize that was a forced fit. The class
model does not deliver any benefits for data warehouses and it`s an unfamiliar technology for data
warehouse developers, so there`s no point in using it there.
Q3: Is there any synergy between non-relational databases (NoSQL, Object Databases) and
data warehouses?
Michael Blaha: Not for conventional data warehouses that are set-oriented. Mass quantities of
data must be processed in bulk. Set-oriented data processing is a strength of relational databases
and the SQL language. Furthermore, tables are a good metaphor for facts and dimensions and the
data is intrinsically strongly typed.
NoSQL (Hadoop) is being used for mining Web data. Web data is by its nature unstructured and
much different from conventional data warehouses.
Q4: How do data warehouses achieve fast performance?
Michael Blaha: The primary technique is pre-computation, by anticipating the need for aggregate
data and computing it in advance. Indexing is also important for data warehouses, but less
important than with operational applications.
Q5: What are some difficult issues with data warehouses?
Michael Blaha:
-Abstraction. Abstraction is needed to devise the proper facts and dimensions. It is always difficult
to perform abstraction.
-Conformed dimensions. A large data warehouse schema must be flexible for mining. This can only
be achieved if data is on the same basis. Therefore there is a need for conformed dimensions.
For example, there must be a single definition of Customer that is used throughout the warehouse.
-Size. The sheer size of schema and data is a challenge.
-Data cleansing. Many operational applications are old legacy code. Often their data is flawed and
may need to be corrected for a data warehouse.
- Data integration. Many data warehouses combine data from multiple applications. The application
data overlaps and must be reconciled.
- Security. Many data warehouses contain sensitive data such as personal data. There are legal
and ethical concerns with accessing such data. So the data must be secured and access controlled
as well as logged for audits.
Q6: What kind of metadata is associated with a data warehouse and is there a role for Object
Databases with this?
Michael Blaha: Maybe. Data warehouse metadata includes source-to-target mappings, definitions
(of facts, dimensions, and attributes), as well as the organization of the data warehouse into
subject areas. The metadata for a data warehouse is just like operational applications. The
metadata has to be custom modeled and doesn`t have a standard metaphor for structure like the

facts and dimensions of a data warehouse. Relational databases, OO databases, and possibly
other kinds of databases are all reasonable candidates.
Q7. In a recent interview, Florian Waas, EMC/Greenplum, said in the Big Data era the old
paradigm of shipping data to the application isnt working any more. Rather, the application
logic must come to the data or else things will break: this is counter to conventional wisdom
and the established notion of strata within the database stack. Instead of stand-alone
products for ETL, BI/reporting and analytics we have to think about seamless integration: in
what ways can we open up a data processing platform to enable applications to get closer?
What language interfaces, but also what resource management facilities can we offer? And
so on.
What is your view on this?
Michael Blaha: Its well known that to get good performance for relational database applications
that stored procedures must be used. Stored procedures are logic that is inside the database
kernel. Stored procedures circumvent much of the overhead that is incurred by shuttling back and
forth between an application process and the database process. So the stored procedure
experience is certainly consistent with this comment.
What I try to do in practice is think in terms of objects. Relational database tables and stored
procedures are analogous to objects with methods. I put core functionality that is likely to be
reusable and computation intensive into stored procedures. I put lightweight functionality and
functionality that is peculiar to an application outside the database kernel.
Q8. Hadoop is the system of choice for Big Data and Analytics. How do you approach data
modeling in this case?
Michael Blaha: I have no experience with Hadoop. My projects have involved structured data. In
contrast Hadoop is architected for unstructured data as is often found on the Web.
Q9. A lot of insights are contained in unstructured or semi-structured data from Big Data
applications. Does it make any sense to do data modeling in this case?
Michael Blaha: I have no experience with unstructured data. I have some experience with semistructured data (XML / XSD). I routinely practice data modeling for XSD files. I published a paper in
2010 lamenting the fact that so
many SOA projects concern storage and retrieval of data and completely lack a data model. Ive
been working on modeling approaches for XSD files, but have not yet devised a solution to my
satisfaction.

Michael Blaha is a partner at Modelsoft Consulting Corporation.


Dr. Blaha is recognized as one of the worlds leading authorities on databases and data modeling.
He has more than 25 years of experience as a consultant and trainer in conceiving, architecting,
modeling, designing, and tuning databases for dozens of major organizations around the world. He
has authored six U.S. patents, six books, and many papers. Dr. Blaha received his doctorate from
Washington University in St. Louis and is an alumnus of GE Global Research in Schenectady, New
York.

Anda mungkin juga menyukai