W H I T E P A P E R
Citicorp Center
800.669.1555 phone
312.577.0228 fax
w w w. k n i g h t s b r i d g e. c o m
Managing Big Data: Building the Foundation for a
Scalable ETL Environment
Can you, at this moment in time, imagine managing 500 terabytes of data? Or integrating
billions of clicks from your web site with data from multiple channels and business units
every day, maybe more than once a day? It seems like an extreme scenario, but it’s one
that industry analysts uniformly predict organizations will be confronting within the
next three to four years. The exact figures may vary slightly, but the consensus is solid:
enterprises are going to be swamped with data, and they’re going to have to figure out
how to manage it or risk being left behind.
A rapid increase in data volume is not the only challenge enterprises will face. End users
want more information at more granular levels of detail, and they want flexible,
integrated, timely access. The number of users and the number of queries are also
growing dramatically. Additionally, organizations are placing more emphasis on high-
value, data-intensive applications such as CRM. All of these developments pose
problems for enterprise data management. Fortunately, there is an effective answer to
these problems–scalable data solutions, and more specifically, scalable ETL
environments.
SCALABLE PERFORMANCE
In the case of performance, scalability implies the ability to increase performance
practically without limit. Performance scalability as a concept is not new, but actually
achieving it is becoming much more challenging because of the dramatic increases in
data volume and complexity that enterprises are experiencing on every front. How long
will users tolerate growing latency in the reporting provided to them? How long can
enterprises keep adding hardware, installing new software, tweaking or building from
scratch new applications? The fact is, yesterday’s scalable solutions aren’t working in the
new environment.
The extract, transform and load (ETL) environment poses some especially difficult
scalability challenges because it is constantly changing to meet new requirements.
Enterprises must tackle the scalability problem in their ETL environments in order to
successfully confront increasing refresh frequencies, shrinking batch windows,
increasing processing complexity, and growing data volumes. Without scalability in the
ETL environment, scalability in the hardware or database layer becomes less effective.
• Avoid scanning unnecessary data. Are there data sets or partitions that do not
need to be scanned? Could the extra CPU/memory cost of passing unused rows
or columns be avoided? Could the data be structured in columnar data sets so
that only necessary columns would be read?
• Apply data reduction steps of the ETL job flow as soon as possible. Could data be
filtered during the extract processing instead of during the transformation
processing? Could data be aggregated sooner in the processing stream, reducing
the rows sent for subsequent processing?
• Use tools to facilitate data reduction. Could a "change data capture" feature on a
source database be used?
Eliminate I/O
Whether implemented within the ETL application, or more likely by the ETL tool, these
techniques avoid I/Os and reduce run times. The first three in-memory techniques are
generally implemented within the database and often come "free" with the ETL tool that
processes in the database. Pipelining is generally supported through an ETL tool.
• In-memory caching. Will the I/Os generated from caching a lookup file/table be
less than a join? Will the lookup file/table fit in memory–either OS buffers, tool
buffers or database buffers?
• Hash join versus sort-merge join. Will the join result fit in memory?
• In-memory aggregates. Will the aggregation result fit in memory?
• Pipelining. Can data be passed from one stage of ETL processing to the next?
Does the data need to be landed to disk for reliability purposes?
• Application data parallelism. Is the ETL job stream too small to be partitioned to
use parallelism? How will application logic such as joins and aggregates need to
be changed to accommodate partitioned data? How many different partitioning
schemes will be needed? How can the data be efficiently repartitioned? How will
the data be gathered back into serial files?
• Control and hardware parallelism. How will jobs running on different servers (a
distributed environment) be controlled and managed? How will data be moved
between different servers? How will failure recovery, monitoring and debugging
be implemented?
SCALABLE AVAILABILITY
Availability hardly needs defining. An application is available if its end user is able to
utilize it when expecting to be able to do so. End users and IT staff generally agree
beforehand on the level of availability that is needed. IT must then design and operate
applications such that the service level expected is actually delivered.
Let’s think of availability in different terms for a moment. Let’s assume that the end user
defines availability as one helium balloon staying no less than 10 feet above the ground. If
left alone, the balloon eventually descends as a result of helium molecules leaking
What if the end user really needs 24x7x365 helium balloon availability? One way to
provide it would be to add a second balloon into the system. While one is floating, the
other could be pressurized. Because pressurizing requires substantially less time than
the leakage, which ultimately brings the balloon below the 10-foot threshold, complete
24x7x365 availability could relatively easily be met by two balloons. This system utilizes
redundancy to provide availability. The waiting, pressurized balloon is the "hot spare."
As the availability threshold rises, the stress on the system increases, as well; it becomes
more difficult and more expensive to maintain the expected availability. The difficulty
and expense are closely correlated. As the system becomes more stressed, additional
vigilance is required to avoid unexpected outages. Additionally, as the stress goes up, the
number of balloons in the system goes up, and the frequency of maintenance operations
(pressurizing and replacing balloons) rises.
Big-data systems are especially at risk of outage because they inherently involve large
numbers of storage devices. Although the mean time between failures (MTBF) for a
single device might be thousands of hours, the system-wide MTBF decreases
dramatically when these MTBF probabilities are multiplied by hundreds or thousands of
devices. Weekly disk drive failures are the norm rather than the exception in multi-
terabyte repositories. As with the helium balloons, redundancy can effectively diminish
the ill effects of these failures.
Now that we’ve established that special considerations are called for in scalable big-data
applications, let’s look specifically at availability in ETL systems. End users generally do
not interact directly with ETL applications. Despite this, ETL systems often must
perform their functions within narrow batch windows. Availability of the ETL systems
during these windows is critical to maintain contracted service levels for downstream
systems, generally data warehouses. If the ETL process completes at 9 a.m. instead of 7
What can be done to attain and retain contracted availability levels in big-data ETL
applications? For ETL applications, availability means extracting data from source
systems on time, transforming it on time and loading it on time. Some tips on attaining
availability in ETL applications include:
In short, a scalable ETL application’s cost scales predictably over time. Over time, two
aspects of ETL applications increase cost: increasing data volumes (and, consequently,
workloads) and feature-set enhancement. Cost-of- ownership predictability in each of
these areas implies different cost growth curves. As data volumes increase, a predictable
cost curve is one where cost increases linearly or sub-linearly. In other words, if an ETL
environment costs $300,000 per year to operate when the base data volume being
processed is one terabyte, the cost should be no more than $600,000 per year when the
volume increases to two terabytes. Cost predictability with regard to feature set
enhancement means two things. First, the cost of the enhancement itself should be
commensurate with the enhancement. Second, once enhanced, the software’s cost of
ownership should change primarily due to workload increases and not software
maintenance needs.
Before detailing the primary cost drivers and how best to manage them into
"predictability," we need to address the issue of measuring cost. Cost-of-ownership
measurement is more art than science. Ownership cost includes several components:
software development, testing and maintenance; capital expenditure; hardware expense;
software licensing expense; data center operations; and others. Furthermore, cost
depends on service level commitments on various quality of service (QoS) metrics. These
include availability, performance, timeliness, accuracy, completeness, disaster recovery
time and unplanned event rates.
This paper assumes that anyone trying to manage cost of ownership first implements
consistent, systematic processes for measuring ownership cost and, second, adjusts these
costs fairly with respect to service-level commitments. For example, increasing an
Following are various drivers that determine whether cost of ownership remains
predictable over time or not. These drivers should be taken into consideration when
attempting to improve cost-of-ownership scalability.
Application cost scalability depends on the hardware platform and hardware architecture
Assume that an ETL application costs $100,000 per year to own and is processing one
terabyte of data per day. When the volume demand increases to two terabytes per day,
the current platform is unable to scale to support such a workload. A new platform may
have to be chosen, and the software may have to be rearchitected to take advantage of
the more scalable hardware. These activities add dramatically to the ETL application’s
cost of ownership for the current year.
Enterprises must take care not to let software enhancements deteriorate the software’s
maintainability
Complex, scalable ETL software contains features that address more than just
functionality. ETL software specifically addresses performance, scalability and other
quality of service features. These myriad features lend significant complexity to the ETL
software. This complexity calls for rigor in the original design and discipline during
enhancements. Enhancements that diminish the ETL software’s ability to meet or exceed
service levels result in a double whammy with respect to cost of ownership degradation.
First, the software maintenance cost goes directly up. Second, the operational costs go
up from having to implement artificial measures to restore service levels (for example,
hiring additional operational staff and deploying additional hardware).