Anda di halaman 1dari 15

Knightsbridge Solutions

W H I T E P A P E R

Managing Big Data:


Building the Foundation for a Scalable ETL Environment

Citicorp Center

500 West Madison, Suite 3100

Chicago, IL 60661 USA

800.669.1555 phone

312.577.0228 fax

w w w. k n i g h t s b r i d g e. c o m
Managing Big Data: Building the Foundation for a
Scalable ETL Environment

Can you, at this moment in time, imagine managing 500 terabytes of data? Or integrating
billions of clicks from your web site with data from multiple channels and business units
every day, maybe more than once a day? It seems like an extreme scenario, but it’s one
that industry analysts uniformly predict organizations will be confronting within the
next three to four years. The exact figures may vary slightly, but the consensus is solid:
enterprises are going to be swamped with data, and they’re going to have to figure out
how to manage it or risk being left behind.

A rapid increase in data volume is not the only challenge enterprises will face. End users
want more information at more granular levels of detail, and they want flexible,
integrated, timely access. The number of users and the number of queries are also
growing dramatically. Additionally, organizations are placing more emphasis on high-
value, data-intensive applications such as CRM. All of these developments pose
problems for enterprise data management. Fortunately, there is an effective answer to
these problems–scalable data solutions, and more specifically, scalable ETL
environments.

Scalability is defined as the retention or improvement of an application’s performance,


availability, and maintainability with increasing data volumes. This paper will explore
the three dimensions of scalability as they relate to ETL environments, and will suggest
some techniques that IT organizations can use to ensure scalability in their own systems.

SCALABLE PERFORMANCE
In the case of performance, scalability implies the ability to increase performance
practically without limit. Performance scalability as a concept is not new, but actually
achieving it is becoming much more challenging because of the dramatic increases in
data volume and complexity that enterprises are experiencing on every front. How long
will users tolerate growing latency in the reporting provided to them? How long can
enterprises keep adding hardware, installing new software, tweaking or building from
scratch new applications? The fact is, yesterday’s scalable solutions aren’t working in the
new environment.

The extract, transform and load (ETL) environment poses some especially difficult
scalability challenges because it is constantly changing to meet new requirements.
Enterprises must tackle the scalability problem in their ETL environments in order to
successfully confront increasing refresh frequencies, shrinking batch windows,
increasing processing complexity, and growing data volumes. Without scalability in the
ETL environment, scalability in the hardware or database layer becomes less effective.

In terms of implementing scalable data solutions, enterprises should adopt a "build it


once and build it right" attitude. If an ETL environment is designed to be scalable from
the start, the organization can avoid headaches later. Let’s consider a situation in which
this is not the case, and the ETL environment is architected without consideration for
scalability. The first generation of this solution will be fine until data volumes exceed
capacity. At that point, the organization will be able to make the fairly easy move to a
second-generation environment by upgrading hardware and purchasing additional
software licenses. Once this solution is no longer sufficient, however, the enterprise will
find it more difficult and costly to evolve to a third-generation solution, which usually
involves custom programming and buying point solutions. Finally, once the third-
generation solution has reached its limits, the enterprise will need to rebuild its ETL
environment entirely, this time using scalable and parallel technologies. Clearly,
enterprises can save time and money by implementing a scalable ETL environment from
the very beginning.

Although a thorough discussion of techniques to ensure performance scalability is


beyond the scope of this paper, following are some basic considerations for improving
the performance of the ETL environment.

Knightsbridge Solutions LLC “Managing Big Data: Building the Foundation


January 2002 for a Scalable ETL Environment”
2
Reduce data flow
Making reductions to the data flow for improved scalability is common sense. However,
this technique is often overlooked in ETL applications. Some ways to reduce data flow:

• Avoid scanning unnecessary data. Are there data sets or partitions that do not
need to be scanned? Could the extra CPU/memory cost of passing unused rows
or columns be avoided? Could the data be structured in columnar data sets so
that only necessary columns would be read?
• Apply data reduction steps of the ETL job flow as soon as possible. Could data be
filtered during the extract processing instead of during the transformation
processing? Could data be aggregated sooner in the processing stream, reducing
the rows sent for subsequent processing?
• Use tools to facilitate data reduction. Could a "change data capture" feature on a
source database be used?

Eliminate I/O
Whether implemented within the ETL application, or more likely by the ETL tool, these
techniques avoid I/Os and reduce run times. The first three in-memory techniques are
generally implemented within the database and often come "free" with the ETL tool that
processes in the database. Pipelining is generally supported through an ETL tool.

• In-memory caching. Will the I/Os generated from caching a lookup file/table be
less than a join? Will the lookup file/table fit in memory–either OS buffers, tool
buffers or database buffers?
• Hash join versus sort-merge join. Will the join result fit in memory?
• In-memory aggregates. Will the aggregation result fit in memory?
• Pipelining. Can data be passed from one stage of ETL processing to the next?
Does the data need to be landed to disk for reliability purposes?

Optimize common ETL processing


Certain ETL tools provide optimized implementation of common and processing-
intensive operations.

Knightsbridge Solutions LLC “Managing Big Data: Building the Foundation


January 2002 for a Scalable ETL Environment”
3
• Change data capture. Will only records that have been changed (inserted or
updated) be sufficient for all the ETL logic required before populating the target
system?
• Surrogate key optimization. Will the surrogate key logic supported within the
tool allow the warehouse dimensions to change in the manner users need? Or
does custom logic need to be written?
• Incremental aggregates. Does the tool support optimizations for only applying
deltas/changed data to aggregates?
• Bulk loading. Does the bulk loading support the unit of work features required
for recoverability?

Eliminate bottlenecks and balance resources


Hardware performance tuning is essential and is heavily dependent on the ETL
application workload being run through the system or systems. There is no such thing as
a perfectly tuned system, but there are several tuning techniques that get the system
running close to optimal.

• Eliminating configuration bottlenecks. Is the disk farm (I/Os) a bottleneck? Is the


memory buffer allocation a bottleneck? Is the network a bottleneck? Is the CPU
or operating system configuration a bottleneck? Is the ETL tool configuration
causing a bottleneck?
• Balancing resource utilization within applications. Does the hardware
environment have the right mix of CPU/disk/memory/network for the ETL
environment? Should hardware be added or subtracted to balance the application
needs?

Provide rules to determine environment scalability


Without defined methods for measuring scalability, an IT organization will have
difficulty quantifying the impact of improvements. Following are two common
measurements of scalability within an ETL environment.

Knightsbridge Solutions LLC “Managing Big Data: Building the Foundation


January 2002 for a Scalable ETL Environment”
4
• Linear speed-up. Linear speed-up is measured by reducing the run-time of an ETL
application by a factor of N with N times more resource.
• Linear scale-up. Linear scale-up is measured by maintaining a constant run time
while increasing data volumes by a factor of N with N times more resource.

Use data parallelism


Parallel processing is only slowly being adopted into ETL tools, although many
enterprises are already making use of it without the help of tools. Following are some of
the considerations for using data parallelism.

• Application data parallelism. Is the ETL job stream too small to be partitioned to
use parallelism? How will application logic such as joins and aggregates need to
be changed to accommodate partitioned data? How many different partitioning
schemes will be needed? How can the data be efficiently repartitioned? How will
the data be gathered back into serial files?
• Control and hardware parallelism. How will jobs running on different servers (a
distributed environment) be controlled and managed? How will data be moved
between different servers? How will failure recovery, monitoring and debugging
be implemented?

Again, an exhaustive discussion of techniques to improve ETL performance is beyond the


scope of this paper, but hopefully these suggestions will encourage IT organizations to
explore opportunities for performance improvement in their own ETL environments.

SCALABLE AVAILABILITY
Availability hardly needs defining. An application is available if its end user is able to
utilize it when expecting to be able to do so. End users and IT staff generally agree
beforehand on the level of availability that is needed. IT must then design and operate
applications such that the service level expected is actually delivered.

Let’s think of availability in different terms for a moment. Let’s assume that the end user
defines availability as one helium balloon staying no less than 10 feet above the ground. If
left alone, the balloon eventually descends as a result of helium molecules leaking

Knightsbridge Solutions LLC “Managing Big Data: Building the Foundation


January 2002 for a Scalable ETL Environment”
5
through the balloon’s skin or through the seal. To retain the contracted availability,
someone periodically adds more helium to the balloon. The refilling process requires
some time, however modest. The refilling must occur outside the contracted availability
window. Therefore, it is probably unrealistic to expect 24-hour availability every day of
the week and every week of the year. Most helium balloons fall to the floor within three
or four days after having been filled. That means the availability contract can’t be met for
more than about three days before being broken to make time for a refill.

What if the end user really needs 24x7x365 helium balloon availability? One way to
provide it would be to add a second balloon into the system. While one is floating, the
other could be pressurized. Because pressurizing requires substantially less time than
the leakage, which ultimately brings the balloon below the 10-foot threshold, complete
24x7x365 availability could relatively easily be met by two balloons. This system utilizes
redundancy to provide availability. The waiting, pressurized balloon is the "hot spare."

Consider the consequences of redefining availability to a 20-foot threshold. We now


pressurize the balloons with additional helium, which stretches the skin more and
stresses the balloons more. The probability that a balloon will eventually fail (probably
with a loud BANG) after a few refills increases. We must now track the number of refills
a balloon has experienced to gauge its life against its expected age at the higher pressure
and retire it long before it actually explodes. This process requires us to increase the
redundancy by retaining at least one "cold spare" to add into the cycle of redundancy
needed for full availability.

As the availability threshold rises, the stress on the system increases, as well; it becomes
more difficult and more expensive to maintain the expected availability. The difficulty
and expense are closely correlated. As the system becomes more stressed, additional
vigilance is required to avoid unexpected outages. Additionally, as the stress goes up, the
number of balloons in the system goes up, and the frequency of maintenance operations
(pressurizing and replacing balloons) rises.

Knightsbridge Solutions LLC “Managing Big Data: Building the Foundation


January 2002 for a Scalable ETL Environment”
6
The helium-balloon metaphor models modern information technology systems pretty
well. If left alone, balloons leak their helium and IT systems eventually fail. Technology
systems can fail for any number of reason including power outages, disk failure, memory
failure, software errors, etc. Balloons require refills; IT systems require maintenance and
administration. For example, upgrading or patching the operating system may require an
outage. Reorganizing database storage structures may also require an outage.

Big-data systems are especially at risk of outage because they inherently involve large
numbers of storage devices. Although the mean time between failures (MTBF) for a
single device might be thousands of hours, the system-wide MTBF decreases
dramatically when these MTBF probabilities are multiplied by hundreds or thousands of
devices. Weekly disk drive failures are the norm rather than the exception in multi-
terabyte repositories. As with the helium balloons, redundancy can effectively diminish
the ill effects of these failures.

Redundancy is a frequently deployed tool in IT’s arsenal. Particularly apropos is the


concept that as scale increases, stress increases and, at certain stress points, special
considerations are needed to deliver contracted availability. Therefore, even with a static
availability requirement, if system scale increases, the cost of delivering that same
availability increases. Availability scaling differs from performance scaling in that
performance scaling can often be achieved with a linear or better than linear relationship
between system scale and cost. Availability scales linearly within ranges. Additional
expense often accompanies the jump from one scale range to the next, eliminating the
linearity of the scale/cost relationship at these transitions.

Now that we’ve established that special considerations are called for in scalable big-data
applications, let’s look specifically at availability in ETL systems. End users generally do
not interact directly with ETL applications. Despite this, ETL systems often must
perform their functions within narrow batch windows. Availability of the ETL systems
during these windows is critical to maintain contracted service levels for downstream
systems, generally data warehouses. If the ETL process completes at 9 a.m. instead of 7

Knightsbridge Solutions LLC “Managing Big Data: Building the Foundation


January 2002 for a Scalable ETL Environment”
7
a.m., end users who report against the database at 8 a.m. may retrieve stale data and may
further be confused by that same report showing conflicting information at 10 a.m.

What can be done to attain and retain contracted availability levels in big-data ETL
applications? For ETL applications, availability means extracting data from source
systems on time, transforming it on time and loading it on time. Some tips on attaining
availability in ETL applications include:

Implement higher redundancy for persistent data stores


Most sophisticated ETL processes produce and maintain data stores during the
transformation process. These stores fall into two categories: persistent and temporary.
Persistent stores live from ETL cycle to cycle. Frequently, these stores contain
information critical to computing aggregates or otherwise producing load-ready data in
the next ETL cycle. Recomputing this persistent store from scratch can be extremely
slow and expensive, if not impossible. If several months or years of transactional data are
summarized in this persistent store, it may well be impossible to recreate the summary.
Even if it were technically possible to resummarize the transactional data, it can often be
very difficult to arrive at the same result. Therefore, it’s best just not to lose the
persistent data store in the first place. Backups are clearly important, but the
backup/restore cycle is generally too slow to accommodate regular disk outages. Disk
redundancy (RAID 1 or RAID 5) is a far better solution. RAID 1 (full mirroring) can be
superior to RAID 5 (parity-based redundancy) or its derivatives in terms of greatly
reducing the probability of an outage, but RAID 1 comes at significant capital cost.

Implement less redundancy for temporary data repositories


ETL processes often produce transient or temporary result sets. These generally don’t
have to be protected if they can be easily recreated. If the time to recreate one of these
temporary results would be high enough to prevent the ETL process from fitting within
its allotted batch window, RAID 5 storage should be utilized.

Separate transient data from stateful/persistent data


Because persistent stores require different service levels than transient stores, separating
the two is essential to achieving high availability affordably.

Knightsbridge Solutions LLC “Managing Big Data: Building the Foundation


January 2002 for a Scalable ETL Environment”
8
Utilize cold spare servers
ETL applications rarely justify hot spare servers. The cost of operating a hot spare server
in all but the largest distributed architectures is prohibitive. If an ETL system consists of
eight or more servers, having an extra as a hot spare may be justifiable in that it
represents about 11 percent of the total cost. On the other hand, if the ETL system
consists of a single server, the hot spare represents 50 percent of the total server cost and
would probably be less than one percent utilized. In large data centers with
standardized server platforms, a cold spare with multiple potential personalities could
act as a spare for more than one ETL application, spreading the cost across multiple
systems. Thus, a cold spare is more easily justified but does require some degree of
platform and configuration standardization.

Utilize off-line backups


Backups are hardly conceptually novel. Nevertheless, in a big-data setting, special
consideration must be given to the amount of time required to perform a full or partial
restore. A conceptually simple restore can prove quite difficult and time-consuming in
practice if full backups are taken rarely and incremental backups taken frequently. A
good archive management package is critical. Furthermore, when a surgical restore is
called for, many partial backups can lead to a quicker restore than a single monolithic
backup. Design the backup process primarily to allow smooth and quick restores. Don’t
back up for backup’s sake; back up for restore’s sake.

Checkpoint at the right intervals


A complex ETL application should not occupy the full batch window. Some part of the
window must remain in reserve to allow for recovery in case of fault. All portions of the
ETL flow must then be designed to be recoverable within this reserve window. Such a
strategy allows the ETL process to fit within the batch window even if a single fault
occurs. Checkpointing is an effective technique for engineering intermediate recovery
points in a complex and long-running application.

Knightsbridge Solutions LLC “Managing Big Data: Building the Foundation


January 2002 for a Scalable ETL Environment”
9
Clean transient data at the end of each cycle
Outages caused by exhausting free storage space for transient data are easily avoided by
religiously removing transient data left behind by each ETL cycle.

Design the system such that no server is special


In a distributed server configuration, all servers should be as similarly configured as
possible. If any is treated as the master or is, in some way, special, it will quickly become
the weak link. Heterogeneous configurations can require multiple spares where a single
cold spare would otherwise do.

Publish the freshness of the data in the database


In the unfortunate circumstance that an ETL cycle exceeds its allotted batch window,
end users may access the database being loaded by the ETL cycle and assume the cycle
completed successfully and then proceed to incorrectly interpret the data retrieved from
the database. Publishing the freshness of the data in the database, preferably on every
report or in the OLAP tool, avoids confusion.

Make data loads atomic


The load phase of an ETL process should be atomic. Either all the data is published or
none is. This avoids end users reporting against a partial refresh in situations when the
ETL cycle exceeds its allotted batch window.

Monitor everything in the chain between E and L


An effective technique for restoring some of the linearity between system scale and cost
of delivering a certain level of availability is to monitor the entire ETL chain thoroughly.
Automate exception reporting and alert operations staff to impending problems.

Scaling availability in a big-data ETL application ultimately comes down to good


planning and design and to tremendous discipline. In a large-scale ETL system, an ounce
of prevention can be far cheaper (read tens of thousands of dollars cheaper) than a pound
of cure.

Knightsbridge Solutions LLC “Managing Big Data: Building the Foundation


January 2002 for a Scalable ETL Environment”
10
SCALABLE MAINTAINABILITY
In point of fact, “maintainability” is perhaps not the best word to describe the third
scalability dimension, as this dimension involves more than just software
maintainability. This third dimension of scalability is primarily about low cost of
ownership, and maintaining predictable cost of ownership over time.

In short, a scalable ETL application’s cost scales predictably over time. Over time, two
aspects of ETL applications increase cost: increasing data volumes (and, consequently,
workloads) and feature-set enhancement. Cost-of- ownership predictability in each of
these areas implies different cost growth curves. As data volumes increase, a predictable
cost curve is one where cost increases linearly or sub-linearly. In other words, if an ETL
environment costs $300,000 per year to operate when the base data volume being
processed is one terabyte, the cost should be no more than $600,000 per year when the
volume increases to two terabytes. Cost predictability with regard to feature set
enhancement means two things. First, the cost of the enhancement itself should be
commensurate with the enhancement. Second, once enhanced, the software’s cost of
ownership should change primarily due to workload increases and not software
maintenance needs.

Before detailing the primary cost drivers and how best to manage them into
"predictability," we need to address the issue of measuring cost. Cost-of-ownership
measurement is more art than science. Ownership cost includes several components:
software development, testing and maintenance; capital expenditure; hardware expense;
software licensing expense; data center operations; and others. Furthermore, cost
depends on service level commitments on various quality of service (QoS) metrics. These
include availability, performance, timeliness, accuracy, completeness, disaster recovery
time and unplanned event rates.

This paper assumes that anyone trying to manage cost of ownership first implements
consistent, systematic processes for measuring ownership cost and, second, adjusts these
costs fairly with respect to service-level commitments. For example, increasing an

Knightsbridge Solutions LLC “Managing Big Data: Building the Foundation


January 2002 for a Scalable ETL Environment”
11
application’s availability metric while expecting cost of ownership to remain unchanged
is folly.

Following are various drivers that determine whether cost of ownership remains
predictable over time or not. These drivers should be taken into consideration when
attempting to improve cost-of-ownership scalability.

Application cost scalability depends on the hardware platform and hardware architecture
Assume that an ETL application costs $100,000 per year to own and is processing one
terabyte of data per day. When the volume demand increases to two terabytes per day,
the current platform is unable to scale to support such a workload. A new platform may
have to be chosen, and the software may have to be rearchitected to take advantage of
the more scalable hardware. These activities add dramatically to the ETL application’s
cost of ownership for the current year.

Cost scalability depends on frequency of software redesign


If scaling an application requires the software to be redesigned, the cost scalability
suffers. Software redesign may be called for to better take advantage of hardware
resources. For example, if an ETL application is written to run on a single server, and
increased data volumes call for a clustered architecture, software redesign may necessary
to take advantage of the clustered hardware. Such a redesign effort would undoubtedly
defeat the application’s cost scaling.

QoS degradation affects cost scaling


Certain QoS measurements naturally degrade as data volumes grow over time. For
example, if an application’s data volume doubles, the number of disks required may also
double. Twice the disk means double the frequency of disk outages. Unless service levels
are specifically maintained as volumes increase, the application ownership cost will
increase. The hardware and software architecture must be designed to retain consistent
service levels as the data volume increases.

Data center operational costs must scale


Ideally, an application’s operational costs stay constant as data volume increases.

Knightsbridge Solutions LLC “Managing Big Data: Building the Foundation


January 2002 for a Scalable ETL Environment”
12
Unfortunately, operational costs often increase with data volume and workload. For
example, as an application grows and is scaled up on more servers and uses more storage
devices, the operations staff often increases to support the extra hardware. The staffing
level and hardware quantity can be decoupled if the proper hardware architecture is
matched to a well-designed ETL software environment.

Software maintenance workloads must scale


As with data center operational costs, software maintenance cost should be decoupled
from application workload and data volume. Often it fails as the application reaches
stressful workloads or data volumes. This puts the application back in the software
development cycle, adding dramatically to its ownership cost.

Enterprises must take care not to let software enhancements deteriorate the software’s
maintainability
Complex, scalable ETL software contains features that address more than just
functionality. ETL software specifically addresses performance, scalability and other
quality of service features. These myriad features lend significant complexity to the ETL
software. This complexity calls for rigor in the original design and discipline during
enhancements. Enhancements that diminish the ETL software’s ability to meet or exceed
service levels result in a double whammy with respect to cost of ownership degradation.
First, the software maintenance cost goes directly up. Second, the operational costs go
up from having to implement artificial measures to restore service levels (for example,
hiring additional operational staff and deploying additional hardware).

Operability must be designed into scalable ETL software


Data center operational costs are substantially affected by the operability of the software
being managed. For example, ETL software that leaves behind few artifacts that might
clutter storage or confuse operators is easier to manage than software that does
otherwise. Operable software provides simple interfaces, has no side effects, integrates
well with schedulers and operations consoles, and behaves predictably under stress.

Cost-of-ownership scalability depends on a variety of factors. Maintaining predictable


cost over a long period requires that enterprises address all of the aforementioned issues.

Knightsbridge Solutions LLC “Managing Big Data: Building the Foundation


January 2002 for a Scalable ETL Environment”
13
CONCLUSION
Developing a scalable ETL environment is well worth the extra time and effort it
requires. Enterprises can’t afford to ignore the massive data growth industry analysts are
predicting for the near future. Those who take steps now to make their infrastructures
more scalable will gain a competitive edge by having access to data that is detailed,
timely, and highly available.

This paper was authored by:


Knightsbridge Solutions LLC
500 West Madison, Suite 3100
Chicago, IL 60661
Phone: (800) 669-1555
Fax: (312) 577-0228

For further information or additional copies, contact Knightsbridge at (312) 577-5258.

Knightsbridge Solutions LLC “Managing Big Data: Building the Foundation


January 2002 for a Scalable ETL Environment”
14

Anda mungkin juga menyukai