IN A
Modern
Data
Architecture
Data is Big
Space is big, Douglas Adams mused in The Hitchhikers
Guide to the Galaxy. Really big. The same can be said of
data: Its big. Really big. You might think you have a lot of
data in your financial system, but thats just peanuts to
big data.
If youre saying to yourself, I dont have Big Data, so a Data Lake doesnt
apply to me, please keep reading. The principles of the Data Lake and
Modern Data Architecture permeate past just gobs of data.
2
Answering the Needs
of the Business
Business analytics have been around as long as business itself. Analytics began the first time
someone tabulated two columns of numbers and used the difference between them to determine
a profit or loss. As business evolved and companies collected more data, it became possible, and
important, to create reports and analyses on different facets of an organization. Fast forward
a few thousand years and the concepts of data warehousing and business intelligence became
the norm. These disciplines promoted a single, central version of the truth for an organization; a
repository to gather and integrate data to quickly and easily create reports.
3
The progress of analytics was a response to evolving business
needs. Todays business leaders understand that data still
holds the key to understanding the patterns of their customers,
competitors and markets. Only by analyzing this information
can they take action and make educated and supportable
decisions.
4
The traditional data warehouse/business intelligence approach has
done a great job of simplifying data access and reporting, as well
as combining data from many sources, in order to answer all of the
questions an organization may have. But its impossible to anticipate
every question a business might ask and every report they might need.
Metrics change from year to year, month to month and sometimes even
day to day.
5
In a traditional data warehouse solution, organizations would probably
ignore most of these external data sources because they are either too
voluminous or in a format that is difficult to manipulate and store. If
companies used any of it, it was probably for an edge reporting need.
Such limitations often result in potentially valuable data and insights being
inaccessible and possibly lost forever.
Apache Hadoop and the Hadoop data lake are at the center of the big data
movement. A data lake is an arsenal to store vast amounts of raw data
for future use. With all the media hype, it is difficult to sift through the
buzzwords and understand where and even if these new technologies
make sense for your analytics needs. Many people believe that implementing
a Hadoop data lake means throwing away their investment in a data
warehouse. This perception ends up either sending them down the wrong
path or causing them to sideline big data as a future project.
The good news? Hadoop, big data and the data lake
dont replace a companys existing investment in
analytics. In fact, they complement it very nicely. By
building a Modern Data Architecture, organizations
can continue to leverage their existing investments in
analytics, while collecting all of the data they have been
ignoring or throwing away, all while enabling analysts
to get company data and insights faster.
6
Introducing the Modern
Data Architecture
Big data technologies support and enhance modern analytics but do not necessarily replace traditional
analytics systems. Building a Modern Data Architecture that incorporates all of the benefits of a data lake,
combined with the high-speed query and analytics provided by traditional relational data warehouse and
online analytical processing (OLAP) engines, supports data consumption at all levels of the business. It
also provides all classes of data consumers with the capabilities they require.
7
All data, regardless of form, is collected into the Persistent layer of
the Data Lake
Data from all internal and external source systems including structured, semi-structured
and unstructured data, as well as streaming sources is gathered in a single Persistent
layer in the data lake. Not all data in the Persistent layer is promoted to subsequent layers,
but rather collected for future analytics use cases. Data scientists and analysts are granted
access to the data at this layer in order to perform discovery and experimentation in an
Analytics Sandbox set aside for their use. As these analysts identify new data sources that
may provide additional business insight, they will help to shape and Curate this data to
provide self-service analytics to a broader audience.
Analysts and data scientists help shape and Curate the data for
business use
As self-service analysts continue to refine the use of Curated data sources, they will work
with the data management team to Operationalize data to be presented to the broadest
audience of the business. Since these data artifacts are generally consumed through the
highest levels of the organization and are required for day-to-day decision making, they will
ultimately reside in the high-speed query engines of the Enterprise Data Warehouse (EDW)
and OLAP layers to support typical Business Intelligence functions.
The EDW supports a subset of data (generally governed by time). The Hadoop data lake
provides the opportunity to create an Active Archive to store additional historical data and
make it available for query for extended analytics use cases.
Maintaining control and records of the content stored in the various layers
of the data lake is very important. Having a strong but flexible governance
policy and mechanism for metadata and content management to support
discovery, standardization, master data management and security is a key
factor in the success of implementing a big data strategy.
8
How Does a Data Lake
Differ from a Data
Warehouse?
Wikipedia1 defines data warehouses as:
Central repositories of integrated data from one or more
disparate sources. They store current and historical data and
are used for creating trending reports for senior management
reporting such as annual and quarterly comparisons.
This is a very high-level definition that describes the
purpose of a data warehouse, but doesnt explain
how the purpose is achieved.
1
From Data Warehouse Wikipedia: https://en.wikipedia.org/wiki/Data_warehouse
2
Ralph Kimball Wikipedia: https://en.wikipedia.org/wiki/Ralph_Kimball
3
Bill Inmon Wikipedia: https://en.wikipedia.org/wiki/Bill_Inmon
4
From the blog of James Dixon: https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/
9
Data warehouse development is
characterized by requiring lots of
discovery, planning and development
work before any data makes it into
the warehouse.
By way of contrast, the term data lake was coined by Pentaho CTO James
Dixon. He describes a data mart (a subset of a data warehouse) as akin to a
bottle of water, cleansed, packaged and structured for easy consumption
while a data lake is more like a body of water in its natural state. Data flows
from the streams (the source systems) to the lake. Users have access to the
lake to examine, take samples or dive in.
10
So, to summarize,
a data warehouse is a highly structured store of the data that the business has
deemed important while a data lake is a more organic store of all data without
regard for the perceived value or structure of the data.
1
Data Lakes
Retain All Data
11
During the development of a data warehouse, a considerable amount of time is spent
analyzing data sources, understanding business processes and profiling data. The result is
a highly structured data model designed for reporting. A large part of this process includes
making decisions about what data to include and to not include in the warehouse. Generally,
if data isnt used to answer specific questions or in a defined report, it may be excluded from
the warehouse. This is usually done to simplify the data model and also to conserve space on
expensive disk storage that is used to make the data warehouse performant.
In contrast, the data lake retains ALL data. Not just data that is in use today but
data that may be used someday and even data that may never be used at all
just in case. Data is also kept for all time so organizations can go back to any
point in time to do analysis.
This approach becomes possible because the hardware for a data lake usually differs greatly
from that used for a data warehouse. Commodity, off-the-shelf servers combined with cheap
storage make scaling a data lake to terabytes and petabytes fairly economical.
12
2 Data Lakes
Support All
Types of Data
The data lake approach embraces these non-traditional data types. Data
lakes store all data, regardless of source and structure. Data is kept in its
raw form and only transformed when it is ready for use. This approach is
known as Schema on Read vs. the Schema on Write approach used in
the data warehouse.
13
3 Data Lakes
Support All Users
In most organizations, 80 percent or more of users are operational. They want to get their
reports, see their key performance metrics or slice the same set of data in a spreadsheet
every day. The data warehouse is usually ideal for these users because it is well structured,
easy to use and understand and it is purpose-built to answer their questions.
The next 10 percent or so do more analysis on the data. They use the data warehouse
as a source but often go back to source systems to get data that is not included in the
warehouse and sometimes bring in data from outside the organization. Their favorite tool
is the spreadsheet and they create new reports that are often distributed throughout the
organization. The data warehouse is their go-to source for data but they often go beyond its
bounds.
Finally, the remaining users do deep analysis. They may create totally new data sources
based on research. They mash up many different types of data and come up with entirely
new questions to be answered. These users may use the data warehouse but often ignore it
as they are usually charged with going beyond its capabilities. These users include the Data
Scientists and they may use advanced analytic tools and capabilities like statistical analysis
and predictive modeling.
14
4 Data Lakes
Adapt Easily to
Changes
One of the chief complaints about data warehouses is how
long it takes to change them.
Many business questions cant wait for the data warehouse team to
adapt their system for answers. This ever-increasing need for faster
answers has given rise to the concept of self-service business
intelligence.
However, this early access to the data comes at a price. The work typically done by the data
warehouse development team may not be done for some or all of the data sources required
to do an analysis. This leaves users in the drivers seat to explore and use the data as they see
fit. However, the operational users referenced earlier may not want to do that work. They
still just want their reports and KPIs.
These operational report consumers will make use of the more structured data views in
the data lake those that resemble what they had in the data warehouse. The difference is
that these views exist primarily as metadata that sits over the data in the lake rather than
physically rigid tables that require a developer to change.
16
Just Add Technology
The Modern Data Architecture described above is a functional
model. It describes layers within which data will be ingested,
organized and presented to the business but it doesnt
specifically call out technologies that will be used to build
these layers. This functional model aligns to physical
layers within a final technical deployment.
17
Data Acquisition
This layer refers to the ingestion and initial movement of data from the
source systems whether they be traditional relational/transactional
systems, user-generated data, unstructured or semi-structured data,
external data or streaming data.
Data Curation
In the Modern Data Architecture, Apache Hadoop plays a key role as a
data storage and curation layer. Using the data lake approach, all data
no matter what type is stored in the data lake and is organized, shaped
and made available for consumption by other layers. A variety of Hadoop
technologies are brought to bear in the curation layer to support the
required analytic and data processing workloads.
Data Provisioning
Operational reporting and analytics are best served by more traditional
data stores. The high-speed query capabilities of relational database
systems make them ideal for serving data to support interactive query
and analytics. Depending on the scale and needs of the organization, an
Enterprise Data Warehouse built on a relational database platform may be
coupled with several subject-oriented data marts to serve various reporting
needs. In addition, an Online Analytic Processing (OLAP) engine can help
facilitate complex, interactive query.
Data Consumption
This layer represents all end-user interfaces. A wide variety of tools and
technologies are available to fill the roles defined in this model. It should
be mentioned that although these physical layers may imply that there is
no direct flow of data from the Curation layer to the Consumption layer,
in some cases there is. The functional model supports the ability for some
users to connect directly to the data lake as needed.
18
To Cloud
or not to Cloud?
At least as popular as the topic of big data is the topic of cloud computing. Cloud
service providers give organizations the choice to avoid the costs of building and
managing a data center on premises by moving storage, compute and networking
to hosted solutions. Because a Modern Data Architecture involves a wide variety
of technologies and can represent a significant investment in both hardware and
software, a careful analysis of the options and costs is required before embarking
down the path of building a solution.
19
There are several considerations you should
take into account when making the decision
to deploy on-premises or in the cloud:
20
Is your data already in the cloud?
If all of your data is on premises, then you may be thinking it will be a challenge to move that data to
a data center outside of your four walls. However, if you are using hosted services for some or all of
your systems, your data may already be in the cloud. In this case, network bandwidth may be less of a
consideration and it may be easier for you to get your data to a cloud provider. A related consideration
is whether your company is global. If you have data centers all over the world, you may already have
bandwidth concerns moving data between your own data centers. In this case, it may actually be easier
to get your data to a cloud data center that is local to your own data centers than it is to centralize your
data in one of your own premises.
21
In a survey conducted in early 2015 by Gartner5, CIOs named Business
Intelligence and Analytics as their top investment priority, followed closely
by Infrastructure and Data Center and Cloud. When looking through the
other items on the list, such as Digitization/Digital Marketing and Customer
Relationship/Experience, its clear they also fall under the data management
heading. Without good data management, the marketing organization cant
decide who to market to and the customer service organization doesnt
know what their customers are thinking. Because so much of the business is
driven by data, building a solid foundation for data analysis must be a high
priority for any organization that wants to make informed decisions.
5
From CIOs name BI and Analytics No. 1 investment priority of 2015: http://gartnerevent.com/NABI13Survey