Anda di halaman 1dari 7

The Bloor Group

WHEN STREAMING BECOMES


STRATEGIC
MapR Provides an Architectural Foundation
for Data-Driven Enterprises

Robin Bloor, Ph.D.


& Rebecca Jozwiak

WHITE PAPER

WHEN STREAMING BECOMES STRATEGIC

A New System of Record?


The system of record feeds the collection of applications that run the business, such as an ERP
system. Traditionally such systems passed information via ETL software to data warehouse
systems, which in turn fed data marts for the sake of BI and analytics applications. The defining
point is that the system of record holds the source data of the organization the golden record of
the truth. Often different contributing applications did not agree perfectly on the definitions
of data, so between the system of record and the data warehouse sat master data management
(MDM) software to reconcile such differences and provide business level data definitions for
users to help understand the corporate data universe.
For years, thats how it was. It worked reasonably well until new technologies and influences
forced their way into the corporate IT environment. You can think of these as falling into two
categories:

The growth of data

The acceleration of data flows

Data didnt suddenly become big. On average, corporate data has been growing at roughly 60
percent per year for decades, and it continues to do so. Some of this growth can be attributed to
new applications, particularly web-facing applications and mobile applications. Some is from
external sources such as social media or public data feeds. Some is data we never bothered
to retain or process, such as that in log files. Some comes from office systems (email, instant
messaging, etc.).
Because of these varied formats, a good deal of data did not arrive in a conveniently structured
form. Unstructured though some of it was, if it recorded or influenced the decision making,
then it clearly qualified as part of the system of record. Hadoop and its attendant software
ecosystem emerged at the right time to qualify as a possible repository for such data.

The Data Flow Issue


While data continued to grow like bamboo, the second disruptive factor came into play.
Although we described this as the acceleration of data flows, we could have described it as
the trend to data streaming. The old system of record and data warehouse arrangement was
founded on the idea that data lived in specific locations, and where there was a need, it was
replicated to other locations, including the data warehouse.
This data architecture came under pressure with the advent of the internet, which mandated
24 x 7 web-facing systems. These applications, in turn, impacted corporate applications via
web services and a service-oriented architecture, forcing some of them to run 24 x 7. This
engendered first generation messaging systems (such as ESBs) to service the flow of data and
messages between programs. Gradually, the time windows during which data could be moved
became shorter and shorter, provoking software changes and often more hardware to speed
up the process.
It has become increasingly apparent that a streaming (i.e., data flow-oriented) architecture is
necessary. Such an architecture can and should be seen as distinct from the streaming systems
that were introduced quite early on in the financial services sector for automated trading.
Although these were streaming architectures of a kind, they were focused on a particular
application rather than built to provide a general purpose streaming environment.

WHEN STREAMING BECOMES STRATEGIC

Streaming Architecture
What defines a streaming architecture is that it focuses equally on data flow and data storage.
It could be described as an event-driven architecture in the sense that the data that it presides
over includes events: website events, application events, customer events, social media events,
analytics triggers, sensor data, log file events, and so on. In the past decade, we have steadily
moved from a transactional world to an event-based world. The system of record now needs
to include the events that determine the behavior of the business, whether those events lead to
immediate action or are simply stored for later analysis.
Even businesses that are not real-time, in the sense of needing to process data as soon as it
arrives, need to think in terms of a streaming architecture. Ultimately, batches of data are
collections of events and should be processed event-by-event if possible. Operating in this way
makes it possible to reduce latencies without any need to make fundamental changes to the
software architecture. Either there is adequate capacity in the IT environment, or you deal with
latency simply with the judicious addition of hardware resources. Older transactional software
architectures are unable to reduce latencies so easily and may even get bent out of shape just
by increases in data volumes.

Hadoop and the Problem of Data Gravity


We noted that two factors brought disruption to the IT environment in recent years: the growth
of data and the acceleration of data flows. Hadoop gained its popularity because it provided an
economic way to manage many of the first of these two factors. But, on its own, it only handles
half the problem. The reality is that data has gravity in the sense that if you accumulate all the
data in a single place, then moving it starts to become prohibitive. You become obliged to move
the processing (the applications that wish to use the data) to the data.
This eventually forces centralization and puts an awkward constraint on flexibility. Ultimately,
if you build an ever-expanding data lake, applications will drown in it. Software architecture
requires being able to distribute data and processing flexibly across available resources. In our
view, the event-driven world that is gradually emerging requires both an ability to scale up the
processing of data and an ability to manage data flows. Both need to be intelligently catered
for.

MapR 5.1 and MapR Streams


The MapR Converged Data Platform was built with the idea of data movement in mind. For
a variety of reasons, MapR chose not to use HDFS directly, instead preferring to develop its
own file system, MapR-FS, which was POSIX compliant, supported the HDFS API and is
better suited to the full range of application workloads, not just large batch jobs. It improves
performance for most workloads and delivers better overall data security. It also provides a
sound foundation for implementing data distribution.
MapRs goal was to establish a coherent global file system that could distribute data and
applications across multiple Hadoop clusters, both locally and remotely, between data centers
and into the cloud. It never subscribed to the idea that organizations should be limited to a
large, central data lake. A key component needed to realize this vision was a real-time data and
messaging transport system. This is what MapR Streams, released with MapR 5.1, delivers.
Figure 1 illustrates how the MapR Streams publish/subscribe capability works. As the left
side of the diagram indicates, MapR Streams is one of three components that can be accessed
directly by applications. The other two are MapR-DB and MapR-FS. An application, whether

WHEN STREAMING BECOMES STRATEGIC

Business
Applications
Applications

Bulk
Processing

Stream
Processing

Producer
1

Producer
2

MapR-FS

MapR-DB

Consumer
1

MapR Streams
Topic 1

MapR Streams

MapR Platform Services

Producer
3

Topic 2

Consumer
2

MapR Converged Data Platform


Global platform, distributable across
local and remote clusters

Producer
4

Consumer
3

Figure 1. MapR Converged Data Platform


it is a normal business app, one of the bulk processing apps or a stream processing app, can
interact with any one or all of these MapR components. Applications use MapR Streams either
to send messages or data to other applications or to receive messages or data from them. There
is no limit to the number of applications that can connect to MapR Streams.
As illustrated on the right side of the diagram, data producers (publishers) or data consumers
(subscribers) connect to a specific data substream (a topic). One or more publishers send data
to the topic, and it is transmitted at once (record-by-record from memory in real time) to all the
consumers for that topic. Thus, the above diagram shows Producers 1, 2 and 3 writing data to
Topic 1, which is immediately sent to Consumers 1 and 2. Simultaneously, Producer 4 is writing
data to Topic 2 that is being transmitted to Consumer 3. A grouping of topics constitutes MapR
Streams, which might be dedicated to a specific distributed application, business system or IT
service.
MapR Streams supports multiple streams, with a maximum of 100,000 topics per stream and
no limit to the number of streams, or messages/events, within each stream. There is also no
limit to the number of producers that can write to a given topic and no limit to the number of
consumers who can receive data from a given topic. As such, you can think of MapR Streams as
a multitenant, in-memory streaming capability with a capacity of billions of messages/events
per second.
It is robust, guaranteeing message delivery and providing automatic management of
disconnection/reconnection in the event, for example, of the failure of a communications line.
Its security is part of a unified framework that also embraces the MapR-FS and the MapRDB. It provides authentication, wire-level encryption and a granular level of authorization
for producers, consumers and MapR Streams administrators. In practice, it requires little
administration beyond the definition of topics and streams and the specification of service
levels.

The Global Capabilities of MapR


With the addition of MapR Streams, MapR becomes a truly global data platform able to support
any type of distributed workload, ranging from the bulk processing of Hadoop applications
(MapReduce, Hive, HBase, etc.) to real-time stream processing using Spark, Storm or any other
data streaming capability. In its prior release, MapR already delivered global data distribution

WHEN STREAMING BECOMES STRATEGIC

Applications

Bulk
Processing

MapR-FS

MapR-DB

Stream
Processing

MapR Streams

Real-Time Data Transport

MapR Platform Services

Applications

Bulk
Processing

MapR Streams

MapR-DB

Stream
Processing

MapR-FS

Real-Time Data Transport

MapR Platform Services

Cloud
Applications

Bulk
Processing

MapR-FS

MapR-DB

Stream
Processing

MapR Streams

Real-Time Data Transport

MapR Platform Services

Data Center 1

Applications

Bulk
Processing

MapR Streams

MapR-DB

Stream
Processing

MapR-FS

Real-Time Data Transport

MapR Platform Services

Data Center 2

Figure 2. MapR Converged Data Platform: Real-Time Event Streaming


capabilities via MapR-DB, but now its capabilities are more general, and a real-time data
transport capability is embedded in the data platform.
Figure 2 illustrates two instances of the platform in Data Center 1, one instance in Data Center
2, and one instance in the cloud. First, consider the disaster recovery (DR) possibilities. Using
MapR Streams, the cluster in Data Center 2 could be configured to mirror all the data held in
the three other clusters and be brought into action in the event of a catastrophic failure in Data
Center 1 or in the cloud. In fact, any of the MapR instances could be used in that way and kept
current, in real time.
Data Center 1 depicts the distribution of local workloads. A Hadoop cluster can become
overloaded by too many applications competing for the same data. Adding more servers to
the cluster may fix the problem, but it is not guaranteed to do so. With the MapR platform, this
situation can be resolved by configuring another cluster and using MapR Streams to keep the
two clusters in sync. Alternatively, for workloads that work well together, the bulk processing
applications perhaps could be confined to one cluster, while streaming applications could run
on the other. MapR Streams can distribute the data that needs to be shared across the two
clusters.
This ability to distribute workloads across Hadoop clusters within a data center is remarkably
flexible and extremely useful. It enables true capacity planning and workload management,
rather than giving way to the Hadoop cluster sprawl that is currently becoming quite common.
It would, in fact, be relatively easy to transfer applications and their data from one cluster to
another or even have the same application available to run on either cluster. It is possible to
migrate applications from one cluster to another when desired.
Similarly, it would be possible to transfer applications and their data between data centers.
This is a kind of multi-master replication: a situation where data is shared between multiple
instances or locations and can be updated from any one of them. This is a capability some
databases provide, and MapR provides it for both streams and databases.

WHEN STREAMING BECOMES STRATEGIC

The Application Layer


Traditional IT architectures have provided a variety of components to cater for data flows. Aside
from the ponderous ETL products, there were exchange data capture and database replication
capabilities, message queues and ESBs, and streaming software each catering for slightly
different usage contexts and none providing a comprehensive solution. Lacking coherent data
flow capabilities, Hadoop implementations often become silos. While that may be adequate for
the workloads involved, it does not support the timely flow of data between Hadoop clusters
or dependent applications in a cohesive way. Such siloed clusters are data sinks rather than
components of an event-oriented architecture.
MapR resolves these two awkward issues by catering for stream processing, bulk processing,
and everything in between and also by providing a versatile global distribution capability.
The primary use case for Hadoop is analytics in its all its variety: time-series analytics, real-time
predictive analytics, data mining and discovery, graph analytics, analytics on unstructured
data and text, and so on. With the current release of MapR, an additional use case is added to
this: the maintenance of the system of record.
Bearing the capabilities of the MapR Platform in mind (security, versatile file system,
performance, real-time communications), what is emerging can be thought of as a global
operating system for data. Impressive though it is in its own right, the MapR Platform needs
also to be considered as a data service platform for both distributed and local applications. The
multitude of complementary Hadoop components (open source components such as Hive,
HBase, Drill, Spark, Storm, etc., as well as commercial components) provide an extremely
fertile application development and execution environment.
This is particularly the case for the analytics and BI applications that currently dominate the
Hadoop landscape. In terms of latency requirements, time series and predictive analytics are
generally streaming applications that need very low latencies. Some BI applications (alerts,
dashboards, performance management) are low latency, while others are less demanding. Data
discovery and exploration and big data analytics (on structured data, unstructured data, graph
data, text and so on) are averse to long wait times and hence need scale-out parallelism, but
dont demand very low latency.
Precise requirements vary from business to business, of course, but it is clear that both this
range of applications, and also the system of records, will be best served by a data environment
which implements a streaming architecture (or event-driven architecture) that is able to cater
for both bulk processing and real-time analytics.

MapR 5.1: A Customer Story


The new release of MapR has already been deployed in beta sites, including businesses in the
health sector and financial sector. The health care implementation was particularly interesting
because it required a global capability, as the company involved has data centers in the US
and in the European Union. Because of this, there were healthcare compliance requirements
involved: the exacting HIPAA regulations in the US and in-country storage regulations in the
EU, where personal health details needed to be stored in the country of origin. The company
wished to build a system of record, but because of compliance regulations, it would have to be
distributed. This is a particular requirement that MapR is suited for.
The overall objective was to build a global, flexible, compliant healthcare database. To add to

WHEN STREAMING BECOMES STRATEGIC

Applications

Search Engine
(Elasticsearch)

Applications

Search Engine
(Elasticsearch)

Data Input
/Updates

Data Input
/Updates

Graph DB
(Titan on
MapR-DB)

MapR Streams
MapR-DB
(JSON)

Records

MapR Streams
Records

MapR-DB
(JSON)

Graph DB
(Titan on
MapR-DB)

EU Data Center

US Data Center

Figure 3. Global Database (SOR) Implemented on MapR


the complexity of the project, there were a variety of users (data consumers) using a variety of
applications, which meant that different data models needed to be catered for. Additionally, a
search capability was required, but, because of compliance constraints, not all data could be
searched.
An overview of the solution is illustrated in Figure 3. Applications local to the data center
would generally access local data, held on file, in the Graph DB or in MapR-DB. In this way,
all required data structures were catered for. Because of compliance restrictions, only some
data was replicated between data centers. For specific requirements, materialized views were
continuously computed in MapR-DB or the Graph DB for use by Elasticsearch. This was done
for the sake of performance.
The creation of the system of record was achieved as specified: secure, immutable, rewindable,
and auditable.

In Summary
MapRs vision is quite distinct from other distributors. It delivers a different architectural
approach to Hadoop while shipping and supporting all open source projects in their entirety.
As far as we are aware, it is the only distribution that offers a truly global capability that
supports everything from real-time analytics to bulk processing. More than a data platform,
it is fast becoming an operating system for data and a global system of record. For companies
that are currently planning to implement Hadoop at a corporate level, we advise taking a close
look at MapR.

About The Bloor Group


The Bloor Group is a consulting, research and technology analysis firm that focuses on open
research and the use of modern media to gather knowledge and disseminate it to IT users.
Visit both www.TheBloorGroup.com and www.InsideAnalysis.com for more information.
The Bloor Group is the sole copyright holder of this publication.
Austin, TX 78720 | 512-5243689

Anda mungkin juga menyukai