WHITE PAPER
Data didnt suddenly become big. On average, corporate data has been growing at roughly 60
percent per year for decades, and it continues to do so. Some of this growth can be attributed to
new applications, particularly web-facing applications and mobile applications. Some is from
external sources such as social media or public data feeds. Some is data we never bothered
to retain or process, such as that in log files. Some comes from office systems (email, instant
messaging, etc.).
Because of these varied formats, a good deal of data did not arrive in a conveniently structured
form. Unstructured though some of it was, if it recorded or influenced the decision making,
then it clearly qualified as part of the system of record. Hadoop and its attendant software
ecosystem emerged at the right time to qualify as a possible repository for such data.
Streaming Architecture
What defines a streaming architecture is that it focuses equally on data flow and data storage.
It could be described as an event-driven architecture in the sense that the data that it presides
over includes events: website events, application events, customer events, social media events,
analytics triggers, sensor data, log file events, and so on. In the past decade, we have steadily
moved from a transactional world to an event-based world. The system of record now needs
to include the events that determine the behavior of the business, whether those events lead to
immediate action or are simply stored for later analysis.
Even businesses that are not real-time, in the sense of needing to process data as soon as it
arrives, need to think in terms of a streaming architecture. Ultimately, batches of data are
collections of events and should be processed event-by-event if possible. Operating in this way
makes it possible to reduce latencies without any need to make fundamental changes to the
software architecture. Either there is adequate capacity in the IT environment, or you deal with
latency simply with the judicious addition of hardware resources. Older transactional software
architectures are unable to reduce latencies so easily and may even get bent out of shape just
by increases in data volumes.
Business
Applications
Applications
Bulk
Processing
Stream
Processing
Producer
1
Producer
2
MapR-FS
MapR-DB
Consumer
1
MapR Streams
Topic 1
MapR Streams
Producer
3
Topic 2
Consumer
2
Producer
4
Consumer
3
Applications
Bulk
Processing
MapR-FS
MapR-DB
Stream
Processing
MapR Streams
Applications
Bulk
Processing
MapR Streams
MapR-DB
Stream
Processing
MapR-FS
Cloud
Applications
Bulk
Processing
MapR-FS
MapR-DB
Stream
Processing
MapR Streams
Data Center 1
Applications
Bulk
Processing
MapR Streams
MapR-DB
Stream
Processing
MapR-FS
Data Center 2
Applications
Search Engine
(Elasticsearch)
Applications
Search Engine
(Elasticsearch)
Data Input
/Updates
Data Input
/Updates
Graph DB
(Titan on
MapR-DB)
MapR Streams
MapR-DB
(JSON)
Records
MapR Streams
Records
MapR-DB
(JSON)
Graph DB
(Titan on
MapR-DB)
EU Data Center
US Data Center
In Summary
MapRs vision is quite distinct from other distributors. It delivers a different architectural
approach to Hadoop while shipping and supporting all open source projects in their entirety.
As far as we are aware, it is the only distribution that offers a truly global capability that
supports everything from real-time analytics to bulk processing. More than a data platform,
it is fast becoming an operating system for data and a global system of record. For companies
that are currently planning to implement Hadoop at a corporate level, we advise taking a close
look at MapR.