Executive Summary
Traditional high performance computing (HPC)/
supercomputing, analytics and mainstream realtime/batch computing are quickly converging.
Mainstream workloads are crossing over the high
performance computing arena, demanding faster
analytics/batching, resource-intensive computations and algorithms. To succeed in todays accelerating digital world, enterprises must collect and
analyze mind-boggling amounts of data, in real
time, and at ever-faster speeds that most legacy
enterprise HPC technologies and systems were
not originally designed to accommodate.
In our view, organizations need to embark on what
we call Enterprise HPC 2.0. This term refers to the
ecosystem that leverages/utilizes various latest
commodity-hardware-based hyperscale grid technologies such as in-memory computing (IMC),
compute and data grid technologies, streaming
analytics, graph analytics, etc. These are in conjunction with infrastructure advancements such
as solid state drives (SSD)-enabled technology,
GPGPU acceleration, general purpose Infiniband
In-Memory Computing:
A Market Check
Effective use of IMC technology along with a clear
strategy for adoption can help enterprises reap
multiple benefits. Figure 1 lists some of the key
use cases across specific industries. While this is
just an indication, the possibilities are abundant
and are not limited to the specified list.
There have been rapid innovations in the IMC
space recently to enable faster computation
and processing speeds. These include Hadoop
Telecom
Retail
Real-time in-store
analytics.
Fast real-time
loyalty offers.
Insurance
Real-time ads
placements.
Real-time sentiment
analysis.
Healthcare
Faster medical
imaging processing.
Genome analysis.
Faster claim
processing &
modeling.
Faster actuarial
science.
Fraud detection.
Manufacturing
Figure 1
Real-time trading
decisions.
Faster reporting.
Inventory
management.
Predictive analytics
to avoid unplanned
downtime.
In-Memory Data
Grid (IMDG)
Distributed
Caches
In-Memory Data
Fabric (IMDF)
In-Memory
Database (IMDB)
A cache that
partitions its data
among all cluster
nodes.
A RDBM system
that stores data in
memory instead
of on disk.
Memcachedd
Ehcache
cache
Pivotal GemFire
ire
Distributed Key/Value
Cache for Low Latency
access.
Pivotal GemFire XD
Oracle
racle Coheren
Coherence
GigaSpaces
aSpaces XAP
Hazelcast
Infinispan
nispan (JBo
(JBoss)
In-Memory
Compute Grid
(IMCG)
Apache Spark
Apache Ignite
Ign
(GridGain)
SAP HANA
Oracle Exalytics
Exadata
MS SQL2014
Figure 2
For in-memory
computation and
processing of data
stored in disks.
Figure 3
Step 1: Discovery
The business use cases and the workloads to be
implemented via IMC technology play a crucial
role in the selection of the products. So first the
workload is chosen and key goals for implementation are defined.
For this white paper, we studied a retail customer
analytics workload previously processed on 1 a
modern scalable batch model using Apache Pig, a
Hadoop MapReduce-based technology, which has
a disk-based architecture. The nature of the technology used for this implementation permitted the
solution to be an offline and batch-based system.
To be better prepared to handle the disruptive
nature of the consumer behavior where latency
implies loss of business, we preemptively wanted
an alternative solution to support faster and/or
near-real-time performance and support for the
customers customers. We devised an internal
study to transform the batch workload using
multiple IMC technologies and successfully applied
appropriate IMC technology to make it faster.
Next, we defined the key use cases that the
workload requires, which becomes the input
for the IMC system evaluation matrix. For quick
development of the use case and benchmarking, we wanted the following core features to be
SQL
Support
Support
Step 2: Analysis
Second, we needed to ascertain the segment of
IMC technology that would best suit the workload
and identify a potential list of IMC systems from
the category that readily support the evaluation
criteria for specific use cases. This is carefully
chosen after deliberation with the enterprises
business and architect stakeholders.
We then performed deep-dive fit and architectural analysis on the selected list and determined
the best-fit match based on the aforementioned
evaluation criteria. From the output of this
analysis, the final list of IMC systems that closely
fit the requirements was determined. Further
proof-of-concept, proof-of-technology and benchmarking were performed on the final list of IMC
systems to validate, establish and recommend the
best-fit IMC system for a given workload.
Pivotal GemFire XD
Oracle Coherence
GigaSpaces XAP
Alternate
IMC Technology
Others
Apache Spark
Open Source
Fitment Analysis
Apache Ignite
Apache Infinispan
from JBoss
Apache Hazelcast
Figure 4
Weightage Percent
Criteria
60%
System
Environment
Setup
25%
Dev
Environment
Setup
15%
Features
Figure 5
60%
Apache Ignite
Apache Hazelcast
Oracle Coherence
45%
Jboss Infinispan
25%
System
Environment
setup
45%
Pivotal GemFireXD
35%
Dev
Environment
Setup
60%
55%
GigaSpaces XAP
15%
Dev
Environment
Setup
Features
25%
15%
System
Environment
setup
Features
Figure 6
Node Details
Disk Space (TB)
RAM (GB)
CPU Cores
128
32
2.6
IMC System
Version
Apache Spark
1.3.1
Apache Ignite
1.2.0-incubating
Pivotal GemFire XD
1.4.1
Figure 7
Benchmark Task
Our study was to compare a batch workload, which
performed a good mix of various computations to
create new data sets, with computed fields based
on aggregations performed in previous steps. The
original data was persisted in four different structured data sets with relational integrity between
them based on certain attributes/fields. The study
was done on 50 GB of data with 500 million records
using the traditional MR mode and compared with
the twin approaches using Alternate1IMC Apache
2
Spark and using IMDG New SQL products.
Step 3: Recommendation
Third, after creating PoCs and performance-related benchmarks, we can easily derive, validate and
recommend the best-fit IMC system for any given
workload. We can also consider where these technologies would potentially give the most durable
benefit for enterprise workloads by performing
such detailed analysis of their architectural
aspects.
Benchmark Execution
We executed each task three times for each IMC
system and reported the average of the trials.
Each system executes the benchmark tasks
separately to ensure exclusive access to the
clusters resources. During the tests, it was found
that Apache Ignite, unlike the other three systems,
did not provide out-of-the-box support for bulk
ingestion of data from csv files and was unable to
handle the ingestion beyond 1 GB volume of data
with its default cluster environment settings in a
stable manner. This prevented us from testing the
system for task executions.
Results
Figure 8 depicts the overall performance numbers
of the three IMC systems under different task
scenarios.
It is important to note that although performance tuning was not considered in our study, for
Aggregations/
Computations
50%
Input Data
Size
(4 datasets)
Data Set
Joins
30%
Input
Records
Count
50G
500 mil
Pre-IMC
Execution Time
Post-IMC
Execution
Time
13hrs
15min
Data Set
Select/Create
10%
10%
Output
Data Size
(1 denormalized
view)
300 mil
13x
20
15
10
5
0
Output
Records Count
150G
1hr 6sec
25
Apache Pig
Pivital GemFireXD
Apache Spark
Performance
Metrics
Data Set
Metrics
Workload
Operations Mix
Percent
Performance Comparison
20
15
10
5
0
Figure 8
Apache Pig
Pivital GemFireXD
Apache Spark
Functional Findings
Pivotal GemFire XD
Apache Spark
Easy to implement.
Easy to implement.
Figure 9
2
Step 4: Planning
Finally, with the knowledge and validation
achieved in the previous steps, we can then
successfully plan and create an effective IMC
roadmap.
When
Key Recommendations
Our analysis establishes that IMC is the future
of computing and a key enabling technology for
enterprise HPC workloads that require analytical,
predictive and cognitive capabilities.
As such, we recommend that:
Organizations
Looking Forward
Albeit in-memory technology has been around
for many years, the latest advancements around
scale-out architecture, increased automation and
reduced memory costs have increased the technologys appeal to all enterprises. IMC innovation
continues to be unabated across the whole
spectrum of IT market segments from hardware
to application infrastructure to packaged
business applications. New in-memory technologies can support new and complex workloads
that organizations can confidently apply to
Footnotes
1
Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, Scott; Stoica, Ion,
Shark: SQL and Rich Analytics at Scale, June 2013.
http://www.firstpost.com/business/ibms-apache-spark-push-plans-put-spark-bluemix-open-tech-centre-2296260.html.
http://searchaws.techtarget.com/news/4500248624/Amazon-Elastic-MapReduce-moves-forward-withApache-Spark.
References
Acknowledgment
Special thanks to Senthil Ramaswamy Sankarasubramanian, Director, Cognizant HyPerscale Computing
Lab, a unit of Cognizant Technology Labs, for his invaluable feedback during the course of writing this
paper.
About Cognizant
Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business
process outsourcing services, dedicated to helping the worlds leading companies build stronger businesses. Headquartered in Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfaction, technology innovation, deep industry and business process expertise, and a global, collaborative
workforce that embodies the future of work. With over 100 development and delivery centers worldwide
and approximately 218,000 employees as of June 30, 2015, Cognizant is a member of the NASDAQ-100,
the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performing and
fastest growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant.
World Headquarters
European Headquarters
1 Kingdom Street
Paddington Central
London W2 6BD
Phone: +44 (0) 20 7297 7600
Fax: +44 (0) 20 7121 0102
Email: infouk@cognizant.com
Copyright 2015, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is
subject to change without notice. All other trademarks mentioned herein are the property of their respective owners.
TL Codex 1546