Jeff Buell, VMware, Inc. Richard McDougall, VMware, Inc. Sanjay Radia, Hortonworks
#vmworldapps
Disclaimer
Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features
discussed or presented have not been determined.
Financial Services
Scientific Research
Social Media
Hadoops ability to handle large unstructured data affordably and efficiently makes it a valuable tool kit for enterprises across a number of applications and fields.
3
Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works
4
MapReduce: Programming
framework for highly parallel data processing
Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works
Simple to Operate Rapid deployment Unified operations across enterprise Easy Clone of Cluster
Highly Available
Elastic Scaling Shrink and expand cluster on demand Resource Guarantee Independent scaling of Compute and data
10
Low Utilization
Dedicated clusters to run Hadoop with low CPU utilization No easy way to share resource between Hadoop and non-Hadoop workloads Noisy neighbor, lack resource containment
11
MPP DB
HBase
Hadoop
HBase
Cluster Consolidation
MPP DB
Simplify
Cluster Sprawling
Single purpose clusters for various business applications lead to cluster sprawl.
Optimize
Shared Resources = higher utilization
Elastic resources = faster on-demand access
12
Automate deployment
Done
13
$ ssh serengeti@serengeti-vm
$ serengeti
serengeti>
14
50
50
15
$ ssh rmc@rmc-elephant-009.eng.vmware.com
16
Configuring Distros
{ "name" : "cdh", "version" : "3u3", "packages" : [ { "roles" : ["hadoop_NameNode", "hadoop_jobtracker", "hadoop_tasktracker", "hadoop_datanode", "hadoop_client"], "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz" }, { "roles" : ["hive"], "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz" }, { "roles" : ["pig"], "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz" } ] },
18
Commercial Vendors
Community Projects
Support major distribution and multiple projects Contribute Hadoop Virtualization Extension (HVE) to Open Source Community
19
SAN Storage
$2 - $10/Gigabyte $1M gets: 0.5Petabytes 200,000 IOPS 8Gbyte/sec
20
NAS Filers
Local Storage
$1 - $5/Gigabyte
$1M gets: 1 Petabyte 200,000 IOPS 10Gbyte/sec
$0.05/Gigabyte
$1M gets: 10 Petabytes 400,000 IOPS 250 Gbytes/sec
Hybrid Storage
SAN for boot images, VMs, other
workloads
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Host
Host
Host
Host
Host
Host
21
Hadoop
Issues of interest
Native vs various virtual configurations Local disks vs Fibre Channel SAN Effect of protecting Hadoop master daemons with Fault Tolerance Public cloud (renting) vs private cloud (buying)
Arista 7124SX 10 GbE switch
24x HP DL380 G7 2x X5687, 72 GB 16x SAS 146 GB Broadcom 10 GbE adapter Qlogic 8 Gb/s HBA
EMC VNX7500
22
Configuration
Software
vSphere 5.0 U1 (storage tests), 5.1 (Native/Virtual, FT) RHEL 6.1 x86_64 Cloudera CDH3u4 Hadoop applications: TeraGen, TeraSort, TeraValidate (1 TB)
Hadoop VMs
Processors (16 logical threads), memory (72 GB), disks (12) partitioned among
1, 2, or 4 VMs per host
Separate VMs for NameNode and JobTracker for storage and FT tests
Hadoop configuration
One map and one reduce task per vCPU (= logical thread)
Machines are highly loaded
256 MB block size FT tests: 8 256 MB block sizes to vary load on NN and JT
23
350
Native 300 250 200 150 100 50 0 TeraGen TeraSort TeraValidate 1 VM 2 VMs 4 VMs
24
3.5
SAN RAID-5
TeraSort
TeraValidate
NameNode and JobTracker placed in separate UP VMs Small overhead: Enabling FT causes 2-4% slowdown for TeraSort 8 MB case places similar load on NN &JT as >200 hosts with 256 MB
1.04 Elapsed time ratio to FT off TeraSort 1.03
1.02
1.01
26
Yahoo! Hadoop 2009: Classic benchmark test, 1460 hosts Google/MapR: SaaS on Google Compute Engine vSphere 5.1: 24 host cluster, 2 VMs/host, 8 or 12 disks/host,
CDH3u4
vSphere 5.1
vSphere 5.1
27
192
192
192
288
442
359
11.2
13.8
11.2
9.2
~$2
~$2
Simple to Operate Rapid deployment Unified operations across enterprise Easy Clone of Cluster
Highly Available
Elastic Scaling Shrink and expand cluster on demand Resource Guarantee Independent scaling of Compute and data
28
Hortonworks goal
Expand Hadoop ecosystem Provide first class support of various platforms Hadoop should run well on VMs VMs offer several advantages as presented earlier Take advantage of vSphere for HA
MR-tmp on HDFS using block pools Elastic Compute-VMs will not need local disk Fast communications within VMs
29
jo b
jo b
jo b
jo b
jo b
Failover
JT into Safemode
NN
JT
NN
Server
Server
Server
N+K failover
30
HA is in HDP 1.0
Using Total System Availability Architecture
31
Addition benefits:
N-N & N+K failover Migration for maintenance
32
33
180 Nodes, 200K files, 18 million blocks, 900TB raw storage 2-4.5
minutes Failure detection and Failover 0.5 to 2 minutes Namenode Startup (exit safemode) 110 sec
For vSphere - OS bootup is needed 10-20 seconds is included above. Cold Failover is good enough for small/medium clusters
Failure Detection and Automatic Failover Dominates
34
34
Summary
35
Simple to Operate Rapid deployment Unified operations across enterprise Easy Clone of Cluster
Highly Available
Elastic Scaling Shrink and expand cluster on demand Resource Guarantee Independent scaling of Compute and data
36
VM
VM
VM
VM
Current Hadoop:
Compute
VM
T1
VM
T2
Storage
Storage
1. Hadoop in VM
Single Tenant Fixed Resources
37
Slot
Virtual Hadoop Node Other Workload
Datanode
Virtualization Host
VMDK
VMDK
38
References
www.projectserengeti.org www.hortonworks.com www.cloudera.com Fault Tolerance performance whitepaper: www.vmware.com/resources/techresources/10301 MapR/Google blog: www.mapr.com/blog/google-mapr
39
APP-CAP2956
Jeff Buell, VMware, Inc. Richard McDougall, VMware, Inc. Sanjay Radia, Hortonworks
#vmworldapps