Anda di halaman 1dari 16

Elastic Map Reduce

AMAZON EMR
EMR

 Analyze and process vast amounts of data


 Uses Apache Hadoop
 EMR consists of Master and Slave Nodes
 Hadoop uses a distributed processing architecture
called MapReduce in which a task is mapped to a set
of servers for processing. The results of the
computation performed by those servers is then
reduced down to a single output set. One node,
designated as the master node, controls the
distribution of tasks
Hadoop

 Apache Hadoop software library is a framework that


allows for the distributed processing of large data
sets across clusters of computers using simple
programming models
 It is designed to scale up from single servers to
thousands of machines
 Hadoop is designed to detect and handle failures at
the application layer
EMR

 Hadoop clusters running on Amazon EMR use EC2


instances as virtual Linux servers for the master and
slave nodes
 Amazon S3 for bulk storage of input and output data,
 CloudWatch to monitor cluster performance and
raise alarms.
 You can also move data into and out of DynamoDB
using Amazon EMR and Hive
EMR
EMR
EMR

 Open-source projects that run on top of the Hadoop


architecture can also be run on Amazon EMR
 Hive, Pig, HBase, DistCp, and Ganglia, are already
integrated with Amazon EMR
EMR: Advantages

 The ability to provision clusters of virtual servers


within minutes.
 You can scale the number of virtual servers in your
cluster to manage your computation needs, and only
pay for what you use.
 Integration with other AWS services
EMR: Features

 Resizeable Clusters
 When you run your Hadoop cluster on Amazon EMR, you can
easily expand or shrink the number of virtual servers in your
cluster depending on your processing needs
 Pay Only for What You Use
 Pay as you go model of Amazon

 Easy to Use
 When you launch a cluster on Amazon EMR, the web service
allocates the virtual server instances and configures them with
the needed software for you. Within minutes you can have a
cluster configured and ready to run your Hadoop application
EMR: Features

 Use Amazon S3 or HDFS


 you can store your input and output data in Amazon S3, on the
cluster in HDFS, or a mix of both. Amazon S3 can be accessed
like a file system from applications running on your Amazon
EMR cluster
 Parallel Clusters
 If your input data is stored in Amazon S3 you can have
multiple clusters accessing the same data simultaneously
EMR: Features

 Hadoop Application Support


 You can use popular Hadoop applications such as Hive, Pig,
and HBase with Amazon EMR
 Save Money with Spot Instances
 Spot instances are lower cost instances than on-demand
instances
 AWS Integration
 Amazon EMR is integrated with other Amazon Web Services
such as Amazon EC2, Amazon S3, DynamoDB, Amazon RDS,
CloudWatch, and AWS Data Pipeline
EMR: Features

 Instance Options
 When you launch a cluster on Amazon EMR, you specify the
size and capabilities of the virtual servers used in the cluster
 MapR Support
 Amazon EMR supports several MapR distributions

 Business Intelligence Tools


 Amazon EMR integrates with popular business intelligence
(BI) tools such as Tableau, MicroStrategy, and Datameer
EMR: Features

 User Control
 When you launch a cluster using Amazon EMR, you have root
access to the cluster and can install software and configure the
cluster before Hadoop starts
 Management Tools
 You can manage your clusters using the Amazon EMR console
(a web-based user interface), a command line interface, web
service APIs, and a variety of SDKs
 Security
 You can run Amazon EMR in a Amazon VPC in which you
configure networking and security rules
EMR Lab

INTRODUCTION TO AMAZON EMR


RedShift
RedShift

 Amazon Redshift is a fast, fully managed, petabyte-


scale data warehouse service
 It is optimized for datasets ranging from a few
hundred gigabytes to a petabyte or more and costs
less than $1,000 per terabyte per year
 The first step to create a data warehouse is to launch
a set of nodes, called an Amazon Redshift cluster.
After you provision your cluster, you can upload your
data set and then perform data analysis queries

Anda mungkin juga menyukai