Deployment Options
As noted in the previous chapter, Spark is easy to download and install on
a laptop or virtual machine. Spark was built to be able to run in a couple
different ways: standalone, or part of a cluster.
But for production workloads that are operating at scale, a single laptop or
virtual machine is not likely to be sufficient. In these circumstances, Spark
will normally run on an existing big data cluster. These clusters are often
also used for Hadoop jobs, and Hadoop's YARN resource manager will
generally be used to manage that Hadoop cluster (including
Spark). Running Spark on YARN, from the Apache Spark project,
provides more configuration details.
For those who prefer alternative resource managers, Spark can also run
just as easily on clusters controlled by Apache Mesos. Running Spark on
Mesos, from the Apache Spark project, provides more configuration
details.
A series of scripts bundled with current releases of Spark simplify the
process of launching Spark on Amazon Web Services' Elastic Compute
Cloud (EC2). Running Spark on EC2, from the Apache Spark project,
provides more configuration details.
Storage Options
Although often linked with the Hadoop Distributed File System (HDFS),
Spark can integrate with a range of commercial or open source third-party
data storage systems, including:
• MapR (file system and database)
• Google Cloud
• Amazon S3
• Apache Cassandra
• Apache Hadoop (HDFS)
• Apache HBase
• Apache Hive
• Berkeley's Tachyon project
Developers are most likely to choose the data storage system they are
already using elsewhere in their workflow.
API Overview
Spark's capabilities can all be accessed and controlled using a rich API.
This supports Spark's four principal development environments:
(Scala, Java, Python, R), and extensive documentation is provided
regarding the API's instantiation in each of these languages. The Spark
Programming Guide provides further detail, with comprehensive code
snippets in Scala, Java and Python. The Spark API was optimized for
manipulating data, with a design that reduced common data science tasks
from hundreds or thousands of lines of code to only a few.
An additional DataFrames API was added to Spark in 2015.
DataFrames offer:
• Ability to scale from kilobytes of data on a single laptop to petabytes
on a large cluster
• Support for a wide array of data formats and storage systems
• State-of-the-art optimization and code generation through the Spark
SQL Catalyst optimizer
• Seamless integration with all big data tooling and infrastructure via
Spark
• APIs for Python, Java, Scala, and R
For those familiar with a DataFrames API in other languages like R or
pandas in Python, this API will make them feel right at home. For those
not familiar with the API, but already familiar with Spark, this extended
API will ease application development, while helping to improve
performance via the optimizations and code generation.