apache spark
Hadoop
mapreduce
DAG
Directed Acyclic Graph edges and vertices
RDDs
spark
spark
HDFS
HDFS
spark
1. Very first, the user submits an apache spark application to spark.
2. Than driver module takes the application from spark side.
3. The driver performs several tasks on the application. That helps to
identify whether transformations and actions are present in the
application.
4. All the operations are arranged further in a logical flow of operations, that
arrangement is DAG.
5. Than DAG graph converted into the physical execution plan which
contains stages.
6. As we discussed earlier driver identifies transformations. It also sets stage
boundaries according to the nature of transformation. There are two types
of transformation process applied on RDD: 1. Narrow transformations 2.
Wide transformations. Let’s discuss each in brief :
▪ Narrow Transformations – Transformation process like map() and filter()
comes under narrow transformation. In this process, it does not require to
shuffle the data across partitions.
▪ Wide Transformations – Transformation process like ReduceByKey comes
under wide transformation. In this process, it is required shuffling the
data across partitions.
spark RDD
▪ It is possible to execute many at same time queries through DAG. Due to
only two queries (Map and Reduce) are available in mapreduce. We are
not able to entertain SQL query, which is also possible on DAG. It turns out
more flexible than mapreduce.
▪ As it is possible to achieve fault tolerance through DAG. We can recover
lost RDDs using this graph.
▪ In comparison to hadoop mapreduce, DAG provides better global
optimization.
mapreduce