Anda di halaman 1dari 4

Directed Acyclic Graph

Spark DAG – DAG Visualisation

apache spark

▪ Directed – Means which is directly connected from one node to another.


This creates a sequence i.e. each node is in linkage from earlier to later in
the appropriate sequence.
▪ Acyclic – Defines that there is no cycle or loop available. Once a
transformation takes place it cannot returns to its earlier position.
▪ Graph – From graph theory, it is a combination of vertices and edges.
Those pattern of connections together in a sequence is the graph.

Hadoop
mapreduce

DAG
Directed Acyclic Graph edges and vertices
RDDs

Spark DAG – DAG Scheduler

spark
spark

▪ It completes the computation and execution of stages for a job. It also


keeps track of RDDsand run jobs in minimum time and assigns jobs to the
task scheduler. Task scheduler means submitting tasks for execution.
▪ It determines the preferred locations on which, we can run each task
respectively. It is possible with the task scheduler. That gets the
information of current cache status.
▪ It handles the track of RDDs, which are cached to devoid re-computing. In
this way it also handles failure. As it remembers at what stages it already
produced output files it heals the loss. Due to shuffle output files may lose
so it helps to recover failure.

HDFS

HDFS

spark
1. Very first, the user submits an apache spark application to spark.
2. Than driver module takes the application from spark side.
3. The driver performs several tasks on the application. That helps to
identify whether transformations and actions are present in the
application.
4. All the operations are arranged further in a logical flow of operations, that
arrangement is DAG.
5. Than DAG graph converted into the physical execution plan which
contains stages.
6. As we discussed earlier driver identifies transformations. It also sets stage
boundaries according to the nature of transformation. There are two types
of transformation process applied on RDD: 1. Narrow transformations 2.
Wide transformations. Let’s discuss each in brief :
▪ Narrow Transformations – Transformation process like map() and filter()
comes under narrow transformation. In this process, it does not require to
shuffle the data across partitions.
▪ Wide Transformations – Transformation process like ReduceByKey comes
under wide transformation. In this process, it is required shuffling the
data across partitions.

7. After all, DAG scheduler makes a physical execution plan, which


contains tasks. Later on, those tasks are joint to make bundles to send them
over the cluster.

partitions of spark RDD

spark RDD
▪ It is possible to execute many at same time queries through DAG. Due to
only two queries (Map and Reduce) are available in mapreduce. We are
not able to entertain SQL query, which is also possible on DAG. It turns out
more flexible than mapreduce.
▪ As it is possible to achieve fault tolerance through DAG. We can recover
lost RDDs using this graph.
▪ In comparison to hadoop mapreduce, DAG provides better global
optimization.

mapreduce

Anda mungkin juga menyukai