Anatomy of a Spark Job Run
A Spark program is fundamentally lazy — when you chain transformations like map, filter, or reduceByKey on an RDD, nothing actually executes. Spark just builds up a lineage graph describing what transformations need to happen. Execution is triggered only when you call an action such as count(), collect(), or saveAsTextFile(). That action call internally invokes SparkContext.runJob(), which hands control to the scheduler running inside the driver process. This is the boundary between “describing computation” and “actually computing.”
DAG Scheduler — Logical Level
The scheduler is split into two components: the DAG scheduler and the task scheduler, which operate at different levels of abstraction. The DAG scheduler works at the logical level — it takes the RDD lineage reachable from the action call and constructs a DAG of stages. One action call produces exactly one job, and one job has exactly one DAG. The DAG scheduler’s primary job is to find shuffle boundaries in the lineage — operations like reduceByKey, join, and groupByKey require redistributing data across partitions by key, and each such operation creates a stage boundary. Everything between two shuffle boundaries is one stage. The result is N shuffle stages followed by exactly one final result stage. If you have multiple independent RDDs with separate action calls, those are separate jobs with separate DAGs, each following this same rule independently. If multiple RDDs are combined via join or cogroup before a single action, the DAG scheduler sees both lineages and merges them into one DAG.
Stages and Tasks
Within a stage, the unit of work is a task, and there is exactly one task per partition. So if your RDD has 100 partitions, that stage spawns 100 tasks which can run in parallel across executors. Every task in a stage is of the same type — either all shuffle map tasks or all result tasks.
Shuffle stages contain shuffle map tasks: each task applies all the chained narrow transformations (map, filter, flatMap, etc.) on its assigned partition entirely in memory, and at the end writes its output to local disk on the executor, partitioned by key. This disk write happens even for in-memory RDDs — it’s a deliberate fault-tolerance tradeoff so that if an executor dies, only that stage needs to be rerun. The next stage fetches these disk-written partitions over the network.
The final stage contains result tasks: each task again applies any chained narrow transformations on its partition, but instead of writing to disk for a next stage, it either sends its computed value back to the driver (for actions like count() or collect()) or writes directly to external storage like HDFS or S3 (for actions like saveAsTextFile()). The driver assembles the per-partition results from all result tasks into the final answer returned to your program.
Placement Preferences
The DAG scheduler also annotates each task with a placement preference before passing stages to the task scheduler. If a task’s input partition is cached in memory on a specific executor, the preference is process-local. If it’s stored on an HDFS DataNode, the preference is node-local. This annotation is passed down but the DAG scheduler doesn’t make scheduling decisions itself — that’s the task scheduler’s job.
Task Scheduler — Physical Level
The task scheduler operates at the physical level. It receives a stage’s set of tasks from the DAG scheduler and maps them to actual executors. It maintains a list of all executors running for the application and assigns tasks to those with free cores, defaulting to one core per task (configurable). It respects placement preferences in the order:
process-local → node-local → rack-local → nonlocal → speculative
Speculative tasks are duplicates of already-running tasks that the scheduler launches as a backup when a task is running unusually slowly. Child stages are only submitted to the task scheduler once all tasks in the parent stage have completed successfully — this enforces the DAG ordering.
Scheduler Backend and Executor
Once the task scheduler decides which task goes to which executor, it doesn’t communicate directly with executors. It hands off to the SchedulerBackend, which serializes the task code and sends a “launch task” message over the network to the ExecutorBackend on the worker node. On the worker side, the ExecutorBackend receives this message and passes it to the Executor process. The executor first ensures all JAR and file dependencies are up to date, then deserializes the task code from the message, and finally runs it. Critically, tasks run inside the same JVM as the executor — there is no process fork per task, which eliminates process startup overhead and is a key performance advantage over Hadoop MapReduce.
Result Flow and Fault Tolerance
When a task finishes, results flow back to the driver. A shuffle map task sends back metadata — specifically, the locations of its output partition files on local disk — so the next stage knows where to fetch its input from. A result task sends back the actual computed value for its partition, serialized, through the ExecutorBackend back to the driver as a status update message. The driver assembles these per-partition values into the final result.
If a task fails, the task scheduler resubmits it on a different executor. If an RDD was persisted (explicitly cached) from a previous job in the same SparkContext, the DAG scheduler short-circuits and skips creating stages to recompute it or any of its ancestors — it treats the cached RDD as a known starting point.