Spark Driver in Apache Spark and Where does the spark driver run?

Spark Driver in Apache Spark

Spark Drivers are the one that starts the spark context or session in the Spark application, which helps in communicating with resource managers and runs tasks in the executors.

Below are the functions of our Spark driver

  • Schedules Tasks
  • Monitor the tasks
  • receives and collects the results of Actions
Spark Context

Driver schedules Tasks

  • Driver program, Creates spark context and it will try to connect with the response manager which allocates resources across applications
  • Once, the resource is allocated, Spark can able to create executors on the Worker nodes
  • Then spark context will send the application code and dependency jars to the executors to run the tasks
  • Once the task is completed, it will receive and display the results/status

Monitor the tasks

Driver, Open a WebUI to monitor the running task in the spark job and the WebUI will be open till the lifetime of the job. Once the driver terminates, so does your Spark application.

http://:4040

Where does the spark driver run?

Driver program runs on different hosts based on deploy mode in spark

Deploy mode

  • Client Mode
  • Cluster Mode

Client Mode:

When the Spark job runs with deploy-mode as a client, Then the driver program will run on the same host where the job is getting triggered

To specify the deploy mode, We need to add the config “–deploy-mode client” in the command below

spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client  /opt/cloudera/parcels/CDH/jars/spark-examples*.jar  1 1

When running the above example, We can see the spark driver will be running in the same node, Where the command is getting triggered.

Execution steps explained:

~]# spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client  /opt/cloudera/parcels/CDH/jars/spark-examples*.jar  1 1

#Driver creates spark context for this spark job

22/11/24 09:35:41 INFO spark.SparkContext: Running Spark version 2.4.0
22/11/24 09:35:41 INFO logging.DriverLogger: Added a local log appender at: /tmp/spark-4982c430-f0fb-4d71-887a-ce50118d0ecc/__driver_logs__/driver.log
22/11/24 09:35:41 INFO spark.SparkContext: Submitted application: Spark Pi

#Driver started a WebUI to monitor the Tasks

22/11/24 09:35:41 INFO util.Utils: Successfully started service 'sparkDriver' on port 37867.
22/11/24 09:35:41 INFO spark.SparkEnv: Registering MapOutputTracker
22/11/24 09:35:42 INFO spark.SparkEnv: Registering BlockManagerMaster
22/11/24 09:35:42 INFO server.AbstractConnector: Started ServerConnector@6594402a{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
22/11/24 09:35:42 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.

#Tasks distribution, Consuming results

22/11/24 09:35:56 INFO spark.SparkContext: Starting job: reduce at SparkPi.scala:38
22/11/24 09:35:56 INFO scheduler.DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 1 output partitions
22/11/24 09:35:56 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)

22/11/24 09:36:05 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 9.193 s
22/11/24 09:36:05 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 9.241394 s

#Results displayed

Pi is roughly 3.1392313923139232
22/11/24 09:36:05 INFO server.AbstractConnector: Stopped Spark@6594402a{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
22/11/24 09:36:05 INFO ui.SparkUI: Stopped Spark web UI at http://:4040

Cluster Mode:

In Cluster mode, the Driver program will be running along with the Application master container. Due to that, We won’t be seeing any driver-related logs or results in the console as in the client mode

To check the driver logs, We need to use the below command to collect the application logs and the driver logs will be recorded as part of the AM container logs

yarn logs -applicationId  -appOwner 

where

Application ID is the corresponding app ID

AppOwner is the user’s name, who submitted the job.

Execution steps explained:

~]# spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster  /opt/cloudera/parcels/CDH/jars/spark-examples*.jar  1 1

#Will have basic information about the number of executors and AM container

22/11/24 09:41:34 INFO yarn.Client: Requesting a new application from cluster with 3 NodeManagers
22/11/24 09:41:34 INFO conf.Configuration: resource-types.xml not found
22/11/24 09:41:34 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
22/11/24 09:41:34 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (28672 MB per container)
22/11/24 09:41:34 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
22/11/24 09:41:34 INFO yarn.Client: Setting up container launch context for our AM
22/11/24 09:41:34 INFO yarn.Client: Setting up the launch environment for our AM container
22/11/24 09:41:34 INFO yarn.Client: Preparing resources for our AM container

.
.
.
22/11/24 09:41:49 INFO util.ShutdownHookManager: Shutdown hook called
22/11/24 09:41:49 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-8cd8db05-2c45-4e93-bced-cd17533e3ac8

Logs:

AM container

Container: container_e05_1669264062105_0002_01_000001

22/11/24 09:41:42 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread
22/11/24 09:41:42 INFO yarn.ApplicationMaster: Waiting for spark context initialization...
22/11/24 09:41:42 INFO spark.SparkContext: Running Spark version 2.4.0
22/11/24 09:41:42 INFO spark.SparkContext: Submitted application: Spark Pi
22/11/24 09:41:43 INFO server.AbstractConnector: Started ServerConnector@47c45e5d{HTTP/1.1,[http/1.1]}{0.0.0.0:37117}
22/11/24 09:41:43 INFO util.Utils: Successfully started service 'SparkUI' on port 37117.
22/11/24 09:41:49 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 3.903 s
22/11/24 09:41:49 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 3.951058 s
22/11/24 09:41:49 INFO server.AbstractConnector: Stopped Spark@47c45e5d{HTTP/1.1,[http/1.1]}{0.0.0.0:0}
22/11/24 09:41:49 INFO ui.SparkUI: Stopped Spark web UI at http://:37117

As we can see the driver-related logs are available in the AM container

NOTE: By default

  • Spark-shell, Pyspark will run in Client mode
  • Spark-submit will run in Cluster mode

Conclusion:

The Spark driver Will start the spark context to interact with the cluster manager (Yarn, Mesos, etc) to get resources and to assign and monitor the tasks from executors. It basically Drives your program

Good Luck with your learning

Similar Posts