How to Run the Spark history server locally

Run the Spark history server locally

Sharing a step-by-step guide to the setup of the Spark history server locally (Mac or Windows). This helps to debug spark application event logs locally

What is the Spark history server?

The Spark history server, Helps to monitor the spark application metrics like the number of jobs, environment variables, and time is taken to complete each task, Without the spark history server the only way to check this information is by accessing the Spark context Webui while the job is in running state

Usually, Spark context  Webui will be hosted on port 4040 if it is occupied then it will re-try by incrementing the port + 1 till it gets an open port

4040 => 4041 => 4042

Example:

23/01/31 13:04:23 INFO server.Server: Started @5922ms

23/01/31 13:04:23 INFO server.AbstractConnector: Started ServerConnector@609e288{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}

23/01/31 13:04:23 INFO util.Utils: Successfully started service ‘SparkUI’ on port 4040.

These metrics information will be available till the lifespan of the spark application. Once the Spark application is completed the WebUI will not be accessible. 

To see these metrics/information We need to make sure the below property has been enabled. Once it is enabled, We can able to see this information after the Spark application is completed in the Spark history server WebUI and it is called event logs

set spark.eventLog.enabled to true

By default, these information will be persisted in hdfs location as below and can be changed based on the user’s need

spark.eventLog.dir = hdfs://namenode/user/spark/applicationHistory/<applicationID>

Why do we need to set up SHS locally

– Spark history server, Maintain all the completed/in-progress spark application event information like below

  • A list of scheduler stages and tasks
  • A summary of RDD sizes and memory usage
  • Environmental information like driver/executor memory, cores, deploy mode, etc
  • Information about the running tasks/executors

– To analyze any issue related to spark application like performance, job failure, or hung, Spark event logs are a valuable asset to get most of the information handy.

– Instead of setting up a Spark cluster, We can set up of Spark history server locally to troubleshoot the issue by uploading the event logs from the original job to the local spark history server

Download Spark event logs locally

– This can be done either by clicking the [DOWNLOAD] button on the history server web (Screenshot attached below)

How to Run the Spark history server locally

– Or can be copied from the HDFS dir to local

hdfs dfs -get <spark.history.fs.logDirectory> <local path>

Setup Spark History Server

NOTE: Make sure JAVA has been installed.

– Download apache spark tarball

Example:

[https://www.apache.org/dyn/closer.lua/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz]

unzip spark-3.3.0-bin-hadoop3.tgz

– Once you have uncompressed the spark tarball

– Move it to a location that you want to keep permanent

We have almost completed the setup:

– Export “SPARK_HISTORY_OPTS” to a local path 

Note: Local path can be any location on your PC/Laptop 

export SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=/tmp/spark

– Start Spark history server by using the below command

./spark-3.3.0-bin-hadoop3/sbin/start-history-server.sh

Spark history server logs will be stored locally under the default path “logs” dir “./spark-3.3.0-bin-hadoop3/logs”

This will start the Spark History server and it will start to listen in the default port 18080

– Copy the downloaded spark event logs and paste them inside “/tmp/spark”

– Can be accessed from the local browser: http://localhost:18080

Conclusion

To troubleshoot the spark application, We need to make sure to understand the spark metrics and to identify the issues, Setting up the Spark history server locally would give us the freedom to debug in our home setup

Check here to find skewness in spark using the Spark history server

Good Luck with your Learning !!

Similar Posts