How to Run the Spark history server locally

Sharing a step-by-step guide to the setup of the Spark history server locally (Mac or Windows). This helps to debug spark application event logs locally
What is the Spark history server?
The Spark history server, Helps to monitor the spark application metrics like the number of jobs, environment variables, and time is taken to complete each task, Without the spark history server the only way to check this information is by accessing the Spark context Webui while the job is in running state
Usually, Spark context Webui will be hosted on port 4040 if it is occupied then it will re-try by incrementing the port + 1 till it gets an open port
4040 => 4041 => 4042
Example:
23/01/31 13:04:23 INFO server.Server: Started @5922ms
23/01/31 13:04:23 INFO server.AbstractConnector: Started ServerConnector@609e288{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
23/01/31 13:04:23 INFO util.Utils: Successfully started service ‘SparkUI’ on port 4040.
These metrics information will be available till the lifespan of the spark application. Once the Spark application is completed the WebUI will not be accessible.
To see these metrics/information We need to make sure the below property has been enabled. Once it is enabled, We can able to see this information after the Spark application is completed in the Spark history server WebUI and it is called event logs
set spark.eventLog.enabled to true
By default, these information will be persisted in hdfs location as below and can be changed based on the user’s need
spark.eventLog.dir = hdfs://namenode/user/spark/applicationHistory/<applicationID>
Why do we need to set up SHS locally
– Spark history server, Maintain all the completed/in-progress spark application event information like below
- A list of scheduler stages and tasks
- A summary of RDD sizes and memory usage
- Environmental information like driver/executor memory, cores, deploy mode, etc
- Information about the running tasks/executors
– To analyze any issue related to spark application like performance, job failure, or hung, Spark event logs are a valuable asset to get most of the information handy.
– Instead of setting up a Spark cluster, We can set up of Spark history server locally to troubleshoot the issue by uploading the event logs from the original job to the local spark history server
Download Spark event logs locally
– This can be done either by clicking the [DOWNLOAD] button on the history server web (Screenshot attached below)

– Or can be copied from the HDFS dir to local
hdfs dfs -get <spark.history.fs.logDirectory> <local path>
Setup Spark History Server
NOTE: Make sure JAVA has been installed.
– Download apache spark tarball
Example:
[https://www.apache.org/dyn/closer.lua/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz]
unzip spark-3.3.0-bin-hadoop3.tgz
– Once you have uncompressed the spark tarball
– Move it to a location that you want to keep permanent
We have almost completed the setup:
– Export “SPARK_HISTORY_OPTS” to a local path
Note: Local path can be any location on your PC/Laptop
export SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=/tmp/spark
– Start Spark history server by using the below command
./spark-3.3.0-bin-hadoop3/sbin/start-history-server.sh
Spark history server logs will be stored locally under the default path “logs” dir “./spark-3.3.0-bin-hadoop3/logs”
This will start the Spark History server and it will start to listen in the default port 18080
– Copy the downloaded spark event logs and paste them inside “/tmp/spark”
– Can be accessed from the local browser: http://localhost:18080
Conclusion
To troubleshoot the spark application, We need to make sure to understand the spark metrics and to identify the issues, Setting up the Spark history server locally would give us the freedom to debug in our home setup
Check here to find skewness in spark using the Spark history server
Good Luck with your Learning !!