How to Enable Debug Mode in spark-submit, spark-shell, and pyspark

How to Enable Debug Mode in spark-submit, spark-shell, and pyspark

We often need to enable debug log level in the spark to understand the issue and troubleshoot, In this article, We will learn How to Enable Debug Mode in spark-submit, spark-shell, and pyspark sessions?

In order to enable debug log level, We need to set the logging level to “DEBUG” for the appropriate logger, It can be enabled in the system default or custom log4j.properties file or bypassing the configure in the spark command

Below are some of the easy ways to enable debug log level in Spark, Which would help us to troubleshoot the issue in-depth,

  • Using spark configuration API
  • Using the command line interface
  • Using a log4j.properties file (Custom and Default file )
  • Setting debug in session level (spark-shell & pyspark)

Using spark configuration API

We can set the spark log level programmatically using the Spark configuration API.

For Example:

import org.apache.log4j.{Level, Logger} 

val rootLogger = Logger.getRootLogger 
rootLogger.setLevel(Level.DEBUG)

Using the command line interface

We can pass the following option to the spark-submit command to enable debug logs:

--conf "spark.logConf=true" --conf "spark.driver.extraJavaOptions=-Dlog4j.debug=true"

Using a log4j.properties file

We can use either Customer log4j.properties file or system default properties file (/etc/spark/conf/log4j.properties)

Custom log4j.properties file

We can create a custom log4j.properties file and include the file while running the spark job (spark-submit, spark-shell, pyspark)

Create a log4j.properties file locally and add the below lines

NOTE: You can enable debug for any specific class as well, Based on your need, I have enabled debug for root.logger

log4j.rootLogger=${root.logger}
root.logger=DEBUG,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
shell.log.level=DEBUG
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
log4j.logger.org.apache.spark.repl.Main=${shell.log.level}
log4j.logger.org.apache.spark.api.python.PythonGatewayServer=${shell.log.level}

spark-submit command:

~]$ spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --files /tmp/log4j.properties  --conf "spark.executor.extraJavaOptions='-Dlog4j.configuration=log4j.properties'" --driver-java-options "-Dlog4j.configuration=file:/tmp/log4j.properties"  /opt/cloudera/parcels/CDH/jars/spark-examples*.jar  1 1

spark-shell command:

~]$ spark-shell --files /tmp/log4j.properties  --conf "spark.executor.extraJavaOptions='-Dlog4j.configuration=log4j.properties'" --driver-java-options "-Dlog4j.configuration=file:/tmp/log4j.properties"

pyspark command:

~]$ pyspark --files /tmp/log4j.properties  --conf "spark.executor.extraJavaOptions='-Dlog4j.configuration=log4j.properties'" --driver-java-options "-Dlog4j.configuration=file:/tmp/log4j.properties"

Example output:

apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport
22/11/20 09:09:19 DEBUG ipc.Client: IPC Client (1119621607) connection to :8032 from test_user got value #39
22/11/20 09:09:19 DEBUG ipc.ProtobufRpcEngine: Call: getApplicationReport took 5ms
22/11/20 09:09:19 INFO state.StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/

Using Python version 2.7.5 (default, Oct 14 2020 14:45:30)
SparkSession available as 'spark'.
>>> 22/11/20 09:09:20 DEBUG ipc.Client: IPC Client (1119621607) connection to :8032 from test_user sending #40 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport
22/11/20 09:09:20 DEBUG ipc.Client: IPC Client (1119621607) connection to :8032 from test_user got value #40
22/11/20 09:09:20 DEBUG ipc.ProtobufRpcEngine: Call: getApplicationReport took 3ms

Default log4j.properties file (/etc/spark/conf/log4j.properties)

Using the spark default log4j.properties file located in (/etc/spark/conf/log4j.properties)

Edit log4j.properties file under /etc/spark/conf

conf]$ cat log4j.properties 
log4j.rootLogger=${root.logger}
root.logger=INFO,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
shell.log.level=WARN
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
log4j.logger.org.apache.spark.repl.Main=${shell.log.level}
log4j.logger.org.apache.spark.api.python.PythonGatewayServer=${shell.log.level}

Change the log level for “root.logger & shell.log.level” to debug as below

log4j.rootLogger=${root.logger}
root.logger=DEBUG,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
shell.log.level=DEBUG
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
log4j.logger.org.apache.spark.repl.Main=${shell.log.level}
log4j.logger.org.apache.spark.api.python.PythonGatewayServer=${shell.log.level}

Once it is changed, Any spark job is created using spark-shell, spark-submit, and pyspark. Submitted in this particular node will have debug log level enabled.

Spark-shell Output:

~]$ spark-shell
22/11/20 09:05:36 DEBUG ipc.ProtobufRpcEngine: Call: getApplicationReport took 4ms
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_141)
Type in expressions to have them evaluated.
Type :help for more information.
22/11/20 09:05:37 DEBUG ipc.Client: IPC Client (1731685294) connection to :8032 from test_user sending #31 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport
22/11/20 09:05:37 DEBUG ipc.Client: IPC Client (1731685294) connection to :8032 from test_user got value #31
22/11/20 09:05:37 DEBUG ipc.ProtobufRpcEngine: Call: getApplicationReport took 2ms

Setting debug in session level (spark-shell & pyspark)

In this approach, You can enable debugging at the session level, using “setLogLevel” to debug, Used widely for interactive spark session

$ spark-shell
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_141)
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at 
Spark context available as 'sc' (master = yarn, app id = application_ID).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/
      
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sc.setLogLevel("DEBUG")

scala> 18/09/17 07:58:57 DEBUG Client: IPC Client (2312412312) connection to :8032 from spark sending #69
18/09/17 07:58:57 DEBUG Client: IPC Client (2312412312) connection to :8032 from spark got value #69
18/09/17 07:58:57 DEBUG ProtobufRpcEngine: Call: getApplicationReport took 8ms

Conclusion:

We can use any one of the approaches to enable debug log level for the spark job based on our needs.

  • To enable debugging without changing any properties in the command, then I will use 2nd approach (Using the default log4j.properties file). But this will enable debugging for the application that is getting triggered from the same node.
  • To enable debugging for a specific application then I will use 1st approach (using custom log4j.properties)
  • To enable debugging in the spark interactive session, you can use approach 3

Do let me know if you have any other ways to achieve debug log level or start a discussion if you are seeing any issues enabling debug

Good Luck with your Learning

Similar Posts