Resolve “Could not find CoarseGrainedScheduler” in Spark

In this article, we will understand and learn about the CoarseGrainedScheduler and why we are encountering this error in the job execution.

“Could not find CoarseGrainedScheduler” is a commonly encountered error message that typically indicates a failure of the driver for various reasons. However, the executor, responsible for carrying out the tasks, has not terminated the connection and continues to attempt a connection with the driver.

Before proceeding further, let’s understand what is “CoarseGrainedScheduler” and “CoarseGrainedSchedulerBackend” in a higher level, Which helps us to gain some idea about this ERROR message

CoarseGrainedScheduler & CoarseGrainedSchedulerBackend Explained

In Spark Job, We all know there are 2 main components Driver & Executor.

Driver

It is the program that runs the main function and coordinates the execution of Spark applications, interacting with the cluster manager and distributing tasks to worker nodes for processing.

Check here to know more about Spark Driver

Executor

It is a worker node that runs tasks and processes data within a Spark application, providing parallel processing capabilities for distributed data processing.

Could not find CoarseGrainedScheduler in Spark

Both CoarseGrainedScheduler & CoarseGrainedSchedulerBackend will be available in the Driver and similar in Executor there will be CoarseGrainedexecutor & CoarseGrainedExecutorBackend 

CoarseGrainedSchedulerBackend is a SchedulerBackend that comes with Spark, Which supports cluster managers like YARN to schedule tasks in Executor. This will be created at the time of spark context initialization

So when the CoarseGrainedSchedulerBackend starts (During spark context initialization) it registers the CoarseGrainedScheduler RPC endpoint to be the driver’s communication endpoint.

Similarly, In Executor, there will be CoarseGrainedExecutorBackend will register itself to the Executor RPC endpoint

So, Whenever an executor started, CoarseGrainedExecutorBackend immediately connects to the CoarseGrainedScheduler to inform that it can run tasks and this connection will last long till the job is completed. (Below log snippet shows the same)

INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@<hostname>:<post>
INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver

Hope that explains about CoarseGrainedScheduler & CoarseGrainedSchedulerBackend

Why we are seeing the “Could not find CoarseGrainedScheduler” ERROR

So, Based on the above explanation about CoarseGrainedScheduler, Whenever an executor starts, CoarseGrainedExecutorBackend connects with  CoarseGrainedScheduler and the connection will be available till the end of the job.

In the event of any problems with the Spark Driver, such as an unexpected exit, the connection between the executor and the driver may not be properly closed. Consequently, the executor continues to attempt to connect with the driver (CoarseGrainedExecutorBackend -> CoarseGrainedScheduler), leading to the occurrence of the following error message.

ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
        at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
        at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134)
        at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:570)
        at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:180)
        at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
        at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
        at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
        at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)

Therefore, it is important to note that the error message “Could not find CoarseGrainedScheduler” is not the actual issue itself. To address the problem, it is necessary to investigate the underlying cause of the unexpected failure of the driver and work towards resolving it.

Resolution

Based on the above discussion, the Error message “Could not find CoarseGrainedScheduler” itself is not an issue, We need to find why the Driver <-> Executor connection failed and try to resolve it

For that, We need to download the driver/Executor logs to check further

Driver Logs

If you are running the Spark job in client mode, Driver logs will be printed in the console output, If you are in Cluster mode, Driver logs will be recorded along with the Application Master container

Command to collect Yarn application log

yarn logs -applicationId application_1576830357464_0003 -appOwner <AppOwner>

where Application ID is the corresponding app ID, and the AppOwner is the user's name, who submitted the job.

Scenario 1

In one of my failures, We found the driver failed with a disk issue (Triggering the ERROR message: Could not find CoarseGrainedScheduler )

WARN  TaskSetManager:69 - Lost task 2342 in stage 342341 (TID 34234, hostname, executor 324): java.io.FileNotFoundException: /learn/yarn/local/usercache/learnerkb/application_12394857239845_34234/blockmgr/03/tmp_shuffle_323423423423423423 (Input/output error)

From the above logs, We can understand that the Driver is unable to write to the disk (disk level issues) causing the failure, post resolving the disk issue, We could eliminate the CoarseGrainedScheduler ERROR

Scenario 2

If you have any network glitch between the Driver <-> executor host, We will be seeing similar ERROR messages in the logs. Resolving network issues should resolve the parent issue

NOTE: This ERROR message will not contribute 100% of job failure, But it can cause Job failure at some point (Like the issue persist and all the min threshold executors failed)

Conclusion

“Could not find CoarseGrainedScheduler” ERROR is harmless to the actual execution of the Spark jobs. This message shows that there is some issue with the Driver -> executor communication. Fixing that would help to resolve the job failure

There is an executing jira SPARK-22769 to change the log level from ERROR to WARN as these messages are a little confusing at 1st look

Related Topics

What is the difference between apache spark and pyspark

Difference between DataFrame, Dataset, and RDD in Spark

What is the difference between Cache and Checkpoint in Spark

Resolve “Task serialization failed: java.lang.StackOverflowError” in Spark

spark.driver.memoryOverhead and spark.executor.memoryOverhead explained

Similar Posts