In this article, we will understand and learn about the CoarseGrainedScheduler and why we are encountering this error in the job execution.
“Could not find CoarseGrainedScheduler” is a commonly encountered error message that typically indicates a failure of the driver for various reasons. However, the executor, responsible for carrying out the tasks, has not terminated the connection and continues to attempt a connection with the driver.
Before proceeding further, let’s understand what is “CoarseGrainedScheduler” and “CoarseGrainedSchedulerBackend” in a higher level, Which helps us to gain some idea about this ERROR message
CoarseGrainedScheduler & CoarseGrainedSchedulerBackend Explained
In Spark Job, We all know there are 2 main components Driver & Executor.
It is the program that runs the main function and coordinates the execution of Spark applications, interacting with the cluster manager and distributing tasks to worker nodes for processing.
Check here to know more about Spark Driver
It is a worker node that runs tasks and processes data within a Spark application, providing parallel processing capabilities for distributed data processing.
Both CoarseGrainedScheduler & CoarseGrainedSchedulerBackend will be available in the Driver and similar in Executor there will be CoarseGrainedexecutor & CoarseGrainedExecutorBackend
CoarseGrainedSchedulerBackend is a SchedulerBackend that comes with Spark, Which supports cluster managers like YARN to schedule tasks in Executor. This will be created at the time of spark context initialization
So when the CoarseGrainedSchedulerBackend starts (During spark context initialization) it registers the CoarseGrainedScheduler RPC endpoint to be the driver’s communication endpoint.
Similarly, In Executor, there will be CoarseGrainedExecutorBackend will register itself to the Executor RPC endpoint
So, Whenever an executor started, CoarseGrainedExecutorBackend immediately connects to the CoarseGrainedScheduler to inform that it can run tasks and this connection will last long till the job is completed. (Below log snippet shows the same)
INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@<hostname>:<post> INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
Hope that explains about CoarseGrainedScheduler & CoarseGrainedSchedulerBackend
Why we are seeing the “Could not find CoarseGrainedScheduler” ERROR
So, Based on the above explanation about CoarseGrainedScheduler, Whenever an executor starts, CoarseGrainedExecutorBackend connects with CoarseGrainedScheduler and the connection will be available till the end of the job.
In the event of any problems with the Spark Driver, such as an unexpected exit, the connection between the executor and the driver may not be properly closed. Consequently, the executor continues to attempt to connect with the driver (CoarseGrainedExecutorBackend -> CoarseGrainedScheduler), leading to the occurrence of the following error message.
ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message. org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134) at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:570) at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:180) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
Therefore, it is important to note that the error message “Could not find CoarseGrainedScheduler” is not the actual issue itself. To address the problem, it is necessary to investigate the underlying cause of the unexpected failure of the driver and work towards resolving it.
Based on the above discussion, the Error message “Could not find CoarseGrainedScheduler” itself is not an issue, We need to find why the Driver <-> Executor connection failed and try to resolve it
For that, We need to download the driver/Executor logs to check further
If you are running the Spark job in client mode, Driver logs will be printed in the console output, If you are in Cluster mode, Driver logs will be recorded along with the Application Master container
Command to collect Yarn application log
yarn logs -applicationId application_1576830357464_0003 -appOwner <AppOwner> where Application ID is the corresponding app ID, and the AppOwner is the user's name, who submitted the job.
In one of my failures, We found the driver failed with a disk issue (Triggering the ERROR message: Could not find CoarseGrainedScheduler )
WARN TaskSetManager:69 - Lost task 2342 in stage 342341 (TID 34234, hostname, executor 324): java.io.FileNotFoundException: /learn/yarn/local/usercache/learnerkb/application_12394857239845_34234/blockmgr/03/tmp_shuffle_323423423423423423 (Input/output error)
From the above logs, We can understand that the Driver is unable to write to the disk (disk level issues) causing the failure, post resolving the disk issue, We could eliminate the CoarseGrainedScheduler ERROR
If you have any network glitch between the Driver <-> executor host, We will be seeing similar ERROR messages in the logs. Resolving network issues should resolve the parent issue
NOTE: This ERROR message will not contribute 100% of job failure, But it can cause Job failure at some point (Like the issue persist and all the min threshold executors failed)
“Could not find CoarseGrainedScheduler” ERROR is harmless to the actual execution of the Spark jobs. This message shows that there is some issue with the Driver -> executor communication. Fixing that would help to resolve the job failure
There is an executing jira SPARK-22769 to change the log level from ERROR to WARN as these messages are a little confusing at 1st look