Yarn application stuck in the ACCEPTED state (Includes Spark, Hive, Tez, and MapReduce jobs)

Yarn application stuck in the ACCEPTED state

Usually, the Yarn application will stuck in the ACCEPTED state, When it didn’t find enough resources to create a new container in the cluster and schedule a task.

Below are the scenarios, We usually face this issue

  • When the Total cluster resource or queue resource is exhausted
  • When the Application Master container creation threshold reaches its max

When the Total cluster resource or Queue resource is exhausted

  • When your total cluster capacity or a particular queue (where you are submitting your Job) is been used at its maximum capacity
  • The yarn would accept your Job submission request, but as it is unable to satisfy the Memory/Core requirement (Due to resource constraints), It will keep it in the ACCEPTED state, till it can able to allocate the requested resources
  • Till the point, Your job will be in an ACCEPTED state

How to find the Resouce usage:

Check the Yarn Resource manager Webui -> Schedular page

– On this page, At the top, You can able easily identify the total cluster capacity (Memory and cores) and current usage

– If you are seeing usage of more than 90%, Then you are in the resource crunch state

Example Screenshot:

Here, We could see the total Memory is 44GB and the Total Vcores is 16, Cluster is currently Idle so no resource utilization.

Total Cluster capacity
  • To check a specific queue level usage, There are chances, Where a specific queue is fully occupied and it is unable to allocate resources for new jobs in the same queue
  • Can be identified in the RM -> Schedule page, Where you can see queue level max resource and usage

Example Screenshot:

In the below screenshot, We can easily understand that Used and max resources

Used and max resources

Resolution:

  • If your resource has been used at its max, Then we need to make sure there is no rouge job occupying the entire resources and causing a resource crunch
  • If it is queue specific, We can move the job to a different queue as an immediate solution

When the Application Master container creation threshold reaches its max

In Yarn, there is a restriction for creating multiple AM containers on each queue, This is to make sure, We are not getting into a deadlock situation

In this case, you will be seeing the below error messages in the Application logs

waiting for AM container to be allocated

This can be checked in the RM -> Schedule page

Where we need to check AM max resource & Current AM Used resource, if you are seeing the AM Used is at its max usage, then Yarn will not allow you to schedule another AM container in the same queue, Which results in your Job going into ACCEPTED state

AM max resource & Current AM Used resource

Resolution:

– We can check if there are multiple small applications occupying the AM resources

– Else, we can increase the AM max resource to the upper value

In Cloudera, We can use dynamic resource pool configuration and edit the pool config to increase the AM max share as below

AM max resource

You might be wondering, Why we need to restrict the AM container creation in the 1st place,

AM max share

This is to make sure, you are not going into a deadlock situation, where your job will be in the ACCEPTED state forever

Example:

Let’s say you have 100 GB and 100 Cores capacity in your cluster, and you have only one queue with 100% cluster capacity

  • When you are submitting the 100 Oozie job, What will happen?
  • Oozie will create 1 AM( Application Master) container for itself and 1 AM container for each action that is going to perform. Let’s assume, each container will need 1 GB of memory and 1 core
  • In this case, 100 Oozie launcher AM will be created, and as your cluster capacity is reached at its max
  • Now, You can’t move anywhere, All the jobs will be in the same state forever as Oozie can’t able to trigger any new AM container for its action

Start a discussion if you have any questions or a better way to resolve the issue

Hope this article is useful for you, Good luck with your learning

Similar Posts