Resolve the “Container killed by YARN for exceeding memory limits” Issue in Hive, Tez, and Spark jobs

“Container killed by YARN for exceeding memory limits” usually happens, When the JVM usage goes beyond the Yarn container memory limit
Explanation
If you are setting memory for a spark executor container to 4GB and if the executor process running inside the container is trying to use more memory than the allocated 4GB, Then YARN will kill the container. This is a precautionary measure of YARN to make sure no process is taking more resources and causing a resource crunch
Current usage: 1.0 GB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual memory used. Killing container.
Symptom
The job will fail with “Container killed by YARN for exceeding memory limits”. It could be any kind of job like “MapReduce. Spark, Hive, HiveOnTez”
You will see the below exception in the application logs
Hive:
INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
Diagnostics report from attempt_143534545934213_145321_m_002565_0:
Container [pid=66028,containerID=container_e54_143534545934213_145321_01_003666] is running beyond physical memory limits.
Current usage: 1.0 GB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual memory used. Killing container.
Hive-On-Tez:
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1644989390691_1174002_1_00, diagnostics=[Task failed, taskId=task_143534545934213_145321_1_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Container container_e26_143534545934213_145321_01_000002 finished with diagnostics set to [Container failed, exitCode=-104. [2022-09-14 00:01:30.085]Container [pid=5846,containerID=container_e26_143534545934213_145321_01_000002] is running 588910240B beyond the 'PHYSICAL' memory limit. Current usage: 6.5 GB of 6 GB physical memory used; 12.9 GB of 12.6 GB virtual memory used. Killing container.
Spark:
WARN scheduler.TaskSetManager: Lost task 13345.0 in stage 20.2 (TID 182591, <hostname>, executor 6): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 11.0 GB of 11 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Root Cause:
All the above kind of jobs uses YARN for scheduling and resource management. The tasks of all these jobs will be running inside a yarn container. Whenever the task/map/reduce is growing beyond the size of the container. YARN will kill the task with the above messages in the logs. This is a precautionary measure from YARN to ensure No rouge job uses the complete cluster yarn resources.
Resolution:
Basically, the immediate fix is to increase the Yarn container size, So that it would fit into the container space. But we need to find the place where the task Is getting exceeded.
For example:
Hive uses the MapReduce framework. So we need to check at which stage it is getting out of control and for spark, it’s just the executor or AM container
To get this information, we need to check the application logs and validate in which container it is getting exceeded and on which task. From the above messages,
For Hive
It is clear the map task is going beyond the container memory “attempt_143534545934213_145321_m_002565_0” where the “m” stands for mapper task
In this case, We need to increase the mapper’s memory
— This specifies a YARN container size of 8GB for each Mapper and each Reducer
— We recommend using a memory value that is 80% of the container size
SET mapreduce.map.memory.mb=8192;
SET mapreduce.map.java.opts=-Xmx7680M;
If you are seeing this issue in the reducer task which will have the letter “r” in the task id, we can increase the reducer memory
SET mapreduce.reduce.memory.mb=5120;
SET mapreduce.reduce.java.opts=-Xmx4096M;
For HiveOnTez
We can just increase the Tez container size
SET hive.tez.container.size = 8192;
For Spark
We need to tune the executor memory and Spark overhead memory to make sure the job can able to accommodate all the request
Add the below properties and tune based on your need
--executor-memory 8G --conf spark.executor.memoryOverhead=1g
By default, executor.memoryOverhead will be 10% of the container memory and will be assigned by the YARN and allocated along with the container or we can explicitly set the overhead using the above property in the spark-submit command
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode --num-executors 10 --executor-memory 8G --driver-memory 8G --conf spark.executor.memoryOverhead=1g client /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10
Points to be NOTED:
– Setting these memory configs depends on the Cluster capacity and job Load, These need to be changed till you are not seeing this issue (To arrive at an optimal value)
– If you are still unable to resolve the issue by increasing the memory. For example, the task is to keep on demanding higher values. Then we need to Optimize the Code to make sure it is not using more memory.
Check here for More troubleshooting guides in spark
Good Luck with your Learning. Hit the like if this blog is helpful