Resolve the “container exited with a non-zero exit code 143” issue in Spark, Hive, and Tez

The term “container exited” refers to a state where a container has stopped running and terminated its processes and it includes normal healthy shutdown, In specific, We going to see why a container exiting with “exit code 143” and its reason
“exit code 143” is a receipt of a SIGTERM signal, Which indicates that the process was terminated. Similarly, In Hadoop (Spark, Hive, and Tez) it indicates the process was terminated while it trying to use more memory than the allocated container memory by YARN, or the container was killed with SIGTERM signal manually
What is SIGTERM?
Its a signal used by the Linux operating system to kill a process in the most elegant way possible, Which means it will allow the process to get prepared for the shutdown process like cleanup, backup, etc,
There is a possibility that it can replay back to OS that it can’t be shut down currently, Due to some critical ongoing operation
kill <Process ID>
In SIGTERM, Once the process is killed, it will exit with an exit code 143
Let’s discuss the other signal operating system used to kill a process (SIGKILL)
What is SIGKILL?
SIGKILL is a way to kill a process in the most forceful way, Which means it will not allow the process to prepare for shutdown.
Once the process receives the signal it will be killed irrespective any operation that is carrying on currently (In this way process cant refuse to shutdown)
kill -9 <process id>
In SIGKILL, Once the process is killed it will exit with an exit code 137
Replicate Exit code 143
We can replicate the issue with the above command with any job (Spark, Hive, and tez), Currently, I am trying with the Spark command to show the behavior
Run the below sample Spark job in terminal
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10000
— While this job is running, We can log in to any one of the Node Manager nodes and grep for the application id as below
ps -ef | grep application_1674621332782_0004
testuser 97618 40419 0 08:55 ? 00:00:00 /var/lib/yarn-ce/bin/container-executor systest systest 1 application_1674621332782_0004 container_e06_1674621332782_0003_01_000001 /data/yarn/nm/usercache/
Kill the above container with the below command
kill 97618
Running the above command Will trigger a SIGTERM signal to this process <97618>, Then the process with exit code 143 as mentioned below
Yarn logs -applicationId application_1674621332782_0004
23/01/25 08:56:48 WARN yarn.YarnAllocator: Container marked as failed: container_e06_1674621332782_0004_01_000009 on host: <Hostname>. Exit status: 143. Diagnostics: [2023-01-25 08:56:47.826]Container killed on request. Exit code is 143
[2023-01-25 08:56:47.829]Container exited with a non-zero exit code 143.
[2023-01-25 08:56:47.829]Killed by external signal
The above replication is done by manually killing the container process
Kill By Yarn (Spark, Hive, Tez)
There are scenarios, Where YARN will send a SIGTERM signal to the container process, and mostly that is when the container memory usage is more than the allocated capacity
When you are seeing the “exit code 143” in the logs, It would be best to validate all the below logs
- Node Manager Logs
- Resource Manager Logs
- Yarn Application Logs
So that you can able to observe the below error message, Which we can conclude is due to a memory issue or not
ExecutorLostFailure Reason: YARN Killed the running container
In this case, We can see the JVM uses more memory than the allocated capacity. It’s applicable for all the components like Spark, Hive, and Tez as they are using YARN as Resource Manager
Resolution for Memory issue
A quick fix is to increase the container memory
If you are seeing the ERROR message related to memory, Where YARN Killed the container due to usage of memory more than the allocated memory. Then you can proceed with memory tuning exercice
Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143
Spark:
For spark, We have to increase the executor memory to make sure it can able to handle the incoming data
--executor-memory 8G
Hive/MR:
For HIVE, By default it uses the MapReduce framework, We need to check which attempt is failing either in the Mapper or reducer stage, and increase the memory accordingly
Mapper:
SET mapreduce.map.memory.mb=8192;
SET mapreduce.map.java.opts=-Xmx7680M;
Reducer:
SET mapreduce.reduce.memory.mb=5120;
SET mapreduce.reduce.java.opts=-Xmx4096M;
Tez:
For Hive on Tez jobs, We need to increase the TEZ container size as below
SET hive.tez.container.size =8192;
NOTE:
Setting the above value as it is can help in some cases, But make sure to tune the memory setting based on the cluster capacity and load
Not Related to Memory
If you are not seeing any Memory related exception in the Node Manager or Application log, It would be good to investigate if there is any manual Sigterm signal(Kill) passed by any application or user
kill <container PID>
Conclusion
It is mandatory to observe all the patterns in the different components logs to validate if it is a Memory issue or a manual intervention
Good Luck with you Learning !!