Resolve the “container exited with a non-zero exit code 143” issue in Spark, Hive, and Tez

container exited with a non-zero exit code 143

The term “container exited” refers to a state where a container has stopped running and terminated its processes and it includes normal healthy shutdown, In specific, We going to see why a container exiting with “exit code 143” and its reason

“exit code 143” is a receipt of a SIGTERM signal, Which indicates that the process was terminated. Similarly, In Hadoop (Spark, Hive, and Tez) it indicates the process was terminated while it trying to use more memory than the allocated container memory by YARN, or the container was killed with SIGTERM signal manually

What is SIGTERM?

Its a signal used by the Linux operating system to kill a process in the most elegant way possible, Which means it will allow the process to get prepared for the shutdown process like cleanup, backup, etc,

There is a possibility that it can replay back to OS that it can’t be shut down currently, Due to some critical ongoing operation

kill <Process ID>

In SIGTERM, Once the process is killed, it will exit with an exit code 143

Let’s discuss the other signal operating system used to kill a process (SIGKILL)

What is SIGKILL?

SIGKILL is a way to kill a process in the most forceful way, Which means it will not allow the process to prepare for shutdown.

Once the process receives the signal it will be killed irrespective any operation that is carrying on currently (In this way process cant refuse to shutdown)

kill -9 <process id>

In SIGKILL, Once the process is killed it will exit with an exit code 137

Replicate Exit code 143

We can replicate the issue with the above command with any job (Spark, Hive, and tez), Currently, I am trying with the Spark command to show the behavior

Run the below sample Spark job in terminal

spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10000

— While this job is running, We can log in to any one of the Node Manager nodes and grep for the application id as below

ps -ef | grep application_1674621332782_0004

testuser    97618   40419  0 08:55 ?        00:00:00 /var/lib/yarn-ce/bin/container-executor systest systest 1 application_1674621332782_0004 container_e06_1674621332782_0003_01_000001 /data/yarn/nm/usercache/

Kill the above container with the below command

kill 97618

Running the above command Will trigger a SIGTERM signal to this process <97618>, Then the process with exit code 143 as mentioned below

Yarn logs -applicationId application_1674621332782_0004

23/01/25 08:56:48 WARN yarn.YarnAllocator: Container marked as failed: container_e06_1674621332782_0004_01_000009 on host: <Hostname>. Exit status: 143. Diagnostics: [2023-01-25 08:56:47.826]Container killed on request. Exit code is 143

[2023-01-25 08:56:47.829]Container exited with a non-zero exit code 143.

[2023-01-25 08:56:47.829]Killed by external signal

The above replication is done by manually killing the container process

Kill By Yarn (Spark, Hive, Tez)

There are scenarios, Where YARN will send a SIGTERM signal to the container process, and mostly that is when the container memory usage is more than the allocated capacity

When you are seeing the “exit code 143” in the logs, It would be best to validate all the below logs

  • Node Manager Logs
  • Resource Manager Logs
  • Yarn Application Logs

So that you can able to observe the below error message, Which we can conclude is due to a memory issue or not

ExecutorLostFailure Reason: YARN Killed the running container

In this case, We can see the JVM uses more memory than the allocated capacity. It’s applicable for all the components like Spark, Hive, and Tez as they are using YARN as Resource Manager

Resolution for Memory issue

A quick fix is to increase the container memory

If you are seeing the ERROR message related to memory, Where YARN Killed the container due to usage of memory more than the allocated memory. Then you can proceed with memory tuning exercice

Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143

Spark:

For spark, We have to increase the executor memory to make sure it can able to handle the incoming data

--executor-memory 8G

Hive/MR:

For HIVE, By default it uses the MapReduce framework, We need to check which attempt is failing either in the Mapper or reducer stage, and increase the memory accordingly

Mapper:

SET mapreduce.map.memory.mb=8192;
SET mapreduce.map.java.opts=-Xmx7680M;

Reducer:

SET mapreduce.reduce.memory.mb=5120;
SET mapreduce.reduce.java.opts=-Xmx4096M;

Tez:

For Hive on Tez jobs, We need to increase the TEZ container size as below

SET hive.tez.container.size =8192;

NOTE:

Setting the above value as it is can help in some cases, But make sure to tune the memory setting based on the cluster capacity and load

Not Related to Memory

If you are not seeing any Memory related exception in the Node Manager or Application log, It would be good to investigate if there is any manual Sigterm signal(Kill) passed by any application or user

kill <container PID>

Conclusion

It is mandatory to observe all the patterns in the different components logs to validate if it is a Memory issue or a manual intervention

Good Luck with you Learning !!

Similar Posts