“IPC’s epoch X is less than the last promised epoch Y” will see this error message in both Namenode and Journal nodes logs
“IPC’s epoch X is less than the last promised epoch Y” error is a default behavior in HDFS to avoid a split-brain scenario between 2 namenodes. The Journal node will only allow the NN which has the latest epoch and if any Namenode tries to flush the edit will be rejected with the same error message
In this article, We will understand the function of the Journal node and why we are seeing the “IPC’s epoch X is less than the last promised epoch Y” ERROR
Role of Journal Node
Namenode high availability will be achieved using the Journal nodes, ZooKeeper, and ZooKeeper failover controller.
ZKFC will make sure that only one Namenode should be active at a time. However, when a failover occurs, it is still possible that the previous Active NameNode could serve read requests to clients
The role of the Journal node is to keep the Namenodes in sync and prevent split brain by a concept called fencing. To work with fencing journal nodes use epoch numbers.
Epoc numbers are integer that always gets increased and have unique value once assigned. Namenode generates epoch numbers using a simple algorithm and uses it while sending RPC requests to the Journal Node. When you configure Namenode HA, the first Active Namenode will get epoch value 1. In case of failover or restart, the epoch number will get increased by 1.
Namenode which has the higher epoch value is considered the latest and will be allowed to flush edits to the journal node, If the request is coming from Namenode with an older epoch number, then the journal node simply rejects the request.
Let’s discuss, Why we are seeing the ERROR “IPC’s epoch X is less than the last promised epoch Y”
JournalNodes are designed in a way that it will not allow two name nodes to write the edits at the same time. This can be achieved by validating the Namenode epoch. Basically, it will allow the Namenode which has the latest epoch value
Any Namenode, which has an old epoch value tries to flush the edits in the journal node will be rejected with the error message “IPC’s epoch X is less than the last promised epoch Y“
This can be caused due to various reasons in the Cluster.
1.) Application that processed a huge number of files
2.) Namenode Memory issue
3.) Network issue.
Application that processes a huge number of files
In this case, When an application tries to process a lot of files can result in acquiring a longer “write-lock” in the Namenode.
Name node Logs:
INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: SnapshotDiffReport Total comparison dirs: 0/20391029, files: 0/20391029. Time snapChildrenListing: 0.206s, actual: 22.558s, total: 22.558s.
In the above log, We can understand there is an application trying to process 20391029 files, due to which the NN has to take a write-lock to update the fsimage
Name node Logs:
INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock held for 982839 ms via java.lang.Thread.getStackTrace(Thread.java:1564) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1604) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:2969) org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1093) org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:681)
If a namenode acquires a “write-lock” then it will not allow any other operation like read to be carried out in the cluster. Due to that failover controller might think the Namenode is not responding and it will try to failover to another standby namenode.
When failover takes place the Namenode epoch will be increased by 1 and when the old active Namenode completes the write operation and tries to write to the Journal node would be rejected with the below error
WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal<ip>:8485 failed to write txns 3342342313-370234234ll try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 58 is less than the last promised epoch 59 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:458) at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:484) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:372) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:179)
This is a precautionary measure taken by the journal node to avoid split brain by the concept called fencing
In this case, it is an expected behavior from the journal node and we need to validate from the application perspective why it is processing a huge amount of data and if there is any possibility to process it in chunks
Namenode Memory issue
We can see the error “IPC’s epoch X is less than the last promised epoch Y” when Namenode is affected by GC pauses,
GC pauses in the server which made the zookeeper to lost its connection with the name node and triggers a Namenode failover if old active namenode tries to flush the edits to the journal node will experience the error “IPC’s epoch X is less than the last promised epoch Y”
Name node Logs:
WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 72140ms GC pool 'ParNew' had collection(s): count=1 time=73ms GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=71208ms WARN org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Skipping next heartbeat scan due to excessive pause WARN org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 80245ms for sessionid 0x1013423edd84847
ERROR org.apache.hadoop.ha.ZKFailoverController: Couldn't transition NameNode at <hostname>:8022 to standby state
INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8485, call Call#1223437669 Retry#0 org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.journal from <ip>:43925 java.io.IOException: IPC's epoch 49 is less than thFCe last promised epoch 50
We can raise the heap size of NameNode to a higher value if there are too many GC activities or too long Full GC.
If the network is not stable in the cluster, We can face this issue, as namenode is unable to respond to Zookeeper Failover Controller, which results in Namenode failover and when the old namenode tries to flush the edits to the journal node will be rejected with the error message “IPC’s epoch X is less than the last promised epoch Y”
We need to fix the network issue on the host level and can tune the below property to increase the performance
Raise the number of dfs.datanode.max.transfer.threads from 4K to 16K
In conclusion, “IPC’s epoch X is less than the last promised epoch Y” error will occur when the old active namenode tried to write onto the journal node edits. This is an expected behavior to avoid split brain scenario and we need to understand why the Namenode went unresponsive which cause the failover. Above are some of the causes that I have faced
Good luck with your learning !!