How to find and delete files older than X days in HDFS

find and delete files older than X days in HDFS

In this article, We will see, How to find and delete files older than X days. It is essential to clean up HDFS files that are older or stale. Setting up a script to monitor and clean up HDFS would be an effective way to avoid unnecessary job/service failures

Usually, I would check the list of files before deleting them as a sanity check and I would recommend the same

Step1: Find file older than 7 days

Script to list all the files that are older than X days. Here I will use X = 7 days

export HADOOP_CLIENT_OPTS="-Xmx4g"
today=$(date +%s);
hadoop fs -ls -R /tmp | while read file; do 
dir_date=`echo $file|awk '{print $6}'`;
difference=$(( ( $today - $(date -d"$dir_date" +%s) ) / ( 24 * 60 * 60)));
if [ $difference -gt 7 ]; then echo $file >> /tmp/older_files.txt ;
fi;
done

Above script will list all the files that are older than 7 days from the ‘/tmp’ directory, Which can be changed based on the requirement

hadoop fs -ls -R /tmp 
  • Setting “export HADOOP_CLIENT_OPTS=”-Xmx4g” would be essential, When you are cleaning up lots of files, to avoid outof memory erros
  • In this script, We are list comparing the timestamp of files and current timestamp and listing out the files which are older than 7 days
  • At last, File name older than 7 days will be stored in the “/tmp/older_files.txt” location

Step2: Validation

  • List of files wil be in the “/tmp/older_files.txt” file, Which can be cross checked to see the list of files are older than 7 days (Just a quick look)

Step3: Find and Deletion files older than 7 days

Proceed with deletion:

cat /tmp/older_files.txt | awk -F ' ' '{print $8}' | grep . >/tmp/older_file_name.txt
while read file; do 
hdfs dfs -rm $file
done < /tmp/older_file_name.txt

If you like to remove the files in single shot, Use the below script,

NOTE: Use it with caution

export HADOOP_CLIENT_OPTS="-Xmx4g"
today=$(date +%s);
hadoop fs -ls -R /tmp | while read file; do 
dir_date=`echo $file|awk '{print $6}'`;
difference=$(( ( $today - $(date -d"$dir_date" +%s) ) / ( 24 * 60 * 60))); 
if [ $difference -gt 7 ]; then 
hadoop fs -rm `echo $file|awk -F ' ' '{print $8}'` >> /tmp/older_files.txt ;
fi;
done

Good Luck with your Learning !!

Similar Posts