How to find and delete files older than X days in HDFS

In this article, We will see, How to find and delete files older than X days. It is essential to clean up HDFS files that are older or stale. Setting up a script to monitor and clean up HDFS would be an effective way to avoid unnecessary job/service failures
Usually, I would check the list of files before deleting them as a sanity check and I would recommend the same
Step1: Find file older than 7 days
Script to list all the files that are older than X days. Here I will use X = 7 days
export HADOOP_CLIENT_OPTS="-Xmx4g"
today=$(date +%s);
hadoop fs -ls -R /tmp | while read file; do
dir_date=`echo $file|awk '{print $6}'`;
difference=$(( ( $today - $(date -d"$dir_date" +%s) ) / ( 24 * 60 * 60)));
if [ $difference -gt 7 ]; then echo $file >> /tmp/older_files.txt ;
fi;
done
Above script will list all the files that are older than 7 days from the ‘/tmp’ directory, Which can be changed based on the requirement
hadoop fs -ls -R /tmp
- Setting “export HADOOP_CLIENT_OPTS=”-Xmx4g” would be essential, When you are cleaning up lots of files, to avoid outof memory erros
- In this script, We are list comparing the timestamp of files and current timestamp and listing out the files which are older than 7 days
- At last, File name older than 7 days will be stored in the “/tmp/older_files.txt” location
Step2: Validation
- List of files wil be in the “/tmp/older_files.txt” file, Which can be cross checked to see the list of files are older than 7 days (Just a quick look)
Step3: Find and Deletion files older than 7 days
Proceed with deletion:
cat /tmp/older_files.txt | awk -F ' ' '{print $8}' | grep . >/tmp/older_file_name.txt
while read file; do
hdfs dfs -rm $file
done < /tmp/older_file_name.txt
If you like to remove the files in single shot, Use the below script,
NOTE: Use it with caution
export HADOOP_CLIENT_OPTS="-Xmx4g"
today=$(date +%s);
hadoop fs -ls -R /tmp | while read file; do
dir_date=`echo $file|awk '{print $6}'`;
difference=$(( ( $today - $(date -d"$dir_date" +%s) ) / ( 24 * 60 * 60)));
if [ $difference -gt 7 ]; then
hadoop fs -rm `echo $file|awk -F ' ' '{print $8}'` >> /tmp/older_files.txt ;
fi;
done
Good Luck with your Learning !!