Script to delete duplicate files - Programming On Unix

Users browsing this thread: 3 Guest(s)
pkal
Long time nixers
Maybe someone besides me can make use of this script I wrote. I use it to delete duplicate files (hence it's called deldup) in the current and all contained directories. In my case it's been particularly helpful in cleaning up my .pdf collection, containing multiple file with the same content (ie. sam MD5 hashsum), but either different file names or just placed twice in two directories. It then offers a selectable list of items to delete. Typing "!" will delete every file with a certain hashsum.

I've only tested it under void with bash, but I'd like to make it as portable as possible, this means replacing all bash tricks (readarray, let, regular expression matching) with some POSIX compatible equivalents.

And here's the script:

Code:
#!/bin/bash
HASH=md5sum
INODE=$(stat -c %i ${1:-.})
MDF="/tmp/deldup-$INODE.md5"
DUP="/tmp/deldup-$INODE.list"
if [ $MDF -nt . ]; then
    echo found cached hash list in $MDF
else
    echo -n "hashing..."
    FC=$(find ${@:-.} -type f | wc -l)
    while true; do
        P=$(cat $MDF | wc -l)
        echo -en "\rhashing files ($P/$FC)... "
        sleep 0.25
    done &
    find -O2 ${1:-.} -type f -print0 | xargs -L 1 -P 8 -0 md5sum > $MDF
    kill $!
    wait $! 2> /dev/null
fi

cut -f1 -d" " < $MDF | sort | uniq -d > $DUP

T=$(cat $DUP | wc -l)
N=1
echo Found $T duplicate files
for dh in $(cat $DUP); do
    echo
    echo \[$N/$T\] Files with hash $dh:
    grep -a $dh $MDF | cut -f2- -d" " | {
        readarray -t -O 1 FILES

        C=1 # counter
        for line in $(seq ${#FILES[@]}); do
            printf "%4d\t%s\n" $C "${FILES[$C]}"
            let C++
        done

        echo -n 'Select to delete (eg. "1", "2,3,4", "!", "1,3-5"): '
        IFS=',' read -ra RANGE <&1;
        if [ "${RANGE[1]}" = "!" ]; then # "!" deletes all matches
            for i in "${FILES[@]}"; do
                rm -v "$(echo $i | xargs -0 echo)"
            done
        else
            for i in "${RANGE[@]}"; do
                if [[ $i =~ ^([0-9]+)-([0-9]+)$ ]]; then
                    START=${BASH_REMATCH[1]}
                    END=${BASH_REMATCH[2]}
                    for x in $(seq $START $END); do
                        rm -v "$(echo ${FILES[$x]} | xargs -0 echo)"
                    done
                elif [[ $i =~ ^[0-9]+$ ]]; then
                    rm -v "$(echo ${FILES[$i]} | xargs -0 echo)"
                else
                    echo Invalid range or number \"$i\"
                fi
            done
        fi
    }
    let N++
done

echo
read -p "delete cache [y/N]? " DC
[[ $DC =~ ^[yY] ]] && rm -v $MDF && rm -v $DUP

Alternative uploads: https://a.pomfe.co/udnjib, https://sub.god.jp/f/F1D3s2iQ.sh, https://comfy.moe/ypqbun

I never had any actual formal introduction to shell scripting, so this isn't the tidiest script. I wrote it over a timespan of a few months, adding features as I needed them. Feedback would be much appreciated. Also, is there a simpler way do do this under Unix? Maybe even a pre-existing tool?


Messages In This Thread
Script to delete duplicate files - by pkal - 17-11-2017, 06:01 PM
RE: Script to delete duplicate files - by z3bra - 18-11-2017, 07:07 PM
RE: Script to delete duplicate files - by pkal - 18-11-2017, 09:26 PM
RE: Script to delete duplicate files - by z3bra - 19-11-2017, 03:38 PM
RE: Script to delete duplicate files - by pkal - 19-11-2017, 04:50 PM
RE: Script to delete duplicate files - by budRich - 19-11-2017, 10:28 PM
RE: Script to delete duplicate files - by z3bra - 20-11-2017, 07:38 AM
RE: Script to delete duplicate files - by budRich - 20-11-2017, 09:25 AM
RE: Script to delete duplicate files - by venam - 20-11-2017, 09:40 AM
RE: Script to delete duplicate files - by budRich - 20-11-2017, 09:55 AM
RE: Script to delete duplicate files - by z3bra - 20-11-2017, 11:09 AM