Script to delete duplicate files - Programming On Unix

Users browsing this thread: 1 Guest(s)
pkal
Long time nixers
Maybe someone besides me can make use of this script I wrote. I use it to delete duplicate files (hence it's called deldup) in the current and all contained directories. In my case it's been particularly helpful in cleaning up my .pdf collection, containing multiple file with the same content (ie. sam MD5 hashsum), but either different file names or just placed twice in two directories. It then offers a selectable list of items to delete. Typing "!" will delete every file with a certain hashsum.

I've only tested it under void with bash, but I'd like to make it as portable as possible, this means replacing all bash tricks (readarray, let, regular expression matching) with some POSIX compatible equivalents.

And here's the script:

Code:
#!/bin/bash
HASH=md5sum
INODE=$(stat -c %i ${1:-.})
MDF="/tmp/deldup-$INODE.md5"
DUP="/tmp/deldup-$INODE.list"
if [ $MDF -nt . ]; then
    echo found cached hash list in $MDF
else
    echo -n "hashing..."
    FC=$(find ${@:-.} -type f | wc -l)
    while true; do
        P=$(cat $MDF | wc -l)
        echo -en "\rhashing files ($P/$FC)... "
        sleep 0.25
    done &
    find -O2 ${1:-.} -type f -print0 | xargs -L 1 -P 8 -0 md5sum > $MDF
    kill $!
    wait $! 2> /dev/null
fi

cut -f1 -d" " < $MDF | sort | uniq -d > $DUP

T=$(cat $DUP | wc -l)
N=1
echo Found $T duplicate files
for dh in $(cat $DUP); do
    echo
    echo \[$N/$T\] Files with hash $dh:
    grep -a $dh $MDF | cut -f2- -d" " | {
        readarray -t -O 1 FILES

        C=1 # counter
        for line in $(seq ${#FILES[@]}); do
            printf "%4d\t%s\n" $C "${FILES[$C]}"
            let C++
        done

        echo -n 'Select to delete (eg. "1", "2,3,4", "!", "1,3-5"): '
        IFS=',' read -ra RANGE <&1;
        if [ "${RANGE[1]}" = "!" ]; then # "!" deletes all matches
            for i in "${FILES[@]}"; do
                rm -v "$(echo $i | xargs -0 echo)"
            done
        else
            for i in "${RANGE[@]}"; do
                if [[ $i =~ ^([0-9]+)-([0-9]+)$ ]]; then
                    START=${BASH_REMATCH[1]}
                    END=${BASH_REMATCH[2]}
                    for x in $(seq $START $END); do
                        rm -v "$(echo ${FILES[$x]} | xargs -0 echo)"
                    done
                elif [[ $i =~ ^[0-9]+$ ]]; then
                    rm -v "$(echo ${FILES[$i]} | xargs -0 echo)"
                else
                    echo Invalid range or number \"$i\"
                fi
            done
        fi
    }
    let N++
done

echo
read -p "delete cache [y/N]? " DC
[[ $DC =~ ^[yY] ]] && rm -v $MDF && rm -v $DUP

Alternative uploads: https://a.pomfe.co/udnjib, https://sub.god.jp/f/F1D3s2iQ.sh, https://comfy.moe/ypqbun

I never had any actual formal introduction to shell scripting, so this isn't the tidiest script. I wrote it over a timespan of a few months, adding features as I needed them. Feedback would be much appreciated. Also, is there a simpler way do do this under Unix? Maybe even a pre-existing tool?
z3bra
Grey Hair Nixers
As you learnt scripting on your own, it's rather good, so kuddos for that!
Over the years, I've written a bunch of scripts, and still use a most of them on a day to day basis, either interactively, in pipes or through cronjobs.

Here are a few tips I can give you, based on my experience:

0. Think small. Scripts are supposed to glue programs together, not be programs on their own. Think your ouptut to be useful to other programs.
1. Be quiet. Logging looks cool, especially with "...." and green "OK" or red "FAIL", but they definitely don't help in pipelines.
2. Avoid using 'rm' in shell scripts. I tend to make my script be selectors or filters, so that I can inspect the output beforhand, and append "xargs rm" after that.
3. Avoid interactivity. The best tools are the one that are automated and can "think" on their own.

Note that these tipa are totally subjective, it's based on my own experience, so it might differ for other people.

Now, I know that these are rather abstract, so here is an attemot at doing the same, "my way" ;) (written directly from memory, without testing of course :P)

Code:
#!/bin/sh
# read file list from input, output all duplicates on a single line, separated by tabs
# eg: find duplicates and only keep one:
# find . -type f | ./getdup | cut -f2- | xargs rm

# write hash + filename in a temp file, sorted by hash
TMP=$(mktemp)
xargs -n1 sha1sum | sort > $TMP

# find duplicate hash, and match them in the list
for SHA1 in $(cut -d' ' -f1 < $TMP | uniq -d); do
    # print all files for each hash separated by tabs
    # assume files don't include spaces, of course...
    grep $SHA1 < $TMP | cut -d' ' -f2- | xargs echo | tr ' ' '\t'
done

rm $TMP

This basically only transfom the input, and let the filesystem untouched, so even if the script is messed up, I cannot delete any file, loose data or corrupt it (because I know how bad I can be at coding :P)

Anyway, keep scripting!

EDIT: OMG, it works!
pkal
Long time nixers
But why should shell scripts be treated differently from other programs in my path? To me it's precisely that I don't know and don't have to know how a program works, how it's implemented, etc. that I find interesting about a unix environment. For some reason discriminating against shell scripts being interactive or deleting files (One can't argue te user doesn't know, he has to consciously call the script and give it valid input), seems arbitrary.

And regarding your script, I remember doing something lile that before I had this script, but I didn't always want to delete the first file it found, but I had to manually scan the output and run rm for those files I wanted to delete. Saying that writing a script that helps me automate the process by glueing the steps together is bad style (especially if I know that it's output is never parsed nor has to be parsed by other programs), seems wierd.
z3bra
Grey Hair Nixers
I'm not saying you should do it my way. If you felt offended or whatever, I'm sorry, that was definitely not the point. I just wanted to discuss your style.

I use interractive shell scripting a lot actually, and I tend to "improvise" them within interactive sessions, rather than saving them in my toolbox. Sometimes the logic is just too complex or to random, and interactive use is indeed best.

The fact I don't find shell scripting suitable for "real" programs, is because its structure is too simple, and not safe enough. It also lacks some basic constructs that make it hard tl deal woth complex data structure.

I'm talking about POSIX shell only. I won't mention bashisms et al. here because I don't want too :)
There are no array types, quote escaping seems random, and implementation specific, arithmetic is barely handled, ... Add on top on that the builtin VS. binary problem and you get a language that is a pain to "program" in.
In my mind, shell scripting should be there to help external programs interact with each others, not be a program by itself.

That being said, it's a really simple language that has the biggest "libraries" available through external programs, so it is definitely the quickest way to get started with programming!
pkal
Long time nixers
I wasn't offended, it's just that I generally really appreciate your opinion, and was surprised when I read your post. All I'm saying is that I don't see a problem in using shell scripts to solve simple tasks (like this one). I'd be the last person (well I'm actually not sure ;^D) who would want to write a physical simulation, a compiler or a browser in shell script, since it obviously the shell isn't the right tool for the task. To quote the Master Foo:
Quote:“There is more Unix-nature in one line of shell script than there is in ten thousand lines of C.”
budRich
Long time nixers
That quote <3 I just uploaded a 1000 lines long bash script to github . There is only one rule to shell scripting: get it done. I haven't tried your script, because i believe every file is sacred. I like your style. z3bra style is also good.
z3bra
Grey Hair Nixers
Don't get me wrong, shell scripting is perfectly suited for your use case! I think that different people solve problems in different ways, and I like to see multiple solutions to each problems myself. That's why I wanted to share my POV, because I think it differs significantly from your, so it could be interresting to other people reading this thread.

I'm not saying my way is better than your (even though I OBVIOUSLY prefer mine :P), just that I think differently, and though it was cool to have different approached to this.

The only important thing here is wether or not your script solved your issue ;)
budRich
Long time nixers
Great approach z3bra. Scripting style is as personal as one's handwriting. And it's cool that we are different. I will make a separate thread and link to my mentioned script (don't want to hijack this thread more then necessary and such). It would be interesting to get some input on the style. I know I do a lot of things.. in a not codesher way, and i am a proud bashist...
venam
Administrators
It seems like an endless quest.
Some years ago everyone was writing bots, kind of like a rite of passage.
It also seems like everyone was making a duplicate remover.
Etc...
Now we're back at writing bots.
And you're bringing back duplicate removers.

Jokes aside, I've also written one of those, in Perl this time: https://github.com/venam/duplicate-remover
I don't remember at all the quality of the code though, it might be bad (warning).
There are probably more efficient ways to solve this taking edge cases in consideration like links (hard/soft).
EDIT: this is probably the total opposite of what z3bra proposes, full of colors and block characters, can't be used in a pipeline.
budRich
Long time nixers
@venam: I don't really know perl. But your script is very readable, well documented and good comments, I wish more (public) scripts was like that. I have started to write the documentation/help (SCRIPT -h) first and then write the code needed to make the description valid. It's a good way to trick my trickster brain into not frankensteining my scripts.
z3bra
Grey Hair Nixers
(20-11-2017, 09:40 AM)venam Wrote: It seems like an endless quest.
Some years ago everyone was writing bots, kind of like a rite of passage.
It also seems like everyone was making a duplicate remover.
Etc...
Now we're back at writing bots.
And you're bringing back duplicate removers.

Jokes aside, I've also written one of those, in Perl this time: https://github.com/venam/duplicate-remover
I don't remember at all the quality of the code though, it might be bad (warning).
There are probably more efficient ways to solve this taking edge cases in consideration like links (hard/soft).
EDIT: this is probably the total opposite of what z3bra proposes, full of colors and block characters, can't be used in a pipeline.

We all used the same projects to either get started with programming, or learn a language. Be it a task manager, IRC bot, text editor, ...
The thing I like best for that, is code golfing though. You always find someone doing things in a more clever way than you do, stripping some bytes along the way.

I really enjoyed all the challenges we used to make this way. That's probably why I felt obliged to propose something else here ;)