venam
(25-09-2016, 12:55 PM)vain Wrote: I used to do what most people do: Attach a USB drive or boot up my NAS, and then backup data using rsync. The thing is, how do you know that the files you're backing up are actually intact? How do you verify that you're not creating the 8th backup in a row of a file that has been corrupt for two weeks? Backups are worthless if you should have restored a file from the old backup instead of creating a new corrupt backup.
You're touching one big concern.
It's the same concern people have when sending data over anything.
That's the reason we have checksums that come with big ISO files.

(25-09-2016, 12:55 PM)vain Wrote: The third issue is: How can you efficiently verify that your backups aren't corrupt?
(25-09-2016, 12:55 PM)vain Wrote: https://github.com/ambv/bitrot
It's basically "sha1sum on steroids". It makes it easy to scan large data sets, compute checksums, and verify them later. There are other tools like it. Maybe for your use case, sha1sum actually is good enough.
Yep, I've tried it, it's pretty nice.
I like the idea of keeping track of changes in bulk.
I'm not sure it's the best way but it's a good approach.

(25-09-2016, 12:55 PM)vain Wrote: Even today, USB sticks fail very often. Still, nobody really cared ...
If you don't plug a USB stick for a while the data on it fades away.
Every technology is a time bomb.

Probably incremental backups with diffs can solve this.
Maybe by showing the diffs, like rsync with the verbose option.
vain
(25-09-2016, 01:33 PM)venam Wrote: Yep, I've tried it, it's pretty nice.
I like the idea of keeping track of changes in bulk.
I'm not sure it's the best way but it's a good approach.

There's also "shatag" which can save the checksums in actual file metadata if your file system supports it (xattrs -- not available on OpenBSD and NetBSD, are they?):

https://bitbucket.org/maugier/shatag

This could be nicer because checksums aren't lost on file renames. Haven't tried this one, yet.

(25-09-2016, 01:33 PM)venam Wrote: Probably incremental backups with diffs can solve this.
Maybe by showing the diffs, like rsync with the verbose option.

Yeah, I'm going to try something like that.

I was thinking about doing this "review process" more often than I do actual backups. For example, let's say you create one backup a week. If you had a tool that could give you a report each day (!) on which files have been removed/added/updated, that might be easier to handle than to just read this report once a week.

(Remember, this isn't about "noisy" data, but about stuff like music. There's not a lot going on there, so I guess that most of the time nothing happens at all.)
josuah
I'm planning to roll currently rolling my own "backup" script, that stores things like that:


<code>
<pre style="line-height: 8px;">
./ - root dir you want to backup

├── file1

├── dir1/
│ ┬
│ ├── file2
│ └── file3

├── file4

└── .version/ - where things are backed up

├── obj/ - The same files as in the root dir, but with different names: their hash.
│ ┬
│ ├── d41d8cd98f00b204e9800998ecf8427e - same content as file1
│ ├── ba1f2511fc30423bdbb183fe33f3dd0f - same content as file2 and file3
│ ├── c097062a51458c527f04b44c210efefc - same content as file4 (before)
│ └── 1c93f779e81d1710e671e5355ee761ef - same content as file4 (now)

├── rev/ - "revisions": list of equivalences between hashes and filename at a given time.
│ ┬
│ ├── fb82199922251c63d72b17a0860198e6 - initial version
│ └── 4dca17472fcdda9affa259b318243a54 - file4 got edited

└── history - list of rev/* in chronological order.
</pre>
</code>


How to do this without all the cli option & stuff:

<code>
mkdir -p .version/obj .version/rev

find . -type f ! -name .version ! -path '*/.version/*' -print0 | sort -z |
xargs -0 md5sum | tee -a .version/tmp | while read -r hash path
do [ -f ".version/obj/$hash" ] || cp "$path" ".version/obj/$hash"
done

chmod -w .version/obj/* .version/rev/*
new="$(md5sum .version/tmp)" && new="${new%% *}" || exit 1
mv -f .version/tmp ".version/rev/$new"

[ -f .version/history ] && old="$(tail -n 1 .version/history)" || old=''
[ -f .version/history ] && [ "$new" = "$old" ] ||
printf '%s\n' "$new" >> .version/history
printf 'old %s\nnew %s\n' "$old" "$new"
</code>


This allows to make versioning easily: just need to add new <code>./.version/rev/*</code> ~100 lines text file (md5sum output) to have a new version.

This permits de-duplication of content across versions: Any file that did not change is not duplicated: it already exist in <code>.version/obj/&lt;hash-of-the-file&gt;</code>, even if the file gets renamed.

I may even add some merging operations to gather content on multiple storage, then merge them with <code>version merge tree1 tree2</code>.

It seems that it is a bit like what git does: https://git-scm.com/book/en/v2/Git-Inter...it-Objects

For now, I use md5sum because it is faster and I just start to play with it. The process is rather slow compared to git, but it's only a 227 shell script for now...

So far, backing up data and switching between revisions works. You can safely run <code>version</code> without argument to get short usage message to start playing with it.

[EDIT] formatting



I just bought a hard drive enclosure (sorry for this being some kind of ads).

(25-09-2016, 12:55 PM)vain Wrote: ...

I hope this can face a few of these issues. This along with my script generating md5sums and comparing them with previous revision at every version (diff with <code>version diff</code>).

As I can not roll a 10 year long tested software with a lot of work to make it work everywhere, I bet on simplicity to make things safe:

The script never ever delete data. `rm` never acts on the read-only <code>./.version/{rev,obj}/*</code> content. So in case of disaster, <code>version revision &lt;revision-hash&gt;</code> restores everything.

I will let you know in the Unix Diary thread if I encounter a great disaster with this! ^_^

One problem: it does not provide any way to backup data on a remote host. Z3bra's <code>synk</code> may be much better for this purpose. Maybe I could combine them some day.
pranomostro
One thing: I recently saw a quite interesting video about GPG (https://begriffs.com/posts/2016-11-05-ad...gnupg.html) and the presenter stated that one should not backup ~/.gnupg/random_seed if one was interested in keeping the own communications secure.

So I opened all my backups, deleted random_seed from ~/.gnupg/.

My backup process is quite simple-it backs up the files that were changed since the last backup (daily) (uses stat --format="%Y", not a checksum, I think it's faster and makes it easier for also backing up new files) and a full backup every beginning of the month. That makes the resulting files smaller and still provides a comfortable fallback in case anything goes wrong.

I should start copying my files to a remote host, until now I have been copying my backups on an external HDD and an USB-stick every 2 months or so. That's a bit careless, but my important projects related to programming are on github anyway.

If I started anew, I would encrypt my harddrive from the beginning, make encrypted backups (using GPG) and simply backup my encrypted files on dropbox or something like that. It would not be 100% privacy oriented, but more comfortable than my current solution. Also, I should use an existing solution (I did not do enough research about that, could anyone give me a recommendation for a stable solution?). But now, everything is working fine (and I already used my backups to restore files) and simple.

My current backup script: https://raw.githubusercontent.com/pranom.../master/bu

cgdc (for copying changed files): https://raw.githubusercontent.com/pranom...aster/cgdc

Update: I considered this one: https://github.com/Pricetx/backup but unfortunately it seems to keep passwords in plaintext in a configuration file https://github.com/Pricetx/backup/blob/m...backup.cfg instead of using a password manager command.

Update 2: Or maybe is there a venti (https://en.wikipedia.org/wiki/Venti) version for unix?
jkl
Quote:For dotfiles everyone uses a git hosting websites.

Actually, not.

On-topic: My backup procedures differ depending on which platform I'm on.

On Windows:

* My not-so-secret project files are entirely stored in my Dropbox, hard-linked to my local Projects directory. I randomly mirror the project folder with robocopy /MIR.
* My not-so-not-so-secret files reside in an encrypted folder in my Dropbox.

On BSD:

* My BSD laptop is encrypted. No files are stored outside that laptop. A hard disk failure would cost me some files. I can perfectly live with that.
* My BSD servers are unencrypted but double-protected by RAID and snapshots.
venam
(03-01-2017, 10:39 AM)jkl Wrote: * My BSD servers are unencrypted but double-protected by RAID and snapshots.
Are those physical servers?
How is the architecture of the whole thing (hardware and software)?
z3bra
I wanted to start an open discussion about backups. I can't believe this topic hasn't been brought on the forums yet!

This is a pretty wide topic, and there are many points to cover.
We all have different setups, different data, ideas, needs... And there are so many different solutions to choose from.

There's no specific question here, you can drop your ideas about backups, explain yours, talk about good practices, ...


Of course, I'll open the discussion:

I'm convinced there are only 2 types of backups: configurations or data:

Configuration backups will help recovering faster than doing it from scratch, so they're not strictly necessary, but help a lot. They MUST come with a recovery plan.

Data backups is for data that cannot be regenerated, and is specific to a context. That can be pictures of your cat, emails, cryptographic keys or user settings in a database.
This usually result in large amount of data, and you need to be really careful not to screw up the data!

Currently use tarsnap for my server's configuration and emails as this is pretty sensitive. Each server has its own key, and upload the configs directly. I don't have any recovery procedure yet (I know it's bad), but it's basically reinstall server and extract all data back on the server.
I also started using drist for configuration's management of these servers.

For the data... I'm still wondering how to proceed. All my data is currently living on a 1Tb drive in my computer. I also backup my ssh keys, pgp keys and /home there (using cp/tar).
I have an USB drive (1Tb as well) taking dust somewhere that I'm supposed to use as an offline backup, but I hardly ever do it...
I recently subscribed to backblaze B2 which is a cheap cloud storage solution. I'm planning on using it as an offsite backup solution, but I need to find a good way to reduce my data size, an encrypt it first.
For the size, I'll use dedup to deduplicate the data. Now I need a good/reliable way to encrypt the data before uploading it "to the cloud".
I'd also want a 3rd location, possibly in a place I can control (eg. my mom's house or a friend).

That's it! It's far from perfect, but I'm fully aware (I think) of the flaws of this setup, and think it will not be that bad once it will be finished.
If you have ideas that could help me, I'll hear them with pleasure!
BANGARANG, MOTHERFUCKER
venam
(19-04-2019, 08:39 AM)z3bra Wrote: I wanted to start an open discussion about backups. I can't believe this topic hasn't been brought on the forums yet!
It was discussed before a bit in this thread about Backing up and Deploying.
There was a lot of great ideas being shared in that old thread.
I'll merge the posts.




Members  |  Stats  |  Night Mode