Users browsing this thread: 1 Guest(s)
venam
Administrators
(This is part of the podcast discussion extension)

Backing up and Deploying


Link of the recording [ https://raw.githubusercontent.com/nixers...-05-29.mp3
http://podcast.nixers.net/feed/?name=nix...05-291.mp3 ]

Everything about transporting your stuffs somewhere else.


What was discussed earlier:
For dotfiles everyone uses a git hosting websites.
As for other data it differed greatly.

-- Show Notes

http://clonezilla.org/clonezilla-live.php
http://dar.linux.free.fr/
http://superuser.com/questions/343915/ta...difference
https://en.wikipedia.org/wiki/Cpio
https://en.wikipedia.org/wiki/RAID
https://wiki.archlinux.org/index.php/Disk_cloning
https://rsync.samba.org/
http://www.thegeekstuff.com/2010/08/cpio-utility/
http://linux.die.net/man/1/rdiff
http://www.cyberciti.biz/faq/disable-the...b-command/
http://unix.stackexchange.com/questions/...rving-acls
http://unix.stackexchange.com/questions/...ermissions
https://wiki.archlinux.org/index.php/udev
https://en.wikipedia.org/wiki/Distribute...ock_Device
swathe
Members
I think with dotfiles, the deployment of them is made easier if accompanied by a script that sets up the symlinks once you've cloned your repo. Some applications lend themselves to this really well too, such as emacs. I clone my .emacsd repo, install emacs and run it it and it downloads any packages I have specified and configures it the way I want.

I think a podcast about backups could go for weeks on end if you covered everyone's preferences. In the end, I don't care how people do it as long as they do actually back up.

Golden rule too is, if it's not tested, it's not a backup.
TheAnachron
Members
Not sure if this is the right place to ask,
but what backup solution do you use?

I am thinking about using bup since it seems very flexible and is basically just a collection of small tools which do the job.

Looked into borg backup and duplicity but I don't like the way of excluding/including files, since it seems to include all and only exclude what files you specifiy, and I would want a more finer control.
venam
Administrators
We had a podcast about backups where we discussed many of the solutions available.

Personally, I just use a script that I manually fire and that rsyncs the files to my external hard disk.

If you find it more appropriate we can move this discussion to the backups and restore thread.
TheAnachron
Members
Hey venam, thanks for you reply! Indeed it seems good to move it to the existing thread. I cannot check the mp3 right now but will do when I am back home. Can this thread be merged into it? (Remove first 2 sentences I wrote from first post)

I was using rsync first but its not compressing and it creates too much data for me. So I would have to use another solution in combination since I want to be able to go back in time with my backups.
z3bra
Grey Hair Nixers
I'm actually working on a tool to synchronize files accross multiple hosts. It could be used for HA setups, or for backups: http://git.z3bra.org/synk/log.html
It's still a work in progress, but it managed to get a file synchronized between 3 hosts for me without issue.

In regards to backups, I wrote a post about it a while ago: http://blog.z3bra.org/2014/09/backup-someone.html
This was more an experiment that a *real* backup solution, but it can still give a few tips.
TheAnachron
Members
Hey z3bra no offence but rsync combined with borg/bup is imo in most cases a more suitable solution than a logrotate on that file.
While yours seem to be good enough for simple config files that can be easily backed up, it fails with massive data like my video or audio list.
(I do also understand it's just a pointer into the right direction)

Right now I am looking for a good combination of a backup where I can go back in time and have checksums and incremental snapshots, like a mix between rsync and bup (or borg). Borg seems to handle it all very well and is tested against 10TB and more storage, I however didn't get fully convinced yet so I am still looking for alternatives. (I dont like that everything is included and you have to write an excessive list of excludes or let it be generated via a find-command)

I will give your "synk" a look and may report back for duty.
z3bra
Grey Hair Nixers
No worries, I'm not offended ;)

This post was more an experimentation / proof-of-concept. While it doesn't scale well, it was definitely fun to setup.
For synk, keep in mind that's it's still a work in progress. As the only user, I could test much cases, so I wouldn't want to be responsible for loosing your 10Tb of data!
synk uses ssh + rsync as a back end. You could use it for incremental backups without much issues I think. I'd be glad to have another user!

Quick question though. Do you plan on hosting the backups yourself, or rely on some sort of service for that?
TheAnachron
Members
So I've setup my backup solution which is mostly consisting of a borg backup setup.

There are 4 repositories:
system-config: Contains data from /etc, /var and alike
system-data: Contains a list of all installed official and AUR packages, a list of all installed python packages and a backup from LVM + LUKS headers as well as a 100% direct DD from the /boot partition.
user-config: Contains all configuration from ~/.config and ~/.local
user-data: Contains all real user-data like documents, movies, songs, savegames, code and alike.

In general I have to admit my home directory is pretty symlinked. Files/Folders are either linked to /data/private/$username/$folder or to /data/protected/$folder.

Dotfiles are stored in ~/dotfiles which is a symlink to /data/private/$username/dotfiles. Those dotfiles are then linked to ~/.config/$program and alike.

Program data (like mails, hedgewars gamedata etc) is stored inside /data/private/$username/appdata which subdirectories are being symlinked from ~/.$program (like .mozilla, .thunderbird and alike).

This setup took quite some time but ultimatively it forces good behaviors and splits the data into 4 maintainable repositories.
Dotfiles/Appdata that is not specially symlinked gets lost, thus forcing me to keep my ~ clean and updated.

When I create backups using borg I create them with this scheme:
For user-* backups (user-config and user-data): user-(config|data)::$username@$date_$time:
And for system-* backups (system-config and system-data): system-(config|data)::$hostname@$date:_$time:

I will now connect this system to a cloud backup solution. The good thing with borg is that its already encrypted by default, so syncing to the cloud will not be of much concern to me.

That's it for now folks!
movq
Long time nixers
I'm probably the last person on earth to fully realize the impact of this issue, but then again, it doesn't get much attention, either: Data integrity.

I used to do what most people do: Attach a USB drive or boot up my NAS, and then backup data using rsync. The thing is, how do you know that the files you're backing up are actually intact? How do you verify that you're not creating the 8th backup in a row of a file that has been corrupt for two weeks? Backups are worthless if you should have restored a file from the old backup instead of creating a new corrupt backup.

Rarely, people address this issue.

File systems are of little help. The common file systems just store whatever data you tell them to store and then blindly trust the hard drive. Was it written successfully? Were there I/O errors? When reading the data later, do we get back the same data? Nobody really knows. See also: http://danluu.com/file-consistency/

I only know of two "big" file systems that take care of this: ZFS and btrfs. They create checksums for both the metadata and the actual data. Given the amount of bugs, I don't fully trust btrfs yet (does anybody here use it?). ZFS is an alien in the Linux world, so I'm very reluctant to use that one, either. Maybe I'll use it on my NAS in the future.

To be fair, the last time I had a drive that silently corrupted data was about ~10 years ago. Sometimes, files just contained garbage. SMART showed a steadily rising "reallocated sector count". Modern hardware seems to be more reliable. Still, it can happen.

So, the first issue is: Before creating a new backup, you should verify the integrity of the data you're about to backup.

If you're like me, you use Git whenever possible. Most of the really important data on my drive is in Git repos. It's not limited to code, it also includes letters and the likes. The nice thing about Git is that it computes a lot of checkums, because it uses them to address objects. That is, when you commit a file, Git automatically hashes it and only refers to that file by its hash. Eventually, commit objects are hashed as well and that's how you end up with "revision e6865e9". Because everything has a checksum, you can do a "git fsck --full" to verify that each object (including each version of each file you ever committed) is still intact. That's awesome. You should do that for all your Git repos before creating backups.

But how about large data? You don't want to have all your photos or music in a Git repo. That would be a huge waste of space. I have no experience with tools like git-annex, maybe they can be of use here, too. What I did discover recently is this tool:

https://github.com/ambv/bitrot

It's basically "sha1sum on steroids". It makes it easy to scan large data sets, compute checksums, and verify them later. There are other tools like it. Maybe for your use case, sha1sum actually is good enough.

I recently started to do these two extra steps before creating new backups -- "git fsck --full" and run a tool like "bitrot". Only if they tell me that everything's fine, I go ahead and do the backup. I have a much better feeling about the quality of my backups now.

This doesn't solve all problems, though. The second issue is: How do you deal with your own stupidity? :-) Let's say you accidentally deleted lots of your photos. If you don't notice that (which can happen, of course), eventually, you will no longer have a backup of these files because you can usually only store the last ~5 backups or so.

Honestly, I'm not sure how to solve this. I now run a tool like "bitrot" before creating backups that tells me exactly which files have been removed. Let's see how that works out. Since I only run "bitrot" on data like music or photos, there shouldn't be too much "noise" (i.e., temporary files).

The third issue is: How can you efficiently verify that your backups aren't corrupt?

Keep in mind that a typical workstation or laptop has an SSD with, say, 256GB of capacity. That thing is fast as hell. Plus, your CPU is very fast. This means that running a tool like "bitrot" is easy to do on a workstation and usually completes within a couple of minutes. But running the same tool on your NAS with ~2TB or more of HDDs will literally take hours. So that's not an option.

Plus, I not only do backups from my workstation and store them on my NAS. I also create full backups of my NAS, store it on a USB drive, and put that USB drive in a safe place in a different building. (I only do this kind of backup every 1-2 weeks.) That leaves me with soooooooo much more "unverified" data.

The answer probably lies in file systems like ZFS or btrfs. They have checksums. They automatically verify data. I should just go ahead and use one of them (if it weren't so much trouble ...).

(All this makes me wonder why there are only two well known file systems that create checksums of data. Why don't all file systems do that? I mean, 20-30 years ago, hard drives failed all the time. Floppy disks failed all the time. Even today, USB sticks fail very often. Still, nobody really cared ... Of course, older drives were really small and CPUs back then were really slow. I still think it's strange that checksums are not a core feature of every file system out there.)
venam
Administrators
(25-09-2016, 12:55 PM)vain Wrote: I used to do what most people do: Attach a USB drive or boot up my NAS, and then backup data using rsync. The thing is, how do you know that the files you're backing up are actually intact? How do you verify that you're not creating the 8th backup in a row of a file that has been corrupt for two weeks? Backups are worthless if you should have restored a file from the old backup instead of creating a new corrupt backup.
You're touching one big concern.
It's the same concern people have when sending data over anything.
That's the reason we have checksums that come with big ISO files.

(25-09-2016, 12:55 PM)vain Wrote: The third issue is: How can you efficiently verify that your backups aren't corrupt?
(25-09-2016, 12:55 PM)vain Wrote: https://github.com/ambv/bitrot
It's basically "sha1sum on steroids". It makes it easy to scan large data sets, compute checksums, and verify them later. There are other tools like it. Maybe for your use case, sha1sum actually is good enough.
Yep, I've tried it, it's pretty nice.
I like the idea of keeping track of changes in bulk.
I'm not sure it's the best way but it's a good approach.

(25-09-2016, 12:55 PM)vain Wrote: Even today, USB sticks fail very often. Still, nobody really cared ...
If you don't plug a USB stick for a while the data on it fades away.
Every technology is a time bomb.

Probably incremental backups with diffs can solve this.
Maybe by showing the diffs, like rsync with the verbose option.
movq
Long time nixers
(25-09-2016, 01:33 PM)venam Wrote: Yep, I've tried it, it's pretty nice.
I like the idea of keeping track of changes in bulk.
I'm not sure it's the best way but it's a good approach.

There's also "shatag" which can save the checksums in actual file metadata if your file system supports it (xattrs -- not available on OpenBSD and NetBSD, are they?):

https://bitbucket.org/maugier/shatag

This could be nicer because checksums aren't lost on file renames. Haven't tried this one, yet.

(25-09-2016, 01:33 PM)venam Wrote: Probably incremental backups with diffs can solve this.
Maybe by showing the diffs, like rsync with the verbose option.

Yeah, I'm going to try something like that.

I was thinking about doing this "review process" more often than I do actual backups. For example, let's say you create one backup a week. If you had a tool that could give you a report each day (!) on which files have been removed/added/updated, that might be easier to handle than to just read this report once a week.

(Remember, this isn't about "noisy" data, but about stuff like music. There's not a lot going on there, so I guess that most of the time nothing happens at all.)
josuah
Long time nixers
I'm planning to roll currently rolling my own "backup" script, that stores things like that:


<code>
<pre style="line-height: 8px;">
./ - root dir you want to backup

├── file1

├── dir1/
│ ┬
│ ├── file2
│ └── file3

├── file4

└── .version/ - where things are backed up

├── obj/ - The same files as in the root dir, but with different names: their hash.
│ ┬
│ ├── d41d8cd98f00b204e9800998ecf8427e - same content as file1
│ ├── ba1f2511fc30423bdbb183fe33f3dd0f - same content as file2 and file3
│ ├── c097062a51458c527f04b44c210efefc - same content as file4 (before)
│ └── 1c93f779e81d1710e671e5355ee761ef - same content as file4 (now)

├── rev/ - "revisions": list of equivalences between hashes and filename at a given time.
│ ┬
│ ├── fb82199922251c63d72b17a0860198e6 - initial version
│ └── 4dca17472fcdda9affa259b318243a54 - file4 got edited

└── history - list of rev/* in chronological order.
</pre>
</code>


How to do this without all the cli option & stuff:

<code>
mkdir -p .version/obj .version/rev

find . -type f ! -name .version ! -path '*/.version/*' -print0 | sort -z |
xargs -0 md5sum | tee -a .version/tmp | while read -r hash path
do [ -f ".version/obj/$hash" ] || cp "$path" ".version/obj/$hash"
done

chmod -w .version/obj/* .version/rev/*
new="$(md5sum .version/tmp)" && new="${new%% *}" || exit 1
mv -f .version/tmp ".version/rev/$new"

[ -f .version/history ] && old="$(tail -n 1 .version/history)" || old=''
[ -f .version/history ] && [ "$new" = "$old" ] ||
printf '%s\n' "$new" >> .version/history
printf 'old %s\nnew %s\n' "$old" "$new"
</code>


This allows to make versioning easily: just need to add new <code>./.version/rev/*</code> ~100 lines text file (md5sum output) to have a new version.

This permits de-duplication of content across versions: Any file that did not change is not duplicated: it already exist in <code>.version/obj/&lt;hash-of-the-file&gt;</code>, even if the file gets renamed.

I may even add some merging operations to gather content on multiple storage, then merge them with <code>version merge tree1 tree2</code>.

It seems that it is a bit like what git does: https://git-scm.com/book/en/v2/Git-Inter...it-Objects

For now, I use md5sum because it is faster and I just start to play with it. The process is rather slow compared to git, but it's only a 227 shell script for now...

So far, backing up data and switching between revisions works. You can safely run <code>version</code> without argument to get short usage message to start playing with it.

[EDIT] formatting



I just bought a hard drive enclosure (sorry for this being some kind of ads).

(25-09-2016, 12:55 PM)vain Wrote: ...

I hope this can face a few of these issues. This along with my script generating md5sums and comparing them with previous revision at every version (diff with <code>version diff</code>).

As I can not roll a 10 year long tested software with a lot of work to make it work everywhere, I bet on simplicity to make things safe:

The script never ever delete data. `rm` never acts on the read-only <code>./.version/{rev,obj}/*</code> content. So in case of disaster, <code>version revision &lt;revision-hash&gt;</code> restores everything.

I will let you know in the Unix Diary thread if I encounter a great disaster with this! ^_^

One problem: it does not provide any way to backup data on a remote host. Z3bra's <code>synk</code> may be much better for this purpose. Maybe I could combine them some day.
pranomostro
Long time nixers
One thing: I recently saw a quite interesting video about GPG (https://begriffs.com/posts/2016-11-05-ad...gnupg.html) and the presenter stated that one should not backup ~/.gnupg/random_seed if one was interested in keeping the own communications secure.

So I opened all my backups, deleted random_seed from ~/.gnupg/.

My backup process is quite simple-it backs up the files that were changed since the last backup (daily) (uses stat --format="%Y", not a checksum, I think it's faster and makes it easier for also backing up new files) and a full backup every beginning of the month. That makes the resulting files smaller and still provides a comfortable fallback in case anything goes wrong.

I should start copying my files to a remote host, until now I have been copying my backups on an external HDD and an USB-stick every 2 months or so. That's a bit careless, but my important projects related to programming are on github anyway.

If I started anew, I would encrypt my harddrive from the beginning, make encrypted backups (using GPG) and simply backup my encrypted files on dropbox or something like that. It would not be 100% privacy oriented, but more comfortable than my current solution. Also, I should use an existing solution (I did not do enough research about that, could anyone give me a recommendation for a stable solution?). But now, everything is working fine (and I already used my backups to restore files) and simple.

My current backup script: https://raw.githubusercontent.com/pranom.../master/bu

cgdc (for copying changed files): https://raw.githubusercontent.com/pranom...aster/cgdc

Update: I considered this one: https://github.com/Pricetx/backup but unfortunately it seems to keep passwords in plaintext in a configuration file https://github.com/Pricetx/backup/blob/m...backup.cfg instead of using a password manager command.

Update 2: Or maybe is there a venti (https://en.wikipedia.org/wiki/Venti) version for unix?
jkl
Long time nixers
Quote:For dotfiles everyone uses a git hosting websites.

Actually, not.

On-topic: My backup procedures differ depending on which platform I'm on.

On Windows:

* My not-so-secret project files are entirely stored in my Dropbox, hard-linked to my local Projects directory. I randomly mirror the project folder with robocopy /MIR.
* My not-so-not-so-secret files reside in an encrypted folder in my Dropbox.

On BSD:

* My BSD laptop is encrypted. No files are stored outside that laptop. A hard disk failure would cost me some files. I can perfectly live with that.
* My BSD servers are unencrypted but double-protected by RAID and snapshots.

--
<mort> choosing a terrible license just to be spiteful towards others is possibly the most tux0r thing I've ever seen
venam
Administrators
(03-01-2017, 10:39 AM)jkl Wrote: * My BSD servers are unencrypted but double-protected by RAID and snapshots.
Are those physical servers?
How is the architecture of the whole thing (hardware and software)?
z3bra
Grey Hair Nixers
I wanted to start an open discussion about backups. I can't believe this topic hasn't been brought on the forums yet!

This is a pretty wide topic, and there are many points to cover.
We all have different setups, different data, ideas, needs... And there are so many different solutions to choose from.

There's no specific question here, you can drop your ideas about backups, explain yours, talk about good practices, ...


Of course, I'll open the discussion:

I'm convinced there are only 2 types of backups: configurations or data:

Configuration backups will help recovering faster than doing it from scratch, so they're not strictly necessary, but help a lot. They MUST come with a recovery plan.

Data backups is for data that cannot be regenerated, and is specific to a context. That can be pictures of your cat, emails, cryptographic keys or user settings in a database.
This usually result in large amount of data, and you need to be really careful not to screw up the data!

Currently use tarsnap for my server's configuration and emails as this is pretty sensitive. Each server has its own key, and upload the configs directly. I don't have any recovery procedure yet (I know it's bad), but it's basically reinstall server and extract all data back on the server.
I also started using drist for configuration's management of these servers.

For the data... I'm still wondering how to proceed. All my data is currently living on a 1Tb drive in my computer. I also backup my ssh keys, pgp keys and /home there (using cp/tar).
I have an USB drive (1Tb as well) taking dust somewhere that I'm supposed to use as an offline backup, but I hardly ever do it...
I recently subscribed to backblaze B2 which is a cheap cloud storage solution. I'm planning on using it as an offsite backup solution, but I need to find a good way to reduce my data size, an encrypt it first.
For the size, I'll use dedup to deduplicate the data. Now I need a good/reliable way to encrypt the data before uploading it "to the cloud".
I'd also want a 3rd location, possibly in a place I can control (eg. my mom's house or a friend).

That's it! It's far from perfect, but I'm fully aware (I think) of the flaws of this setup, and think it will not be that bad once it will be finished.
If you have ideas that could help me, I'll hear them with pleasure!
BANGARANG, MOTHERFUCKER
venam
Administrators
(19-04-2019, 08:39 AM)z3bra Wrote: I wanted to start an open discussion about backups. I can't believe this topic hasn't been brought on the forums yet!
It was discussed before a bit in this thread about Backing up and Deploying.
There was a lot of great ideas being shared in that old thread.
I'll merge the posts.
z3bra
Grey Hair Nixers
Friendly reminder: check your backups. Today.

I lost my internal 1Tb HDD 2 days ago. No idea why. I rebooted my computer, and dmesg started spitting out stuff like this:

Code:
[    1.138082] ata5.00: READ LOG DMA EXT failed, trying PIO
[    1.138610] ata5: failed to read log page 10h (errno=-5)
[    1.139096] ata5.00: exception Emask 0x1 SAct 0x400 SErr 0x0 action 0x0
[    1.139581] ata5.00: irq_stat 0x40000001
[    1.140072] ata5.00: failed command: READ FPDMA QUEUED
[    1.140569] ata5.00: cmd 60/08:50:00:00:00/00:00:00:00:00/40 tag 10 ncq dma 4096 in
                        res 51/04:50:00:00:00/00:00:00:00:00/40 Emask 0x1 (device error)
[    1.141606] ata5.00: status: { DRDY ERR }
[    1.142120] ata5.00: error: { ABRT }
[    1.142776] ata5.00: configured for UDMA/133 (device error ignored)
[    1.143727] ata5: EH complete

Now my HDD is innaccessible, with my whole /home and /var/data on it (including /var/data/backups !). I hopefully made a one time backup in may, so I didn't loose everything. But I lost 2 month worth of data (and I had a bunch of it created since then !).

Oh, and the irony, I lost the configuration scripts to setup my backups !
opfez
Members
Damn, thanks for reminding me. I'll backup my stuff tomorrow.

(31-07-2020, 03:55 PM)z3bra Wrote: check your backups. Today.
Yeah, yeah, it's late, ok?
z3bra
Grey Hair Nixers
So quick update, my HDD definitely died, and I gave up on it… It's still plugged onto my computer, "just in case", but I doubt I'll ever retrieve my data.

In reaction to this, I finally automated my backup system, which involves the following:
  • safe - to store API tokens, private keys, passwords, …
  • dedup - encrypted + compressed deduplicating snapshoting system
  • drist - automated configuration (similar to ansible/terraform)
  • rclone - rsync for "online clouds" (used for backblaze)
  • tarsnap - backup my servers' configuration (automated via cron)
  • backblaze - cheapest online storage provider (B2 storage)

All backups are done from within "cron", and I'll receive an email if any command produce an output (basically, meaning an error).

I configure my servers using drist. It also configure the backup system, by pushing the tarsnap private key on each server, as well as the list of directories/files to back up. I try to only backup what I really need, so I don't have to skim through a huge tarball and guess which file I need or not in case of recover.
These backups are done every week from within /etc/weekly.local, so I get a report whenever they run. I trust tarsnap on the topic of encryption, so I don't bother encrypting the data before sending it to their servers.

For my personal computer, I use backblaze storage, which is not encrypted. Tarsnap can be expensive when you have lots of data, so I decided on keeping it ONLY for config files (a few MB at most).
On backblaze I have 3 buckets: one for my password store, one for the drist configs, and the last one for my data (pics, videos, …).

As I used safe, all my passwords are stored encrypted on disk, so I just upload them as-is. For drist, I want to store it encrypted, and possibly version it. The best way to do this is using dedup, which basically deduplicate data, and store "snapshot" archives after encrypting and compressing it. The dedup key used to encrypt the repo is obviously stored in the safe I mentioned earlier :)
Then, I push the whole dedup repo on backblaze using rclone, knowing that my data is safe. I do the same for my other data (dedup + backblaze).

All I need now is a way to practice recovery !
venam
Administrators
Someone recently asked on IRC about how others went about doing backups.
So what's your current backup solution?
fre d die
Members
I boot from a live usb and use dd to backup the system onto an external HDD. I try to do this regularly but i am very forgetfull.....

Edit:
Luckily i dont have anything of much value stored on my laptop, other than my music library, which is stored on an mp3 player aswell... but that often ends up corrupting the files as it runs on a micro sd card...
jkl
Long time nixers
Currently I’m cross-storing .tbz2 backups between my servers. I hope that not too many of them will stop working at the same time.

--
<mort> choosing a terrible license just to be spiteful towards others is possibly the most tux0r thing I've ever seen
z3bra
Grey Hair Nixers
Edit : Removed my whole description, because I'm stupid. I already described my setup last year, in this same thread. Like, 4 posts above… No need to spam redundant info here, sorry for the noise !
venam
Administrators
A little story reminded me of this thread. What do you do when the backup solution is automatic and you notice the corruption after the backup has already taken place?
I guess incremental backups would solve this, but you'd still loose everything in between the last "good" incremental backup and the next "bad" one.