(This is part of the podcast discussion extension)

Backing up and Deploying

Link of the recording [ ]

Everything about transporting your stuffs somewhere else.

What was discussed earlier:
For dotfiles everyone uses a git hosting websites.
As for other data it differed greatly.

-- Show Notes
I think with dotfiles, the deployment of them is made easier if accompanied by a script that sets up the symlinks once you've cloned your repo. Some applications lend themselves to this really well too, such as emacs. I clone my .emacsd repo, install emacs and run it it and it downloads any packages I have specified and configures it the way I want.

I think a podcast about backups could go for weeks on end if you covered everyone's preferences. In the end, I don't care how people do it as long as they do actually back up.

Golden rule too is, if it's not tested, it's not a backup.
Not sure if this is the right place to ask,
but what backup solution do you use?

I am thinking about using bup since it seems very flexible and is basically just a collection of small tools which do the job.

Looked into borg backup and duplicity but I don't like the way of excluding/including files, since it seems to include all and only exclude what files you specifiy, and I would want a more finer control.
We had a podcast about backups where we discussed many of the solutions available.

Personally, I just use a script that I manually fire and that rsyncs the files to my external hard disk.

If you find it more appropriate we can move this discussion to the backups and restore thread.
Hey venam, thanks for you reply! Indeed it seems good to move it to the existing thread. I cannot check the mp3 right now but will do when I am back home. Can this thread be merged into it? (Remove first 2 sentences I wrote from first post)

I was using rsync first but its not compressing and it creates too much data for me. So I would have to use another solution in combination since I want to be able to go back in time with my backups.
I'm actually working on a tool to synchronize files accross multiple hosts. It could be used for HA setups, or for backups:
It's still a work in progress, but it managed to get a file synchronized between 3 hosts for me without issue.

In regards to backups, I wrote a post about it a while ago:
This was more an experiment that a *real* backup solution, but it can still give a few tips.
Hey z3bra no offence but rsync combined with borg/bup is imo in most cases a more suitable solution than a logrotate on that file.
While yours seem to be good enough for simple config files that can be easily backed up, it fails with massive data like my video or audio list.
(I do also understand it's just a pointer into the right direction)

Right now I am looking for a good combination of a backup where I can go back in time and have checksums and incremental snapshots, like a mix between rsync and bup (or borg). Borg seems to handle it all very well and is tested against 10TB and more storage, I however didn't get fully convinced yet so I am still looking for alternatives. (I dont like that everything is included and you have to write an excessive list of excludes or let it be generated via a find-command)

I will give your "synk" a look and may report back for duty.
No worries, I'm not offended ;)

This post was more an experimentation / proof-of-concept. While it doesn't scale well, it was definitely fun to setup.
For synk, keep in mind that's it's still a work in progress. As the only user, I could test much cases, so I wouldn't want to be responsible for loosing your 10Tb of data!
synk uses ssh + rsync as a back end. You could use it for incremental backups without much issues I think. I'd be glad to have another user!

Quick question though. Do you plan on hosting the backups yourself, or rely on some sort of service for that?
So I've setup my backup solution which is mostly consisting of a borg backup setup.

There are 4 repositories:
system-config: Contains data from /etc, /var and alike
system-data: Contains a list of all installed official and AUR packages, a list of all installed python packages and a backup from LVM + LUKS headers as well as a 100% direct DD from the /boot partition.
user-config: Contains all configuration from ~/.config and ~/.local
user-data: Contains all real user-data like documents, movies, songs, savegames, code and alike.

In general I have to admit my home directory is pretty symlinked. Files/Folders are either linked to /data/private/$username/$folder or to /data/protected/$folder.

Dotfiles are stored in ~/dotfiles which is a symlink to /data/private/$username/dotfiles. Those dotfiles are then linked to ~/.config/$program and alike.

Program data (like mails, hedgewars gamedata etc) is stored inside /data/private/$username/appdata which subdirectories are being symlinked from ~/.$program (like .mozilla, .thunderbird and alike).

This setup took quite some time but ultimatively it forces good behaviors and splits the data into 4 maintainable repositories.
Dotfiles/Appdata that is not specially symlinked gets lost, thus forcing me to keep my ~ clean and updated.

When I create backups using borg I create them with this scheme:
For user-* backups (user-config and user-data): user-(config|data)::$username@$date_$time:
And for system-* backups (system-config and system-data): system-(config|data)::$hostname@$date:_$time:

I will now connect this system to a cloud backup solution. The good thing with borg is that its already encrypted by default, so syncing to the cloud will not be of much concern to me.

That's it for now folks!
I'm probably the last person on earth to fully realize the impact of this issue, but then again, it doesn't get much attention, either: Data integrity.

I used to do what most people do: Attach a USB drive or boot up my NAS, and then backup data using rsync. The thing is, how do you know that the files you're backing up are actually intact? How do you verify that you're not creating the 8th backup in a row of a file that has been corrupt for two weeks? Backups are worthless if you should have restored a file from the old backup instead of creating a new corrupt backup.

Rarely, people address this issue.

File systems are of little help. The common file systems just store whatever data you tell them to store and then blindly trust the hard drive. Was it written successfully? Were there I/O errors? When reading the data later, do we get back the same data? Nobody really knows. See also:

I only know of two "big" file systems that take care of this: ZFS and btrfs. They create checksums for both the metadata and the actual data. Given the amount of bugs, I don't fully trust btrfs yet (does anybody here use it?). ZFS is an alien in the Linux world, so I'm very reluctant to use that one, either. Maybe I'll use it on my NAS in the future.

To be fair, the last time I had a drive that silently corrupted data was about ~10 years ago. Sometimes, files just contained garbage. SMART showed a steadily rising "reallocated sector count". Modern hardware seems to be more reliable. Still, it can happen.

So, the first issue is: Before creating a new backup, you should verify the integrity of the data you're about to backup.

If you're like me, you use Git whenever possible. Most of the really important data on my drive is in Git repos. It's not limited to code, it also includes letters and the likes. The nice thing about Git is that it computes a lot of checkums, because it uses them to address objects. That is, when you commit a file, Git automatically hashes it and only refers to that file by its hash. Eventually, commit objects are hashed as well and that's how you end up with "revision e6865e9". Because everything has a checksum, you can do a "git fsck --full" to verify that each object (including each version of each file you ever committed) is still intact. That's awesome. You should do that for all your Git repos before creating backups.

But how about large data? You don't want to have all your photos or music in a Git repo. That would be a huge waste of space. I have no experience with tools like git-annex, maybe they can be of use here, too. What I did discover recently is this tool:

It's basically "sha1sum on steroids". It makes it easy to scan large data sets, compute checksums, and verify them later. There are other tools like it. Maybe for your use case, sha1sum actually is good enough.

I recently started to do these two extra steps before creating new backups -- "git fsck --full" and run a tool like "bitrot". Only if they tell me that everything's fine, I go ahead and do the backup. I have a much better feeling about the quality of my backups now.

This doesn't solve all problems, though. The second issue is: How do you deal with your own stupidity? :-) Let's say you accidentally deleted lots of your photos. If you don't notice that (which can happen, of course), eventually, you will no longer have a backup of these files because you can usually only store the last ~5 backups or so.

Honestly, I'm not sure how to solve this. I now run a tool like "bitrot" before creating backups that tells me exactly which files have been removed. Let's see how that works out. Since I only run "bitrot" on data like music or photos, there shouldn't be too much "noise" (i.e., temporary files).

The third issue is: How can you efficiently verify that your backups aren't corrupt?

Keep in mind that a typical workstation or laptop has an SSD with, say, 256GB of capacity. That thing is fast as hell. Plus, your CPU is very fast. This means that running a tool like "bitrot" is easy to do on a workstation and usually completes within a couple of minutes. But running the same tool on your NAS with ~2TB or more of HDDs will literally take hours. So that's not an option.

Plus, I not only do backups from my workstation and store them on my NAS. I also create full backups of my NAS, store it on a USB drive, and put that USB drive in a safe place in a different building. (I only do this kind of backup every 1-2 weeks.) That leaves me with soooooooo much more "unverified" data.

The answer probably lies in file systems like ZFS or btrfs. They have checksums. They automatically verify data. I should just go ahead and use one of them (if it weren't so much trouble ...).

(All this makes me wonder why there are only two well known file systems that create checksums of data. Why don't all file systems do that? I mean, 20-30 years ago, hard drives failed all the time. Floppy disks failed all the time. Even today, USB sticks fail very often. Still, nobody really cared ... Of course, older drives were really small and CPUs back then were really slow. I still think it's strange that checksums are not a core feature of every file system out there.)