Users browsing this thread: 2 Guest(s)
movq
Long time nixers
I'm probably the last person on earth to fully realize the impact of this issue, but then again, it doesn't get much attention, either: Data integrity.

I used to do what most people do: Attach a USB drive or boot up my NAS, and then backup data using rsync. The thing is, how do you know that the files you're backing up are actually intact? How do you verify that you're not creating the 8th backup in a row of a file that has been corrupt for two weeks? Backups are worthless if you should have restored a file from the old backup instead of creating a new corrupt backup.

Rarely, people address this issue.

File systems are of little help. The common file systems just store whatever data you tell them to store and then blindly trust the hard drive. Was it written successfully? Were there I/O errors? When reading the data later, do we get back the same data? Nobody really knows. See also: http://danluu.com/file-consistency/

I only know of two "big" file systems that take care of this: ZFS and btrfs. They create checksums for both the metadata and the actual data. Given the amount of bugs, I don't fully trust btrfs yet (does anybody here use it?). ZFS is an alien in the Linux world, so I'm very reluctant to use that one, either. Maybe I'll use it on my NAS in the future.

To be fair, the last time I had a drive that silently corrupted data was about ~10 years ago. Sometimes, files just contained garbage. SMART showed a steadily rising "reallocated sector count". Modern hardware seems to be more reliable. Still, it can happen.

So, the first issue is: Before creating a new backup, you should verify the integrity of the data you're about to backup.

If you're like me, you use Git whenever possible. Most of the really important data on my drive is in Git repos. It's not limited to code, it also includes letters and the likes. The nice thing about Git is that it computes a lot of checkums, because it uses them to address objects. That is, when you commit a file, Git automatically hashes it and only refers to that file by its hash. Eventually, commit objects are hashed as well and that's how you end up with "revision e6865e9". Because everything has a checksum, you can do a "git fsck --full" to verify that each object (including each version of each file you ever committed) is still intact. That's awesome. You should do that for all your Git repos before creating backups.

But how about large data? You don't want to have all your photos or music in a Git repo. That would be a huge waste of space. I have no experience with tools like git-annex, maybe they can be of use here, too. What I did discover recently is this tool:

https://github.com/ambv/bitrot

It's basically "sha1sum on steroids". It makes it easy to scan large data sets, compute checksums, and verify them later. There are other tools like it. Maybe for your use case, sha1sum actually is good enough.

I recently started to do these two extra steps before creating new backups -- "git fsck --full" and run a tool like "bitrot". Only if they tell me that everything's fine, I go ahead and do the backup. I have a much better feeling about the quality of my backups now.

This doesn't solve all problems, though. The second issue is: How do you deal with your own stupidity? :-) Let's say you accidentally deleted lots of your photos. If you don't notice that (which can happen, of course), eventually, you will no longer have a backup of these files because you can usually only store the last ~5 backups or so.

Honestly, I'm not sure how to solve this. I now run a tool like "bitrot" before creating backups that tells me exactly which files have been removed. Let's see how that works out. Since I only run "bitrot" on data like music or photos, there shouldn't be too much "noise" (i.e., temporary files).

The third issue is: How can you efficiently verify that your backups aren't corrupt?

Keep in mind that a typical workstation or laptop has an SSD with, say, 256GB of capacity. That thing is fast as hell. Plus, your CPU is very fast. This means that running a tool like "bitrot" is easy to do on a workstation and usually completes within a couple of minutes. But running the same tool on your NAS with ~2TB or more of HDDs will literally take hours. So that's not an option.

Plus, I not only do backups from my workstation and store them on my NAS. I also create full backups of my NAS, store it on a USB drive, and put that USB drive in a safe place in a different building. (I only do this kind of backup every 1-2 weeks.) That leaves me with soooooooo much more "unverified" data.

The answer probably lies in file systems like ZFS or btrfs. They have checksums. They automatically verify data. I should just go ahead and use one of them (if it weren't so much trouble ...).

(All this makes me wonder why there are only two well known file systems that create checksums of data. Why don't all file systems do that? I mean, 20-30 years ago, hard drives failed all the time. Floppy disks failed all the time. Even today, USB sticks fail very often. Still, nobody really cared ... Of course, older drives were really small and CPUs back then were really slow. I still think it's strange that checksums are not a core feature of every file system out there.)


Messages In This Thread
Backing up and Deploying - by venam - 29-05-2016, 12:01 PM
RE: Backing up and Deploying - by z3bra - 07-09-2016, 09:14 AM
RE: Backing up and Deploying - by TheAnachron - 07-09-2016, 09:37 AM
RE: Backing up and Deploying - by z3bra - 07-09-2016, 10:45 AM
RE: Backing up and Deploying - by TheAnachron - 09-09-2016, 04:20 AM
RE: Backing up and Deploying - by movq - 25-09-2016, 12:55 PM
RE: Backing up and Deploying - by venam - 25-09-2016, 01:33 PM
RE: Backing up and Deploying - by movq - 25-09-2016, 03:19 PM
RE: Backing up and Deploying - by josuah - 02-01-2017, 04:52 PM
RE: Backing up and Deploying - by pranomostro - 03-01-2017, 09:00 AM
RE: Backing up and Deploying - by jkl - 03-01-2017, 10:39 AM
RE: Backing up and Deploying - by venam - 03-01-2017, 11:09 AM
RE: Backing up and Deploying - by z3bra - 31-07-2020, 03:55 PM
RE: Backing up and Deploying - by opfez - 31-07-2020, 07:44 PM
RE: Backing up and Deploying - by z3bra - 11-08-2020, 04:45 AM
RE: Backing up and Deploying - by venam - 02-08-2021, 01:21 AM
RE: Backing up and Deploying - by fre d die - 02-08-2021, 09:15 AM
RE: Backing up and Deploying - by jkl - 02-08-2021, 11:12 AM
RE: Backing up and Deploying - by z3bra - 04-08-2021, 08:46 AM
RE: Backing up and Deploying - by venam - 07-11-2021, 04:46 AM
RE: Backing up and Deploying - by swathe - 30-05-2016, 01:44 AM
Your backup solution? - by TheAnachron - 07-09-2016, 05:14 AM
RE: Your backup solution? - by venam - 07-09-2016, 05:21 AM
RE: Your backup solution? - by TheAnachron - 07-09-2016, 05:58 AM
Backups! - by z3bra - 19-04-2019, 08:39 AM
RE: Backups! - by venam - 19-04-2019, 10:00 AM