Suggestion for better data integrity
Posted: Sun Dec 05, 2010 10:46 am
TLDR at the bottom.
I am a Vail refugee and am interested in Amahi and Greyhole. However, I really want data integrity and protection. I've been looking at ZFS and it sounds really good for catching corrupted files and automatically fixing them, except for the fact that it's not truly on Linux yet ( FUSE doesn't count!
and I want to use programs in Linux like CrashPlan), you cannot EASILY add more drives to RAID-Z, and I cannot have the zpool selectively have redundancy for certain files. It's all or nothing.
I know it's still early in development but Greyhole gives me the above except for the data integrity part (beyond having multiple copies of the files). ZFS will checksum a file when it is read and written and users can do a scrub of the pool, which I understand to be verifying integrity of all files.
Can we do a poor man's checksum and data integrity with Greyhole?
I have been looking into md5sum to detect a corrupted file and par2 to repair corrupted files. I'm not sure yet if par2 will detect a corrupted file. If it will, then md5sum would not really be needed. From what little I've read of par2, I'm also trying to understand if it can repair any corruption whatsoever in a file if given enough space to create the parity information, without actually creating a duplicate of the entire file.
Then I got to thinking, par2 might not really be necessary if Greyhole is keeping multiple copies of files.
It would take me a while since I'm not familiar with the ins and outs of Linux but I figured I could write a Python script to do this on a periodic basis. Thinking more, I thought it might be better if Greyhole was involved, at least with the first 2 points below.
So I've come up with this idea:
1. When Greyhole writes a new file:
If the file is to have multiple copies, create an md5sum for each copy after it has been copied to each hard drive.
or
If there is only one copy of the file, create a par2 file to fix any corruption. (Create md5sum to detect it if necessary)
Maybe not necessary or not possible (I don't know how SAMBA works):
2. If SAMBA keeps a log of the files that are read:
Keep a list of all files read so we can process it later. Then on periodic basis X, maybe 12 or 24 hours, check for corruption and repair any errors (by using duplicate file or par2 information for single files) any file that was read in the last X hours.
3. Then on an even longer basis, maybe once a week, go over the entire Greyhole pool and check for corruption. With files that have multiple copies, a corrupted file detected with the md5sum can just be overwritten by another copy of the file if it is not corrupted. Or if the file is a single file, use the par2 information to fix the errors. Email a report of the files that were fixed and what hard drives they were on.
Give the users options on how often to run the check and maybe a percentage of the pool to check. Large pools could take a really long time, so maybe do 50% one week and the other 50% the next week.
That's my idea in a nutshell. Not being an expert on any of this I don't know if it would just be a waste of space, horrible to implement with a ton of corner cases, not very efficient and would not catch many errors, or a super awesome idea.
Any input would be appreciated.
Sean
TLDR - Use md5sum and par2 to detect and fix bit rot in, hopefully, a similar manner to ZFS while still retaining the advantages of an EASILY expandable storage pool.
I am a Vail refugee and am interested in Amahi and Greyhole. However, I really want data integrity and protection. I've been looking at ZFS and it sounds really good for catching corrupted files and automatically fixing them, except for the fact that it's not truly on Linux yet ( FUSE doesn't count!

I know it's still early in development but Greyhole gives me the above except for the data integrity part (beyond having multiple copies of the files). ZFS will checksum a file when it is read and written and users can do a scrub of the pool, which I understand to be verifying integrity of all files.
Can we do a poor man's checksum and data integrity with Greyhole?
I have been looking into md5sum to detect a corrupted file and par2 to repair corrupted files. I'm not sure yet if par2 will detect a corrupted file. If it will, then md5sum would not really be needed. From what little I've read of par2, I'm also trying to understand if it can repair any corruption whatsoever in a file if given enough space to create the parity information, without actually creating a duplicate of the entire file.
Then I got to thinking, par2 might not really be necessary if Greyhole is keeping multiple copies of files.
It would take me a while since I'm not familiar with the ins and outs of Linux but I figured I could write a Python script to do this on a periodic basis. Thinking more, I thought it might be better if Greyhole was involved, at least with the first 2 points below.
So I've come up with this idea:
1. When Greyhole writes a new file:
If the file is to have multiple copies, create an md5sum for each copy after it has been copied to each hard drive.
or
If there is only one copy of the file, create a par2 file to fix any corruption. (Create md5sum to detect it if necessary)
Maybe not necessary or not possible (I don't know how SAMBA works):
2. If SAMBA keeps a log of the files that are read:
Keep a list of all files read so we can process it later. Then on periodic basis X, maybe 12 or 24 hours, check for corruption and repair any errors (by using duplicate file or par2 information for single files) any file that was read in the last X hours.
3. Then on an even longer basis, maybe once a week, go over the entire Greyhole pool and check for corruption. With files that have multiple copies, a corrupted file detected with the md5sum can just be overwritten by another copy of the file if it is not corrupted. Or if the file is a single file, use the par2 information to fix the errors. Email a report of the files that were fixed and what hard drives they were on.
Give the users options on how often to run the check and maybe a percentage of the pool to check. Large pools could take a really long time, so maybe do 50% one week and the other 50% the next week.
That's my idea in a nutshell. Not being an expert on any of this I don't know if it would just be a waste of space, horrible to implement with a ton of corner cases, not very efficient and would not catch many errors, or a super awesome idea.

Any input would be appreciated.
Sean
TLDR - Use md5sum and par2 to detect and fix bit rot in, hopefully, a similar manner to ZFS while still retaining the advantages of an EASILY expandable storage pool.