Greyhole's lost loads of files. :( What other options?

dinomic
Posts: 65
Joined: Mon Jan 03, 2011 6:49 am

Greyhole's lost loads of files. :( What other options?

Postby dinomic » Fri Dec 28, 2012 4:37 pm

Hi all,

It seems that I've lost a lot files because Greyhole wasn't too good at dealing with a failing drive.

Basically, I had 2 drives set up as "stickies" to for 1 particular share (let's say drives 7 and 8). Drive 7 was the landing zone for this share, and Greyhole seems to have performed the first gh copy onto drive 8 as a result (which was the drive which was has slowly been dieing). Now, it seems that greyhole has been using that copy on drive 8 to copy onto drive 7 too. That's why any "broken" file in question is showing the full size on drive 8 (eg 8.2Gb), but you can't copy that file separately to anywhere else as it dies before it gets to the end (say at 5.3Gb) with an "Invalid Argument" error which you can only then "Skip".

This has happened to about 500GB's worth of data, so I'm a bit annoyed. I'm trying to recover the bad sectors on drive 8, but I've already done one attempt which has taken about 2 days and it doesn't seem to have "fixed" any of the bad files.

I'm therefore looking at any options. Is there any way that I can recover these files? If not, what other options do I have for drive pooling so that I can get rid of Greyhole? I generally want to keep drives paired up (almost like a RAID1), hence why I always tie 2 drives together for any given share using "stickies".

I'm looking for options, guys, because with 20drives in my machine, I don't want to lose any more data, and want to get this sorted once and for all.
Norco 4220 Case
Gigabyte GA-G33-DS3R Motherboard w/ 8GB RAM
LSI SAS2116PCIe 6GB/s SAS (replaced 3Ware 9690SA-4I & Chenbro CK1360)
1 x Hitachi 160Gb2.5" System Drive (original 1 x OCZ 60GB Vertex 2 SATAII 2.5" SSD died)
8 x 4TB Hitachi Deskstars
6 x 3TB Hitachi Deskstars
6 x 2TB Hitachi Deskstars (1 dead!)

ribbles
Posts: 11
Joined: Tue Dec 11, 2012 5:46 pm

Re: Greyhole's lost loads of files. :( What other options?

Postby ribbles » Fri Jan 04, 2013 9:26 pm

might want to look into ZFS at http://zfsonlinux.org/ You can read more about it on Wikipedia
zfs does drive pooling, file checksum and self-healing on every read of a file.
Its not officially support on amahi though but that may not be important to you.

User avatar
gboudreau
Posts: 606
Joined: Sat Jan 23, 2010 1:15 pm
Location: Montréal, Canada
Contact:

Re: Greyhole's lost loads of files. :( What other options?

Postby gboudreau » Sat Jan 05, 2013 6:49 am

Basically, I had 2 drives set up as "stickies" to for 1 particular share (let's say drives 7 and 8). Drive 7 was the landing zone for this share, and Greyhole seems to have performed the first gh copy onto drive 8 as a result (which was the drive which was has slowly been dieing). Now, it seems that greyhole has been using that copy on drive 8 to copy onto drive 7 too. That's why any "broken" file in question is showing the full size on drive 8 (eg 8.2Gb), but you can't copy that file separately to anywhere else as it dies before it gets to the end (say at 5.3Gb) with an "Invalid Argument" error which you can only then "Skip".
So what you're saying is that Greyhole copied the file from the LZ onto drive 8, and thought it worked, then deleted the file from the LZ, and then, when it tried to create the 2nd copy from the copy it just created on drive 8, that failed?
Can you show me the part of your greyhole.log where that happened?
I've never seen a file copy operation complete without error, and have the copied file unreadable as soon as the cp command finished.

I could add an option to md5 each file copy, instead of just checking the copy file size, to insure the new copy is readable, but for sure that would be disabled by default, as it would add a lot of processing time, and it's the first time I'm hearing of such a problem.
- Guillaume Boudreau

dinomic
Posts: 65
Joined: Mon Jan 03, 2011 6:49 am

Re: Greyhole's lost loads of files. :( What other options?

Postby dinomic » Sat Jan 05, 2013 7:53 am

Hi Guillaume,

Firstly, let me say thanks for developing Greyhole in the first place - especially more so since it's free! Secondly, I do understand your situation as a developer getting feedback - I'm a developer myself! Often, users will manage to do things that we developers haven't replicated before (or haven't foreseen).
So what you're saying is that Greyhole copied the file from the LZ onto drive 8, and thought it worked, then deleted the file from the LZ, and then, when it tried to create the 2nd copy from the copy it just created on drive 8, that failed?
Yes, that's exactly what I'm saying.
Can you show me the part of your greyhole.log where that happened?
If I still have the log, and you could point me towards what to look for in it, I'd be happy to. I'm not sure if greyhole.log gets truncated, as I often just look at the "tail", but it's currently at 144.7MB, and there are some zipped files that look like archives, but they only seem to have been done from 20121230 onwards?
I've never seen a file copy operation complete without error, and have the copied file unreadable as soon as the cp command finished.
Have you tried with a drive that's failing (ie one that's getting an increasing amount of bad sectors)? I'm guessing you haven't - unless you happen to have such a drive, it's not something you can replicate at will.
I could add an option to md5 each file copy, instead of just checking the copy file size, to insure the new copy is readable, but for sure that would be disabled by default, as it would add a lot of processing time, and it's the first time I'm hearing of such a problem.
I think this would be a good idea, because the bad drive still reports the file as having the full initial size but the file obviously still can't be read fully - eg if you try and copy the damaged file manually to a non-samba share (ie direct to file system), the operation will work up until a certain point. As mentioned in my first post, the 8.2GB file will only be readable up until the 5.3GB mark, which is where the failure happened in the first place.

Anyway, a checksum is still not enough, though IHMO. For a start, you need a "good" copy in order to create/compare a checksum against. But if the copy on the the LZ is corrupted to start with, then this won't help either. Would Samba report the file as corrupted to start with, before it notifies Greyhole of an impending operation? I'm not sure if there's anything else that's available from the filesystem that will tell you the state of a file on disk. I do know that the drive can be queried for its "health" (after all, that's how the Disk Utility, and the Fedora OS, knows when a drive is failing), but again, I don't know (without researching further) if there's a way to identify properly when a file has been corrupted.

Anyway, let me know what you need to help with this issue. I still have the damaged drive (disconnected for now), so we can look at any of a number of damaged files on there. To add to my initial report, some file operations obviously do break with this scenario, as the 2nd copy hasn't even had the filename returned back to its original state (looks like some kind of temporary file). However, there's nothing that I'm aware of which Greyhole does to notify me if there's a problem, so even if there was a problem with creating the file copies, how would I know that there's a problem that needs fixing?

Incidentally, I have also been documenting a few other bugs that I've discovered with Greyhole if you're interested in seeing those too?
Norco 4220 Case
Gigabyte GA-G33-DS3R Motherboard w/ 8GB RAM
LSI SAS2116PCIe 6GB/s SAS (replaced 3Ware 9690SA-4I & Chenbro CK1360)
1 x Hitachi 160Gb2.5" System Drive (original 1 x OCZ 60GB Vertex 2 SATAII 2.5" SSD died)
8 x 4TB Hitachi Deskstars
6 x 3TB Hitachi Deskstars
6 x 2TB Hitachi Deskstars (1 dead!)

User avatar
gboudreau
Posts: 606
Joined: Sat Jan 23, 2010 1:15 pm
Location: Montréal, Canada
Contact:

Re: Greyhole's lost loads of files. :( What other options?

Postby gboudreau » Sat Jan 05, 2013 1:19 pm

Can you show me the part of your greyhole.log where that happened?
If I still have the log, and you could point me towards what to look for in it, I'd be happy to. I'm not sure if greyhole.log gets truncated, as I often just look at the "tail", but it's currently at 144.7MB, and there are some zipped files that look like archives, but they only seem to have been done from 20121230 onwards?
Maybe try:

Code: Select all

greyhole --debug some_unique_filename_you_lost
This should grep the existing logs you still have, and output anything that happened with the specified file. Use just a filename, not a full path.
I've never seen a file copy operation complete without error, and have the copied file unreadable as soon as the cp command finished.
Have you tried with a drive that's failing (ie one that's getting an increasing amount of bad sectors)? I'm guessing you haven't - unless you happen to have such a drive, it's not something you can replicate at will.
...
Anyway, a checksum is still not enough, though IHMO. For a start, you need a "good" copy in order to create/compare a checksum against. But if the copy on the the LZ is corrupted to start with, then this won't help either. Would Samba report the file as corrupted to start with, before it notifies Greyhole of an impending operation? I'm not sure if there's anything else that's available from the filesystem that will tell you the state of a file on disk. I do know that the drive can be queried for its "health" (after all, that's how the Disk Utility, and the Fedora OS, knows when a drive is failing), but again, I don't know (without researching further) if there's a way to identify properly when a file has been corrupted.
I had my lot of bad drives, controllers & cables during the years I've been using Greyhole. Every time, the cp operation would just fail. It was my assumption that the file-system would be in charge on insuring the data that was written was indeed written correctly, and that if it wasn't (i.e. it wouldn't be readable), cp would just error out. I think I'll post something on ServerFault about this, to try to understand better what can fail, and how.
However, there's nothing that I'm aware of which Greyhole does to notify me if there's a problem, so even if there was a problem with creating the file copies, how would I know that there's a problem that needs fixing?
When a file copy operation fails, Greyhole will simply leave the original file behind, and retry to create copies later.
But indeed, it won't notify you in any way; something I could add.
Incidentally, I have also been documenting a few other bugs that I've discovered with Greyhole if you're interested in seeing those too?
Sure. You can file bug reports in the bug tracker: https://github.com/gboudreau/Greyhole/issues
Or if you're unsure if it's a real bug, or you have questions, post something on the GetSatisfaction forum: https://getsatisfaction.com/greyhole

Cheers.
- Guillaume Boudreau

User avatar
gboudreau
Posts: 606
Joined: Sat Jan 23, 2010 1:15 pm
Location: Montréal, Canada
Contact:

Re: Greyhole's lost loads of files. :( What other options?

Postby gboudreau » Sat Jan 05, 2013 1:27 pm

What file-system are you using on the broken drive?
ext3/4? Journaled?
- Guillaume Boudreau

dinomic
Posts: 65
Joined: Mon Jan 03, 2011 6:49 am

Re: Greyhole's lost loads of files. :( What other options?

Postby dinomic » Sat Jan 05, 2013 1:29 pm

What file-system are you using on the broken drive?
ext3/4? Journaled?
GPT partition, ext4.

Not sure what you mean by journaled?
Norco 4220 Case
Gigabyte GA-G33-DS3R Motherboard w/ 8GB RAM
LSI SAS2116PCIe 6GB/s SAS (replaced 3Ware 9690SA-4I & Chenbro CK1360)
1 x Hitachi 160Gb2.5" System Drive (original 1 x OCZ 60GB Vertex 2 SATAII 2.5" SSD died)
8 x 4TB Hitachi Deskstars
6 x 3TB Hitachi Deskstars
6 x 2TB Hitachi Deskstars (1 dead!)

User avatar
gboudreau
Posts: 606
Joined: Sat Jan 23, 2010 1:15 pm
Location: Montréal, Canada
Contact:

Re: Greyhole's lost loads of files. :( What other options?

Postby gboudreau » Sat Jan 05, 2013 1:36 pm

Code: Select all

$ sudo dumpe2fs /dev/sdXX dumpe2fs 1.42.5 (29-Jul-2012) ... Filesystem features: has_journal,... ...
- Guillaume Boudreau

dinomic
Posts: 65
Joined: Mon Jan 03, 2011 6:49 am

Re: Greyhole's lost loads of files. :( What other options?

Postby dinomic » Sat Jan 05, 2013 1:45 pm

Code: Select all

$ sudo dumpe2fs /dev/sdXX dumpe2fs 1.42.5 (29-Jul-2012) ... Filesystem features: has_journal,... ...
I can't do it on the actual damaged drive at the moment, but doing this to the other drive in the pair, I get this:

Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Norco 4220 Case
Gigabyte GA-G33-DS3R Motherboard w/ 8GB RAM
LSI SAS2116PCIe 6GB/s SAS (replaced 3Ware 9690SA-4I & Chenbro CK1360)
1 x Hitachi 160Gb2.5" System Drive (original 1 x OCZ 60GB Vertex 2 SATAII 2.5" SSD died)
8 x 4TB Hitachi Deskstars
6 x 3TB Hitachi Deskstars
6 x 2TB Hitachi Deskstars (1 dead!)

dinomic
Posts: 65
Joined: Mon Jan 03, 2011 6:49 am

Re: Greyhole's lost loads of files. :( What other options?

Postby dinomic » Sat Jan 05, 2013 1:58 pm

Maybe try:

Code: Select all

greyhole --debug some_unique_filename_you_lost
Hmmm. Again, because I've removed the damaged drive for now, I couldn't look up any of the affected filenames, so I thought I'd look at the samba share to find a sample filename, only to discover that all of the broken files have now had their symlinks removed. So, as far as the samba share is concerned, NONE of those damaged files exist anymore. Is this what happens when you remove one drive (ie one that has the symlinks pointing to it), even if it's working OK? Surely, this isn't good?

Anyway, looking at the "good" drive in the pair, that has part of the damaged files copied to it, I've been able to run that debug command and got an output.

Do you want me to PM you 2 sample outputs as follows?
*) For an operation that resulted in a properly named filename (on the good drive)
*) For an operation that resulted in a filename that wasn't renamed back properly (on the good drive)
Norco 4220 Case
Gigabyte GA-G33-DS3R Motherboard w/ 8GB RAM
LSI SAS2116PCIe 6GB/s SAS (replaced 3Ware 9690SA-4I & Chenbro CK1360)
1 x Hitachi 160Gb2.5" System Drive (original 1 x OCZ 60GB Vertex 2 SATAII 2.5" SSD died)
8 x 4TB Hitachi Deskstars
6 x 3TB Hitachi Deskstars
6 x 2TB Hitachi Deskstars (1 dead!)

Who is online

Users browsing this forum: Majestic-12 [Bot] and 12 guests