Working on a bit of a recovery effort with an mdadm array. Here's the background:
8 x 1.5TB disk raid 5, created with mdadm, housing an XFS filesystem (default chunk, etc).
Array has run fine for quite some time. Occassionally, usually under heavy load, or an aging disk, the array will kick out a disk (bad block, or some other i/o error). Not a problem, you can simply remove the disk, re-add it (or replace it) and rsync happiness.
Once in a while, usually under some heavy I/O operation, TWO disks will get kicked out. This might otherwise be a problem, but a trick i've used (and is often cited online) is that you can simply RE-create the array with the same parameters and disk order on top of the existing array without data loss. Usually, it's best to use the "missing" keyword on one of your disks in the create statement, so that the array is created but NOT DOES NOT START SYNCING, so that you can verify all is well with your filesystem...then add your last disk, and again, rsync happiness.
Recently I ran into the above scenario, but complicated things greatly. I rebooted the server before recreating the array, just to give things a clear start, and didn't realize that my Ubuntu system (12) often makes block device assignments in a DIFFERENT order, depending on what controller is seen first, etc. So /dev/sda in one boot, might show up as /dev/sde in another. Again, shouldn't be a problem, since ever disk has mdadm metadata identifying the order, and raid details, UUID, etc. However, in this case, not thinking my drives were out of order, I RECREATED the array with my last known order - which wasn't right after the boot. Essentially creating a NEW set of metadata on every drive, with an array out of order. No sync's happened (used the missing drive method above), so data "should" be intact - I just need to figure out which drive was which from the original create!
I have SOME log information (hdparm outout, /proc/mdstat output, etc) from the original array, but thus far haven't succeeded in guessing the original order.
A note to self - a good bit of data to obtain from your healthy, in-sync array, is to create a simple mapping of the constituent block devices to their corresponding drive serial number - that way you can always go back and know physically which drive was which at the time of happiness.
Anyhoo, I don't have that, so I've thought of maybe scripting a process to literally try every drive combination order (again, with one drive missing so no sync's occur) checking for a valid filesystem with each iteration. Hip calculations put this process as taking 3-4 days to complete (9 actual disks, but only 8 in use ~ 141k permutations) - which I'm fine with, but would there be any dangers to this?
As a second question - is there something smarter I'm missing? IS there a way to somehow examine the disks, aside from the raid metadata (which had since been overwritten), that might give a clue as to their order?
Thanks all in advance...
Bookmarks