tltemple
May 25th, 2012, 08:43 PM
I would like to start a serious discussion on repairing RAID drives on Ubuntu systems.
I have spent quite a bit of time over the last several weeks trying to install and prove out RAID drives on a new Ubuntu 12.04 Server installation. I am becoming frustrated and disappointed at the lack of definitive knowledge and answers to RAID issues on Ubuntu and Linux in general.
My situation seems very simple to me: I want to load up a simple Ubuntu file server with RAID 1 configured hard disks. I want to fail a drive on purpose, then replace that drive, and get the system back to the exact same state as it was prior to the drive failing. If I can't do this, and record the exact steps required to do so, I don't want to waste any more time with RAID drives, because why use them if one can't prove them to be (easily) recoverable?
With that said, I have come very close to my goal, but I have run in to small, but almost impossible issues trying to get this done. I will review what I have done:
1) Configure system with 4 physical hard drives. 2 drives for the OS, 2 drives for data. The OS drives are Raid 1, 4 partitions, 4 raid devices (md0 through md3). The data drives are one large partition, 1 raid device (md4). The data drive may get more complicated later, once I prove I can recover it, but for now one large partition, one raid array.
2) md4 (data drive) is mounted as /media/data.
3) configure Samba with a share of /media/data
4) connect windows client, copy some data to the /media/data share.
Now for the trial run of the drive replacement:
5) fail /dev/sdc1 on the md4 device
(sudo mdadm --manage /dev/md4 --fail /dev/sdc1
This effectively puts md4 into a failed hard drive state.
6) Remove /dev/sdc1 from the array
(sudo mdadm --manage /dev/md4 --remove /dev/sdc1
7) Shut down the system, take out sdc, replace it with another drive.
8 ) Partition the new drive
(sudo sfdisk -d /dev/sdd | sfdisk /dev/sdc)
**** This is a solution proposed a number of places for copying the partition from one drive to another. It doesn't seem to work on Ubuntu.
8 ) Partition the new drive
(sudo fdisk /dev/sdc)
(c)
(u)
(n) - new partition
(p) - primary partition
enter - accept default start location
enter - accept default end location
(w) - write partition table
**************************************
At this point, I have some questions
Q1) is it necessary or desirable to format the new drive here?
Q2) should the partition be marked as something other than a plain old linux partition? If so, How?
**************************************
9) Format the drive?
10) Add the drive back to the array
(sudo mdadm --manage /dev/md4 --add /dev/sdc1
11) check the status of the raid devices
(sudo cat /proc/mdstat)
This shows the raid device rebuilding.
This process (at least in this case) went on and rebuilt the raid array, BUT, it marked the new drive as a spare drive in the array.
This proved to be virtually impossible for me to change. Maybe I shouldn't be concerned with the new drive being marked a spare, but it bothers me. I was basically not able to overcome this issue. Everything was functioning, and I didn't lose any data, but I was not back where I started. I found a number of posts where this specific issue was discussed, but no definitive answer as to how to fix it.
I have tried this several more times with other little problems:
In some cases, the rebuild goes very slowly, and I get all kinds of errors reported during the process. I am associating this with whether or not I format (create a file system) on the new drive before adding it to the array. I am still not sure (Q1 above) whether the drive needs to be formatted prior to adding it to the raid array.
In other cases, the system seems to grab the drive as soon as it sees the partition created. It adds it to the raid array and starts rebuilding it before I have the chance of stopping it.
In other cases, the system hangs on to the drive and says the drive is busy, and it won't let me partition it.
In other cases, I can't remove a partition from an array.
If you have lasted this long, maybe you can help.
I would really like to work through this issue, and move forward with confidence configuring this server with raid arrays. I am planning on adding another server to this installation when this one is done.
I am not after guesses to these issues. If you don't have definitive information, please don't offer it. I have scoured many forum posts that are full of suggestions and guesses. If I have to, I will get the source code for the raid software and dig through it myself to find the answers, but I would rather not spend that kind of effort.
If there is a better place to have this discussion, please point me that direction.
Thanks,
Tom
I have spent quite a bit of time over the last several weeks trying to install and prove out RAID drives on a new Ubuntu 12.04 Server installation. I am becoming frustrated and disappointed at the lack of definitive knowledge and answers to RAID issues on Ubuntu and Linux in general.
My situation seems very simple to me: I want to load up a simple Ubuntu file server with RAID 1 configured hard disks. I want to fail a drive on purpose, then replace that drive, and get the system back to the exact same state as it was prior to the drive failing. If I can't do this, and record the exact steps required to do so, I don't want to waste any more time with RAID drives, because why use them if one can't prove them to be (easily) recoverable?
With that said, I have come very close to my goal, but I have run in to small, but almost impossible issues trying to get this done. I will review what I have done:
1) Configure system with 4 physical hard drives. 2 drives for the OS, 2 drives for data. The OS drives are Raid 1, 4 partitions, 4 raid devices (md0 through md3). The data drives are one large partition, 1 raid device (md4). The data drive may get more complicated later, once I prove I can recover it, but for now one large partition, one raid array.
2) md4 (data drive) is mounted as /media/data.
3) configure Samba with a share of /media/data
4) connect windows client, copy some data to the /media/data share.
Now for the trial run of the drive replacement:
5) fail /dev/sdc1 on the md4 device
(sudo mdadm --manage /dev/md4 --fail /dev/sdc1
This effectively puts md4 into a failed hard drive state.
6) Remove /dev/sdc1 from the array
(sudo mdadm --manage /dev/md4 --remove /dev/sdc1
7) Shut down the system, take out sdc, replace it with another drive.
8 ) Partition the new drive
(sudo sfdisk -d /dev/sdd | sfdisk /dev/sdc)
**** This is a solution proposed a number of places for copying the partition from one drive to another. It doesn't seem to work on Ubuntu.
8 ) Partition the new drive
(sudo fdisk /dev/sdc)
(c)
(u)
(n) - new partition
(p) - primary partition
enter - accept default start location
enter - accept default end location
(w) - write partition table
**************************************
At this point, I have some questions
Q1) is it necessary or desirable to format the new drive here?
Q2) should the partition be marked as something other than a plain old linux partition? If so, How?
**************************************
9) Format the drive?
10) Add the drive back to the array
(sudo mdadm --manage /dev/md4 --add /dev/sdc1
11) check the status of the raid devices
(sudo cat /proc/mdstat)
This shows the raid device rebuilding.
This process (at least in this case) went on and rebuilt the raid array, BUT, it marked the new drive as a spare drive in the array.
This proved to be virtually impossible for me to change. Maybe I shouldn't be concerned with the new drive being marked a spare, but it bothers me. I was basically not able to overcome this issue. Everything was functioning, and I didn't lose any data, but I was not back where I started. I found a number of posts where this specific issue was discussed, but no definitive answer as to how to fix it.
I have tried this several more times with other little problems:
In some cases, the rebuild goes very slowly, and I get all kinds of errors reported during the process. I am associating this with whether or not I format (create a file system) on the new drive before adding it to the array. I am still not sure (Q1 above) whether the drive needs to be formatted prior to adding it to the raid array.
In other cases, the system seems to grab the drive as soon as it sees the partition created. It adds it to the raid array and starts rebuilding it before I have the chance of stopping it.
In other cases, the system hangs on to the drive and says the drive is busy, and it won't let me partition it.
In other cases, I can't remove a partition from an array.
If you have lasted this long, maybe you can help.
I would really like to work through this issue, and move forward with confidence configuring this server with raid arrays. I am planning on adding another server to this installation when this one is done.
I am not after guesses to these issues. If you don't have definitive information, please don't offer it. I have scoured many forum posts that are full of suggestions and guesses. If I have to, I will get the source code for the raid software and dig through it myself to find the answers, but I would rather not spend that kind of effort.
If there is a better place to have this discussion, please point me that direction.
Thanks,
Tom