I’ve ran into a situation over the past few months that servers I maintain started having their hard drives fail. While I maintain backups of the data, I found that instead of being a “leisurely” replacement, I found myself in a stressful situation having to act quickly to acquire new hardware, load the system, load the software & restore the backup. While the use of rsnapshot has been a great tool for these situations, the need for unscheduled downtime (usually two to three hours) and unfortunately usually during business hours was not ideal.
As a result, I started looking into utilizing RAID to address the situation. I found most of my servers were in the 320GB to 500GB disk range (usually storing much less data than the disk provided) and as a result, a new drive today only cost around $50-$55. Definitely worth it for the piece of mind.
In addition to the physical hard drive, I looked at using the FreeBSD gmirror software to build and maintain the RAID. For testing purposes, I setup a FreeBSD box in VirtualBox and worked with the guide on OnLamp to get my initial RAID up and going. Fortunately the system allows for preexisting data to be on the drive (no need to setup the RAID before the OS installation). All in all, the RAID setup took only a few minutes to configure (basic steps is to tell gmirror to make a RAID, adjust the fstab to point to the new “mirror” file system points and load the gmirror/geom system on boot).
While the initial setup was straight forward, I was curious on the performance of the RAID in some “real world” scenarios.
Scenario 1: Failed Drive: atacontrol detach
To emulate this situation, I pulled a drive and booted the system. Pulling either the primary drive or the secondary drive did not matter. The system still booted and showed the RAID in “degraded” mode.
I also used “atacontrol detach” command to stop a controller (and hense a drive) during operations. Once again, the system continued functioning without any indication the drive failed (except for a console warning & degraded status). I had various disk operations running at the time and everything continued as expected. Attaching the drive using “atacontrol attach” had the drive rebuilding in the background while existing server operations continued, unaffected (ie I’d imagine if my servers had hot swap, no need for reboot or other interruptions to operation).
Scenario 2: Failed Drive: Replaced with bigger drive
My next test was to pretend a drive was failed and replace it with a different, larger drive. Instead of an auto-rebuild, the system required two commands to remove the failed drive from the mirror and add the new drive to the mirror:
- gmirror forget gm0 (gm0 is the RAID device)
- gmirror insert gm0 ad4 (ad4 is the new disk device)
After this, the RAID started to rebuild itself in the background and I was back up and running. Using the instructions per the OnLamp article, any additional space on the larger drive was not usable (there are other ways to configure gmirror to allow slices or partitions to be mirrored instead of the entire drive.. but I did not test this functionality).
Scenario 3: Failed Drive: Smaller drive
As I am mirroring the entire drive, I was curious what would happen if I only had a slightly smaller drive as the replacement. Of course, as expected, gmirror did not like the smaller drive. However, a workaround I was able to to was slice & partition the new drive, add a boot manager and then use the dump/restore commands to copy the data from the larger drive to the smaller drive. I modified the fstab back to a standard single drive configuration and then erased the existing gmirror RAID information (gmirror stop gm0) and then simply created a new mirror, using the smaller disk as my initial disk for the mirror.
This was a bit more time intensive but in theory, it is possible to dump a live system and then use rsync to sync any changes in data that changed after the dump occurred (I’d recommend doing an rsync in single user mode so there is more control over data changes).
Once this was done, the system and gmirror was back up and running without issue. This method does require two additional reboots (one to switch the system over to the new drive & one during the initial gmirror setup on the new drive (which *might* not be necessary .. more testing required)).
Scenario 4: Power Failure
Simply put .. I had various disk activities going on and then killed the power to the box. On bootup, fsck ran to check the disk and afterwards the gmirror sync’d the disks back to a working state.
Monitoring & Conclusion
Overall my initial tests have been very positive with gmirror. It is very straight forward and easy to use. I didn’t do any benchmarking given the virtual environment of my tests, but from my understanding, disk reads can be up to twice as fast, so adding that second hard drive does provide additional performance benefits as well.
Use of some simple scripting and the “gmirror status” command will allow me to monitor the status of my RAIDs and hopefully address any issues with the systems before users are aware of a problem. The “gmirror status” returns either a “DEGRADED” or “COMPLETE” status. “DEGRADED” will either indicate a drive failed or the system is rebuilding. “COMPLETE” indicates a functional system. Use of some scripting & cron can email change status to you or this could be tied into a monitoring system such as Nagios (probably more ideal particularly if Nagios is already inplace for other monitoring tasks).
While gmirror is not a particularly new technology, based on my testing and reports from production use of this tool, it appears to be a solid and very worthwhile addition to any server configuration (and quite possibly most desktop configurations).