Hard Drive Failure Without Down Time

Posted By: Tony Baird

Last Updated: Friday January 18, 2008

Hard Drive FailureI constantly read about web hosts who lose a hard drive then the server is down for days because data has to be restored via backups. I shake my head every time I read about this as you know 99% of them are not running any form of raid. These hosts figure what are the chances the hard drives are going to fail? I can tell you the chances are much higher than they think and that’s a big reason why we use raid or specifically raid-10 (4 drives)and we make sure all servers the drives are hot swappable.

I’ve already used some knowledgeable that may not be known by everyone so I’ll quickly describe a few things;

Raid - Raid is when a set of hard drives act as one logical set of drives. This can mean you have two 200GB drives they become 400GB drive to your operating system. Or you can take two 200GB drives and have them be just 200GB with it being a mirror so you could potentially lose a drive without hardware failure. There are other more complicated setups but that’s the basic idea is more than one drive acting as one in the eyes of the system.

Raid-10 - Raid-10 is a nested raid level which combines several raid-1 arrays (2 drives) to form raid-0 set which expands the drive space. So what you end up with is if you had 4 drives with 200GB each. You would have a total of 400GB of usable space with each set of 200GB being mirrored by one other drive.


hot swappable drives - It essentially means you can take out a hard drive without powering down the server and replace it with another. So in the case of raid-10 one drive failing can be replaced without taking the server off line.

Okay now that I’ve explained a few things onto the Hawk Host part. Today we had Venus alert us that one of it’s drives has failed in the raid was in degraded mode which means one drive was not functioning. We quickly acted scheduling a time for the bad drive to be swapped out. Fortunately all our servers are hot-swappable so no down time was expected. A technician on site got to the server took out the bad drive and put a new one in the server. After this we simply told the raid controller on the server to rebuild the array which would sync the new drive up with the data of it’s mirror. We had now replaced a hard drive without even any down time. A lot of hosts this would have been an absolute nightmare. In our case the server was a tiny bit slower with one drive gone, however most customers will not even be aware we had a hard drive failure unless they check our notices on our forum.

So, next time you’re worrying about data just remember Raid is your friend. It does not replace back ups but it can make your life a lot easier when a drive fails which will happen. Here’s just a few statistics from a random site I found that was talking about failures way back in 2001!

Google’s write up about drive failures http://research.google.com/archive/disk_failures.pdf. They have a lot of experience with hard drives so figured I’d post it as well.

Ready to get started? Build your site from