You are here

Recovering from a RAID Controller Failure

Recovering from a RAID Controller Failure

There are many reasons why a RAID goes down.  A technician will normally assume that one or more of the drives have failed.  This is a common diagnosis as the diagnostic lights on each of the drives may be blinking, the lights may have gone amber, or in some cases the drive may not be spinning up at all.  All of these surface indicators would surely

lead the most seasoned technician to assume that the drives have either failed or are definitely on their way out.  There is another reason why all these things would happen, and that would be a RAID controller failure.  The challenge is to diagnose the problem with a damaged controller.  Using a damaged controller to make a diagnosis is the same as having a sick doctor diagnose his own health problems. Some technicians will try and replace the controller and hope the configuration will reload from the drives and the RAID will mount.  DTI Data has made a very good living on technicians that swap controllers and cross their fingers in hopes that the RAID will come up.  There are so many problems with this method of trying to bring the RAID online that they are too numerous to mention.

What needs to be done is to separate the primary component of the RAID which is the hard drive from the controller in order to make a legitimate diagnosis. The following are some methods you can use that are isolated from a damaged RAID controller that will help you recover the data of your client.

First of all check the drives to make sure they are electronically sound. If you have SCSI drives use an Adaptec SCSI controller.  Perhaps an Adaptec 2930 would suit your needs.  They are inexpensive and have been around for a bit so all of the firmware bugs are worked out.  Put the SCSI card in a reliable computer and mount each drive individually. If the drives are SATA, or PATA use a standard interface port to mount the drives.

If the drive shows up in the ‘Disk Manager’  item of the ‘Computer Management’  then it is a pretty safe assumption that the interface is intact and you have some I/O between the drive and the controller. In addition to this DTI Data has a free surface scanner that will allow you to look at each drive and map any bad sectors on the drive.  If two or more drives come up having bad sectors then that could be the reason why the RAID went down.  RAID controllers are very sensitive to more than one drive exhibiting bad sectors or slow reads.  A RAID 5 controllers’ firmware may be fault tolerant, but when two drives have bad sectors the controller will degrade the array and bring it offline.

If, however, there are not any bad sectors on any of the drives then that is normally a controller problem.  You may have received a power spike, or some kind of memory fault but the fact of the matter is that barring those kinds of things the raid controller failed and will not mount your array.

In addition to doing a surface scan to verify if in fact you have had a raid controller failure you can check the integrity of the raid.  In a raid 5, the controller will do a set of mathematical operations on the data in order to be able to reverse engineer the data if a drive drops out of the array. These XOR math functions are used to do a rebuild on the array and take a degraded raid 5 hard drive and build it.  The drive will have to be replaced before the build but a raid 5 controller has the ability to integrate a brand new drive back into the array.

I bring up the raid 5 mathematics, because in order for the array to have a ‘clean bill of health’ the parity integrity must be intact.  If a raid card does not detect the fact that a drive has dropped out of the array then the drive will become stale.  A raid 5 will continue to function even if one drive is out of the array; however the raid card should notify the technician that the array has been degraded and the drive should be replaced and a rebuild performed.

DTI Data has a free diagnostic tool for raid 5 and will allow you to see if in fact there is a stale drive in the array.  I wrote a blog on how to detect a stale drive in the array and hopefully this will help you diagnose drive array controller failure.  If in fact the software finds a stale drive and the raid controller did not indicate that then the only way to recover the data is to create a virtual raid 5 array offline using software and images created from the raid 5 drives.

In order to create the images DTI Data has an inexpensive solution on our web site.  The software was designed and written with the technician imaging multiple drives to a single drive.  It is as easy as mounting the drives, selecting the source drives, the destination drive and then just walking away.  The software will not only image the drives but it will map all the bad sectors its finds and generate a comprehensive report.

These are just a few things that you can do to detect a raid controller failure.  DTI Data offers a set of comprehensive tools that will check all aspects of the raid 5 hard drive.  These tools are all on our website and will hopefully be an addition to your tool set.


sourcE: dtidata - Recovering from a RAID Controller Failure