30 December 2008

The Mysterious Failed Backup Job

"Tape labeling is not working" A quick look shows it is more serious: The certified-compliant-Mission-critical-Tape-backup? FAIL

So, the tape drive is connected to the fiber card, the fiber card is connected to the server, the server is running the backup software Legato -- a Windows port of Legato at that.

So make a guess:
a) the Legato application - a Windows port of a UNIX application
b) a random Windows patch
c) the 3+ year old server
d) the fiber card
e) the $15000 tape robot
f) the drives in the robot
g) none of the above

Whenever a tape labeling happens, this mysterious error appears in Legato...

check_cdi_infop and cdi_open failed

Try Googling that. Yes, that is right, it IS cold out there in Legato land. I breifly consider calling Dell tech. But I know a "server specialist" will blame the tape robot, the "storage specialist" will blame the application, and the Dell Legato specialists -- let me get out the genie lamp.

Let's get ready to grind it out:

- Google does not help, so now I have to actually look at the Legato logs.

- Instead of looking at the logs, I Google some more.

- Okay, no help at all. So, now I have to check the logs.

FINDING #1: In the logs, the errors are only happening on one of the two drives.
COROLLARY: The application can't be the issue, as everything works on one of the drives. So, I do not have to rub the lamp really really hard after all. Goodbye Dell Legato specialist, it is as if you never existed!

- So, if it is hardware, to the eventvwr we go.
- Lovely event viewer, so full of... nothing.
- Check device manager out of quiet desperation, as device manager
ALWAYS reports a device as working, no matter what.
- But hold on...

FINDING #2?: There is a beautiful, glorious yellow dot and exclamation point on a PCI bus. Could it be that device manager is actually telling me something?

Unfortunately, a reboot clears it out. And now EVERYTHING is working, both drives, the labeling, t
he certified-compliant-Mission-critical-Tape-backup. Until...

The next hefty backup job and FAIL.

Once again the tape drive gets crunked and

the yellow dot and exclamation point appear in device manager.
COROLLARY: Nothing like a intermittent error on a mission critical server.

TO REVIEW: Could be tape drive, robot, fiber card, server, or mysterious Windows patch. Could not be the application.

- Troubleshoot the
yellow dot and exclamation point. But no success. The specific error code 12 yields nothing.
- Reboot again. I mean really why not?
- During BIOS post, this appears:

FINDING #3: PCI initialization error
COROLLARY: An error in the server BIOS eliminates the "storage specialist". The Windows patch is not to be blamed, and the robot has not yet noticed the server is alive. It must be the hardware on the server: either the PCI bridge or the motherboard.

- A call to Dell, and the Dell rep reasonably decides to send out the PCI bridge, being that it is much cheaper than a server motherboard.
- I take the delivery Parts Only, no Dell tech needed -- cause I walk like that. Well, actually, it is because Dell gunked the extended warranty, and I had no choice, unless I waited a couple of days for the sales department to confirm to service that I did have an extended warranty
- I replaced the PCI bridge and discover the answer is

the 3+ year old server".

