Sysadmin Stories: Blaming on the hardware

by Stephen on October 19, 2009 · 0 comments

in Sysadmin Stories

1

From: kelley@epg.nist.gov (Mike Kelley)
Organization: NIST

We have a cluster of HP workstations and, once upon a time, were using
1/4-tape as the backup medium. This was very slow and cumbersome, as
we were forever increasing the amount of disk space on our system, and
we decided to purchase HP’s optical jukebox to use both as large
removable media and as the primary backup device.

We had been experiencing occasional problems with the 1/4-inch tape
backups, but HP’s hardware service engineer convinced us that the
problems were resolved. A complete backup was performed prior to
installation (by the HP engineer) of the jukebox. Two unfortunate
things happened. First, the problems on our backup tapes were due to
intermittent hardware problems on the tape drive which were not
discovered by the extensive diagnostics performed on the tape drive.
Second, the engineer installed the jukebox with the same hardware SCSI
address as our root file system.

As you may have anticipated, the attempt to mediainit the first
optical cartridge resulted in a rather ungraceful failure of the root
file system. This was compounded by the fact that much of the data on
the backup tapes was not recoverable.

2

From: robjohn@ocdis01.UUCP (Contractor Bob Johnson)
Organization: Tinker Air Force Base, Oklahoma

We had an operator lay a book on the console keyboard, throwing the console
into system monitor mode. This stops the system clock, which locks every
session dead in it’s tracks. At that time we had over 100 user sessions
running. Most of our inbound lines are essentially modem lines on a very
large “rotor”. After their session hung for a minute or so, many users
disconnected and called back. They got connected, but received no login
prompt (the system was in a sort of suspended animation). Little did they
know that they were now on a different port than the one they just abandoned.

A call to the computer room soon identified the problem, and the operator was
given the commands to resume normal system operation. As near as we can
figure, somewhere around half of the users had disconnected but the system
didn’t notice because it never saw carrier drop on those ports (being dead).
New, different users had now connected to those ports. We received several
semi-confused user calls, realized what had happened and invoked the magic
“/etc/shutdown NOW” command. The procedure (should this ever happen again)
will be to manually panic the system and reboot. I also surgically removed
the keycap from that particular key on our terminal – you have to work to
press it now!

3

From: stehman%citron.cs.clemson.edu@hubcap.clemson.edu (Jeff Stehman)
Organization: Clemson University

Many years ago a tiny little college in the middle of nowhere purchased an
NCR tower, then a newfangled contraption. A half-dozen of us were using it
for an assembly class. The prof should have made his warnings about TRAP a
little more clear. One student runs his program and it suddenly begans
spawning processes, rapidly filling the machine. The prof came in, amused,
logged on as superuser, and killed a process. Another process was
immediately spawned. The prof tried again. He was ignored. He was also no
longer amused. After several minutes he gave up and turned off the box.
The tower didn’t even flinch. He pulled the plug. Nothing. He ripped the
back off the box and dug around. Finally he found the fuse and pulled it,
killing the machine. Some of us later claimed we heard laughter as it went
down.

Many times since then I have wished other computers came with a backup
battery as standard issue.

4

From: pinard@IRO.UMontreal.CA (Francois Pinard)
Organization: Universite’ de Montre’al

Many things happened in those many years I’ve been with computers.
The most horrorful story I’ve seen is not UNIX related, but it is
certainly worth a tale. Here it goes.

This big (:-) CDC 6600 system was bootable from tape drive 0, using
these 12 inches wheels containing 1/2″ tape. The *whole* system was
reloaded anew from the tape each time we restarted the machine,
because there was no permanent file system yet, the disks were not
meant to retain files through computer restarts (unbelievable today, I
know :-). The deadstart tapes (as they were called) were quite
valuable, and we were keeping at least a dozen backups of those, going
back maybe one or two years in development.

The problem was that the two vacuum capstans which were driving the
tape 0, near the magnetic heads, were not perfectly synchronized, due
to an hardware misadjustment. So they were stretching the tape while
they were reading it, wearing it in a way invisible to the eye, but
nevertheless making the tape irrecoverable. Besides that, everything
was looking normal in the tape physical and electrical operations. Of
course, nobody knew about this problem when it suddenly appeared.

All this happened while all the system administration team went into
vacation at the same time. Not being a traveler, I just stayed
available `on call’. The knowledgeable operators were able solve many
situations, and being kind guys for me (I was for them :-), they would
not disturb me just for a non-working deadstart tape. Further, they
had a full list of all deadstart backup tapes. So, they first tried
(and destroyed) half a dozen backups before turning the machine to the
hardware guys, whom destroyed themselves a few more.

The technicians had their own systems for diagnostics, all bootable
from tape drive 0, of course. They had far less backups to we did.
They destroyed almost them all before calling me in. Once told what
happened, my only suggestion was to alter the deadstart sequence so to
become able to boot from another tape drive. Strangely enough, nobody
thought about it yet. In these old times, software guys were always
suspecting hardware, and vice versa :-).

Happily enough, the few tapes left started, both for production and
for the technicians. Tape drive 0 being quite suspectable, the
technicians finally discovered the problem and repaired it. My only
job left was to upgrade the system from almost one year back, before
turning it to operations. This was at the time, now seemingly lost,
when system teams were heavily modifying their operating system
sources. This was also the time when everything not on big tapes was
all on punched Hollerith cards, the only interactive device being the
system console. It took me many days, alone, having the machine in
standalone mode. The crowd of users stopped regularily in the windows
of the computer room, taking bets, as they were used to do, on how
fast I will get the machine back up (I got some of my supporters
loosing their money, this time :-).

This was quite hard work for me, done under high pressure. When the
remainder of the staff returned from trip, and when I told them the
whole tale, we decided to never synchronize our holidays again.

5

From: ravi@usv.com (Ravi Ramachandran)

At one time, there were three of us working on a unique SVR3.2 motorola
based machine, on a R&D project. I took care of all the SysAdmin tasks,
I had a back up administrator, and the third person had been stuck into
my group (company politics). The group project files were in /user and
the individial ones in /user2. We had managed to get backup from the
operations department for /user only (not even /; security paranoia?).
Anyway, I had another scsi hard disk that I used for making a disk copy
of the primary scsi hard disk every Friday. This disk was connected, but
not mounted, so that I could do the disk backup from my desk when I wanted
to. This machine used to sometimes get a scsi error such that you could
not log in, but the processes already running on the machine were not
affected. If were logged in the console, you just powered off the machine
for a few minutes and rebooted it. Around holidays time the other Admin
was off in a long vacation. I had taken Monday off, and headed off for a
four day weekend. The machine does the same blurp. The third person
decides the power off the machine & turn it back on immediately. It does
not come up properly. She decides to reinstall the machine using the
installation tape that I had unfortunately left in the open. Reformats the
hard disk, installs the base system, and is stuck at that point when I come
back in on Tuesday. I almost blow a blood vessel but try to keep calm
’cause I had made a disk copy about 10 days before (too anxious to get on
my holiday the previous week). Try to mount the disk… hit vaccuum. Try
using dd to look at the disk… Seemed to be a large /dev/null ๐Ÿ˜• When the
lady decided to reinstall the system, it asked her what scsi disks she
wanted to reformat, and she said “y” for both 0 & 1!! All my
sample/trial&error work for a year had bitten the dust.
My only (small) consolation was that I was not the only one affected.

6

From: williams@nssdcs.gsfc.nasa.gov (Jim Williams)
Organization: NASA Goddard Space Flight Center, Greenbelt, Maryland

Story One is about The Sun 3/260 That Froze Solid. One day a user
reported that the Sun 3/260 he was using was “dead”. On inspection, I
found the Sun at the console prompt and the keyboard totally
unresponsive. The L1-A sequence did nothing. So I power cycled it.
Nothing. A blank screen, no activity. I was ready to call service,
then decided to try rebooting with the normal/diag switch set to diag.
On looking at the back of the pedestal, I saw that the ethernet cable
had been pressed up against the reset switch! ARGGGHHHH! The user
had pushed the machine back just enough to press the switch and keep
it pressed. (I don’t recall if there was a “watchdog reset” message
on the console when I found it, but I was new enough to Suns that that
would not have been a dead givaway.)

Story Two involved connecting an HP laserjet to a Sun 3/280. This
sucker just would NOT do flow control correctly. I put a dumb
terminal in place of the HP and manually typed ^S/^Q sequences to
prove that the serial port really was honoring X-ON/X-OFF. But for
some reason the ^Ss from the HP didn’t “taste right” to the Sun, which
ignored them. Switching the HP serial port between RS422/RS232 had no
effect. It evenually turned out to be some sort of flakeyness with
the Sun ALM-II board. Everything worked fine after I moved the
printer to one of the built-in Zilog ports. Death to flakey hardware…

7

From: ken@sugra.uucp (Kenneth Ng)
Organization: Private Computer, Totowa, NJ

In article <1992Oct16.152629.29804@nsisrv.gsfc.nasa.gov: williams@nssdcs.gsfc.na [story about connecting HP LJ to a Sun 3/280 with an ALM-II board deleted] ARRRGGGHHH!!!! DEATH TO ALM-II BOARDS! Funny though, I do have an HPLJ-2 hooked up to a SUN 690MP through the ALM-2 boards without problems. However I also had Sun going up the wall with myself with an Okidata 320 printer that would hang the port until we reboot the machine (not a nice thing to do with a dozen stock brokers). Funny thing is, we had ANOTHER Okidata 320 printer attached to the same Sun on another ALM-2 port, no problem with that one. Hm, switch the printers, no change. Switch the cables, no change. Switch the ports, no change. Wierd. Finally discovered it was the DATA that was being sent. The printer with problems was a label printer, which was sending a control-s every 10-20 characters or so to pause the Sun. Apparently the Sun ALM-2 drivers can not handle control-s'es too frequently. No problem, Sun said, just switch to hardware flow control. Puzzled me, because my docs said the ALM boards had no hardware flow control. But his docs said they were there. Took the printer off line, started the lpd, data scope showed the data going out. Talked to Sun again, tried RTS-CTS, DTR, 'crtscts' in printcap, '-crtscts' in printcap. Trying all kinds combinations. Finally he asked me which ALM-2 port I was using, 13 I responded. Oh, ALM-2 ports only have the hardware flow control in the first four ports. Whoops :-). Both docs were, true, my docs said there was no hardware flow control, which was right, on the last 12 ports. His docs said that there was hw flow control, but he missed the 'on the first four ports' part. Now it works, and I hope Sun now has this better documented.

8

From: gary@resumix.portal.com (Gary M. Lin)
Organization: Resumix Inc.

My company markets turnkey solutions for resume-processing, so most of our
customers are non-technical HR recruiters. We contract third-party field
service to a fairly recognizable name in the industry.

I received a call from an irate user who noticed intolerable delays after
some upgrades were done to the customer’s branch offices. His ELC would use
dial-up to establish a link before running software off the server in a
different site.

He attributed the delay to slow dial-up links and software changes, but then
the customer mentioned that quitting WordPerfect and switching to our applic-
ation took over an hour. I asked what the system was doing during that hour.
He replied the disk was constantly spinning. Puzzled, I checked his swap,
which was more than sufficient. Then finally I noticed his ELC booted with
only 4 meg of memory.

Think the field technician swapped their CPU board a month ago and forgot to
move the SIMMs over. The worst part of it was the customer went on with this
situation for a month before bringing it to our attention!

Moral of the story: Check that the service guy puts everything back in.

9

From: greep@Speech.SRI.COM (Steven Tepper)
Organization: SRI International

I once had problems with files that mysteriously refused to stayed
changed for very long. It was a PDP-11 Unix system that had crashed,
and I brought it up single-user. I would change some file and it
would stay changed for a minute or so but then revert to its earlier
state (contents, protection mode, etc). What happened was that the
write-protect switch on the disk drive had gotten bumped into the “on”
position but the device driver failed to report any write errors. As
long as the data stayed in kernel buffers the changes “took”, but they
would disappear once the buffers were reused and the system had to
reread the disk.

{ 0 comments… add one now }

Leave a Comment

Previous post:

Next post: