Sysadmin Stories: Miscellaneous

by Stephen on October 19, 2009 · 0 comments

in Sysadmin Stories

1

From: hirai@cc.swarthmore.edu (Eiji Hirai)
Organization: Information Services, Swarthmore College, Swarthmore, PA, USA

We were running a system software that had a serious bug where if anyone
had logged out ungracefully, the system wouldn’t let any more users onto the
system and users who were logged on couldn’t execute any new commands. (The
newest release of the software later on did fix this bug.) I had to reboot
the machine to restore the system to a sane state. I did a wall <

2

From: robjohn@ocdis01.UUCP (Contractor Bob Johnson)
Organization: Tinker Air Force Base, Oklahoma

Management told us to email a security notice to every user on the our
system (at that time, around 3000 users). A certain novice administrator
on our system wanted to do it, so I instructed them to extract a list of
users from /etc/passwd, write a simple shell loop to do the job, and
throw it in the background. Here’s what they wrote (bourne shell)…

for USER in `cat user.list`; do
mail $USER

3

From: Iain.Lea%anl433.uucp@Germany.EU.net (Iain Lea)
Organization: ANL A433, Siemens AG., Germany.

I used to work at Siemens R&D in Erlangen (33000 people out of 115000
population work at Siemens – 12000 in the R&D area). We were working
on a project porting an ISO FTAM implementation in Ada to C.

About 2 months into the project we received a new project leader who
decided there were too few people working on the project (sigh!).
Anyway we were promised that a “Spitzen Klasse” (Outstanding) SW guy
was being sent over from the next lab.

The fateful day turned up (had to be a monday) and there was our very
own ‘Einstein’. We gave him a tour of the lab (ie. Coffee machine on
the left, laser on the right etc.) finally getting to out work area.
We had a couple of fast 386’s (this happened in ’89) running Xenix 386.
We told Einstein that I was the sysadmin for both machines and that if
*anything* was strange or not working to speak with me. OK so the first
morning went off without a hitch and we all went to get someting to eat
around midday. All except Einstein who said he wanted to check a few
things out (Code practices we thought etc. – turned out to be Page 3 of
that months playboy).

We came back from eating to find Einstein twiddling his thumbs and
saying that he could no longer log in on either machine. Ermmm…

I asked him if *anything* had happened while we were away. He thought
and thought and then said “Nothing really but the lights went out for
a few minutes”. OK I thought “fsck the disks, remount them and away
we go” but then I stopped and asked him again “Anything else?”. He
then really started looking around and found the palms of his hand
the most interesting thing he’d ever seen. He answered “Well I know
a little about Unix and fsck is the ‘ajax’ cleaning program of Unix
so when it started again after the lights came back on it started
fsck and asked me for a scratchpad file. I just took the one it
printed on the line above!” (ie. the name of the filesystem to clean).

Another comment he made was “Must be a fast machine as fsck ran quick”.

Bad you might say until he told me he had done the same thing to our
backup machine.

Needless to say Einstein & our project leader exited stage left…

And we eventually got a backup tape from our data safe stored at
another lab. The SW guy is kind of a living legend around here ๐Ÿ™‚

4

From: rca@Ingres.COM (Bob Arnold)
Organization: Ask Computer Systems Inc., Ingres Division, Alameda CA 94501

Many moons ago, in my first sysadmin job, learning via “on-the-job
training”, I was in charge of a UNIX box who’s user disk developed a
bad block. (Maybe you can see it already …)

The “format” man page seemed to indicate that it could repair bad
blocks. (Can you see it now?) I read the man page very carefully.
Nowhere did it indicate any kind of destructive behavior.

I was brave and bold, not to mention boneheaded, and formatted the user disk.
Heh.

The good news:
1) The bad block was gone.
2) I was about to learn a lot real fast ๐Ÿ™‚
The bad news:
1) The user data was gone too.
2) The users weren’t happy, to say the least.

Having recently made a full backup of the disk, I knew I was in for a
miserable all day restore. Why all day? It took 8 hours to dump
that disk to 40 floppies. And I had incrementals (levels 1, 2, 3, 4,
and 5, which were another sign of my novice state) to layer on top
of the full.

Only it got worse. The floppy drive had intermittent problems reading
some of the floppies. So I had to go back and retry to get the files
which were missed on the first attempt.

This was also a port of Version 7 UNIX (like I said, this was many
moons ago). It had a program called “restor”, primordial ancestor of
BSD’s “restore”. If you used the “x” option to extract selected files
(the ones missed on earlier attempts), “restor” would use the *inode
number* as the name of the extracted files. You had to move the
extracted files to their correct locations yourself (the man page said
to write a shellscript to do this :-(). I didn’t know much about shell
scripts at the time, but I learned a lot more that week.

Yes, it took me a full week, including the weekend, maybe 120 hours or
more, to get what I could (probably 95% of the data) off the backups.
And there were a few ownership and permissions problems to be cleaned up
after that.

Once burned twice shy. This is the only truly catastrophic mistake I’ve
ever made as a sysadmin, I’m glad to be able to say.

I kept a copy of my memo to the users after I had done what I could.
Reading it over now is sobering indeed! I also kept my extensive notes
on the restore process – thank goodness I’ve never had to use them since.

5

From: jimh@pacdata.uucp (Jim Harkins)
Organization: Pacific Data Products

A friend of mine admins an RS6000 for a state college. The weekend before
the fall semester started the Powers That Be decided to physically move the
system to a different room. She stayed late friday night, moved the machine,
and then it wouldn’t boot. I was in Sunday afternoon looking at it, wouldn’t
boot for nothing. Monday morning, first day of classes, an IBM rep comes in
and reformats the hard disk without telling her. Turns out this was the
machine all the professors were doing their class plans on. So not only
couldn’t they have them printed out, but when school started monday morning
the teachers discovered they had lost all the work they’d done in the week
before school started. Seems she never did backups because the teachers
always bitched about how slow the system was when she did, and she hadn’t
learned about cron yet (I told her about that one).

In her defense, she’d only been using the RS6000 for less than a month before
this happened. She didn’t know UNIX. She hadn’t had any training. She
still had her regular job to do.

To make things worse, when she called me monday night she was in tears as
she told me how she had to personally visit all the professors and tell them
their work was gone. I blurted out “Stupid of you not to make backups”. Here
she is looking for a shoulder to cry on and I go and tell her the same thing
everybody from the department chair on down to the janitor had been saying.
Oops.

The moral? If you appoint someone to admin your machine you better be willing
to train them. If they’ve never had a hard disk crash on them you might want
to ensure they understand hardware does stuff like that. I also found out
she was unplugging and plugging cables all over the place without powering
down the system. Her hardware knowledge was essentially “this thing goes into
the wall, then the lights blink”.

7

From: rick@sadtler.com (Rick Morris)
Organization: Sadtler Research Laboratories

Slightly off the subject, but not too far off, is the phenomenon of “Sysadmin
Wannabees.” I’ve been Sys Admin of UNIX at 3 sites now. The phenomenon has
occured at all three.

You are talking to a fellow programmer, or a programmer is within ear shot.
A new user (or even an old user) comes up to you and asks something like:
“How would I list only directory files within a directory?”

Now it has been my experience that the question is not complete. Is this a
recursive list? Is this a “one-time” thing, or are you going to do it many
times? Is it part of a program? (Sometimes questions like this end up as
an answer to a C question executed as a system(3) call rather than a preferred
library call.) Anyway, as you ponder the question, the many alternatives (in
unix there’s always another way), the questioner’s experience, whether or not
they want a techie answer or a DOSie answer, the programmer within ear shot
pipes in with an answer of how *THEY* do or would do it.

It is invariable. It happens every time. I don’t think I take all that
long to answer. But the Wannabee answer is rapid. Like the kid in class
who raises his hand going “oo” “oo” “oo”.

I have seen my predicessors get all bent out of shape when the Sysadmin
Wannabees jump on their toes. I usually let the answer proceed, indeed,
often these Wannabees give a complete answer, even doing it for the
questioner. After a bit I return to the questioner and ask if the question
was properly answered, if they understand the answer, or if they want any
more information. It also shows me how deeply the Wannabee understands
just what is going on inside that pizza box.

Have any other of you sys admins seen this phenomenon, or is it my slow
pondering of potential answers that drives the Wannabee to jump in?

8

From: rslade@cue.bc.ca (Rob Slade)
Organization: Computer Using Educators of B.C., Canada

I had a job one time teaching Pascal at a “visa school”. The machine was a
multi-user micro that ran UNIX. I have enough stories from that one course
to keep a group of computer educators in stitches for at least half an hour.

The finale of the course was on the last day of classes. When I showed up
and powered up the system, it refused to boot. Since all the students’ term
projects and papers were in the computer, it was fairly important. After
a few hours of work, and consultation with the other teacher, who did the
sysadmin and maintenance, we were finally informed that the new admin
assistant around the place had decided that the layout of the computer lab
was unsuitable. (I had noticed that all the desk were repositioned: I thought
the other teacher had done it, he thought I had.) The AA had, the night
before, moved all the furniture, including the terminals and the micro. She
did not know anything about parking hard disks.

We knew now, that we were in trouble, but we didn’t realize how much until
we started reading up on emergency procedures. For some unknown reason,
booting the micro from the original system disks would automatically reformat
the hard disk.

(The visa school refunded the tuition for all the students in that course.)

9

From: corwin@ensta.ensta.fr (Gilles Gravier)
Organization: ENSTA, Paris, France

I am sysadmin at my office… I won’t name it, because that’s not
the subject… Of course, UNIX is my cup of tea… But, at home, I have an
MS DOS machine… As old habits die hard, I have set up MKS toolkit on my home
PC… And, as I have a C:\TMP directory where Windows and other applications
put stuff, that remains, as I sometimes have to reboot fast… (ah, the fun
of developping at home!)… So, in my AUTOEXEC.BAT file, I have the following:
rm -rf /tmp
mkdir c:\tmp
the recursive rm comming from MKS, and mkdir from horrible MSDOS.

At the time, I didn’t have a tape streamer on my pc… I was working,
and the mains waint down… so did the PC. Windows was running, \TMP full
of stuff… So, when powers comes back on, rm -rf /tmp has things to do…
While it’s doing those things, power goes down again (there was a storm).
Power comes back up, and this time, it seems that the autoexec takes really
too much time… So, I control C it… And, to my horror, realize that I don’t
have anymore C:\DOS C:\BIN C:\USR and that my C:\WINDOWS was quite depleted…

After some investigation, unsuccesfull, I did the following: cd \tmp
and then DIR… And there, in C:\TMP, I find my C:\ files! The first power
down had resulted in the cluster number of C:\ being copied to that of C:\TMP,
actually resulting in a LINK! (Now, this isn’t suppose to happen under MSDOS!)
I had to patch in the DIRECTORY cluster to change TMP’s name replacing the
first T by the letter Sigma, so that DOS tought that TMP wasn’t there anymore,
then do an chkdsk /F, and then undelete the files that I could… And rebuild
the rest…

10

From: gert@greenie.gold.sub.org (Gert Doering)

I was on a 5 days vacation, the first day my machine crashed…

How? Well…

cron started a shell-skript to extract some files from a “.lzh”-Archive.
LHarc found that the target file already existed, asked

“file exists, overwrite (y/n)?”

… since it was started from cron, it just read “EOF”. Tried again. Read
“EOF”. And so on.

All output went to /tmp… what was full after the file reached 90 MB!
What happened next? I’m using a SCO machine, /tmp is in my root filesystem
and when trying to login, the machine said something about being not able
to write loggin informations – and threw me out again.

Switched machine off.

Power on, go to single user mode. Tried to login – immediately thrown out
again.

I finally managed to repair the mess by booting from Floppy disk, mounting
(and fsck-ing) the root filesystem and cleaning /tmp/*

{ 0 comments… add one now }

Leave a Comment

Previous post:

Next post: