Sysadmin Stories: How not to free up some space on you drives

by Stephen on October 19, 2009 · 0 comments

in Sysadmin Stories


From: (Mitch Wright)
Organization: Cirrus Logic Inc.

A fellow sysadmin was looking to free up some much needed disk space. Since
it was purely a production machine I suggested that he go through and “strip”
his binaries. Unfortunately I made the assumption that he knew what strip
does and would use it wisely — flashes of the Bad News Bears come to mind
now. To make it short, he stripped /vmunix which didn’t destroy the system,
but certainly caused some interesting problems.


From: (Eiji Hirai)
Organization: Information Services, Swarthmore College, Swarthmore, PA, USA

I heard this from a fellow sysadmin friend. My friend was forced to
work with some sysadmins who didn’t have their act together. One day, one
of them was “cleaning” the filesytem and saw a file called “vmunix” in /.
“Hmm, this is taking up a lot of space – let’s delete it”. “rm /vmunix”.

My friend had to reinstall the entire OS on that machine after his coworker
did this “cleanup”. Ahh, the hazards of working with sysadmins who really
shouldn’t be sysadmins in the first place.

Moral of all these stories: if I had to hire a Unix sysadmin, the first
thing I’d look for is experience. NOTHING can substitute for down-to-earth,
real-life grungy experience in this field.


From: (David J Stevenson)
Organization: Joint European Torus (Eiji Hirai) writes:
[story about “deleting /vmunix to save space” deleted – to save space! -ed.]

When this happened to a colleague (when I worked somewhere else) he restored
vmunix by copying from another machine. Unfortunately, a 68000 kernel does
not run very well on a Sparc…


From: smckinty@sunicnc.France.Sun.COM (Steve McKinty – Sun ICNC)
Organization: SunConnect (Eiji Hirai) writes:
[story about “deleting /vmunix to save space” deleted – to save space! -ed.]

Hmm. A colleague of mine did much the same by accident on one of
our test machines. After discovering it, fortunately while the machine
was still up & running, he FTPed a copy of /vmunix from the other lab
system (both running exactly the same kernel).

After rebooting his machine everything (to his relief) worked fine.


From: greep@Speech.SRI.COM (Steven Tepper)
Organization: SRI International

At one place where I worked, someone had set up cron to delete any
file named “core” more than a few days old, since disk space was
always tight and most users wouldn’t know what core files were or care
about them. Unfortunately not everyone knew about this and one user
lost a plain text file (a project proposal) he’d spent a one lot of
time working on because he called it “core”. This was around 1976,
when Unix was still considered exotic and before bookstores carried
entire sections of Unix books.


From: (Tim Miller)
Organization: AL/HRTI, Brooks AFB

This one qulaified for Stupid Act of the Month:

All this happened on my sparcII…

I was making room on / because I needed to to test run something
(which was using a tmp file in, of all places, /var/tmp. I could have
recompiled the application to use more memory and/or /tmp, but I’m too
lazy for that), so I figure “I’ll just compress this, and this, and
this…” One of those “this'” was vmunix.

Well, of course the application crashes the machine, and stupid
me had forgotten that I’d compressed vmunix, so the damn thing won’t
boot. checksum: Bad value or some such error. Took me most of the day
to figure out just what I’d done to the dang thing. 8)

1) Never, ever, EVER play with vmunix.
2) Always keep a log of what you do to the root file system.


From: (Gilles Gravier)
Organization: ENSTA, Paris, France

Well, talk about horror stories… We have a DataGeneral Aviion machine
where I work at. I was doing regular admin tasks on it and decided, logged
in as root, to clean /tmp… (I can already see you laughing there!). So,
as usual, I typed “cd / tmp” then “rm *” as I was placed in / when the
dreaded rm was entered… My root directory was erased…

I realized my error fast enough… So, since I had deleted the kernel, and
the administration kernels (that both reside in /), I had to recreate a
new kernel. Luckily for me, DG/UX allows to recreate one “on the fly”, using
parameters of the running kernel (in memory!)… So I did, and then rebooted.

Things started getting bad when I still couldn’t work on my machine, logins
didn’t work (No Shell messages…)… Until I could access the /etc/passwd
file using a trojan shell through an NFS mounted directory, and great a root
account whose shell was not /sbin/sh…

On a DG, /sbin and /bin are both links to /usr/sbin… The links were killed
when I did my “rm”…


From: (grover davidson)
Organization: CCAI

Several months ago here, we were reoganizing our disk space on an
RS/6000 with AIX 3.1. I have done this many time before, but for some
reason, I was rushing through expanding a file system. Instead of entering
the new file system size where it belongs, I entered it into the mount
point. It also turns out that I was attached 2 levels down in the file
system. Since the size was entered as a number (‘234567’) and was
INTERPRETED as a mount point directory, the result was a
circular hard link that basicly left the file system unusable.
IBM was not able to help, and we had done quite a bit of work that day,
we had to somehow recover some of the stuff. We ended up doing a dd of the
raw volume, and the read it back in a couple MB at a time and extracted
the pieces that we needed for the mess.

The other day while reading Stevens new book, “Advanced Programming in
the UNIX Environment”, he stated that he had done the exact same thing
durring the preparation of his book. At least I am not alone…..


From: hillig@U.Chem.LSA.UMich.EDU (Kurt Hillig)
Organization: Department of Chemistry, University of Michigan, Ann Arbor

Just so nobody get the impression that you can only screw up
U**X systems….

Several years ago I was sysadmin for the department’s VAX/VMS system.
One day, trying to free up some space on the system disk, I noticed
there were a bunch of files like COBRTL.EXE, BASRTL.EXE etc. – i.e.
the Cobol, Basic, etc. run-time libraries. Since the only language
used was Fortran, I nuked them.

Three weeks later, a visiting professor came over from Greece for a few
weeks, mostly to do some calculations on the VAX. He got in on a Friday
morning, and started work that afternoon. About 7 PM I got a call at
home – he’d accidentally bumped the reset switch (on the VAX 3200, it
was just at knee height!) and it wouldn’t reboot. I went back in and
took a look, and the reason it wouldn’t come up was that the run-time
libraries were missing.

I ended up booting stand-alone backup from tape, dumping another data
disk to tape, restoring an old system from tape, copying the RTL’s,
then restoring the data disk from tape again – all with TK50’s. Took
me until 3 AM.


From: (Anthony DeBoer)
Organization: Geac Computer Corporation

At a former employer, I once watched our sysadmin reboot from the
distribution tape after making a typing error editing the root line in
/etc/passwd. After munging the colon count in this line, nobody could
login or su, and he hadn’t left himself in root in another session while
testing his changes (a rule I’ve adopted for myself).

My “big break”, the moment I became sysadmin, was partly by virtue of
being the only one to ask him for the root password the day he went out
the door for the last time.

What I’ve found preferable, when wanting to set up an alternative shell
for root (bash, in my case), is to add a second line in /etc/passwd with
a slightly different login name, same password, UID 0, and the other
shell. That way, if /usr/local/bin/bash or /usr/local/bin or the /usr
partition itself ever goes west, I still have a login with good ol’
/bin/sh handy. (I know, installing it as /bin/bash might bypass some
potential problems, but not all of them.)

This might, of course, be harder to do on a security fascist system like
AIX. Simply trying to create a “backup” login with UID 0 there once so
that the operator didn’t get a prompt and have to remember what to type
next was a nightmare. (I wound up giving “backup” a normal UID, put it
in a group by itself, and gave it setuid-root copies of find and cpio,
with owner root, group backup, and permissions 4550). BTW, this was to
make things easier for the backup operator, not to make it secure from
that person.


From: (Dave Williams)
Organization: Ericsson Network Systems

A sysadmin was told to change the root passwd on a dozen or so Sun servers
serving 400 diskless sun clients. He changed the passwd string to the wrong
encrypted string (with a sed-like string editor) and locked root out from
everywhere. Took hours to untangle.


From: (Rick Morris)
Organization: Sadtler Research Laboratories

Okay, I’ll bite. We had Zenith Data System’s Z-286’s, boosted to 386’s
via an excellerator (imagine a large boot stomping lots of data through
a small 16 bit funnel…). We were running SCO’s Xenix. The user filesystem
crashed in such a way that it couldn’t be repaired via fsck. fsck would
try to repair a specific file and then just stop, leaving the filesystem
dirty. The “dirty bit” in the superblock said that it couldn’t be mounted
because it was dirty. But it couldn’t be cleaned. But there was lots of
data on it and I hadn’t been doing backups because the only I/O device to
do backups was the floppy drive and I wasn’t about to sit there every night
or even once a week and slam 30 odd floppies into the drive while the backups
ran, even worse try to restore a file from a backup of 30 floppies….

Anyway, to recover the data I used fsdb to edit the superblock and change
the dirty bit to clean, mounted the disk, got off all the good data,
and remade the filesystem. Thanks, Xenix. fsck couldn’t clean it,
but you did supply fsdb! *whew*


From: (Valdis Kletnieks)
Organization: Virginia Tech, Blacksburg, VA

Well, here’s a few contributions of mine, over 10 years of hacking
Unixoid systems:

1) yesterday’s panic: Applying a patch tape to an AIX 3.2 system
to bring it to 3.2.3. Having had reasonable sucess at this before,
I used an xterm window from my workstation. Well, at some point,
a shared library got updated.. I’d seen this before on other machines –
what happens is that ‘more’, ‘su’, and a few other things start failing
mysteriously. Unfortunately, I then managed to nuke ANOTHER window
on my workstation – and the SIGHUP semantics took out all windows I
spawned from the command line of that window.

So – we got a system that I can login to, but can’t ‘su’ to root.
And since I’m not root, I can’t continue the update install, or clean
things up. I was in no mood to pull the plug on the machine when
I didn’t know what state it was in – was kind of in no mood to reboot
and find out it wasn’t rebootable.

I finally ended up using FTP to coerce all the files in /etc/security
so that I could login as root and finish cleaning up….

Ended up having to reboot *anyhow* – just too much confusion with the
updated shared library…

2) Another time, our AIX/370 cluster managed to trash the /etc/passwd
file. All 4 machines in the cluster lost their copies within
milliseconds. In the next few minutes, I discovered that (a) the
nightly script that stashed an archive copy hadn’t run the night before
and (b) that our backups were pure zorkumblattum as well. (The joys
of running very beta-test software).

I finally got saved when I realized the cluster had *5* machines in it –
a lone PS/2 had crashed the night before, and failed to reboot. So
it had a propogated copy of /etc/passwd as of the previous night.

Go to that PS/2, unplug it’s Ethernet.. reboot it. Copy /etc/passwd
to floppy, carry to a working (?) PS/2 in the cluster, tar it off,
let it propogate to other cluster sites. Go back, hook up the
crashed PS/2s ethernet.. All done.

Only time in my career that having beta-test software crash a machine
saved me from bugs in beta-test software. ๐Ÿ˜‰

3) Once I was in the position of upgrading a Gould PN/9080. I was
a good sysadmin, took a backup before I started, since the README said
that they had changed the I-node format slightly. I do the upgrade,
and it goes with unprecidented (for Gould) smoothness. mkfs all
the user partitions, start restoring files. Blam.

I/O error on the tape. All 12 tapes. Both Sets of backups.

However, ‘dd’ could read the tape just fine.

36 straight hours later, I finally track it down to a bad chip on the
tape controller board – the chip was involved in the buffer/convert
from a 32-bit backplane to a 8-bit I/O cable. Every 4 bytes, the
5th bit would reverse sense. 20 mins later, I had a program
written, and ‘dd | my_twiddle | restore -f -‘ running.

Moral: Always *verify* the backups – the tape drive didn’t report a
write error, because what it *received* and what went on the tape
were the same….

I’m sure I have other sagas, but those are some of the more memorable
ones I’ve had…


From: mccalld@Sonoma.EDU

I was an engineer from the CYBER world (Control Data Corp.) when they got
involved with MIPS. They sold a contract to the Army Core of Engineers
and I got a crach course in the EP/IX, Enhanced Performance Unix, for the
San Francisco customer base. These were RISC 4000 machines with 128mb of
memory and several 1.5 gig disks and connected to the worlds largest LAN.
One day the site administrator called me and said his machine was con-
tinuously crashing with core dumps and many other bizzare error messages…
After ariving at the site and calling for help, it was determined that I
needed a kit of spares to swap for the problem…24 hours later a kit
arrived and all cards (3) were swapped to no avail. Software support was
then consulted and we booted to mini-root and then mounted the back door
partition into the regular root directory and went searching for the real
problem. After about 15 minutes of examining /etc it was apparent to the
support person that inittab had been deleted, and so we had to restore it
from backups. We found out later that one of the Core network software
engineers was given su and told to learn the machine. Enough said. This
day in age, the hardware is usually quite reliable and there are a number
of files which, if corrupted, could easily simulate a hardware failure…
MORAL never give a network engineer the su password he might attempt to
build bridges into non-existant file systems, or just tear down all the
existing bridges hoping to get the bigger picture and mayber build a
better system!? Geeze.


From: Tatjana Heuser

I once thought it a nice idea to leave root *without* password at all
on my little Sun 3/50 at home. (I’m using that one to play with things
I don’t dare to mess with at work)

So I started with setting every tty including the console to insecure,
put only myself in group wheel and made sure that ftp denied accesss to
every account without a password.

Everything worked fine and I couldn’t imagine anything against it.

Then, after maybe a month or so, I decided for some reasons I have
entirely forgotten, to set my own login shell from /usr/local/bin/tcsh
to /bin/sh. Trying to make things as small as possible I just deleted
the entire shell entry in the passwd so /bin/sh would get the default shell.
As a short test logging in in just another xterm went fine, I dodn’t spent
any more thoughts on it and logged off a few hours later.

Next time I wanted to su to root I was plain denied it!
(Needless to say that I was somewhat surprized)

`id` quickly revealed I had no other group than my login group
(which wasn’t wheel) -hence no su for me ๐Ÿ™

– booting single-user asked for the root password and wasn’t content
with a
– logging in as root had been disabled by myself
– ftp denies access to accounts without password
– I didn’t have an /.rhosts
– my tape trive stopped working (I later found out the head was blocked in a
faraway position)

Eventually I ended up inviting another 3/50 owner to my home with his
disk and booting from that one.

-since then I’ve moved experiments to diskless clients ๐Ÿ™‚


From: Tatjana Heuser

Being responsible for a small network where every single user had the root
passwd and mucked around with things (me being the lowliest person there
and not allowed to chande this then) I started putting all important
configuration files under SCCS control. Of course I did this on the main
server, leaving instructions to all the other would-be administrators how
to use this. Everything went fine until all the machines were taken down
during x-mas vacation (no reboot of the server for quite some time).

Well, the first working day in January I got a phone call at the
place I spent that time. Missing /etc/rc* the server would drop a
desperated shell at a rather helpless state of things. :-} At my last
change of the rc’s I obviously had checked them in with the ‘delta’
command only ๐Ÿ™ having the original files deleted (or rather stored in
the SCCS directory) :-}

I had to return the 800 km to work a week earlier than planned.
(and learned a lot about startup ๐Ÿ™‚

No mistake any user ever made as root has ever outscored this one…
(oh yeah, extending the swap partition over the next one (almost one GB
without backup, but that was the boss of the department…)

{ 0 comments… add one now }

Leave a Comment

Previous post:

Next post: