Sysadmin Stories: Upgrading the system

by Stephen on October 19, 2009 · 0 comments

in Sysadmin Stories

1

From: rsj@wa4mei (Randy Jarrett)
Organization: Amateur Radio Gateway WA4MEI, Chamblee, GA

Here’s one that will show that you shouldn’t work on a system
that you don’t thourghly understand.

At my “previous” employer I was instructed to install a new
(larger) disk drive in a RS/6000 system. Since a full backup
of the system was done the previous day I just looked at the file
systems vi a df to see which were on the drive that I was replacing.
After this I did a tape backup of these filesystems, ran smit and
did a remove of these filesystems. I then installed the new disk
and brought the system back up. When I ran smit and when I was able
to do the installation of the new drive and setup the file systems
I was figuring that this was going to be an easy one. WRONG!! I was
aware that you could expand filesystems under AIX but was not aware
that it would expand them ‘across physical drives’!!! I first
realized that I was in trouble when I went to read in the backup tape
and cpio was not found. I did an ls of the /usr/bin directory and it
said that the file was there but when I tried to run it it was not
found. And of course when I went looking for the original install tape
it was not to be found….

2

From: matthews@oberon.umd.edu (Mike Matthews)
Organization: /etc/organization

When I had first gotten my NeXTstation, it had the lil’ 105M hard drive in
it. I had a 330M external, but alas, no cable for it. (Life was not fun
when I was essentially netbooting off a “test” machine…. “.. um, guys, did
you just reboot is-next?”)

Finally got the cable, just in time for the winter holiday (read: no
network). Brought the machine home, and I figured I’d just copy the
configuration files over from the internal to the external (as a nice gesture
to my users so they wouldn’t have to change their passwords and everything).

The external was a brand new BuildDisk’d disk (had stock NeXTstep on it).
NeXT keeps the private information of each machine (/dev, /etc, stuff like
that) in a /private directory to make netbooting easier.

Hey, I’ll just move /private from the 105M to /private on the external. So I

deleted the external’s /private and tried to move it via the workspace.

/dev is in /private.

/dev contains device files. Can’t move them.

BUT. The workspace happily deleted all the files it DID copy, so the
internal couldn’t boot (no /etc) and the external couldn’t boot (no /dev).
This is before the advent of boot floppies so I was stuck for about a week at
home with $5000 of NeXT computer that I couldn’t boot.

The moral? *NEVER* move something important. Copy, VERIFY, and THEN delete.

3

From: grog@lemis.uucp (Greg Lehey)
Organization: LEMIS, W-6324 Feldatal, Germany

I’m currently trying to work out how ISC Unix/386 handles COFF files, and
discovered the /shlib directory, which I suspected wasn’t really used
(*wrong*). So, to try it out, I did:

+ root adagio:/ 819 -> mv shlib slob
+ root adagio:/ 820 -> xterm
+ /usr/bin/X11/xterm: Can not access a needed shared library

So far, so good. So, put it back:

+ root adagio:/ 821 -> mv slob shlib
+ /bin/mv: Can not access a needed shared library

Oops! So, tried it from a different system, but didn’t have
permission, so:

+ root adagio:/ 822 -> chmod 777 slob
+ /bin/chmod: Can not access a needed shared library

OK, so let’s just cp them across.

+ root adagio:/ 823 -> cd slob
+ root adagio:/slob 824 -> mkdir /shlib
+ /bin/mkdir: Can not access a needed shared library
+ root adagio:/slob 825 ->

Then I wrote a program which just did a link(2) of the directories.
Yes, gcc and ld didn’t have any problems, but even after the link was
in place, it still didn’t work. I had to reboot (but nothing else),
after which it did work. No idea why that made any difference.

4

From: erik@src4src.linet.org (Erik VanRiper)
Organization: The Source for Source

I run on a 386/25. Small system, 4 inbound lines, etc. I was installing a
new SCSI drive to complement my 2 MFM’s. Took me forever to get everything
just right. Things finally worked, so I figured I would shutdown and play
with the jumper settings to see what this thing could do. What did I do?
Well, I just turned off the power, that’s all.

erk. Just rebuilt the kernal, did not do a haltsys, or a shutdown, or anything.
Just shut the power off. ARGH! Took me 3 weeks to clean up the mess.

You tend to get in this cycle of “try” “haltsys” “power off” “change jumpers”
“power on” “try”. Well, once everything worked, I guess I was a wee bit
excited and forgot a step. ๐Ÿ™‚

5

From: almquist@chopin.udel.edu (Squish)
Organization: Human Interface Technology Lab (on vacation)

Two miserable flubs:

1) /etc/rc cleans tmp but it wasn’t cleaning up directories so I changed
the line:
(cd /tmp; rm -f – *)
to
(cd /tmp; rm -f -r – *; rm -f -r – .*)

About 15 minutes later I had wiped out the hard drive.

2) One of the user discs got filled so I needed to move everyone over to
the new disc partition. So, I used the tar to tar command and flubbed:

cd /user1; tar cf – . | (cd /user1; tar xfBp – )

Next thing I know /user1 is coming up with lots of weird consistency errors and
other such nonsense. I meant to type /user2 not /user1. OOOPS!

My moral of the story is when you are doing some BIG type the command and
reread what you’ve typed about 100 times to make sure its sunk in (:

6

From: anne@maxwell.concordia.ca (Anne Bennett)
Organization: Concordia University, Montreal, Canada

After about four months as a Unix sysadm, and still feeling rather like a
novice, I was asked to “upgrade” a Sun lab (3/280 server and ten 3/50
diskless clients) from SunOS 4.0.3 to 4.1 — of course, this “upgrade” was
actually a complete re-install.

Well, the server had no tape drive, not even any SCSI controller. There
were no other machines on its subnet other than the clients, so I had no
boothost (at that time, I did not know that the routers could be
reconfigured to pass the appropriate rarp packets, nor do I think our
network people would have taken kindly to such a hack!). The clients did
have SCSI controllers, but I had no portable tape drive. Luckily, I had
a portable disk.

So, with great trepidation (remember, I was still a novice), I set up
one of the clients, with the spare disk, to be a boothost. I booted
the server off the client and read the miniroot from a tape on a remote
machine, and copied it to the server’s swap partition. Then I manually
booted the miniroot on the server by booting off the temporary boothost
with the appropriate options, and specified the server’s swap partition
as containing the kernel to be loaded. Once in the miniroot, I started
up routed to permit me to reach the tapehost, and finally invoked
suninstall. From then on, it worked like a charm.

Needless to say, I was extremely pleased with myself for figuring all of
this out. I then settled down to do the “easy stuff”, and got around to
configuring NIS (Yellow Pages). I decided to get rid of everything I
didn’t need, under the assumption that a smaller system is easier to
understand and keep track of. The Sun System and Network Administration
Manual, which is in many ways an admirable tome, had on page 476 a
section on “Preparing Files on NIS Clients”, which said:

“Note that the files networks, protocols, ethers, and services need
not be present on any NIS clients. However, if a client will on
occasion not run NIS, make sure that the above mentioned files do
have valid data in them.”

So I removed them. Several hours later, when I had finished configuring
the server to my satisfaction, reloading the user files, etc., I finally
got around to booting up the clients. Well, I *tried* to boot up the
clients, but got the strangest errors: the clients loaded their
kernels and mounted /, but failed trying to mount /usr with the message
“server not responding. RPC: Unknown protocol”. I was mystified. I tried
putting back the generic kernels on server and clients, several different
ifconfig values for the ethernet interfaces, enabling mountd and rexd on
server’s inetd.conf, removing the clients’ /etc/hostname.le0 (which I had
added)… all to no avail. ‘Twas the last work day before the Christmas
break, and I was flummoxed.

Of course, I finally connected the error message “unknown protocol”
with the removed /etc/protocols (and other) files, restored these
files, after which everything was fine again. I was pretty mad, since
I had wasted a whole day on this problem, but *technically*, the Sun
manual above is correct.

It just neglected to mention that of course, *no* machine is running
NIS at boot time, therefore *every* machine needs valid data in the
networks, services, protocols, and ethers files *at boot time*. Grrr!

7

From: yared@anteros.enst.fr (Nadim Yared)
Organization: Telecom Paris, France

My story happened on a Sun Sparcstation 2

I once wanted to update the libc.so.1.7 to libc.so.1.8 by myself, so
I got root, and then ftp the /lib/libc.so.1.8 to my /lib. Unfortunately
there was not enough room on this partition. So all i got was a file
with zero length.

The problem is that I ran /usr/etc/ldconfig in the directory /lib,
and that was all. Every command could not be executed, cause ld.so
checked for /libc.so.1.8, being the newest one. All i needed was a
statically linked mv, but SUN does not provide usually the source.
Even going single user didn’t do anything. So i had to install a
miniroot on the swap partition, and cp /bin/mv from the CD-ROM,
and execute-it.

8

From: TRIEMER@EAGLE.WESLEYAN.EDU
Organization: Wesleyan College

I have been trying to put a at&t 3b2/310 machine on the net for a
while, I’ll skip the unbelievable hardware problems. I’ll skip the
paranoid system admins that forced me to build a temporary net to show
them that the ethernet board worked. Anyway, I get it up and running
on the temp net – it works fine – a little slow, but hey. Ok, so I’m
ready to stick it on the net – you need to power down to do that right.
So, I powered down. Bad, bad bad mistake. I had been running a sysadm
shell script – I needed to change a password so that I could get into an
account. Well, would you believe that the script, despite the fact that
I wasn’t in the passwd option anymore held onto the passwd file! Stupid
machine, stupid script. Anyway… what that means is that when I boot
up the machine, it passes diagnostics (A small miracle) runs unix and
doesn’t let anyone log in! I almost freaked. Anyway, so…

There’s an undocumented option on the installation disks called
‘magic mode’ At one point it offers 4 options (none of which is magic)
If you type magic mode at that point, you can get it… believe it or not
some at&t person had the nerve, and bizarre sense of humor to add one
extra line to magic mode- you see when you type ‘magic mode’ it says

Poof!

That was just about the last thing I wanted to see… the rest was in a
sense trivial… ran an fsck… it fixed it all for me. So the moral of
the story… never ever assume that some prepackaged script that you are
running does anything right.

{ 0 comments… add one now }

Leave a Comment

Previous post:

Next post: