Sysadmin Stories: Making backups

by Stephen on October 19, 2009 · 0 comments

in Sysadmin Stories

1

From: rickf@pmafire.inel.gov (Rick Furniss)
Organization: WINCO

Murphy’s law #?? , preventive maintenence doesnt.

try this one: /etc/dump /dev/rmt/0m /dev/dsk/0s1
Or: tar cvf /dev/root /dev/rmt0

Backups on unix can be one of the most dangerous commands used, and they
are used to prevent rather than cause a problem. If any Unix utility were
a candidate for a warning message, or error checking, this would be it.

Just in case you didnt catch the HORROR above, the parameters are backworks
causing a TOTAL wipe out of the root file systems.

More systems have been wiped out by admins than any hacker could do in
a life time.

2

From: grant@unisys.co.nz (Grant McLean)
Organization: Unisys New Zealand

One of my customers (who shall remain nameless) was having a problem with
insufficient swap space. I recommended that he back up the system, boot
off the OS tape, repartition the disk, remake the filesystems and restore
the data (any idiot could do this, right? ๐Ÿ™‚ ). I also suggested that if
he wasn’t confident of achieving all this, we could provide a skilled
person for a modest fee. Of course he was fully confident so I left him
to it.

Next day I get a call from the guy to say he’d been there all night and
he’d had all sorts of funny messages when restoring from tape.

Eventually we tracked his problem down to the backup script he’d been
using. It was a simple one liner:

find / -print [ cpio -oc ] dd -obs=100k of=/dev/rmt0 2>/dev/null

This was a problem because:

1) His system had two 300MB drives
2) He only had a 150MB tape drive
3) The same script was being run every night by a cron job
4) All his backups were created by this script

(In case you haven’t worked it out, the dd is to speed up writes to tape
but it has the unfortunate side effect that CPIO never finds out about
the end of tape. Because the errors were going to the bit bucket, they
never knew their backups were incomplete until they came to restore from
them).

I would have loved to be a fly on the wall when he explained to his boss
that the data was gone and there was no way of getting it back.

3

From: ravi@usv.com (Ravi Ramachandran)

Live 24 hour online system. Does backup over the ethernet to a SCSI tape.
Unfortunately, no SCSI on this system to recover if root/ethernet dies.
This was a Compaq Systempro running SCO Unix. Slated a downtime of 4-6am.
I thought that it will take me only 30 minutes, as I had installed a
similar (Adaptec) SCSI board on a similiar hardware on SCO. Only difference
was that this machine was running MPX (multiprocess extension) and you had
to deinstall it, install the SCSI, and then reinstall MPX (proper procedure).
I had made all my slot/IRQ charts the previous day, and so got busy removing
MPX. Then said “mkdev tape”, go through the IDs, and am almost at home
base. Then… “link kit not installed, use floppy X1” when I tried to remake
the kernel. For some reason, when I removed the multiprocessor extension,
the single processor files were not moved to their right location. And if
I reinstalled the single, all my changes would be lost. Finally, restored the
OS (from backup) on the remote machine, and then rcp-ed them over to bring back
the MPX version. Unfortunately, rcp does not maintain the date/ permissions,
etc. Got a limpimg version of the machine back on-line about 45 minutes
after its slated time, and spent the rest of the day fixing vagrant files.
The next week, I moved the online programs to another machine (a headache),
and reinstalled this machine from scratch.

4

From: keith@ksmith.uucp (Keith Smith)
Organization: Keith’s Computer, Hope Mills, NC

My dumbest move ever. Client in Charlotte, NC (3 hours + away) has
Xenix box with like 15 users running single app. They have a tape
backup of course. Anyway they ran slam out of space on the 70MB disk
drive so I upgraded them from an MFM to a SCSI 150MB disk. Restored
their app & data files, and they were off and running. Anyway they did
an application directories backup (tar) on a daily basis and backed the
rest of the system up with tar on Monday morning.

Being a nice guy I built a menu system and installed the backups on the
menu so they could do it with a push of the button. Swell, It’s Monday.
Call if anything else comes up. 1 week later I get a call. Console is
scrolling messages, App seems to be missing yesterday’s orders, etc.
Call in, and cannot log in. ‘w’ doesn’t work. Crazy stuff. Really
strange.

Grab old drive/controller, fly to Charlotte replace drive, install
app backup tape. They re-key missing stuff, etc. Bring new disk back.
Won’t boot, won’t do anything. Boot emergency floppy set. Looking
around. Can’t figure but have backup tape from that morning that
“completed successfully”. tar tvf /dev/rct0. Hmm, why all these
files look very OLD. Uh, Where, Uh. Look at menu command for the
“backup” is ‘tar xvf /dev/rct0 /’

Anyway, I owned up to the mistake, re-loaded the SCSI drivers and
changed the command to ‘tar cvf ..’

Hehehe, Now I DOUBLE check what I put on a menu, and try not to be in a
*HURRY* when I do this stuff.

5

From: mike@pacsoft.com (Mike Stefanik)
Organization: Pacific Software Group, Riverside, CA

One of the more interesting problems that I ran into was a customer that
was having problems with their SCSI tape drive on a XENIX box. Around midnight,
every night, the system would automatically backup and verify their data. One
day, the customer needed to restore some data files from the last night’s
backup. She called because, although the restore worked just fine, she didn’t
see the busy light on the drive come on, and it didn’t sound like the tape was
moving. I dialed up the system, had her put a tape in and did a retension —
the drive started winding the tape back and forth, and we both concluded that
she was mistaken. After all, the tape was retensioning, and she wasn’t getting
any backup or verify errors at all. I just chalked this one up to user
confusion.

A few days later, she called back saying that there really is something wrong
with the tape. She needed to restore some data from a few days ago, and like
before, the busy light on the drive didn’t come on, but files did restore.
However when she started the application program, the data hadn’t changed. I
dialed up the system again, and just on a fluke, issued a “df” — it showed
their rather large root filesystem to be nearly full. Confused, I did a “find”,
searching for files over 1MB. Of course, what I found was this huge file named
/dev/rct0. As I later discovered, their system had crashed a few weeks ago,
and she had simply answered “yes” to a bunch of questions that it asked when
she brought it back up. The /dev/rct0 device was removed (but /dev/xct0 was
still there, which allowed me to retension the tape) and the backup script
never checked to make sure that it was actually writing to a character device.

Needless to say, I modified the backup program to make sure that it was really
writing to a device, and I made her promise to call me whenever the system
crashed or asked “funny questions” when it was booting.

6

From: Nick Sayer

And then there was the time the / disk was full but nobody knew where
the space was going. ‘Course this was on an Ultrix box and everyone’s
used to using Suns, so they were tarring to /dev/rst*. Sure enough,
/dev/rst8 was a 20M file in a 25M partition.

{ 0 comments… add one now }

Leave a Comment

Previous post:

Next post: