Virtual Cloud: List of good Systems Administration practises

Note that while many of the tools listed are Unix specific, the theories behind them apply to all Systems Administration, no matter what the system.

To find out how to do something in your version of Unix, look at the Unix Rosetta Stone

Limit yourself: Do as little as possible with elevated privileges as possible (ie. only use root when necessary, learn the uses of sudo and su -c)

Automation of any repetitive tasks:

If you aren't capable of Automation, stop calling yourself a Systems Administrator until you learn how. Learn to program in perl, bash (including for loops), a little awk, and a little sed. Note that Windows users will need to download these from Cygwin.
Learn to use an automatic scheduler like crontab! Seems obvious until you find someone trying to use atd and scheduling their jobs weekly. [Thanks to Josh Peck]
Another obvious one, edit your startup/shutdown procedures so that they stop all services before shutting down and likewise that they start again at boot time without anyone being there. This can be done under Redhat with ntsysv (or playing with the files in /etc/rc.d/). [Thanks to Josh Peck]

Document Everything

Why document? To some, the reasons for documentation are obvious, but to others, less so. Here are some of the benefits:

When you have to do something in a hurry, you won't have to work it out all over again
When you come back to a particular subsystem of your system after some time (eg. years), then you will know what is going on
When a new sysadmin comes along, e won't feel that e has to replace all systems with clean ones due to lack of documentation (I've seen this a number of times)
When they hire a new sysadmin, you can tell them to read the manual. Then you won't have to explain things all the time.
Basically, unless you are setting up a home network (and sometimes even then), you need documentation. This may sound onerous, but some of the ways of documenting things (below) may do some of the work for you.

Ways of documenting (use all of them)

Package manager: Having a package manager does a lot of the documentation for you. Want to find out what package a file belongs to? Just ask your package manager. Want to find out what version of a piece of software you have? Ask your package manager. I've only experienced Redhat's package manager (command line version), but I'd imagine that others can do this (I've been told that Debian can -- try the Rosetta stone link at the top of the page).
Journal: Get a school exercise book (lined paper), and list everything you do to the system; all configuration changes, everything. If there are repetitive tasks that you feel don't need documentation (eg. adding users), you're right -- they should be automated, and someone else should be entering any required data.
Network map: Make a visual representation of your network. Then people can see what is going on.
List of machines and software: You should make a list of all the machines you have, including:

Their purpose(s)
Any non-standard software installed (Standard software, of course, is only that included on your auto-setup disk (see below, under Emergency kit)).
Configuration details
Where you bought the machine
Where to get support for each supplier

List of common procedures: Document all common procedures. For example, I have a list of things I need to do when making a virtual host. I have another one which tells me how to put a new machine online (ie. auto-install, configure, Investigate security issues with any specific software)
Address Allocation list: List the way you intend to allocate IP addresses and machine names. Keep in mind that in future you may want to take advantage of subnetting
Training documentation: If you care about your system, also document ways for future sysadmins to learn about things. I have a half page on learning about Cisco (mostly "Read this, look at that"), another half-page about Security, and a bit about Content Regulation (ie. the industry regulations here). Mostly it's just a good collection of links.
Include Doco in procedures: In the procedure for putting a machine online, I have things like "Add the machine to the documentation and the timeline"
To aid others in reading your journal, make a timeline with major events (new machines, retiring old machines, and personnel changes)
Document the process of installing each server, so that someone else can do it if you are away

Security

Uninstall anything you don't need. This is also part of knowing your system [Thanks to Jimi Thompson]
Update: keep everything up-to-date, but particularly your kernel, and your internet-accessible services
Have a logging server (preferably encrypted, ie. syslog-ng+stunnel -- see Alternate Software, below). It should duplicate your logs, but logs should also be logged locally.
Go through the following, and make sure you have some software from every category (except a few like decoys):

The software section of Whitehats (now unfortunately defunct -- hopefully I'll find a replacement someday)
The IDS section of linux-sec

Have other monitoring. Investigate Process Accounting (Linux only in that link), and Network Monitoring
Alternate software: Consider alternatives to common programs which are insecure. The following are more secure than what they replace:

sshd (replaces telnetd, rlogin, and ftpd (not compatible with these programs)); you may want to also try making sftp chroot
vsftpd (replaces ftpd with something compatible)
courier-mta (replaces sendmail). Some disagree with me about the security of postfix, and recommend instead Exim instead.
syslog-ng + stunnel (replaces syslog)

Know your system: This requires documentation. Follow the documentation procedures above, and familiarise yourself with any documentation of systems that you are new to
Learn to use chroot. It is your friend [Thanks to Jimi Thompson]
Familiarise yourself with the top 20 security holes on the Internet.
Read about Secure Systems Development

Emergency Preparation

Understand what emergencies you can run into:

Worms:

The Internet Worm (aka the Morris Worm): which brought down the entire Internet for a week early on. The Code Red worm was only kept from being an attack on a similar scale by the expertise of major network admins, expertise gained with the Internet Worm
The Sapphire Worm, the first high-speed Internet worm.

DoS, DDoS: Lots of these happen (in my experience, about once every 2 months). They can be brief, or worse. For stories about some more severe ones, see:

Hacker attacks: Hackers can break into your machine, and start doing things. Probably the most famous instance is recorded in "The Cuckoo's Egg", by Clifford Stoll.

Mistakes: Someone can make a mistake. Someone where I worked once accidentally removed all entries but one from the passwd file (for those who don't know, that meant that no-one could log in, not even root)
Hardware failures: I worked somewhere with a FreeBSD box, and it crashed 3 times in 3 weeks, because the hard drive was failing, but it wasn't until the final time that it left an error message. We disabled the failing hard drive, and it was still running a year later
Full drives: If a hard drive fills up, it can make things quite interesting -- software can start behaving in strange ways, and sometimes services need to be restarted even after the space is freed up
Lots of other things can cause problems too. If you have any other large categories, let me know. Here are some outage stories I've found:

1. Slashdot outage: A lot can be learned from this. It's worth checking out.

Plan what to do in each of the cases above. Have a plan for what you plan to do if you've been hacked
Ensure you know where to find help

Be able to search the Security Focus site.
Have hard copies of contact details for:

All people who provide you with bandwidth (their tech department, not sales)
Vendors of all servers (ie. Cisco, Sun, whoever)
Other sysadmins who might be willing to advise you

You may also want to keep other kinds of contact -- IRC friends, dial-in details that are long-distance, and won't be on the same network section, or whatever

Have an emergency kit:

Have a rescue disk. I use Tom's Rescue disk (aka tomsrtbt), and Knoppix.
Printouts of /etc/fstab on all machines, and the output of df too
Backups, of course (see below)
Have monitoring in place (see Security section above)
Buy the O'Reilly book "Unix Disaster Recovery and Backup" [Thanks to Dave Vehrs]

USB or Parallel Port Zip Drive (or something larger like a Jaz or Orb drive).[Thanks to Dave Vehrs]

Have an emergency network (boot) disk which allows network diag/analysis Have an (almost) Auto-Installation disk: The point is that, if a server completely dies, you can construct another without spending lots of time choosing options. Have a base (secure) system, and add on special software (proxy, etc), as needed. You can also set up new server (ie. additional proxies, and the like) quickly. And it keeps things uniform across your systems

As an example, I have a (Redhat Linux) kickstart disk. Kickstart itself comes with Redhat. Google "kickstart linux", or try the following:

Kickstart HOWTO (v0.2, 11 Jan 1999)
Redhat Linux Configuration Guide (Redhat 7.1 -- you probably want the Kickstart Options section)
Something which isn't mentioned anywhere in the Kickstart documentation is that, if you leave something out, the install will prompt you for it.
The more recent versions of the Redhat-based distributions will actually generate a file named /root/anaconda-ks.cfg when you install them. It usually needs its packages replaced with a copy of 'rpm -qa --queryformat "%{NAME}" ', but it's a good start
There's also a mailing list. For info (including archives), use http://mail.gnu.org/mailman/listinfo/howto-kickstart

7. Regular maintenance schedule

APPLY PATCHES. The number one security hole on the Internet is machines running old software. [Thanks to Dan Shauver]
Check your passwd/hosts/group files every once in a while. Sometimes people were granted special access, and have left, and no longer need it. This point and the following are also part of "Know your system". [Thanks to Dan Shauver]
Check your crontab to see if there are old entries which no longer need to be there [Thanks to Dan Shauver]
If you have support contracts, ensure that they cover all hardware & software -- some have found that upgrading old hardware/software is cheaper than renewing support contracts on it. [Thanks to Dan Shauver]
If you have any idea what else belongs in a Regular Maintenance Schedule, please send me info

Backups

Have a multi-phase backup plan, eg.

Nightly backups, kept for a week
Weekly backups, kept for 3 month (just use a nightly from the end of the week)
Monthly backups, kept for a year; make a duplicate and keep it offsite
Yearly backups, kept for 10 years, duplicated 2 places offsite. Transfer the old ones to new media each year, to avoid problems with media deterioration problems

Test your backups: Sometime, when you back up, test a full disaster recovery. Otherwise you won't know whether your disaster recovery procedures work. [Thanks to Dan Shauver]
Ensure security of backups - ie. physical location (more?)

Redundancy

Have backup machines where possible (ie. multiple proxies)
RAID: Mirrored disks allow you to take one out while the other keeps working. Hot-swappable is preferable in something like a mail server.

Do post-emergency analysis

Determine the cause of the problem
Consider if there is any way that the problem could be avoided in the future
Implement solutions to prevent it happening

Time-limited upgrades: Not exactly an emergency, but they can easily cause emergencies. If you have to upgrade something in a limited period of time, there are certain things you should do:

Prepare, prepare, prepare

Make sure you know exactly what you're going to do, and how you're going to do it
Get everything ready beforehand (preferably having everything ready by the day before)
Have a written plan (hand-written is suitable)
Make sure all involved know what part they are going to play
Label everything beforehand, if relevant. One place I worked, we took advantage of a 10-minute link outage (for upgrade) to re-cable all the power cables to our servers, in order to make our UPSs last longer. I wrote a master plan, and everyone else wrote a plan for their part in the scheme (if they write it, it ensures they know it). Then I put a label next to the power port on each computer saying which UPS port they had to link to.
Have 2 (or more) backup plans. That's another reason why, in the example above, everyone wrote their own plan from the master plan -- if they dropped their own plan somewhere inaccessible, they could still work from the master. And the labels were another backup.

Do everything as fast as possible
Document
Clean up the mess

Policies: Create policies, including penalties for breaking them. Possibilities include level of service, rights & responsibilities of users, and of admins. Anyone with links on how to make your own, e-mail me

Bookshelf: Have a bookshelf. SAGE used to have a page on this, but unfortunately it's disappeared.

Attitude: You should try not to develop too much of an attitude (but you should check out that link, in order to recognise one :) ). Unless, of course, you take your attitutde from Hobbes:
Calvin: What we need is an attitude. Everyone who's anyone these days has an attitude
Hobbes: We could be courteously deferential!
Calvin [sarcasm]: Oh yeah, that's real cool.

Some sites you should know:

SAGE
Security Focus
WhiteHats (unfortunately defunct)

Some sites which may also contain tips on good sysadmin practises:

The Programmer's Stone. This site talks about doing things properly, and lots of other useful info.

Virtual Cloud

Thursday, February 18, 2010

List of good Systems Administration practises

No comments:

Post a Comment

Blogroll

Labels

Followers

Blog Archive