When I drafted this article, I really came-up with 7 sysadmin habits. But, out of those 7 habits, three really stood out for me.
While habits are good, sometimes rules might even be better, especially in the sysadmin world, when handling a production environment.
Rule #1: Backup Everything ( and validate the backup regularly )
Experienced sysadmin knows that production system will crash someday, no matter how proactive we are. The best way to be prepared for that situation is to have a valid backup.
If you don’t have a backup of your critical systems, you should start planning for it immediately. While planning for a backup, keep the following factors in your mind:
- What software (or custom script?) you would use to take a backup?
- Do you have enough disk space to keep the backup?
- How often would you rotate the backups?
- Apart from full-backup, do you also need regular incremental-backup?
- How would you execute your backup? i.e Using crontab or some other schedulers?
If you don’t have a backup of your critical systems, stop reading this article and get back to work. Start planning for your backup immediately.
A while back in one of the research conducted by some group (don’t remember who did that), I remember they mentioned that only 70% of the production applications are getting backed-up. Out of those, 30% of the backups are invalid or corrupted.
Assume that Sam takes backup of the critical applications regularly, but doesn’t validate his backup. However, Jack doesn’t even bother to take any backup of his critical applications. It might sound like Sam who has a backup is in much better shape than Jack who doesn’t even have a backup. In my opinion, both Sam and Jack are in the same situation, as Sam never validated his backup to make sure it can be restored when there is a disater.
If you are a sysadmin and don’t want to follow this golden rule#1 (or like to break this rule), you should seriously consider quitting sysadmin job and become a developer. 🙂
Rule #2: Master the Command Line ( and avoid the UI if possible )
There is not a single task on a Unix / Linux server, that you cannot perform from command line. While there are some user interface available to make some of the sysadmin task easy, you really don’t need them and should be using command line all the time.
So, if you are a Linux sysadmin, you should master the command line.
On any system, if you want to be very fluent and productive, you should master the command line. The main difference between a Windows sysadmin and Linux sysadmin is — GUI Vs Command line. Windows sysadmin are not very comfortable with command line. Linux sysadmin should be very comfortable with command line.
Even when you have a UI to do certain task, you should still prefer command line, as you would understand how a particular service works, if you do it from the command line. In lot of production server environment, sysadmin’s typically uninstall all GUI related services and tools.
If you are Unix / Linux sysadmin and don’t want to follow this rule, probably there is a deep desire inside you to become a Windows sysadmin. 🙂
Rule #3: Automate Everything ( and become lazy )
Lazy sysadmin is the best sysadmin.
There is not even a single sysadmin that I know of, who likes to break this rule. That might have something to do with the lazy part.
Take few minutes to think and list out all the routine tasks that you might do daily, weekly or monthly. Once you have that list, figure out how you can automate those. The best sysadmin typically doesn’t like to be busy. He would rather be relaxed and let the system do the job for him.
Are there any other rules you think a sysadmin shouldn’t break? Leave a comment.
Comments on this entry are closed.
Good one, I like the tone and agree to most of what’s said.
Using the command line is advisable, even though one should be aware of the advantages GUIs may or may not have (and if they have any, leverage them!). The really strong suit of the CLI arises from rule 3 – automation is possible, powerful and makes your tasks safer, as the margin for human error shrinks.
The “Friday” rule.
Don’t schedule an outage for the last day of your work week.
– Use the last day of work week to do / review preventitive things to keep you from needing to be contacted during your time off.
– Most of the time “Murphy’s Law” seems to intervene on things scheduled for last work day that: have you departing for day later than originally planned, and failing to explain issues with outage to your co-workers. This combination normally leads to calls at home (or calls to return to work) on you days off.
Really nice article.
and like the first line of 3rd point
“Lazy sysadmin is the best sysadmin!!!!!!!!!!”
There is ALWAYS a better way of doing something … You just haven’t found it yet !!
Rule #2, right after perform backups: Document everything. All your system configs, all your system inter-relationships, all your processes. This, in effect is an extension of rule 1: backup everything. In the case of disaster recover, backing up data is only useful if you can recreate the system environment the data was running on. Also, you are backing up a another critical resource: you! If you were hit by a bus, could someone pick up where you left off?
Never change the root shell, unless you have an alternative root account set up!
I see that’s more or less the way I was following. One question: How can the back up information be validated? Which method, script or programm?
Thanks.
using shell is very close to automatization taks, because shell comands are more simple to automatizate than mouse clicks 😀
I Agree with Francisco, how to validate data, I am new on sysadmin and still learning, so a good pointer would be greatly appreciated
The simplest solution to validate data is to restore from your backup media and compare that to the existing data. Most simplistic is to run a ‘sum’ on the files to compare. If the data is more dynamic, run a sum on the files in the backup before they get backed up and include that in the backup. Restore to a different directory and check it against what is restored.
To generate a checksum file:
find . -type f -print -exec sum ‘{}’ \; > checksumfile
then back it up.
Restore it someplace, run the same command (but put the output in a different file!) and diff the two checksum files.
well, …. I think the author did not write “validate the data”, but validate the backup…
kind of confusing, ha…
In my opinion, one sysadmin must validate the media now and then, mostly periodically.
For example, if you have a backup of a particular aplication typically in tape, or maybe you can hire an on line service, it depends on your budget and/or your needs.
You can restore this backup to another server, now that prizes are dropping, or if you upgrade your server, you can have an “old” server only for this purpose.
Use wisely the checksum methods, md5sum, hash, cksum, … whatever that fits your needs, I am saying, make a checksum before the backup and another checksum in the new location. and compare.
Well, surely there are a lot of techniques to accomplish this.
At a glance this is only my “two cents”.
You did it wrong, 1st rule is RTFM.
Backups (and disaster recovery plans) are rarely shown the respect they deserve. A failed backup could literally put your company out of business yet often they are handled by junior sysadmins with no supervision. I always tell new sysadmins they will never truly know the importance of backups until they have to look a user in the face while telling them their data is gone forever. Gone because you didn’t do your job properly.
Very niiiice article, i can say that is one of best that i read here in thegeeksttuff. I’ll take notes and leave them very close to my PC, im not yet an administrator but that’s not reason for not apply those rules to my habits right now, especially rules #2 and #3
2 things… validating the backup media is very important.. I know a sysadmin for a large hotel in a well known tourist area on the east coast, who had been running backups for years and after a hurricane hit needed to do a restore only to find out that the tape drive had a bad head in it and none of the backups ever had anything written to them…
Also along the backup line is before making ANY changes to configuration files Backup the current config so that you can revert back and start over if the change you make has an unexpected consequence.
Very good article …
-Mike
Rule #4
merge into #3 – Get a date and let the computer working
Thanks a lot to all for your answers to my question. Very usefull.
I’m not really a sysadmin but at home I also have important information to keep safe and systems I wouldn’t like to reinstall from scratch again and again. So, I take seriously backing up, being sure that my backups will work when I’d need them and I find also important to delegate many repetitive tasks as possible to scripts or programs as I can. Trying to become lazy involves learning scripting, more linux shell and learning in general. So, I suppose laziness is the prize to knowledge.
Redundant backups are important too. Expect that when your system fails your primary backup device will fail also. Unfortunately I’ve learned this the hard way. It’s best if your secondary backup is kept off site so if something happens that just flattens your entire block then you can drop the backup on a spare machine somewhere and be the hero when the company continues with minimal downtime instead of being out of business.
Our servers mostly are virtualized now so I even make backups of the entire VM and make backups on the OS level – if one backup stops functioning as expected for some reason then I have an alternate.
Employers tend to give admins crap for spending so much time and money on backups but when the shit hits the fan they are much happier that you were prepared.
Mike Hall has it wrong. The admins’ needs are totally disjoint from the end users’ needs. The end users needs a reliable system; the admins need a system that isn’t unreliable. The end users need a system that won’t fail; the admins need to make sure that the system won’t fail, and so need the time to test the system in ways that might make the system fail.
I’ve been an admin, and I know that sometimes, the weekend is the best time for an outage. If a system will fail after an update, the weekend gives the most time to recover from the failure.
Rule #4: Chaos theory (“butterfly effect”) is for real.
…or: If it works don’t change it!
As sysadmin’s we all know that even simple tasks such as unplugging a network printer will lead to an unpredictable series of events which eventually will end in disasters such as email servers not working.
Even though the two things are not connected in any way, disasters may pop up 😉
One good reasoning for rule #2 is when you have to administer a server a couple of time zones away and all your network traffic is goes through headquarters in the next state over the difference between command line and GUI(xdmcp, vnc, etc) can be hours of downtime for the customer.
Also on the topic of backup, at home least, I find the best option is to store everything on my networked RAID drives. Every so often I just change out one of the drives using my spare and store it in the fireproof safe.
I found the backup strategy one of the most challenging tasks of all. It is not a simple as back up everything, or you will end up backing up all your users music, family and party pictures, and tons of crap. I agree that the most important part of the backup process is to test if you can effectively restore the data, in the end that is the purpose of backing up.
I would recommend to check your plan taking into account the purpose of the backup:
[1] Disaster Recovery
[2] Archive or long term preservation of data.
The first strategy has the purpose of saving the most current data to get your systems up and running as soon as possible with the minimum data loss. Usually you don’t old data for this purpose.
The second strategy is more complicated. Should you preserve all versions of your files?, for how long?, what data needs to be preserved and what data can be ignored (for preservation purposes).
A final note, specially for archival purposes it is important to backup in a tool/format that you can use in the future. Try to use standard tools and test if you can still restore old data with your new shiny tape drive or backup software.
I have to disagree partially with rule #2, “master the command line”.
“Mastering” the command line implies that one should know by heart nearly all the command line commands and their associated options. Ever seen the book Linux in a Nutshell? There’s no way that somebody could memorize even half the commands in that book. Instead of using the command line for each and every task, I would advise learning it by heart for more common tasks.
By the way, in the *Nix world shouldn’t you be referring to the “shell prompt”? “Command line” is Windows jargon, isn’t it?
Never, ever deploy version 1.0, or for that matter a brand spanking new product version that is significant to your daily operations until service packed (OS, backup, database, email server, etc.). Let the earlier adopters toil and suffer. Case in point, an IT services org deployed Exchange 2010 two weeks after it’s release. 4 weeks later there are still ongoing problems, including the pres/CEO not being able to open attachments on emails more recent than 6/22/10. Besides a product that is surely filled with bugs, the services org had only a few weeks of newsgroup postings for the sake of deploying and remediating this new service. And then consider how monumentally poor Backup Exec 12.x was for backing up Exchange 2007. Is there any reason to believe Backup Exec 2010 will be any better?
Thou Shalt Not Maketh System Changes on Fridays
(Unless thou wishest to be work weekends)
Just because you use the GUI doesn’t mean you aren’t comfortable with the cli.
I could add:
*Practice any change you’ll do on a not critical environment before try the change on production environment
On Rule #3 I could add to the description, that notifications are essential for ensuring the availability of the process
How can we automate the regular tasks. Is this done by writing scripts or any other means. if any one can explain it would be more help full to me and others those who are new in this field and want advices like this.
thank you
satheesh
Restoring from backup should be a last resort. Yes, you should take backups, but recovery scenarios should avoid the need for them when at all possible. Applications and configuration should be deployed in a repeatable fashion, such that you can start from a a bare piece of hardware and know exactly how long it’ll take to rebuild the system exactly as it was before.
RE Friday rule: Always plan outages before Wednesday. You then have atleast 3 days for corrections of whatever can/will happen.
I agree with the rules, the third one is my very nature, I admit I’m lazy! Let the machine do that returning taks that are boring, and i’m certailnly messing it up because I’m human.
but there is a 4th rule that I hate but I’ve suffered because someelse didn’t followed it.
Document everything! You may be a single point of failure in the system. The people that step in to keep things going will have a hard time figuring everything out.
Agree with 1 and 2.
automation not only makes you lazy but it also make you forget the less used procedures/commands.
Also, I would add a 4th one – Documentation. A good sysadmin will document everything so that another sysadmin (or the management) can understand what the previous admin has done.
This is really really nice information.
lazy sys admin is great IF lazy sys admin documents his automated processes and trains junior lazy sys admin to lazily manage them. otherwise lazy sys admin becomes a point of failure for the organization.
hi,
yes what happend to the “only be root when it is absolutely necessary” basic but important one…
Hi Nataraj
Rules are very good . I don’t think there is an article on these three by you . It would be great if u can provide it as i am new to this Linux domain
Rule #2: GUI’s are great, but what if the GUI isn’t available? As sysadm you will be called on to fix the system when major parts have failed… your only interface may be the CLI while you make repairs. The rule is “Always have an alternate path into the machine”. If you can’t connect to Xwindows, maybe you can SSH or RSH from another server, or use the Remote System Console (ILO, RSC, etc) through a remote KVM or serial switch. I’ve had to restart system processes using an SSH client from my cell phone, while parked in a supermarket parking lot. The CLI is always available, more reliable, faster, easier to document, etc… Sort of like using the “vi” editor: other editors may have an easier UI, but vi is always available.
For Windows, I install Cygwin SSHD on each server. Not quite the same as Linux, but much of the admin-ing can be done from the CLI, and automated.
Here’s my own rules.
1.) Always check the logs.
2.) Google is your friend.
3.) Think twice before pressing enter.
4.) Don’t try to fix things if aren’t broken. (Most if not all always try to do things in their server and eventually end up messing it.)
IMHO an equally important rule for the sysadmin is:
Document every change you make on your system!
Clearly for every change made the reason for the change should be mentioned, and a reference to test results for the change should be given.
Probably a hint how to revoke the chance could be helpful, IF this is possible.
Best regards
Hellmut
NEVER, EVER, make a significant change to anything on a Friday afternoon!!
I wonder from where this prejudice about windows sysadmins and cli comes from?
Anyway, it’s all bunch of bullshit. Any serious (corporate) windows sysadmin knows his cmd/powershell, not to menition WSH scripting.
Setting up windows xp and iis over weekend does not make one windows sysadmin…
I am also a new sysadmin now,Im handling manufacturing servers,but not totally mastered the whole system.I agree in all of the rules that i have heared here
and I will do this in practice.Now I do some scripts for automation of my common tasks,and also planning for a backup system.I will take a training this coming 3rd Week of this month(Storage Management),and this will be a big part of my plan to make a reliable backup system.I want to hear more advice hear,on how to become
a lazy sysadmin,that all administrative tasks was automated and running behind.
Always “cover your ….”. Remember that all customers lie or leave things out when you’re trying to troubleshoot problems. Test, test and then test again!
super .
Just found your site and I like it very much. You present your knowledge very well. Thank you for sharing it with us.
Excellent Rules of Sysadmin.
Really nice article. i will share the same on my office notice board.
Good Luck.!
Loved this article!!!
One of my basic rules as a longtime *IX admin is to (almost) never delete a file directly. Instead I rename it and set an ‘at’ job to delete it some time in the future. This can avoid greatly reduce you need to restore files from backup. It’s not uncommon for me to let a file sit around for a month before it finally gets delete.
Thank you,
A simple and strong article …
The FOURTH RULE and important rule to consider:
Rule #4: Setup a monitoring system (based on the threshold you tune it) to send period alerts about the system health(Disk, memory, swap, directory level space utilization). it will always helps to know about the production system health, to avoid any forthcoming or future disaster.
– SK (Shekhar Koli)
#4 Ensure that automated tasks scripts have email notification to Data center Operators/Help Desk and to SAs cell phones when things go wrong.
Thanks for sharing your knowlege.
Khider Allos
Software Engineer
sincere guidance …… the whole truth….love u
Document everything, even if it is in your ‘personal notes’ spiral notebook.
— It just makes life changing things more reproducible. It doesn’t have to be ready for ‘publication’ either. … it just MUST communicate to the author.
Get another set of eyes on the problem if it is taking enough time that it is on your critical path.
Document your ‘upgrade senario/process’ before doing it. Print it out to use during the ‘fire fight’. Then use that as a guideline, and write times by when you start/stop each step (so you can see if you are keeping on track). Write notes on the printout of any deviations (so you can explain it to yourself later). … These are NOT notes for your boss, they are YOURS.
I have been through several upgrades over weekends, with sleepless nights. They help get through the Tuesday morning ‘post mortem’ (I reserve the Mondays for recovering from disaster that seldom happen, and getting sleep. Bosses can wait, I am more interested in keeping production going that reports.)
Very Nice and Keep up the good work,
What sir, are the other 4 rules?
I like the last most RULE…very innovative…….I expect more linux based articles from u……thank u
Hi Ramesh,
Your stuffs are very informative and valuable. I frequently visit the website. Regarding the backup. We have monthly backup our files to the iOmega storage. The storage is mounted via NFS in the Linux server. We do an incremental backup using rsync command. Yes, we never bother to verify the backed up stuffs. I just realized (after reading this article) that it is an important aspect.
The comments suggest the checksum one of the methods. As we do the incremental backup can any one suggest me, how can the backup be validated.
Nice article. Thanks for sharing.
How to add disk/lun from storage in linux
The “lazy sysadmin” is a nice one. However with Linux utterly reliable, you should not forget things you automated. At the very least you should have those taks send you e-mails so you remember what automated task you run.
The most horrible example is this: while performing a kernel upgrade on a server I built a script with the instruction: “if you can’t ping server X, the network did not survive the new kernel, so reboot in the old kernel”. I put that script in cron to be executed every 10 minutes. But of course after the successful kernel upgrade I forgot to deactivate the script.
Three years later this server suddenly started to reboot every 10 minutes. What the F*? Well, I decommissioned server X, hence it was not pingable. When I finally discovered the culprit I could not even remember I ever wrote the script.
Had my script sent me an e-mail, it might have occurred to me sooner I had some loose ends dangling somewhere.
#4 – NEVER give out Admin level passwords to anyone not directly responsible for the environment.
#5 – Avoid any “shared user” account creation like the plague.
#6 – If they can’t spell “sudo”, they don’t get Admin privileges.
Very nice article ! Lessons for budding system admins like me ! Thanks for sharing.
Those are great axioms to live your life as a sys admin by. I would, however, like to add one for thought. This one has served me well over the years, “always have a plan”. I know it may sound somewhat basic, and in a way it is, but it is one of those basic tenants that will save your preverbal derriere.
Aside from the 3 rules given, 2 more important rules, don’t walk around with a round in the chamber…..become root only when necessary! I have seen sysadmins have system backups where the backups are bombproof, but they may need y\to consult the application users ( turf wars) what data needs to be backed up, you can rebuild an OS and you can re-install applications, but if you lose data, you could well be unemployed!
Someone suggested that automation is bad, because then you forget rarely used commands. This is wrong on several levels. First, if you are only automating things you do frequently (it is a waste of time to automate things you do rarely), your automation will never contain rarely used commands. Second, you will forget rarely used commands anyway, because you rarely use them!
I really like your page. and am begginer network admin in Insa ethiopia. I need your help to know more on this field of study. Thank u guys alot.