If you are a system administrator, or IT manager, or someone who is responsible for IT infrastructure, you should implement an enterprise level monitoring solution.
The shell script you’ve written that does a ps -ef and sends you an email might do the basic job, but it doesn’t count as monitoring.
If you want to be proactive, have peace of mind, and sleep well at night, you should implement a robust system and network monitoring solution for your IT infrastructure.
Nagios took the number one stop in our Top 5 System Monitoring Tool.
I’ve used Nagios intensively for several years, and cannot live without it anymore. Knowing all my systems (and services) are monitored by Nagios, which will notify me when something goes wrong, makes me sleep peacefully at night.
I’ll be launching my next eBook in few weeks. Yes, it is about Nagios Core 3. For those who are new to Nagios, the eBook will walk you though installation, configuration, Nagios web, and everything you need to know to setup Nagios Core 3, and start monitoring your systems. I’m very excited (despite having lot of sleepless night in creating the material for the eBook), to release the eBook that will be helpful to those who are looking to implement Nagios Core in their enterprise.
Any monitoring solution you implement for your infrastructure should have the ability to do the following.
- Universal Monitoring – You don’t want to implement a monitoring solution for Linux server, another for network equipments, another for hardware, etc. You need one monitoring solution to monitor almost all your IT systems and services. A good monitoring system should provide a framework to include plugins to monitoring various services and devices. For example, it should be able to monitor: Operating systems — *nix, Windows, etc. System resources — CPU, Disk, Swap, Process, etc. Network Equipments — Switches, Router, VPN, Firewalls, etc.
- Efficient Alert Notifications – Assign individuals (or groups) to a system or service as owners. This gives the power to the owners. Let the owners of the system (or service) be notified and take action, before you get involved. Should provide the ability for you to send notification using various methods — email, pager, SMS, IM, etc. Ability to set warning and critical alerts for systems and services that are monitored. Granular monitoring options to specify how often the system should be monitored, how many retries in case of failure, how many failure notifications to send, methods of notification, etc.
- Web Dashboard that provides overall health, issues, and alerts for all the systems across the network, along with the ability to drill-down to individual hosts (and services).
- Issue Escalation – Should provide the ability to notify managers, when the owner of the system is not taking action on an issue within certain time period. For example, when a database crashes, and DBA doesn’t fix it within reasonable time, the monitoring system should alert the manager about the issue.
- Distributed Monitoring and Scalability – Should be capable of monitoring thousands of servers and services without too much overhead. Support distributed monitoring with multiple monitoring systems across the enterprise that can talk to a central monitoring server.
- Reporting – Should generate various monitoring reports. For example, availability, trending, notification reports for administrators. Should provide daily, weekly, monthly, or custom date range analysis of various monitoring statistics
- External Application Integration – Should provide a framework (or API) that can be used by external application to update the current status of the system or service that is getting monitored. Should be able to provide enough details for external vendors to integrate their solution with the monitoring software. The more extensible the framework is, more vendors will provide solution, and more companies will use it to make the software robust.
- Open source solution – Since you’ll be exposing all your mission critical systems to the monitoring software, you should make sure that you can trust the monitoring software. Open source solutions are typically thoroughly tested and reviewed by the community for any potential security issues. Look for the track-record of the software. How many years it has been in the market, the longer the better. How many companies are using the software, the more the better.
- Community and Commercial Support – When you are implementing it on a large scale (thousands of servers), you might want to implement a solution that is official supported and backed by a company. Several open source monitoring solutions are backed by a company that provides commercial support. Even if you don’t use the commercial support initially, you might want to use the support, when you expand your monitoring footprint.
- Easy to Learn and Use – This might be obvious to some of you, but you’ll be surprised how many people end-up implementing a system that is very hard to learn and use. Don’t overlook this. The monitoring solution should be easy to implement and learn, as simple as that. You should not spend weeks trying to figure out how to get the software implemented and working successfully.
My upcoming eBook on Nagios Core 3, is structured and organized in an easy to understand way, to help you implement, configure, and manage the Nagios Core 3 on your IT infrastructure. Nagios is an extremely powerful monitoring software, that does all of the above very well.
Apart from the 10 things mentioned above, are there anything else any monitoring solution should do (or have) in your opinion?
Comments on this entry are closed.
Agree with you regarding Nagios, I look forward to your new book!
Regards.
Good article Ramesh – looking forward to your book on Nagios 3 as it’s something that I have been experimenting with at home and have had mixed results.
Hopefully your eBook will be able to guide / walk us through the world of Nagios.
Thanks again!
Will be waiting for the ebook, Thanks for your work Ramesh.
Totally agree. We implemented Nagios for our customer to monitor 100+ Unix servers and 1000+ Windows servers using passive mode. Together with n2rrd plugin, we are able to visual historical performance data. This helps us to do capacity planning.
Look forward to your ebook
Hi Ramesh,
I Like your site….. Excellent work… I thoroughly enjoyed your previous E-Book and cannot wait for this one. I think there is very few sysadmins that will not agree that Nagios is king.
Thanks again and keep up the good work!
Don’t forget about security. Many of the the servers that I manage contain data that is subject to audits. Authentication and encryption are critical part of any server administration now.
Great website content. Thanks for all the ideas.
Free e-Book is expected 🙂
Regards,
Tapas
Hi Ramesh,
what about Icinga?
Is Icinga the future of Nagios?
Thx
Very Good Article !!!
I work with Monitoring tools like Vitalnet (Alcatel) and Netview (Tivoli) etc…
So can you please give us a comparison chart between Nagios and these Network Management Softwares.
Hope it will add to the majesty of Nagios .
Dear Ramesh
When will you be releasing the new book on Nagios 3
Monitors are over rated. They are very cool and make great presentations when you need to show your skills. Just watch out when you start spending more time looking at your monitor, then you do making your systems stable.
Any information how to configure distributed nagios with GUI?
e.g. lets assume there are 2 (1 in america + 1 in europe) servers with nagios installed on them and 100’s (same) of host(all services such as HTTP, ICMP,etc…) are configured on these 2 nagios servers
Now, can we install distributed nagios on any other 3rd server and manage all hosts configured under above 2 nagios and monitor them from single nagios for both above different continents?
How about op5’s merlin, ninja, etc….. ?
Thank you.
Good read. I love the new Phone Sheriff 2.0!!!! The updates trumps other software I have tried in the past. They really out done them-self on this one.
Ideally, it not only monitors system performance data, but gives options to track the performance counters that actually imply a bottleneck, logging the data and summarizing it for easy access and gives me charts/graphs with any time-range of my choosing.
I have some experience using ASG PerfMan for this purpose, but have not seen any other tools that can do what PerfMan does. Instead, I rely on single-server solutions like PerfMon.msc on Windows and sar on Linux.
Most of these monitors claim to monitor performance, but they do not collect all of the data that you need to diagnose a performance problem.
Also, I would like to see which servers in the enterprise are doing the most I/O, for example. Most monitoring tools show you disk space. It is nice to know which disk are nearing capacity, and you need to know that, but when your SAN performance becomes an issue, you need to quickly determine which systems are the top hogs of I/O.
Nagios seems to be very capable and customizable. I intend to find some time to tinker with it because it seems to be widely adopted.
Configure webservice monitoring,Database Monitoring,Weblogic Monitoring plugin in nagios server.
How to Configure webservice monitoring,Database Monitoring,Weblogic Monitoring plugin in nagios server.
Nice writeup Ramesh; Nearly all system monitoring requirements have been identified by you with precision and clarity. Most companies want us to assist in monitoring individual applications that run on the servers and asisst in the troubleshooting of those. When application servers are up and when DB servers are up, if the application is not running at all(e.g. A custom nightly file transfer job) they want to know about it.