Enterprise Monitoring



Monitoring – Shedding light on your infrastructure


In July 2010, I started working at 5AM Solutions (http://www.5amsolutions.com). One of the first tasks we conquered was to take a hard look the monitoring, and find the best monitoring system for the company.  The process was what anyone would imagine from a software shop. Meetings were had, requirements were gathered, options were evaluated. Our process is outlined here, for better or worse.

In the beginning

5AM Solutions had the good fortune of growing quickly. As is often the case, the IT infrastructure wasn’t able to keep pace, despite the best efforts of a competent and dedicated team. The monitoring situation was a series of home-brewed scripts, some utilizing Hudson (http://jenkins-ci.org/) in very novel ways, to keep some basic health checks going on critical systems.

There were several issues with this approach that made it undesirable going forward.

·      It would be extremely difficult, bordering on impossible to scale
·      There wasn’t any historical data retention for analysis
·      There wasn’t a way to get quick-hit visual representations of data
·      There wasn’t a consistent UI or methodology

Gathering Requirements

When the 5AM IT Group sat down to discuss the issues with the existing monitoring set up and come up with a more permanent solution, everyone walked away with a handful of requirements for the new platform.

·      The application stack had to be agreeable to our desires within the group.  Quite simply, we didn’t want an application written in Perl or Java or .Net. We are all Python fanboys, and while MySQL is fine, we hold PostgreSQL near and dear to our hearts.
·      We didn’t have time to learn a whole new language just to run an application. Of course each app will have its unique syntax and workflows, but we wanted to be able to use standard tools (e.g. XML or SQL, etc.) to import and export data, configure and talk to an API. 
·      The solution had to have a short learning curve. The unofficial motto of 5AM IT is “do more with less”. We are committed to innovation and automation. Spending endless hours tweaking and supporting an application is not an option.
·      We wanted an Open Source solution. 5AM releases as much of its source code as possible under variations of the GPL. The 5AM IT Group is also committed to FOSS, and contributes to multiple projects and organizations.
·      We needed a monitoring application, not an entire helpdesk/inventory/monitoring/coffee solution. 5AM was already committed to the Atlassian software suite for Issue Tracking, and proper inventory management was to be considered further down the road. With that said, we also wanted a mature API, so we could automate and integrate whenever we had the opportunity.
·      The application had to be flexible. More and more companies are adopting a hybrid approach to their infrastructure. IaaS, PaaS, and straight web-based applications make up an increasing share of things that a Systems Administrator has to keep any eye on with regards to performance and availability.
·      There had to be an established community for the product we chose. While we want to innovate, we didn’t have time to spend hour after hour solving basic issues with an immature application. 

Choosing a Solution

With those requirements in mind, we began looking over the options. Some candidates were immediately eliminated due to their cost or implementation needs. After some research and consideration we ended up with a short list of contenders that included Zabbix, Nagios, and Zenoss.

Zenoss looked to be a great fit, on the surface. Zenoss Core is an open source project written in Python, with a large following. It has a very slick user interface as well. Unfortunately, multiple members of the IT group had deployed or maintained Zenoss installations that suffered from a history of stability issues. 

Also, the Zenoss “open core” business model, where it offers a free but limited or degraded version and a fully functional paid one, is not a truly open source solution in our eyes, and was wholly undesirable.

Nagios made its way on to the short list because of its venerable history and massive community within IT, especially with Linux administrators. We spent a long time scratching out head on both of those counts. However, their business model mirrors Zenoss.

The final point that eliminated Nagios is often considered its strongest selling point. It’s age. The application infrastructure for Nagios; what it defines as its mission and how it goes about it, it too limited and we were concerned that we would hit limitations due to this down the road. The face of technology changes every day, and none of us felt that Nagios had kept up nor did its community have a desire for that.

Zabbix was the next potential application to evaluate. Its feature list was impressive.

·      The ability to run multiple types of distributed setups
·      web-based checks
·      agent-based checks
·      SNMP/TCP/ICMP/UDP based passive checks
·      simple installation
·      mature XML-RPC API
·      light network footprint
·      database agnostic (able to run on SQLite, Oracle, MySQL, and PostgreSQL currently)

The Zabbix business model is akin to that of OpenNMS (http://www.opennms.org), another open source monitoring system that is designed to sole a different sort of problem. At its heart is a for-profit company, which steers and contributes to a fully open source community. The company generates revenue by providing professional services such as training, specialized deployments, support, etc. for the open source product.

The decision was made to go with Zabbix as our monitoring solution at 5AM.

Zabbix History

In 2001, Alexei Vladishev decided that he didn’t like the monitoring tools that were available, and he was going to do something about it. The product he came up with was named Zabbix. Releasing the software under the GPL (currently released under GPL3), he began the Zabbix community.

That community now contains over 20,000 members. Downloads of the Zabbix software average approximately 500 daily. This doesn’t include the binary installations from repositories like EPEL and the Ubuntu universes.

In 2005, Alexei Vladishev launched Zabbix SIA in Riga, Latvia. This corporation steers the Zabbix project and also provides consulting, training, paid support and custom deployment services for Zabbix to customers. Zabbix SIA currently has partners and resellers on four continents.

Training and Certification

At least in North America, this is the weakest link in the Zabbix solution. Currently there are no training classes scheduled in North America. Europe, Asia, and South America are the most common destinations for training certified by Zabbix SIA. This will hopefully not stay the case for much longer.

Installation and Getting Started

Zabbix SIA provides source code available for download. The company and community also provide binary packages for installation into all popular Linux distributions, and even Windows (a dirty little secret, to be sure). On a clean server, the Zabbix server daemon can be up and providing meaningful metrics in well under one hour. After installation, the server daemon requires the editing of a single configuration file.

Configuring the database is also straight forward. After creating a database and user in your favorite database server (currently we’re running Zabbix in Postgresql 9.0.1), two supplied SQL scripts are executed into that database. The first script creates the schema and configures the database with some default settings. The second script imports a few BLOB graphics and other items that are used within the application itself.

The web frontend, written in PHP can also be up and running within the same one hour time frame. After installation and configuration by Apache (we haven’t tested it in Nginx or another alternative web server at this point, but it should work if the server can render PHP), a web-based confirmation of required PHP settings and modules culminates in a second configuration file being configured to confirm database connectivity strings and a few other key values.

And that’s it. With the installation of a few packages, the optional modification of a supplied Apache configuration into a VirtualHost, a web-based PHP check, some simple database commands, and 2 configuration file tweaks you have a fully functioning monitoring system.

Setting up your infrastructure for monitoring with Zabbix

For an existing infrastructure, the easiest process to populate Zabbix is to run a discovery action on your networks. After installing the Zabbix agent on each host via your management application (or by shear pain if you don’t have centralized configuration management in your Enterprise) and editing its single configuration file to your desires, a command is initiated on the Zabbix server via its web GUI to scan the networks and look for Zabbix agents.

This scan can intelligently look at data (typically by looking at the uname output of a Linux server) and classify these discovered hosts into various groups without your intervention. This makes life much easier down the road.  For example, Zabbix can determine that a system is a Linux server, check the version of the running kernel and assign it to groups for all of the different Linux releases running in your infrastructure.

You can also manually edit group memberships, dividing and sub-dividing groups of servers in whatever views that give the desired visibility. While this does help create better visibility, it is optional, and at this point you have a fully monitored infrastructure.

The default Zabbix Linux template checks over 100 items at various intervals on a given server. While all of these are likely not needed for every server, it is a great starting point as you analyze Zabbix and look at what capabilities are built into the Zabbix Agent by default.

Custom Checks in Zabbix

Of course the Zabbix SIA team cannot conceive of every possible thing that a Zabbix Agent will ever need to check. That is where the Sys Admin can get a little creative and innovative. Configuring Zabbix to check a custom value on a server takes seconds to configure. In our experience, it is often harder to come up with the string of commands that properly illustrate what needs to be tracked. After deciding what to monitor, adding a custom check in Zabbix is relatively simple. By adding a CustomParameter value in a given host’s Zabbix Agent configuration file and adding a corresponding value in the server’s web GUI to tell the server to execute the action. To put it another way, if a value can be output to the command line, or even a web page, it can be captured, stored, evaluated, and acted upon by Zabbix.

Not only can Zabbix check these values easily, it can also execute commands remotely on a given server if certain conditions are met. A great example of this can be illustrated by an old Java application that everyone has running in their Enterprise. It has hundreds of names, but it is always vaguely unstable, bringing itself to its own knees at random unpredictable intervals. Zabbix can of course monitor the amount of RAM used by the Java servlet (let’s say Tomcat in this example) by issuing simple java commands on the command line as Custom Parameters. It can monitor the system load on the server, as well as the total free system RAM. If any or all of these values breach the desired threshold, Zabbix can issue a command to restart Tomcat, reboot the server, initiate a garbage collection in the application itself, or any other command that will keep this Java app from going south. It will do this 24x7x365 without intervention. This type of action takes minutes to set up, and is limited only by the creativity of the Zabbix administrator.

Zabbix in the clouds

More and more business is done in the grey, misty realm known in IT as “The Cloud”. Whether it is IaaS, PaaS, or any other XaaS, Zabbix can keep your infrastructure under control.

Zabbix can monitor URLs with its Web Checks module for web-based applications. By handling Apache basic authentication, POST, and GET bitstreams, a Zabbix web check can log into an application, calculate its response time and download rate, confirm the proper HTTP codes are being received, and search the HTTP source for a given string. Alerts via email, jabber, or SMS can be triggered for all of these values. Actions on the remote server can also be configured for execution to restart an application, for example.

For IaaS providers like AWS, Zabbix has multiple options to keep your internal and cloud infrastructures healthy and happy. Zabbix proxies can operate behind firewalls, initiating contact with your Zabbix Server, collating data for its network and relaying it back to the Zabbix Server on scheduled, configurable intervals.

For larger deployments, Distributed Monitoring (DM) can allow for multiple levels of access and monitoring. Child nodes report to their master nodes, but each node can also act as a stand-along Zabbix Server installation. With this configuration, certain responsibilities can be delegated to a remote location, but can be over-ridden from the Master node whenever it is desirable.

Configuration of both proxy servers and DM is well documented and straightforward. They can also be accomplished within the magical “one hour of your life” timeframe, like the standard Zabbix Server installation and configuration.

Closing Thoughts

Zabbix began providing returns for our IT group almost immediately. Within hours of the server being activated, it had found issues on systems that we had not detected previously. It currently helps us maintain our infrastructure by alerting us to issues before they affect our end users and customers, by automatically acting on certain conditions (i.e. restarting a balky application) so I don’t have to waste my time on mundane drudgery, and by giving us a troubleshooting platform to help diagnose issues from the perspective of the entire infrastructure.

It is true that any major monitoring application could do the job if given proper resources and time. However, with the requirements that we laid out to guide our decision, we are more than satisfied with Zabbix as our enterprise monitoring application.