Thoughts on Monitoring, part two

In my last post on monitoring, I attempted to define what monitoring is in practical terms.  Now I’ll give you a brief rundown of my history with monitoring and what I learned as a result.

I started with system monitoring way back in the ancient days of the web; probably somewhere around 1998, give or take.  I played with several different monitoring packages, like Big Brother, and What’s Up Gold, but the one that I ended up settling into with some level of fondness was Netsaint (now Nagios).  Unfortunately, memory being what it is, I don’t know exactly when I started using it – but I can tell you that I found mailing list archives from 1999 where I was discussing it, so sometime in mid-1999.  ;)  (The crazy thing is that it looks like I must have started using it within a month or so of its initial public release; I have no idea how I even found it that quickly.)

The various monitoring tools back then were very primitive; you can get some idea of the kind of “innovative” features we were getting back then by looking at some of the old release notes for Netsaint: http://web.archive.org/web/20000530175949/http://www.netsaint.org/docs/0_0_4/whatsnew.html (Thanks to the Internet Archive!)

Despite the limited featureset and the generally primitive state of the industry at the time, Netsaint (we’ll call it by its current name, Nagios, going forward) accomplished all four of the major monitoring requirements quite ably:

Collection: Nagios has primarily relied on external scripts or plugins that poll (remote) resources and then feed the results back to the parent process.  It has also, for several years, had the concept of passive checks, which were essentially push monitors: they’d sit there idle, waiting for new status results to be pushed into them.

Fault Detection: Nagios has had the concept of warning and critical thresholds for each of its service checks since its inception; if the returned value trips a threshold, it acts accordingly.

Scheduling: Nagios has also had the concept of basic repeated check scheduling since the beginning; each check is assigned a polling time, and it attempts to run each check as often as specified.  It handles overloading fairly gracefully and doesn’t schedule the next check of a service until the previous one has run, preventing it from getting too far behind schedule by piling checks on top of each other.  It did not, however, have the concept of a cron’d check that runs at a specific time (but that could be accomplished through passive checks and cron).

Alerting/ Reporting: Finally, since the beginning, Nagios has supported the concept of calling an external program whenever it detects a fault – typically, a script that would drop an email to raise an alarm, or perhaps would send an SMS to your pager.  In addition, it has had a web interface for viewing the current status of hosts, services, etc. since early on.

So if Nagios could do all of those things over 10 years ago, why have people continued to create new monitoring systems?  Well, easy: for all of the things Nagios doesn’t do, or at least doesn’t do well.

One of the earliest things you discover as a Nagios user is that if you want historical reporting – say, a graph of the values Nagios is monitoring – you’re basically out of luck.  It has basic availability and trend reporting, but it’s very simplistic and only reports on which state a service is in – OK, WARNING, CRITICAL, etc.  In order to see graphs of actual values, you’ve always had to plug in an external tool – often an RRD-based graphing system.  So the first missing feature is basically a subset of alerting and reporting: Trending.  Several tools serve this purpose; some are stand-alone like Cacti, and others are integrated with Nagios like pnp4nagios.  Either way, without being able to go back in time and see how a service has responded, or how quickly a disk has been filling up, etc. it can be very difficult to determine proper thresholds for services, and it can be difficult to know how severe an alert really is.  On some systems, a disk hitting 90% might be cause for huge alarm; on others, it may normally hover there.  Teaching new people what to expect from a service without a good view of its history gets to be challenging.

Speaking of alarms, an oft-forgotten but critical portion of managing them in a large environment is the concept of Dependencies.  Any medium to large sized environment is going to have at least one host or service that depends on another, and often will have a huge chain of cascading services; when one fails, you don’t want to get alerted about all of them!  Ideally, your monitoring system will be smart enough to figure out that if the router that connects you to data center 2 goes down, that everything inside of data center 2 may or may not be down but that you have no way of knowing until the service they depend on (inter-data center routing!) is resolved.  In those scenarios, getting an alert for every host and service in data center 2 is not helpful, and is potentially misleading.  Ideally, you just want to get an alert about that shared dependency.  Nagios added the ability to set up parent/ child dependency relationships between hosts and services over 10 years ago, solving that problem (as long as you had the time and patience to figure out every dependency and configure it!).

Sometimes, though, you don’t realize there’s a dependency, or there may be a dependency you don’t have an easy way of monitoring or expressing – for instance, you may not be able to monitor power or cooling to your data center racks, and a failure there could easily impact many unrelated services.  In those scenarios, dependency management won’t help, the only thing that will really relieve your aching pager is some sort of alert management, often in the form of Alert Aggregation.  In other words, if 100 things break at once – related or not – you probably don’t want to receive 100 separate alerts.  Nagios didn’t have a built-in way of managing this, so I wrote NANS, an aggregate notification system, which was fairly popular years ago.  Newer monitoring systems hopefully have some sort of throttling or (better) aggregation built-in.

There are other “features” you want in a monitoring tool, of course – the ability for it to be extended in a modular fashion, the ability for it to perform well in a real environment and ideally scale to an almost infinite point, for it to be stable and reliable, for it to be consistent, for it to be easily configured and ideally automated… but to a certain extent, those are “expected”.  That doesn’t mean you always get them, of course, just that you’d be surprised to find a tool that didn’t offer those (or claim to, at the very least).

Typically, to get the “missing” features from a tool like Nagios, you bring in other tools.  They could be something like cacti, for graphing, or ganglia, to more easily monitor and manage clusters of machines.  You may also use third-party monitoring sites, especially in the web world, to monitor your performance from various regions, browsers, etc.  The problem with these scenarios is that you typically have to configure, manage, and scale each of these separately.  What’s even worse, is that you also have to manage the reporting and alerting separately, meaning that it’s entirely possible to have several systems that all alert you at once when something breaks, defeating the purpose of the alert aggregation you worked so hard on earlier.  (Another alternative is that you run all of these disparate alerts into one aggregation and management system – Pagerduty is a popular option, nowadays, for people looking to manage their alerts in a single place.)

A newer class of monitoring tools attempt to be all-in-one systems.  The one I’m most familiar with is Zenoss, which is a “clean” implementation of a monitoring system, but there are also several that build on the Nagios foundation or that have their own clean systems.  You can find a decent – but not exhaustive – list of some of them here: http://en.wikipedia.org/wiki/Comparison_of_network_monitoring_systems.  (For the record: I don’t consider most of those network monitoring systems, as they can monitor so much more.  For most people, network monitoring is just one small piece of it.  I guess you could argue that you’re monitoring devices on a network, but even that’s not necessarily true – you could theoretically use it to monitor local components like disks or cpus on the monitoring system itself.)

The thing I’ve found about so many of these all-in-one systems is that they don’t really work all that well, either.  But we’ll get into that in the next post.  ;)

Note: I actually started writing this post a year ago – in August 2012, and then got distracted by work and life and left it.  Whoops?  I’m publishing it without further review or editing, because it’s worth getting out there.  I’m going to need to loop back around, reread these posts, and continue the series, but first, I think I’ll digress to another related series.  :P 

MongoDB Performance Tuning and Monitoring Using MMS

I really don’t post often enough.  Hmm.

So, news: I’m working at 10gen now, as the Technical Services Manager for North America.  I recently spoke at MongoNYC and then delivered an extended version of the talk today in webinar form.  The video should be live tomorrow, but if you’d like to check out the slides, I uploaded them earlier, here they are:

Enjoy!

Thoughts on Monitoring, part one

I was inspired* to write this by a recent blog post I read by Michael Gorsuch, as I’ve been thinking about monitoring, and when people have listened, ranting about monitoring for years… but I’ve never put those thoughts down in a blog post.  Here’s the first in what will be an ongoing series of blog posts.

First off, the ever-important question: what is monitoring?  Google tells us that to monitor means to…

Observe and check the progress or quality of (something) over a period of time; keep under systematic review.

Ok, that’s something to start with!  So we’re talking observation and checking on the progress or quality of something, and we’re looking at it over a period of time.  To break it down into its required core components, then, we get:

  • Collection – this is the observing part, where we pull in data on the state of something, often its availability (is it working?) or its performance (how quickly is it working?).
  • Fault Detection – this is the checking part, where we compare the data we’ve collected to a threshold or thresholds to determine its state.
  • Scheduling – this part kicks off the observation and checking portions of the monitoring software, where it functions over a period of time.
  • Alerting/ Reporting – this portion of the service tells you when a fault is detected, and hopefully when it returns to an acceptable state.

That’s it.  There’s a lot more that goes into a monitoring system in practice, but those are the core components that are required for a usable system.  Over time, monitoring systems have gone from bare-bones systems that barely met the requirements above to incredibly complex systems that attempt** to solve dozens of requirements, far above and beyond the base requirements.  Some of this is good… and some of it, not so much.

I’m going to wrap up here as it’s just about 1 am and I need to wake up for work tomorrow.  ;)  In future posts, I’ll cover the early days of web monitoring (where “early days” means “around 20 years ago” – I’m old, but I’m not that old), the progression to more modern solutions, and then after we’ve stepped through a very abridged history of monitoring, we can get to the fun part: where I think monitoring needs to go next.  The rants about where monitoring should be will probably never end.  ;)

Comments?  Questions?  Send me your feedback!

* – To give credit where credit is due, the #monitoringsucks conversation that has gone on for over a year has also been an inspiration, but I have to admit I haven’t kept up with it – I’ll probably have to fix that soon.  If you’d like to read more about what that has spawned, check out the github repo.

** – I said “attempt” for a reason, but we haven’t gotten there yet.  ;)

Long overdue…

long overdue

Yeah… it’s been a while. Photo: long overdue by breahn, on Flickr

Wow, I just realized how long it has been since I’ve updated my blog.  Way, way too long.

First off: a new design!  I was bored of the old look, and honestly, it was a little fug.  The text wasn’t especially easy to read, and while the dark background was nice in some ways, it’s not very pleasant to stare at for long periods of time.  The new design is cleaner, brighter, and easier to read.  It’s also a little boring, so don’t be surprised to find it changing over the next few days as I try different layouts, color schemes, etc.

Next: I’ve added my Twitter feed on the sidebar.  I’ve found myself slowly adopting to using Twitter regularly, and since that’s usually fresh even when my blog isn’t, I’ve added it in.  Now at least I can pretend the content on here is up to date.

Speaking of up to date, the next topic: posting here.  I’ve made a random-middle-of-the-year resolution: I’m going to post to this blog at least twice a month.  (I know, ooooh, twice a month.  Considering my track record, though, that’d be a big improvement.)

So, there you go.  Time to dig this blog up from the dead, and get it going again.

Along those lines, I’m updating the About page – it’s 2 years out of date.  I’ve been working at Livestream for a while now, and it’s a little crazy that my blog still hasn’t noticed that.

Image representing Livestream as depicted in C...

Image via CrunchBase

Whoops?

Anyways, I’ve been doing a lot of interesting work at Livestream, and I’m hoping to start sharing it, both here (in the form of blog posts, presentations, etc.) and in Github, as we start sharing more of the code we write and the apps we build.  We’re a little new there – for whatever reason, while we’re an avid open source shop, we’ve never really gotten good at sharing out work with everyone else.  We hope to change that.

(As far as my own stuff, it’s scattered far and wide – one of the things I need to do is pull it all into one place, aka Github.  Google Code is fine, but I like having everything in one easy to manage place, and Github is not only easy to use and powerful but also has that ever-important geek cred.)

This blog will remain focused on long-form content; I’ve got my Tumblr for more spontaneous content, including any live-blogging I may do (courtesy of the New Livestream, an awesome platform for it), and Twitter, for random spouting off.

See you soon!

Droid Bionic review

Ok, I’ve had the Bionic in hand for about a week and a half now – long enough to give more meaningful feedback.  (If you missed it, you can see my initial thoughts here.  This review assumes you’ve at least skimmed that; if not, go check it out.)

The Basics
The Bionic is Verizon‘s latest high-end phone, the only VZW phone with both a dual-core processor and a 4G (LTE) radio.  The specs are all over the place – I’m not going to repeat them all – but it’s got a dual-core CPU with 1G of RAM, a 4.3″ qHD screen, and an LTE radio – this is the most powerful phone Verizon offers (and, arguably, the most powerful combo available on any carrier right now).  However – and this is a big one, potentially – we’re mere weeks away from a flood of high-end, dual-core powerhouses, including the new Nexus Prime, which will quickly knock the Bionic down a notch or two.

Read the full post »

%d bloggers like this: