Alright, a giant Twitter rant about a monitoring system with a shitastic installer turned into a brief discussion with @neojima about graphing systems, so this brief bitty is inspired by him.
Over the years I have mentioned the work I did on network data graphs while I worked at $weDropPackets. The brief backstory, for the few of you who don’t know, I used to work for an ISP, and I was the everything guy. When I started, we were using MRTG. Now, MRTG is great for what it does, but it has some shortcomings. For starters, it can only poll and display two data points. This seems to have lead to the industry standard of all network traffic graphs: data in, and data out. It also cannot display negative numbers. While not useful for traffic data, plenty of other data points exist that can be negative numbers. MRTG, out of the box, also doesn’t scale very well at all. The base config and mode of operation regenerate the images every single time new data is ingested. This gets very disk I/O intensive very fast. There is a trick to do away with this, but given that I was faced with incoming staff members and graphs that provided very little actual, useful data for the untrained eye, I decided to start mucking about with some alternatives.
What lead me to settle on writing my own code that used RRDTool as the storage was a hideously ugly little program called drraw. While ugly, seriously ugly, it is a tremendously useful program. For one thing, it is stupid simple to set up. But more importantly, you throw RRD files at it, and it exposes, via a powerful yet intuitive web interface, all of the possible ways that RRDTool can manipulate and display that data. It then shows you how to get your web based efforts into the RRDTool command line, which I then translated into perl’s RRDtool::OO. I wasn’t about to take my perl display code, and fill it with system(“rrdtool foo bar baz….”), I have my dignity. In short, it is an amazing prototyping tool.
So, shortcomings of MRTG, and we were onboarding new people who were less technically proficient. A better analysis system for tier-1 support would be great, right? The owner didn’t even consider the possibility of doing something different, nor did I ever ask him, I just did this, and it turned out great enough that, as of at least mid 2019, they still use my code and ideas.
For the uninitiated, we were a WISP, that is Wireless Internet Service Provider in this context, the acronym has been overloaded a few times now. We used primarily Cambium’s Canopy product line for our last mile delivery to homes and business. The classic way almost all WISPs poll and display data, if they do it at all, is to track Received Signal Strength Indicator (RSSI), and Signal to Noise Ratio (SNR). RSSI is reported by the radio in decibels (dBm), which, due to the power levels involved, was a negative number. SNR is a positive number. So, we take a bit of customer data (this is from my 2014 presentation slide deck on this subject), and attempt to shoehorn these two values into something useful, here is the result. The green is the RSSI, blue is SNR.
So here we have a one week view of RSSI and SNR. Since MRTG doesn’t do negative numbers, we had to manipulate the data as it was collected, so the displayed data rate is actually offset by 86, -86 dBm being a level low enough that basically no SM will hold a session to the AP. So, zero on this graph is actually -86 dBm, but only for the green stuff. Confused yet? Incoming CSRs sure were. Because if you sign into the CPE, it will report a actual negative number, this one is probably around -68 dBm on average. The blue SNR is displayed as an offset from zero, actually zero this time. The higher the line, the more noise you have. So, that is fairly intuitive. Polling this is a pain in the butt, because you have to settle on your data values at poll time, not interpretation time, meaning the MRTG poller had to be shunted with a little perl script that manipulated the data prior to storage.
Now take a look at the chop on the 21st, what does that mean? The signal dropped, but did the customer lose service? MRTG also has a fine habit of smudging data points together when a gap exists in the actual RRD file, so, short of checking the actual database for gaps, we don’t know if it ever failed to poll the unit. Now take a look at some lesser chop on the 22nd, 24th, and 25th. That doesn’t look so bad does it? Probably just a little bad weather, but nothing service affecting.
Good lord we were wrong. When I developed a better tool, the first thing we did was start finding issues we never knew existed.
First things first, I wanted to store more than two data points, RRDTool can do this. MRTG, despite being based on RRDTool, cannot. I also want to store the actual values reported, not have to pre-manipulate them into something I can shoehorn some level of utility out of later. Since I also want to be able to write cronjobs that analyze CPE data in bulk for large scale reporting, it was be far easier if the data is actually consistently sane. Third, I want to display one hell of a lot more data in a single graph, without overloading the viewer.
Drraw prototyping had given me the knowledge and inspiration I needed. For starters, one incredibly underused feature of RRDTool is TIC marks. Not all data has to be displayed as a line graph, you can also simply display a single point of data, based on a conditional. Oh yeah, did you know RRDTool has a fully featured Reverse Polish Notation (RPN) calculator, and supports conditionals? Most people do not, but it is insanely powerful. RRDTool also supports labels, averages, can generate text most anywhere you want. The feature list goes on and on.
So, what can I do with this information? Lets find out. I have an idea! Wouldn’t it be great if I could show to the CSR if the radio ever dropped its session with the access point in a way that lined up with some signal events? Canopy SMs report their session uptime in seconds, if they drop association, it returns to zero. With this in mind, we now start polling the same customer for a third data item, its SNMP reported session uptime. If the value is ever less than the polling interval, we draw a TIC mark at the top of the graph.
But wait! What if the customer rebooted their CPE? That would still show up as a session restart, can we show if that happens? We sure can, The CPEs also report their device uptime in seconds. If it ever goes backwards, we can safely assume the device restarted for some reason. So, if the radio drops too much signal and losses connection, we will have an indicator, and if that loss of session is caused by the device losing power, we will have another indicator. This simple idea, plus a whole host of additional data looks like this.
Wow! Look at all that information! This customer’s service isn’t just losing some power, it is dropping off the network completely! Every single one of those red marks at the top indicates that the radio had a session uptime of less than 600 seconds, or five minutes(our polling interval). I now know that the events on the 24th were also service affecting, but not those on the 25th. They looked about the same. but now we can see plain as day evidence to the contrary. I also have some nice averages, min and max values, as well as the most recent. The graph itself is now completely portable. With MRTG, I had to surround the image with text showing the device name, image creation time, and a legend. This is all now contained in the image, this is something that can saved elsewhere, a ticketing system, or even sent to the client, with an explanation about how we plan to resolve the issue.
This pretty much proved my point that there was a better way to view this data, and we migrated off of MRTG completely once we had a few months of trend data collected. We eventually built a massive ZFS file server to get this data to disk as fast as we could collect it. I wrote a custom SNMP poller that could query over 10,000 devices for nine or more data points, in well under two minutes. Honestly, that part was super easy, getting all that data to disk took some doing, as this was all spinning rust. It was more ~20 seconds to collect (threaded code is awesome, yo), and another 70 to finish committing it to disk.
In the months that followed, Cambium released a new kind of OFDM based Subscriber Module that was dual polarity. Sadly, I do not have an example of those graphs, but we went from graphing just RSSI and SNR to graphing each value for both vertical and horizontal polarities. To keep the graphs easy to see, we printed the absolute values for the new data sources at the bottom like we had before. But for the visual side of things, we made the existing lines a bit thicker, and represented the new polarity as thin lines over top of them in a color that stood out well. In a perfect deployment, they lined up very well. In a less than stellar installation, it was very easy to see when a single polarity was doing something wonky.
So, I looked around, and found another image for you that I just love. Same concept as before: Choppy service, even choppier than just the gaps in RSSI would indicate. I mean, just look at all that red! What makes this one interesting, I remember it quite well (it was a friend’s house). The drastic change where both RSSI and SNR got worse, but stability improved drastically illustrates perfectly my point, more information is better.
Stay tuned for Part II of this post, where I will show off traffic analysis.