That MSTP From Hell Story

Yet another story I tell all the time at hackercons, now in full textual glory.

As many of you know, I spent a good chunk of time working for a WISP. That is Wireless Internet Service Provider for you non technical civilians. I used to refer to this org either by name, or as $weMicrowavePackets. Get it? We used microwaves to send packet data? It’s funny, it’s a funny joke, laugh. Anyways, one of the many, many, many reasons I left was my sheer disgust at just how bad our service had gotten in my final years. We were so completely oversubscribed. The customers complained to the customer service group constantly, who in turn came to me, the senior network engineer, for a solution. There was none, we were oversubscribed, simple fact. We were oversubscribed, and I did not have the authority nor the budget to do anything about it. After I left, the name $weMicrowavePackets died, replaced by the far less flattering moniker $weDropPackets. Believe you me, in the last years I was there, we did more than our fair share of it.

The wireless network was essentially divided into two pieces: East and West. They were connected by way of a single 1.7 gbit wireless link, and later, a ring of mostly 10 gbit fiber was added. Sadly, one expensive as hell 1 gbit loop meant the East side of the network only had a theoretical max of 2.7 gbit bandwidth to the outside world. We had two 10 gbit upstream circuits, as well as 10 gbit to a CDN aggregation network that included Netflix, but none  of that is where we had bottlenecks.

This meant roughly 50% of our customers had access to roughly 12.5% of our available bandwidth. In short, the entire eastern half of the network was oversubscribed right at its headend. This had the effect of making it far easier to deal with, despite generating just as many customer complaints as its western sister net. None of the links leading up to the headend, save the penultimate hops, were generally over saturated. All the traffic had one way out, and every path to that egress sucked because the egress itself sucked. So, at least it sucked consistently. This actually made my life significantly easier.

Contrast this with wireless zone west, which had 10 gbit fiber to 20 gbit of available uplink. It should go without saying at this point that price, not utility, determined everything.

Balancing L3 traffic between these two zones and our upstream providers was also a lot of fun, we had a /19 and a /22 of public address space. A lot of that was announced as /23s in various states of ASN prepend, and was changed on a fairly regular basis. That madness is another entire rant unto itself.

Both sides of the network had evolved organically out of a significantly smaller network of the same core design. The earliest stages worked reasonably well, but they did not scale. In the earlier era, east and west were actually one large mesh. As we added new tower sites, the scale issue came to a head. I had advocated a switch from MSTP (multi spanning tree protocol) to MPLS/VPLS very early on, but was consistently shot down.

Side note: Apparently, to this day, MPLS is considered some kind of Juniper/Cisco propaganda at $weDropPackets. They believe it doesn’t actually scale. I have since worked for networks 10 times the size of $weDropPackets that were near 100% MPLS, trust me, it scales just fine. You can thank a disparaging 2001ish NANOG comment by Randy Busch for the owner’s attitude toward MPLS. It isn’t like technologies mature over time.

So the network lived on in two pieces, as a near unmanageable MSTP mess. A real network engineer will tell you that a backup link has to be able to absorb all the traffic from the primary, or it isn’t a very good backup. A owner/manager concerned with bottom line will insist that all potential links should run balls to the wall, because, “stuff rarely breaks and traffic is money!” Unfortunately, near as I can tell, MSTP was created for managers with this exact mindset. If you are so oversubscribed that you need to split up your layer two paths, damn good possibility you have a massive problem.

It wasn’t my design, it wasn’t what I wanted, but it was my problem. I had to keep this thing running. My name was Sisyphus and this  was my boulder. I was, on paper, the senior network administrator. In reality, I was that so long as my boss hadn’t had a bad day. If he was in a foul mood, all bets were off. He became a micromanaging asshat who would make undocumented changes, ignore change controls, and say “Fuck” an awful lot. Then the undocumented mess he had created would piss him off because our documentation sucked, and the cycle would begin again. Keeping him in a less than bad mood actually became a network stability issue toward the end. Not a pretty picture is it? This was my job, seven days a week. The Internet never sleeps.

So, the western wifi zone. Sixteen major tower sites, and a dozen or so minor ones, and one, singular, way, out. Major sites had links to more than just one other tower site. Minor towers were stubs, they had one uplink to a major site. The end result was a “core” network of 20 or so wireless links, between 16 sites, and all traffic going toward, and coming from, a single headend tower.

This was all managed via MSTP, it was all layer 2, strictly Ethernet.

Let that sink in. Link speeds between 700 mbit, and 35 mbit. Sixteen sites. Twenty something major connections between them. All MSTP. No end in sight, and the owner thought this was just fine. My job was to keep it alive. My solution was horrifying, but I stand by it as the only reasonable course of action left to me.

From memory, I have pieced together what I can remember this network looked like, I cannot seem to recall two of the towers, but this is pretty accurate from the 14 I can recall:

I’m missing a few towers and links, but that should illustrate the horror that was the western network. And yes, LU to LE had two links. Many of these links are actually dual wireless in a link aggregation, but this particular one was of two wildly different speeds. That doesn’t aggregate well at all, so they were handled as separate entities.

Also, this is not to any scale, these links existed mostly as geography dictated. Graphviz just built its own layout.

We had already deployed MSTP in the form of two instances, plus the common. When we split the network in half, that became four instances, plus two commons. That worked for a while.

Then, the over-subscription got really bad.

The western wifi zone effectively had, at least at the headend, 10gbit of bandwidth for apx 50% of the customer load. But the links leading into the headend, numerous that they were, were in total, no more than four gigabit in total capacity. To add to the hell, none of the links were of the same capacity, nor did we have any way of dealing with overloaded links in an automated way. Remember when I said I had advocated MPLS? Now I was outright begging for it. I even built the core network around it, at least in name. Switch management networks (layer 3 switches whenever I won that argument) were all OSPF connected, and named things like, “WifiWest-mpls-105. WifiWest was the network label, mpls was the predominant purpose of the network, and 105 was the VLAN tag. I was building the capability to switch to MPLS in multiple stages. Network tests indicated we could deliver (in lieu of capacity issues), line rate L2 services between wireless east and west. MPLS was never deployed, we instead deployed a “carrier grade Ethernet” solution that was awful on many, many levels.

We also had a pile of aging Cisco L2 switches, some of which did not speak industry standard MSTP. They spoke Cisco’s pre standard to the spec only. We had become, under my direction, a Juniper shop. I will not apologize, I love Juniper EX switches. But this meant that we had to often create a buffer between tower sites with Juniper gear and tower sites with legacy Cisco switches.

The Juniper EX3200 was the standard, if I got my way switch. Later replaced by the EX3300. The legacy switch was the Cisco 2955.  The 2955 only speaks Cisco’s pre standard MSTP implementation. The Juniper switches only spoke the industry standard. So, migrating between the two required the introduction of an intermediary. We chose the Cisco 2960s series. The 2960s will speak both, they will exchange BPDUs with “pre-standard” 2955s, as well as standard compliant, modern switches. This meant we literally had to place a 2960 in between any Juniper, and any legacy Cisco deployment. A 2955 adjacent to a Juniper switch will not communicate MSTP frames…… at best it will be reasonably RSTP sane.

This would have been fine, except……. we kept bringing 2955s back into service. As I told one vendor when they inquired about our switch life cycle….. “we run a switch until the smoke comes out, then we go out with a soldering iron, and put the smoke back in, and we run it some more!” The 2955 series hit end of life in early 2013, we were still deploying them in May 2016….. the month I left.

Reason number god only knows I left this job…… I regularly had to point out that we could not safely deploy a spare 2955 at new site X, because it would literally break the network.” I was always told I was being difficult or accused of wanting more shiny new toys.

The culmination of hell was actually managing individual links of the western core. Two MSTP instances had become four, two east, and two west. After several colossal failures due to oversubscribed links, the two western instances became nine. Yes, nine MSTP instances.

The common instance was relegated to management traffic only at first, until reaching congestion impacted switches via SSH became an issue, and management was broken up into the multiple MSTP instances. Eight of the nine customer zones were similar, they all had the same goal: Get traffic to the only way out, the headend of wifi west. The final instance was actually an emergency services related VLAN, and fortunately, had a different ultimate destination than the rest of the customer traffic. Unfortunately, they also carried an urgency, often over antiquated equipment that had no real ability to prioritize L2 data, certainly not emergency services’ traffic. But we did consider them a higher priority. To make matters worse, the emergency services were using old analog radios, and forcing them to work over IP in a way that the manufacturer considered a very bad idea. It broke all the time, and we were always blamed.

At this point, I was literally adjusting metrics on a daily basis……. live. For those of you not in the network world, spanning tree network metrics are purely layer two. This means useful tools like traceroute do not exist. Traceroute exists at layer three, a layer which is, quite frankly, not terribly difficult to diagnose. Layer two issues by contrast, are very hard to visualize, and the tools to assist you are far, far less common.

MSTP config changes are not easy, and they come in two flavors. The first is VLAN membership to instances. Changing VLAN memberships is insanely difficult, as you have to change every switch in the same MSTP domain. So, either make it all happen at once, or carefully plan how you will break the network and then put it back together in a new form. Even if you have kickass automation, this almost universally requires a 3am maintenance window.

The second is per link metric changes, one can change link costs on a single switch without technically causing a meltdown. Changing a single metric on a single instance on a single link at one end of the network could potentially send a 500+ mbit of traffic thundering in a very new direction. Any sane IT admin would still  relegate this to a 3am maintenance window. I did not have the luxury of sanity. I started my prep at 3pm everyday, and took changes live before 4:30pm.

In my last year at $weDropPackets, adjusting metrics became a daily admin task, my task. I was overworked, nothing was ever good enough, and despite my protests, the network alternated between, “just fine,” and “the way its gotta be.” We were great, we were the best WISP ever, we could do no wrong.

The final numbers speak for themselves.

  • 1000+ square miles
  • 16 major switches
  • 20+ minor switches
  • 9 MSTP instances(and the common zone)
  • one way out for 99.95% of customer traffic.

In the end, that one way out is what killed the network on a regular basis. All traffic had to reach the headend by any means necessary. A failure to balance the daily traffic on all links would result in dropped BPDU frames somewhere. The ensuing flood of traffic over a new switch path would cause more dropped BPDU frames, and a cascade failure was the end result. Network stability became a function of my ability to think on my feet, and predict the future. Starting at 3:00pm everyday, seven days a week, I loaded core network graphs, looked for the overloaded sections, and adjusted MSTP metrics…… LIVE. It was basically a question of what sections of the network were consuming the most Netflix after school that day. It changed all the time.

I did this for over a year, and I became extremely good at it. I had to be. Because when I failed to balance the daily traffic, the cascade failure happened. It would take over 50% of our paying customers offline for an hour or more, and guaranteed I would get my ass chewed out at the next staff meeting. I proposed many solutions, and yes, they all cost money, and yes, they were all rejected. We had no money, and when we did, the network rarely saw it. Yes, our primary product was our network, and it was held together with duct tape, blood, and tears.

I am ashamed to admit, those last points were my driving force for the last year I was at $weDropPackets. Everything I did was predicated on that low level of human need. I wanted to not get yelled at, that is literally all I wanted anymore. I had recently developed Meniere’s disorder, and stress was consistently triggering vertigo spells. I would wake up Monday morning, knowing I faced yet another scream out session before the senior staff, and immediately go into vertigo. In my final year, I burned through all my sick time, and nearly half my PTO. When I finally turned in my 30 days notice, friends started noticing a difference immediately. The words that guaranteed, more than anything else, that I would not reconsider leaving were from my wife. “You have been calmer, happier, and more affectionate in the last two weeks than you have been in the last two years, I feel like I got my husband back.”

No job is worth your health and happiness, not even for a funny horror story like this one.