How the T-Mobile outage of 2020 went down

Loop, oscillate, cascade: human meets machine, with unpredictable consequences.

Do you remember the T-Mobile outage from June 2020, which left T-Mobile scrambling to recover after users were unable to send messages and make calls for a prolonged period?

A brief presentation from Orange on network resiliency has shed further light on what happened.

At the time, T-Mobile’s CTO Neville Ray said that an optical link had gone down, and that link redundancy had also failed, which had “resulted in an overload situation that was then compounded by other factors. This overload resulted in an IP traffic storm that spread from the Southeast to create significant capacity issues across the IMS (IP multimedia Subsystem) core network that supports VoLTE calls.”

The Orange presentation, made by research VP Brigette Cardinaeal during her presentation for Orange’s Salon de la Recherche exhibition, took up the story of an outage which was severe enough to mean that some 911 calls could not be placed.

1. First, an optical link goes down, “it happens” shrugs the pragmatic Cardinael.

2. A back-up fibre route is established. But someone had erred. The wrong parameters meant the back up route had not been connected to the original network server, but to a much smaller server that couldn’t handle all the call requests, and quickly got saturated.

A new link, but wrongly connected to a much smaller server. Uh-oh.

3. Then, the optical link kicked back in spontaneously, as part of the network’s own self-management. Automatically, devices took this as a sign they could safely get connected to 4G again. So they try to call again, but unfortunately they still have to go through the small server, calls don’t go through.

Third error, disconnecting the restored optical link, thinking that there’s a problem with the link itself.

4. So the network ops staff say, “Ah shoot, the server is still saturated so the optical link has been restored incorrectly.” So they disable it (deliberately, this time). And so then we are back to the same situation – a working back-up link connected to a server with insufficient capacity to cope. So the mobiles go into signalling confusion, oscillating between 4G to WiFi to 4G to WiFi. This sent loads of requests that all got stored in our poor little server that could not handle them. And so the server offloaded this load onto other servers, and by a cascade effect the problem that had been limited to Atlanta then spread over the US. For 12 hours, it was difficult to make 4G or WiFi calls. Users had to try two or three times and some calls, including 911 calls, did not go through at all.

 Here comes the cascade. Erk.

So what lessons can we learn? Cardinael says first that there are always going to be software bugs so we have to test software better before deploying. Two, human error means it has to be possible to verify a configuration before deploying. Third, there are loop, oscillation and cascade effects and none of that is simple to understand. And that is going to be more critical with 5G, which is both more complex but is also intended to support more mission-critical services.

Therefore, we need to do more research on simulating these effects, to understand how to build more resilient networks. 

Cardinael concludes, “We need to find new methods to maintain resilience, we need to launch new research.” As part of this, Orange has established a new research programme – that includes electricity and rail companies  – that aims to develop tools, methods and simulators to optimise maintenance phases in order to reduce downtime and to improve responses to outages.

Maybe next time, the failure of one leased link won’t lead to network-wide issues, with potentially severe effects.