When last Monday the 4th of October Facebook suddenly stopped responding, to some, it may have felt like the end of the world. Others might have felt relieved, after all, it’s just social media, right? Facebook’s cousins also went dark. Between the absence of notifications from Facebook, Messenger, WhatsApp and Instagram, the return to the early 2000’s was only disturbed by Twitter, buzzing about the current outage. So, what did happen to the Facebook family that night?
Thousands of lines of text have been written by engineers all around the globe hypothesising about the cause of the outage. The possible explanations – numerous. Everything from bugs to intentional outages to outside interference were theorised. However, it seems the answer was much more mundane than that. As Santosh Janardhan, the VP of Infrastructure at Facebook, explains: “During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.”
But why exactly did everything stop working? Basically, all the data centres of Facebook are connected between each other. One of the jobs done by those data centres, is to show the way to Facebook’s different applications through what is called a Border Gateway Protocol. A routing protocol that, in essence, tells your browser the best road to take towards a certain IP address . But, because of the mistake mentioned by Mr. Janardhan, the Facebook Network Advertisements were taken out of BGP, essentially disconnecting all of Facebook’s data centres from the internet. What followed could be described as an unintentional DDOS attack, since apps and consumers weren’t taking “error” for an answer. This problem affected internet providers all around the globe while Facebook’s engineers were scrambling to fix the issue.
After all, the outage only took a couple hours. It could’ve been much worse and the engineers that fixed it should be commended for their swiftness. Nevertheless, it underlined a “single point of failure” issue that no doubt Facebook will be working on in the future.
All this shows how interdependent our world has become. If a “routine maintenance job” in California, can cause problems and serious financial loss for restaurants in Delhi, and fashion brands in Ireland.