The unimaginable happened on October 4, 2021, at 15:39 UTC, when Facebook, Instagram, and Whatsapp were all down at the same time. While we were all twiddling our thumbs, pondering how we could interact with others, Facebook’s servers were in a state of emergency.
To normal users like us, Facebook appeared to have vanished from the Internet. When users attempted to access their favorite social media platform for their smart devices, they received an error message, and the servers were completely unavailable. It was as if someone had straight away pulled the wires from their data centers, cutting them off from the Internet. They were down for more than seven hours around the world. It is a very exceptional occurrence for a corporation as large as Facebook.
However, according to information from a brief blog post by Facebook and a great write-up by CloudFare, a web infrastructure and website security business, the downtime was caused by a faulty configuration change to the backbone of Facebook’s routers, which send and receive data on networks. It stopped communication between data centers, and as a result, all of their services were disrupted.
Because Facebook does not advertise its presence, ISPs and other networks were unable to locate Facebook’s network, making it unavailable. The paths to Facebook’s DNS prefixes were no longer being announced. That indicated that, at the very least, Facebook’s DNS servers were down.
So let us discuss what occurred, exactly.
Like every website on the Internet, the renowned social media platform, Facebook, relies on advertising to attract users. The Internet uses Border Gateway Protocol (BGP) to do this. BGP connects the Internet, which is essentially a network of networks.
What is BGP?
Border Gateway Protocol (BGP) is a gateway protocol that allows Autonomous Systems (AS) on the Internet to communicate routing information. As networks interact, they require a means of communication. Peering is used to accomplish this. Peering is made feasible through the BGP. The Internet routers would be unable to function without BGP, and the Internet would cease to exist.
Let us look at how this is done. BGP is a routing protocol that allows all of the numerous Autonomous Systems under each company’s control to communicate with one another or know how to communicate with one another. For this, companies have to announce their ASN. It is termed a BGP route announcement. As a result, Facebook has its Autonomous System (AS), and others do as well. They form the Internet’s backbone, each with its self-contained system.
The BGP comes in to help us figure out how to move from one point to another point over the Internet. An AS is connected to every computer or device that connects to the Internet. And, an Autonomous System Number or ASN is assigned to each network. Companies must disclose where their Autonomous Systems (AS) are peering with other Autonomous Systems (AS). This is how the Internet’s resilience is supposed to function. But, as we all know, great power comes with great risks. You can suddenly point to nothing if you turn on the router and announce all the wrong things at the same time, possibly because automation tools did not validate these route changes. It caused a cascading calamity issue over Facebook, and no one could even access the tools they needed to fix it.
BGP, in simple words, is essentially a map of available services on the Internet that helps networks choose the best routing or path to them. It is like the map that your phone, computer, or web browser uses to locate Facebook.com URL when you type it in the browser.
Further, “Sorry, something went wrong, we’re working on it and will repair it as soon as possible.” said the error message that appeared on Facebook.com throughout the day. This message suggested a Domain Name System (DNS) error, which is another integral part of the Internet. The outage appears to be related to DNS servers, which Facebook has resolved.
What is a DNS?
A Domain Name System or DNS converts human-readable domain names like ‘www.facebook.com’ into machine-readable IP addresses. It allows users to navigate to their intended site using web addresses.
Suppose that if you input www.facebook.com, DNS servers translate an IP address to a domain name, which is then routed through the Internet by BGP, allowing them to advertise their website.
The vulnerability was recognized as a Border Gateway Protocol (BGP) withdrawal of the IP address prefixes in which Facebook’s DNS were hosted, according to security experts. DNS resolvers all over the world ceased resolving their domain names as a result of the BGP failure. This failure caused the websites and services to go offline.
Even employees could not send or receive external emails, access the corporate directory, or authenticate to some documents and services due to the outage, which cut off Facebook’s internal communications. Engineers were attempting to reach the data centers to rectify it but were unable to do so.
Consequences of the Facebook outage:
Without a doubt, Facebook, WhatsApp, and Instagram services went live, but it had a significant influence on Facebook and its customers. Downdetector, a site that tracks network outages, received over 10 million problem complaints, the most ever for a single occasion.
It has massive impacts on the company’s finances also. The company’s stock dropped over 5% on the day of the outage, while Facebook CEO Mark Zuckerberg’s fortune plunged by more than $6 billion. Facebook, according to sources, lost at least $60 million in ad revenue. It has even caused a decline in the trust of customers.
Final words:
BGP routing for the impacted prefixes was restored about 20:50 UTC, and DNS services began to be available again around 21:05 UTC. However, application-layer services for Facebook, Instagram, and WhatsApp were slowly restored more than an hour later, with service generally restored for users by 22:45 UTC.
Moreover, Facebook, on October 5, 2021, published, “Our services are now back online, and we’re actively working to return them to regular operations fully. We want to make clear that there was no malicious activity behind this outage — its root cause was a faulty configuration change on our end.”