Facebook explains the main reason behind its global outage
Facebook explains the main reason behind its global outage

Massive outages that crippled Facebook's platform and associated services (WhatsApp, Instagram, Messenger and Oculus), its business platform and the company's intranet began with routine maintenance.

According to Santosh Janardan, Vice President of Infrastructure, the maintenance mission accidentally shut down the core network connecting all of the company's data centers around the world.

"The reason for this outage is the system that manages the capacity of our global backbone," Janadan said. The core network is a company-created network that connects all of our IT facilities and consists of tens of thousands of kilometers of fiber-optic cable that traverses the globe and connects all of our data centers.

These data centers come in many forms. Some are huge buildings with millions of devices used to store data and run the heavy computing load that keeps the platform running, while others are small buildings that tie the platform together. Upholstery.

Read also: Facebook most vulnerable to corruption

When you open a corporate app and download your feed or message, the app's data request is forwarded from your device to the nearest facility and then connected directly to the larger data center over the company's backbone network. Here you can get the information needed by the application, edit it and send it back to your mobile phone over the network.

Data traffic between all these IT facilities is handled by routers, which determine where all incoming and outgoing data is sent.

Facebook engineers often need to be involved in the offline backbone to maintain this infrastructure. This is the source of the outage.

As part of a daily maintenance task, a request was made to check the availability of global backbone capacity, which inadvertently cut off all connections to the company's backbone network, thus separating the global data center from Facebook.

Facebook explains the reasons for the global shutdown

The company's system for reviewing these requests is designed to prevent such errors. However, a bug in this checker prevents the command from exiting properly. This change resulted in a complete disruption of the server connection between the data center and the Internet. A complete loss of connectivity leads to a second problem with DNS and BGP.

The situation is serious, but the reason why you can't use Facebook is because the DNS and BGP routing information pointing to its servers has suddenly disappeared.

According to Canardan, this issue is minor as the company's DNS server noticed it was disconnected from the backbone. It stops publishing BGP routing information that will help every computer on the Internet find its server. The DNS server is still running. But it is not accessible.

The lack of network connectivity and the loss of DNS prevented engineers from trying to solve the server usage issue. And turn off most of the tools they usually use for repair and communication.

The publication noted that engineers face additional hurdles due to the physical security and system of these important devices. Once the Secure Access Protocol is enabled, they can restore the backbone and slowly restore service under increasing load.

This is one of the reasons why some people take longer to access the data again. The power and computation requirements to run all functions at the same time can lead to more errors.


Previous Post Next Post