If supposed “networking issues” in some of the most advanced data farms in the world could bring down this network …
This week’s massive outage in Facebook, WhatsApp, Messenger, Instagram and Oculus for almost six hours resulted from “a server configuration problem” and its backbone connection between data centers shut down during routine maintenance, leading to Domain Name System (DNS) servers to go offline.
What is server configuration and how vulnerable are platforms to such glitches? If heavily-funded online services regularly catering to millions of users on a 24/7 basis everyday can go down for hours, what does this say about human error, unpredictable vulnerabilities and the advanced cybersecurity practices and equipment that lesser organizations deploy?
Two cybersecurity experts offered their views to CybersecAsia:
Only Facebook knows the real answers: “Facebook is maintaining millions of servers to provide different offerings to their users. Part of this maintenance is also changing certain server settings that define how the server and the services on it are working.
In this outage, the problem was caused due to the change of the Domain Name Service settings. Domain Name Service, or DNS for short, is a hierarchical and decentralized naming system that makes it possible for computers on the internet to find each other. If this setting is wrong, then your servers will no longer be reachable as other computers will not be able to find them.
In simpler terms, think about a phone number. If you have a phone number, other people can call you directly and easily. However, if you hand out the number with just one digit wrong, others will not be able to contact you. This is what supposedly has happened to Facebook. Due to a misconfiguration of DNS settings, the Facebook servers were no longer accessible and therefore all services went offline—for six hours.
Note that changes of that magnitude are not done by hand. IT administrators use scripts and specialized programs to execute this task on automated machines. However, the risk of running into glitches is always present. As with any programming language, bugs can be also a part of scripting languages. In order to avoid glitches on a large scale, the best way is to do it step by step, rolling out changes in a small, controlled environment, and containing any possible threat in a in a sustainable manner. Why was this not possible here and why did this happen is a question only Facebook can answer.
However, changes of any kind should be done in smaller steps until they are confirmed to be working. On the other hand, we also see other configuration mistakes that are maybe on the contrary; not bringing the services down, but keeping them up and running without any security implemented. For example, in many cases, there have been reports of misconfigured AWS Server instances called S3 Buckets that caused data leakage.” — Boris Cipot, Senior Security Engineer, Synopsys Software Integrity Group
They were not down—just inaccessible: “While it might look like a colossal failure in all those services and apps, the reason is probably a DNS service they all use to route their pages and service to our devices. So, what is DNS? Simply, it is the internet protocol to convert the words in a URL we use, such as “Facebook.com” to a language that computers can process. DNS servers do the conversion and route us to the services and applications we want to access. When this service falls, the services look like they are down, but actually they are just inaccessible.” — Lotem Finkelstein, Head of Cyber Intelligence, Check Point Software Technologies
This could be a developing story, so stay tuned for more expert commentaries.