The day everything crashed: CrowdStrike and the future of cybersecurity
What actually happened and what are the long-term repercussions for the IT industry?
It would be hard to miss CrowdStrike in the news lately. From downed check-in counters at airports to hospitals admitting patients manually, and even frozen car park gantry systems in Singapore, the impact was widespread.
As hundreds chime in and competitors issue statements that further cloud the issue, here's a deeper look at this black swan event from a technical and cybersecurity perspective. What actually happened, could this have been prevented, and what are the long-term repercussions for the IT industry?
What happened: One file, outsized repercussions
I first wrote about the CrowdStrike outage in a post yesterday. I noted how the issue was triggered by a faulty file in the Falcon Sensor component of its highly regarded Falcon platform.
The problematic file was pushed out as part of an automated update to customers globally, where it crashed millions of systems. Restarting the computer crashes it again, effectively causing a continuous restart loop.
Here are some details about Falcon Sensor and the fault:
- Falcon Sensor does fairly basic data collection, packages it up, and uploads it to the CrowdStrike cloud (~5-8 MB/day). The passive, data-collection nature of this module means crashes are unlikely to be caused by the mistaken identification of innocuous files as malware.
- Despite its name, CrowdStrike has confirmed that the *.sys file is not kernel-mode software but is used to provide channel updates. It is no malware definition file, but instructions on the monitoring of named pipes for possible abuse, which is used in Windows for interprocess communication.
- Sensational headlines aside, it doesn't appear that every Windows PC with Falson Sensor installed is affected. There is anecdotal evidence from users who went at the office expecting to tidy up their desks to find their PCs working just fine... and having to go back to work.
We can draw some conclusions from the above. For a start, CrowdStrike likely did the usual tests before releasing the update to its customers. There was no overt negligence, and everything was done according to established procedures with no indications that it would cause computers to crash.
The fix is relatively simple, and entails replacing the problem file. It is laborious, however.
- Sit down at PC and tap F4 on bootup to go into safe mode.
- Key in unique 48-digit BitLocker key*.
- Log in and replace offending file.
- Reboot and verify that it works.
- Repeat for each and every PC.
There are multiple solutions to reboot/manage PCs remotely. But they are pricey and must be set up ahead of time.
So almost everyone is fixing this manually.
*BitLocker is a Windows disk encryption feature used by businesses to protect onboard data. The decryption key is unique for every PC.
Could this have been prevented?
So how could a simple definition file have crashed everything? While CrowdStrike has yet to offer the detailed explanation for this unfortunate incident, technical experts have rolled up their sleeves and fired up their code debugging tools to dissect the Falcon Sensor software.
Below is one such analysis by Zach Vorhies, a former senior software engineer at Google. Vorhies attributes the crash to poor coding practices. Essentially, the program tried to read an invalid memory region, resulting in the process being killed by Windows.
Similar analyses by others have also pointed to poorly written software as the culprit. In essence, weak application security practices led to the software accessing memory it shouldn't have by accident, likely after encountering unexpected inputs from the file. The result is Windows stepping in to terminate the software as it was designed to.
But because the CrowdStrike software is likely running as kernel-level code and tightly integrated with the underlying operating system, Windows ended up performing digital fratricide on itself, triggering a BSOD, or blue screen of death. Upon rebooting, Falcon Sensor loads early as a privileged process, reads the accursed file, attempts to access memory it shouldn't and is promptly killed again. And so on.
This offers the likeliest explanation as to why the problem affected only Windows desktops and Windows server systems, but not other operating systems such as the macOS or Linux. It is at this point that we can probably discount the work of malicious insiders or any external cyberattack as the root cause of this incident.
Of course, this does lead to an uncomfortable question: What else could have been poorly implemented?
What now? Looking ahead
So what now? Will customers drop CrowdStrike and shift en masse to competing products? I personally consider this scenario extremely unlikely, though customers who are about to sign on the dotted line might ask harder questions now.
Existing customers might also question the wisdom of pushing out updates globally instead of doing a staggered release – even though most if not all cybersecurity vendors utilise the same strategy for non-executable file updates – and demand the ability to block or install them later.
Ultimately, CrowdStrike offers a comprehensive suite of offerings that compete successfully across multiple niches. Replacing it with equivalent solutions won't be easy or even possible in some cases, on top of the fact that cybersecurity products are known to be extremely difficult to switch.
What the incident revealed is the innate dangers of relying on the same cybersecurity platform for everything. After all, the debacle barely caused a ripple in China where few use CrowdStrike (or Windows for that matter) or firms that apparently are still on Windows 95.
According to a long-time cybersecurity executive who leads a team specialising in implementing cybersecurity strategies, such an error could have happened with any EDR (Endpoint Detection and Response) vendor.
Properly addressing these calls for the management of concentration risks, such as using different cybersecurity solutions to prevent a single flaw or vulnerability from bringing down the entire organisation. There is a cost here, and the onus is on cybersecurity leaders to find the right balance between resilience, operational overheads, and the budget.
For now, I would expect governments to start taking a greater regulatory oversight into the concentration of such risks, and forward-looking businesses to make a more conscious decision about the solutions that they deploy. There is no model response, except one that is informed and tailored to the specific needs and risk profiles of their organisations.
Life goes on, though the lessons learned from this incident will undoubtedly shape the future of cybersecurity strategies.