My CrowdStrike Incident Report
What could have been done to prevent the incident on 19th July (Part 1 of 3)
Hey, fellow Leader 🚀,
I am Artur and welcome to my weekly newsletter. I am focusing on topics like Project Management, Innovation, Leadership, and a bit of Entrepreneurship. I am always open to suggestions for new topics. Feel free to reach me on Substack and share my newsletter if it helps you in any way.
This article is the first of three covering the Crowdstrike incident. On Friday, 19th July a simple update on CrowdStrike’s Falcon software, which is designed to detect computer vulnerabilities, is responsible for crashing 8.5 million systems worldwide. From desktop computers, servers, and even simple displays.
The Falcon Sensor has the responsibility of detecting malicious behaviors, from both the system and the end-user, and requires a special level of security access. This means the Falcon Sensor is executed at the Kernal level (Ring 0) from the Windows operating system. The BSOD (Blue Screen Of Death) triggered by the update, is normal behavior from Windows OS when it detects a catastrophic event on the OS’ core, and to prevent further damage, it shuts down the system entirely.
One striking characteristic of this incident is that the root cause came from a configuration file allegedly not containing any executable code. The update was in its very nature, a simple and innocuous update on the Falcon sensor. However, this update was done at the Ring 0 stage from the Operating System. If the update were made on the Ring 3 (Application level), the OS would detect the faulty component and shut down the application alone, preventing a cascade effect on the rest of the system. However, since it was done at the Kernel level, there was no safety net, and therefore, when it crashed, the OS just protected itself and forced the computer to stop.
This article is not intended to criticize CrowdStrike’s strategy and its way of operating, but rather, to have an outside look at what happened and comment on the measures put in place and the processes linked to the incident.
The obvious remarks
To understand the obvious feedback that went all over the internet, we need to go through the deployment strategy used by CrowdStrike. There are two modes basically: Sensor Content and Rapid Response Content. Where the Sensor Content delivers the new code for the Falcon Sensor, and the Rapid Response Content is responsible for delivering heuristics that enable the Falcon to detect new suspect behaviors and is classified as “Template Instances” or configuration files, which avoid the need for delivering new code into Production.
Depending on which process is used for delivery, one process is more lightweight than the other. The details of this process can be seen in an official statement from the CrowdStrike website.
In other words, Template Types represent a sensor capability that enables new telemetry and detection, and their runtime behavior is configured dynamically by the Template Instance (i.e., Rapid Response Content). Source CrowdStrike
Nevertheless, the deployment for the Rapid Response Content takes into account a degree of testing, including stress tests, and other types of automatic tests. However, a bug was included on the Content Validator, which in practice is saying that a bug was found on the testing procedure which is run before the new configuration is delivered in Production.
In defense of CrowdStrike, I can only imagine how complex and hard is to maintain such technical architecture and to test this kind of configuration. Note that some Software Engineers argue those “configuration” files are in practice runnable code that is executed at the Kernel level.
Lack of Testing: The first obvious remark is the lack of testing. Of course, something as catastrophic as was found on 19th July should have been caught on some sort of testing. It was not a hidden bug behind a series of end-user actions and was a deployment of a faulty file that triggered a system shutdown. Immediately after this file was installed on the OS, it was rendered inoperational immediately. With no surprise, there were several opinions online contesting how a runnable code was delivered using a lightweight pipeline. The reality has shown the faulty file was not tested at the most simple level, the entire situation should have been detected in the first place.
CrowdStrike argues the crash was not detected due to a bug on their Content Validator. The faulty configuration was not detected as evident as it seemed, because the mechanism created to test in the first place, had a bug. This begs the question: How many other “configuration files” were deployed on critical systems without having been properly tested?
This situation raises another perspective: Crowdstrike argues the Template Instances (simplified in my article as “configuration files”), don’t contain any sort of runnable code. However, searching on the internet I found out a series of opinions that state and prove otherwise. No matter who was right, the “configuration files” are run at the kernel level nevertheless. This fact will be important in the next obvious remark.
Simple update at Ring 0: There seems to be a clear difference between the two delivery pipelines (Sensor Content and Rapid Response Content), where one of them is designed to be fast in the delivery of new types of threats. This makes sense since the world of Cyber Security is very volatile and rapid and tactical measures need to be delivered very fast. However, this doesn’t mean that it should have a process that is significantly lightweight compared with the delivery of new Software versions.
Since the Falcon sensor operates on the OS kernel, there is no safety net. In typical software architecture, if an application is consuming large amounts of resources, or producing a series of erratic inputs, the OS can trigger the shutdown of the application without impacting the rest of the system. However, the Falcon Sensor was operating on the Ring 0, sharing almost the same level of responsibility and security as the OS. Any kind of deployment at this level needs to be tested in detail, therefor any Continuous Deployment mechanism that is in place should exert the same level of requirements.
Having a faulty mechanism on a “Rapid mode” deployment has the same level of impact as if it were on the normal software delivery pipeline. For that simple reason, both mechanisms deliver content to the same level inside the OS.
Big Bang Deploy: I was surprised to see there is no mechanism for staggered (or canary) deployment with this level of risk. In practice what “Staggered Deployment” means, is the delivery of the software package to a small sample of the users, and increasing the number of deployments as time goes by. With this approach, when software is delivered using this method, if there is some malfunction, the impacts are contained to a small percentage of users. In this particular example, if the “Configuration File” was delivered using a staggered deployment, the problem would be seen immediately in the first patch.
Let’s imagine the deployment method was set up at 25% (which is already too high a percentage for this level of risk in my opinion, however is the standard when using Google services). In a canary deployment instead of 8.5 million impacted systems, it would be 2.1 million. Now it’s understandable why I argue 25% is too high. Maybe 5% as initial and progressively increase from there, would be a more conservative approach.
That’s it. If you find this post useful please share it with your friends or colleagues who might be interested in this topic. If you would like to see a different angle, suggest in the comments or send me a message on Substack.
Cheers,
Artur