My Analysis on CrowdStrike's Action Plan

What could have been done to prevent the incident on 19th July (Part 2 of 3)

Aug 30, 2024

Hey, fellow Leader 🚀,

I am Artur and welcome to my weekly newsletter. I am focusing on topics like Project Management, Innovation, Leadership, and a bit of Entrepreneurship. I am always open to suggestions for new topics. Feel free to reach me on Substack and share my newsletter if it helps you in any way.

This article is a continuation of the incident report related to Crowdstrike. The first article is available below. The goal of today’s article is an analysis of the Crowdstrike measures and some other ideas that could be put in place to mitigate future incidents.

My CrowdStrike Incident Report

Artur Henriques

August 23, 2024

Read full story

The countermeasures

One important note is that Crowdstrike is sharing the details of the incident, its root cause analysis, and strategies for the future. Making my assessment available in this article a lot easier, so in light of clarity and transparency CrowdSource has provided valuable information.

Software Resiliency and Testing: The company has shared this action plan on its website:

Improve Rapid Response Content testing by using testing types such as:

Local developer testing
Content update and rollback testing
Stress testing, fuzzing, and fault injection
Stability testing
Content interface testing

Add additional validation checks to the Content Validator for Rapid Response Content. A new check is in process to guard against this type of problematic content from being deployed in the future.

Enhance existing error handling in the Content Interpreter.

For someone who has managed Software Development teams for more than a decade, this is what we all shared after a crisis when the software crashed on our laps: “We will improve the testing, add more testing, and test it more”. I understand the CrowdStrike list is a PR response to what happened. We all did it at least once, however in practice to fix the underlying issue, more is needed.

Kevin Ku: https://www.pexels.com/pt-br/foto/oculos-black-farmed-em-frente-ao-computador-laptop-577585/

The true question is how the testing is improved because the mechanism should already have testing in place. My goal with this article is to comment and provide suggestions for this particular incident, and not fuel opinions without having underlying data. However, when it’s communicated that one of the improvement points is “Local Developer Testing”, the message here is the change which provoked an incident that crashed 8.5 million systems was not tested by the developer. This might have been true, but also, when a major incident like this happens the problem is for sure around the process. I would argue the “the developer didn’t test properly” excuse should not stick on this one.

Nevertheless having the testing mechanism improved is huge, since the rapid deployment is delivering changes into Ring 0 on the Windows systems. There is no true lightweight deployment mode since their impacts incur the same level of risk.

However, with the increase of controls and mechanisms around the deployment and testing, CrowdStrike will lose some tactical advantage when delivering Cyber Security countermeasures. One of the issues to tackle during the post-mortem of this type of incident is to reinforce the mechanisms without impacting significantly the lead time.

To note lead time might be one of the most valuable requirements for products at this level of importance. So the automation of the testing process is paramount. In some online forums, there were some discussions of investing in QA personnel to avoid this kind of incident. The issue with manual tests is they will decrease dramatically the lead time. I am not arguing to remove the human factor altogether, however, the solution in my opinion is continuing to invest and improve the automatic testing process. Going through the system and designing a fast, lean, and complete testing coverage is the true challenge here.

Rapid Response Content Deployment: On their website, CrowdStrike officialized their intention of implementing the canary deployment. I would argue this will be the most significant change which can remediate future errors.

Not only Crowdstrike will feature this type of deployment, as well allow clients to adjust the level of granularity inside their organizations. If this feature had been a reality on the 19th of July, it would alone would have mitigated a series of incidents.

luis gomes: https://www.pexels.com/pt-br/foto/computador-laptop-preto-e-cinza-546819/

Third-Party Validation: While on paper makes sense to have a third party to audit the process and security procedures with code review, in practice, it would be difficult to have a lean process that actually can do it efficiently.

Exposing the code base to a third party, in a level of having multiple code reviews done, can provide some insight and avoid some blind spots that the system might have. However, that code base is CrowdStrikes’ core business, and having it exposed to a third party can bring other security and commercial risks.

I would have replaced this measure with an independent third-party audit, and once the findings were communicated, an action plan would be put in place and schedule actionable points. Any blindsight in the system would be addressed and followed by staff training and other retention policies. Software Engineering has a lot of Knowledge Management, and a strategy would need to take that into account.

My Overall Analysis

Many of these measures are a PR stunt. I understand the need for it since the incident impacted greatly the company’s image and commercial validity. Some PR measures should be in place after an incident like this. In practical terms, the canary deployment is the most important and actionable improvement the action plan has. It alone would have helped greatly in the mitigation of this incident. Also, letting each corporation define the update granularity of their systems, would transfer part of the responsibility to CrowdStrike’s clients in a future occurrence of a similar bug.

Would be interesting to understand how the testing would be de facto improved. One thing is to put a bullet point in a public statement, another one is defining which actual changes will be put in place to increase the robustness of testing’s strategy and processes. Defining and planning these changes are the most important measures to improve CrowdStrike’s internal processes. Unfortunately, there is not much information about it.

That’s it. If you find this post useful please share it with your friends or colleagues who might be interested in this topic. If you would like to see a different angle, suggest in the comments or send me a message on Substack.

Cheers,

Artur

The Long Missing SoW

My CrowdStrike Incident Report

Discussion about this post