Mere dozens of hours ago, millions of Microsoft Windows hosts running “endpoint protection” software cried out in a terrible system crash, and were suddenly silenced by a BSOD. If you’re reading this shortly after it happened, you will feel informed enough about the currently known details and origin story of the great CrowdStrike “content update” mishap of 2024. Otherwise, you should be able to find ample material on Wikipedia and in news archives about it, because this was a big deal: A few bytes in the wrong place crippled the IT-based operations of ambulances and emergency services, airlines, harbors, government agencies, multinationals, etc. pp. all around this island earth. It was very widespread, it happened very fast, and few felt (and even fewer were) prepared for such a thing to happen.
Around the middle of the 19th century, about a million people in Ireland starved to death because the predominant source of food, a single crop of potato, got virtually wiped out by its unstoppable fungus nemesis. The dependency on the Irish Lumper had became so stark that it, for all its virtues and benefits of feeding so many people while it could, became a gigantic liability the moment it couldn’t.
If we as a society continue down the path that current IT practices and trends suggest, with our daily lives becoming ever more inextricably linked to the systems that these circumstances bear, we are in for debacles of similar magnitude - with the underlying causes and reasons sharing likeness, too.
CrowdStrike is a very successful enterprise, offering products that promise its customers to be able to protect if from the kind of disastrous IT meltdowns like the one it just caused. Quite evidently, their sales teams are very effective, and many IT managers put at least a part of their digital fates in CrowdStrike’s hands. Sometimes, it seems, these hands tend to twitch, and dire consequences follow.
The premise under which CrowdStrike and other “endpoint protection” peddlers sell their product is not the easiest one to swallow. You see, IT platforms powering all the important applications and integrations of today are known and accepted to be very brittle. Digital threats putting them at existential risk lurk in each crack and every dark corner of the Internet. A single click by some hapless office clerk on the wrong link in an email is accepted to be able to cause ripple effects that will bring a financially healthy company to its knees. The single mouse click will, quite suddenly, compel the company to buy huge amounts of digital crypto currency - only to unload those riches into untraceable digital wallets shortly thereafter, to the benefit of whoever orchestrated the delivery of that dangerous, evil email.
Luckily, CrowdStrike(et al.)’s advanced software, algorithms and data - and surely, there’s a healthy dose of “AI” involved by now, too! - will help you and your company avoid this cruel fate. If only you erect this advanced digital fortification over the decrepit groundwork that, unfortunately, has to serve as basis of the rest of your IT infrastructure, you and your systems shall be safe and sound. So sayeth not only the vendors and their partners, but also the wise Audit Gods - and so it is checkboxed and written.
When I began to grasp the proportions of this most recent digital calamity, I felt an urge to climb these lands’ highest mountain to yell my frustration into the dales beneath. Since that would be hardly productive, I am typing up this rant. Because I know SO much in the IT industry to be just very, very wrong, and yet I’ve never taken the time to turn my thoughts about that into a coherent-ish essay.
Well, that changes today! I am pretty sure I will not be able to offer a solution to the grave troubles the industry faces (and re-creates and re-perpetuates, on a daily basis!) on a whim here, but “the first step in solving a problem is recognizing there is one”, as they say. And I hope that this text might help share an imho important cognition or two with at least some of its readers.
The systems that failed their societies and communities in this way had two major things in common: They were running Microsoft Windows, and they were running CrowdStrike Falcon. There were (probably) many millions of them. That’s at least seven decimal digits required to count them, and that, to be quite frank, boggles my mind. Some organizations had tens or even hundreds of thousands of devices taken offline in a single sweep of dishing out a prophylaxis that turned out about as bad as any disease could get.
Now, accepting the premise that these systems all have to exist to perform some kind of important and useful computation/work in the first place, I feel confident to assert that it is downright folly that these systems existed, and will continue to exist, in this particular shape and form: All essentially the same, with a uniform and homogeneous collection of both known and unknown weaknesses, ready to be taken out again by the next near-fatal strain of digital malady.
A society that becomes too dependent on a single critical factor, whatever it may be, becomes unnecessarily fragile. Recent human history has deterrent examples of this, and yet all the incentives in corporate IT willingly ignore them. There’s probably hardly a company today where no strategic initiative seeks to level and unify whatever healthy diversity that somehow seeped into the organization, before it became “professionally (i.e., someone’s getting paid to do it) managed” and “optimized” for hollow “synergy”.
Of course, having the same kind of software, with identical genetic makeup everywhere, yields convenient benefits (which I am not going to try to enumerate - random LinkedIn posts produced by IT transformation priests should have enough of that) - but this kind of fair-weather-thinking and relentless faux-efficiency-optimization mindlessness eventually induces serious downsides which are easy to spot. At least after 2024-07-19 they should be.
Staged deployment and incremental rollouts change mitigate these risks by a lot, but that only applies to the kind of change that you actively want to implement. Actual adversaries won’t strike at your giant infrastructure monocultures one piece at a time.
One would think that the troubles of recent years - be it the world economy aching under the impact of the COVID-19 pandemic, or the 2021 Suez canal obstruction - would have made some impression on people in positions of power to put the proverbial MBA mindset on hold for long enough to give the world a chance to gain a few additional shreds of much-needed resilience. All it would require is granting a little bit of slack (not the chat app), accepting some kinds of redundancy as a necessity, and fostering diversity of implementation across the various fields of human endeavor.
So you’ve spent titanic efforts to make all your IT systems look the same, work the same, and be the same. Great - now you can deploy measures to manage them all efficiently, all in the very same way. Single pane of glass and all that, well-oiled automation and orchestration everywhere. Finally, you can move fast! Sometimes you might break a thing or three, but for the really important stuff, there’s remedies: You deploy Multi-Factor-Auth for your hosted Single-Sign-On authentication portal, admins’ machines and their all-powerful session tokens are guarded by advanced endpoint protection from CrowdStrike. You have Privileged Access Management, Data Exfiltration Prevention and Security Information and Event Management systems looking at and logging each bit that traverses your organization at least twice. Still, companies like yours get hacked. All. The. Time.
The common denominator in this tragedy seems to (mostly) be centralized authentication and authorization hubs like Microsoft’s Active Directory - but you could probably substitute any other “ID provider” here, if AD weren’t so widely available and also easy to pwn. And yet, despite this being common knowledge, IT managers seem to mostly insist on hooking each and every insignificant piece of software into the one thing that needs to be protected at all costs, where the crown jewels are kept.
To what end? So that people don’t have to remember more than one set of credentials or something. (Well, apart from those in the org who have to touch the important stuff, because they will get segregated, personalized role-like accounts.) Where’s the sane approach to managing risk in that? Why should one compromised account in one service enable the nigh-universal compromise of that account in all services?
But wait, there’s more! Why does the thing you implement a network packet filtering policy with also have to MITM all your TLS connections and try to filter for malware in that data? Why is it also your TLS VPN endpoint (buffer overflows in its privileged TCP listener included) and act as a security perimeter device, despite its vendor having “zero trust” written all over their marketing material? Why does it have colorful web-based dashboards built in, featuring at least six out of OWASP’s top ten security gaffes that were cringy to find even in an PHP beginner’s shopping list web app a decade ago?
The IT sector has cultivated a strange fetish where everything needs to be seamlessly connected to everything else, and where it always needs more: More resources, more features, more “integrations”, more plug-ins, more APIs, more attack surface.
Some of the checks and balances we as a society have tried to mandate to establish “best practices” in terms of applied IT security - ISO 27000, PCI DSS, NIST CSF, NIS-2, younameit - actually require compliant organizations to do this crazy thing or deploy that kind of unfit piece of junk to “pass the audit”, which seems to have become the most and in fact only pressing need of those who should be in charge of actual IT security at their place of work. These grandiose tomes of thick quasi-legalese leave no possibility for healthy minimalism, and implementing the smallest and most simple thing to get something done. Which leads me to…
What is known as Goodhart’s Law is in full effect for the IT industry at large. Good people with decades of experience, highly specialized knowledge and training, and very important duties to fulfill in their well-compensated day jobs will willingly and knowingly compromise on actually improving the security of their organization, because checking some boxes in a 3000+ rows spreadsheet that has to be religiously re-validated each year has somehow gotten more important.
They do so by choosing the products and services of “widely-used and established vendors”, instead of making use of their own mental facilities to actually implement what any sane person would expect their job title to imply. That way, actually important organizations end up with FortiNet “security” appliances in front of their Avanti “security” appliance, because the (technically completely clueless) auditor “will ask fewer questions” when you present some patently crap product with commercial backing and sufficient market penetration, rather than just deploying Wireguard yourself. And since the person who made that decision, somewhere in their mind, is perfectly aware of that absurdity, they will choose to buy from two independent vendors of the same kind of shitware and serialize their “solutions”, just to be able to get at least a modicum of sleep at night.
In my book, having auditors drop by each year and check an organization’s compliance with myriads of rules and controls is all fine and dandy, if that process demonstrably improves the thing that it was supposed and originally designed to improve. In the real world though, that process has degenerated into an exercise in mostly checking boxes without much residual meaning, and a widely shared interest in making things look alright. Not only does the Emperor have no clothes - he’s a clacking skeleton, without an ounce of flesh left on his frame.
I have personally witnessed and experienced instances where the actual impact on security was a clear net negative, and I feel we’re beyond the point where we can afford that kind of foolish luxury.
Great question! Unfortunately, I don’t have the answer. Thanks for reading, bye!
On a more serious note, what I think would be necessary to fix these problems is a sudden explosion in individual competence across the IT industry, and a healthy dose of distrust for mostly empty vendor promises. Then, we would need people with a desire and ability to understand and implement systems that do one thing, and do that thing very well, and resist the urge to make the thing into a product - or if it already is, refuse to tack on everything but the kitchen sink to increase its mass-market appeal by piling up features upon features.
We would need people who understand and uphold an interpretation of the single-responsibility principle not only briefly during some form of code review, but also when there’s a design or buying decision to be made, and when something that’s less shiny, but took KISS to heart, competes against a very feature- and colorful, highly polished turd that will succumb to anyone who can smuggle a URI into a log message it emits.
We need to accept that one size does not fit all, because a resilient ecosystem requires a diversity of functionally redundant subsystems. Successful companies tend to grow quite absurdly large these days, to the point where only a few key players compete in important markets. If any one of those few players gets into trouble, the whole world feels the stingy effects. These organizations need to sufficiently diversify internally, so that no single piece of equipment or software or system malfunctioning can put the operation in jeopardy. And finally, we as a society need to sufficiently diversify to not depend on a single vendor, partner, or provider not fumbling it, ever.
It’s gonna be a tough ride for many, and success in untangling the mess that has been made is not guaranteed. But waiting to try any longer will only make matters worse - so let’s get to it!
Copyright ©2024 Johannes Truschnigg
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.