Winds of Change – Infrastructure management returns to center stage

The Oxford Dictionary mentions epoch-making events as those of major importance and which are likely to have a significant impact over a period of time. The worldwide major technology outage on July 19, 2024, might be considered the largest IT outage in recorded history. It was caused by a botched-up software update from CrowdStrike that affected millions of Windows machines around the world. The damage estimated to Fortune 500 companies alone is pegged at USD 5.4 billion. There are many learnings that have surfaced, and in many ways, things will be pegged as prior to or post the July disaster. It seems apt, therefore, to say that this was an epoch-making event.

Phiroze Vandrewala

Background for those who did not follow the events

CrowdStrike is an endpoint security vendor that primarily offers the Falcon platform. This platform protects against cyber threats to endpoints.

Software issues occur very commonly in the technology industry. Rarely, though, do they manifest all at once everywhere. Here, too, a buggy software update was released on July 19th, 2024, that affected the core or “kernel” of the Windows operating system, causing it to shut itself down. What is different here is the propensity of this update to cause mass upheaval on a global scale.

The effect of this software bug was amplified logarithmically due to Falcon’s pervasiveness as a platform in many mission-critical applications and industries. According to a Microsoft estimate, 8.5 million Windows devices were directly affected. Airlines, public transit, healthcare, media, and, last but not least, financial services were industries that were brought to their knees.

The once obscure term “blue screen of death” or “bsod,” which was used only by tech geeks, was prominently used in all news reports about this incident.

The fix advised required warm bodies to physically reach the impacted machines, modify some parameters, and reboot. While very time-consuming, this is also a lesson, which we will cover below.

Deep Dive

Many thousands of words have already been expended on the internet explaining the geekery behind how the bug impacted the core of the Windows system, and the steps required to get back to production readiness. I do not intend to dwell on that bit of wizardry here. This article is more focused on the changes that I see coming from all that I have read, understood, and analyzed, albeit baselined to my own three decades of experience in Infrastructure Management.

It is perhaps advisable to break the learnings into some broad groups.

1. Technology Management or Planning fixes
2. Changes in Products and Process
3. Risk Management Disaster Recovery
4. Legal and Commercial

1. Technology Management and Planning Fixes:

Barring a few organizations that indulge in regularly propelling people into space, etc., rarely do people plan on a 3rd or 4th layer outage in all its dimensions of design, staffing, planning, recovery, etc. There is an unwritten median of possibility that most organizations calibrate per their industry or peer groups. Some, more than others depending on the criticality or extent of loss or damage that can occur. This outage, however, forces us to rethink, recalibrate, or at least review the set median and give it a good shake to check if it is still appropriate.

A. Know thy self – God is indeed in the details… Over time, all organizations have a sprawl of infrastructure built and layered across projects, acquisitions, mergers, staff attrition, and so on. When faced with a pervasive outage like the above, a key learning, however, is to know your environment well. What system does what, where in the enterprise it is located, how it is managed, delivers what functionality, and the criticality of that functionality to business operations, etc., are all things that need a lot of time and effort to record correctly. There is, however, a rich dividend in having this information clearly mapped and recorded. It is invaluable when prioritizing actions.

B. Established response protocols – Prepare, plan, review, and rehearse. A clearly articulated response protocol addressing technology risks including outages, cyber events, etc., is critical. This means preparing a Crisis Management plan to outline the roles and responsibilities of various stakeholders (Crisis Management Team), potential scenarios, and probable response strategies that could be designed to plug and play. It is very important to have a chain of command that drives measuring impact, decision-making, communications, etc. This will enable one to “respond” to the crisis in a concerted way rather than “ad hoc reactions.”

Business Impact Analysis too is very critical to drive recovery efforts as categorized below.

i. critical for survival,

ii. required for sustenance

iii. good to have for scale.

A BIA exercise must be a cold nosed categorization of systems. Sadly however, business rarely have the time to participate in this and it is left to the operating risk folks to drive. Most times, therefore, the baby that cries loudest lands up getting fed the most.

C. Architecture & Engineering – In every enterprise today, there are many business applications. Underpinning those business applications are numerous infrastructure systems and software, which are the skeleton, sinews, and musculature that form the framework on which the applications are based. Most of these underpinning systems are pervasive to the overall framework. An outage in these is more than likely to cause a horizontal impact. There is a line of thought, especially post CrowdStrike, that advocates not having any one software as a single point of failure and going for multiple product implementations for a given objective. Experience indicates, however, that this creates more complexity in day-to-day operations with not enough benefit of resilience. Having two points of failure for endpoint security is perhaps as bad as having one. The negatives seem to outweigh the perceived benefit.

D. Staffing – Endpoint management is widely categorized as a low-tech activity and is often outsourced on a service-level basis. Service providers provide support to multiple organizations with the same pool of resources. Providing hands and feet support that touch and can reach the machine is sub-contracted even further for reasons of cost, geographical reach, etc. For the most part, that is indeed appropriate. An outage of this proportion, however, is likely taken as a force majeure event. A service provider with a highly utilized team would not be able to cope with the surge of demand even centrally, let alone in the field. Organizations outsourcing these activities would be well advised to have differentiated levels of service carved out for critical officers, machines, and locations.

2. Product & Process Changes:

In today’s world of specialization, there is an alphabet soup of products required to keep things running and secure. It is considered too difficult to size, recruit, and keep in-house teams motivated. Also, the sheer breadth of specialization would make it very expensive.

Therefore, teams that are manning this variety of software are most often drawn from either the OEM or their resellers. These folks are mostly immersed in ensuring product uptime itself, and its regular core updates. Minor versions and definition updates are designed as automated downloads and deployments. The updates across systems are just too numerous to manually sequence.

The possible answer is perhaps a couple of things. Zoning the endpoints into smaller and more manageable groups would deliver more fine-grained control than the “uat-dr-proddmz” grouping today, in which each may have thousands of machines. Grouping could be per network zone and then per criticality or such combinations. No one pattern would fit all requirements. The idea, however, is to be able to limit damage and lateral spread. While the process is difficult, it can be achieved.

Technology product evaluations too will increasingly look for the ability to group endpoint systems and desktops as desired and even a kill switch to stop automatic updates in their tracks or deliver a bypass. Fine grained and flexible reporting for different reviews is also the need of the hour.

3. Risk Management & Disaster Recovery:

Consider the horrifying prospect of having your Privileged Access Management system on Windows patched and updated at the same time as the rest of the environment. In the above outage, you have now lost the ability to administer even those systems that are not directly impacted by the extant outage.

The responsibility to design and implement both the infrastructure and such solutions to consider such possibilities can no longer be assigned to the core system administration team. Technology risk today perhaps rivals operating risk and market risk in its sheer ability and speed to cause an organization wide meltdown.

Having well trained technology risk and disaster recovery managers that contribute through the lifecycle rather than just at times of audit or downtime is well advised. These need to be embedded in the technology organization reporting directly to the CIO.

4. Commercial and Legal aspects:

The overwhelming majority of legal agreements related to technology products limit direct or indirect liability to the value of the contract. Outside of government business, there is no vendor that would accept liability in excess of the contract. This is not an easy problem to resolve.

Increasingly however, Regulators, and even the Board of Directors of organizations are going to mandate regular reviews of architecture, build to suit, and service levels. While this is certainly onerous, it does need to be done. Risk Managers need to play a key role in creating frameworks that aid the listing, risk categorization and review of technology vendors.

How can you use this information?

This article is designed to provoke thought in the minds of key decision makers in technology. Not everything will apply uniformly, but the broad elements are indeed horizontal. The priorities that a CIO today has are numerous and often conflicting. The need for agility of delivery v/s design to scale and survive are often at odds with each other.

At TechBridge Governance and Strategy Consulting we specialize in aiding the Technology and Risk Management teams in reviewing their current setups, and providing expert consulting advise and recommendations across a broad range of technology risk, governance and disaster recovery aspects.

Phiroze Vandrewala is an accomplished and visionary CTO with 30 years of distinguished experience in the Indian Banking and Financial Industry.

Share on