Strategies for Seamless Business Continuity and Proactive Failover

Phiroze Vandrewala

Rarely does a single event cause cataclysmic failure. It is usually a chain of events or process/design failures that all impact each other and cause these disasters. Technical disasters come in strange guises. A contractor switched off a UPS, causing damage and a domino effect of failures in British Airways. Amazon’s shopping infrastructure keeled over due to too much demand for a Black Friday sale in 2018. In 2021, a major failure in a content delivery network provider knocked several major broadcasting and retail websites inaccessible. Closer homes in India, too, have seen several instances of significant corporations having technology outages with severe impact. These tend to be either being played down or altogether under-reported. 

In all these cases, one maxim rings true: “Failing to plan means you are planning to fail.” Paradoxically, although technology must recover in a disaster event, it’s the business that should dictate that recovery, its sequence, and its timelines.

The Role of Business in Disaster Recovery

Defining Survival, Sustenance, and Growth:
Businesses need to be very clear about technology services that are mission-critical for their business to continue undisrupted in an adverse event. It is for senior business leaders to push through the fog of emotions and ownership angst and make very hard-nosed decisions on how their business processes and therefore the resultant applications stack up on the crucible of Survival v/s Sustenance v/s Growth. Amongst other things like manpower planning, supply chain etc., is also an important part of what is Business Continuity Planning (BCP).

Getting this planning right is very important for a business so that the most critical operations are resumed quickest, followed by the somewhat less critical or supporting business processes, and lastly, internal processes. This planning goes a long way in ensuring the organization avoids the impact trifecta of Reputation Loss, Business Loss, and Regulatory Oprobrium.

Challenges in Business Continuity Planning

BCP vs. Dr confusion:
A lot of organizations underspec the BCP part of the story and run away directly to Disaster Recovery planning. This results in an inadequate or altogether unsuitable DR plan. 

Application Failover planning v/s Journey failover:
For the longest time, planners have done DR planning at an application level. The advent of virtualization and containerizing of applications means that, quite often, it’s a series of applications that completes a business journey. Disaster Recovery planners, therefore, must now necessarily reorient their planning from failover of single applications to clumps of applications that may be servicing a business journey.

Recovery Point Objective (RPO) and Recovery Time Objective (RTO):
RPO refers to a data point that is less than what the organization can afford to lose data. RPO should be decided very much as an outcome of examining process criticality. Architecting technology solutions delivering Zero RPO can become very expensive very quickly. Hence, while it is intuitive to think that any loss of data is not tolerable, it is dependent on the type of application and data. If the data involved is transient and does not lead to an incorrect financial or other outcome, it is entirely advisable to stagger the RPO strictly as per the amount of data loss tolerable.

Achieving Effective Failover

The dichotomy of the business is that the fastest failover is desired in the cheapest way. Anyone who has done this at an enterprise scale knows that it is a journey.

The base maturity is usually a heavily resourced and overplanned exercise involving many system administrators of different flavors. Most individual failovers involve a series of commands that must be executed in the right sequence. It is when entire journeys are to be failed over in a short period of time that things get challenging and where manual failovers won’t scale.

As organizations mature in their failover preparations, it is advisable that they look for solutions that first allow scripting as well as orchestration of the failover of various technologies and applications. These solutions exist, some in DIY form, a few that are pretty good and relatively inexpensive, and then also some that might cost an arm and a leg and are sold by some of the big names in technology. The magic, however, lies in the innovativeness of the users in seeking the best outcomes from these technologies. 

Testing, Testing, Testing 

Sweat in peace so that you don’t bleed in war. Your disaster capability is only as good as how near it is to production. The higher the change velocity, the harder it is to maintain the recovery capability. Code changes and capacity utilization changes can quickly render a well-tested application unrecoverable unless the change and capacity management hygiene are well-developed. 

Testing needs to be varied, as near to reality as possible, and done as often as possible. These tests will reveal whether the scripting and orchestration are holding true, whether the code base is faithful to production, and whether capacity will stand the punishment of live usage. 

Invoking a disaster recovery is almost always a culmination of a failed resurrection in production, and stress is omnipresent. At such times, a practiced hand calmly executing a well-tested failover will result in the best outcomes. 

In conclusion 

Businesses today promise very high responsiveness to the customer. The payments business alone, which was a 3-day turnaround in the era of the venerable cheque, has been reduced to 30 seconds of turnaround time. Service providers process more than 900 tps in the real world. Customers demand the highest levels of availability and continuity of service. Social media can be the rapier that rends many reputations asunder. Lastly, the regulators are increasingly hawk-eyed about customer service. 

It is imperative for organizations to, therefore, invest in the right people, processes, and technology and have an optimum business recovery capability. 

Phiroze Vandrewala is an accomplished and visionary CTO with 30 years of distinguished experience in the Indian Banking and Financial Industry.

Share on