Tuesday 17 July 2012

The Revival of IT Disaster Recovery as a BCM Core Issue



Lyndon Bird FBCI
The one thing that almost all Business Continuity professionals agree upon is that BCM originated in IT Disaster Recovery (ITDR). Until very recently I would have argued that ITDR was no longer a core issue of BC practitioners in multi-national or global organizations. Although the risks of IT failure were extensive and the potential impact massive, the probability of those risks being manifest was very much under control. The sheer expenditure and sophistication of ICT in major organizations should have designed resilience into the infra-structure so that IT failures were largely invisible to customers and other external stakeholders.

My first reaction to hearing that the Royal Bank of Scotland had serious IT problems with a software upgrade was one of disbelief. Surely not I told my colleagues, there has to be more to it than this, perhaps a cyber-attack or perhaps internal sabotage. When no rumours of such dramatic events emerged and the banking group kept extending their expected recovery period I had to reluctantly admit I was wrong. As I write, the banking group are still not totally operational at their Ulster Bank subsidiary after 4 weeks of disruption, and even customers at the bank’s main brands (RBS and NatWest) experienced delays of a week or more.

If we put this in context it is increasingly bizarre. Most banks define their “Recovery Time Objectives” in hours or even minutes for particular services. An interruption of 4 weeks on accessing client accounts (i.e. basic banking functionality) would have been unthinkable.

For those non UK residents who are not too familiar with this particular financial institution, it is sufficient to say that it was it was bailed out by the UK Government in 2009 and is now 82% owned by the UK taxpayer. It has continued with problems since then, not just the destruction of its share price and failure to contain its losses but also reputational and image danger. There was the arbitrary of removal of the previous Chairman’s knighthood by the Queen, a damaging long running dispute about whether the current CEO should be allowed to take his bonus. It has also faced much government and opposition criticism of its failure to lend sufficient money to small businesses despite being entirely protected itself by public money.

Against this backdrop, the last thing needed was an operational failure. It appears that the IT problem started on a Tuesday evening when a routine update of a software component failed and prevented access to customer accounts. It took until Friday to understand the problem fully, no transactions had been handled and a backlog of over 100 million transactions remained to be processed. RBS certainly had commercial problems but like all major banks it has expensive, sophisticated, low risk and highly protected computer systems.

Speculation on reasons for the failure was, of course, wide-spread and highly imaginative. Some postulated that it was caused by the outsourcing of computer operations to India. The argument that RBS was running its computer operations in a risky manner just to save money was a popular line, with no evidence presented. As it continued there seemed a new explanation gaining credence – it was not really about any specific failure, it was about the complexity of the technical infrastructure that had grown up over the past decade. The view was that no-one could possibly understand the full potential consequences of a single change on the overall infrastructure. RBS had been unlucky but it could (and would) happen to others on a regularly increasingly basis. Leading ICT consultants called for a fundamentally different approach to the way large organizations manage the performance of the IT systems, recognizing that everyone now relies on such services in their day to day lives.

For those who were less than convinced by this argument we were then hit by a different situation, but one in which again the experts again blamed complexity of technical integration. The mobile operator O2 (itself owned by the struggling Spanish telecom company Telefonica) was out of action for a minimum of 17 hours for many customers, and was then only restored to the downgrade 2G services whilst work continued on recovery of the 3G network. The network, who uses the slogan "we're better connected", had not issued a timetable for full recovery two days after the incident. Mobile operators set their acceptable downtimes in minutes, not days so again the impact on profits and reputation will be massive.

From these two examples, probably O2 has the most to lose. RBS already has a poor reputation, and is protected against loss by the UK taxpayer. It is also more difficult to change banks than to change mobile operators so RBS are unlikely to lose customers in large numbers. Other banking scandals have already overtaken RBS in the news agenda. O2 has no such protection and only limited brand loyalty.

Just when the Business Continuity community felt that ITDR was now a routine business process and our attention should be turned to helping deal with business related threats (such as the risk of a Eurozone breakup or political upheaval in the Middle East), the oldest BCM issue of all comes back to bite us. Technology recovery is back on our radar and with increasing cyber threats emerging; it is likely to remain so.


1 comment:

  1. I wrote a similar article for my companies blog - http://blog.onyx.net - focusing on the complexities of systems we take for granted and the need for testing. The current ongoing electricity problems in India may also be symptomatic of not having a full understanding of the complexity of large interconnected systems.

    ReplyDelete