What should businesses learn from the BA outage?
On the Saturday morning of the recent holiday weekend BA’s computer systems suffered an outage, the details are not public, although there is a suggestion that there was a power surge in one of BA’s data centers that caused damage to hardware. Whilst the why is important to BA’s technicians who will have been working all over the holiday weekend to resolve the issue and subsequent fallout, the more interesting lessons come from looking at what can be learned from the incident as a whole.
Plan for it and Document it
The risks that businesses need to protect themselves against will vary from business to business and from sector to sector, one of the most important first steps is to recognise the risks and scenarios that the company needs to protect itself against. Many firms make two critical mistakes in this area:
1. They don’t do anything
The old adage to fail to plan is to plan to fail is very adapt for those in this situation. Following BA’s recent challenges, smart boards will be asking questions about how they would/could handle something similarly disruptive. If your organisation doesn't have a Business Continuity Plan, then now is the time to create one. If you don’t know where to start, or don’t have the time, or experience to create one, Fifth Step is always here to help.
2. They get too specific
When looking at risks and scenarios it is important to look at their impact rather than getting to specific. I have for example seen Business Continuity Plans that go into details about the loss of access to a building due to a gas main that ran close to the building, but had no plan for loss of access to the building generally. Always make sure that your plans deal with the general issues, and get more specific only where needed.
Lesson: Create a Business Continuity Plan
Test, Test and Test again.
A spokesman for BA said that they had backup systems but that they failed to come online as was expected.
Having back-up systems is the first and sometimes the costliest part of building resilient IT systems, but without the occasional fire drill to test things out you won't truly know if systems, processes and procedures are going to work and come together as expected. Testing can take different forms, but at least some of the testing must include a transfer of processing to the backup systems. Desktop exercises are also an important tool and should also be used.
Lesson: Perform DR and BCP tests on a regular basis
There can be Only One.
In contrast to the Highlander mantra, if your organisation is to be more resilient, you need to identify any single points of failure (people, systems, vendors, infrastructure, etc.) and ensure that you have a backup plan should any of those key resources fail.
In BA’s case, there seemed to be a single point of failure somewhere in their systems, as the backup systems failed to come online in a timely manner.
Whilst these days you may not have a physical duplicate computer system sitting in another data centre (although this may be the right approach in some cases), your systems need to be able to respond promptly to the loss of a key system or infrastructure. In the best cases users are not even aware that an incident has occurred because backup systems stepped in and plugged the gaps as soon as it was identified.
Lesson: Know and eliminate your organisation’s single points of failure
Incident? What Incident?
One of the biggest complaints that BA’s customers had over the weekend (aside from the fact that they were still in the airport rather than where they intended to be) was that no one seemed to know what was going on.
Part of any good Business Continuity Plan, is an Incident Response Plan. It is the job of the incident response plan to ensure that those involved know what has happened, and what is being done to get the business back to normal operations as quickly as possible. This may include the formation of an incident response team, who’s role is to oversee and manage the process of returning the business to normal operations.
A critical part of an incident response plan is the communication plan. This ensures that all stakeholders, from customer to investors, are given timely updates, and are reassured that the organisation is as in control of the incident as is possible.
Lesson: Have a good incident response plan.
Resiliency is too Expensive
Its not as popular a view as it was (certainly pre-9/11), but I have certainly met people who say that resiliency for their business costs too much. The bill for the (presently) 3 days of issues suffered by BA is reportedly going to cost in the region of $100m, this doesn't include any potential changes in BA’s stock price as the markets open following the holiday weekend, or the longer-term impact caused by customers refusing to fly with BA in the future if some of the interviews with customers at Gatwick and Heathrow that were showing on the news channels over the weekend are to be believed.
For those operating in regulated sectors (such as financial services), regulators are increasingly requiring organisations to have good resiliency process in place, and to can demonstrate that they are both documented and tested.
Lesson: Implementing resiliency in a proportionate and appropriate way is cheaper than the alternatives
King Maker or Career Breaker
Poor resiliency can in extreme cases call into question the management of an organisation. In BA’s case, there have already been questions asked of Alex Cruz the CEO, some of this will of course come from angry customers who’s plans for the weekend didn’t include being in Terminal 5 for a couple of days. The more worrisome questions are those that come from board members who will no doubt ask questions of Alex and his team in the coming days and weeks.
Of course, people leaving is a natural part of any organisation’s life, sometimes people leave an organisation at a time when they’re critical to that organisation. Having good succession plans in place is another way of improving an organisation’s resiliency. Such documents are often sensitive, and so may not form part of the general BCP documentation; they should however be in present and in place, at the very least covering senior people and those who have been identified as key to business operations.
Lesson: Resiliency includes people aspects too. Ensure your Business Continuity Plans include succession plans.
Your Next Step
I hope that these lessons are useful to you and help you improve your organisation’s resiliency, if you need help with any part of this, please contact me via LinkedIn or our website (https://www.fifthstep.com), and a member of our highly experience Resiliency Team will be in contact to help you make the improvements.Darren Wray