Last year, Amazon’s Simple Storage Service (S3) cloud storage service at Amazon’s US-EAST-1 data centres went down for four hours. Organisations that stored their content in just one data centre stopped working. So websites such as Business Insider, Quora, Associated Press, Expedia, Reddit, Netflix, Medium, Snapchat and Slack either crashed entirely or were broken.
We still do not know exactly what happened. Despite Amazon explaining, “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region.”
It was subsequently found to be caused by a typo caused by human error. It highlights just how dependent the internet infrastructure has become dependent on Amazon. But this outage even incapacitated AWS’s own ‘customer care’ services dashboard.
AWS are now working to ensure that this situation doesn’t happen again by adding additional safeguard. But the whole incident underscores that even the most sophisticated cloud IT infrastructure is not flawless. It highlights just how reliant we have become on many of the large cloud providers. And we even heard reports told of people who couldn’t turn their internet-connected lights on at home.
This may make organisations consider moving certain workloads to a private cloud. But ultimately the primary lesson here is that moving to the cloud does not remove the necessity for proper systems planning. We don’t put the redundancy for mission critical systems in the same building as the mission critical systems themselves, so why would we do anything else when it comes to our cloud set up?
We separate things for a reason and there are clear rules for this. Banks have their back-up systems housed in entirely different cities to mitigate against things like floods, bombs or even nuclear attacks. Most organisations understand this fundamental requirement for physical distance separation, but many large businesses lost all functionality by storing their data in the same Amazon regions. If they had used these different regions to host that information or shared the capacity, then they would have been protected against it. You wouldn’t put your backup server next to your main server. So why would you put your backup cloud systems within the same region as your permanent cloud systems?
Group Head of Systems Architecture, Jake Greenland, highlights some of the key things to consider to mitigate against the risk of cloud failure. And areas where you can help to prepare your business for the possibility of cloud failure.
How to mitigate the risk from cloud systems failure
- Use geographic separation to store your data multiple regions at the very least. But preferably use different providers across different regions. Put simply, don’t be totally reliant on just S3!.
- Use similar techniques to those you apply to your physical infrastructure such as SDN and global load balancing. These principles are still valid for building redundancy and storage in the cloud. Using it does not remove the need to do this, just changes the way you do it.
- Apply normal rules to your risk analysis and understand that all systems have limitations.
- Use SDN to provide seamless transparency between the two networks.
- Try global load balancing to load balance between systems using distributed content distribution networks.
- Utilise hybrid cloud capabilities to fall back on some functionality to internal systems if you need or want to.
- Perform adequate risk analysis as you would do for anything else. And don’t make any assumptions about your cloud provider handling risk mitigation for you. It is you that is ultimately responsible.
- Realise that any compensation from your cloud provider is unlikely to help you after the incident. Especially since some organisations can lose millions of pounds from just one hour of outage.
How to prepare for failure
- Identify critical systems, applications and infrastructure.
- Identify how to how make these systems redundant.
- Build a Disaster Recovery plan where you have performed a thorough risk analysis that has identified a risk mitigation path. Even the smallest of businesses can afford a cloud disaster recovery plan.
- Plan in advance by replicating critical servers and ensuring you have the means for a seamless transition for your critical processes.
- Load balance across multiple providers and introduce elements of SDN.
- Don’t believe in the redundancy being offered by public cloud. Just because they have multiple servers doesn’t mean that you don’t have to.
Our experts can design redundant systems and build traditional infrastructure to be resilient to this kind of damage. There is no reason for this to happen if you design your systems properly. Identify the risks, plan accordingly and have a mitigation plan. We have experience with hybrid cloud integration and inter-cloud deployments. So if you’re running your cloud systems with a single cloud provider and want to learn more about how to protect your business, get in touch.