top of page
Writer's pictureIain Duncan

Mastering Disaster Recovery for Private Equity: Insights from Technical Diligence


Disaster Recovery for Private Equity

Introduction

When we conduct technical diligence at RingStone, a major part of our work consists of uncovering risks that the target company is either underestimating or for which it isn’t adequately prepared. Not surprisingly, this is a top priority for private equity funds that are making large investments - large enough that they are simply not acceptable to lose even in the wake of a freak accident. While earlier-stage investors might be willing to write off some share of their portfolios in pursuit of unicorns, by the time the larger PE funds are calling, risks with potentially business-ending worst-case scenarios aren't tolerable. 


As we frequently see companies when they are at an inflection point of maturity, there are a handful of risks we see over and over again. Foremost among these is the risk of inadequate disaster preparedness. I would even go so far as to say it’s the most common weakness we uncover, even in well-established companies. In this blog post, we’re going to look at why this is, how it matters more than ever in 2024, and how to get your house in order.


In This Article


Why Is Disaster Preparedness Such A Common Weak Spot?

 
“Because people are limited in their ability to comprehend and evaluate extreme probabilities, highly unlikely events are either ignored or overweighted”  - Kahneman & Tversky, 1978
 

Disaster preparedness is an interesting beast, deeply rooted in human psychology and cognitive biases. It’s been well established by psychologists and behavioral economists that, in general, people don’t evaluate risks and rewards rationally, even when given enough information to do so. In particular, we are notoriously bad when it comes to highly unlikely events with very large impacts, a trait savvy companies take advantage of, whether it’s in marketing lottery tickets, insurance, or safety features. Nicholas Taleb has written on this extensively, including in his acclaimed book The Black Swan, exploring how seemingly unpredictable, highly unlikely, and rare events have outsized impacts on our lives and the economy. 


More recently, Dr. Joakim Sundh at Uppsala University in Sweden, who researches how people perceive low-probability high-impact risks, wrote in a 2024 article in the scientific journal Nature,… it appears that people tend to mentally edit such probabilities, either up to a less extreme number or down to effectively zero, meaning that low probabilities are either grossly overestimated or discounted entirely.


He also quotes a landmark paper by Kahneman and Tversky from 1979, stating, “Because people are limited in their ability to comprehend and evaluate extreme probabilities, highly unlikely events are either ignored or overweighted.” 


My own experiences in technical diligence confirm this pattern. Frequently, companies have barely tested their disaster recovery plans, even when an extended outage could result in losing customers permanently or their plans depend entirely on the presence and ability of one or two key employees. Given that a disaster can effectively destroy a company, it seems irrational to hesitate at the nominal costs required to ensure proper disaster preparedness. Dr. Sundh observes that “we apparently fail so spectacularly to take such small, but important, risks into consideration, particularly in such situations where a very low-cost intervention (such as using a safety belt) would lower this risk substantially.


Why Disaster Preparation Matters More Than Ever

 
In May 2024, UniSuper, a pension fund with 647,000 members and $135 billion in assets, had its entire Google Cloud account accidentally deleted
 

Now, you might respond that you’ve enabled your cloud backups, and you trust the engineers at the big cloud service providers to know more about backups than you do, so you feel secure. Indeed, a fairly typical scenario we see in diligence is that daily backups are enabled on the cloud, using the default settings provided by the cloud service, and that’s considered good enough. However, some recent events should give us pause for thought.


In May 2024, UniSuper, a pension fund with 647,000 members and $135 billion in assets, had its entire Google Cloud account accidentally deleted by a Google employee. In Google’s own words, “There was an inadvertent misconfiguration of the GCVE service by Google operators due to leaving a parameter blank. This had the unintended and then unknown consequence of defaulting the customer’s GCVE Private Cloud to a fixed term, with automatic deletion at the end of that period.” 


Fortunately, the IT head at UniSuper was appropriately paranoid and had complete secondary backups stored outside of the Google Cloud, so they could restore service completely. However, in my experience, this is rarely the case for most early- to mid-stage companies.


While this is admittedly an extremely uncommon scenario, it’s not the only scenario in which offsite backups save the day. I have personally had a private equity client ask about this after a cyberattack on one of their portfolio companies locked them out of the AWS account. Since then, the ability to restore operations without access to the main cloud account has been considered a necessity for any new investments by the fund.


Similarly, in 2024, London Drugs, a Canadian retail chain with 79 stores, closed for over a week as they rebuilt their infrastructure following a devastating ransomware attack. 


 
To evaluate risks rationally, we must consider the worst-case scenario and decide if the measures in place are sufficient, given their likelihood and impact
 

To evaluate risks rationally, we must consider the worst-case scenario and decide if the measures in place are sufficient, given their likelihood and impact. A surprising number of companies have never gotten around to doing this properly. When we ask exactly how long they could be offline before losing their customers, they often provide educated guesses, meaning they don’t truly know the potential impact. This is a critical step in disaster preparedness. If you don’t know what the business impact of a worst-case scenario is, you can’t decide how much you should spend preparing for it. 


Ask yourself: If there was a disaster and your primary backups failed or became inaccessible, how long would it take to restore them? Would you still retain your customers after such extended downtime? Is that risk worth neglecting the nominal cost of implementing secondary backups?


You may also have Service Level Agreements (SLAs) in place with your customers that define the maximum allowable downtime. For some enterprise customers, these may have financial penalties for failures. If this is the case, have you calculated the potential financial impact if your primary backups fail? Could your business survive?


What Does Prepared Look Like In 2024?

Given the ever-increasing importance of disaster preparedness, what does preparedness look like in 2024? Of course, there’s a wide range, and what might be feasible for one company may be unaffordable to another, but I can share what our clients now consider the gold standard. The goal here is that we should be adequately prepared for disasters, including physical catastrophes, human error, personnel disasters, malicious insiders, and sophisticated cyber attacks. Moreover, the cost of being prepared should be reasonable when weighed against both the likelihood and potential costs of disaster and the business's revenue.


Secondary Data Backups

At a minimum, backups of all information necessary for operational recovery should be made with an appropriate frequency and stored in a secondary location. The secondary location should be both a separate physical location and managed by a different company. While these copies might not be made at quite the same frequency as the primaries, they should still be frequent enough that your business can survive if your primary backups are inaccessible. Secondary backups are a classic example of our poor ability to evaluate low-probability events: while the likelihood of needing secondary backups is indeed very low, the cost of automating secondary copies is minimal - and the potential impact of not having them so severe - that it makes no sense to skip this step. 


Through discussions with customers, you should also determine what an acceptable Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are. If your secondary backups are weekly, and your customers will all leave forever if you can only restore them to data from a week ago (the recovery point), then your secondary backups are inadequate.


Infrastructure Backups

In the case of a complete cloud lock-out, whether from human error or a sophisticated cyber attack, can you rebuild your infrastructure fast enough to save your business? A common weakness we see is that the infrastructure rebuilds will probably take around two days of manual effort. If you aren’t certain about the timeline, it will almost certainly take longer than anticipated. Adequate preparation means having the automation to rebuild quickly, knowing how long this will take (because you’ve tested it), and knowing this timeline is acceptable to your business. Anything requiring human intervention carries a chance for errors, delays, or personnel issues.


Documentation and Key Person Risks

Related to the above is the question of who can run a restoration. We often find that only one or two people can restore services, with no plan in place if they are unavailable. The thought might be that the odds of both a disaster and of both those people being unavailable at the same time are exceedingly low. However, this is a common mistake in risk evaluation. One might think the probability of all three events (disaster, unavailability of person one, unavailability of person two) would be calculated by multiplying the probability of each, giving an incredibly low likelihood. But that method only applies to independent events. As Nate Silver explains in The Signal And The Noise, confusing interconnected risks with isolated ones is a common blind spot and can lead to a house-of-cards effect. The real question we should ask is, “If there is a disaster, what is the likelihood of the other risks also being a problem?”  


If you need the backups because of a physical disaster, it's also possible that the two key people - who are friends and go out of town together - can’t access the office because of the same disaster. Physical catastrophes often present correlated risks. Personnel problems do so as well: if you’re protecting against a disgruntled employee causing deliberate damage, that employee could also be the one responsible for restorations. If you’re safeguarding against an employee leaving, the departure may affect multiple team members. As any HR person can tell you, employees often leave in clumps.


The only 100% reliable solution here is to have your disaster recovery plan automated and documented to the point that someone who has never done it before (but who has the appropriate technical background) could make a recovery in a timeframe that wouldn’t jeopardize the business.  Ensure the recovery documentation and scripts don’t only live in your primary cloud account. A common mistake is storing recovery documentation on the same cloud-hosted system that might be down. Disaster recovery plans should be printed and saved to removable media, such as thumb drives, stored both in and outside of the office.


Recovery Routing

Another often overlooked issue is how your customers will access the restored service.  If your main cloud account is compromised, and the same cloud provider manages all your DNS records, how long will it take to get URLs routed to the new cloud setup? If secondary URLs are required, how will customers find them? Ideally, DNS routing is handled outside the main cloud account, and there are also communication channels to customers that do not depend on the main environment. This can be as simple as a status board hosted on a third-party service or as sophisticated as a routing setup that fails over to a pre-provisioned disaster recovery environment on certain conditions. Planning how to inform customers during recovery is a critical part of disaster preparedness and should be included in tabletop exercises.


Recovery Testing

Finally, a disaster recovery plan is not a plan if it hasn’t been tested - that’s a disaster recovery aspiration. Only top-to-bottom tests will reveal what might have been overlooked and show how long recovery really takes. This is a very common gap, especially in early-stage companies. While testing does take time, it’s important if you’re courting enterprise customers or later-stage investors, as this is a key question they’ll ask. Tests help you determine what your RPO and RTO can be in SLAs with customers, ensure your documentation is thorough, and clarify who knows how to execute each step. Tabletop exercises, where teams talk through the plan, are a great preliminary step and are especially useful to determine whether someone not intimately familiar with the systems could run a recovery by following the instructions.


A good way to justify the expense of full testing to company leadership is to point out that an automated recovery system can double as the infrastructure needed to offer isolated installations to enterprise customers. Running a complete recovery into a fresh cloud account allows you to assess whether this service is feasible and helps determine pricing.


Conclusion

 
The first thing Van Halen would do on arrival was check for the M&Ms backstage, and if they weren’t there or the brown ones weren’t missing, the entire stage rig would be line-checked meticulously
 

When we’re doing technical diligence at RingStone, asking about the DR plan is a bit like the infamous Van Halen M&M Test. In the eighties, when the rock band was touring with some of the heaviest stage sets in history, they had a line in the technical rider buried deep in the middle stating that the backstage area had to have a bowl of M&Ms with all the brown ones removed. The first thing they'd do on arrival was check for the M&Ms backstage. If they weren’t there or the brown ones weren’t missing, the band would meticulously check the entire stage rig. As David Lee Roth explained, this was a perfect sniff test - if the M&Ms were wrong, there were almost always other safety concerns. 


Similarly, when we ask about the DR plan, it gives us an immediate sense of the maturity of an organization and its ability to mitigate risks. When there’s no tested DR plan, we dig much deeper for infrastructure-related risks. When a plan is absent altogether, you can guarantee the cost of implementing adequate disaster preparedness will be factored into the investor’s business model - and thus out of the valuation - at a price that won’t be favorable to the target company. Given this, getting the DR plan together more than pays for itself. 


Hopefully, this brief tour of the world of disaster preparation has given you food for thought and some ideas for improving your disaster preparation. If you need guidance in making sure you’re ready for diligence in this and other areas, RingStone is available to help.


About The Author

Iain C.T. Duncan has spent the last 20 years working in the software industry as a CTO, developer, software architect, and consultant, building B2B and internal web applications at startups, agencies, and early-stage companies in Canada and the U.S. 


At RingStone, Iain advises private equity firms globally, conducting technical due diligence for early and mid-stage companies. He has six years of experience in the diligence sector and has participated in hundreds of efforts as a practitioner, reviewer, and trainer. Before entering the diligence sector, he worked on software for e-commerce, association management, non-profit fundraising, genomics, and online education, among other industries.


An active open-source developer and researcher, Iain is currently completing an interdisciplinary PhD in Computer Science and Music at the University of Victoria, working on applications of Scheme Lisp to algorithmic music and music pedagogy. He is the author of the Scheme for Max open-source programming extension to the Max/MSP audio programming platform and is the founder and developer of the online music education startup SeriousMusicTraining.com. Contact Iain at iain.dunwhite@ringstonetech.com.




bottom of page