Why Embracing Chaos is Crucial to Your Success and Longevity
Chaos engineering is a popular idea in software engineering, centered around the premise that deliberately breaking a system to gain information will ultimately help improve that system's resiliency.
This article first appeared at The Entrepreneur on 23 Mar 2023.
We are living in turbulent times. Natural disasters are becoming more frequent, a tragic war is again raging in Europe, and a central U.S. bank loved by VCs and startups on both sides of the Atlantic has unexpectedly collapsed in a matter of days. All of this, coupled with technology advancing at breakneck speed, makes it crystal clear that the future is getting increasingly challenging to map out.
Given this toxic mix of volatility and unpredictability with an occasional black swan event mixed in, what can senior executives do to ensure they are leading companies into a sustainable future? I believe that chaos engineering could help.
Chaos engineering to the rescue
Typically, there are three stages to chaos engineering: forming a hypothesis about how a system will behave when something goes wrong, designing the smallest possible experiment to test that hypothesis in the system and then measuring the impact of any failures at each step to gain a deeper understanding of the system's real-world behavior.
In a sense, chaos engineering is not that different from the idea of antifragility and can pave a path toward better risk management, ensuring corporate sustainability in the long run. Even so, I believe it should be applied cautiously because some risks are unacceptable. For example, taking risks that could lead to corporate bankruptcy or death must be identified and avoided at all costs.
To put it another way, having a fire escape plan and conducting regular fire drills is an absolute must (some 20+ years ago, the building I was working in burned down, causing more than 100 deaths and making me take fire safety very seriously). Still, setting fire to a building simply to observe people's behavior to draw up a better safety protocol for future use is not acceptable by any means.
Unacceptable risks
Practicing simulations, scenarios and drills is an essential part of testing risk management processes for any company, provided it does not cause irreparable harm. For example, it is possible to turn a few servers off and see what happens, but kicking a few customers out of a system or from a physical store just to observe the fallout is hardly a practical solution.
To be sure, regarding career longevity, running around creating fires is not an advantageous strategy for a top manager, who should want to build and maintain relationships with their subordinates. After all, the job of a senior executive, first and foremost, is to remove hurdles and enable their team to work and deliver results – not to throw sand in the gears every now and then because a chaos monkey told them to.
So, to make chaos engineering work for the company and your career, it pays for a CEO to work with the Board of Directors to establish a Risk Committee with some senior leaders or one Board member in its ranks to lend it greater authority and credibility.
Many companies, including ours, have a two-step process to identify and manage risk. Low- and medium-impact risks are reviewed and managed by line managers, and high-impact risks are communicated to the Board quarterly. The mitigation of high-impact risks is directly related to the risk appetite of the Board and their potential impact on the company's sustainability.
Getting everyone involved
To carry out this work on a day-to-day basis at FunCorp, we task our department heads with keeping track of risks within their areas of focus and determining things that could go wrong if certain moving parts stop working. This applies to intercompany relationships and to those with suppliers, partners, consultants and other counterparts.
Every company has a history of past incidents. It should therefore make an effort to document and analyze what went wrong and why, including in technology and its other areas of operations. These sorts of 'postmortems' are invaluable resources for any company.
Thus, risk management should be a company-wide effort involving everyone from senior executives and management to line employees. Enabling staff to contribute to risk analysis and mitigation, and thus further highlighting to them their impact on the company's sustainability, goes a long way in building mutual trust and a sense of common ownership. I strongly believe in this approach.
Planning risk response
Why is it important to consider risk and introduce processes to avoid or mitigate their effects?
Poorly planned or poorly executed risk responses, as well as a lack of follow-through in response to organizational learning, can end up causing real damage to a company's reputation and bottom line. This means that risk plans should be regularly reviewed and updated.
It is also important to keep in mind that even companies within the same industry have different risk profiles, so repurposing a competitor's risk playbook would still involve additional work to finetune.
It is never easy to pinpoint all the potential risks and get the complete list right on the first try. So, a process for modifying this list, including expanding or editing it as times change, should be part and parcel of routine corporate risk management. After all, any failure to incorporate new risks or modify the potential impacts of existing ones would put any company in a vulnerable position.
Known unknowns and radical transparency
Risk management is about known unknowns. The business environment is mostly about unknown unknowns, and I can confidently say that being reasonably paranoid (i.e., having an up-to-date risk map) is an excellent way to build a toolbox large enough to apply even to currently unknown problems.
To share one example, in my previous business Aviasales, the third largest global flight metasearch, we had contingency plans for instances when passenger traffic fell 25%, 50% and 70%. What looked like a purely intellectual exercise ended up being incredibly useful when Covid-19 struck, and instead of endlessly searching for a 'new normal,' we were able to implement our risk response plan with minor modifications. The team was confident in a positive outcome because they themselves had participated in developing the plan.
In my current industry, entertainment tech, other risks are commonplace. For example, all companies that deal with User-Generated Content (UGC) face a risk of exposing their audience to the content that is inappropriate in some way. This risk poses an almost existential threat to the business — the content that does not meet company standards affects user retention and makes advertisers concerned. At FunCorp, we identified the need for a proactive solution, aiming to shoot down such content before it appears on our platform for everyone to see.
I am a strong believer in corporate sustainability, and risk management is one of its essential pillars. It requires the proactive identification of risks, timely involvement of the Board and C-suite, and radical transparency between management and employees so that problems are never swept under the rug. Given the chaotic nature of risk itself, chaos engineering can also help along the way.