Chaos engineering is the intentional act of injecting failure into production codebases and system configurations to measure its resilience to attacks and faults. It’s a confidence-building experiment that checks whether the system is strong enough to withstand turbulent conditions that may occur during production.
Chaos engineering began at Netflix, where a team worked on the initial concept of chaos testing and began highlighting the importance of chaos engineering to others. In 2008, Netflix migrated to the cloud due to a significant database corruption event. The database corruption stopped Netflix from shipping DVDs (the status quo during the time) for about three days. As a result of the corruption, they began contemplating how to move their software model from its single, point-of-failure architecture. However, this was a time when cloud service providers weren’t as mature as they are now, so it took eight years to migrate from the data center to the cloud entirely.
The nature of Netflix’s engineering work at the time was such that engineers had full autonomy and responsibility to write their features and deploy them to production without anyone questioning or directing them. Chaos engineering was created, in part, to address errors that can arise from this system of development. They intended to move away from a development model that assumed there wouldn’t be any outages and instead considered outages were bound to happen, making system resilience necessary.
Netflix introduced Chaos Monkey in 2011. Chaos Monkey is an application that goes through a list of clusters, selects a random instance from each cluster, and turns it off without warning during work hours every workday to see how their remaining systems respond to a system outage.
Today, chaos engineering is a well-known, widely practiced process that helps you get to know your system and its vulnerabilities better to improve its overall reliability and durability.
Overview
Chaos engineering isn’t just about causing chaos; it exists to build confidence in production systems and find the source of chaos within an acceptable timeframe.
Injecting chaos into software systems falls into three major categories:
- State Attacks: Tests against experiencing unpredicted, unexpected changes in your environment, like app crashes, outages, or other failures
- Resource Attacks: Tests against unexpected, rapid increases in your computing resources
- Network Attacks: Tests against unpredictable conditions in your environment
In all three of these forms, the process of implementing chaos engineering generally follows the same process.
Implementing Chaos Engineering
The process of a chaos injection session involves a lot of careful planning, observation, and reflection. When you’re preparing for and implementing chaos engineering into your development practice, you’ll generally take the following steps:
- Start by determining your steady-state hypothesis, which represents how you think your system will respond to the chaos injection
- Plan which real-world turbulent conditions you will use and inject them into your system. In other words, introduce the chaos to see how your system responds
- Return to your steady-state hypothesis and see how your system’s response measures up. Take notes, such as where your system excelled, or its performance wasn’t ideal
- Collect metrics about your system’s performance. Observability tools will be helpful here. You want measurable data that you can use to refine and strengthen your system’s response. Deciding which KPIs and performance metrics to focus on is crucial to getting solid insights into your system’s resiliency. Since chaos engineering seeks to validate and enhance your system’s durability, it’s good practice to use metrics related to user experience and availability
- Make necessary changes to improve the reliability of your system
Principles of Chaos Engineering
There are five advanced principles of chaos engineering. The degree to which you implement and prioritize these five principles in your chaos engineering tests directly affects the reliability of the practice and your system.
Build a Hypothesis Around Steady-State Behavior
It’s often said that chaos engineering builds upon science and academia. Chaos engineering is seen as an experiment, and experiments are initiated after hypotheses are made. You want to create a working theory that your investigation will confirm or disprove.
A general form of the hypothesis for chaos engineering is as follows, where the unknown cause of failure is replaced by x:
When x fails, the software or system will still be available for customers.
Using this formula encourages you to focus on how the system is expected to behave and compare your system’s response to the attack to this baseline.
Vary Real-World Events
This principle encourages chaos engineers to make the unknown variables closer to real-life events. This may mean that you create variables more likely to be tethered to the users’ experience than the systems’ engineering experience.
Run Experiments in Production
Chaos engineering experiments must be done in the production environment rather than staging to build confidence and test resilience properly. Chaos engineering should take place when your app is in production. Because your production environment is simpler to modify/observe and doesn’t require a complete environment replication, it’s much easier and more cost-effective to run experiments in production.
Automate Experiments to Run Continuously
This principle strives to abstract away the headaches of navigating complex systems and running chaos experiments over large sets of instances. In line with the second principle, the probable faults in a large and complex system might be too numerous, making the need for automation critical.
Minimize the Blast Radius
Damaging or breaking too much in production may adversely affect customer traffic — the scenario chaos engineering seeks to prevent. Experimenting with your system in production creates the potential for customers to be impacted by unexpected errors. When performing chaos engineering, it’s crucial to have a defined parameter that accounts for short-term impact, but the overall adverse effects of experiments must have minimal consequences.
Benefits of Chaos Engineering
Chaos engineering enables you to understand better your system, its vulnerabilities, and how to proactively and reactively address them. Let’s take a look at some examples below:
Increase System Durability
The main goal of chaos engineering is to make the system more durable and reliable. Despite how robust the cloud architecture of a software application is, there’s always the potential for vulnerabilities to arise. Chaos engineering enables you to locate these areas and make the system more resilient and fault-tolerant based on what you find.
Reduce and Prevent Outages
Chaos engineering doesn’t necessarily stop outages from happening altogether. However, implementing chaos engineering will mitigate the likelihood of having severe and unpredictable outages. It will also provide you with a strategy to understand the risks inherent in your software system.
Improve Incident Management
Practice makes perfect, and you need to practice to improve your incident response and management. Chaos engineering lets you hone your incident management skills by creating incidents for you and your system to respond to. This keeps you on your toes and expands your knowledge of the kinds of errors or failures that can occur in the system — and how to deal with them effectively.
Gain System Insights
Similar to technical debt, software systems can have dark debts — unknown weaknesses or inherent vulnerabilities — that aren’t visible until they spawn an issue. When building, dark debts aren’t easy to spot, so most traditional testing methods can’t reveal them.
Chaos engineering experiments will help you gain insights into how your system would respond to turbulent conditions in production environments. Actively seeking out errors enables you to identify dark debts and other performance-related issues while simultaneously identifying the parts of your system performing as — or better than — expected.
Key Takeaways
- Chaos engineering describes the practice of performing experiments on your distributed system to build confidence in its ability to withstand unexpected production conditions.
- When implementing chaos engineering, you should determine which kinds of attacks you’ll prepare: state, resource, or network. Then, carefully plan how you’ll implement chaos engineering and closely observe your system’s response.
- Implementing chaos engineering helps you increase the durability of your system, reduce and prevent outages, improve your incident response strategy, and gain insights into system performability.