In 1983, Charles Perrow coined the term normal accidents to describe systems that were so prone to failure, no amount of safety procedures could eliminate accidents entirely. According to Perrow, normal accidents are not the product of bad technology or incompetent staff. Systems that experience normal accidents display two important characteristics.
They are tightly coupled. When two separate components are dependent on each other, they are said to be coupled. In tightly coupled situations, there’s a high probability that changes with one component will affect the other. For example, if a change to one code base requires a corresponding change to another code base, the two repositories are tightly coupled. Loosely coupled components, on the other hand, are ones where changes made to a component don't necessarily affect the other.
Tightly coupled systems produce cascading effects. One change creates a response in another part of the system, which creates a response in another part of the system. Like a domino effect, parts of the system start executing without a human operator telling them to do so. If the system is simple, it is possible to anticipate how failure will happen and prevent it, which leads to the second characteristic of systems that experience normal accidents.
They are complex. Big systems are often complex, but not all complex systems are big. Signs of complexity in software include the number of direct dependencies and the depth of the dependency tree, the number of integrations, the hierarchy of users and ability to delegate, the number of edge cases the system must control for, the amount of input from untrusted sources, the amount of legal variety in that input, and so on. Computer systems naturally grow more complex as they age, because over time, we tend to add more and more features to them, which increases at least a few of the above characteristics. Computer systems also tend to start off tightly coupled and may in fact stay that way if we don’t prioritize refactoring the code occasionally.
Tightly coupled and complex systems are prone to failure because the coupling produces cascading effects, and the complexity makes the direction and course of those cascades impossible to predict.
If your goal is to reduce failures or minimize security risks, your best bet is to start by evaluating your system on those two characteristics: where are things tightly coupled, and where are things complex? Your goal should not be to eliminate all complexity and all coupling; there will be trade-offs in each specific instance.
Suppose you have three services that need to access the same data. If you configure them to talk to the same database, they are tightly coupled.
Such coupling creates a few potential problems. To begin with, any of the three services could make a change to the data that breaks the other two services. Any changes to the database schema have to be coordinated across all three services. By sharing a database, you might lose the scaling benefit of having three separate services, because as load increases on one service, it is passed down to the database, and the other services see a dip in performance.
However, giving each service its own database trades those problems for other potential problems. You now must figure out how to keep the data between the three separate databases consistent.
Loosening up the coupling of two components usually ends with the creation of additional abstraction layers, which raises the complexity of the system. Minimizing the complexity of systems tends to mean more reuse of common components, which tightens couplings. It’s not about transforming your system into something that is completely simple and uncoupled, it’s about being strategic about where you are coupled and where you are complex and to what degree. Places of complexity are areas where the human operators make the most mistakes and have the greatest probability of misunderstanding. Places of tight coupling are areas of acceleration where effects both good and bad will move faster, which means less time for intervention.
Once you have identified the parts of the system where there is tight coupling and where there is complexity, study the role those areas have played in past problems. Will changing the ratio of complexity to coupling make those problems better or worse?
A helpful way to think about this is to classify the types of failures you’ve seen so far. Problems that are caused by human beings failing to read something, understand something, or check something are usually improved by minimizing complexity. Problems that are caused by failures in monitoring or testing are usually improved by loosening the coupling (and thereby creating places for automated testing). Remember also that an incident can include both elements, so be thoughtful in your analysis. A human operator may have made a mistake to trigger the incident, but if that mistake was impossible to discover because the logs aren’t granular enough, minimizing complexity will not pay off as much as changing the coupling.
This is an excerpt from Marianne’s upcoming book about running complex legacy system modernizations, “Kill it with Fire,” which will be available from No Starch Press later this year.