Failure Mode & Effects Analysis (FMEA) for dependency manage
Managing system & software dependencies with FMEA for cloud architects & engineers
Wouldn’t it be great if all systems were equally highly stable? Unfortunately, reality begs to differ. As proprietors of stable systems, it is essential that the dependencies of those systems are well managed. This article covers a proven methodology - failure mode & effects analysis (FMEA) - to identify and better understand dependencies and an overview of how to mitigate the failures of those dependencies.
What is a 'system dependency'?
First, what is a system dependency?
A system dependency occurs when a system requires another system or software library in order to perform its systematic functions, in design time or run time.
Based on this definition, the scope of dependency management is quite broad. One needs to identify and consider both dependencies external and internal to the system in both design time and run time scenarios. In addition, dependencies can be recursive or indirect in nature. For example, as illustrated below a system (System X) may directly call one system (System Y), and that system may call another (System Z). In this chain of dependencies, System X is also dependent on System Z, but can usually rely on System Y’s mitigation of its dependency on System Z. Speaking of mitigations, one needs to prioritize, define and implement mitigations to support well managed dependencies.
How to perform dependency management (with a design time example)
Let’s review dependency management via a design time scenario.
A typical system will implement the following steps to develop software:
- Develop and store code
- Test source code
- Package code
- Test packaged code
- Deploy code
Let’s overlay these steps with their typical dependencies:
Step | Dependency |
---|---|
1. Develop and store code | System used to store code |
2. Test source code | System(s) used to test code |
3. Package code | System(s) used to package code and store the packaged result, including retrieving any OS and/or software dependencies required to create that package |
4. Test packaged code | System(s) used to test code, including any testing environments |
5. Deploy code | System(s) used to deploy code, including the deploy target environments |
Utilizing failure mode & effects analysis (FMEA)
With this understanding of dependencies, one can now proceed to ‘managing’ them by first analyzing what can fail, how impactful that failure is, and how to mitigate those failures. This is commonly known as failure mode & effects analysis (FMEA).
Applying FMEA in an abbreviated form could look like this:
Step | Dependency | Failure | Probability | Severity | Mitigation |
---|---|---|---|---|---|
1. Develop and store code | System used to store code | a. System can’t be reached to push code | 2 (unlikely) |
i. 3 (minor for normal operations, delay acceptable) ii. 5 for critical patch |
i: Retry with logical backoff Ii. Use alternative system to store code |
b. System can’t be reached to retrieve code | 2 (unlikely) | ||||
c. System runs out of capacity to store code | 1 (extremely unlikely) |
The severity rating is based on the business understanding of the failure’s impact. Multiplying probability and severity results in a score that is used to determine priority for investment into mitigations. In the example above, that would put the failure scenario of ‘System can’t be reached to push code’ and ‘system can’t be reached to retrieve code’ in a ‘critical patch’ situation as the highest priorities. Implementing mitigations can be automated or manual, and typically depends on observability and failure predictability to detect and respond to the failure.
Dependency mitigation strategies
The following describes mitigation strategies for dependency failure in more detail:
- Simple retry: A system receives an error and retries the request again, based on a predefined frequency. Depending on the failure scenario, this could suffice, however, it can lead to unintended denial of service experiences, especially if a maximum number of retries is not set.
- Retry with exponential backoff: A system receives an error and retries the request again, based on waiting a longer period of time between each retry request. It is best practice to configure a maximum delay interval and maximum number of retries. This approach enables better flow control.
- Feature toggles: A system can change behavior during runtime based on conditions. For example, if a service used to refresh data is unavailable, a feature toggle can allow the system to use stale data and be transparent about that to the end user rather than outright failing. While feature toggles increase complexity, they are highly recommended to support predictable failure scenarios as well as blue/green or canary deployment strategies.
- Circuit breaker: A system uses a circuit breaker to monitor and execute a dependency. For example, if a dependency is a remote service, a failure can be an exception and/or a timeout. When the number of failures reaches a predefined threshold, the circuit breaker opens the circuit such that subsequent requests end with an error or result in executing an alternative. After a reset period, the circuit breaker sets the circuit state to half-open and allows a single request through to gauge the health of the service. Upon success, the circuit breaker closes the circuit and waiting requests proceed as normal. Upon failure, the circuit breaker re-opens the circuit.
- Reduce dependencies: A system that can remove dependencies does not need to manage those dependencies. This is most commonly applied to software library dependencies and also helps reduce package size and attack surface. This also typically requires a robust mapping capability to understand software dependencies in the first place, especially recursive or indirect dependencies, which are those that are called by direct dependencies that are called directly from the software.
- Self healing: A system can detect a failover, and automatically recover from that failure to restore normal operations without human intervention. For example, if a process fails, the system can detect that and restart it; assuming the restart corrects the issue, the system has effectively self healed.
Final thoughts on dependency management
Dependency management is an essential technique to improve system stability. While it requires effort and investment to properly identify, understand, and mitigate dependencies, the resulting improvement of a system’s stability cannot be understated.
To review:
- Dependency management must cover both dependencies external and internal to the system, and both design time and runtime scenarios, to be holistic.
- Dependency management can use FMEA to identify and prioritize failure scenarios, in order to define and implement mitigation strategies.
- There is a wide array of mitigation strategies that can be used depending on the business priority and level of automated sophistication needed.
I hope you’ve found this post useful and look forward to performing dependency management!