Software fault tolerance
From Wikipedia, the free encyclopedia
Software fault tolerance is the ability of computer software to continue its normal operation despite the presence of system or hardware faults. Fault-tolerant software has the ability to satisfy requirements despite failures.[1][2]
This article needs additional citations for verification. (February 2011) |
Following design patterns should be combined together to make the system more fault tolerant: retry, fallback, timeout, circuit breaker, and bulkhead pattern. [3][4]
To make your system more fault tolerant, you should measure 99th percentile latency and keep the remaining 1% (aka tail latencies) in check through self healing mechanisms.[5]