Software fault tolerance

Software fault tolerance is the ability of computer software to continue its normal operation despite the presence of system or hardware faults. Fault-tolerant software has the ability to satisfy requirements despite failures.[1][2]

Following design patterns should be combined together to make the system more fault tolerant: retry, fallback, timeout, circuit breaker, and bulkhead pattern. [3][4]

To make your system more fault tolerant, you should measure 99th percentile latency and keep the remaining 1% (aka tail latencies) in check through self healing mechanisms.[5]

  1. ^ "Software Fault Tolerance". Carnegie Mellon University.
  2. ^ "Portable and Fault Tolerant Software Systems" (PDF). Massachusetts Institute of Technology.
  3. ^ Kubernetes Native Microservices with Quarkus and MicroProfile. Manning. 2022. ISBN 9781638357155.
  4. ^ Acing the System Design Interview. Manning. 2024. ISBN 9781638355915.
  5. ^ Understanding Distributed Systems: What every developer should know about large distributed applications. 2021. ISBN 978-1838430207.

Developed by StudentB