The Discounted Failure Pitfall for autonomous system safety

The Discounted Failure Pitfall:

Arguing that something is safe because it has never failed before doesn't work if you keep ignoring its failures based on that reasoning.

A particularly tricky pitfall occurs when a proven in use argument is based upon a lack of observed field failures when field failures have been systematically under-reported or even not reported at all. In this case, the argument is based upon faulty evidence.

One way that this pitfall manifests in practice is that faults that result in low consequence failures tend to go unreported, with system redundancy tending to reduce the consequences of a typical incident. It can take time and effort to report failures, and there is little incentive to report each incident if the consequence is small, the system can readily continue service, and the subjective impression is that the system is perceived as overall safe. Perversely, reporting numerous recovered malfunctions or warnings can actually increase user confidence in a system even while events go unreported. This can be a significant issue when removing backup systems such as mechanical interlocks based on a lack of reported loss events. If the interlocks were keeping the system safe but interlock or other failsafe engagements go unreported, removing those interlocks can be deadly. Various aspects of this pitfall came into play in the Therac 25 loss events (Leveson 1993).

An alternate way that this pitfall manifests is when there is a significant economic or other incentive for suppressing or mischaracterizing the cause of field failures. For example, there can be significant pressure and even systematic approaches to failure reporting and analysis that emphasize human error over equipment failure in mishap reporting (Koopman 2018b). Similarly, if technological maturity is being measured by a trend of reduced safety mechanism engagements (e.g. autonomy disengagements during road testing (Banerjee et al. 2018)), there can be significant pressure to artificially reduce the number of events reported. Claiming proven in use integrity for a component subject to artificially reduced or suppressed error reports is again basing an argument on faulty evidence.

Finally, arguing that a component cannot be unsafe because it has never caused a mishap before can result in systems that are unsafe by induction. More specifically, if each time you argue it wasn't that component's fault, and therefore don't attribute cause to that component, then you can continue that behavior forever, all the while denying that an unsafe component is OK.

(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)

Leveson, N. (1993) An investigation of the Therac-25 Accidents, IEEE Computer, July 1993, pp. 18-41.
Koopman, P. (2018b), "Practical Experience Report: Automotive Safety Practices vs. Accepted Principles," SAFECOMP, Sept. 2018.
Bannerjee, S., Jha, S., Cyriac, J., Kalbarczyk, Z, Iyer, R., (2018) “Hands Off the Wheel in Autonomous Vehicles?: A Systems Perspective on over a Million Miles of Field Data,” 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2018.