The Insufficient Testing Pitfall for autonomous system safety

The Insufficient Testing Pitfall:
Testing less than the target failure rate doesn't prove you are safe. In fact you probably need to test for about 10x the target failure rate to be reasonably sure you've met it. For life critical systems this means too much testing to be feasible.

In a field testing argumentation approach a fleet of systems is tested in real-world conditions to build up confidence. At a certain point the testers declare that the system has been demonstrated to be safe and proceed with production deployment.

An appropriate argumentation approach for field testing is that a sufficiently large number of exposure hours have been attained in a highly representative real-world environment. In other words, this is a variant of a proven in use argument in which the “use” was via testing rather than a production deployment. As such, the same proven in use argumentation issues apply, including especially the needs for representativeness and statistical significance. However, there are additional pitfalls encountered in field testing that must also be dealt with.

Accumulating an appropriate amount of field testing data for high dependability systems is challenging. In general, real-world testing needs to last approximately 3 to 10 times the acceptable mean time between hazardous failure to provide statistical significance (Kalra and Paddock 2016). For life-critical testing this can be an infeasible amount of testing (Butler and Finelli 1993, Littlewood and Strigini 1993).

Even if a sufficient amount of testing has been performed, it must also be argued that the testing is representative of the intended operational environment. To be representative, the field testing must at least have an operational profile (Musa et al. 1996) that matches the types of operational scenarios, actions, and other attributes of what the system will experience when deployed. For autonomous vehicles this includes a host of factors such as geography, roadway infrastructure, weather, expected obstacle types, and so on.

While the need to perform a statistically significant amount of field testing should be obvious, it is common to see plans to build public confidence in a system via comparatively small amounts of public field testing. (Whether there is additional non-public-facing argumentation in place is unclear in many of these cases.)

To illustrate the magnitude of the problem, in 2016 there were 1.18 fatalities per 100 million vehicle miles traveled in the United States, making the Mean Time Between Failures (MTBF) with respect to fatalities by a human driver 85 million miles (NHTSA 2017). Assuming an exponential failure distribution, for a given MTBF the required test time in which r failures occur can be computed with confidence α using a chi square distribution (Morris 2018):

Required Test Time = χ2(α, 2r + 2)(MTBF)/2

(If you want to try out some example numbers, try out this interactive calculator at https://reliabilityanalyticstoolkit.appspot.com/mtbf_test_calculator)

Based on this, a single-occupant system needs to accumulate 255 million test miles with no fatalities to be 95% sure that the mean time is only 85 million miles. If there is a mishap during that time, more testing is required to distinguish normal statistical fluctuations from a lower MTBF: a total of 403 million miles to reach 95% confidence. If a second mishap occurs, 535 million miles of testing are needed, and so on. Additional testing might also be needed if the system is changed, as discussed in a subsequent section. Significantly more testing would also be required to ensure a comparable per-occupant fatality rate for multi-occupant vehicle configurations.

Attempting to use a proxy metric such as rate of non-fatal crashes and extrapolate to fatal mishaps requires also substantiating the assumption that the mishap profiles will be the same for autonomous systems as they are for human drivers. (See the human filter pitfall previously discussed.)

(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)

Kalra, N., Paddock, S., (2016) Driving to Safety: how many miles of driving would it take to demonstrate autonomous vehicle reliability? Rand Corporation, RR-1479-RC, 2016.
Butler, Finelli (1993) “The infeasibility of experimental quantification of life-critical software reliability,” IEEE Trans. SW Engr. 19(1):3-12, Jan 1993.
Littlewood, B., Strigini, L. (1993) “Validation of Ultra-High Dependability for Software-Based Systems,” Communications of the ACM, 36(11):69-80, November 1993.
Musa, J., Fuoco, G., Irving, N., Kropfl, D., Juhlin, B., (1996) “The Operational Profile,” Handbook of Software Reliability Engineering, pp. 167-216, 1996.
NHTSA (2017) Traffic Safety Facts Research Note: 2016 Fatal Motor Vehicle Crashes: Overview. U.S. Department of Transportation National Highway Traffic Safety Administration. DOT HS-812-456.
Morris, S. (2018) Reliability Analytics Toolkit: MTBF Test Time Calculator. https://reliabilityanalyticstoolkit.appspot.com/mtbf_test_calculator. Reliability Analytics. (accessed December 12, 2018)