Nondeterministic Behavior and Legibility:
How do you know your autonomous vehicle passed the test for the right reason? What if it just got lucky, or is gaming the test?
The nature of the algorithms used by autonomy systems creates problems for modelling and testing that go beyond typical safety critical software. Some autonomy algorithms, such as randomized path planning, are inherently non-deterministic. Others can be brittle, failing dramatically with subtle variations in data, such as perception false negatives induced by adversarial attacks (Szegedy at al. 2013) or false negatives induced by slight image degradation due to haze or defocus (Pezzementi et al. 2018).
The use of potentially non-deterministic test scenarios combined with non-deterministic system behaviors and opaque system designs means it is difficult to know whether a system has passed a test, because there is no single correct answer. Rather, there must be some algorithmic way to determine whether a particular system response is acceptable or not, making that test oracle algorithm safety critical.
It is unclear if resolving this issue requires solving the difficult problem of explainable AI (Gunning 2018). As a minimum, a credible safety argument will need to address the problem of how plans to test vehicles with less than a statistically valid amount of real-world exposure data can avoid these pitfalls. It seems likely that a credible argument will also have to establish that each type of test has been passed due to safe operation of the system rather than simply by chance (Koopman & Wagner 2018).
(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)
How do you know your autonomous vehicle passed the test for the right reason? What if it just got lucky, or is gaming the test?
The nature of the algorithms used by autonomy systems creates problems for modelling and testing that go beyond typical safety critical software. Some autonomy algorithms, such as randomized path planning, are inherently non-deterministic. Others can be brittle, failing dramatically with subtle variations in data, such as perception false negatives induced by adversarial attacks (Szegedy at al. 2013) or false negatives induced by slight image degradation due to haze or defocus (Pezzementi et al. 2018).
A related issue is over-fitting to the test, in which an autonomy system over-fits and learns how to beat a fixed test. By analogy, this is the pitfall of the system cheating by having memorized the correct answers. A proposed way to deal with this risk is by randomly varying aspects of test cases.
In such a fuzzing or variable testing approach it is important to randomly vary all relevant aspects of a problem. For example, varying geometries for traffic situations can be helpful, but probably does not address potential over-fitting for perception algorithms that perform object classification.
In such a fuzzing or variable testing approach it is important to randomly vary all relevant aspects of a problem. For example, varying geometries for traffic situations can be helpful, but probably does not address potential over-fitting for perception algorithms that perform object classification.
The use of potentially non-deterministic test scenarios combined with non-deterministic system behaviors and opaque system designs means it is difficult to know whether a system has passed a test, because there is no single correct answer. Rather, there must be some algorithmic way to determine whether a particular system response is acceptable or not, making that test oracle algorithm safety critical.
Moreover, it is possible that a system has passed a particular test by chance. For example, a pedestrian might be avoided due to a properly functioning detection and avoidance algorithm. But a pedestrian might also be avoided merely because a random path planner by chance picked a path that did not intersect the pedestrian, or responded to a completely unrelated aspect of the environment that caused it to pick a fortuitously safe path. Similarly, a pedestrian might be detected in one image, but undetected in another that differs in ways that are essentially imperceptible to a human.
It is unclear if resolving this issue requires solving the difficult problem of explainable AI (Gunning 2018). As a minimum, a credible safety argument will need to address the problem of how plans to test vehicles with less than a statistically valid amount of real-world exposure data can avoid these pitfalls. It seems likely that a credible argument will also have to establish that each type of test has been passed due to safe operation of the system rather than simply by chance (Koopman & Wagner 2018).
(This is an excerpt of our SSS 2019 paper: Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019. Read the full text here)
- Gunning, D. (2018), Explainable Artificial Intelligence (XAI), Defense Advanced Research Projects Agency, https://www.darpa.mil/program/explainable-artificial-intelligence (accessed October 27, 2018).
- Koopman, P. & Wagner, M., (2018) "Toward a Framework for Highly Automated Vehicle Safety Validation," SAE World Congress, 2018. SAE-2018-01-1071.
- Pezzementi, Z., Tabor, T., Yim, S., Chang, J., Drozd, B., Guttendorf, D., Wagner, M., Koopman, P., "Putting image manipulations in context: robustness testing for safe perception," IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), Aug. 2018.
- Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fer-gus, R. (2013) "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199 (2013).