Safe road testing of autonomous vehicle technology assumes that human "safety drivers" will be able to prevent mishaps. But humans are notoriously bad at supervising autonomy. Ensuring that road testing is safe requires designing the test platform to have high "supervisability." In other words, it must be easy for a human to stay in the loop and compensate for autonomy errors, even when the autonomy is gets pretty good and the supervisor job gets pretty boring. This excerpt from a draft paper explains the concept and why it matters.
(update: full paper here: https://users.ece.cmu.edu/~koopman/pubs/koopman19_TestingSafetyCase_SAEWCX.pdf
)
An essential observation regarding self-driving car road testing is that it relies upon imperfect human responses to provide safety. There is some non-zero probability that the supervisor (a "safety driver") will not react in a timely fashion, and some additional probability that the supervisor will react incorrectly. Either of these outcomes could be an incident or mishap. Such a non-zero probability of unsuccessful failure mitigation means it is necessarily the case that the frequency of autonomy failures will influence on-road safety outcomes.
However, lower autonomy failure rates are not necessarily better. The types and frequencies of autonomy failures will affect the supervisability of the system. Therefore, the field failure rate and types of failures must be compatible with the measures being taken to ensure supervisor engagement. Thus, the failure profile must be “appropriate” rather than low.
Non-Linear Autonomy/Human Interactions
A significant difficulty in reasoning about the effect of autonomy failure on safety is that there is a non-linear response of human attentiveness to autonomy failure. We propose that there are five different regions of supervisability of autonomy failures, with two different hypothetical scenarios based on comparatively lower and higher supervisability trends illustrated in the figures.
1. Autonomy fails frequently in a dangerous way. In essence this is autonomy which is not really working. A supervisor faced with an AV test platform that is trying to run off the road every few seconds should terminate the testing and demand more development. We assume that such a system would never be operated on public roads in the first place, making a public risk assessment unnecessary. (Debugging of highly immature autonomy on public roads seems like a bad idea, and presents a high risk of mishaps.)
2. Autonomy fails moderately frequently but works or is benign most of the time. In this case the supervisor is more likely to remain attentive since an autonomy failure in the next few seconds or minutes is likely. The risk in this scenario is probably dominated by the ability of the supervisor to plan and execute adequate fault responses, and eventual supervisor fatigue.
3. Autonomy fails infrequently. In this case there is a real risk that the supervisor will lose focus during testing, and fail to respond in time or respond incorrectly due to loss of situational awareness. This is perhaps the most difficult situation for on-road testing, because the autonomy could be failing frequently enough to present an unacceptably high risk, but so infrequently that the supervisor is relatively ineffective at mitigation. This dangerous situation corresponds to the “valley of degraded supervision” in Figure 1.
4. Autonomy fails very infrequently, with high diagnostic coverage. At a high level of maturity, the autonomy might fail so infrequently that it is almost safe enough, and even a relatively disengaged driver can deal with failures well enough to result in a system that is overall acceptably safe. High coverage failure detection that prompts the driver to take over in the event of a failure might help improve the effectiveness of such a system. The ultimate safety of such a system will likely depend upon its ability to detect a risky situation with sufficient advance warning for the supervisor to re-engage and take over safely. (This scenario is generally aligned with envisioned production deployment of SAE Level 3 autonomy.)
5. Autonomy essentially never fails. In this case the role of the supervisor is to be there in case the expectation of “never fails” turns out to be incorrect in testing. It is difficult to know how to evaluate the potential effectiveness of a supervisor, other than that the supervisor will have the same tasks as the “very infrequently” preceding case, but is expected not to have to perform them.
Perhaps counter-intuitively, the probability of a supervisor failure is likely to increase as the autonomy failure rate decreases from regions 1 to 5 above (from left to right along the horizontal axis of the figures). In other words, the less often autonomy fails, the less reliable supervisor intervention becomes. The most dangerous operational region will be #3, in which the autonomy is failing often enough to present a significantly elevated risk, but not often enough to keep the supervisor alert and engaged. This is a well understood risk that must be addressed in a road testing safety case.
Figure 2 illustrates this effect with hypothetical performance data that results in an overall test
platform safety value in accordance with [math in the full paper]. A hypothetical lower supervisability curve results in a region in which the vehicle is less safe than a conventional vehicle driven by a human driver.
Safe testing requires a comparatively higher supervisability curve to ensure that the overall test platform safety is sufficiently high, as shown by Figure 2.
Because autonomy capabilities are generally expected to mature over time, the safety argument must
be revisited periodically during test and development campaigns as the autonomy failure rate decreases from region 2 to 3 above. An intuitive – but dangerously incorrect – approach would be to assume that the requirements for test supervision can be relaxed as autonomy becomes more mature. Rather, it seems likely that the rigor of ensuring supervisors are vigilant and continually trained to maintain their ability to react effectively needs to be increased as autonomy technology transitions from immature to moderately mature. This effect only diminishes when the AV technology starts approximating the road safety of a conventional human driver all on its own (regions 4 & 5).
If you are actively doing self-driving car testing on public roads, please contact me for a preprint of the full paper that includes a GSN safety argumentation structure for ensuring road testing safety. I plan to present the full paper at SAE WCX 2019 in April.
)
Figure 1.
An essential observation regarding self-driving car road testing is that it relies upon imperfect human responses to provide safety. There is some non-zero probability that the supervisor (a "safety driver") will not react in a timely fashion, and some additional probability that the supervisor will react incorrectly. Either of these outcomes could be an incident or mishap. Such a non-zero probability of unsuccessful failure mitigation means it is necessarily the case that the frequency of autonomy failures will influence on-road safety outcomes.
However, lower autonomy failure rates are not necessarily better. The types and frequencies of autonomy failures will affect the supervisability of the system. Therefore, the field failure rate and types of failures must be compatible with the measures being taken to ensure supervisor engagement. Thus, the failure profile must be “appropriate” rather than low.
Non-Linear Autonomy/Human Interactions
A significant difficulty in reasoning about the effect of autonomy failure on safety is that there is a non-linear response of human attentiveness to autonomy failure. We propose that there are five different regions of supervisability of autonomy failures, with two different hypothetical scenarios based on comparatively lower and higher supervisability trends illustrated in the figures.
1. Autonomy fails frequently in a dangerous way. In essence this is autonomy which is not really working. A supervisor faced with an AV test platform that is trying to run off the road every few seconds should terminate the testing and demand more development. We assume that such a system would never be operated on public roads in the first place, making a public risk assessment unnecessary. (Debugging of highly immature autonomy on public roads seems like a bad idea, and presents a high risk of mishaps.)
2. Autonomy fails moderately frequently but works or is benign most of the time. In this case the supervisor is more likely to remain attentive since an autonomy failure in the next few seconds or minutes is likely. The risk in this scenario is probably dominated by the ability of the supervisor to plan and execute adequate fault responses, and eventual supervisor fatigue.
3. Autonomy fails infrequently. In this case there is a real risk that the supervisor will lose focus during testing, and fail to respond in time or respond incorrectly due to loss of situational awareness. This is perhaps the most difficult situation for on-road testing, because the autonomy could be failing frequently enough to present an unacceptably high risk, but so infrequently that the supervisor is relatively ineffective at mitigation. This dangerous situation corresponds to the “valley of degraded supervision” in Figure 1.
4. Autonomy fails very infrequently, with high diagnostic coverage. At a high level of maturity, the autonomy might fail so infrequently that it is almost safe enough, and even a relatively disengaged driver can deal with failures well enough to result in a system that is overall acceptably safe. High coverage failure detection that prompts the driver to take over in the event of a failure might help improve the effectiveness of such a system. The ultimate safety of such a system will likely depend upon its ability to detect a risky situation with sufficient advance warning for the supervisor to re-engage and take over safely. (This scenario is generally aligned with envisioned production deployment of SAE Level 3 autonomy.)
5. Autonomy essentially never fails. In this case the role of the supervisor is to be there in case the expectation of “never fails” turns out to be incorrect in testing. It is difficult to know how to evaluate the potential effectiveness of a supervisor, other than that the supervisor will have the same tasks as the “very infrequently” preceding case, but is expected not to have to perform them.
Perhaps counter-intuitively, the probability of a supervisor failure is likely to increase as the autonomy failure rate decreases from regions 1 to 5 above (from left to right along the horizontal axis of the figures). In other words, the less often autonomy fails, the less reliable supervisor intervention becomes. The most dangerous operational region will be #3, in which the autonomy is failing often enough to present a significantly elevated risk, but not often enough to keep the supervisor alert and engaged. This is a well understood risk that must be addressed in a road testing safety case.
Figure 2 illustrates this effect with hypothetical performance data that results in an overall test
platform safety value in accordance with [math in the full paper]. A hypothetical lower supervisability curve results in a region in which the vehicle is less safe than a conventional vehicle driven by a human driver.
Safe testing requires a comparatively higher supervisability curve to ensure that the overall test platform safety is sufficiently high, as shown by Figure 2.
Figure 2.
Because autonomy capabilities are generally expected to mature over time, the safety argument must
be revisited periodically during test and development campaigns as the autonomy failure rate decreases from region 2 to 3 above. An intuitive – but dangerously incorrect – approach would be to assume that the requirements for test supervision can be relaxed as autonomy becomes more mature. Rather, it seems likely that the rigor of ensuring supervisors are vigilant and continually trained to maintain their ability to react effectively needs to be increased as autonomy technology transitions from immature to moderately mature. This effect only diminishes when the AV technology starts approximating the road safety of a conventional human driver all on its own (regions 4 & 5).
If you are actively doing self-driving car testing on public roads, please contact me for a preprint of the full paper that includes a GSN safety argumentation structure for ensuring road testing safety. I plan to present the full paper at SAE WCX 2019 in April.
-- Phil Koopman, Edge Case Research & Carnegie Mellon University