Command Override Anti-Pattern for autonomous system safety

Command Override Anti-Pattern:
Don't let a non-critical "Doer" over-ride the safety critical "Checker."  If you do this, the Doer can tell the Checker that something is safe when it isn't.


https://pixabay.com/en/base-jump-jump-base-jumper-leaping-1600668/

(First in a series of postings on pitfalls and fallacies we've seen used in safety assurance arguments for autonomous vehicles.)

A common pitfall when identifying safety relevant portions of a system is overlooking the safety relevance of sensors, actuators, software, or some other portion of a system. A common example is creating a system that permits the Doer to perform a command override of the Checker. (In other words, the designers think they are building a Doer/Checker pattern in which only the Checker is safety relevant, but in fact the Doer is safety relevant due to its ability to override the Checker’s functionality.)

The usual claim being made is that a safety layer will prevent any malfunction of an autonomy layer from creating a mishap. This claim generally involves arguing that an autonomy failure will activate a safing function in the safety layer, and that an attempt by the autonomy to do something unsafe will be prevented. A typical (deficient) scheme is to have autonomy failure detected via some sort of self-test combined with the safety layer monitoring an autonomy layer heartbeat signal. It is further argued that the safety layer is designed in conformance with a suitable functional safety standard, and therefore acts as a safety-rated Checker as part of a Doer/Checker pair.

The flaw in that safety argumentation approach is that the autonomy layer has been presumed to fail silent via a combination of self-diagnosed fault detection and lack of heartbeat. However, self-diagnosis and heartbeat detection methods provide only partial fault detection (Hammett 2001). For example, there is always the possibility that the checking function itself has been com-promised by a fault that leads to false negatives of checks for incorrect functionality.

As a simple example, a heartbeat signal might be generated by a timer interrupt in the autonomy computer that continues to function even if significant portions of the autonomy software have crashed or are generating incorrect results. In general, such an architectural pattern is unsafe because it permits a non-silent failure in the autonomy layer to issue an unsafe vehicle trajectory command that overrides the safety layer. Fixing this fault requires making the autonomy layer safety-critical, which defeats a primary purpose of using a Doer/Checker architecture.

In practice, safety layer logic is usually less permissive than the autonomy layer. By less permissive, we mean that it under-approximates the safe state space of the system in exchange for simplifying computation (Machin et al. 2018). As a practical example, the safety layer might leave a larger spatial buffer area around obstacles to simplify computations, resulting in a longer total path length for the vehicle or even denying the vehicle entry into situations such as a tight alleyway that is only slightly larger than the vehicle.

A significant safety compromise can occur when vehicle designers attempt to increase permissiveness by enabling a non-safety-rated autonomy layer to say “trust me, this is OK” to override the safety layer. This creates a way for a malfunctioning autonomy layer to override the safety layer, again subverting the safety of the Doer/Checker pair architecture.
Eliminating this command override anti-pattern requires that the designers accept that there is an inherent trade-off between permissiveness and simplicity. A simple Checker tends to have limited permissiveness. Increasing permissiveness makes the Checker more complex, increasing the fraction of the system design work that must be done with high integrity. Permitting a lower integrity Doer to override the safety-relevant behavior of a high integrity Checker in an attempt to avoid Checker complexity is unsafe.

Related pitfalls are first a system in which the Checker only covers a sub-set of the safety properties of the system. This implicitly trusts the Doer to not have certain classes of defects, including potentially requirements defects. If the Checker does not actually check some aspects of safety, then the Doer is in fact safety relevant. A second pitfall is having the Checker supervise a diagnosis operation for a Doer health check. Even if the Doer passes a health check, that does not mean its calculations are correct. At best it means that the Doer is operating as designed – which might be unsafe since the Doer has not been verified with the level of rigor required to assure safety.

We have found it productive to conduct the following thought experiment when evaluating Doer/Checker architectural patterns and other systems that rely upon assuring the integrity of only a subset of safety-related functions. Ask this question: “Assume the Doer (a portion of the system, including software, sensors, and actuators, that are not ‘safety related’) maliciously at-tempts to compromise safety in the worst way possible, with full and complete knowledge of every aspect of the design. Could it compromise safety?” If the answer is yes, then the Doer is in fact safety relevant, and must be designed to a sufficiently high level of integrity. Attempts to argue that such an outcome is unlikely in practice must be supported by strong evidence.

(This is an excerpt of our SSS 2019 paper:  Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019.  Read the full text here)
  • Hammett, R., (2001) “Design by extrapolation: an evaluation of fault-tolerant avion-ics, 20th Conference on Digital Avionics Systems, IEEE, 2001.
  • Machin, M., Guiochet, J., Waeselynck, H., Blanquart, J-P., Roy, M., Masson, L., (2018) “SMOF – a safety monitoring framework for autonomous systems,” IEEE Trans. System, Man and Cybernetics Systems, 48(5) May 2018, pp. 702-715.