Autonomous Vehicle Fault and Failure Management

When you build an autonomous vehicle you can't count on a human driver to notice when something's wrong and "do the right thing." Here is a list of faults, system limitations, and fault responses AVs will need to get right. Did you think of these?

https://www.godlikeproductions.com/sm/custom/z/l/mppijgrr.jpeg


System Limitations:

Sometimes the issue isn't that something is broken, but rather simply that all vehicles have limitations. You have to know your system's limitations.
  • Current capabilities of sensors and actuators, which can depend upon the operational state space.
  • Detecting and handling a vehicle excursion outside the operational state space for which it was validated, including all aspects of {ODD, OEDR, Maneuver, Fault} tuples.
  • Desired availability despite fault states, including any graceful degradation plan, and any limits placed upon the degraded operational state space.
  • Capability variation based on payload characteristics (e.g. passenger vehicle overloaded with cargo, uneven weight distribution, truck loaded with gravel, tanker half filled with liquid) and autonomous payload modification (e.g. trailer connect/disconnect).
  • Capability variation based on functional modes (e.g. pivot vs. Ackerman vs. crab steering, rear wheel steering, ABS or 4WD engaged/disengaged).
  • Capability variation based on ad-hoc teaming (e.g. V2V, V2I) and planned teaming (e.g. leader-follower or platooning vehicle pairing).
  • Incompleteness, incorrectness, corruption or unavailability of external information (V2V, V2I).

System Faults:
  • Perception failure, including transient and permanent faults in classification and pose of objects.
  • Planning failures, including those leading to collision, unsafe trajectories (e.g., rollover risk), and dangerous paths (e.g., roadway departure).
  • Vehicle equipment operational faults (e.g., blown tire, engine stall, brake failure, steering failure, lighting system failure, transmission failure, uncommanded engine power, autonomy equipment failure, electrical system failure, vehicle diagnostic trouble codes).
  • Vehicle equipment maintenance faults (e.g., improper tire pressure, bald tires, misaligned wheels, empty sensor cleaning fluid reservoir, depleted fuel/battery).
  • Operational degradation of sensors and actuators including temporary (e.g., accumulation of mud, dirt, dust, heat, water, ice, salt spray, smashed insects) and permanent (e.g., manufacturing imperfections, scratches, scouring, aging, wear-out, blockage, impact damage).
  • Equipment damage including detecting and mitigating catastrophic loss (e.g., vehicle collisions, lighting strikes, roadway departure), minor losses (e.g., sensor knocked off, actuator failures), and temporary losses (e.g., misalignment due to bent support bracket, loss of calibration).
  • Incorrect, missing, stale, and inaccurate map data.
  • Training data incompleteness, incorrectness, known bias, or unknown bias.

Fault Responses:

Some of the faults and limitations fall within the purview of safety standards that apply to non-autonomous functions. However, a unified system-level view of fault detection and mitigation can be useful to ensure that no faults are left unaddressed. More importantly, to the degree that credit has been taken for a human driver participating in fault mitigation by safety standards, that places fault mitigation obligations upon the autonomy.
  • How the system behaves when encountering an exceptional operational state space, experiencing a fault, or reaching a system limitation.
  • Diagnostic gaps (e.g., latent faults, undetected faults, undetected faulty redundancy).
  • How the system re-integrates failed components, including recovery from transient faults and recovery from repaired permanent faults during operation and/or after maintenance.
  • Response and policies for prioritizing or otherwise determining actions in inherently risky or certain-loss situations.
  • Withstanding an attack (system security, compromised infrastructure, compromised other vehicles), and deterring inappropriate use (e.g., malicious commands, inappropriately dangerous cargo, dangerous passenger behavior).
  • How the system is updated to correct functional defects, security defects, safety defects, and addition of new or improved capabilities.
Is there anything we missed?

(This is an excerpt of Koopman, P. & Fratrik, F., "How many operational design domains, objects, and events?" SafeAI 2019, AAAI, Jan 27, 2019.)