Edge Case Defense — How We Enable Warfighter Safety

Edge Case Research
13 min readMar 1, 2023

--

ECD System Safety Process Overview

Edge Case Defense (ECD) has a mission of enabling warfighter safety across the US DoD, and we do this by ensuring the safety of the systems the DoD acquires. We use our expertise and experience to apply MIL-STD 882, DoD Standard Practice — System Safety to systems of all types, with a focus on complex, software-intensive vehicle and weapon systems. As a spin out from the Carnegie Mellon National Robotics Engineering Center, many of the systems we work on are some combination of robotic, unmanned, and autonomous.

ECD safety engineering capabilities are specifically tailored to facilitate DoD acquisition. The lifecycle-enablement found in our approaches are a critical multiplier in a time where system autonomy is ever increasing, software dependency continues to grow, and there is pressure to deploy and update systems rapidly. This article details the ECD MIL-STD 882 process along with insights and pitfalls we’ve noted over the years.

MIL-STD 882 background and overview

MIL-STD 882 specifies the system safety process as including the eight elements in Figure 1. From a high level it’s easy to see the eight elements and how they impact the entire system lifecycle, from cradle to grave.

Figure 1 — MIL STD 882 eight elements of the system safety process

The elements have a sequential relationship but as with many systems engineering processes there are iterative feedback loops which update, refine, and improve as the system lifecycle progresses.

ECD insight: Safety risk management has many similarities to more general program risk management in that risks are identified, documented, mitigated, and accepted. Tracking the status of the safety program and its associated risks in a clearly documented manner is critical to success of the program.

MIL-STD 882 normative sections and highlights

The details of the eight elements are further discussed in sections 4.2 and 4.3 of the MIL-STD where requirements for the system safety program are captured. A highlight from the requirements in section 4.2 is that all functional disciplines (e.g., systems engineers, software engineers, hardware engineers, test engineers, etc) are responsible for applying system safety measures to their area.

ECD Insight: Safety engineers should have insight into the specifications and designs of all functional areas to ensure safety process adequacy throughout the system. Safety requirements are a key feedback mechanism for safety engineers to influence system designs.

ECD Pitfall: Commercial Off-the-Shelf (COTS), Government-Off-the-Shelf (GOTS), Non-Developmental Item (NDI), Government-Furnished Equipment (GFE), and Government-Furnished Information (GFI) commonly lack adequate evidence of safety. However, MIL STD 882 applies to these items, and they must be addressed for safety implications.

System safety process requirements

Section 4.3 covers all of the eight elements of the system safety process.

Documenting the system safety approach is most commonly achieved through safety planning documents like a government System Safety Management Plan (SSMP) and developing contractor System Safety Program Plan (SSPP). Initiating a Hazard Tracking System (HTS) in this phase allows continuous monitoring of the state of the safety program and its risks as they are identified and managed.

ECD Insight: The government Program Office will capture their intended management process for the system safety program in a System Safety Management Plan (SSMP). The developing contractor responds to the SSMP with their System Safety Program Plan (SSPP). Ask the government customer if a SSMP is available for review before finalizing the SSPP.

Types of hazard analysis vs techniques for hazard analysis

Safety process elements 2 through 5 in Figure 1 rely heavily on a number of hazard analyses, which are a key aspect of system safety engineering. It’s common for DoD contracts to cite MIL-STD 882 200 series tasks as contract deliverables in order to show how these process elements were achieved. These 200 series hazard analysis tasks largely describe types of hazard analysis, to include goals, scopes, and specific risks to be included. The performer is required to specify the methods and techniques intended to be used for each required hazard analysis type. For example, Task 205 System Hazard Analysis describes identifying hazards in the integrated system design. Since Task 205 requires consideration of independent, dependent, and simultaneous events that could increase risk, a Fault Tree Analysis (FTA) is a natural fit for the technique to be used.

A primary result of hazard analysis is the identification of safety requirements. This might flag existing requirements has having safety impact or identify new mitigating requirements such as those for detecting and responding to unsafe conditions. Identifying and tracking safety requirements is necessary to enable later safety verification which results in credible reduction of safety risk.

ECD Insight: Functional Hazard Analysis (FHA) is described in MIL STD 882 Task 208. A tabular technique for FHA is described in detail in the Joint Services — Software Safety Authorities (JS-SSA) Software System Safety Implementation Process and Tasks Supporting MIL-STD 882E (further discussed here as the JS-SSA Implementation Guide (IG)). This FHA technique is powerful for complex systems, especially those levying safety requirements on a combination of hardware, software, and human interactions. By focusing on functions rather than component types, failure modes can be simplified using guidewords and concepts from Nancy Leveson’s Systems Theoretic Process Analysis (STPA) method. STPA was originated with software intensive control systems in mind. Unsafe functionality resulting from missing, inadvertent, late, or similar guidewords can be considered without needing to focus on the originating causal factor and, for example, the complexities in software that may have caused the unsafe functionality. As described in Task 208, the FHA should be initiated as early in the systems engineering process as possible (which can precede system detailed design).

ECD Insight: ECD loves FHA as a technique. The FHA technique can be useful for other types of hazard analysis such as Subsystem Hazard Analysis (SSHA) where a tabular method is appropriate. When using FHA as a technique, the difference in the hazard analysis types can be defined by the scope and boundary of where the analysis starts and ends. For example, the Task 208 FHA may consider the system and its subsystems, while the Task 204 SSHA may provide a deeper dive into the low-level functions of those subsystems identified as safety critical.

Risk definition and risk acceptance summary

It’s important to note the risk assessment definitions in section 4.3.3 (shown below in Table 1, Table 2, and Table 3) and how those risks are accepted.

ECD Insight: Some government program offices have approved modifications for severity, probability, risk assessment, or risk acceptance definitions and processes. Modifications should be supported with justification / rationale. If the government program office does not have an approved set of modifications, it’s wise to assume the MIL-STD definitions apply.

ECD Pitfall: Software contribution to system risk is handled separately per section 4.4 and uses software Level of Rigor tasks evidences to adequately mitigate software risk rather than quantifying reduction in likelihood of unsafe software functionality. Application of probabilities to software failures is cited as “hard at best” in the MIL-STD and should generally be avoided.

Table 1 — MIL-STD 882 Hazard Severity Categories
Table 2 — MIL-STD 882 Hazard Probability Categories and example quantification
Table 3 — MIL-STD 882 Risk Assessment Matrix

ECD Insight: Since many DoD vehicles and weapon systems have the potential to cause loss of life due to unsafe system functionality, it is not uncommon to have hazards with Catastrophic severity based on considering the credible worst-case result of unsafe functionality. Related to this is the common DoD stance that hazards are not reduced in severity (or eliminated) without changes to the system such that the energy source is inherently modified or removed. For example, a laceration risk may be eliminated if a sharp edge is removed from the design and replaced by a rounded surface. More often, high severity risks can be mitigated to reduce their probability. In this way, hazards with Catastrophic potential are commonly targeted for reduction to Improbable likelihood, which results in a generally acceptable Medium risk.

ECD Insight: Note that the Improbable probability level is cited as “less than one in a million” likelihood of mishap with an unspecified time frame. It’s important to define and clarify how this will be assessed with stakeholders. For example, the one in a million could be in the lifetime of an item, the lifetime of a fleet, or per demand of a safety critical functionality. The lifetime may look at a fielded system lifecycle or consider shorter events like demonstrations or tests.

ECD Insight: Safety assessment can be qualitative, quantitative, or a combination. Each approach has advantages and disadvantages. MIL STD 882 states that quantitative safety analysis and assessment is generally preferred over qualitative. An advantage of a quantitative approach is that there are clear assessment thresholds. Disadvantages of a quantitative approach include that manufacturer failure data can lack supporting claim evidences and software failures are difficult to quantify. Advantages of a qualitative approach are that more scrutiny can be applied on analyses that indicate the number of defeated interlocks that are required to cause a hazard and it is consistent with some aspects of software safety assessment. A disadvantage of a qualitative approach is potential ambiguity of the probability assessments.

Software safety and level of rigor summary

Section 4.4 details software contribution to system risk. The standard states that determination of the probability of failure of a single software function is difficult at best. Hence, it requires a different approach for assessing software and handling of those software items which are determined to have safety significance. New terminology is introduced, including:

Software Control Category (SCC), defined in Table 4 — A simpler way to consider SCC is: for a given software function, if unsafe functionality occurs how many other unsafe actions from other independent functions (i.e., independent interlocks) must occur to have a mishap occur? The fewer other interlocks, the higher the level of software control of safety. This is shown later in Table 7 — Software hazard causal factor risk assessment criteria.

Software Criticality Index (SwCI), defined in Table 5 — SwCI (often pronounced “swicky”) uses the SCC and considers the credible worst-case hazard(s) the software function could cause. Based on the severity and the SCC, the SwCI is determined. Note that a SwCI rating of 1 is “red”, but this is not a risk rating or unacceptable. The SwCI just determines the development that is required for the software.

Level of Rigor (LoR), with LoR tasks defined in Table 5 — The LoR tasks and resulting evidences are required for a given software item based on its assigned SwCI. An LoR task template is also provided in the JS-SSA IG and is the default task list starting point for most program offices.

ECD Pitfall: Its critically important to be aware of the nature of software LoR tasks and their expected evidences prior to committing to a safety concept within the system architecture. Allocating LoR 1/2 to overly complex software can lead to safety designs that are not able to be adequately verified and result in elevated safety risk.

ECD Insight: Using Functional Hazard Analysis, as discussed in MIL STD 882E App B, is incredibly useful in determining how software functions may contribute to system level hazard and hence identifying the necessary SwCIs correctly.

Table 4 — Software Control Categories
Table 5 — Software Safety Criticality Matrix

ECD Pitfall: Avoid assessing SCC autonomy levels based on the concept of operations for uncrewed systems. SCC levels of autonomy are related to the level of direct control that a software function has over a hazard or mishap. For example, a teleoperated robot with no autonomous mobility control may still have autonomous SCC software responsible for detection and failsafe handling of lost radio link scenarios where the operator has no way to intervene. It’s important to use an analysis like FHA to assess each software function and find its appropriate SCC and SwCI.

As the software engineering lifecycle is executed, the identification of the appropriate LoR tasks for each software item allows identification of the required software safety LoR evidences. These evidences span the software development lifecycle to include planning, specification, design, coding, and verification. Early identification of safety critical software is a key enabler of success. Failure to provide evidence of the necessary LoR tasks requires consideration of Table 6 to assess the safety risk impact.

ECD Pitfall: Developing and verifying safety critical software is resource intensive. As much as possible, architecting to minimize safety critical software (e.g., use more easily verifiable electromechanical mitigations instead of software) and where safety critical software is required, aim to isolate it from non-safety software. Failure to architect the system in this manner can result in safety critical functionality being scattered throughout the design and causing the associated schedule and budget resources to multiply rapidly.

ECD Insight: LoR tasks can be tailored as described in the JS-SSA IG, but must be agreed upon by the developer, acquirer, and any other relevant government safety assessors. Rationale / justification should be provided to tailor LoR tasks.

ECD Insight: MIL-STD 882 has less to say about hardware in safety critical functions compared to its software safety guidance. Other standards like ISO-26262 and ISO-13849 have detailed, quantifiable hardware safety guidance that is worth considering as a best practice. This includes redundancy-based architecture with fault detection and response.

Table 6 — Relationship between SwCI, Risk Level, LoR Tasks, and Risk

Verification, validation, and documenting risk reduction

As previously mentioned, safety requirements will result from the hazard analysis process and require complete and methodical verification. In addition to verifying the system achieves its mission, verification of safety requirements will demonstrate that off nominal, unexpected, or anomalous conditions are handled in a safe manner. Safety significant software also has unique evidence requirements as mandated by the identified LoR, such as structural coverage of the software code. The results of verifying safety requirements can be captured in the HTS and hazard risk reduction credit can be gained based on traceability from the safety requirement to the hazards it helps to mitigate.

ECD Insight: Software LoR evidence is commonly a combination of process documentation, analytical evidence, and test evidence from a number of environments like unit testing, integration testing, and system testing.

ECD Insight: Careful consideration should be given to simulator verification, validation, and accreditation (VV&A) with stakeholders as early as possible. The state space and accuracy in simulators needed to verify complex functionality is challenging. For example, environment simulation to verify and validate perception functionality in an autonomous vehicle has a much larger state space than that of a weapon simulator used to test a vehicle weapon interface. Gaining acceptable confidence in the simulator(s) to enable use for safety evidence collection is resource intensive for smaller scopes of simulation. For autonomous systems, simulation may need to aid in gaining coverage and volume of test cases across a much larger number and range of variables. Stakeholder VV&A expectations should be elicited before counting on simulator data as part of the safety story.

Accepting and documenting risk

A number of events throughout the system lifecycle will require acceptance of safety risk. Examples might be acquisition milestone based (e.g., to support design reviews or milestone decisions) or based on warfighter usage of the system (e.g., demonstrations, test events, or fielding events). For all of these, it’s common for the developing contractor to deliver a Safety Assessment Report (SAR). The SAR takes a snapshot of the safety story at a point in time, for a (set of) specific system configuration(s), in a defined set of environments and use cases. This commonly includes the latest HTS, and associated hazard analyses and verification data used to populate the HTS.

ECD Insight: Specifically for autonomous systems — understanding how the system works and how it might fail is critical to enabling sufficient hazard analysis and safety risk reduction. Using emerging industry standards like UL 4600 — Standard for Safety — Evaluation of Autonomous Products can accelerate this learning. Additionally, given the novel implementation methods often required for autonomous systems, some level of safety consideration for unknown unknowns is likely required. Focusing on system robustness via methods like fault insertion is critical to ensure complex autonomous functionality can recognize and handle edge cases.

ECD Pitfall: The SAR is sometimes the primary or only safety deliverable on contract. However — A SAR without the associated safety process artifacts will merely be a shell that does not hold up under assessor scrutiny. A critical eye on safety deliverables in the government Request for Proposal is necessary to ensure the right technical information is able to be generated and is supported with adequate schedule and budget.

Appendix B provides additional software system safety guidance. Table 7 and Figure 2 below provide useful concepts and visuals for considering interlock counts in risk assessment and how to integrate software, hardware, and operator elements into risk assessment.

Table 7 — Software hazard causal factor risk assessment criteria
Figure 2 — Assessing software contribution to system risk

Safety engineering in the systems engineering lifecycle

ECD executes system safety within the program systems engineering lifecycle to ensure that adequate planning, resourcing, and risk decision making occurs. Successful programs recognize that system safety is not a result of a certain test event at the end of development, but rather a holistic process that includes safety considerations from program inception.

ECD Pitfall: Missing/weak systems or software engineering core practices can be a blocker to a successful system safety program. System decomposition, traceability, configuration management, quality assurance, and verification & validation for all functionalities are table stakes for developing DoD weapon and vehicle systems.

Figure 3 shows ECD’s system safety process which is categorized into Safety Planning, Safety Analysis, Safety Verification, and Safety Reporting. Additional detailed writeups will be available for commonly executed MIL-STD 882 tasks including: SSPP, HTS, Preliminary Hazard Analysis, System Hazard Analysis, Subsystem Hazard Analysis, System Requirements Hazard Analysis, Operations & Support Hazard Analysis, Safety Verification and Validation & Software Level of Rigor evidences, and SAR.

ECD Insight: Given the rise in level of system autonomy and inclusion of complex or black box functionality like machine learning, developers and acquirers will benefit from discussing risk with stakeholders early and often. Planning for elevated risk acceptance when needed and understanding missions where elevated risk can be accepted will enable the acquisition effort to take early steps to success.

Figure 3 — ECD system safety process

About Edge Case Research:

At Edge Case Research, we believe that complex systems should be built safely from the ground up. Founded in 2014 by leaders in autonomous vehicle safety, our expert team provides system safety engineering services, nLoop Live Safety Case software, and risk management solutions. Our clients include government agencies and developing contractors in the aerospace and defense industries, as well as automotive and trucking OEMs, Tier 1 suppliers, and insurance providers. For more information, visit ecr.ai or ecr-defense.ai.

--

--

Edge Case Research
Edge Case Research

Written by Edge Case Research

We Deliver the Promise of Autonomy. Founded by global leaders in safety and autonomy who build safety into autonomous systems from the ground up.

No responses yet