Reasons Swiss Cheese Model

Risk Management and Error Trapping in Software and Hardware Development, Part 2

This is part 2 of a 3-part piece on risk management and error trapping in software and hardware development. The first post is located here (and should be read first to provide context on the content below).

Error Causality, Detection & Prevention

Errors occurring during software and hardware development (resulting in bugs) can be classified into two broad categories: (1) skill-based errors, and (2) knowledge-based errors.

Skill-based errors

Skill-based errors are those errors which emerge through the application of knowledge and experience. They are differentiated from knowledge-based errors in that they arise not from a lack of knowing what to do, but instead from either misapplication or failure to apply what is known. The two types of skill-based errors are errors of commission, and errors of omission.

Errors of commission are the mis-application of a previously learned behavior or  knowledge. To use a rock-climbing metaphor, if I tied my climbing rope to my harness with the wrong type of knot, I would be committing an error of commission. I know I need a knot and I know which knot to use and I know how to tie the correct knot – I simply did not do it correctly. In software development, one example of an error of commission might be an engineer providing the wrong variable to a function call, as in:

var x = 1;        // variable to call
var y = false;    // variable not to call
public function callVariable(x) {
return x;
}
callVariable(y); // should have provided “x” but gave “y” instead

Errors of omission, by contrast, are the failure to apply knowledge or experience (previously learned behaviors) to the given problem. In my climbing example, not tying the climbing rope to my harness (at all) before beginning to climb is an error of omission. (Don’t laugh – this actually happens.) In software development, an example of an error of omission would be an engineer forgetting to provide a variable to a function call (or forgetting to add the function call at all), as in:

var x = 1;              // variable to call
var y = false;          // variable not to call
public function callVariable(x) {
return x;
}
callVariable();   // should have provided “x” but left empty

Knowledge-based errors

Knowledge-based errors, in contrast to skill-based errors, arise from the failure to know the correct behavior to apply (if any). An example of a knowledge-based error would be a developer checking in code without any unit, integration, or system tests. If the developer is new and has never been indoctrinated to the requirements for code check-in as including having written and run a suite of automated unit, integration, and system tests, this is an error caused by a lack of knowledge (as opposed to omission, where the developer had been informed of the need to write and run the tests but failed to do so).

Defense-in-depth, the Swiss cheese model, bug prevention and detection

Prevention comprises the systems and processes employed to trap bugs and stop them from getting through development environments and into certification and/or production environments (depending on your software / hardware release process). In envisioning our Swiss cheese model, we need to understand that the layers include both latent and active types of error traps, and are designed to mitigate against certain types of errors.

The following are intended to aid in preventing bugs.

Tools & methods to mitigate against Skill-based errors in bug prevention:

  • Code base and architecture [latent]
  • Automated test coverage [active]
  • Manual test coverage [active]
  • Unit, feature, integration, system, and story tests [active]
  • TDD / ATDD / BDD / FDD practices [active]
  • Code reviews [active]
  • Pair Programming [active]
  • Performance testing [active]
  • Software development framework / methodology (ie, Scrum, Kanban, DevOps, etc.) [latent]

Tools & methods to mitigate against Knowledge-based errors in bug prevention:

  • Education & background [latent]
  • Recruiting and hiring practices [active]
  • New-hire Onboarding [active]
  • Performance feedback & professional development [active]
  • Design documents [active]
  • Definition of Done [active]
  • User Story Acceptance Criteria [active]
  • Code reviews [active]
  • Pair Programming [active]
  • Information Radiators [latent]

Detection is the term for the ways in which we find bugs, hopefully in the development environment but this phase would also include certification if your organization has a certification / QA phase. The primary focus of detection methods is to ensure no bugs escape into production. As such, the entire software certification system itself may be considered one, large, active layer of error trapping. In fact, in many enterprise companies, the certification or QA team (if you have one) is actually the last line of defense.

The following are intended to aid in detecting bugs:

Tools & methods to mitigate against Skill-based errors in detecting bugs:

  • Automated test coverage [active]
  • Manual test coverage [active]
  • Unit, feature, integration, system, and story tests [active]
  • TDD / ATDD / BDD / FDD practices [active]
  • Release certification testing [active]
  • Performance testing [active]
  • User Story Acceptance Criteria [active]
  • User Story “Done” Criteria [active]
  • Bug tracking software [active]
  • Triage reports [active]

Tools & methods to mitigate against Knowledge-based errors in detecting bugs:

  • Education & background [latent]
  • Professional development (individual / organizational) [latent / active]
  • Code reviews [active]
  • Automated & manual test coverage [active]
  • Unit, feature, integration, system, story tests [active]

When bugs “escape” the preventative measures of your Defense-in-depth system and are discovered in either the development or production environment, a root cause analysis should be conducted on your system based on the nature of the bug and how it could have been prevented and / or detected earlier. Based upon the findings of your root cause analysis, your system can be improved in specific, meaningful ways to increase both its robustness and resilience.

How an organization should, specifically, conduct root cause analysis, analyze risk and make purposeful decisions about risk, and how they should improve their system is the subject of part 3 in this series, available here.

 

Chris Alexander is a former U.S. Naval Officer who was an F-14 Tomcat flight officer and instructor. He is Co-Founder and Executive Team Member of AGLX Consulting, creators of the High-Performance Teaming™ model, a Scrum Trainer, Scrum Master, and Agile Coach.

Share This:

Leave a Reply