Tag Archives: Error Trapping

Risk Management and Error Trapping in Software and Hardware Development, Part 3

This is part 3 of a 3-part piece on risk management and error trapping in software and hardware development. The first post is located here (and should be read first to provide context on the content below), and part 2 is located here.

Root Cause Analysis and Process Improvement

Once a bug has been discovered and risk analysis / decision-making has been completed (see below), a retrospective-style analysis on the circumstances surrounding the engineering practices which failed to effectively trap the bug completes the cycle.

The purpose of the retrospective is not to assign blame or find fault, but rather to understand the cause of the failure to trap the bug, inspect the layers of the system, and determine if any additional layers, procedures, or process changes could effectively improve collective engineering surety and help to prevent future bugs emerging from similar causes.

Methodology

  1. Review sequence of events that led to the anomaly / bug.
  2. Determine root cause.
  3. Map the root cause to our defense-in-depth (Swiss cheese) model.
  4. Decide if there are remediation efforts or improvements which would be effective in supporting or restructuring the system to increase its effectiveness at error trapping.
  5. Implement any changes identified, sharing them publicly to ensure everyone understands the changes and the reasoning behind them.
  6. Monitor the changes, adjusting as necessary.

Review sequence of events

With appropriate representatives from engineering teams, certification, hardware, operations, customer success, etc., review the discovery path which led to finding the bug. The point is to understand the processes used, which ones worked, and which let the bug pass through.

Determine root cause and analyze the optimum layers for improvement

What caused the bug? There are many enablers and contributing factors, but typically only one or two root causes. The root cause is one or a possible combination of Organization, Communication, Knowledge, Experience, Discipline, Teamwork, or Leadership.

  • Organization – typically latent, organizational root causes include things like existing processes, tools, practices, habits, customs, etc., which the company or organization as a whole employs in carrying out its work.
  • Communication – a failure to convey necessary, important, or vital information to or among an individual or team who required it for the successful accomplishment of their work.
  • Knowledge – an individual, team, or organization did not possess the knowledge necessary to succeed. This is the root cause for knowledge-based errors.
  • Experience – an individual, team, or organization did not possess the experience necessary to successfully accomplish a task (as opposed to the knowledge about what to do). Experience is often a root cause in skill-based errors of omission.
  • Discipline – an individual, team, or organization did not possess the discipline necessary to apply their knowledge and experience to solving a problem. Discipline is often a root cause in skill-based errors of commission.
  • Teamwork – individuals, possibly at multiple levels, failed to work together as a team, support one another, and check one another against errors. Additional root causes may be knowledge, experience, communication, or discipline.
  • Leadership – less often seen at smaller organizations, a Leadership failure is typically a root cause when a leader and/or manager has not effectively communicated expectations or empowered execution regarding those expectations.

Map the root cause to the layer(s) which should have trapped the error

Given the root cause analysis, determine where in the system (which layer or layers) the bug should have been trapped. Often there will be multiple locations at which the bug should or could have been trapped, however the best location to identify is the one which most closely corresponds to the root cause of the bug. Consideration should also be given to timeliness. The earlier an error can be caught or prevented (trapped), the less costly it is in terms of both time (to find, fix, and eliminate the bug) and effort (a bug in production requires more effort from more people than a developer discovering a bug while checking their own unit test).

While we should seek to apply fixes at the locations best suited for them, the earliest point at which a bug could have been caught and prevented will often be the optimum place to improve the system.

For example, if a bug was traced back to a team’s discipline in writing and using tests (root cause: discipline and experience), then it would map to layers dealing with testing practices (TDD/ATDD), pair programming, acceptance criteria, definition of “Done,” etc. Those layers to which the team can most readily apply improvements and which will trap the error sooner rather than later should be the focus for improvement efforts.

Decide on improvements to increase system effectiveness

Based on the knowledge gained through analyzing and mapping the root cause, decisions are made on how to improve the effectiveness of the system at the layers identified. Using the testing example above, a team could decide that they need to adjust their definition of Done to include listing which tests a story has been tested against and their pass/fail conditions.

Implement the changes identified, and monitor them for effectiveness.

Risk Analysis

Should our preventative measures fail to stop a bug from escaping into a production environment, an analysis of the level of risk needs to be explicitly completed. (This is often done, but in an implicit way.) The analysis of the level of risk derives from two areas.

Risk Severity – the degree of impact the bug can be expected to have to the data, operations, or functionality of affected parties (the company, vendors, customers, etc.).

Blocking A bug that is so bad, or a feature that is so important, that we would not ship the next release until it is fixed/completed. Could also signify a bug that is currently impacting a customer’s operations, or one that is blocking development.
Critical A bug that needs to be resolved ASAP, but for which we wouldn’t stop everything. Bugs in this category are not impacting operations (a customer’s, or ours), but they are significantly challenging to warrant attention.
Major Best judgement should be used to determine how this stacks against other work. The bug is serious enough that it needs to be resolved, but the value of other work and timing should be considered. If a bug sits in major for too long, its categorization should be reviewed and either upgraded or downgraded.
Minor A bug that is known, but which we have explicitly de-prioritized. Such a bug will be fixed as time allows.
Trivial Should really consider closing this level of bug. At best these should be put into the “Long Tail” for tracking.

Risk Probability – the likelihood, expressed against a percentage, that those potentially affected by the bug will actually experience it (ie., always, only if they have a power outage, or only if the sun aligns with Jupiter during the slackwater phase of a diurnal tide in the northeastern hemisphere between 44 and 45 degrees Latitude).

Definite 100% – issue will occur in every case
Probable 60-99% – issue will occur in most cases
Possible 30-60% – coin-flip; issue may or may not occur
Unlikely 2-30% – issue will occur in less than 50% of cases
Won’t 1% – occurrence of the issue will be exceptionally rare

Given Risk Severity and Probability, the risk can be assessed according to the following matrix and assigned a Risk Assessment Code (RAC).

Risk Assessment Matrix Probability
Definite Probable Possible Unlikely Won’t
Severity Blocker 1 1 1 2 3
Critical 1 1 2 2 3
Major 2 2 2 3 4
Minor 3 3 3 4 5
Trivial 3 4 4 5 5

Risk Assessment Codes
1 – Strategic     2 – Significant     3 – Moderate     4 – Low     5 – Negligible

The Risk Assessment Codes are a significant factor in Risk decision-making.

  1. Strategic – the risk to the business or customers is significant enough that its realization could threaten operations, basic functioning, and/or professional reputation to the point that the basic survival of the business could be in jeopardy. As Arnold said in Predator: “We make a stand now, or there will be nobody left to go to the chopper!”
  2. Significant – the risk poses considerable, but not life-threatening, challenges for the business or its customers. If left unchecked, these risks may elevate to strategic levels.
  3. Moderate – the risk to business operations, continuity, and/or reputation is significant enough to warrant consideration against other business priorities and issues, but not significant enough to trigger higher responses.
  4. Low – the risk to the business is not significant enough to warrant special consideration of the risk against other priorities. Issues should be dealt with in routine, predictable, and business-as-usual ways.
  5. Negligible – the risk to the business is not significant enough to warrant further consideration except in exceptional circumstances (ie., we literally have nothing better to do).

Risk Decision

The risk decision is the point at which a decision is made about the risk. Typically, risk decisions take the form of:

  • Accept – accept the risk as it is and do not mitigate or take additional steps.
  • Delay – for less critical issues or dependencies, a decision about whether to accept or mitigate a risk may be delayed until additional information, research, or steps are completed.
  • Mitigate – establish a mitigation strategy and deal with the risk.

For risk mitigation, feasible Courses of Action (CoAs) should be developed to assist in making the mitigation plan. These potential actions comprise the mitigation and or reaction plan. Specifically, given a specific bug’s risk severity, probability, and resulting RAC, the courses of action are the possible mitigate solutions for the risk. Examples include:

— Pre-release —

  • Apply software fix / patch
  • Code refactor
  • Code rewrite
  • Release without the code integrated (re-build)
  • Hold the release and await code fix
  • Cancel the release

— In production —

  • Add to normal backlog and prioritize with normal workflow
  • Pull / create a team to triage and fix
  • Swarm / mob multiple teams on fix
  • Pull back / recall release
  • Release an additional fix as a micro-upgrade

For all risk decisions, those decisions should be recorded and those which remain active need to be tracked. There are many methods available for logging and tracking risk decisions, from spreadsheets to documentation to support tickets. There are entire software platforms expressly designed to track and monitor risk status and record decisions taken (or not) about risks.

Decisions to delay risk mitigations are the most important to track, as they require action and at the speed most business move today, a real risk exists of losing track of risk delay decisions. Therefore a Risk Log or Review should be used to routinely review the status of pending risk decisions and reevaluate them. Risk changes constantly, and risks may significantly change in severity and probability overnight. In reviewing risk decisions regularly, leadership is able to simultaneously ensure both that emerging risks are mitigated and that effort is not wasted unnecessarily (as when effort is put against a risk which has significantly declined in impact due to changes external to the business).

Conclusion

I hope you’ve enjoyed this 3-part series. Risk management and error trapping is a complicated and – at times – complex topic. There are many ways to approach these types of systems and many variations on the defense-in-depth model.

The specific implementation your business or organization chooses to adopt should reflect the reality and environment in which you operate, but the basic framework has proven useful across many domains, industries, and is directly adapted from Operational Risk Management as I used to practice and teach it in the military.

Understanding the root cause of your errors, where they slipped through your system, and how to improve your system’s resiliency and robustness are critical skills which you need to develop if they are not already functional. A mindful, purposeful approach to risk decision-making throughout your organization is also critical to your business operations.

Good luck!

 

Chris Alexander is a former U.S. Naval Officer who was an F-14 Tomcat flight officer and instructor. He is Co-Founder and Executive Team Member of AGLX Consulting, creators of the High-Performance Teaming™ model, a Scrum Trainer, Scrum Master, and Agile Coach.

Share This:

Risk Management and Error Trapping in Software and Hardware Development, Part 2

This is part 2 of a 3-part piece on risk management and error trapping in software and hardware development. The first post is located here (and should be read first to provide context on the content below).

Error Causality, Detection & Prevention

Errors occurring during software and hardware development (resulting in bugs) can be classified into two broad categories: (1) skill-based errors, and (2) knowledge-based errors.

Skill-based errors

Skill-based errors are those errors which emerge through the application of knowledge and experience. They are differentiated from knowledge-based errors in that they arise not from a lack of knowing what to do, but instead from either misapplication or failure to apply what is known. The two types of skill-based errors are errors of commission, and errors of omission.

Errors of commission are the mis-application of a previously learned behavior or  knowledge. To use a rock-climbing metaphor, if I tied my climbing rope to my harness with the wrong type of knot, I would be committing an error of commission. I know I need a knot and I know which knot to use and I know how to tie the correct knot – I simply did not do it correctly. In software development, one example of an error of commission might be an engineer providing the wrong variable to a function call, as in:

var x = 1;        // variable to call
var y = false;    // variable not to call
public function callVariable(x) {
return x;
}
callVariable(y); // should have provided “x” but gave “y” instead

Errors of omission, by contrast, are the failure to apply knowledge or experience (previously learned behaviors) to the given problem. In my climbing example, not tying the climbing rope to my harness (at all) before beginning to climb is an error of omission. (Don’t laugh – this actually happens.) In software development, an example of an error of omission would be an engineer forgetting to provide a variable to a function call (or forgetting to add the function call at all), as in:

var x = 1;              // variable to call
var y = false;          // variable not to call
public function callVariable(x) {
return x;
}
callVariable();   // should have provided “x” but left empty

Knowledge-based errors

Knowledge-based errors, in contrast to skill-based errors, arise from the failure to know the correct behavior to apply (if any). An example of a knowledge-based error would be a developer checking in code without any unit, integration, or system tests. If the developer is new and has never been indoctrinated to the requirements for code check-in as including having written and run a suite of automated unit, integration, and system tests, this is an error caused by a lack of knowledge (as opposed to omission, where the developer had been informed of the need to write and run the tests but failed to do so).

Defense-in-depth, the Swiss cheese model, bug prevention and detection

Prevention comprises the systems and processes employed to trap bugs and stop them from getting through development environments and into certification and/or production environments (depending on your software / hardware release process). In envisioning our Swiss cheese model, we need to understand that the layers include both latent and active types of error traps, and are designed to mitigate against certain types of errors.

The following are intended to aid in preventing bugs.

Tools & methods to mitigate against Skill-based errors in bug prevention:

  • Code base and architecture [latent]
  • Automated test coverage [active]
  • Manual test coverage [active]
  • Unit, feature, integration, system, and story tests [active]
  • TDD / ATDD / BDD / FDD practices [active]
  • Code reviews [active]
  • Pair Programming [active]
  • Performance testing [active]
  • Software development framework / methodology (ie, Scrum, Kanban, DevOps, etc.) [latent]

Tools & methods to mitigate against Knowledge-based errors in bug prevention:

  • Education & background [latent]
  • Recruiting and hiring practices [active]
  • New-hire Onboarding [active]
  • Performance feedback & professional development [active]
  • Design documents [active]
  • Definition of Done [active]
  • User Story Acceptance Criteria [active]
  • Code reviews [active]
  • Pair Programming [active]
  • Information Radiators [latent]

Detection is the term for the ways in which we find bugs, hopefully in the development environment but this phase would also include certification if your organization has a certification / QA phase. The primary focus of detection methods is to ensure no bugs escape into production. As such, the entire software certification system itself may be considered one, large, active layer of error trapping. In fact, in many enterprise companies, the certification or QA team (if you have one) is actually the last line of defense.

The following are intended to aid in detecting bugs:

Tools & methods to mitigate against Skill-based errors in detecting bugs:

  • Automated test coverage [active]
  • Manual test coverage [active]
  • Unit, feature, integration, system, and story tests [active]
  • TDD / ATDD / BDD / FDD practices [active]
  • Release certification testing [active]
  • Performance testing [active]
  • User Story Acceptance Criteria [active]
  • User Story “Done” Criteria [active]
  • Bug tracking software [active]
  • Triage reports [active]

Tools & methods to mitigate against Knowledge-based errors in detecting bugs:

  • Education & background [latent]
  • Professional development (individual / organizational) [latent / active]
  • Code reviews [active]
  • Automated & manual test coverage [active]
  • Unit, feature, integration, system, story tests [active]

When bugs “escape” the preventative measures of your Defense-in-depth system and are discovered in either the development or production environment, a root cause analysis should be conducted on your system based on the nature of the bug and how it could have been prevented and / or detected earlier. Based upon the findings of your root cause analysis, your system can be improved in specific, meaningful ways to increase both its robustness and resilience.

How an organization should, specifically, conduct root cause analysis, analyze risk and make purposeful decisions about risk, and how they should improve their system is the subject of part 3 in this series, available here.

 

Chris Alexander is a former U.S. Naval Officer who was an F-14 Tomcat flight officer and instructor. He is Co-Founder and Executive Team Member of AGLX Consulting, creators of the High-Performance Teaming™ model, a Scrum Trainer, Scrum Master, and Agile Coach.

Share This:

Risk Management and Error Trapping in Software and Hardware Development, Part 1

The way in which we conceptualize and analyze risk and error management in technology projects has never received quite the same degree of scrutiny which business process frameworks and methodologies such as Scrum, Lean, or Traditional Project Management have. Yet risk is inherent in everything we do, every day, regardless of our industry, sector, work domain, or process.

We actually practice risk management in our everyday lives, often without consciously realizing that what we are doing is designed to manage levels of risk against degrees of potential reward, and either prevent errors from occurring or minimizing their impact when they do.

For example, I recently took a trip to the San Juan islands with my wife and parents. I woke up early, made coffee, roused the troops, and checked the weather. I’d filled up the gas thank the day before, and booked our ferry tickets online. I checked the weather and recommended we each take an extra layer. We departed the house a bit earlier than really necessary, but ended up encountering a detour along the way due to a traffic accident on the Interstate. Nevertheless, we made it to the ferry terminal with about 10 minutes to spare, and just in time to drive onto the ferry and depart for Friday Harbor.

My personal example is relatively simple but, with a little analysis, demonstrates how intuitively we assess and manage risk:

  • Wake up early: mitigates risk of oversleeping and departing late (which could result further in forgetting important things, leaving coffee pot/equipment on, etc.), waiting on others in the bathroom, and not being able to prepare and enjoy some morning coffee (serious risk).
  • Check the weather: understanding the environment we are entering into is critical to mitigating environment-related risks, in this case real environmental concerns such as temperature, weather, wind, and precipitation, enabling us to mitigate potentially negative effects and capitalize on positives. Bad weather may even result in our changing our travel plans entirely – a clear form of risk mitigation in which we determine that our chance for a successful journey is low compared against the value we would derive from undertaking the journey in the first place, and decide the goal is not sufficient to accept present risk levels.
  • Book ferry tickets online: a mitigation against the risk of arriving late and having to wait in line to purchase tickets, which could result in us missing the ferry due to either running out of time or the ferry already being completely booked.
  • Departing earlier than necessary: a mitigation against unforeseen and unknowable specific risk, in this case the generic risk of en route delays, which we did encounter on this occasion.

As you can see, as a story my preparations for our trip seem rather routine and unremarkable, but when viewed through the lens of risk mitigation and error management, each action and decision can be seen as specifically targeted to mitigate one or more specific risks or minimize the potential effects of an error. Unfortunately, our everyday intuitive actions and mental processes seldom translate into our work environments in such direct and meaningful ways.

Risk and Error Management in Software and Hardware Development – Defense-in-Depth and the Swiss Cheese Model

Any risk management system can be seen as a series of layers designed to employ a variety of means to mitigate risk and prevent errors from progressing further through the system. We call this “trapping errors.” Additionally, each of these layers is often just one part of a larger system. A system constructed with these layers  is referred to as having “defense-in-depth.”

Defense-in-depth reflects the simple idea that instead of employing one single, catch-all solution for eliminating risk and trapping errors, a layered approach which employs both latent and active controls in different areas throughout the system will be far more effective in both detecting and preventing errors from escaping.

These layers are often envisioned as slices of Swiss cheese, with each slice representing a different part of the larger system. As a potential risk or error progresses through holes in the system’s layers, it should eventually be trapped in one of the layers.

Risk and errors are then only able to impact the system when all the holes in the system’s Swiss cheese layers “line up.”

Latent and Active Layers

There are two basic types of layers (or traps) in any system: latent and active. In your day to day life, latent traps are things such as the tires on your car or the surface of the road. Active traps are things such as checking the weather, putting on safety gear, wearing a helmet, or deciding not to go out into the weather.

Latent layers in software or hardware development may be things such as the original (legacy) code base, development language(s) used, system architecture & design, hardware (types of disk drives, manufacturer), and so forth. It may even include educational requirements for hiring, hiring practices, and company values.

Active layers in software and hardware development may include release processes, User Story writing and acceptance criteria, and development practices like TDD/ATDD, test automation, code reviews, and pair programming.

Separation of Risk and Error Management Concerns

To better focus on dealing with the most appropriate work at the appropriate time in responding to error detection, triage, and risk mitigation, we can separate our risk and error analysis into the following areas:

During development: focus on trapping errors

  • Prevention – the practices, procedures, and techniques we undertake in engineering disciplines to help ensure we do not release bugs or errors into our code base or hardware products.
  • Detection – the methods available to us as individual engineers, teams, and the organization as a whole to find and respond to errors in our code base or hardware products (which includes reporting and tracking).

Risk mitigation: steps for errors that have escaped into certification or production environments

  • Risk Analysis – the steps required to analyze the severity and impact of an error.
  • Risk Decision-making – the process of ensuring decisions about risk avoidance, acceptance, or mitigation are made at appropriate levels with full transparency.

Continuous Improvement in every case

Improvement – the process of improving workflows and practices through shared knowledge and experience in order to improve engineering practices and further harden our release cycles. This step uses root cause analysis to help close the holes we find in the layers of our Swiss cheese model.

Here is one conceptualization of what a Defense-in-depth Risk Management model might look like. Bear in mind that this is simply one way to conceive of layers at a more macro level, and each layer could easily itself be broken down into a set of layers, or you could conceive of it as one very large model.

swiss_cheese

Given our model and our new ability to conceive of Risk and Error Management in this more meaningful and purposeful way, our next step is to understand error causality and what we can do to apply our causal analysis to strengthening our software and hardware risk management and error trapping system.

Continue reading in part 2 of this 3-part series.

 

Chris Alexander is a former U.S. Naval Officer who was an F-14 Tomcat flight officer and instructor. He is Co-Founder and Executive Team Member of AGLX Consulting, creators of the High-Performance Teaming™ model, a Scrum Trainer, Scrum Master, and Agile Coach.

Share This: