Tag Archives: Continuous Improvement

Risk Management and Error Trapping in Software and Hardware Development, Part 3

This is part 3 of a 3-part piece on risk management and error trapping in software and hardware development. The first post is located here (and should be read first to provide context on the content below), and part 2 is located here.

Root Cause Analysis and Process Improvement

Once a bug has been discovered and risk analysis / decision-making has been completed (see below), a retrospective-style analysis on the circumstances surrounding the engineering practices which failed to effectively trap the bug completes the cycle.

The purpose of the retrospective is not to assign blame or find fault, but rather to understand the cause of the failure to trap the bug, inspect the layers of the system, and determine if any additional layers, procedures, or process changes could effectively improve collective engineering surety and help to prevent future bugs emerging from similar causes.

Methodology

  1. Review sequence of events that led to the anomaly / bug.
  2. Determine root cause.
  3. Map the root cause to our defense-in-depth (Swiss cheese) model.
  4. Decide if there are remediation efforts or improvements which would be effective in supporting or restructuring the system to increase its effectiveness at error trapping.
  5. Implement any changes identified, sharing them publicly to ensure everyone understands the changes and the reasoning behind them.
  6. Monitor the changes, adjusting as necessary.

Review sequence of events

With appropriate representatives from engineering teams, certification, hardware, operations, customer success, etc., review the discovery path which led to finding the bug. The point is to understand the processes used, which ones worked, and which let the bug pass through.

Determine root cause and analyze the optimum layers for improvement

What caused the bug? There are many enablers and contributing factors, but typically only one or two root causes. The root cause is one or a possible combination of Organization, Communication, Knowledge, Experience, Discipline, Teamwork, or Leadership.

  • Organization – typically latent, organizational root causes include things like existing processes, tools, practices, habits, customs, etc., which the company or organization as a whole employs in carrying out its work.
  • Communication – a failure to convey necessary, important, or vital information to or among an individual or team who required it for the successful accomplishment of their work.
  • Knowledge – an individual, team, or organization did not possess the knowledge necessary to succeed. This is the root cause for knowledge-based errors.
  • Experience – an individual, team, or organization did not possess the experience necessary to successfully accomplish a task (as opposed to the knowledge about what to do). Experience is often a root cause in skill-based errors of omission.
  • Discipline – an individual, team, or organization did not possess the discipline necessary to apply their knowledge and experience to solving a problem. Discipline is often a root cause in skill-based errors of commission.
  • Teamwork – individuals, possibly at multiple levels, failed to work together as a team, support one another, and check one another against errors. Additional root causes may be knowledge, experience, communication, or discipline.
  • Leadership – less often seen at smaller organizations, a Leadership failure is typically a root cause when a leader and/or manager has not effectively communicated expectations or empowered execution regarding those expectations.

Map the root cause to the layer(s) which should have trapped the error

Given the root cause analysis, determine where in the system (which layer or layers) the bug should have been trapped. Often there will be multiple locations at which the bug should or could have been trapped, however the best location to identify is the one which most closely corresponds to the root cause of the bug. Consideration should also be given to timeliness. The earlier an error can be caught or prevented (trapped), the less costly it is in terms of both time (to find, fix, and eliminate the bug) and effort (a bug in production requires more effort from more people than a developer discovering a bug while checking their own unit test).

While we should seek to apply fixes at the locations best suited for them, the earliest point at which a bug could have been caught and prevented will often be the optimum place to improve the system.

For example, if a bug was traced back to a team’s discipline in writing and using tests (root cause: discipline and experience), then it would map to layers dealing with testing practices (TDD/ATDD), pair programming, acceptance criteria, definition of “Done,” etc. Those layers to which the team can most readily apply improvements and which will trap the error sooner rather than later should be the focus for improvement efforts.

Decide on improvements to increase system effectiveness

Based on the knowledge gained through analyzing and mapping the root cause, decisions are made on how to improve the effectiveness of the system at the layers identified. Using the testing example above, a team could decide that they need to adjust their definition of Done to include listing which tests a story has been tested against and their pass/fail conditions.

Implement the changes identified, and monitor them for effectiveness.

Risk Analysis

Should our preventative measures fail to stop a bug from escaping into a production environment, an analysis of the level of risk needs to be explicitly completed. (This is often done, but in an implicit way.) The analysis of the level of risk derives from two areas.

Risk Severity – the degree of impact the bug can be expected to have to the data, operations, or functionality of affected parties (the company, vendors, customers, etc.).

Blocking A bug that is so bad, or a feature that is so important, that we would not ship the next release until it is fixed/completed. Could also signify a bug that is currently impacting a customer’s operations, or one that is blocking development.
Critical A bug that needs to be resolved ASAP, but for which we wouldn’t stop everything. Bugs in this category are not impacting operations (a customer’s, or ours), but they are significantly challenging to warrant attention.
Major Best judgement should be used to determine how this stacks against other work. The bug is serious enough that it needs to be resolved, but the value of other work and timing should be considered. If a bug sits in major for too long, its categorization should be reviewed and either upgraded or downgraded.
Minor A bug that is known, but which we have explicitly de-prioritized. Such a bug will be fixed as time allows.
Trivial Should really consider closing this level of bug. At best these should be put into the “Long Tail” for tracking.

Risk Probability – the likelihood, expressed against a percentage, that those potentially affected by the bug will actually experience it (ie., always, only if they have a power outage, or only if the sun aligns with Jupiter during the slackwater phase of a diurnal tide in the northeastern hemisphere between 44 and 45 degrees Latitude).

Definite 100% – issue will occur in every case
Probable 60-99% – issue will occur in most cases
Possible 30-60% – coin-flip; issue may or may not occur
Unlikely 2-30% – issue will occur in less than 50% of cases
Won’t 1% – occurrence of the issue will be exceptionally rare

Given Risk Severity and Probability, the risk can be assessed according to the following matrix and assigned a Risk Assessment Code (RAC).

Risk Assessment Matrix Probability
Definite Probable Possible Unlikely Won’t
Severity Blocker 1 1 1 2 3
Critical 1 1 2 2 3
Major 2 2 2 3 4
Minor 3 3 3 4 5
Trivial 3 4 4 5 5

Risk Assessment Codes
1 – Strategic     2 – Significant     3 – Moderate     4 – Low     5 – Negligible

The Risk Assessment Codes are a significant factor in Risk decision-making.

  1. Strategic – the risk to the business or customers is significant enough that its realization could threaten operations, basic functioning, and/or professional reputation to the point that the basic survival of the business could be in jeopardy. As Arnold said in Predator: “We make a stand now, or there will be nobody left to go to the chopper!”
  2. Significant – the risk poses considerable, but not life-threatening, challenges for the business or its customers. If left unchecked, these risks may elevate to strategic levels.
  3. Moderate – the risk to business operations, continuity, and/or reputation is significant enough to warrant consideration against other business priorities and issues, but not significant enough to trigger higher responses.
  4. Low – the risk to the business is not significant enough to warrant special consideration of the risk against other priorities. Issues should be dealt with in routine, predictable, and business-as-usual ways.
  5. Negligible – the risk to the business is not significant enough to warrant further consideration except in exceptional circumstances (ie., we literally have nothing better to do).

Risk Decision

The risk decision is the point at which a decision is made about the risk. Typically, risk decisions take the form of:

  • Accept – accept the risk as it is and do not mitigate or take additional steps.
  • Delay – for less critical issues or dependencies, a decision about whether to accept or mitigate a risk may be delayed until additional information, research, or steps are completed.
  • Mitigate – establish a mitigation strategy and deal with the risk.

For risk mitigation, feasible Courses of Action (CoAs) should be developed to assist in making the mitigation plan. These potential actions comprise the mitigation and or reaction plan. Specifically, given a specific bug’s risk severity, probability, and resulting RAC, the courses of action are the possible mitigate solutions for the risk. Examples include:

— Pre-release —

  • Apply software fix / patch
  • Code refactor
  • Code rewrite
  • Release without the code integrated (re-build)
  • Hold the release and await code fix
  • Cancel the release

— In production —

  • Add to normal backlog and prioritize with normal workflow
  • Pull / create a team to triage and fix
  • Swarm / mob multiple teams on fix
  • Pull back / recall release
  • Release an additional fix as a micro-upgrade

For all risk decisions, those decisions should be recorded and those which remain active need to be tracked. There are many methods available for logging and tracking risk decisions, from spreadsheets to documentation to support tickets. There are entire software platforms expressly designed to track and monitor risk status and record decisions taken (or not) about risks.

Decisions to delay risk mitigations are the most important to track, as they require action and at the speed most business move today, a real risk exists of losing track of risk delay decisions. Therefore a Risk Log or Review should be used to routinely review the status of pending risk decisions and reevaluate them. Risk changes constantly, and risks may significantly change in severity and probability overnight. In reviewing risk decisions regularly, leadership is able to simultaneously ensure both that emerging risks are mitigated and that effort is not wasted unnecessarily (as when effort is put against a risk which has significantly declined in impact due to changes external to the business).

Conclusion

I hope you’ve enjoyed this 3-part series. Risk management and error trapping is a complicated and – at times – complex topic. There are many ways to approach these types of systems and many variations on the defense-in-depth model.

The specific implementation your business or organization chooses to adopt should reflect the reality and environment in which you operate, but the basic framework has proven useful across many domains, industries, and is directly adapted from Operational Risk Management as I used to practice and teach it in the military.

Understanding the root cause of your errors, where they slipped through your system, and how to improve your system’s resiliency and robustness are critical skills which you need to develop if they are not already functional. A mindful, purposeful approach to risk decision-making throughout your organization is also critical to your business operations.

Good luck!

 

Chris Alexander is a former U.S. Naval Officer who was an F-14 Tomcat flight officer and instructor. He is Co-Founder and Executive Team Member of AGLX Consulting, creators of the High-Performance Teaming™ model, a Scrum Trainer, Scrum Master, and Agile Coach.

Share This:

Agile Retrospectives: High-Performing Teams Don’t Play Games

Scrum, The Lean Startup, Cyber Security and some product development loops have fighter aviation origins. But retrospectives (debriefs)—the most important continuous improvement event—have been hijacked by academics, consultants, and others who have never been part of a high-performing team; sure, they know how things ought to work but haven’t lived them. We have.

Learn what’s wrong with current retrospectives and discover how an effective retrospective process can build the high-performance teaming skills your organization needs to compete in today’s knowledge economy.

Special thanks to Robert “Cujo” Teschner, Dan “Bunny” O’Hara, Chris “Deuce” Alexander, Jeff “T-Bell” Dermody, Ryan “Hook-n-Jab” Bromenschenkel, Ashok “WishICould” Singh, John “Shorn” Saccomando, Dr. Daniel Low, and Allison “I signed up for what?” Rivera.

Brian “Ponch” Rivera is a recovering naval aviator, co-creator of High-Performance Teaming™ and the co-founder of AGLX Consulting, LLC.

Share This:

What the Agile Community Should Learn from Two Little Girls and Their Weather Balloon

As reported by GeekWire, over the weekend two Seattle sisters, Kimberly (8) and Rebecca (10) Yeung, launched a small weather balloon to the edge of space (roughly 78,000 feet). They have the GoPro video from two cameras to prove it.

While this is certainly an impressive, if not amazing, feat for two young girls to have accomplished (despite some parental assistance), what is perhaps most impressive (at least to me) is the debrief (or retrospective) they held after the mission. While I’m not fortunate enough to have been there to witness it personally, I can see from the photo of their debrief sheet (as posted in the GeekWire article) that it was amazingly productive and far surpasses most of the agile retrospectives (debriefs) I’ve witnessed.

14416510384450*Photo copied from the article on GeekWire.

Apart from the lesson about their Project Plan (“We were successful because we followed a Project Plan & Project Binder”), this sheet is astonishingly solid. Even given the fact that I think it is a misconception to attribute success to having had a project plan, for an 8 and 10-year-old, this is awesome work!

My friend and fellow coach Brian Rivera and I have often discussed the dire lack of quality, understanding, and usefulness of most agile retrospectives. I might even go so far as to call the current state of agile retrospectives in general “abhorrid” or “pathetic,” even “disgraceful.” Yes, I might just use one of those adjectives.

For teams using agile methodologies and frameworks focused on continuous improvement (hint: everything in agile is about enabling continuous improvement), the retrospective is the “how” which underlies the “what” of continuous improvement.

Supporting the concrete actions of how to improve within the retrospective are the lessons learned. Drawing out lessons learned during an iteration isn’t magic and it isn’t  circumstantial happenstance – it requires focused thought, discussion, and analysis. Perhaps for high-performing teams who have become expert at this through positive practice, distilling lessons learned and improving their work may occur at an almost unconscious level of understanding, but that’s maybe 1% (5% if I’m optimistic) of all agile teams.

So what does a team need to understand to actually conduct a thorough and detailed analysis during their retrospective? Actually only a few things:

  1. What were they trying to do? (Goals)
  2. How did they plan to do it? (Planning / strategy)
  3. What did they actually do? (Execution – what actually occurred)
  4. What were their outcomes? (Results of their work)
  5. What did they learn, derived from analyzing the results of their efforts measured against the plan they had to achieve their goals? (Lessons learned)

A simple example:

  1. I want to bake peach scones which are light, fluffy, and taste good. (Goal + acceptance criteria)
  2. I plan to wake up early Saturday morning and follow a recipe for peach scones which I’ve found online, is highly rated, and comes from a source I trust. It should take 30 minutes. (Planning – who / what / when / where / how)
  3. I wake up early Saturday morning and follow the recipe, except for the Baking Powder. It can leave a metallic taste behind, so I leave it out. (Execution)
  4. It took almost an hour to make the scones, and they did not rise. They tasted alright, but were far, far too dense and under-cooked internally, partially due to being flat. (Outcomes)
  5. I didn’t allocate enough time based on the fact that it was my first attempt at baking scones and I was trying to modify a known good recipe (reinventing the wheel, root causes: experience). Although I wanted light, fluffy scones, I didn’t get them because I deliberately left out a key ingredient necessary to help the dough rise (good intention – bad judgment, root causes: knowledge / discipline). (Lessons learned)

Perhaps a bit overly simplistic but this is exactly the type of concrete, detailed analysis into which most teams simply never delve. Instead, retrospectives for most agile teams have devolved into a tragic litany of games, complaining sessions, and “I liked this / I didn’t like that” reviews with no real outcomes, takeaways, or practical concepts for how to actually improve anything. Their coaches leave them with simple statements such as “we need to improve.” Great. Thanks.

Taking what we know from Kimberly and Rebecca’s plan to send a weather balloon into outer space, let’s do a little analysis on their retrospective. I can tell you already it is not only solid, but will ensure they’re able to improve not only on the technical design itself, but also improve their team’s “meta” – the ways they work, their collaboration, their teamwork, their research – everything which enables them to actually continually improve and produce powerful results.

  • Bigger balloon – create more lift – ensure faster rate of ascent (Technical / work – related but important. They have learned through iterating.)
  • Remember to weigh payload with extra – more accurate calculations – correct amount of helium (Technical but also process-related, this draws root causes arising from both knowledge and experience, enabling them to adapt both their work itself and their meta – how they work.)
  • Don’t stop trying – you will never know if you don’t ask. Eg GoPro (Almost purely meta, reflecting a great lesson which builds not only a team mindset but also reflects a core value, perseverance!)
  • Washington Geography – Map research on launch locations taught us a lot of geography (This is both technical and meta, addressing their research data and inputs/outputs but also learning about how to learn and the value of research itself!)
  • Always be optimistic – We thought everything went wrong but every thing went right. Eg. SPOT Trace max altitude mislead [sic] our expectations. Eg. We thought weather cloudy but it was sun after launch. Eg. Weight. Thought payload too heavy for high altitude. (Are you kidding me?! Awesome! Lessons about situational awareness and current operational picture, data inconsistencies, planned versus actual events, planning data and metrics, and the importance of outlook/attitude! #goldmine!)
  • Be willing to reconstruct – If you find out there is a problem, do not be afraid to take it apart and start all over again. (Invaluable lesson – learning to embrace failure when it occurs and recover from it, realizing that the most important thing is not to build the product right, but to build the right product!)
  • Have a redundant system – Worry less. (Needs no explanation.)
  • SPOT Trace technology awesome – Very precise (This is a fantastic example of a positive lesson learned – something that is equally important to acknowledge and capture to ensure it gets carried forward and turned into a standard practice / use.)
  • Live FB updates – add to fun + excitement (Yes yes yes!! To quote an old motto, “If you’re not having fun, you’re not doing it right!” This stuff should be fun!!)
  • Speculation – Don’t guess. Rely on data. (Fantastic emphasis on the importance of data-oriented decisions and reflects another potential team core value!)
  • Project Plan – We were sucessful [sic] because we followed a Project Plan + Project Binder. (The only lesson I disagree with. I would advocate a good 5 Whys session on this one. My suspicion is that the project was successful because they as a team worked both hard and well together [high-performing], had fun, and iterated well [based on the lesson about not being afraid to reconstruct / start over]. I have serious doubts that their mission was a success because they had and followed a project plan. Regardless, this is far too small a point to detract from the overall impressiveness of their work!)

Take a few lessons from two girls who have demonstrated concrete learning in ways most adults fail miserably to even conceptually grasp. If you are on a team struggling to get productive results from your retrospectives, stop accepting less than solid, meaningful analysis coupled with clear, actionable results. The power is in your hands (and head).

If you are one of those agile coaches who thinks retrospectives are just for fun and celebration, who plays games instead of enables concrete analysis, and who wonders why their teams just cannot seem to make any marked improvements, get some education and coaching yourself and stop being a part of the problem!

(Written with the sincerest of thanks to Kimberly and Rebecca Yeung, and the Yeung family for their outstanding work, and to GeekWire for publishing it!)

* Chris Alexander is an agile coach, thinker, ScrumMaster, recovering developer, and co-founder of AGLX Consulting, who spends too little time rock climbing.

Share This:

An Agile Approach to Process Management

Does your business process do what it’s supposed to do?

This is a common question for those involved in business process management, and of course there are many existing methodologies designed to address this. Additionally, process improvement can be carried out at both the macro and micro levels. At the macro level, the business develops processes at the organizational level in order to achieve strategic goals. At the micro level, individual teams, departments, or individuals may develop processes (both formal in informal) to help them accomplish their work.

The issue with existing process development and analysis methodology is that it remains relatively waterfall-ish. Define all of the requirements, develop the process, communicate and implement, then occasionally analyze and change (improve) as needed. The cycle time involved with this approach can be considerable, and especially at the formal, macro level, once initiated the process may be particularly challenging to analyze or change.

An iterative approach to process design and management.

What if we considered process design in an adaptive, iterative way?

In other words, design a narrow process around the first, most important requirement, enabling rapid creation and implementation, feedback, and improvement, while developing toward additional requirements.

This serves two important purposes: first, it enables us to rapidly design and implement a process to satisfy the most important business need, while, second, enabling the process manager to receive fast feedback about the usefulness of the process as implemented thus far.

Once the critical first requirement has been met, the process can be adapted through iteration to satisfy additional requirements, again enabling feedback on the utility of the process overall.

Let’s look at an example of process improvement in new hire onboarding.

CDE Company has a challenging and sub-optimal process in place for onboarding new hires. CDE Company knows this because it is the number one pain point new hires relate about their experience joining the company. Let’s help CDE Co. improve their process through the application of Agile practicies – the same ones their Scrum teams are currently using to develop their products (so everyone is already familiar with and can use them!).

Step 1 – for an existing process, conduct a Process Retrospective

  • Define the goals of the process itself – what is the process supposed to do? “Minimize the amount of time new hires need to get up to speed quickly and become productive.”
  • Analyze the current process plan (how it is supposed to work). “New hire is welcomed and given a desk; requests are sent to create and activate their email account and grant access to the folders, tools, and utilities which they will need to do their job; new hire is added to email lists, meetings, and taken out to lunch; new hire will work alongside another developer to understand the code base and standard practices, and will be given an overview of the main products on which they’ll be working; new hires begin to work.”
  • Gather feedback on the actual current process and how it is executed. “New hire is welcomed and requests are sent; email and access to environments/tools takes anywhere from a couple of days to a week; being added to email lists/meetings requires finding the owners and asking to be added individually; buddying with another developer works ok; lunch is always good.”
  • Identify the lessons from analyzing the current process, the intended plan, and the actual goals. “Onboarding is slower than it should be and we are not achieving the goal of getting new developers up to speed and working quickly; getting email / account / tools / utilities / environments access is too slow and holds developers up; our buddying model works well once the new developer has the access needed; getting the new developer into meetings / email lists is fractured and takes too long.”
  • Transfer those lessons into a plan of action. “We need to re-prioritize and redesign the process to address the issues found in Lessons Learned. Specifically we need to plan to get access for the new hire on Day 1; we need to have email lists / meetings cleared up prior to Day 1; we should maintain the onboarding buddy system.”

Step 2 – hold a Process Improvement Planning Session

  • Set a Timebox for this process improvement / development cycle (1 or 2 weeks should generally be enough time – otherwise you’re taking on too much work).
  • Using the process owner(s) as a sort of Product Owner, determine what the most important requirement(s) of the process is in terms of either the Lessons Transferred from the Retrospective or the needs of the business, team, or individual. “Get the new developer access to email / development tools / environments on Day 1.”
  • It is important to note that the size / number of planned improvements should be something achievable within the agreed timebox. (Just to re-iterate: 1 improvement implemented is better than 5 improvements “in work”.)
  • Plan how to achieve the requirement identified, and name an owner. “Develop a consolidated list of tools / utilities / environments and work with infrastructure & engineering leads to create account permissions according to that list for new hires at Day 1 minus 2; on Day 1 minus 1 Operations places the account in a password reset state; on Day 1 new hire completes password reset process and has access to email and all required tools & utilities.”

Step 3 – implement the Plan 

The next new hire through the door gets to use the new process.

Step 4 – conduct the next Retrospective and Planning Session

Retrospect and then prioritize any new lessons along with the existing work which is still in the backlog and waiting to be done (such as improving new hires getting added to email lists and meetings). Moving forward to the next planning session, plan improvements for the next highest priority item(s) which can be achieved within the timebox.

An iterative approach to process improvement also provides the opportunity to stop working on process improvements once the process has been improved enough to be generating sufficient utility – regardless of whether every conceived improvement has been implemented.

Although certainly feasible, the majority of business value will likely be derived from implementing a portion of the desired improvements to a process, but not necessarily all. If we can determine that the Pareto principle (80% of the value of our process is delivered by 20% of the features/work) is indeed applicable to our process, then eliminating sub-processes or work not required will further contribute to business capacity by eliminating waste.

Whatever process you’re seeking to implement or improve upon, an Agile approach can help you build and deliver faster and with greater effectiveness. Payroll, feature request intakes, customer polling, software bug remediation, hardware certification, inventory management, data analysis and reporting – it is challenging to think of a single process which couldn’t benefit in some way from utilizing an iterative, Agile approach to design and continuous improvement.

Share This: