Table of Contents
Like agile, hypothesis-driven development (HDD) is more a point of view with various associated practices than it is a single, particular practice or process. That said, my goal here for is you to leave with a solid understanding of how to do HDD and a specific set of steps that work for you to get started.
After reading this guide and trying out the related practice you will be able to:
- Diagnose when and where hypothesis-driven development (HDD) makes sense for your team
- Apply techniques from HDD to your work in small, success-based batches across your product pipeline
- Frame and enhance your existing practices (where applicable) with HDD
What is hypothesis-driven development (HDD)?
Does your product program feel like a Netflix show you’d binge watch? Is your team excited to see what happens when you release stuff? If so, congratulations- you’re already doing it and please hit me up on Twitter so we can talk about it! If not, don’t worry- that’s pretty normal, but HDD offers some awesome opportunities to work better.
Underlying the growing excitement about HDD is the realization that in the product world we’re underusing the scientific method. Now, that doesn’t mean you need to put on a lab coat and be super formal about what you do. What it does mean is that as we deal with uncertainty (aka innovate), we should have explicit, testable ideas and focus on definitely disproving them so we can focus on a better idea or proving them and building on them with confidence. In the diagram here, I’ve borrowed language from Lean Startup and referred to these conditions as pivot (disproven) or persevere (proven).
Building on the scientific method, HDD is a take on how to integrate test-driven approaches across your product development activities- everything from creating a user persona to figuring out which integration tests to automate. Yeah- wow, right?! It is a great way to energize and focus your practice of agile and your work in general.
By product pipeline, I mean the set of processes you and your team undertake to go from a certain set of product priorities to released product. If you’re doing agile, then iteration (sprints) is a big part of making these work.
How do you know if it’s working?
It wouldn’t be very hypothesis-driven if I didn’t have an answer to that! In the diagram above, you’ll find metrics for each area. For your application of HDD to what we’ll call continuous design, your metric to improve is the ratio of all your release content to the release content that meets or exceeds your target metrics on user behavior. For example, if you developed a new, additional way for users to search for products and set the success threshold at it being used in >10% of users sessions, did that feature succeed or fail by that measure? For application development, the metric you’re working to improve is basically velocity, meaning story points or, generally, release content per sprint. For continuous delivery, it’s how often you can release. Hypothesis testing is, of course, central to HDD and generally doing agile with any kind focus on valuable outcomes, and I think it shares the metric on successful release content with continuous design.
Teams find these useful, particularly for individuals in their various areas, but how do those success metrics interact? The single best feature of HDD is that it calls attention to what all the different practices across your pipeline and related disciplines have in common. For answering this question and generally helping these various metrics cohere into outcomes that are significant for the team’s product, I like ‘F’.
The first component is team cost, which you would sum up over whatever period you’re measuring. This includes ‘c$’, which is total compensation as well as loading (benefits, equipment, etc.) as well as ‘g’ which is the cost of the gear you use- that might be application infrastructure like AWS, GCP, etc. along with any other infrastructure you buy or share with other teams. For example, using a backend-as-a-service like Heroku or Firebase might push up your value for ‘g’ while deferring the cost of building your own app infrastructure.
The next component is release content, fe. If you’re already estimating story points somehow, you can use those. If you’re a NoEstimates crew, and, hey, I get it, then you’d need to do some kind of rough proportional sizing of your release content for the period in question. The next term, rf, is optional but this is an estimate of the time you’re having to invest in rework, bug fixes, manual testing, manual deployment, and anything else that doesn’t go as planned.
The last term, sd, is one of the most critical and is an estimate of the proportion of your release content that’s successful relative to the success metrics you set for it. For example, if you developed a new, additional way for users to search for products and set the success threshold at it being used in >10% of users sessions, did that feature succeed or fail by that measure? Naturally, if you’re not doing this it will require some work and changing your habits, but it’s hard to deliver value in agile if you don’t know what that means and define it against anything other than actual user behavior.
Here’s how some of the key terms lay out in the product pipeline:
The example here shows how a team might tabulate this for a given month:
Is the punchline that you should be shooting for a cost of $1,742 per story point? No. First, this is for a single month and would only serve the purpose of the team setting a baseline for itself. Like any agile practice, the interesting part of this is seeing how your value for ‘F’ changes from period to period, using your team retrospectives to talk about how to improve it. Second, this is just a single team and the economic value (ex: revenue) related to a given story point will vary enormously from product to product. There’s a Google Sheets-based calculator that you can use here: Innovation Accounting with ‘F’.
Like any metric, ‘F’ only matters if you find it workable to get in the habit of measuring it and paying attention to it. As a team, say, evaluates its progress on OKR (objectives and key results), ‘F’ offers a view on the health of the team’s collaboration together in the context of their product and organization. For example, if the team’s accruing technical debt, that will show up as a steady increase in ‘F’. If a team’s invested in test or deploy automation or started testing their release content with users more specifically, that should show up as a steady lowering of ‘F’.
In the next few sections, we’ll step through how to apply HDD to your product pipeline by area, starting with continuous design.
How do you apply HDD to ‘continuous design’?
For the practice of HDD with continuous design, I like to use the ‘double diamond’ framing of ‘right problem’ vs. ‘right solution’, which I first learned about in Donald Norman’s seminal book, ‘The Design of Everyday Things’.
I’ve organized the balance of this section around three big questions:
- How do you test that you’ve found the ‘Right Problem’?
- How do you test that you’ve found demand and have the ‘Right Solution’?
- How do you test that you’ve designed the ‘Right Solution’?
How do you test that you’ve found the ‘Right Problem’?
Basically, you talk to customers about your area of interest and see what’s actually important to them. This process is very different from selling, and that’s where most people run into trouble.
Let’s say it’s an internal project- a ‘digital transformation’ for an HVAC (heating, ventilation, and air conditioning) service company. The digital team thinks it would be cool to organize the documentation for all the different HVAC equipment the company’s technicians service. But, would it be?
The only way to find out is to go out and talk to these technicians and find out! First, you need to test whether you’re talking to someone who is one of these technicians. For example, you might have a screening question like: ‘How many HVAC’s did you repair last week?’. If it’s <10, you might instead be talking to a handyman or a manager (or someone who’s not an HVAC tech at all).
Second, you need to ask non-leading questions. The evidentiary value of a specific answer to a general question is much higher than a specific answer to a specific questions. Also, some questions are just leading. For example, if you ask such a subject ‘Would you use a documentation system if we built it?’, they’re going to say yes, just to avoid the awkwardness and sales pitch they expect if they say no.
How do you draft personas? Much more renowned designers than myself (Donald Norman among them) disagree with me about this, but personally I like to draft my personas while I’m creating my interview guide and before I do my first set of interviews. Whether you draft or interview first is also of secondary important if you’re doing HDD- if you’re not iteratively interviewing and revising your material based on what you’ve found, it’s not going to be very functional anyway.
Really, the persona (and the job-to-be-done/problem scenario) is a means to an end- it should be answering some facet of the question ‘Who is our customer, and what’s important to them?’. It’s iterative, with a process that looks something like this:
How do you draft jobs-to-be-done? Personally- I like to work these in a similar fashion- draft, interview, revise, and then repeat, repeat, repeat.
You’ll use the same interview guide and subjects for these. The template is the same as the personas, but I maintain a separate (though related) tutorial for these–
How do you interview subjects? And, action! The #1 place I see teams struggle is at the beginning and it’s with the paradox that to get to a big market you need to nail a series of small markets. Sure, they might have heard something about segmentation in a marketing class, but here you need to apply that from the very beginning.
The fix is to create a screener for each persona. This is a factual question whose job is specifically and only to determine whether a given subject does or does not map to your target persona. In the HVAC in a Hurry technician persona (see above), you might have a screening question like: ‘How many HVAC’s did you repair last week?’. If it’s <10, you might instead be talking to a handyman or a manager (or someone who’s not an HVAC tech at all).
And this is the point where (if I’ve made them comfortable enough to be candid with me) teams will ask me ‘But we want to go big- be the next Facebook.’ And then we talk about how just about all those success stories where there’s a product that has for all intents and purpose a universal user base started out by killing it in small, specific segments and learning and growing from there.
Sorry for all that, reader, but I find all this so frequently at this point and it’s so crucial to what I think is a healthy practice of HDD it seemed necessary.
The key with the interview guide is to start with general questions where you’re testing for a specific answer and then progressively get into more specific questions. Here are some resources–
An example interview guide related to the previous tutorials
A general take on these interviews in the context of a larger customer discovery/design research program
A template for drafting an interview guide
To recap, what’s a ‘Right Problem’ hypothesis? The Right Problem (persona and PS/JTBD) hypothesis is the most fundamental, but the hardest to pin down. You should know what kind of shoes your customer wears and when and why they use your product. You should be able to apply factual screeners to identify subjects that map to your persona or personas.
You should know what people who look like/behave like your customer who don’t use your product are doing instead, particularly if you’re in an industry undergoing change. You should be analyzing your quantitative data with strong, specific, emphatic hypotheses.
If you make software for HVAC (heating, ventilation and air conditioning) technicians, you should have a decent idea of what you’re likely to hear if you ask such a person a question like ‘What are the top 5 hardest things about finishing an HVAC repair?’
In summary, HDD here looks something like this:
01 IDEA: The working idea is that you know your customer and you’re solving a problem/doing a job (whatever term feels like it fits for you) that is important to them. If this isn’t the case, everything else you’re going to do isn’t going to matter.
Also, you know the top alternatives, which may or may not be what you see as your direct competitors. This is important as an input into focused testing demand to see if you have the Right Solution.
02 HYPOTHESIS: If you ask non-leading questions (like ‘What are the top 5 hardest things about finishing an HVAC repair?’), then you should generally hear relatively similar responses.
03 EXPERIMENTAL DESIGN: You’ll want an Interview Guide and, critically, a screener. This is a factual question you can use to make sure any given subject maps to your persona. With the HVAC repair example, this would be something like ‘How many HVAC repairs have you done in the last week?’ where you’re expecting an answer >5. This is important because if your screener isn’t tight enough, your interview responses may not converge.
04 EXPERIMENTATION: Get out and interview some subjects- but with a screener and an interview guide. The resources above has more on this, but one key thing to remember is that the interview guide is a guide, not a questionnaire. Your job is to make the interaction as normal as possible and it’s perfectly OK to skip questions or change them. It’s also 1000% OK to revise your interview guide during the process.
05: PIVOT OR PERSEVERE: What did you learn? Was it consistent? Good results are:
a) We didn’t know what was on their A-list and what alternatives they are using, but we do know.
b) We knew what was on their A-list and what alternatives they are using- we were pretty much right (doesn’t happen as much as you’d think).
c) Our interviews just didn’t work/converge. Let’s try this again with some changes (happens all the time to smart teams and is very healthy).
How do you test that you’ve found demand and have the ‘Right Solution’?
By this, I mean: How do you test whether you have demand for your proposition? How do you know whether it’s better enough at solving a problem (doing a job, etc.) than the current alternatives your target persona has available to them now?
If an existing team was going to pick one of these areas to start with, I’d pick this one. While they’ll waste time if they haven’t found the right problem to solve and, yes, usability does matter, in practice this area of HDD is a good forcing function for really finding out what the team knows vs. doesn’t. This is why I show it as a kind of fulcrum between Right Problem and Right Solution:
This is not about usability and it does not involve showing someone a prototype, asking them if they like it, and checking the box.
Lean Startup offers a body of practice that’s an excellent fit for this. However, it’s widely misused because it’s so much more fun to build stuff than to test whether or not anyone cares about your idea. Yeah, seriously- that is the central challenge of Lean Startup.
Here’s the exciting part: You can massively improve your odds of success. While Lean Startup does not claim to be able to take any idea and make it successful, it does claim to minimize waste- and that matters a lot. Let’s just say that a new product or feature has a 1 in 5 chance of being successful. Using Lean Startup, you can iterate through 5 ideas in the space it would take you to build 1 out (and hope for the best)- this makes the improbably probable which is pretty much the most you can ask for in the innovation game.
Build, measure, learn, right? Kind of. I’ll harp on this since it’s important and a common failure mode relate to Lean Startup: an MVP is not a 1.0. As the Lean Startup folks (and Eric Ries’ book) will tell you, the right order is learn, build, measure. Specifically–
Learn: Who your customer is and what matters to them (see Solving the Right Problem, above). If you don’t do this, you’ll throwing darts with your eyes closed. Those darts are a lot cheaper than the darts you’d throw if you were building out the solution all the way (to strain the metaphor some), but far from free.
In particular, I see lots of teams run an MVP experiment and get confusing, inconsistent results. Most of the time, this is because they don’t have a screener and they’re putting the MVP in front of an audience that’s too wide ranging. A grandmother is going to respond differently than a millennial to the same thing.
Build: An experiment, not a real product, if at all possible (and it almost always is). Then consider MVP archetypes (see below) that will deliver the best results and try them out. You’ll likely have to iterate on the experiment itself some, particularly if it’s your first go.
Measure: Have metrics and link them to a kill decision. The Lean Startup term is ‘pivot or persevere’, which is great and makes perfect sense, but in practice the pivot/kill decisions are hard and as you decision your experiment you should really think about what metrics and thresholds are really going to convince you.
How do you code an MVP? You don’t. This MVP is a means to running an experiment to test motivation- so formulate your experiment first and then figure out an MVP that will get you the best results with the least amount of time and money. Just since this is a practitioner’s guide, with regard to ‘time’, that’s both time you’ll have to invest as well as how long the experiment will take to conclude. I’ve seen them both matter.
The most important first step is just to start with a simple hypothesis about your idea, and I like the form of ‘If we [do something] for [a specific customer/persona], then they will [respond in a specific, observable way that we can measure]. For example, if you’re building an app for parents to manage allowances for their children, it would be something like ‘If we offer parents and app to manage their kids’ allowances, they will download it, try it, make a habit of using it, and pay for a subscription.’
All that said, for getting started here is-
A guide on testing with Lean Startup
A template for creating motivation/demand experiments
To recap, what’s a Right Solution hypothesis for testing demand? The core hypothesis is that you have a value proposition that’s better enough than the target persona’s current alternatives that you’re going to acquire customers.
As you may notice, this creates a tight linkage with your testing from Solving the Right Problem. This is important because while testing value propositions with Lean Startup is way cheaper than building product, it still takes work and you can only run a finite set of tests. So, before you do this kind of testing I highly recommend you’ve iterated to validated learning on the what you see below: a persona, one or more PS/JTBD, the alternatives they’re using, and a testable view of why your VP is going to displace those alternatives. With that, your odds of doing quality work in this area dramatically increase!
What’s the testing, then? Well, it looks something like this:
01 IDEA: Most practicing scientists will tell you that the best way to get a good experimental result is to start with a strong hypothesis. Validating that you have the Right Problem and know what alternatives you’re competing against is critical to making investments in this kind of testing yield valuable results.
With that, you have a nice clear view of what alternative you’re trying to see if you’re better than.
02 HYPOTHESIS: I like a cause an effect stated here, like: ‘If we [offer something to said persona], they will [react in some observable way].’ This really helps focus your work on the MVP.
03 EXPERIMENTAL DESIGN: The MVP is a means to enable an experiment. It’s important to have a clear, explicit declaration of that hypothesis and for the MVP to delivery a metric for which you will (in advance) decide on a fail threshold. Most teams find it easier to kill an idea decisively with a kill metric vs. a success metric, even though they’re literally different sides of the same threshold.
04 EXPERIMENTATION: It is OK to tweak the parameters some as you run the experiment. For example, if you’re running a Google AdWords test, feel free to try new and different keyword phrases.
05: PIVOT OR PERSEVERE: Did you end up above or below your fail threshold? If below, pivot and focus on something else. If above, great- what is the next step to scaling up this proposition?
How does this related to usability? What’s usability vs. motivation? You might reasonably wonder: If my MVP has something that’s hard to understand, won’t that affect the results? Yes, sure. Testing for usability and the related tasks of building stuff are much more fun and (short-term) gratifying. I can’t emphasize enough how much harder it is for most founders, etc. is to push themselves to focus on motivation.
There’s certainly a relationship and, as we transition to the next section on usability, it seems like a good time to introduce the relationship between motivation and usability. My favorite tool for this is BJ Fogg’s Fogg Curve, which appears below. On the y-axis is motivation and on the x-axis is ‘ability’, the inverse of usability. If you imagine a point in the upper left, that would be, say, a cure for cancer where no matter if it’s hard to deal with you really want. On the bottom right would be something like checking Facebook- you may not be super motivated but it’s so easy.
The punchline is that there’s certainly a relationship but beware that for most of us our natural bias is to neglect testing our hypotheses about motivation in favor of testing usability.
How do you test that you’ve designed the ‘Right Solution’?
For my money, fully articulated user stories are the most important foundation element for getting to great usability in HDD. These have a specific format-
As a [person],
I want to [do something],
so that I can [achieve some testable reward].
First and foremost, delivering great usability is a team sport. Without a strong, co-created narrative, your performance is going to be sub-par. This means your developers, testers, analysts should be asking lots of hard, inconvenient (but relevant) questions about the user stories. For more on how these fit into an overall design program, let’s zoom out and we’ll again stand on the shoulders of Donald Norman.
Usability and User Cognition
To unpack usability in a coherent, testable fashion, I like to use Donald Norman’s 7-step model of user cognition:
The process starts with a Goal and that goals interacts with an object in an environment, the ‘World’. With the concepts we’ve been using here, the Goal is equivalent to a job-to-be-done/problem scenario. The World is your application in whatever circumstances your customer will use it (in a cubicle, on a plane, etc.).
The Reflective layer is where the customer is making a decision about alternatives for their JTBD/PS. In his seminal book, The Design of Everyday Things, Donald Normal’s is to continue reading a book as the sun goes down. In the framings we’ve been using, we looked at understanding your customers Goals/JTBD in ‘How do you test that you’ve found the ‘right problem’?’, and we looked evaluating their alternatives relative to your own (proposition) in ‘How do you test that you’ve found the ‘right solution’?’.
The Behavioral layer is where the user interacts with your application to get what they want- hopefully engaging with interface patterns they know so well they barely have to think about it. This is what we’ll focus on in this section. Critical here is leading with strong narrative (user stories), pairing those with well-understood (by your persona) interface patterns, and then iterating through qualitative and quantitative testing.
The Visceral layer is the lower level visual cues that a user gets- in the design world this is a lot about good visual design and even more about visual consistency. We’re not going to look at that in depth here, but if you haven’t already I’d make sure you have a working style guide to ensure consistency (see Creating a Style Guide).
How do you unpack the UX Stack for Testability? Back to our example company, HVAC in a Hurry, which services commercial heating, ventilation, and A/C systems, let’s say we’ve arrived at the following tested learnings for Trent the Technician:
As we look at how we’ll iterate to the right solution in terms of usability, let’s say we arrive at the following user story we want to unpack (this would be one of many, even just for the PS/JTBD above):
As Trent the Technician,
I know the part number and I want to find it on the system,
so that I can find out its price and availability.
Let’s step through the 7 steps above in the context of HDD, with a particular focus on achieving strong usability.
This is the PS/JTBD: Getting replacement parts to a job site. An HDD-enabled team would have found this out by doing customer discovery interviews with subjects they’ve screened and validated to be relevant to the target persona. They would have asked non-leading questions like ‘What are the top five hardest things about finishing an HVAC repair?’ and consistently heard that one such thing is sorting our replacement parts. This validates the PS/JTBD hypothesis that said PS/JTBD matters.
For the PS/JTBD/Goal, which alternative are they likely to select? Is our proposition better enough than the alternatives? This is where Lean Startup and demand/motivation testing is critical. This is where we focused in ‘How do you test that you’ve found the ‘right solution’?’ and the HVAC in a Hurry team might have run a series of MVP to both understand how their subject might interact with a solution (concierge MVP) as well as whether they’re likely to engage (Smoke Test MVP).
Our first step here is just to think through what the user expects to do and how we can make that as natural as possible. This is where drafting testable user stories, looking at comp’s, and then pairing clickable prototypes with iterative usability testing is critical. Following that, make sure your analytics are answering the same questions but at scale and with the observations available.
If you did a good job in Specify and there are not overt visual problems (like ‘Can I click this part of the interface?’), you’ll be fine here.
We’re at the bottom of the stack and looping back up from World: Is the feedback from your application readily apparent to the user? For example, if you turn a switch for a lightbulb, you know if it worked or not. Is your user testing delivering similar clarity on user reactions?
Do they understand what they’re seeing? Does is make sense relative to what they expected to happen. For example, if the user just clicked ‘Save’, do they’re know that whatever they wanted to save is saved and OK? Or not?
Have you delivered your target VP? Did they get what they wanted relative to the Goal/PS/JTBD?
How do you draft relevant, focused, testable user stories? Without these, everything else is on a shaky foundation. Sometimes, things will work out. Other times, they won’t. And it won’t be that clear why/not. Also, getting in the habit of pushing yourself on the relevance and testability of each little detail will make you a much better designer and a much better steward of where and why your team invests in building software.
How do you create find the relevant patterns and apply them? Once you’ve got great narrative, it’s time to put the best-understood, most expected, most relevant interface patterns in front of your user. Getting there is a process.
For getting started here is-
A guide on interface patterns and prototyping
How do you run qualitative user testing early and often? Once you’ve got great something to test, it’s time to get that design in front of a user, give them a prompt, and see what happens- then rinse and repeat with your design.
How do you focus your outcomes and instrument actionable observation? Once you release product (features, etc.) into the wild, it’s important to make sure you’re always closing the loop with analytics that are a regular part of your agile cadences. For example, in a high-functioning practice of HDD the team should be interested in and reviewing focused analytics to see how their pair with the results of their qualitative usability testing.
For getting started here is-
A guide on quantitative usability testing with Google Analytics.
To recap, what’s a Right Solution hypothesis for usability? Essentially, the usability hypothesis is that you’ve arrived at a high-performing UI pattern that minimizes the cognitive load, maximizes the user’s ability to act on their motivation to connect with your proposition.
01 IDEA: If you’re writing good user stories, you already have your ideas implemented in the form of testable hypotheses. Stay focused and use these to anchor your testing. You’re not trying to test what color drop-down works best- you’re testing which affordances best deliver on a given user story.
02 HYPOTHESIS: Basically, the hypothesis is that ‘For [x] user story, this interface pattern will perform will, assuming we supply the relevant motivation and have the right assessments in place.
03 EXPERIMENTAL DESIGN: Really, this means have a tests set up that, beyond working, links user stories to prompts and narrative which supply motivation and have discernible assessments that help you make sure the subject didn’t click in the wrong place by mistake.
04 EXPERIMENTATION: It is OK to iterate on your prototypes and even your test plan in between sessions, particularly at the exploratory stages.
05: PIVOT OR PERSEVERE: Did the patterns perform well, or is it worth reviewing patterns and comparables and giving it another go?
How do you apply HDD to application development?
There’s a lot of great material and successful practice on the engineering management part of application development. But should you pair program? Do estimates or go NoEstimates? None of these are the right choice for every team all of the time. In this sense, HDD is the only way to reliably drive up your velocity, or fe. What I love about agile is that fundamental to its design is the coupling and integration of working out how to make your release content successful while you’re figuring out how to make your team more successful.
What does HDD have to offer application development, then? First, I think it’s useful to consider how well HDD integrates with agile in this sense and what existing habits you can borrow from it to improve your practice of HDD. For example, let’s say your team is used to doing weekly retrospectives about its practice of agile. That’s the obvious place to start introducing a retrospective on how your hypothesis testing went and deciding what that should mean for the next sprint’s backlog.
Second, let’s look at the linkage from continuous design. Primarily, what we’re looking to do is move fewer designs into development through more disciplined experimentation before we invest in development. This leaves the developers the do things better and keep the pipeline healthier (faster and able to produce more content or story points per sprint). We’d do this by making sure we’re dealing with a user that exists, a job/problem that exists for them, and only propositions that we’ve successfully tested with non-product MVP’s.
But wait– what does that exactly mean: ‘only propositions that we’ve successfully tested with non-product MVP’s’? In practice, there’s no such thing as fully validating a proposition. You’re constantly looking at user behavior and deciding where you’d be best off improving. To create balance and consistency from sprint to sprint, I like to use a ‘UX map‘. You can read more about it at that link but the basic idea is that for a given JTBD:VP pairing you map out the customer experience (CX) arc broken into progressive stages that each have a description, a dependent variable you’ll observe to assess success, and ideas on things (independent variables or ‘IV’s’) to test. For example, here’s what such a UX map might look like for HVAC in a Hurry’s work on the JTBD of ‘getting replacement parts to a job site’.
From there, how can we use HDD to bring better, more testable design into the development process? One thing I like to do with user stories and HDD is to make a habit of pairing every single story with a simple, analytical question that would tell me whether the story is ‘done’ from the standpoint of creating the target user behavior or not. From there, I consider focal metrics. Here’s what that might look like at HinH.
How do you apply HDD to continuous delivery?
For the last couple of decades, test and deploy/ops was often treated like a kind of stepchild to the development- something that had to happen at the end of development and was the sole responsibility of an outside group of specialists. It didn’t make sense then, and now an integral test capability is table stakes for getting to a continuous product pipeline, which at the core of HDD itself.
A continuous pipeline means that you release a lot. Getting good at releasing relieves a lot of energy-draining stress on the product team as well as creating the opportunity for rapid learning that HDD requires. Interestingly, research by outfits like DORA (now part of Google) and CircleCI shows teams that are able to do this both release faster and encounter fewer bugs in production.
Amazon famously releases code every 11.6 seconds. What this means is that a developer can push a button to commit code and everything from there to that code showing up in front of a customer is automated. How does that happen? For starters, there are two big (related) areas: Test & Deploy.
While there is some important plumbing that I’ll cover in the next couple of sections, in practice most teams struggle with test coverage. What does that mean? In principal, what it means is that even though you can’t test everything, you iterate to test automation coverage that is catching most bugs before they end up in front of a user. For most teams, that means a ‘pyramid’ of tests like you see here, where the x-axis the number of tests and the y-axis is the level of abstraction of the tests.
The reason for the pyramid shape is that the tests are progressively more work to create and maintain, and also each one provides less and less isolation about where a bug actually resides. In terms of iteration and retrospectives, what this means is that you’re always asking ‘What’s the lowest level test that could have caught this bug?’.
Unit tests isolate the operation of a single function and make sure it works as expected. Integration tests span two functions and system tests, as you’d guess, more or less emulate the way a user or endpoint would interact with a system.
Feature Flags: These are a separate but somewhat complimentary facility. The basic idea is that as you add new features, they each have a flag that can enable or disable them. They are start out disabled and you make sure they don’t break anything. Then, on small sets of users, you can enable them and test whether a) the metrics look normal and nothing’s broken and, closer to the core of HDD, whether users are actually interacting with the new feature.
In the olden days (which is when I last did this kind of thing for work), if you wanted to update a web application, you had to log in to a server, upload the software, and then configure it, maybe with the help of some scripts. Very often, things didn’t go accordingly to plan for the predictable reason that there was a lot of opportunity for variation between how the update was tested and the machine you were updating, not to mention how you were updating.
Now computers do all that- but you still have to program them. As such, the job of deployment has increasingly become a job where you’re coding solutions on top of platforms like Kubernetes, Chef, and Terraform. These folks are (hopefully) working closely with developers on this. For example, rather than spending time and money on writing documentation for an upgrade, the team would collaborate on code/config. that runs on the kind of application I mentioned earlier.
Most teams with a continuous pipeline orchestrate something like what you see below with an application made for this like Jenkins or CircleCI. The Manual Validation step you see is, of course, optional and not a prevalent part of a truly continuous delivery. In fact, if you automate up to the point of a staging server or similar before you release, that’s what’s generally called continuous integration.
Finally, the two yellow items you see are where the team centralizes their code (version control) and the build that they’re taking from commit to deploy (artifact repository).
To recap, what’s the hypothesis?
Well, you can’t test everything but you can make sure that you’re testing what tends to affect your users and likewise in the deployment process. I’d summarize this area of HDD as follows:
01 IDEA: You can’t test everything and you can’t foresee everything that might go wrong. This is important for the team to internalize. But you can iteratively, purposefully focus your test investments.
02 HYPOTHESIS: Relative to the test pyramid, you’re looking to get to a place where you’re finding issues with the least expensive, least complex test possible- not an integration test when a unit test could have caught the issue, and so forth.
03 EXPERIMENTAL DESIGN: As you run integrations and deployments, you see what happens! Most teams move from continuous integration (deploy-ready system that’s not actually in front of customers) to continuous deployment.
04 EXPERIMENTATION: In retrospectives, it’s important to look at the tests suite and ask what would have made the most sense and how the current processes were or weren’t facilitating that.
05: PIVOT OR PERSEVERE: It takes work, but teams get there all the time- and research shows they end up both releasing more often and encounter fewer production bugs, believe it or not!
How does HDD work with Design Thinking, Lean, etc.?
Topline, I would say it’s a way to unify and focus your work across those disciplines. I’ve found that’s a pretty big deal. While none of those practices are hard to understand, practice on the ground is patchy. Usually, the problem is having the confidence that doing things well is going to be worthwhile, and knowing who should be participating when.
My hope is that with this guide and the supporting material (and of course the wider body of practice), that teams will get in the habit of always having a set of hypotheses and that will improve their work and their confidence as a team.
Naturally, these various disciplines have a lot to do with each other, and I’ve summarized some of that here:
Mostly, I find practitioners learn about this through their work, but I’ll point out a few big points of intersection that I think are particularly notable:
- Learn by Observing Humans
We all tend to jump on solutions and over invest in them when we should be observing our user, seeing how they behave, and then iterating. HDD helps reinforce problem-first diagnosis through its connections to relevant practice.
- Focus on What Users Actually Do
A lot of thing might happen- more than we can deal with properly. The goods news is that by just observing what actually happens you can make things a lot easier on yourself.
- Move Fast, but Minimize Blast Radius
Working across so many types of org’s at present (startups, corporations, a university), I can’t overstate how important this is and yet how big a shift it is for more traditional organizations. The idea of ‘moving fast and breaking things’ is terrifying to these places, and the reality is with practice you can move fast and rarely break things/only break them a tiny bit. Without this, you end up stuck waiting for someone else to create the perfect plan or for that next super important hire to fix everything (spoiler: it won’t and they don’t).
- Minimize Waste
Succeeding at innovation is improbable, and yet it happens all the time. Practices like Lean Startup do not warrant that by following them you’ll always succeed; however, they do promise that by minimizing waste you can test five ideas in the time/money/energy it would otherwise take you to test one, making the improbable probable.
What I love about Hypothesis-Driven Development is that it solves a really hard problem with practice: that all these behaviors are important and yet you can’t learn to practice them all immediately. What HDD does is it gives you a foundation where you can see what’s similar across these and how your practice in one is reenforcing the other. It’s also a good tool to decide where you need to focus on any given project or team.