Tech Note: System Test Automation with Playwright

Table of Contents

Learning Objectives
Getting Started
Focusing Your Design Intent
Unpacking Your Tests into Codeable Steps
Effectuating Between Alternatives
Iteratively Coding, Testing (Verifying) and Debugging
Exhibit 1: Effectuating an Approach Across the Job of Test

Learning Objectives

After reading this note and the related practice, you’ll be able to:

Design intentional functional tests with the ‘given-when-then’ pattern
Code and system test automation with Playwright

Getting Started

To code is to test. Getting intimate with this proposition is a lot of what separates those who ‘can code’ from those who have convinced themselves they can’t. This note assumes some basic familiarity with HTML, CSS, and JS. It also assumes you have Cursor and Github set up and working together (see note: Cursor & Github), though you can readily do pretty much the same thing with just the basic chatGPT UX (or similar).

Now more than ever, being a ‘good’ coder is just as much about being a clear, disciplined thinker as it is having lots of experience with a particular programming language or set of development tools. To that end, we’ll approach the same four steps you’ll know from our degree program or online courses. Those are:

Focusing Your Design Intent
What do you want to test? Why? And how will you know if it’s working or not?
Unpacking Into Codeable Steps
For standard test code, how do you take your design intent and describe it for testing that is not overly detailed but thorough for your purposes?
Effectuating Between Alternatives
Where does this test sit in the bigger picture of testing your code? What tools make sense for it, and why?
Iteratively Coding, Testing, and Debugging
How do you get it done? How do you verify that it really is done, in the sense of working properly?

We’ll get after all that in the sections that follow.

Focusing Your Design Intent

You want to make sure the code you’ve just written works the way you intended it. Also, you want to minimize false positives and the related overhead of keeping your test suite up and running. This is a good place to start, generally speaking.

Within this framing, we’re looking specifically at one part of your overall test portfolio: the system test. Basically, these are a kind of summary tests to make sure that all the individual pieces that make up your UX are working. This is not the best way to test those individual components- for more on all that, see Exhibit 1 about the test portfolio.

Let’s turn our attention to an example you’ll be familiar with through our cases– searching for an HVAC part. Here’s the user story:
I don’t know the part number and I want to try to identify it online so I can move the job forward.

And here’s a working version of the UI, to refresh your memory:

In the current implementation, filtering the list of parts by, say, manufacturer is a key UX. To recap, our intent is to write a high-level system test to check that the functional implementation for that user story is working as we designed it. In the next section, we’ll look at how to unpack that into a test that’s readily codeable (by you or by an AI).

Unpacking Your Tests into Codeable Steps

How might we unpack our intent for such a system test of the implemented user story above? Probably the most contemporary (and popular) take on this is ‘behavior-driven development’ (BDD). BDD is a collaborative approach to specifying and testing software that focuses on observable behavior, not internal implementation. Instead of writing requirements as abstract statements (“the system should…”) or low-level technical tasks, BDD expresses expected behavior in a narrative form that business, design, engineering, and QA/test can all read, understand, and collaborate around.

The core structure is the Given–When–Then pattern:

Given — sets the initial context or preconditions.
(“Given a logged-in user with an empty cart…”)

When — describes a meaningful action the user or system performs.
(“When they add a product to the cart…”)

Then — specifies the expected, observable outcome.
(“Then the cart should display that product with the correct price.”)

This structure serves two key purposes:
1. Shared understanding – Everyone sees requirements as concrete user scenarios rather than abstract intentions.
2. Executable specifications – These scenarios can be turned into automated tests (e.g. Playwright) that validate behavior continuously as the product evolves.

BDD keeps teams aligned on what “functionally done” looks like—anchoring development, design, and testing around real user behavior rather than assumptions or implementation details. The idea is to verify ‘working as design’ vs. ‘the design is working’- but one thing at a time and BDD is excellent for this.

For the HinH story on filtering HVAC parts, a given-when-then unpacking might be something like:

Given that the user has arrived at the baseline state of the ‘Parts’ page with no filter applied
When they filter for ‘Acme’ and they press the ‘Go’ button to filter,
Then only parts with ‘Manufacturer’ as ‘Acme’ display.

Now, you might wonder: should we test that or should we make sure more generally that all parts available for sale from the database and with manufacturer: Acme display? Perhaps! But then we get into the question of whether a system test is the right fit for that and other things you can find in Exhibit 1. For now, we’ll assume that’s the right test for us for this particular purpose.

Effectuating Between Alternatives

Short, Starter Version

If we narrow our design intent to the specific example here of automating system test for a given user story and implementation, then choosing the right approach and toolchain is straightforward. If we don’t already have one we like, we want a system test tool. These are basically libraries created to make traversing the DOM easy and transparent. Wait, what’s that? Well, the DOM (document object model) is basically the conceptual model and standard for Views in HTML. To ‘traverse’ it is just to click through and interact with its various components. For example, such tools make coding an operation like this easy and natural:
‘select [something] from the dropdown, press the submit button, and then make sure that [certain things] appear on the next page’.

In this example, we’ll use a popular new framework called ‘Playwright’ that Microsoft maintains. Selenium (also open-source) is another long-standing, popular option. We’ll also assume you’re set up on the Cursor IDE (see: tech note on cursor).

Longer Version

There are a few other things about the job of test that are worth knowing for a general manager (MBA, etc.) in tech:

Test Pyramid
Basically, there’s an established rubric about the kinds of tests you want to have and the proportions of those types. The system test we’re working in this note is just one type (at the top of the test pyramid).
Toolchains through the Pyramid
Given there’s a pyramid and several different types of tests, teams will select a set of tools suited to their needs. For tests other than the system tests, Those needs depend on the programming language or languages and the CI (continuous integration) tool they are using.
Adaptive (Agile) Test Plan Design
How do you do step 1 (express design intent) for automated testing across your codebase? That’s a good question, and the answer is basically ‘as you learn more about what breaks and why’.

If you’re not familiar with these, it’s a good idea to learn a little bit about their fundamentals. For that, see ‘Exhibit 1: Effectuating an Approach Across the Job of Test’.

Now, on to the job of getting your system tests running!

Iteratively Coding, Testing (Verifying) and Debugging

Step 1: Setting Up Playwright with Cursor

You’ll need the Playwright packages and the Playwright extension for Cursor. As is so often the case with Cursor, we mostly just need a clear intention around what we want, and can rely on Cursor’s interface with your favorite LLM to do the rest.

Ask the Agent AI how to get those.

Once you have that, ask it to create the test or tests you want using the given-when-then pattern. Have it run the tests and see how it works out. If some fail, ask why. Ask it to comment the code and explain it back to you. Worst case, ensemble that with advice from another model/chat UX.

We’re going to set up Playwright to run on our local dev environment with Cursor. I started with this simple prompt:

I'd like to add some system tests with Playwright. Can you set that up?

Cursor (along with Claude, openAI, etc.) will then handle downloading and installing the right packages, for which it will need your permission. It will follow up and add configuration files to your code so that you can run the tests. Here’s a snippet of how that might look in Cursor:

It will also suggest that you to install an extension for Playwright which will simplify (abstract) some of its interaction with Cursor, and I’ve found that useful as well.

Note: This says ‘for VSCode’- Cursor is actually a fork of VSCode and compatible with most of its plug-in’s.

Step 2: Writing & Interpreting Your First Test

If it’s feeling energetic, it may also right some sample tests. It did in my case, so I proceeded with the following prompt so I could start simple and intentional with the test I designed above:

I don't actually want all those tests. Would you remove them and just create one test for this behavior:

Given that the user has arrived at the baseline state of the 'Parts' page with no filter applied
When they filter for 'Acme' and they press the 'Go' button to filter,
Then only parts with 'Manufacturer' as 'Acme' display.

Cursor did just that. To see the report with the tests running, just type these commands (from Playwright) into Terminal (Mac) or PowerShell (Windows):

npm test
npx playwright show-report

From there, Playwright will open a browser window with an HTML/CSS View about the test results:

playwright-output

You’ll find all test code Playwright has written in a folder called /tests along with the rest of your file tree (HTML, CSS, JS, etc.). The code’s a little long, but you can find it in the project repo here: Playwright JS File for Testing Parts Page. That code has a header, comments, etc., but, as always, your best option to understand it is probably to run it or your own version (including edits!) yourself with help from chatGPT.

Exhibit 1: Effectuating an Approach Across the Job of Test

More on the Test Pyramid

System tests are great for what they do, but that’s only a small part of the picture. Also, it’s easy to overdo it with system tests, creating waste. A conventional take on all this is the ‘test pyramid’:

This concept is attributed to agilist Mike Cohn, but given a heavy lift in popularity by agilist Martin Fowler. The fundamental finding is that most teams minimize waste and maximize outcomes with ‘pyramid’ proportions on their system, integration, and unit tests (top to bottom). System tests are useful for verifying the ‘big picture’ of the target UX: finding a part, for example, in the case of HVAC in a Hurry. In other words, most teams find it’s useful to have a few ‘summary’ tests of this type to make sure all of the lower level subcomponents are working together as intended. However, system tests have major limitations. First, they are

There are more specific tests supporting the lower level components of the app (Model, Controller, components). ‘Unit’ and ‘integration’ tests should do most of the heavy lifting to make sure your code stays functional. The idea with a unit test, for starters, is that it’s limited to the scope of a single function- it only gets input or delivers output to ‘fake’ functions designed specifically to test it. These fake tests are often called ‘mocks’ or ‘stubs’. The snippet below shows a function that takes a price and returns the total price with tax:

function addTax(amount, taxRate) {
    return amount + amount * taxRate;
}
/* example usage:
price = 5.00;
total = addTax(price, 0.10);  // 10% tax → 5.50
*/

This is a pretty simple function, so we might want just two test cases. The snippet below shows two unit tests written in a common test language (for unit and integration tests) called Jasmine. Basically, what you’re seeing in the snippet is a general declaration of the test (‘describe(…)’ and then two actual tests (starting with ‘it(…’). You might wonder “Why the need for all this labeling?”, and the answer to that is the output of these tests is often rendered into a View where developers (testers, etc.) can see exactly which tests ran, whether they passed for failed, and why (if you’re lucky). The other concept that might be new to you is the idea of an assertion- basically these are the pattern most test languages use to say ‘check that the result is [x]’. In terms of the given-when-then pattern, this is the coding language’s why of checking on the ‘then’.

/* ---------------------------------------------------------
   Jasmine Unit Test Example
   Function Under Test: addTax(amount, taxRate)
   This test demonstrates:
     • how to call a function inside a test
     • how to compare expected and actual values
     • how to write multiple test cases
--------------------------------------------------------- */

describe('addTax', function () {

	// Test #1: normal case with a tax rate
	it('adds tax correctly', function () {
		// Arrange + Act
		const result = addTax(100, 0.1);  // 10% tax

		// Assert
		expect(result).toBe(110);
	});

	// Test #2: edge case—no tax applied
	it('works with zero tax', function () {
		// Arrange + Act
		const result = addTax(50, 0);  // 0% tax

		// Assert
		expect(result).toBe(50);
	});
});

Integration tests look a lot like unit tests but they test an interaction between multiple functions (or 3rd party API’s). If a team was using, say, Jasmine for testing their JS code, most likely they’d use it across unit and integration tests. If the team also has, for example, Python they’d probably use a different testing language/library for that: ‘pytest’ being one popular option.

More on Toolchains through the Pyramid

Different layers of the test pyramid naturally pair with different kinds of tools. The higher you go, the more the tooling looks like a full “user simulation”; the lower you go, the more the tooling becomes lightweight, fast, and close to the code itself. Good teams match their tools to the purpose and cost of each layer.

At the top (system tests) teams rely on browser-automation frameworks—Playwright, Cypress, Selenium, etc. These tools mimic real user behavior: clicking, typing, navigating, waiting for async events, and rendering full pages. They’re powerful, but also the slowest and most resource-heavy, which is why teams keep the number of system tests relatively small.

In the middle (integration tests) developers typically use lighter-weight testing libraries—Jasmine, Jest, Mocha, Pytest, etc.—but extend them with plugins, mocks, or lightweight servers to simulate interactions between modules or services. Here the toolchain is about balancing realism with speed: realistic enough to catch misaligned components, but fast enough to run often.

At the base (unit tests) teams use the same testing libraries as integration tests, but without any of the heavier scaffolding. These tests leverage simple assertion libraries, mock/stub utilities, or snapshot features. They’re intentionally minimal because the aim is rapid feedback: hundreds or thousands of tests running in seconds.

The general trend down the pyramid is:
more realism at the top → more speed and simplicity at the bottom.

And across the whole pyramid, the toolchain forms a complementary stack. Heavy system-test tools verify the end-to-end experience; flexible integration-test tools validate the connections; fast unit-test tools make everyday development safe. When these tools are chosen intentionally, they reinforce one another—producing a testing strategy that is both robust, efficient, and, above all waste-minimizing. How do teams actually decide what to test? That’s the topic of our next and last section.

More on Adaptive (Agile) Test Plan Design

In adaptive (agile) environments, test planning isn’t a one-time activity. Instead, it evolves as the team learns about the system, the users, and the kinds of failures that actually matter. The test plan is less of a fixed document and more of a strategy for allocating testing effort and tooling based on what the team knows today.

A common pattern is to begin by automating the highest-risk behaviors using whatever system-test tool the team already has in place—often Playwright, Cypress, or Selenium. These early, top-of-pyramid tests serve as guardrails for the core user journeys. They give the team confidence while the product is still taking shape, long before the architecture underneath is stable enough for meaningful unit or integration tests.

As the design solidifies, teams gradually shift effort down the pyramid, supplementing the few expensive system tests with broader, faster suites using libraries like Jasmine, Jest, Mocha, or Pytest. These lower-level tests are easier to write once the underlying functions and modules stop changing every day. They also run far faster, which makes them ideal for continuous integration: developers get feedback with every commit, often in seconds.

Throughout the project, the team continuously adjusts its tool usage by asking a simple question whenever something breaks:
“At which layer—and with which tool—should this failure have been caught?”
Sometimes the answer is a missing unit test. Sometimes it’s an integration test that needs better mocking. And occasionally it’s a system test that should be promoted to cover a newly discovered edge case. This reflective loop gradually shapes a balanced, resilient test suite.

In this way, Agile test design becomes a process of ongoing calibration. Teams start with the tools that provide the most immediate value, add or rebalance tests as the architecture matures, and use tooling strategically to maintain speed without sacrificing confidence. The result isn’t just a test plan—it’s a testing practice that adapts as the product and team evolve.

Mobile Menu