Michael Osofsky on Innovation: Testing Tale of Toil

As a developer who's committed to Test-Driven Development (TDD), I've believed in the value of writing tests for a long time. However, at times I've questioned whether they truly deliver on their promise and wondered if they're worth the effort. Recently I measured just how much time we waste maintaining tests. The answer?

flaky tests waste 2/3 of developer time

Before I tear testing a new one, let me assert for the record that I still value testing. But with such a high proportion of time wasted maintaining tests, I have to question its value. My aim in sharing this analysis is also to call out to the developer community for help reducing the toil. Then more development teams could enjoy the benefits of writing automated tests.

In this blog post, I will describe a case study of upgrading a dependency and the costs and benefits of our regression test suite. To conclude, I propose a better approach: a robot tester that can click, see, and think like a human.

Case Study: Upgrade to React Router 6.x

This tale of testing toil pertains to a web app written with React in TypeScript and a test suite written with Cypress. The code base was about 3 years old at the time of writing and as such consisted of about 20 thousand lines of production code and well over 500 Cypress tests across 14 thousand lines of code. As a routine task, we needed to upgrade dependencies, and in this case, upgrading React Router 6.x involved a lot of work because of many breaking changes. This is where regression tests can prove very helpful because ideally, no functionality should change from an end-user perspective for a dependency upgrade. Well did they?

To examine whether the regression tests proved beneficial or more of a nuisance, we analyzed all the commits during the bug-fixing phase of the upgrade. In other words, once the upgrade was code-complete, we began evaluating the quality using our regression test suite and fixing bugs. The analysis below will reveal that 2/3 of the changes made during this bug-fixing phase were actually to stabilize flaky tests. Only 1/3 of the commits made to the code repository pertained to fixing legitimate end-user-facing bugs.

The first chart below, Test Status, shows the portion of tests passing and failing over time. The timeframe was 3 days. At time 0 in this chart, we believed the migration to React Router 6.x was complete so we ran the first Continuous Integration (CI) pipeline to see if the regression tests passed. At this point, about 15% of the tests failed:

A glass-half-full perspective would conclude that was a good starting point. If the failures had all been due to real issues caused by upgrading the dependency, the tests would have done a great job preventing problems from reaching our customers. But what ensued next was mainly a struggle with Cypress to stabilize the tests. In fact, that large hump of red on the right side of the chart was entirely flakiness. Do you see where the red cliff drops off completely at the end? Well that, my friends, was simply a matter of rerunning the test suite! To be clear, we made no code changes at the end and we went from over half the tests failing to all the tests passing. That is a very frustrating experience, but ultimately a happy conclusion (as long as the tests still pass on the next run...)

The second chart below, Bug Fixing vs. Toil, shows how we spent our efforts to achieve the green passing test victory shown in the first chart. The first and second graphs cover the same timeframe. The second graph shows how our efforts were spread cumulatively between fixing legitimate end-user-facing bugs and straight-up "toil".

Our friend ChatGPT explains toil this way:

In the context of Agile software development, "toil" refers to work that is necessary but does not add any value to the customer or the end-user. Toil is different from productive work, which delivers value to the customer, and is often seen as a form of waste that can be minimized or eliminated through automation, process improvement, or other techniques.

The most significant thing to observe in this graph is that the final proportion of Bug Fixing to Toil, in the end, was for every 2 stability issues fixed with tests (Toil), we fixed 1 real legitimate end-user-facing bug. That's where I based my claim at the beginning:

flaky tests waste 2/3 of developer time

For our third and final graph, we'll reveal what was all that time wasted on. The graph below shows that there were three main types of changes required to stabilize tests:

To be clear, this is just an analysis of 2/3 of the changes that did not pertain to legitimate bugs. The legitimate bugs won't be described in detail here aside from one example. A legitimate bug caused by upgrading to React Router 6.x was an end-user warning dialog box that had to be reimplemented because React Router dropped support for their Prompt component. It took several tries to come up with a working solution. Apart from that example, we'll focus now on describing the three types of issues that were a pure waste of time from a developer's perspective.

Toil Type: Add wait

Many times when a test fails, it's just because the test code didn't know how long to wait before clicking or validating something on the screen. A handy solution is to tell the test code, Cypress in this case, to wait. However, Cypress documentation tells us this is unnecessary:

"In Cypress, you almost never need to use cy.wait() for an arbitrary amount of time." (source)

The reality is it can require considerable effort or unnatural acts to avoid using a wait statement. Cypress often tells me otherwise with the following error message:

You typically need to re-query for the element or add 'guards' which delay Cypress from running new commands. Learn More

For one thing, the Learn More link Cypress gives us is a dead link. But a more important thing is sometimes it's just not possible or natural to add a "guard". Modern web browser applications are asynchronous. Loading of a page could happen in any order. Cypress recommends the approach of "guards" which I believe means they want you to write extra steps in your Cypress script to ensure that a particular change has taken effect. For example, wait for a loading indicator to disappear, or wait for a particular thing to appear on the screen. But sometimes this is infeasible. For example, consider when a table of data is supposed to update in response to a button click. The structure of the page doesn't change, just the numbers in the table.

A Cypress guard would entail one statement to verify the desired content has loaded and another to validate that content. But in this case, those are the same. Furthermore, although displaying a loading indicator would help Cypress know when to attempt to validate the table of data, I believe loading indicators are only for actions that require more than a second to complete. So it would be unnatural to add a loading indicator just to help Cypress.

So, the most pragmatic solution is just to add a wait statement.

Toil Type: Unnatural tests

Ideally, test scripts reflect real, natural sequences of user actions. But sometimes, test frameworks need the steps to be different from what a real human would do. In this project, tests needed to be added to clear tooltips from the screen. Normally when a user wants to explore the tooltips on a screen, they mouse over one widget, read the tooltip, and then mouse over another widget to read its tooltip. What we began seeing is the tooltip of the first widget would not disappear from the screen when the test clicked on a second widget. Perhaps this was a peculiarity of Cypress' tooling or the obscure web browser they recommend, Electron. Suffice it to say, it's not a natural situation for a real user using a popular browser.

The solution we found for this issue was to add a step to press the escape key. We added this after the validation of the first tooltip and before the step to click on the second widget. However, Cypress didn't just let us press an escape key. We had to click on a completely unrelated text field and direct the escape key press at that field.

This workaround for Cypress' sake is based on an unnatural user flow and was a complete waste of developer time to have to come up with it.

Toil Type: Redo validation

Although Cypress offers many programming constructs to validate what's on the screen, sometimes it's just more efficient to find a way to validate outside of Cypress. In this project, we started seeing flaky tests due to timing issues that could not be solved by adding wait statements. This was extremely frustrating because our human eye could see valid data on the screen but since Cypress may have a slightly outdated representation of that data (like just a single millisecond older!), it failed the test.

Our solution was to add a feature to download the screen's data to CSV and then have Cypress call out to a third-party library to validate the CSV contents. Although this was a further waste of developer time, there was a silver lining: our users were delighted that we added this download capability. Their feedback was that we had "read their minds".

Conclusion

Although the creativity required for testing sometimes pushes us to introduce valuable new features like the CSV download, we would rather devote our scarce time and effort to developing end-user-facing features. Quality is just a hygiene factor; customers pay for value-added features, especially for innovative products. We would rather spend our time on value-added features, not tests.

To recap, this case study examined the value of test automation. Regression testing is potentially a time for testing to deliver a lot of value because we're not writing new tests, we're just hoping our existing tests catch bugs introduced by low-level changes. Our low-level change was the upgrade of a complex dependency, React Router. Fortunately, our tests did catch some legitimate end-user-facing bugs. We are grateful to the tests for these gotchas, but for every legitimate test failure, there were 2 illegitimate failures. Thus 2/3 of our bug-fixing efforts went toward stabilizing tests by adding waits, modifying tests to the point that they reflect unnatural user behavior, and simply ditching Cypress's validation for more reliable third parties.

I believe it will never be easy to test software if we keep trying to emulate browsers, simulate clicks, and push test frameworks to try to see what is plainly obvious to the human eye. We are trying to put the cart before the horse. Hardware and software platform providers will always prioritize their end-user experience over making their system easy to test. So if we can't beat 'em, join 'em.

In the future, I would like a test framework that tests from an end-user perspective. Literally, I want a physical robot that can click, see, and think like a human. For more on that vision, please see another rant I wrote recently, There must be a better way to automate integration testing.

Michael Osofsky on Innovation

Sunday, February 19, 2023

Testing Tale of Toil

Case Study: Upgrade to React Router 6.x

Toil Type: Add wait

Toil Type: Unnatural tests

Toil Type: Redo validation

0 Comments:

About Me

Previous Posts