Friday, February 24, 2023

Unlocking Your Programming Potential: How AI-powered Test-driven Development Can Help You Write Better Code

Full disclosure: ChatGPT wrote everything below; at the end of the blog post I will list out all of my prompts that generated this post.

As a software engineer, I've always struggled with coming up with efficient algorithms. I'm constantly amazed by engineers who can think of optimized algorithms on the fly. To compensate for this weakness, I've relied heavily on Test-Driven Development (TDD) to help me focus on what I feel confident in: identifying test cases, especially corner cases. However, where I struggle is in writing efficient algorithms that satisfy the tests. So I've been thinking about ways to use ChatGPT to help me with that.

To test this idea, I gave ChatGPT a challenge: write a function that takes a list of integers as input and returns a sorted list in ascending order, without revealing the goal to ChatGPT. I provided ChatGPT with several test cases, such as a list with one element, a list with two elements, or a list with three elements. Slowly but surely, ChatGPT developed an efficient algorithm that passed all of the provided test cases.

For example, one of the test cases I provided was a list with two elements, which should be sorted in ascending order. ChatGPT initially wrote a simple comparison-based sorting algorithm that used a loop to compare each pair of elements in the list, and then swapped them if they were out of order. This algorithm was inefficient for larger lists, but it passed the provided test case.

As I provided more test cases, such as a list with three elements, ChatGPT began to realize that the goal of the algorithm was to sort the input list in ascending order, and it developed a more efficient algorithm based on the merge sort algorithm. This algorithm used divide-and-conquer to recursively divide the input list into smaller sublists, sort the sublists, and then merge the sorted sublists to obtain the final sorted list. This algorithm was much more efficient than the initial sorting algorithm, and it passed all of the provided test cases.

Through this exercise, I realized that TDD can be partitioned into the part I feel good at (coming up with test cases) and the part I'm not good at (writing efficient algorithms), and that ChatGPT can help with the latter part. By using ChatGPT to generate code based on the provided test cases, I can focus on what I'm good at and still develop high-quality software that meets my requirements.

Here is ChatGPT's final implementation (in Python)


import unittest

def my_function(input_list):
    if len(input_list) <= 1:
        return input_list

    mid = len(input_list) // 2
    left = input_list[:mid]
    right = input_list[mid:]

    left = my_function(left)
    right = my_function(right)

    return merge(left, right)

def merge(left, right):
    result = []
    i, j = 0, 0
    while i < len(left) and j < len(right):
        if left[i] <= right[j]:
            result.append(left[i])
            i += 1
        else:
            result.append(right[j])
            j += 1
    result += left[i:]
    result += right[j:]
    return result

class TestMyFunction(unittest.TestCase):
    def test_returns_input_list(self):
        input_list = [0]
        expected_output = [0]
        result = my_function(input_list)
        self.assertEqual(result, expected_output)

    def test_reverses_input_list(self):
        input_list = [1, 0]
        expected_output = [0, 1]
        result = my_function(input_list)
        self.assertEqual(result, expected_output)

    def test_sorts_input_list(self):
        input_list = [1, 0, 2]
        expected_output = [0, 1, 2]
        result = my_function(input_list)
        self.assertEqual(result, expected_output)

    def test_stress(self):
        input_list = [random.randint(0, 1000000) for _ in range(1000000)]
        expected_output = sorted(input_list)
        result = my_function(input_list)
        self.assertEqual(result, expected_output)

if __name__ == '__main__':
    unittest.main()


ChatGPT prompts

  1. Can you help me do test driven development?
  2. Can you write my first test? Input: [0], Output: [0]
  3. Add the next test which is Input: [1, 0], Output: [0, 1]
  4. Add my third test which is Input: [1, 0, 2], Output: [0, 1, 2]
  5. I do not want my_function to call sorted() though. Redo it please.
  6. What is the time complexity of my_function?
  7. Optimize my_function
  8. Optimize it even further please
  9. Whoa hold up I asked you not to call any built-in sort functions. Please redo.
  10. Write a stress test
  11. Show me the final code
  12. Write a blog post about how ChatGPT can be used for test-driven development and show this coding exercise as an example. Rewrite it though assuming that ChatGPT does not know what the goal of the algorithm is. The human understands the goal but does not reveal it to ChatGPT. The human is supposed to
  13. provide the test cases and ChatGPT is supposed to generalize from the test cases and write the function. Write an introductory paragraph saying that I have been thinking about ways of using ChatGPT to help me write better code. I want to confess that I have never felt very confident as a software engineer because I find it hard to think of efficient algorithms. I have always been amazed by engineers who can think of efficient algorithms. I have found Test-Driven Development helps me a lot because it allows me to focus on what I feel confident in. I feel confident identifying test cases, especially corner cases. But where I struggle is in writing efficient algorithms that satisfy the tests. So I wanted to see if I could to do Test-Driven Development partitioning the part I feel good at (coming up with test cases) from the part I'm not good at, writing the algorithm, and let ChatGPT do that for me. Then describe the challenge I gave ChatGPT (the coding exercise). Describe how I gave ChatGPT test
  14. cases and slowly but surely ChatGPT developed an efficient algorithm. Explain some of the tests as examples.
  15. List out all of the prompts I have given you today.

Sunday, February 19, 2023

Testing Tale of Toil

As a developer who's committed to Test-Driven Development (TDD), I've believed in the value of writing tests for a long time.  However, at times I've questioned whether they truly deliver on their promise and wondered if they're worth the effort.  Recently I measured just how much time we waste maintaining tests.  The answer?

flaky tests waste 2/3 of developer time

Before I tear testing a new one, let me assert for the record that I still value testing.  But with such a high proportion of time wasted maintaining tests, I have to question its value.  My aim in sharing this analysis is also to call out to the developer community for help reducing the toil.  Then more development teams could enjoy the benefits of writing automated tests.

In this blog post, I will describe a case study of upgrading a dependency and the costs and benefits of our regression test suite.  To conclude, I propose a better approach: a robot tester that can click, see, and think like a human.

Case Study: Upgrade to React Router 6.x

This tale of testing toil pertains to a web app written with React in TypeScript and a test suite written with Cypress.  The code base was about 3 years old at the time of writing and as such consisted of about 20 thousand lines of production code and well over 500 Cypress tests across 14 thousand lines of code.  As a routine task, we needed to upgrade dependencies, and in this case, upgrading React Router 6.x involved a lot of work because of many breaking changes.  This is where regression tests can prove very helpful because ideally, no functionality should change from an end-user perspective for a dependency upgrade.  Well did they?

To examine whether the regression tests proved beneficial or more of a nuisance, we analyzed all the commits during the bug-fixing phase of the upgrade.  In other words, once the upgrade was code-complete, we began evaluating the quality using our regression test suite and fixing bugs.  The analysis below will reveal that 2/3 of the changes made during this bug-fixing phase were actually to stabilize flaky tests.  Only 1/3 of the commits made to the code repository pertained to fixing legitimate end-user-facing bugs.

The first chart below, Test Status, shows the portion of tests passing and failing over time.  The timeframe was 3 days.  At time 0 in this chart, we believed the migration to React Router 6.x was complete so we ran the first Continuous Integration (CI) pipeline to see if the regression tests passed.  At this point, about 15% of the tests failed:


A glass-half-full perspective would conclude that was a good starting point.  If the failures had all been due to real issues caused by upgrading the dependency, the tests would have done a great job preventing problems from reaching our customers.  But what ensued next was mainly a struggle with Cypress to stabilize the tests.  In fact, that large hump of red on the right side of the chart was entirely flakiness.  Do you see where the red cliff drops off completely at the end?  Well that, my friends, was simply a matter of rerunning the test suite!  To be clear, we made no code changes at the end and we went from over half the tests failing to all the tests passing.  That is a very frustrating experience, but ultimately a happy conclusion (as long as the tests still pass on the next run...)

The second chart below, Bug Fixing vs. Toil, shows how we spent our efforts to achieve the green passing test victory shown in the first chart.  The first and second graphs cover the same timeframe.  The second graph shows how our efforts were spread cumulatively between fixing legitimate end-user-facing bugs and straight-up "toil".

Our friend ChatGPT explains toil this way: 

In the context of Agile software development, "toil" refers to work that is necessary but does not add any value to the customer or the end-user.  Toil is different from productive work, which delivers value to the customer, and is often seen as a form of waste that can be minimized or eliminated through automation, process improvement, or other techniques.

The most significant thing to observe in this graph is that the final proportion of Bug Fixing to Toil, in the end, was for every 2 stability issues fixed with tests (Toil), we fixed 1 real legitimate end-user-facing bug.  That's where I based my claim at the beginning:

flaky tests waste 2/3 of developer time

For our third and final graph, we'll reveal what was all that time wasted on.  The graph below shows that there were three main types of changes required to stabilize tests:


To be clear, this is just an analysis of 2/3 of the changes that did not pertain to legitimate bugs.  The legitimate bugs won't be described in detail here aside from one example.  A legitimate bug caused by upgrading to React Router 6.x was an end-user warning dialog box that had to be reimplemented because React Router dropped support for their Prompt component.  It took several tries to come up with a working solution.  Apart from that example, we'll focus now on describing the three types of issues that were a pure waste of time from a developer's perspective.

Toil Type: Add wait

Many times when a test fails, it's just because the test code didn't know how long to wait before clicking or validating something on the screen.  A handy solution is to tell the test code, Cypress in this case, to wait.  However, Cypress documentation tells us this is unnecessary: 
"In Cypress, you almost never need to use cy.wait() for an arbitrary amount of time." (source)  
The reality is it can require considerable effort or unnatural acts to avoid using a wait statement.  Cypress often tells me otherwise with the following error message:
You typically need to re-query for the element or add 'guards' which delay Cypress from running new commands. Learn More
For one thing, the Learn More link Cypress gives us is a dead link.  But a more important thing is sometimes it's just not possible or natural to add a "guard".  Modern web browser applications are asynchronous.  Loading of a page could happen in any order.  Cypress recommends the approach of "guards" which I believe means they want you to write extra steps in your Cypress script to ensure that a particular change has taken effect.  For example, wait for a loading indicator to disappear, or wait for a particular thing to appear on the screen.  But sometimes this is infeasible.  For example, consider when a table of data is supposed to update in response to a button click.  The structure of the page doesn't change, just the numbers in the table.  

A Cypress guard would entail one statement to verify the desired content has loaded and another to validate that content.  But in this case, those are the same.  Furthermore, although displaying a loading indicator would help Cypress know when to attempt to validate the table of data, I believe loading indicators are only for actions that require more than a second to complete.  So it would be unnatural to add a loading indicator just to help Cypress.

So, the most pragmatic solution is just to add a wait statement.

Toil Type: Unnatural tests

Ideally, test scripts reflect real, natural sequences of user actions.  But sometimes, test frameworks need the steps to be different from what a real human would do.  In this project, tests needed to be added to clear tooltips from the screen.  Normally when a user wants to explore the tooltips on a screen, they mouse over one widget, read the tooltip, and then mouse over another widget to read its tooltip.  What we began seeing is the tooltip of the first widget would not disappear from the screen when the test clicked on a second widget.  Perhaps this was a peculiarity of Cypress' tooling or the obscure web browser they recommend, Electron.  Suffice it to say, it's not a natural situation for a real user using a popular browser.

The solution we found for this issue was to add a step to press the escape key.  We added this after the validation of the first tooltip and before the step to click on the second widget.  However, Cypress didn't just let us press an escape key.  We had to click on a completely unrelated text field and direct the escape key press at that field.  

This workaround for Cypress' sake is based on an unnatural user flow and was a complete waste of developer time to have to come up with it.

Toil Type: Redo validation

Although Cypress offers many programming constructs to validate what's on the screen, sometimes it's just more efficient to find a way to validate outside of Cypress.  In this project, we started seeing flaky tests due to timing issues that could not be solved by adding wait statements.  This was extremely frustrating because our human eye could see valid data on the screen but since Cypress may have a slightly outdated representation of that data (like just a single millisecond older!), it failed the test.

Our solution was to add a feature to download the screen's data to CSV and then have Cypress call out to a third-party library to validate the CSV contents.  Although this was a further waste of developer time, there was a silver lining: our users were delighted that we added this download capability.  Their feedback was that we had "read their minds".

Conclusion

Although the creativity required for testing sometimes pushes us to introduce valuable new features like the CSV download, we would rather devote our scarce time and effort to developing end-user-facing features.  Quality is just a hygiene factor; customers pay for value-added features, especially for innovative products.  We would rather spend our time on value-added features, not tests.

To recap, this case study examined the value of test automation.  Regression testing is potentially a time for testing to deliver a lot of value because we're not writing new tests, we're just hoping our existing tests catch bugs introduced by low-level changes.  Our low-level change was the upgrade of a complex dependency, React Router.  Fortunately, our tests did catch some legitimate end-user-facing bugs.  We are grateful to the tests for these gotchas, but for every legitimate test failure, there were 2 illegitimate failures.  Thus 2/3 of our bug-fixing efforts went toward stabilizing tests by adding waits, modifying tests to the point that they reflect unnatural user behavior, and simply ditching Cypress's validation for more reliable third parties.

I believe it will never be easy to test software if we keep trying to emulate browsers, simulate clicks, and push test frameworks to try to see what is plainly obvious to the human eye.  We are trying to put the cart before the horse.  Hardware and software platform providers will always prioritize their end-user experience over making their system easy to test.  So if we can't beat 'em, join 'em.

In the future, I would like a test framework that tests from an end-user perspective.  Literally, I want a physical robot that can click, see, and think like a human.  For more on that vision, please see another rant I wrote recently, There must be a better way to automate integration testing.