From c77d9a8b19ac5b7687cb35788be46b82bbc80e98 Mon Sep 17 00:00:00 2001
From: Li Haoyi <haoyi.sg@gmail.com>
Date: Tue, 31 Dec 2024 22:07:37 +0800
Subject: [PATCH] .

---
 blog/modules/ROOT/pages/4-flaky-tests.adoc | 391 +++++++++++++++++++++
 blog/modules/ROOT/pages/index.adoc         |   4 +
 2 files changed, 395 insertions(+)
 create mode 100644 blog/modules/ROOT/pages/4-flaky-tests.adoc

diff --git a/blog/modules/ROOT/pages/4-flaky-tests.adoc b/blog/modules/ROOT/pages/4-flaky-tests.adoc
new file mode 100644
index 00000000000..6bec91864e4
--- /dev/null
+++ b/blog/modules/ROOT/pages/4-flaky-tests.adoc
@@ -0,0 +1,391 @@
+// tag::header[]
+
+# How To Manage Flaky Tests
+
+
+:author: Li Haoyi
+:revdate: 31 Dec 2024
+_{author}, {revdate}_
+
+include::mill:ROOT:partial$gtag-config.adoc[]
+
+Many projects suffer from the problem of flaky tests: tests that pass or fail
+non-deterministically. These cause confusion, slow development cycles, and endless
+arguments between individuals and teams in an organization.
+
+This article dives deep into the problems and solutions of working with flaky tests
+from the perspective of someone who supervised the implementation of the first flaky
+test management systems at Dropbox and Databricks. The issue of flaky tests can be
+surprisingly unintuitive, with many "obvious" approaches to managing them being ineffective
+or counterproductive. But there are both "right" and "wrong" answers to many of
+these issues, and we will discuss both so you can better understand what flaky tests
+are all about.
+
+// end::header[]
+
+## What Causes Flaky Tests?
+
+A flaky test is a test that sometimes passes and sometimes fails, non-deterministically.
+
+Flaky tests can be caused by a whole range of issues:
+
+### Race conditions within tests
+
+Often this manifest as `sleep`/`time.sleep`/`Thread.sleep` calls in your test
+expecting some concurrent code path to complete, which may or may not wait
+long enough depending on how much CPU contention there is slowing down your test code.
+But any multi-threaded or multi-process code has the potential for race conditions or
+concurrency bugs.
+
+### Race conditions between tests
+
+Two tests both reading and writing
+to the same global variables or files (e.g. in `~/.cache`) causing non-deterministic
+outcomes. These can be tricky, because every test may pass when run alone, and the
+test that fails when run in parallel may not be the one that is misbehaving!
+
+### Test ordering dependencies
+
+Even in the absence of parallelism, tests may interfere
+with one another by mutating global variables or files on disk. Depending on the exact
+tests you run or the order in which you run them, the tests can behave differently
+and pass or fail unpredictably.
+
+### Resource contention
+
+Depending on exactly how your tests are run, they may rely
+on process memory, disk space, file descriptors, ports, or other limited
+physical resources. These are subject to noisy neighbour problems, e.g. an
+overloaded linux system with too many processes using memory will start OOM-Killing
+processes at random.
+
+### External flakiness
+
+Integration or end-to-end tests often
+interact with third-party services, and it is not uncommon for the service
+to be flaky, rate limit you, have transient networking errors, or just be entirely
+down for some period of time. Even fundamental workflows like "downloading
+packages from a package repository" can be subject to flaky failures
+
+Sometimes the flakiness is test-specific, sometimes it is actually flakiness in the
+code being tested, which may manifest as real flakiness when customers are trying
+to use your software.
+
+## Why Are Flaky Tests Problematic?
+
+### Development Slowdown
+
+Flaky tests generally make it impossible to know the state of your test suite,
+which in turns makes it impossible to know whether the software you are working on
+is broken or not. In general, even a small number of flaky tests is enough to
+destroy the core value of your test suite:
+
+* Ideally, if the tests pass on a proposed code change, even someone
+  unfamiliar with the codebase can be confident that the code change did not break
+  anything.
+
+* Once test failures start happening spuriously, it quickly becomes
+  impossible to get a fully "green" test run without failures, so in order to make
+  progress (merging a pull-request, deploying a service, etc.) the developer then
+  needs to individually triage and make judgements on those test failures to determine
+  if they are real issues or spurious
+
+This means that you are back to the "manual" workflow of developers squinting at
+test failures to decide if they are relevant or not. This risks both false positives
+and false negatives:
+
+* A developer may waste time triaging a failure that turns out to be a flake, and
+  unrelated to the change being tested
+
+* A developer may wrongly judge that a test failure is spurious, only for it to be a
+  real failure that causes breakage for customers once released.
+
+Even relatively low rates of flakiness can cause issues. For example, consider the following:
+
+* A codebase with 10,000 tests
+* 1% of the tests are flaky
+* Each flaky test case fails spuriously 1% of the time
+
+Although 1% of tests each failing 1% of the time may not seem like a huge deal, it means
+that someone running the entire test suite only has a `0.99^100 = ~37%` chance of getting
+a green test report! The other 63% of the time, someone running the test suite without any
+real breakages gets one or more spurious failures, that they then have to spend time and energy
+triaging and investigating.
+
+### Inter-team Conflict
+
+One fundamental issue with flaky tests is organizational:
+
+* The team that owns a test benefits from the test running and providing coverage
+  for the code they care about
+
+* Other teams that run the test in CI suffer from the spurious failures and wasted time
+  that hitting flaky tests entails
+
+This is a fundamental mis-alignment between the different teams in an organization, and
+it leads to no end of discussion:
+
+* For the team that owns a flaky test, it is in their best interest to keep running it in CI,
+  because it provides coverage on their code. And fixing the flakiness usually isn't a top priority.
+
+* For teams that do not own the flaky test but run it, it is in their best interest to disable
+  it, because they don't care about the code being covered, or at least fixing the flakiness right away
+
+There really is no way to square this mis-alignment as long as flaky tests are allowed to exist,
+as the two teams fundamentally want different things. This results in all sorts of conflict and
+endless debates or arguments between teams, wasting enormous amounts of time and energy.
+
+## Mitigating Flaky Tests
+
+In general it is impossible to completely avoid flaky tests, but you can take steps to
+mitigate them:
+
+1. Avoid race conditions in your application code to prevent random crashes
+   affecting users, and avoid race conditions in your test code
+
+2. Run parallel test processes inside "sandbox" empty temp folders, to try and avoid
+   them reading and writing to the same files on the filesystem and risking race conditions.
+  (See https://bazel.build/docs/sandboxing[Bazel Sandboxing])
+
+3. Run test processes inside CGroups to mitigate resource contention: e.g. if every test process is limited
+   in how much memory it uses, it cannot cause memory pressure that might cause other tests
+   to be OOM-killed (See Bazel's https://github.com/bazelbuild/bazel/pull/21322[Extended CGroup Support])
+
+4. Mock out external services: e.g. AWS and Azure can be mocked using https://www.localstack.cloud/[LocalStack], parts of Azure
+   Kubernetes can be mocked using https://kind.sigs.k8s.io/[KIND], etc..
+
+However, although you can mitigate the flakiness, you should not expect to make it go away
+entirely.
+
+* Race conditions _will_ find their way into your code despite your best efforts.
+
+* There will always be _some_ limited physical resource you didn't realize could run out,
+  until it does.
+
+* Mocking out third-party services never ends up working 100%: inevitably
+  you hit cases where the mock isn't accurate enough, or trustworthy enough, and you still
+  need to test against the real service to get confidence in the correctness of your system.
+
+End-to-end tests and integration tests are especially prone to flakiness, as are UI
+tests exercising web or mobile app UIs.
+
+As a developer, you should work hard in trying to make your application and test
+code as deterministic as possible. But you should also accept that despite your best efforts,
+flaky tests will appear, and so you will need some plan or strategy to deal with them when they do.
+
+## How Not To Manage Flaky Tests
+
+Flaky test management can be surprisingly counter-intuitive. Below we discuss some common
+mistakes people make when they first start dealing with flaky tests.
+
+### Do Not Block Code Changes on Flaky Tests
+
+The most important thing to take note of is that you should not block
+code changes on flaky tests: merging pull-requests, deploying services, etc.
+
+That is despite blocking code changes being the default and most obvious behavior: e.g.
+if you wait for a fully-green test run before merging a code change, and a flaky test
+makes the test run red, then it blocks the merge. However, this is not a good workflow
+for a variety of reasons:
+
+1. A flaky failure when testing a code change does not indicate the code change caused
+   that breakage. So blocking the merge on the flaky failure just prevents progress
+   without actually helping increase system quality.
+
+2. The flaky test may be in a part of the system totally unrelated to the code change
+   being tested, which means the individual working on the code change has zero context
+   on why it might be flaky
+
+3. Blocking progress on a flaky test introduces an incentives problem: The code/test owner
+   benefits from the flaky test's existence, but other people working in that codebase
+   get blocked with no benefit. This directly leads to the endless xref:_inter_team_conflict[]
+   mentioned earlier.
+
+Although _"all tests should pass before merging"_ is a common requirement, it is ultimately
+unhelpful when you are dealing with flaky tests.
+
+### Preventing Flaky Tests From Being Introduced Is Hard
+
+Consider the example we used earlier: 10,000 tests, with 1% of them flaky, each failing 1% of
+the time.
+
+* If someone adds a new test case, in order to have a 95% confidence that it is not flaky,
+  you would need to run it about 300 times (`log(0.05) / log(0.99)`).
+
+* Even if we do run every new test 300 times, that 1 in 20 flaky tests will still slip through,
+  and over time will still build up into a population of flaky tests actively causing flakiness
+  in your test suite
+
+* Furthermore, many tests are not flaky alone! Running the same test 300 times in
+  isolation may not demonstrate any flakiness, since e.g. the test may only be flaky when
+  run in parallel with another test due to <<Race conditions between tests>> or <<Resource contention>>
+  in a specific order after other tests due to <<Test ordering dependencies>>.
+
+* Lastly, it is not only new tests that are flaky! When I was working on this area at Dropbox
+  and Databricks, the majority of flaky tests we detected were existing tests that
+  were stable for days/weeks/months before turning flaky (presumably due to a code change
+  in the application code or test code). Blocking new tests that are flaky does nothing
+  to prevent the code changes causing old tests to become flaky!
+
+* To block code changes that cause either new and old tests from becoming flaky, we would need
+  to run every single test about 300 times on each pull request, to give us 95% confidence that
+  each 1% flaky test introduced by the code change would get caught. This is prohibitively
+  slow and expensive, causing a test suite that may take 5 minute to run costing $1 to instead
+  take 25 hours to run costing $300.
+
+In general, it is very hard to block flaky tests "up front". You have to accept that
+over time some parts of your test suite will become flaky, and then make plans on how
+to respond and manage those flaky tests when they inevitably appear.
+
+## Managing Flaky Tests
+
+Once flaky tests start appearing in your test suite, you need to do something about them.
+This generally involves (a) noticiing that flaky tests exist, (b) identifying which tests
+are flaky, and (c) mitigating those specific problematic test to prevent them from
+causing pain to your developers.
+
+### Monitor Flaky Tests Asynchronously
+
+As mentioned earlier, <<Preventing Flaky Tests From Being Introduced Is Hard>>.
+Thus, you must assume that flaky tests _will_ make their way into your test suite,
+and monitor the flakiness when it occurs. This can be done in a variety of ways, for example:
+
+1. Most CI systems allow manual retries, and developers usually retry tests they suspect are
+   flaky. If a test fails once then passes when retried on the same version of the code, it
+   was a flaky failure
+
+2. Some CI systems or test frameworks have automatic retries. If a test fails initially and then
+   passes on the retry, it is flaky: the fact that it's non-deterministic means that
+   next time, it might fail initially and then fail on the retry!
+
+3. Most CI systems run tests to validate code changes before merging, and then run tests
+   again to validate the code post-merge. Post-merge should "always" be green, but sometimes
+   suffers breakages or flakiness. If a test passes, fails, then passes on three consecutive
+   commit test runs post-merge, it is likely to be flaky. Breakages
+   tend to cause a string of consecutive test failures before being fixed or reverted, and
+   very rarely get noticed and dealt with immedaitely
+
+Notably, most test failures when validating code changes (e.g. on pull requests) are not useful
+here, as tests are _meant_ to break when validating code changes in order to catch problems!
+Hence the need for the slightly-roundabout ways above to determine what tests are flaky
+
+Once you have noticed a test is flaky, there are two main options: retries and quarantine
+
+### Retrying Flaky Tests
+
+Retries are always controversial. A common criticism is that they can mask real flakiness
+in the system that can cause real problems to customers, which is true. However,
+we already discussed why we xref:_do_not_block_code_changes_on_flaky_tests[should
+not block code changes on flaky tests], since doing so just causes pain while
+not being an effective way of getting the flakiness fixed.
+Thus, we should feel free to add retries around flaky tests, to try and make them pass
+
+Retrying flaky tests can be surprisingly effective. As mentioned earlier, even
+infrequently flaky tests can cause issues, with a small subset of tests flaking
+1% of the time being enough to block all progress. However, one retry
+turns it into a 0.01% flaky test, and two retries turns it into a 0.0001% flaky test.
+So even one or two retries is enough to make most flaky tests stable that the
+flakiness does not cause issues.
+
+Retrying flaky tests has two weaknesses:
+
+* Retries can be expensive for real failures. If you retry a test twice, that
+  means that an actually-failed test would run three times before giving up.
+  If you retry every test by default, and a code change breaks a large number of
+  them, running all those failing tests three times can be a significant performance
+  and latency penalty
+
+** To mitigate this, you should generally avoid "blanket" retries, and only add
+   retries around specific tests that you have detected as being flaky
+
+* Retries may not work if not coarse grained enough. For example, if `test_a` fails
+  due to interference with `test_b` running concurrently, retrying `test_a`
+  immediately while `test_b` is still running will fail again. Or if the flakiness is
+  due to some bad state on the filesystem, the test may continue flaking until
+  it is run on a completely new machine with a clean filesystem.
+
+** This failure mode can be mitigated by retrying the failed tests only after the
+   entire test suite has completed, possibly on a clean test machine.
+
+### Auto-Quarantining Flaky Tests
+
+Quarantine involves detecting that a test is flaky, and simply not counting it when deciding
+whether or not to merge a code change.
+
+This is much more aggressive than retrying flaky tests, as even real breakages will get
+ignored for quarantined tests. You effectively lose the test coverage given by a particular
+test for the period while it is quarantined. Only when someone eventually fixes the flaky test
+can it be removed from quarantine and can begin running and blocking code changes again.
+
+Quarantining is best automated, both to remove busy-work of finding/quarantining
+flaky tests, and to avoid the inevitable back-and-forth between the people
+quarantining the tests and the people whose tests are getting quarantined.
+
+
+#### Why Quarantine?
+
+The obvious value of quarantining flaky tests is that it unblocks merging of code changes
+by ignoring flaky tests that are probably not relevant. Quarantime basically automates what
+people do manually in the presence of flaky tests anyway:
+
+* When enough tests are flaky, eventually developers are going to start merging/deploying code
+  changes despite the failures being present, because getting a "fully green" test run is
+  impossible
+
+* When that happens, the developer is not going to be able to tell whether the failure
+  is flaky or real, so if a code change causes a real breakage in that test the
+  developer is likely going to not notice and merge/deploy it anyway!
+
+So although naively it seems like quarantining flaky tests cost you test coverage, in
+reality it costs you nothing and simply automates the loss of coverage that you are going
+to suffer anyway. It simply saves a lot of manual effort in having your developers manually
+deciding which test failures to ignore based on what tests they remember to be flaky, since
+now the quarantine system remembers the flaky tests and ignores them on your behalf.
+
+#### Why Quarantine? Part 2
+
+The non-obvious value of quarantining flaky tests is that it aligns incentives across a
+development team or organization:
+
+* Normally, a flaky test meant the test owner continues to benefit from the test coverage while
+  other teams suffered from the flakiness
+
+* With auto-quarantine, a flaky test means the test owner both benefits from the test coverage for
+  health tests and suffers the lack of coverage caused by their flaky test being quarantined.
+
+This aligning of incentives means that with auto-quarantine enabled, the normal
+endless discussions and disputes about flaky tests tend to disappear. The test owner
+can decide themselves how urgently they need to fix a quarantined flaky test, depending
+on how crucial that test coverage is, or even if they should fix it at all! Other teams
+are not affected by the quarantined flaky test, and do not care what the test owner ends
+up deciding
+
+Most commonly, quarantining is automatic, while un-quarantining a test can be automatic or manual.
+Due to the non-determinstic nature of flakiness, it's often hard to determine whether a flaky
+test has been truly fixed or not, but it turns out it doesn't matter. If you try to fix a test,
+take it out of quarantine, and it turns out to be still flaky, the auto-quarantine system will
+just put it back into quarantine for you to take another look at it.
+
+## Implementing Flaky Test Management Systems
+
+So far, all the discussion in this article has been at a high level. Exactly how to implement
+it is left as an exercise to the reader, but is usually a mix of:
+
+* `retry{}` helpers in multiple languages you can sprinkle through your test code where necessary
+* A SQL database storing historical test results and retries
+* A SQL database or a text file committed in-repo to track quarantined tests
+* A service that looks at historical test results and retries and decides when/if to quarantine a test
+* Tweaks to your existing CI system to be able to work with all of the above: ignoring quarantined
+  tests, tracking retry counts, tracking test results, etc.
+* Some kind of web interface giving you visibility into all the components and workflows above,
+  so when things inevitably go wrong you are able to figure out what's misbehaving
+
+Usually flaky test management starts off as an entirely manual process, which works fine for small
+projects. But as the size of the project grows, you inevitably need to augment the manual work
+with some basic automation, and over time build out a fully automated system to do what you want.
+
+None of the techniques discussed in this article are rocket science, and the challenge is mostly
+just plumbing the necessary data back and forth between different parts of your CI system. But
+hopefully the high-level discussion of how to manage flaky tests should give you a head start,
+and save you weeks, months, or even years of problematic designs until you learn all the same
+things that I have learned working on flaky tests over the past decade.
diff --git a/blog/modules/ROOT/pages/index.adoc b/blog/modules/ROOT/pages/index.adoc
index 4081e94046c..7577012eafc 100644
--- a/blog/modules/ROOT/pages/index.adoc
+++ b/blog/modules/ROOT/pages/index.adoc
@@ -7,6 +7,10 @@ Welcome to the Mill development blog! This page contains posts and articles on
 technical topics related to the development of the Mill build tool. These discuss
 topics related to JVM languages and monorepo build tooling.
 
+include::4-flaky-tests.adoc[tag=header,leveloffset=1]
+
+xref:4-flaky-tests.adoc[Read More...]
+
 include::3-selective-testing.adoc[tag=header,leveloffset=1]
 
 xref:3-selective-testing.adoc[Read More...]