← All Posts

How to Reduce Flaky Tests by 90%: A Practical Playbook

1/15/2025

Flaky tests are the silent killer of engineering velocity. When your CI goes red and developers shrug because "it's probably just a flake," you've lost the feedback loop that makes automation valuable. I've worked on suites where flake rates exceeded 25%, and brought them under 1%. Here's how.

Step 1: Measure before you fix

You can't reduce what you don't measure. Before touching any test code, set up flake tracking. Most CI systems can report test stability over time. Tools like Allure, Currents, or even a simple script that parses JUnit XML can give you a flake rate per test.

Key metric: flake rate = (number of flaky runs / total runs) per test, over a rolling 7-day window. Any test above 5% flake rate gets flagged.

Step 2: Categorize the root causes

In my experience, 90% of flakes fall into five categories:

  • Timing issues: Tests that don't wait for the right condition. The #1 cause.
  • Shared state: Tests that depend on data from other tests or a shared database.
  • Network dependencies: Tests that hit real APIs or services that are occasionally slow.
  • Animation/transition races: Clicking elements that are still animating.
  • Resource contention: Parallel tests competing for the same resources.

Step 3: Fix timing issues with proper waits

Never use arbitrary sleep() or wait() calls. Instead, wait for specific conditions. In Playwright, use await expect(locator).toBeVisible() or await page.waitForResponse(). In Cypress, use cy.intercept() to wait for network requests to complete.

Step 4: Isolate test data

Every test should create its own data and clean up after itself. Use factories or fixtures to generate unique test data. Never rely on seed data that other tests might modify.

Step 5: Mock external dependencies

If a test doesn't need to verify the integration with an external service, mock it. Use Playwright's page.route() or Cypress's cy.intercept() to return consistent responses.

Step 6: Quarantine and fix systematically

Don't skip flaky tests. Quarantine them. Move them to a separate test tag or suite that runs but doesn't block CI. Then fix them one by one, starting with the most frequently failing ones.

Results

Following this playbook, I've consistently reduced flake rates from 15-25% to under 1% within 2-4 weeks. The key is measurement, categorization, and systematic fixes, not heroic debugging sessions.

Want help reducing flake in your test suite? Let's talk. I've done this for SaaS, e-commerce, and enterprise teams.

Need Help Implementing This?

We help engineering teams set up test automation, CI/CD, and quality infrastructure.