Back to Sanity

Debugging and Managing Flaky Tests

.agents/skills/playwright-best-practices/debugging/flaky-tests.md

5.24.014.6 KB
Original Source

Debugging and Managing Flaky Tests

Table of Contents

  1. Understanding Flakiness Types
  2. Detection and Reproduction
  3. Root Cause Analysis
  4. Fixing Strategies by Type
  5. CI-Specific Flakiness
  6. Quarantine and Management
  7. Prevention Strategies

Understanding Flakiness Types

Categories of Flakiness

Most flaky tests fall into distinct categories requiring different remediation:

CategorySymptomsCommon Causes
UI-drivenElement not found, click missedMissing waits, animations, dynamic rendering
Environment-drivenCI-only failuresSlower CPU, memory limits, cold browser starts
Data/parallelism-drivenFails with multiple workersShared backend data, reused accounts, state collisions
Test-suite-drivenFails when run with other testsLeaked state, shared fixtures, order dependencies

Flakiness Decision Tree

Test fails intermittently
├─ Fails locally too?
│  ├─ YES → Timing/async issue → Check waits and assertions
│  └─ NO → CI-specific → Check environment differences
│
├─ Fails only with multiple workers?
│  └─ YES → Parallelism issue → Check data isolation
│
├─ Fails only when run after specific tests?
│  └─ YES → State leak → Check fixtures and cleanup
│
└─ Fails randomly regardless of conditions?
   └─ External dependency → Check network/API stability

Detection and Reproduction

Confirming Flakiness

bash
# Run test multiple times to confirm instability
npx playwright test tests/checkout.spec.ts --repeat-each=20

# Run with single worker to isolate parallelism issues
npx playwright test --workers=1

# Run in CI-like conditions locally
CI=true npx playwright test --repeat-each=10

Reproduction Strategies

typescript
// playwright.config.ts - Enable artifacts for flaky test investigation
export default defineConfig({
  retries: process.env.CI ? 2 : 0,
  use: {
    trace: 'on-first-retry', // Capture trace on retry
    video: 'retain-on-failure',
    screenshot: 'only-on-failure',
  },
})

Identify Flaky Tests Programmatically

typescript
// Track test results across runs
test.afterEach(async ({}, testInfo) => {
  if (testInfo.retry > 0 && testInfo.status === 'passed') {
    console.warn(`FLAKY: ${testInfo.title} passed on retry ${testInfo.retry}`)
    // Log to your tracking system
  }
})

Root Cause Analysis

Event Logging for Race Conditions

Add comprehensive event logging to expose timing issues:

typescript
test.beforeEach(async ({page}) => {
  page.on('console', (msg) => console.log(`CONSOLE [${msg.type()}]:`, msg.text()))
  page.on('pageerror', (err) => console.error('PAGE ERROR:', err.message))
  page.on('requestfailed', (req) => console.error(`REQUEST FAILED: ${req.url()}`))
})

For comprehensive console error handling (fail on errors, allowed patterns, fixtures), see console-errors.md.

Network Timing Analysis

typescript
// Capture slow or failed requests
test.beforeEach(async ({page}) => {
  const slowRequests: string[] = []

  page.on('requestfinished', (request) => {
    const timing = request.timing()
    const duration = timing.responseEnd - timing.requestStart
    if (duration > 2000) {
      slowRequests.push(`${request.url()} took ${duration}ms`)
    }
  })

  page.on('requestfailed', (request) => {
    console.error(`Failed: ${request.url()} - ${request.failure()?.errorText}`)
  })
})

Trace Analysis

bash
# View trace from failed CI run
npx playwright show-trace path/to/trace.zip

# Generate trace for specific test
npx playwright test tests/flaky.spec.ts --trace on

Fixing Strategies by Type

UI-Driven Flakiness

Problem: Element not ready when action executes

typescript
// ❌ BAD: No wait for element state
await page.click('#submit')
await page.fill('#username', 'test') // Element may not be ready

// ✅ GOOD: Actions + assertions pattern (auto-waiting built-in)
await page.getByRole('button', {name: 'Submit'}).click()
await expect(page.getByRole('heading', {name: 'Dashboard'})).toBeVisible()

Problem: Animations or transitions interfere

typescript
// ❌ BAD: Click during animation
await page.click('.menu-item')

// ✅ GOOD: Wait for animation to complete
await page.getByRole('menuitem', {name: 'Settings'}).click()
await expect(page.getByRole('dialog')).toBeVisible()
// Or disable animations entirely
await page.emulateMedia({reducedMotion: 'reduce'})

Problem: Brittle selectors

typescript
// ❌ BAD: Fragile CSS chain
await page.click('div.container > div:nth-child(2) > button.btn-primary')

// ✅ GOOD: Semantic selectors
await page.getByRole('button', {name: 'Continue'}).click()
await page.getByTestId('checkout-button').click()
await page.getByLabel('Email address').fill('[email protected]')

Async/Timing Flakiness

Problem: Race between test and application

typescript
// ❌ BAD: Arbitrary sleep
await page.click('#load-data')
await page.waitForTimeout(3000) // Hope data loads in 3s

// ✅ GOOD: Wait for specific condition
await page.click('#load-data')
await expect(page.locator('.data-row')).toHaveCount(10, {timeout: 10000})

// ✅ BETTER: Wait for network response, then assert
const responsePromise = page.waitForResponse(
  (r) => r.url().includes('/api/data') && r.request().method() === 'GET' && r.ok(),
)
await page.click('#load-data')
await responsePromise
await expect(page.locator('.data-row')).toHaveCount(10)

For comprehensive waiting strategies (navigation, element state, network, polling with toPass()), see assertions-waiting.md.

Problem: Complex async state

typescript
// Custom wait for application-specific conditions
await page.waitForFunction(() => {
  const app = (window as any).__APP_STATE__
  return app?.isReady && !app?.isLoading
})

// Wait for multiple conditions
await Promise.all([
  page.waitForResponse('**/api/user'),
  page.waitForResponse('**/api/settings'),
  page.getByRole('button', {name: 'Load'}).click(),
])

Data/Parallelism-Driven Flakiness

Problem: Tests share backend data

typescript
// ❌ BAD: All workers use same user
const testUser = {email: '[email protected]', password: 'pass123'}

// ✅ GOOD: Unique data per worker
import {test as base} from '@playwright/test'

export const test = base.extend<{}, {testUser: {email: string; id: string}}>({
  testUser: [
    async ({}, use, workerInfo) => {
      const email = `test-${workerInfo.workerIndex}-${Date.now()}@example.com`
      const user = await createTestUser(email)
      await use(user)
      await deleteTestUser(user.id)
    },
    {scope: 'worker'},
  ],
})

Problem: Shared storageState across workers

typescript
// ❌ BAD: All workers share same auth state
use: {
  storageState: '.auth/user.json',
}

// ✅ GOOD: Per-worker auth state
export const test = base.extend<{}, { workerStorageState: string }>({
  workerStorageState: [
    async ({ browser }, use, workerInfo) => {
      const id = workerInfo.workerIndex;
      const fileName = `.auth/user-${id}.json`;

      if (!fs.existsSync(fileName)) {
        const page = await browser.newPage({ storageState: undefined });
        await authenticateUser(page, `worker${id}@test.com`);
        await page.context().storageState({ path: fileName });
        await page.close();
      }

      await use(fileName);
    },
    { scope: "worker" },
  ],
});

Test-Suite-Driven Flakiness (State Leaks)

Problem: Tests affect each other

typescript
// ❌ BAD: Module-level state persists across tests
let sharedPage: Page

test.beforeAll(async ({browser}) => {
  sharedPage = await browser.newPage() // Shared across tests!
})

// ✅ GOOD: Use Playwright's default isolation (fresh context per test)
test('first test', async ({page}) => {
  // Fresh page for this test
})

test('second test', async ({page}) => {
  // Fresh page for this test
})

Problem: Fixture cleanup not happening

typescript
// ✅ GOOD: Proper fixture with cleanup
export const test = base.extend<{tempFile: string}>({
  tempFile: async ({}, use) => {
    const file = `/tmp/test-${Date.now()}.json`
    fs.writeFileSync(file, '{}')

    await use(file)

    // Cleanup always runs, even on failure
    if (fs.existsSync(file)) {
      fs.unlinkSync(file)
    }
  },
})

CI-Specific Flakiness

Why Tests Fail Only in CI

CI ConditionImpactSolution
Slower CPUActions complete later than expectedUse auto-waiting, not timeouts
Cold browser startNo cached assets, slower initial loadAdd explicit waits for first navigation
Headless modeDifferent rendering behaviorTest locally in headless mode
Shared runnersResource contentionReduce parallelism or use dedicated runners
Network latencyAPI calls slowerMock external APIs, increase timeouts for real calls

Simulating CI Locally

bash
# Run headless with CI environment variable
CI=true npx playwright test

# Limit CPU (Linux/Mac)
cpulimit -l 50 -- npx playwright test

# Run in Docker matching CI environment
docker run -it --rm \
  -v $(pwd):/work \
  -w /work \
  mcr.microsoft.com/playwright:v1.40.0-jammy \
  npx playwright test

Consistent Viewport and Scale

typescript
// playwright.config.ts - Match CI rendering exactly
export default defineConfig({
  use: {
    viewport: {width: 1280, height: 720},
    deviceScaleFactor: 1,
  },
})

Network Stubbing for External APIs

typescript
// Eliminate external API flakiness
test.beforeEach(async ({page}) => {
  // Stub unstable third-party APIs
  await page.route('**/api.analytics.com/**', (route) => route.fulfill({body: ''}))
  await page.route('**/api.payment-provider.com/**', (route) =>
    route.fulfill({json: {status: 'ok'}}),
  )
})

// Test-specific stub
test('checkout with payment', async ({page}) => {
  await page.route('**/api/payment', (route) =>
    route.fulfill({json: {success: true, transactionId: 'test-123'}}),
  )
  // Test proceeds with deterministic response
})

Quarantine and Management

Quarantine Pattern

typescript
// playwright.config.ts - Separate flaky tests
export default defineConfig({
  projects: [
    {
      name: 'stable',
      testIgnore: ['**/*.flaky.spec.ts'],
    },
    {
      name: 'quarantine',
      testMatch: ['**/*.flaky.spec.ts'],
      retries: 3,
    },
  ],
})

Annotation-Based Quarantine

typescript
// Mark flaky tests with annotations
test('intermittent checkout issue', async ({page}, testInfo) => {
  testInfo.annotations.push({
    type: 'flaky',
    description: 'Investigating payment API timing - JIRA-1234',
  })

  // Test implementation
})

// Skip flaky test conditionally
test('known CI flaky', async ({page}) => {
  test.skip(!!process.env.CI, 'Flaky in CI - investigating JIRA-5678')
  // Test implementation
})

Prevention Strategies

Test Burn-In

bash
# Run new tests many times before merging
npx playwright test tests/new-feature.spec.ts --repeat-each=50

# Run in parallel to expose race conditions
npx playwright test tests/new-feature.spec.ts --repeat-each=20 --workers=4

Isolation Checklist

typescript
// ✅ Each test should be self-contained
test.describe('User profile', () => {
  test('can update name', async ({page, testUser}) => {
    // Uses unique testUser fixture
    // No dependency on other tests
    // Cleanup handled by fixture
  })

  test('can update email', async ({page, testUser}) => {
    // Independent of "can update name"
    // Own testUser, own state
  })
})

Defensive Assertions

typescript
// ❌ BAD: Single point of failure
await expect(page.locator('.items')).toHaveCount(5)

// ✅ GOOD: Progressive assertions that help diagnose
await expect(page.locator('.items-container')).toBeVisible()
await expect(page.locator('.loading')).not.toBeVisible()
await expect(page.locator('.items')).toHaveCount(5)

Retry Budget

typescript
// playwright.config.ts - Limit retries to avoid masking issues
export default defineConfig({
  retries: process.env.CI ? 2 : 0, // Only retry in CI
  expect: {
    timeout: 10000, // Reasonable assertion timeout
  },
  timeout: 60000, // Test timeout
})

Anti-Patterns to Avoid

Anti-PatternProblemSolution
waitForTimeout() as primary waitArbitrary, hides real timing issuesUse auto-waiting assertions
Increasing global timeout to "fix" flakesMasks root cause, slows all testsFind and fix actual timing issue
Retrying until passHides systemic problemsFix root cause, use retries for diagnosis only
Shared test data across workersRace conditions, collisionsIsolate data per worker
Testing real external APIsNetwork variabilityMock external dependencies
Module-level mutable stateLeaks between testsUse fixtures with proper cleanup
Ignoring flaky testsProblem compounds over timeQuarantine and track for fixing