From 44e9b38ac2d20323e57fc4689cf8f2be6bc7e1a9 Mon Sep 17 00:00:00 2001 From: BigBodyCobain Date: Fri, 22 May 2026 17:36:33 -0600 Subject: [PATCH 1/2] Deflake messagesViewFirstContact via CI concurrency group MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Root cause ---------- ci.yml fires twice on every PR — once directly via `pull_request: [main]` (producing the "Frontend Tests & Build" check) and once via `workflow_call` from docker-publish.yml (producing the "CI Gate / Frontend Tests & Build" check). Both jobs land on the same Actions runner pool at the same time and fight for CPU/RAM. Under contention, the React reconciliation in `messagesViewFirstContact.test.tsx > removes an approved contact immediately from the visible contact list` overruns its 5s waitFor timeout. This is the single test that has flaked on PRs #226, #237, #261, #262, #265, #294, #303, and the fd7d6fa push — always on the same job name ("CI Gate / Frontend Tests & Build"), never on the sibling job ("Frontend Tests & Build") on the same commit. PR #304 (which heavily touched the frontend) passed both jobs on first try. PR #303 (zero frontend changes) failed only the CI Gate job. That asymmetry is what finally pinpointed the parallel-resource-contention cause rather than anything in the test or the PRs. Fix --- .github/workflows/ci.yml — added a workflow-level concurrency group keyed on the PR head SHA (or pushed commit SHA). Both invocations against the same commit now share a group, so the second one queues instead of running in parallel. cancel-in-progress is intentionally `false` — cancelling would risk leaving a PR check stuck in "Expected" if only one of the two ever finished. Total CI time grows by ~2 min in exchange for deterministic outcomes. frontend/src/__tests__/mesh/messagesViewFirstContact.test.tsx — belt-and-suspenders bump of the waitFor timeout from 5s to 15s. The structural fix above should make the original 5s margin sufficient, but the bump removes the residual risk of brief runner load spikes inside the (now serialised) single job. The failure mode this masks would be "toast never renders", which still fails loudly at 15s. The full mesh test file (26 tests) passes locally in ~8s with the bumped timeout. --- .github/workflows/ci.yml | 22 +++++++++++++++++++ .../mesh/messagesViewFirstContact.test.tsx | 21 +++++++++++++----- 2 files changed, 37 insertions(+), 6 deletions(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index fd68ae5..fd5ed6e 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -7,6 +7,28 @@ on: branches: [main] workflow_call: +# CI flake mitigation: +# ci.yml is triggered TWICE per PR on the same commit — once directly via +# the `pull_request` trigger above ("Frontend Tests & Build" check) and once +# via `workflow_call` from docker-publish.yml ("CI Gate / Frontend Tests & +# Build" check). Both jobs land on the same Actions runner pool at the same +# time and fight for CPU/RAM. Under contention, React's reconciliation in +# `messagesViewFirstContact.test.tsx > removes an approved contact …` +# overruns its 5s waitFor timeout — that's the single failure mode we've +# seen flake on PRs #226, #237, #261, #262, #265, #294, #303, and the +# fd7d6fa push. Backend tests and every other frontend test pass under +# the same conditions, which is what made this look random. +# +# Pinning a concurrency group on the SHA (PR head, or the pushed commit +# for main) serializes the two invocations so neither starves the other. +# We use cancel-in-progress: false so the second one queues instead of +# cancelling — cancelling could leave the PR check stuck "Expected" if +# only one of the two ever finishes. Total CI time grows by ~2 min in +# exchange for deterministic outcomes. +concurrency: + group: ci-${{ github.event.pull_request.head.sha || github.sha }} + cancel-in-progress: false + jobs: frontend: name: Frontend Tests & Build diff --git a/frontend/src/__tests__/mesh/messagesViewFirstContact.test.tsx b/frontend/src/__tests__/mesh/messagesViewFirstContact.test.tsx index dec5baa..b781d97 100644 --- a/frontend/src/__tests__/mesh/messagesViewFirstContact.test.tsx +++ b/frontend/src/__tests__/mesh/messagesViewFirstContact.test.tsx @@ -868,18 +868,27 @@ describe('MessagesView first-contact trust UX', () => { // event (removeContact + setContacts + setComposeStatus + setComposeError). // Under CI load the resulting render-and-paint cycle has been observed // to take >1s, which is the default findByText timeout — that race has - // produced flakes on PRs #226, #237, #261, and #262 in succession. - // The settle window is bounded by React's reconciliation, not by any - // network/animation cost, so a generous timeout is the right deflake - // here (the failure mode this masks would be "toast never renders", - // which would still fail at 5s). + // produced flakes on PRs #226, #237, #261, #262, #265, #294, #303, and + // the fd7d6fa push. + // + // The structural root cause is fixed in .github/workflows/ci.yml via a + // concurrency group (ci.yml runs twice in parallel per PR — direct + // trigger + workflow_call from docker-publish.yml — and both jobs land + // on the same Actions runner pool, starving each other). Serialising + // them via concurrency removes the resource contention. + // + // We also bump the timeout here as belt-and-suspenders. The settle + // window is bounded by React's reconciliation, not by any network or + // animation cost, so a generous timeout is the right deflake (the + // failure mode this masks would be "toast never renders", which still + // fails at 15s). await waitFor( () => { expect( screen.getByText(/Removed contact: Remove Me\./i), ).toBeInTheDocument(); }, - { timeout: 5000, interval: 50 }, + { timeout: 15000, interval: 50 }, ); expect(screen.queryByText('Remove Me')).not.toBeInTheDocument(); }); From bc70cc3527f73a2fb89f63da3e18939523a921dc Mon Sep 17 00:00:00 2001 From: BigBodyCobain Date: Fri, 22 May 2026 17:49:00 -0600 Subject: [PATCH 2/2] =?UTF-8?q?fix(test):=20per-test=20timeout=20=E2=80=94?= =?UTF-8?q?=2015s=20waitFor=20inside=2015s=20testTimeout=20was=20zero=20he?= =?UTF-8?q?adroom?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Mistake in the prior commit on this branch (44e9b38). Bumped the waitFor timeout to 15s without realising the suite-wide testTimeout was ALSO 15s (raised in Round 7a deflake work). Net effect: the test ran out of clock budget BEFORE waitFor could even finish polling, producing "Test timed out in 15000ms" on the "Frontend Tests & Build" run of PR #305 — same job that the concurrency-group fix had just freed from the resource-contention flake. Fix: * Bump JUST this test's per-test timeout to 30s via the `{ timeout: 30_000 }` argument on the `it()` block. * Drop the inner waitFor back to 10s (was 15s) so it has a clear margin against the 30s test budget after setup/render/click. 26/26 tests in the file pass locally in 6.19s. The concurrency-group fix in ci.yml stays as-is — that was correct and verifiably worked (CI Gate / Frontend Tests & Build went green on the PR after 8 prior failures). The flake-jump to the sibling workflow exposed this second-order bug. --- .../mesh/messagesViewFirstContact.test.tsx | 32 ++++++++++++------- 1 file changed, 20 insertions(+), 12 deletions(-) diff --git a/frontend/src/__tests__/mesh/messagesViewFirstContact.test.tsx b/frontend/src/__tests__/mesh/messagesViewFirstContact.test.tsx index b781d97..780a053 100644 --- a/frontend/src/__tests__/mesh/messagesViewFirstContact.test.tsx +++ b/frontend/src/__tests__/mesh/messagesViewFirstContact.test.tsx @@ -842,7 +842,7 @@ describe('MessagesView first-contact trust UX', () => { expect(screen.queryByText(/delivery key has not reached/i)).not.toBeInTheDocument(); }); - it('removes an approved contact immediately from the visible contact list', async () => { + it('removes an approved contact immediately from the visible contact list', { timeout: 30_000 }, async () => { contactsState = { '!sb_remove': { alias: 'Remove Me', @@ -871,24 +871,32 @@ describe('MessagesView first-contact trust UX', () => { // produced flakes on PRs #226, #237, #261, #262, #265, #294, #303, and // the fd7d6fa push. // - // The structural root cause is fixed in .github/workflows/ci.yml via a - // concurrency group (ci.yml runs twice in parallel per PR — direct - // trigger + workflow_call from docker-publish.yml — and both jobs land - // on the same Actions runner pool, starving each other). Serialising - // them via concurrency removes the resource contention. + // Two-part fix: // - // We also bump the timeout here as belt-and-suspenders. The settle - // window is bounded by React's reconciliation, not by any network or - // animation cost, so a generous timeout is the right deflake (the - // failure mode this masks would be "toast never renders", which still - // fails at 15s). + // 1. .github/workflows/ci.yml — concurrency group serialises the two + // parallel ci.yml invocations (direct trigger + workflow_call from + // docker-publish.yml) so they no longer starve each other for + // runner CPU/RAM. That covered the SHA-pair starvation case which + // was visible on PR #303 / #294. + // + // 2. This block — the per-test `timeout: 30_000` on the `it()` above + // and the 10s `waitFor` timeout below. The suite-wide testTimeout + // was 15s (raised in Round 7a deflake work). An earlier draft of + // this fix set waitFor to 15s, but that left ZERO headroom against + // the 15s per-test budget — the test ran out of clock before + // waitFor could even fail. Bumping the per-test timeout to 30s + // gives waitFor a real 10s window after the render/click setup + // finishes. + // + // The failure mode this masks would be "toast never renders", which + // still fails loudly at the 10s waitFor cap. await waitFor( () => { expect( screen.getByText(/Removed contact: Remove Me\./i), ).toBeInTheDocument(); }, - { timeout: 15000, interval: 50 }, + { timeout: 10000, interval: 50 }, ); expect(screen.queryByText('Remove Me')).not.toBeInTheDocument(); });