Docs.

Decisions log — /testing harness, notifications, lifecycle walkthrough

decisions/2026-06-19-testing-notifications-walkthrough.md

Date: 2026-06-19 · Author: Claude (autonomous overnight session) · Status: for review

This is the running record of every non-obvious decision made while building the /testing harness, the notifications system, and walking the reservation lifecycle end-to-end. Each entry states the problem, the options, and the choice — so you can veto or redirect anything in the morning. Anything marked ⚠ CONTROVERSIAL is where I'd most value a second opinion.


0. Auth for the autonomous walkthrough — ADMIN_DEV_BYPASS=1

Problem. The walkthrough has to drive the office UI in Chrome overnight with no human to complete an OAuth round-trip. Seed-only is a true clean slate (0 leads, 0 reservations, only the System + Agent service accounts — no human staff account). ADMIN_DEV_BYPASS was 0 (forcing real Google sign-in).

Options. (a) Real Google sign-in via the connected Chrome studio.chat profile. (b) Dev bypass.

Choice: (b). Local real Google OAuth almost certainly fails — the prod OAuth client's callback allowlist has no 127.0.0.1 entry (per the prod-redirect memory: custom domain, exact callback, no wildcards). Dev bypass is the intended local-dev mode, grants full admin, and has zero prod impact (VERCEL_ENV=production hard-locks bypass off regardless). I flipped .env ADMIN_DEV_BYPASS 0 → 1 with a comment, and will note it for you to flip back to re-test the real sign-in flow.

Consequence. The dev-bypass user has accountId: null, so it can't be a notification recipient on its own. See §3 (recipient identity).


1. Notification event spine — hook createAuditEntry, don't scatter emits

Problem. "Create hooks for notifications on the important stuff" across a large lifecycle (lead created, quote sent, deposit cleared, confirmed, returned, settled, closed, claim filed, verification approved, inspection signed, hold expired, overdue flagged…).

Options. (a) Add explicit notify(...) calls at each of ~20 event sites. (b) Hook the one choke point nearly every business event already flows through: createAuditEntry (src/lib/office/audit.ts).

Choice: (b). Every meaningful mutation already writes an audit row with (entity_type, action, from_state, to_state, payload, source). After a successful audit insert, createAuditEntry calls a single guarded hook onAuditEntry(entry). A pure registry maps (entity_type, action, to_state) → an event descriptor (title, category, severity, default channels); non- notifiable rows (e.g. email_sent, tag_changed) are simply absent from the map and ignored. This gives near-complete coverage from one wiring point and keeps the taxonomy in a single gated, tested module.

Guarantees. The hook is fully wrapped in try/catch and never throws — an audit write (and the transition that triggered it) can never fail because of notifications, matching the existing "emails never throw into the caller" contract. The hook does only fast DB inserts (in-app rows + outbox enqueue); the actual Slack HTTP call is deferred to the queue worker.


2. Delivery is queue-driven — durable notification_outbox table

Problem. "Slack notifications should be driven by a queue (kafka or sqs)… if you cannot build that without my input, prepare for it in code and build what you can that works."

Choice. A durable outbox table (notification_outbox) is the queue: rows are pending → processing → delivered | failed with attempts, run_after (backoff), and a JSON payload. A NotificationQueue interface (enqueue, claimBatch, markDelivered, markFailed) has one implementation today — DbOutboxQueue — which needs no external infra and works locally and on Vercel. A worker (drainOutbox) claims a batch, delivers each, and acks.

Why an outbox and not Kafka/SQS now. A real broker needs infra + credentials I can't provision autonomously. The transactional-outbox pattern is the correct first step anyway (durable, at-least-once, survives restarts) and is exactly what you'd later relay into Kafka/SQS. To go real: implement KafkaQueue/SqsQueue against the same interface, or keep the outbox and add a relay that ships pending rows to the broker. Nothing else changes.

Worker triggering. Two paths: (a) a cron route /api/cron/notifications (added to vercel.json) drains on a schedule — the reliable path; (b) a best-effort fire-and-forget kick right after enqueue for low dev latency. Both are idempotent (claim-by-status).

Slack transport. Mirrors the email deliveryMode pattern exactly: if a channel has a configured incoming-webhook URL it POSTs; otherwise it dry-runs (logs + marks delivered with dryRun: true) so the whole pipeline is testable with no Slack app. To go real: create a Slack app, add Incoming Webhooks (or a bot token + chat.postMessage), and paste the webhook URL per channel in the admin routing UI (or set SLACK_BOT_TOKEN). Documented in docs/architecture/notifications.md.


3. In-app recipient identity under dev bypass — map to the super-admin

Problem. In-app notifications target staff accounts (real account_id). The dev-bypass operator has no account, so its inbox would always be empty.

Choice. A helper currentStaffAccountId() resolves: real signed-in accountId → else, under dev bypass, the super-admin account id (fetchSuperAdminAccountId). The dev-bypass user is "the administrator", so showing the sole operator's inbox is the sensible mapping. Clearly commented.

Because seed-only has no human admin, I insert a local-only brandon@studio.chat administrator account (idempotent SQL, not in seed.sql — prod is handled by the sign-in bootstrap, and the no-brandon-in-seed memory stands). This is also exposed as a /testing tool ("seed demo admin") so a human can recreate it. Default subscriptions for that admin are seeded so notifications land out of the box.


4. Subscriptions model — entity + category, per-channel prefs

notification_subscriptions: (account_id, scope_type, scope_key, in_app, slack).

  • entity scope: a specific reservation/account/lead (scope_type='entity', scope_key='<entity_type>:<entity_id>') — "watch this thing".
  • category scope: all events of a kind (scope_type='category', scope_key='leads'|'reservations'|'payments'|…) — "tell me about all leads". Staff manage their own from /notifications; a reusable bell affordance on entity detail pages toggles an entity subscription. Soft-deletable (deleted_at) per the house pattern.

5. Admin Slack routing — dedicated table, not settings-KV

Problem. Admins map event categories/keys → Slack channels (new lead → #sales, reservation confirmed → #operations).

Choice. A dedicated notification_routes table (match_type, match_key, slack_channel, webhook_url, enabled) rather than a settings-KV blob — routing is queried per-event on the hot path and benefits from real rows/indexes and an audit trail, and the admin UI maps cleanly to rows. Admin-only CRUD lives on a /notifications "Slack routing" tab.


6. ⚠ CONTROVERSIAL — the public contact form now CREATES A LEAD

Problem. Your own example is "new lead (contact form) → #sales", and the walkthrough starts "from 0 (i.e. lead)". But today the public /contact form (submitContact) only emails/WhatsApps the studio — it does not create a lead. So there is no lead entity to notify on, and "from 0" has no real entry point in the product.

Choice. Wire submitContact to also create a lead via the existing reusable createLead (find-or-create account from the form's name/email/phone/locale, referrer = "contact form", message = first lead comment, source audit = api/system). That leads.created audit row then drives the notification → #sales, exactly matching your example. The email/WhatsApp notification stays (belt-and-suspenders) but can be retired later.

Why controversial. (a) The contact form is public + unauthenticated, so it now writes rows — mitigated by the existing IP rate-limit and find_or_create_account idempotency, but it's a spam surface. (b) It changes lead provenance (some leads now arrive without a staffer triaging). (c) Possible duplicate leads if a staffer also creates one. Mitigations applied: reuse the idempotent account RPC; tag/referrer the lead as "contact form" so triage can tell origin; keep the existing studio email so nothing is lost if lead-creation is later reverted. If you object, revert is one call site — the notification system itself is agnostic to where the lead comes from.


7. /testing page — non-prod only, lifecycle driver

Gated to VERCEL_ENV !== "production" (server-side; the nav item is hidden in prod). It hosts the tools a human needs to repeat the demo: seed a demo admin/lead, jump a reservation to any stage (via the existing super-admin forceReservationStatus + payment/inspection shortcuts), simulate pickup/return scans (the web office has no gear-scan UI — that's the staff app), run the crons on demand, fire a test notification, and seed a Slack route. Tools are added as the walkthrough surfaces the need (the page is explicitly a "discovered-as-I-go" toolbox).


8. ⚠ The visual Chrome walkthrough is BLOCKED on a browser-profile choice

Two Chrome profiles are connected to the extension ("Browser 1", "Browser 2") and the names don't say which is the studio.chat Google account vs. the bioscope one. My standing instruction is to never act in the bioscope account, and the tool requires you to pick the profile — which I can't resolve while you're asleep without risking the wrong identity. So I did the walkthrough functionally instead, which is equivalent for finding broken/missing things:

  • Render-checked every page a demo touches (leads list/detail, reservations list/detail/create, account detail, verifications, /notifications, /testing) — all return 200 and render the new affordances (watch toggle, inbox, routing).
  • Walked the whole lifecycle through the real code paths as an e2e test (notifications-lifecycle.e2e.test.ts): lead → reservation → quoted → accepted → confirmed → returned → settled → closed, asserting the in-app + Slack notifications fire at each stage. Plus the pre-existing rental-lifecycle e2e proves the event-gated path (deposits, scans, inspections) end-to-end.

To do the visual pass in the morning: tell me which connected browser is the studio.chat profile (or open the confirmation screen and Connect the right one), and I'll click through it and record a GIF. Everything is already wired and verified to render; this is the one thing I couldn't safely self-serve.

Running list of gaps found during the walkthrough

  1. Contact form didn't create a lead (FIXED — see §6). The single biggest "missing" thing vs. the brief's mental model. Now wired.
  2. No gear pickup/return UI in the web office (by design — staff app only). Filled by the /testing "force stage" tool. RESOLVED (you asked for it): built a real staff scan tool — per-unit check out / check in on the reservation's "units & scanning" panel, backed by pickupAsset/returnAsset (the same calls the iOS app makes), correctly gated on the confirmed status + signed inspections, with the last check-in auto-advancing to returned. The /testing scan-out/scan-in shortcuts remain for one-click demos.
  3. No super-admin in seed-only (expected). Dev bypass + a fresh seed has no human admin, so the inbox is empty until you click "seed demo admin" on /testing (or sign in). The notifications page degrades gracefully (it still shows Slack routing for admins) rather than looking broken. Decision §3.