May: AI on both sides of the diff, browser tests as state guardrails, MCP as production integrations

May was about three threads sitting next to each other: AI as both the writer and the reviewer of the diff, browser-test-as-code finally becoming the guardrail that fits stateful systems, and MCP servers earning the kind of supply-chain attention any other production dependency would.

AI on both sides of the diff

Codegen velocity has roughly 10x’d what an engineer ships in a day; review velocity has not moved at the same rate. The honest awkwardness is that the same vendors now sell both ends (Copilot writes the diff, Copilot Code Review scores it), and asking the same model family to grade its own homework is exactly the pattern that pattern-matching is supposed to catch. The instructive case sits inside the Copilot CLI itself: its rubber duck feature (named for the classic debugging metaphor) hands the main agent’s plan to a second model for review. It earns its keep when the duck is from a different family than the main agent (an OpenAI-class reviewer over a Claude-class planner, say), and much less when both are the same model with a different system prompt.

variable "zone_id" {
  validation {
    # invalid: a Terraform validation block can only reference
    # the variable being validated, not other variables.
    condition     = var.validation_method != "DNS" || var.zone_id != "" || length(var.zones) > 0
    error_message = "Either zone_id or zones must be provided when validation_method is DNS."
  }
}

A newer class of PR-review tools is doing something the diff-only pass still does not: pulling in the rest of the repo (related tests, prior decisions, naming conventions, coverage deltas) and scoring the diff against that, not just the changed lines in isolation. Diff-only review tops out fast because most real bugs are not local. The block above is a concrete example from this month: invisible to a diff-only pass, caught by an agent pair where one agent specifically had context on the upstream community module’s contract. The bug lived in the relationship between two variables, not in either line.

The job of “reviewer” is becoming the job of “triage operator.” Knowing which agents to fan out, how to weight conflicting findings, when to override the model and trust the gut: this is the skill being hired for, even when the job description still says “senior engineer.” The same-vendor-on-both-sides critique stands, and it matters less in practice if the human stays in the override seat. The failure mode to watch for is the one where the human stops reading and starts merge-stamping.

Browser tests as the state guardrail

Two recurring failure modes sit at the same root: dashboards that go blank after instance replacement (control-plane artifacts hardcoding data-plane identity), and stateful systems that quietly resist agentic destroy/reapply work (secrets that must survive snapshot restore, TLS certificate chains, replica-set identity). Both are the same problem with different surface: AI is great at generating code, and bad at reasoning about state that exists right now. The practical answer that is emerging is not a smarter agent. It is a cheaper test that exercises the running thing.

playwright.spec.ts
        │
   ┌────┴────┐
   ▼         ▼
 PR CI    synthetic
(regress)  monitor
           Checkly
           Datadog

Playwright tests are the cheapest guardrail anyone has found for AI-generated changes to anything that renders, queries, or authenticates: they assert the behavior of the deployed thing, not the shape of the code. The interesting shift is that the same test files now run on two clocks: once at PR time as a regression gate, and continuously in production via synthetic-monitoring runners that treat browser tests as a new tier of alert. Checkly has been doing this as a category for years; Datadog’s browser tests are productizing the same idea inside an existing synthetic-test surface. The convergence signal is that platform teams everywhere are realizing that “synthetic monitoring” and “end-to-end test” are the same artifact run at different cadences.

This is the symmetric answer to “agents are bad at state.” You do not need the agent to be state-aware if the test fleet around it is. Any change (human, agent, or vendor patch) that breaks the running thing gets caught by a test that exercises the running thing, regardless of whether the change looked clean in the diff. The generalized lesson for any platform team: pick the artifact that can run at both PR-time and monitoring-time, and you have collapsed two adjacent budgets into one tool. The next leg of this pattern lives below the browser too: synthetic API checks, synthetic data-plane checks, synthetic auth-chain checks, all treated as code that runs on two clocks.

MCP servers as production integrations

The agent runtime now routinely reaches into ticketing, cloud APIs, source control, calendars, drives, and dashboards, and almost all of that reach lives behind MCP servers. Each MCP server is effectively a production integration with a third-party system: it carries credentials, it has rate limits, it depends on a remote schema that changes, it has policy boundaries that can quietly tighten. None of this is new in distributed systems. What is new is that the operational discipline that usually surrounds a production integration (versioning, changelogs, status visibility, incident playbooks) has not caught up to the MCP layer.

MCP server classVersion pinned in lock?Readable changelog?Credential boundary documented?Failure mode when remote down?Supply-chain provenance
First-party, vendor-shipped (e.g. gh MCP via gh-aw)yes, in the .lock.ymlyes, with the CLI release notesyes, scoped tokentool call returns error; runner continuesfirst-party
Major-vendor MCP (cloud, ticketing, drive)usuallysometimessometimesvaries, often opaquefirst-party, varying maturity
Community MCP from a public directoryrarely by defaultrarelyrarelyundefinedunvetted; treat as any other untrusted dep

The specific operational pain this month was version drift between MCP servers and the agent runtime consuming them: a policy change in one MCP server interacted with the runtime in a way that took disproportionate time to track down, because debugging across an agent boundary is dramatically harder than debugging an HTTP call. The pattern is the one that bit SDK consumers a decade ago: a dependency you did not think of as a dependency upgrades itself underneath you, and the symptom shows up three layers from the cause. The interesting counter-pattern this month came from gh-aw’s safe-outputs: a workflow can call back into its own gh MCP to upload artifacts, post comments, open PRs, and the set of side effects is constrained to a declared output schema that CI validates before anything leaves the runner. The agent gets to choose what to write; the platform decides what shapes are allowed. Deterministic guardrails around a non-deterministic actor.

Treat MCP servers like any other third-party SaaS dependency, and an unvetted MCP from an unofficial directory like any other unvetted dependency you would not paste into package.json without reading first: pin versions, read the changelog before bumping, draw the credential boundary explicitly, design for the case where the server is unreachable or returns nonsense. Where a first-party option exists (an MCP that ships and versions with the product it talks to), default to it; reach for community servers only when no first-party option exists, and document why. The platform-engineering frontier here is mostly boring: MCP allowlists, MCP health checks, MCP-aware tracing, safe-output schemas. Nothing exciting, exactly the kind of work that is load-bearing. The senior move is to recognize the agent runtime as a production system whose dependency graph happens to be made of MCP servers instead of npm packages, and apply the discipline you would already apply to any other supply chain.

Long-form pieces on synthetic-test-as-code and on the MCP supply chain are queued for the /writing section.