EXPLAIN ANALYZE

Death by a Thousand Cuts: the AI Database Failure You Can't Restore

Thu, 04 Jun 2026 00:00:00 +0000

TL;DR

The catastrophic AI failure (the agent that DROPs a table) is the recoverable one: loud, attributable, and on a gated pipeline it mostly can’t happen anyway. What actually bleeds you is the change that clears every gate because the gates check correctness at review time, and the failure is a function of volume and time, both of which are zero when the PR is open. Plot failures on loud-vs-quiet and recoverable-vs-not, and AI floods the quiet-and-unrecoverable corner the pipeline was never watching.

A scraper ships behind the same pipeline as everything else: feature branch, two approvals, CI green, a day in staging, then deploy. There is already a postings table, one row per job posting the crawler tracks, keyed by posting_id. Part of the change is a new table to track crawl state, storing the content hash of every posting each run sees so the next run can tell what changed. The agent designed the table, a reviewer approved it, and at review time it held zero rows. It looked reasonable:

1
2
3
4
5
6
7
8


CREATE TABLE crawl_state (
 id BIGSERIAL PRIMARY KEY,
 run_id BIGINT NOT NULL,
 posting_id BIGINT NOT NULL REFERENCES postings (posting_id),
 content_hash CHAR(64) NOT NULL, -- SHA-256 of the posting body
 crawled_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON crawl_state (posting_id);

It was reasonable, for about three months.

Nothing on that screen looks wrong, and that is the problem. A surrogate id, a foreign key to postings, an index on the column you look postings up by. It reviews as boilerplate. The grain is the part nobody states out loud: one row per posting per run. Every pass appends the entire posting set under a fresh run_id instead of updating the rows already there, so the table grows by the full set every time the scraper runs. postings holds roughly 100,000 rows. Three months and a hundred-odd runs later, crawl_state holds 11 million. The job that decides whether a posting changed runs the obvious query:

1
2
3
4
5


SELECT content_hash
FROM crawl_state
WHERE posting_id = $1
ORDER BY crawled_at DESC
LIMIT 1;

The posting_id index finds the matching rows, but there are now a hundred-odd of them per posting, and to return the single latest hash it sorts that pile by crawled_at on every call. Across the working set the pipeline is timing out. Asked to fix the slowness, the agent recommends what it always recommends: widen the index to CREATE INDEX ON crawl_state (posting_id, crawled_at DESC), so the latest-hash lookup stops sorting.

An engineer who knows the system reads it differently. The table shouldn’t have run grain at all. A posting’s current hash is one value, stored once and overwritten each run, and there is nothing to accumulate. Better than that, the hash isn’t a separate concern from the posting. It belongs on the postings row that already exists, so there is no second table, no run_id, and no latest-per-posting lookup at all:

1
2
3
4
5
6
7
8


ALTER TABLE postings
 ADD COLUMN content_hash CHAR(64),
 ADD COLUMN hash_checked_at TIMESTAMPTZ;

-- each run, per posting:
UPDATE postings
SET content_hash = $2, hash_checked_at = now()
WHERE posting_id = $1 AND content_hash IS DISTINCT FROM $2;

That stays at 100k rows forever, the changed-or-not check is a single primary-key row read, and there is no crawl_state to bloat. The index the agent suggested speeds the symptom while doubling down on the grain that is the actual bug, paying write cost and storage on a table that should not exist. And there is no deploy to revert. The bad grain is a schema decision three months old plus 11 million rows of accumulated state, and unwinding it is a planned migration and a backfill, reviewed like any other change, not a rollback.

The DROP is the lucky case, and here’s the 2x2 that shows why

The failure everyone pictures is the destructive one: the DROP TABLE, the migration that truncates the wrong relation, the script that deletes the binlogs. On a gated system that is the one you have mostly handled, with destructive migrations caught in review, credentials scoped, and a restore runbook practiced. When the Replit agent wiped a production database during a code freeze in July 2025, the data was restored: loud, attributable to one timestamp, recoverable. The scraper cleared that same pipeline because at merge time it was correct, with no rows for the grain bug to express and a CI seed that never reached the row count where it breaks.

Rank a failure on two axes, loud-vs-quiet and recoverable-vs-not, and you get four corners:

Loud and recoverable is the DROP. You know instantly and you roll it back.
Loud and unrecoverable is rarer: a destructive operation you catch but can’t undo.
Quiet and recoverable is the bug that sat unnoticed but is still reversible when you find it.
Quiet and unrecoverable ruins quarters. You don’t find out for months, and by then the prior state is gone or never existed as a clean artifact.

The axes correlate, which is what makes the bad corner deep. Loud failures get caught while the prior state still exists; quiet ones sit, and the longer one sits the more downstream systems consume it as truth, until a recoverable error has been aggregated and propagated into an unrecoverable one. Loudness buys the time, so quiet and unrecoverable travel together.

Note

This post is about the write path: statements that change data, schema, or the planner’s behavior. The read-path failure (an AI-generated SELECT that returns a confidently wrong number) is the sibling problem, covered in What AI Gets Wrong About Your Database. A wrong read misleads a decision; a wrong write becomes the new truth.

AI floods the bad corner for a structural reason. Each change it ships is plausible: it compiles, runs, returns the right shape, passes whatever checks exist. That plausibility is the corruption floor, the same mechanism that makes the output useful making it occasionally wrong in a way that looks exactly right. A loud failure is one the output failed to make plausible; the quiet one is the model working as designed. Then volume multiplies it: a team shipping eighty changes a week instead of eight samples that floor ten times as often, on the same review and CI budget. The DROP is the rare draw the pipeline was built to stop. The thousand cuts are the modal draw, and it waves them through.

The quiet corner comes in three shapes. The scraper is the first: a schema correct at zero rows and fatal at eleven million, because the model optimizes from its training distribution, not your scale. Asked to partition a large table it reaches for created_at in the primary key, the common corpus shape, not the primary-key partitioning that fits a high-scale OLTP table. The second is the value it computes wrong because it doesn’t hold your domain: Cursor’s support bot invented a login policy that didn’t exist and users cancelled before anyone knew, and Air Canada lost in court over a bereavement refund its chatbot made up. Move that same generator onto a write path computing a discount or a tax split and the row is well-typed and wrong about what the number means, with nothing reconciling it against the contract until quarter-end. The third is the change shipped past the author’s own understanding: it looked good and worked, so it went, and the judgment that would have caught it is built by the slow work the agent now skips. That is the paradox of the fast engineer, which a July 2025 study measured as 19% slower even as the developers felt faster.

A worked example: the soft-delete leak

The grain bug is loud once you go looking, because it shows up as latency. The worse version of the same class is a write that corrupts a number and never moves a performance metric at all. Soft deletes are where this lives in most enterprise schemas.

The convention is old and unwritten. A row is never physically removed; it gets a deleted_at stamp, and every query that reads the table is expected to filter it out. A billing system for a SaaS company looks like this:

1
2
3
4
5
6
7
8


-- one row per recurring line on an account: base plan, seats, add-ons
CREATE TABLE subscription_items (
 id BIGSERIAL PRIMARY KEY,
 account_id BIGINT NOT NULL REFERENCES accounts (account_id),
 sku TEXT NOT NULL,
 monthly_cents BIGINT NOT NULL,
 deleted_at TIMESTAMPTZ -- set when a customer drops the line
);

When a customer downgrades, the application does not delete the row, it stamps it:

1

UPDATE subscription_items SET deleted_at = now() WHERE id = $1;

Every existing query that touches money knows this. The MRR rollup, the invoice generator, the revenue dashboard, all of them carry AND deleted_at IS NULL, because the team learned years ago that forgetting it double-counts churned revenue. That knowledge lives in the queries and in the heads of the people who wrote them. It is nowhere in the schema; deleted_at is just a nullable timestamp, and nothing stops a query from ignoring it.

Now an agent is asked to add a board-facing metric, monthly recurring revenue by region, that the nightly ETL persists into the warehouse fact tables the dashboards read. It writes the obvious thing:

1
2
3
4
5
6


CREATE VIEW mrr_by_region AS
SELECT a.region, SUM(si.monthly_cents) AS mrr_cents
FROM accounts a
JOIN subscription_items si ON si.account_id = a.account_id
WHERE a.deleted_at IS NULL -- remembered on accounts
GROUP BY a.region;

It filtered deleted_at on accounts but not on subscription_items, because nothing in the schema said it had to and the training corpus is full of joins shaped exactly like this. Every cancelled add-on and downgraded seat is now summed back into regional MRR, and each night the ETL reads this view and writes the inflated number into the warehouse the whole company reads as truth. The shape is right and the number is plausible, a little high, and growing as the soft-deleted pool grows.

Nothing catches it. The engineer saw a working view and a deleted_at filter sitting right there and moved on; the AI reviewer flagged a naming nit; the human skimmed the green summary and approved; CI ran on a seed database with almost no deleted rows, so the leak was a rounding error in the test. A missing predicate is an absence, the one thing every reviewer, human or model, reliably fails to see.

The drift is the tell. Small at launch, ten or fifteen percent high a year later in the regions with the most downgrade history, and finance finds it the only way anyone does: reconciling the dashboard against billed revenue, a year in, with no single cause and no deploy to revert.

Warning

A soft-delete filter is a contract the schema cannot enforce. Nothing in subscription_items makes a query honor deleted_at, so the rollup should have joined an active_items view (CREATE VIEW active_items AS SELECT ... WHERE deleted_at IS NULL) and a rule should forbid money queries from touching the base table at all. That is worth more than any amount of review attention, because the absence of a predicate is the one thing a reviewer, human or model, reliably fails to see.

The fix is a business call, not a technical one

The mitigations are known and none of them are clever: reconcile against a source of truth on a cadence the business can stand, alarm on aggregates and drift instead of only errors, run CI against production-shaped volume, bake invariants into views and constraints. All worth doing, all secondary, because every one is a net thrown after the write has committed. The load-bearing decision is upstream of the tooling, and it is a positioning call leadership makes on purpose. Four honest positions:

Ship at full speed, accept the corner. Take the velocity and take the unrecoverable write, the silent data loss, the bug the customer finds, as the cost. Legitimate for a seed-stage product with no prior state worth protecting. Catastrophic for a billing system.
Fast where it’s cheap, gated where it’s not. Agents and junior engineers run on loud, recoverable surfaces (internal tooling, dashboards, throwaway analysis); writes that touch money, schema, or multi-writer tables go through someone who holds the domain. This is where Amazon landed the expensive way, requiring senior sign-off on AI-assisted changes to its sensitive stack after a run of incidents (The Register, April 2026). They named the cost: controlled friction.
SME on everything. Only the smallest or most regulated shops can afford it, and it collapses into the second position the moment change volume outgrows the reviewers.
Encode the domain into tests. The one position that scales without scaling reviewers, and the one with the sharpest trap. An SME who knows the soft-delete convention writes an assertion that fails the build on the leak. Ask the agent to “add tests” under deadline and it writes one that sums the same leaked rows and asserts the inflated total is correct: the bug ships with a green check certifying it. And tests only check behavior, never whether the design should exist. The scraper passes everything you could write against it; the bug is the table, and a green suite is guaranteed to bless an architecture that works exactly as built.

The trap is choosing the second or fourth on a slide and the first in practice. All four only work if the people doing the reading still have judgment to read with, and the paradox of the fast engineer is draining that pool: hand the slow work that grows an SME to an agent for two years and the sign-off is staffed by people with the title and not the instinct. If you are not willing to lose your seniors, the budget item is the work that makes them, not just the headcount that has the rank today.

What your monitoring is actually for

Everything you monitor fires in the loud quadrant. Error rates, 5xx, failed-job alerts, latency thresholds set above current numbers, all of it watching the corner you were already going to survive, because the DROP has a backup and a practiced runbook. The scraper that dies three months after a clean review, the write that computes the wrong number, the logic bug a customer found first, those never alarm, because the pipeline reads every change at the one moment it is still correct and then stops looking.

A single one of those is not a crisis. You find it, you trace it, you fix it. The problem is rate. Each is one draw from the corruption floor, and a team shipping ten thousand lines a day draws constantly, laying down a sediment of small wrongs that surface not the day they’re written but a year later, together, as a system nobody fully understands returning numbers nobody fully trusts. By then it is past untangling: a thread to pull assumes a thread, and a year of compounded cuts is the whole fabric. The options shrink to a rewrite or living with numbers you can’t defend, and no senior worth the title signs up to reverse-engineer a year of an agent’s confident guesses.

The fake citations got caught because the judge knew the real ones. That is the whole job, and the agent can’t do it for you: someone has to ship nothing they don’t understand, and understand it the whole way down, what the value means and what it does to every system that reads it later. Your product has no judge unless you are one. The agent makes the drafts faster; knowing what they cost is still the part you can’t hand off.

Narrow Tools, Narrow Agents: Where Agent Reliability Actually Comes From

Sat, 30 May 2026 00:00:00 +0000

TL;DR

The agents that hallucinate least are the ones calling narrow, opinionated tools and given the smallest possible job. A “diagnostic” endpoint that returns a present-state boolean plus a lag number beats a generic query(sql) surface, and a scoped agent with three tools beats a wide agent with twenty. Even with both halves in place, the output still has inconsistencies on some level. The goal is “hallucinates least”, not “right”.

Here is what an on-call agent posted to the incident channel when pg_stat_replication.replay_lag on the orders cluster crossed the 60s SLO:

1
2
3
4
5


Logical replication slot pub_orders_v1 is behind on the publisher.
restart_lsn has not advanced in the last 12 minutes. Recommend
dropping the subscription and recreating it, then advancing the
slot manually with pg_replication_slot_advance(). Runbook:
ops/runbooks/postgres/replica-lag-orders.md (last reviewed 2023-08).

The slot the agent named was decommissioned twenty months earlier. This is what pg_replication_slots returned the same minute on the same replica:

1
2
3
4


=> SELECT slot_name, slot_type, active FROM pg_replication_slots;
 slot_name | slot_type | active
-----------+-----------+--------
(0 rows)

The summary was produced by an agent with RAG over the full ops repo and a standard query(sql) MCP tool against the replica. The orders cluster moved off logical replication in the Q3 2024 migration to streaming. Two paragraphs of an ADR titled “Move publication off logical replication” explain exactly why. The ADR lives in the same repo, indexed by the same embedding model, available to the same retrieval call. It ranked behind the 2023 runbook, behind two other runbooks for the same alert family, and behind a post-mortem from a different cluster entirely. The ADR’s vocabulary didn’t match the alert’s. The agent never read it. The cluster’s actual problem (a long autovacuum on the orders_2026_05 partition generating WAL faster than the replica could apply) was sitting one query away in pg_stat_progress_vacuum. The agent’s summary never reached for it.

What a smarter model wouldn’t have fixed

The familiar levers are a bigger context window, better embeddings, a smarter model. None of them touch the underlying mechanic. Embeddings retrieve by similarity, not by truth-value. The 2023 runbook scored highest because it talked about the exact alert, with the exact column names, in the exact phrasing the alert text used. A bigger window pulls in more competing documents, including more wrong ones. A better embedding model sharpens the same match against the same stale corpus. A smarter LLM produces a more confident summary on the wrong grounding. The retrieval surface is the problem, and the model is doing what models do.

Anthropic’s Writing effective tools for agents (September 2025) makes the structural version of the same point: more tools and broader tools don’t improve agent outcomes. The team behind Claude found that purposefully narrower tools (search_contacts over list_contacts, with shaped responses) beat broader ones consistently, because agents struggle to extract signal from irrelevant context and burn tokens trying. The same shape applies one layer up. A search_runbooks tool grounded against a corpus where half the docs are out of date is a broad tool dressed in narrow clothes. The narrowness has to live in what the tool actually returns.

The same alert through a narrow tool

Here is what a diagnose_replica_lag(cluster) endpoint returns when the same alert fires:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


{
 "lagging": true,
 "lag_seconds": 47,
 "replication_mode": "streaming",
 "publisher_lsn": "F1/A2C3D400",
 "replay_lsn": "F1/A2B14C00",
 "wal_replay_paused": false,
 "blocking_autovacuum": {
 "relation": "orders_2026_05",
 "phase": "scanning heap",
 "started_at": "2026-05-24T02:18:11Z",
 "duration_seconds": 5421
 }
}

The agent reads present-state structured input. No runbook, no link, no freeform “here’s what this might mean”. The endpoint is the only thing in the stack that knows the cluster moved off logical replication. Whatever slot the response names (if any) is whatever the slot is called today. The blocking autovacuum is computed by joining pg_stat_replication and pg_stat_progress_vacuum on the server side, where the join is cheap and the freshness is guaranteed. The agent’s summary against this input says what the input says. There is no 2023 runbook in the prompt to retrieve from.

The Unix mantra ports to agent tools more directly than most. Each tool does one function, does it well, and shapes its output for the consumer. The consumer is an agent with a token budget, no working memory between calls, and a fondness for whichever input most resembles a familiar pattern. The output has to be small, shaped, current, and complete enough that the agent’s job is to read it, not compose it.

A schema introspector for an agent doing query work returns the active subset of the catalog: tables, columns, and indexes touched by queries in the last thirty days, with column types, foreign keys, and the indexes that have non-zero idx_scan over the same window. It does not dump 4,200 rows of pg_catalog.pg_class joined against pg_attribute. The catalog carries years of accumulated noise: deprecated audit tables, the experiment from 2022 that never got cleaned up, the staging-only mirror copies. An agent given the full dump pattern-matches against the noise as readily as the signal.

A “currently breaking” endpoint returns a pre-joined view of active alerts, the playbook each alert routes to, the affected service, the deploy SHA from the last fifteen minutes, and the on-call’s contact. It does not return three underlying APIs and trust the agent to compose the join. The join is the question. The tool encoded it. Re-deriving the join from raw sources every call is where the agent burns tokens and where the misattributions accumulate.

Web search joins this set when the question is about something fresh. For a CVE on a specific Postgres minor, a web_search(release_notes) tool grounded against pgsql-announce or a vendor advisory beats the same question routed against an internal RAG corpus where the relevant note doesn’t exist yet. Fresh source. Narrow scope. The tool encodes the question.

The pattern across all four: the tool encodes the question, not the source. An agent calling query(sql) gets to compose every question itself, including the badly-worded ones. An agent calling a diagnostic tool only asks the questions the diagnostic was designed for. That constraint is the feature.

Warning

“Narrow” has to be narrow in what the tool returns, regardless of what it’s named. A diagnose_replica_lag tool that runs SELECT * FROM pg_stat_replication and dumps every column on every row is no better than letting the agent write the SQL itself. The shaping (which columns, with what projection, with what filtering, with what defaults) is where the reliability lives. A tool that returns 5,000 rows because nobody put a LIMIT on the server side has handed the responsibility back to the model, which is the responsibility the tool existed to remove.

Narrow tools don’t help a wide agent

Tools shaped for the right answer still get the wrong call from an agent given fifty of them and the instruction “you are the on-call assistant, help the human”. The model picks whichever tool pattern-matches best to whatever the input string looked like, and that pattern-matching is biased toward whichever description sounds most familiar. A scoped agent (one job, three tools, one decision to produce) has nothing else to reach for.

Scope is what the agent is for, expressed narrowly enough that the right tool call is mechanical. “Triage a replica-lag page” is a scope. The tools are the diagnostic, the recent-deploys lookup, and the playbook. The agent calls them in order and produces the summary. “Help the engineer with whatever they ask” is not a scope, it is the absence of one. The same model with the same tools resolves the first job consistently and freestyles the second.

Wide scope produces a failure that’s worse than picking the wrong tool. A wide-scope agent given a hybrid problem reaches for one of the relevant tools and silently drops the other half. The scoping decision lived in the model’s pattern-match against the input string, which is the worst place to put a decision the next on-call wants to debug six months from now. A scoped agent doesn’t have to decide. The scope already decided.

The two halves compound. Narrow tools make the right call mechanical inside a scope. Narrow scope makes it obvious which tools the agent will call. A wide agent with narrow tools wastes the narrowness. A scoped agent with broad tools wastes the scope. The Redis diagnostic API in internal tools are AI 10x is what the canonical version looks like in production: one endpoint, present-state answers (key counts, memory pressure, slow-log entries, replication offset), pre-shaped for whoever is asking. The agent never composes Redis commands. It calls the diagnostic and reads structured output. The same shape ports to Postgres, MySQL, Kafka, ES.

Note

The argument is about agents acting on production systems, where a bad tool call has operational consequences and the corpus the agent reads from has years of accumulated drift. Pure-text Q&A bots over a curated help-center corpus are a different regime. The corpus there is small, maintained, and the cost of a bad retrieval is the user reading a stale answer rather than the system running a destructive command. Retrieval is the right primitive in that regime. The argument here is that “retrieval over the production docs corpus” is the wrong primitive for an agent that’s going to act.

When this doesn’t earn its keep

One-shot scripts and ad-hoc analysis where the human reviews every output. The agent’s tool is query(sql) against a sandbox; the human reads the result and discards it. The blast radius is the engineer’s own time.
Low-stakes generative tasks. Commit messages, variable names, refactoring suggestions, the docstring for a private helper. The cost of a wrong output is noticing and editing.
Greenfield code where there’s no accumulated context to be stale. The agent is the only author the repo has seen, the conventions are whatever it wrote yesterday, and there’s no two-year-old runbook to mis-retrieve.
Small teams with a single agent and a single use case. Three tools is the size the agent already has. Building a narrowing layer for something already narrow is overkill.
Genuinely exploratory questions where the agent is supposed to ask broadly. “What in the catalog looks unused” is a question that wants the full catalog, not the thirty-day-active subset. Exploratory questions need exploratory tools.

The engineering work has moved. The 2023 question was how to prompt the model better. The 2026 question is what tool the agent reaches for, and how small the job is that the harness hands it. Most of the reliability budget lives in those two surfaces and not in the model weights. A team adopting agents on production systems will spend more time building agent-shaped APIs and scoping per-task harnesses than tuning prompts, because that is where the floor on hallucination actually moves. The model is going to misattribute and fabricate and confidently quote the wrong thing on some fraction of calls regardless of how the harness is built. The tool the agent calls and the scope of the call are the surfaces the engineer controls. The 2023 runbook for a slot that doesn’t exist anymore stops being a problem when the agent never reads runbooks, only calls the diagnostic that knows what the cluster actually is today.

It's Almost Always the Queries, Part V: Disk Has Two Alarms, Not One

Wed, 27 May 2026 00:00:00 +0000

TL;DR

Two alarms ride the same dashboard tile: the disk filling up, and the disk slowing down. Both have query-level and schema-level fixes that hold for years. Capacity is partition-and-archive, not DELETE. IOPS is covering indexes and the access patterns that go with them. The cloud’s autoscaling and burst credits mask both, and the bill is where the symptom finally surfaces.

On-call gets paged on RDS I/O latency at 2pm Tuesday. The Datadog graph shows read latency at four times its baseline, write latency climbing in lockstep. The engineer on rotation bumps the instance from db.t3.medium to db.t3.large, latency drops back inside the SLO inside five minutes, the page closes, and the incident channel goes quiet. Three days later: same alert, same dashboard, same “fix.” Same five-minute window. By the fourth time, somebody pulls the BurstBalance metric out of CloudWatch and the picture changes. The instance upgrade had not actually done anything to the workload. It had reset the gp2 burst-credit pool from zero back to full. The query mix was steady. The variable was the credit accounting, and the dashboard the team was looking at did not graph it.

The obvious fix and why it buys you weeks

Reach for a bigger volume, more provisioned IOPS, a beefier instance class, or all three at once. Each lever is real, and during an active incident with revenue tied to checkout latency, the right call is often whichever one moves the graph fastest. They share a property the postmortem usually skips: each rents capacity proportional to the pattern underneath - the bloat that produces dead pages, the retention nobody set, the SELECT list that stopped being covered when a column got added. The cost recurs every time growth or concurrency pushes the workload back into the same shape. Part I called this renting the bug. The disk case is two bugs sharing one alert.

Two failure modes, two upstream fixes

The disk tile collapses two genuinely different problems into one number. The disk is filling, which is capacity. The disk is slowing, which is IOPS. They have different mechanics, different upstream causes, and different fixes that hold for years rather than weeks. The 2pm incident is almost always the IOPS one. The Friday-afternoon “we’re at 87% storage” thread is the capacity one. Same dashboard, two alarms, two posts’ worth of mechanism.

Capacity is bloat plus growth, and the shape that holds up is partition-and-archive. A team’s first instinct on a too-full disk is to write a DELETE FROM events WHERE created_at < NOW() - INTERVAL '180 days' and ship it. The space does not come back. On PostgreSQL, DELETE marks tuples dead and leaves them on the page; the space is reclaimed by VACUUM, and VACUUM only returns physical space to the OS when an entire trailing extent is empty (the VACUUM FULL that does is an ACCESS EXCLUSIVE rewrite of the table, which is not a thing you run on a busy production system). On an UPDATE-heavy table, autovacuum can fall behind the dead-tuple production rate, and bloat grows unboundedly until somebody intervenes. InnoDB has its own version of the same problem: deletes and updates fragment the clustered index, and a long-running transaction (an analytics session left open, a misbehaving connection pool, an export that took longer than expected) pins the undo log via the history list and prevents purge from cleaning up. SHOW ENGINE INNODB STATUS lists “History list length” precisely so you can spot the case where purge is losing.

The pattern that holds: partition by date or by tenant, drop or detach old partitions on a schedule, offload the detached partitions to cheaper storage if compliance or analytics still need them. PostgreSQL declarative partitioning (PG 10+) with pg_partman handles the rotation; the extension’s background worker can create new partitions ahead of the curve and run the retention drop on a schedule with no external cron. ALTER TABLE ... DETACH PARTITION turns a partition into a standalone table you can dump and drop, or move to a different tablespace on slower disk. MySQL has the same shape via native PARTITION BY RANGE and ALTER TABLE ... DROP PARTITION, which on InnoDB returns the space directly because each partition is its own tablespace. The space comes back instantly, instead of waiting on VACUUM and never quite catching up. The trade is schema churn upfront, and the rest of the partitioning post is what to know before you commit to a partition key you cannot easily change later.

IOPS is access pattern, and more IOPS is the answer to the wrong question. A query that is “well-indexed” can still saturate the disk if the index does not cover the SELECT list. The classic shape: a composite index on (customer_id, status) happily serves WHERE customer_id = $1 AND status = 'open', but the SELECT projects customer_id, status, total_cents, created_at, and the engine follows a heap pointer for each of the few thousand matching rows to fetch the columns the index does not contain. A thousand random heap fetches per call, multiplied by call volume, is an IOPS load that no amount of provisioning quietly absorbs. The plan looks correct. The dashboard reads “needs more IOPS.” The covering-index post walks the diagnostic in detail; the fix is INCLUDE columns on PostgreSQL 11+ for the projection-only payload, or a reordered composite index on MySQL that puts the projected columns inside the index. Same query, two orders of magnitude fewer pages read, the heap-fetch count drops to zero in EXPLAIN (ANALYZE, BUFFERS) output.

A worked example with real numbers: a February 2026 write-up titled “Between select and disk” documents a single query reading 27,841 blocks (217 MB) to return zero rows - roughly 1,989 IOPS from a query that filtered everything out on the heap because a JSONB predicate could not be evaluated inside a B-tree on account_id. A companion query did the same shape: 12,071 rows fetched, 107 MB, ~1,944 IOPS, zero rows returned. Combined, the two queries demanded ~3,900 IOPS against a 3,000 IOPS provisioned ceiling, with reads briefly hitting 3,668 IOPS as burst credits allowed. The fix was a GIN index that let the JSONB filter run inside the index scan, instead of after the heap fetch. The disk dashboard during the incident read “IOPS saturated”; the actual cause was an index that did not match the predicate.

The query-side moves that keep an index covering: project the columns you actually need (an ORM defaulting to SELECT * defeats coverage the moment any column lands outside the index), prefer keyset pagination over deep OFFSET (a LIMIT 50 OFFSET 100000 reads a hundred thousand index entries to discard them and return the next fifty), match the index’s column order in ORDER BY so the planner skips the sort, and write WHERE predicates the planner can push down to the index leading column. Non-SARGable predicates is the third leg of this: a function on a column, a leading wildcard, an implicit cast from bigint to text, and the engine evaluates per row instead of seeking, and the IOPS graph follows. Each of these is a query-level move with no schema change, and each removes IO that more provisioned IOPS would only hide.

The managed-cloud overlay produces false fixes. Three behaviors on AWS, with analogues on Azure and GCP, make the disk dashboard easy to misread. The first is gp2 (and gp3, with different mechanics) and burst credits. A gp2 volume earns 3 I/O credits per GiB per second up to a 5.4-million-credit cap, sustains 3,000 IOPS while credits last, and falls to its baseline (as low as 100 IOPS on small volumes) when the pool drains. The AWS blog post that introduced the BurstBalance metric in 2016 is still the cleanest reference. A workload that has been steady for months can hit a credit wall during a backup window or an end-of-quarter report, and the latency graph tells you “the disk got slow” without showing you that the disk was throttled because the credit counter hit zero. Bumping the instance, or growing the volume, resets the picture. Three days later the credits drain again. Same incident, same fix, same cycle, and BurstBalance is the metric that closes the loop.

The second is Aurora’s no-disk-in-the-traditional-sense model. Aurora storage scales transparently to 128 TB, so the disk-full alarm never fires. On Aurora Standard, IO is billed per request, and the alert nobody sets is on the bill. In May 2023, AWS announced Aurora I/O-Optimized, a flat-rate pricing option that removes per-IO charges in exchange for a higher instance and storage rate. The break-even, per AWS’s own guidance, is roughly 25% of total Aurora spend going to IO; above that, I/O-Optimized wins, below that, Standard does. VGS’s case study from May 2025 puts numbers on it: their Aurora:StorageIOUsage was 30–40% of daily Aurora cost, traced to a Monday cleanup cron job concentrating millions of I/O operations into one window, and the move to I/O-Optimized cut their overall Aurora bill by roughly 20%, which at their scale was hundreds of thousands of dollars per year. The point is not the calculator. The point is that on Aurora the failure mode is not a graph that goes red, it is an invoice line item that climbs, and the cause is the same access pattern that would have shown up as IOPS saturation on RDS.

The third is RDS storage autoscaling. Enable it, and the disk-full alarm never fires because the volume grows automatically up to the configured ceiling. The bloat keeps growing, the retention policy still does not exist, and the issue surfaces six months later at finance review when the storage line is double what it was. Autoscaling is fine; running it without a retention policy underneath turns “we need to delete old data” into “we need to delete old data and reclaim a terabyte of provisioned storage we’re paying for.”

What this costs

Each upstream fix has trade-offs the postmortem should name out loud.

Partition-and-archive is schema churn upfront and operational scaffolding forever. Partition key choice, query routing across detached partitions, and the rotation tooling itself are the trade-offs worth making in advance, and the partitioning post is the canonical reference for that decision pass. The thing to keep in mind here is that none of it looks urgent until the disk is full, and that is the wrong moment to design a partition strategy.

Covering indexes are write amplification and storage overhead. Every INSERT and UPDATE to a covered column writes to every index that covers it; an INCLUDE clause adds payload columns to the index leaf without making them part of the key, which keeps the index smaller than a wide composite but still means the leaf gets updated on every write to those columns. A covering index designed for today’s SELECT list ages out the moment the SELECT list grows; the same query pattern from the covering-index post reappears six months later when a new feature adds a column. Adding INCLUDE columns interacts with PostgreSQL’s HOT update path too: HOT updates need the new tuple to fit on the same page and not modify any indexed columns, and a wider index payload combined with a fillfactor near 100% can starve the HOT optimization without changing any query. ALTER TABLE ... SET (fillfactor = 90) for write-heavy tables is the standard accompaniment to wide covering indexes, and it is the easy thing to forget.

Cloud-side moves are mostly upside, with one trap worth naming. Aurora I/O-Optimized’s break-even moves with workload. A cluster that was fine on Standard last quarter can cross the I/O-heavy threshold this quarter and nobody notices until the next bill review. AWS publishes an estimator using CloudWatch metrics for the recalculation; running it quarterly catches the drift.

Warning

The most common partition-and-archive footgun is queries that span the archive boundary returning silently incomplete results. A report that used to read three years of data still asks for three years, the partition that holds year three has been detached and archived to S3, and the query returns two years of data with no error. Once is a bug. Recurring is an architecture problem. The fix is making the boundary explicit, either by routing historical queries to a federated view that includes the archive (CREATE FOREIGN TABLE on PostgreSQL, or a UNION ALL against a separate archive schema), or by rejecting queries that ask for ranges the live table does not cover. Failing loud beats answering wrong.

When this doesn’t apply

Three cases where the hardware reading is right and the schema reading is not.

A working set that genuinely does not fit in RAM. If a hot table is 12 GB on an 8 GB instance and the top of pg_stat_statements is dominated by reads against that table, no partition strategy and no covering index change the fact that the buffer cache is too small. The wait events tell the story: IO:DataFileRead dominating the active-session-by-wait graph in Part IV’s terms. The fix is RAM, or a smaller working set, and “smaller working set” usually means partitioning so the active subset shrinks, which means the line between “more RAM” and “fewer rows” is fuzzier than the framing suggests.

Snapshot or backup operations consuming live IOPS during a known window. If the latency spike lines up with the 2am backup window, or with a once-a-month consistency check, and the rest of the day is fine, the answer is scheduling and IO throttling rather than query optimization. RDS snapshots are incremental and cheap to take, but the first snapshot on a fresh volume and any major change to the dataset force a full sweep that competes with live traffic.

A one-time migration off a system that should have moved to cheaper storage years ago. If the disk is full of 2017 data that nobody has read in three years, the fix is dumping it to S3 once and reclaiming the space, not designing a rotation strategy for data that is not being produced anymore. Partition-and-archive is for recurring patterns. One-off cleanup is one-off cleanup.

The bigger picture

Capacity and IOPS are the slowest-to-alert resources a relational database has, and on managed cloud the autoscaling, burst credits, and per-IO billing models hide the cause while the bill quietly absorbs the symptom. Fix-once strategies survive workload growth in a way that “bigger instance” does not: a partition rotation dropping a month every month is not less effective the year you triple traffic, and a covering index that touches zero heap pages is not less effective when the table grows tenfold. The diagnostic discipline is the same one running through Parts II–IV. Pull the top-10 from pg_stat_statements by total_exec_time, read the plan with BUFFERS, check BurstBalance and the storage autoscaling history before resizing the volume. The instance type is the last thing to change.

It's Almost Always the Queries, Part IV: When the Sort Spills

Thu, 21 May 2026 00:00:00 +0000

TL;DR

A database alerting on memory pressure is almost never a workload that needs more RAM. The dangerous allocation is per-sort, per-hash, per-connection and transient, so the instance-level memory graph never shows it until data size and concurrency line up at the same instant. The exception is a genuinely large working set with a low cache hit ratio, and that one is real.

A reporting endpoint sorts orders by created_at for an account’s quarterly export. Last quarter it ran in about 200 ms. This quarter it takes four seconds, and nobody changed the query. The EXPLAIN ANALYZE output is the whole story:

1
2
3
4
5
6
7
8
9


-- last quarter
Sort (cost=8420.11..8556.34 rows=54492 width=84)
 Sort Key: created_at
 Sort Method: quicksort Memory: 3072kB

-- this quarter
Sort (cost=41922.88..42698.05 rows=310068 width=84)
 Sort Key: created_at
 Sort Method: external merge Disk: 24960kB

The data grew, which is the one thing that always happens. The sort set crossed work_mem, the executor stopped sorting in memory and started an external merge against a temp file on disk, and a CPU-bound operation became an IO-bound one. No crash, no OOM kill, no page. The memory dashboards show nothing, because a temp file written by one backend for the duration of one sort is not a number that appears on an instance-level “RAM used %” graph. The query just got slow, and the only place the cause is visible is in a plan nobody ran.

More RAM doesn’t raise work_mem

The reflex is to resize the box. The endpoint is “running out of memory,” the instance has 32 GB, so move it to 64 and the sort has room. It’s a clean story and it is wrong, for a reason that is structural rather than situational.

The sort spilled because it exceeded work_mem, and work_mem is a fixed per-operation budget that has nothing to do with how much RAM the box has. The default is 4 MB. It is 4 MB on a 4 GB instance and 4 MB on a 256 GB instance. Doubling the instance does not move that number by a byte. MySQL’s per-thread buffers behave the same way: sort_buffer_size defaults to 256 KB and stays 256 KB on a 4 GB box and on a 256 GB box, exactly as work_mem does. The export’s sort set is 24 MB; it will spill on the bigger box exactly as it spilled on the smaller one, because the limit it crossed is a config value, not a quantity of physical memory. Part I called this renting the bug. The memory case is the cleanest version of the metaphor: you can buy RAM the database is structurally unable to point at the operation that needs it.

So the real fix looks like raising work_mem. Set it to 64 MB, the export’s sort fits in memory, the endpoint is fast again. That works for the export. It also arms a multiplier across every other connection on the system, and the multiplier is the actual subject of this article.

The memory the dashboard can’t see

work_mem is not a global pool the server draws down. It is a per-operation allowance, granted independently to every sort, hash, and materialize node, in every query, in every session, at the same time. The PostgreSQL “Resource Consumption” documentation states the consequence directly:

Note that a complex query might perform several sort and hash operations at the same time, with each operation generally being allowed to use as much memory as this value specifies before it starts to write data into temporary files. Also, several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of work_mem.

“Many times” is the load-bearing phrase, and it has three separate multipliers stacked inside it.

The first is plan shape. A report query that joins four tables, aggregates, and sorts the result does not allocate work_mem once. It allocates it per memory-using node: a hash for each hash join, a sort for the ORDER BY, another for a DISTINCT, a hash for the GROUP BY. Four memory-using nodes in one plan is ordinary, not pathological. That single query’s peak is 4 × work_mem, not work_mem.

The second is hash_mem_multiplier. Since PostgreSQL 13 a hash_mem_multiplier setting governs this, and since PostgreSQL 15 it defaults to 2.0, so hash-based nodes get a larger allowance than sort-based ones. The “Resource Consumption” docs: “The final limit is determined by multiplying work_mem by hash_mem_multiplier. The default value is 2.0, which makes hash-based operations use twice the usual work_mem base amount.” A hash join in that plan is not budgeted at work_mem. It is budgeted at 2 × work_mem, on a current default, before anyone has touched a setting.

The third is parallel workers. A parallel sequential scan feeding a parallel hash or sort gives each worker its own work_mem allocation for its slice of the work. A query with max_parallel_workers_per_gather set to 4 can have five processes (the leader plus four workers) each holding a work_mem-sized hash for the same node.

Now do the arithmetic the way Part III did it with CPU-seconds. You raised work_mem to 64 MB to stop the export from spilling. A moderately complex query has four memory-using nodes, two of them hashes that get the 2.0 multiplier. Call its peak a conservative 64 MB × 6 once the hash multiplier is counted, roughly 384 MB for one execution. Your Rails app runs 300 worker processes, each holding a connection, and on a busy afternoon 80 of them are running a query of about that shape at the same instant. 384 MB × 80 is just over 30 GB of transient sort and hash memory, none of it in shared_buffers, none of it visible on the memory graph until the second it is all allocated at once. The instance has 32 GB. The export’s sort fit. The OOM killer arrives anyway, two weeks after the change, with no obvious deploy to blame, because the change that caused it was a config edit and the trigger was an ordinary Tuesday with slightly more concurrency than Monday.

The arithmetic above assumes the plan goes as costed. But the planner chose that plan from estimates, not from the rows it would actually find. Sort versus no-sort, hash join versus merge join, HashAggregate versus GroupAggregate are all decided at planning time from row-count guesses in pg_statistic. A stale estimate, or an n_distinct that was always wrong, and the planner picks a memory-hungry plan accurate stats would have avoided, or under-sizes an allocation that blows past work_mem at runtime. GROUP BY is the sharp case. Before PostgreSQL 13, released 2020-09-24, a HashAggregate chosen on a too-low n_distinct estimate had no disk fallback at all: the hash table grew past work_mem with no bound and could OOM the server. PG13 added the spill, and its release notes state the new behavior plainly. Hash aggregation “was avoided if it was expected to use more than work_mem memory”; now the plan can be chosen anyway, and “the hash table will be spilled to disk if it exceeds work_mem times hash_mem_multiplier.” An uncapped OOM became the same external merge trade the export opened on, and the data never had to grow to trigger it. The estimate only had to go stale.

The same structure runs on MySQL with different knob names. The per-thread buffers (sort_buffer_size, join_buffer_size, read_rnd_buffer_size) are allocated per connection, and the MySQL documentation is explicit that join_buffer_size can be allocated more than once for a single query, once per join that cannot use an index. Multiply the per-thread total by max_connections and you have MySQL’s version of the same blind multiplier. MySQL’s optimizer is just as estimate-bound: it picks join order and access paths from index cardinality (refreshed by ANALYZE TABLE, stored under innodb_stats_persistent) and optional column histograms, and stale cardinality pushes it onto a join that cannot use an index and falls back to a block-nested-loop on join_buffer_size. Internal temporary tables add a second multiplier: under the MySQL 8.0 TempTable engine, an in-memory temp table grows until it hits tmp_table_size, at which point, in the documentation’s words, “MySQL automatically converts the in-memory internal temporary table to an InnoDB on-disk internal temporary table.” temptable_max_ram (default 1 GiB) caps the engine’s total RAM before it spills to memory-mapped files. The MySQL spill is the same event as the Postgres external merge, reached through a slightly different accounting path.

Note

This is also why “give the database more memory” so often goes to the wrong place. Told to add memory, teams enlarge shared_buffers (or innodb_buffer_pool_size). That is the buffer cache, fixed at server startup, and it does nothing for sorts and hashes, which allocate from a separate per-backend region. Worse, oversizing it starves the resource the database quietly depends on. PostgreSQL uses no direct IO; every page read goes through the operating system page cache, and RAM you did not hand to shared_buffers or to backends is not idle, it is that cache. The PostgreSQL wiki tuning guide puts the starting point at “1/4 of the memory in your system” and warns that “it’s unlikely you’ll find using more than 40% of RAM to work better than a smaller amount,” precisely because the OS cache needs the rest. Every memory area in this article is a facet of one accounting problem: the instance metric sums physical pages, and none of the allocations that actually break the database are visible at that resolution.

The OOM killer is where the silent spill stops being silent. Linux overcommits memory by default: it hands out address space freely on the assumption that not every process touches all of it. When resident memory across the system exceeds RAM plus swap, the kernel invokes the OOM killer, which picks a victim by oom_score and sends it SIGKILL. If the victim is a PostgreSQL backend, the damage does not stop at that one connection. The PostgreSQL documentation on managing kernel resources explains that the kernel “might terminate the PostgreSQL postmaster” outright, and even when it takes a backend instead, a backend killed by SIGKILL had no chance to release its locks or detach cleanly from shared memory. The postmaster can no longer assume shared memory is consistent, so it does the only safe thing: it terminates every other backend, and the entire instance runs crash recovery. One report query, on one connection, restarts the whole database. MySQL gets to the same place by a shorter route. mysqld is a single multi-threaded process, so the OOM killer has exactly one target; kill it and the entire server goes down at once, and on restart InnoDB runs crash recovery by replaying its redo log. Both engines end in a full restart, Postgres because the postmaster cascades the kill outward and MySQL because there was only ever one process to kill.

Warning

This is the failure mode behind a real, dated incident. In a March 2026 write-up titled “work_mem: it’s a trap!”, PostgreSQL contributor Lætitia Avrot walked through a production cluster with 2 TB of RAM that the OOM killer reaped. work_mem on that cluster was 2 MB, below the 4 MB default, not some reckless 1 GB. A single badly structured query accumulated allocations inside one ExecutorState memory context faster than anything released them. The context’s dump showed 524,059 separate chunks. That memory is not freed until the operation finishes, and the operation never finished, so it climbed until 2 TB was gone. A 2 MB setting and a 2 TB box, and the box still lost. The problem was never the size of the box.

Fixes, and what each one costs

Do not raise work_mem globally. The export needs 64 MB; the rest of the workload does not, and the global setting applies the change to every connection whether it needs it or not. Raise it where the big sort actually runs. SET work_mem = '256MB' inside the reporting session, scoped to that transaction, or ALTER ROLE analytics SET work_mem = '256MB' so the change attaches to the role the reports run as and the OLTP path keeps the small default. MySQL’s per-thread buffers are session-settable in the same way. Keep the my.cnf global values for sort_buffer_size, join_buffer_size, and tmp_table_size small, and SET SESSION sort_buffer_size = ... on the reporting connection. The cost is that this requires knowing which workload is which. It assumes the reporting queries connect as a distinguishable role or run through a distinguishable code path, and on a system where the web app and the nightly export share one database user, that separation is work you have to do first, on either engine.

A connection pooler bounds the other multiplier. The arithmetic above had 80 concurrent heavy queries because 300 app workers each held a real backend. Put PgBouncer in transaction mode in front, sized to a 40-connection pool, and the database can never run more than 40 backends no matter how many app workers exist. The multiplier is capped at 40 instead of 300. MySQL’s analogue is ProxySQL, an external connection multiplexer that fronts the server the way PgBouncer fronts Postgres, plus the thread pool plugin shipped in MySQL Enterprise Edition and in Percona Server, which caps the number of threads executing at once. Bounding concurrent threads bounds the per-thread-buffer multiplier the same way bounding backends bounds it on Postgres. Part III covered pool sizing for the CPU case, and the reasoning transfers exactly: a query that waits briefly for a pool slot and then runs is cheaper than one that starts immediately and helps exhaust memory. The cost is latency under burst, and a pool sized too small turns into its own incident when every slot is held and the queue backs up.

The fix that removes the spill instead of feeding it is fixing the query. The export sorts by created_at; an index on (account_id, created_at) lets the planner return rows already in order and the Sort node disappears from the plan entirely, no work_mem consumed because no sort happens. Keeping statistics fresh is part of the same fix: autovacuum runs ANALYZE, but a bulk load or a fast-growing table outruns it, and a manual ANALYZE (or a raised per-column target via ALTER TABLE ... ALTER COLUMN ... SET STATISTICS, or CREATE STATISTICS for correlated columns) keeps the planner’s estimate close enough to reality that it sizes the plan correctly. MySQL’s equivalent is ANALYZE TABLE for index cardinality and ANALYZE TABLE ... UPDATE HISTOGRAM ON ... for column histograms, with innodb_stats_auto_recalc governing whether InnoDB refreshes cardinality on its own. The diagnostic tell is the one Part II already named: EXPLAIN ANALYZE showing estimated rows and actual rows diverging by orders of magnitude means the planner is flying blind. A hash join’s allocation is proportional to the rows on its build side, so a predicate that filters earlier, or an index that avoids scanning rows the query then discards, shrinks the hash. This is the crossover with the rest of the series: a non-SARGable predicate that forces a scan also inflates every downstream sort and hash that scan feeds, and a covering index that quietly stopped covering when a column joined the SELECT list adds heap fetches that widen the rows a sort has to buffer. Tuning the query to be smaller per call is the per-operation answer; Part V takes the IO side of it further.

Leave shared_buffers near the conventional fraction and do not raise it to “use the memory.” The page cache needs that headroom, and on a managed service the default is usually already in the sane range. The kernel side is worth one deliberate pass: vm.overcommit_memory controls how freely Linux hands out address space, and setting the postmaster’s oom_score_adj lower than its backends’ (PostgreSQL ships PG_OOM_ADJUST_FILE and PG_OOM_ADJUST_VALUE for exactly this) means that when the OOM killer does fire, it reaps a single backend and not the supervisor. That converts a full crash-recovery restart into one dropped connection. It does not fix the multiplier; it makes the multiplier’s worst day cheaper. vm.overcommit_memory is an OS-level setting that applies to both engines, but the per-process oom_score_adj trick has nothing to work with on MySQL. There is no supervisor and worker split, so one mysqld process means nothing to spare. For MySQL the levers are the system-wide overcommit setting and the discipline of not over-provisioning the per-thread buffers in the first place.

When more RAM is the honest answer

The thesis has real exceptions, and a staff engineer wants the boundary, not the slogan.

A genuinely large working set is the first one. If the buffer-cache hit ratio is low and the wait events are dominated by IO:DataFileRead, the database is going to disk because the data it needs does not fit in RAM, and that is a capacity problem that more memory honestly solves. The tell is in the waits, not the memory graph: steady IO waits on reads, a hit ratio that has been falling for weeks. This shades into Part V’s territory, where the line between “needs more RAM for cache” and “needs more IOPS” gets drawn properly.

A correctly isolated analytical workload is the second. A reporting role that runs deliberate large sorts, on its own connection budget, behind a pooler that caps its concurrency, genuinely benefits from a high work_mem for that role. Raising it there is not the bug this article describes. The bug is raising it globally and arming it across 300 OLTP connections. Scoped to a role that runs ten concurrent queries at most, a large work_mem is a correct decision.

And plainly low concurrency. A database with a 20-connection pool and a workload that never runs more than a handful of heavy queries at once has a small multiplier, and raising work_mem globally on that system is safe because work_mem × 20 is a number the box can hold with room to spare. The multiplier is only dangerous when it is large. Measure it before assuming it is.

The number that isn’t on the graph

The memory graph on the instance dashboard is an honest number. It reports resident physical pages, sampled every minute, and it is the wrong instrument for this failure for the same reason the slow-query log was the wrong instrument in Part III. The slow-query log filters for queries that are individually expensive and misses the cheap query run a million times. The memory graph reports memory that is allocated right now and misses the allocation that is small per operation, multiplied by plan nodes, by the hash multiplier, by parallel workers, and by concurrent connections, and exists only in the instant all of those line up. The export sort that spilled to a 24 MB temp file and the 2 TB cluster the OOM killer reaped are the same failure at two scales: a per-operation cost the instance metric was never built to see. Resize the box and you move the ceiling that cost climbs toward. You do not change the cost, and the cost is set by the query and the connection count, the same two numbers every part of this series keeps coming back to.

It's Almost Always the Queries, Part III: When the CPU Is Pegged

Tue, 19 May 2026 00:00:00 +0000

TL;DR

A relational database pinned at 100% CPU is almost never running one expensive query. It’s running a cheap one too many times. The slow-query log and mean-time sorting both look right past it; total_exec_time is the only view that finds it.

A client portal has a status dropdown at the top of the orders page. It shows a count next to “Open”: SELECT COUNT(*) FROM orders WHERE status = 'open' AND account_id = $1. For any given account that’s around 200 rows. The query runs in 0.4 ms. It has run in 0.4 ms for two years.

Then marketing buys a Super Bowl ad. Thirty seconds of airtime, a short URL, and for the next twenty minutes the portal takes the kind of traffic it normally sees in a quarter. Every visitor lands on the orders page. Every page render fires that COUNT(*). The primary’s CPU graph goes from a comfortable 35% to a flat 100% ceiling and stays there. Checkout latency triples. The on-call engineer pulls up the slow-query log to find the offending statement and the log is empty. Nothing crossed the 100 ms threshold. The slowest query in pg_stat_activity right now is 12 ms. Sorted by mean execution time, the dropdown count doesn’t appear until page four.

It is, by a wide margin, the cheapest query in the system. It is also the entire problem.

More CPU rents the bug

Resize the instance. Double the vCPUs, the graph drops from 100% to 55%, the incident closes. This works, and during a live traffic spike with revenue on the line it is often the correct first move. Part I called this renting the bug, and the CPU case is the cleanest example of why the metaphor holds.

The cost of that COUNT(*) is linear in traffic. A box twice the size moves the ceiling, it does not change the slope. The dropdown still fires once per page render, the render count still tracks visitor count, and visitor count for a business that just discovered TV advertising only goes up. The next campaign, or organic growth over two quarters, walks the bigger box back to the same 100%. Each round of resizing buys time proportional to the headroom purchased, and the bill recurs.

The other reflex is more replicas. For a read-heavy workload that genuinely is read-heavy, spreading reads across replicas is sound. It does not help here for a reason worth being precise about: the COUNT(*) is not slow because the primary is contended. It is slow-in-aggregate because it does real work every single call, and that work executes on whatever node serves the query. Move it to a replica and the replica’s CPU pegs instead. You have not removed the work. You have bought another machine to do it.

Why a 0.4 ms query saturates a core

The arithmetic is the whole mechanism. A query that averages 0.4 ms and fires 50,000 times a minute consumes 50,000 × 0.0004 = 20 CPU-seconds of work every 60 seconds of wall-clock time. That’s one core, a third occupied, by one statement. Push the campaign traffic to 150,000 calls a minute and that statement alone wants a full core. Add the other queries the orders page fires (the order list, the account lookup, the session check) and a handful of cores disappear into a workload where no individual query is doing anything you’d call slow.

This is why the slow-query log is the wrong instrument. A log thresholded at 100 ms is a filter for queries that are individually expensive. The CPU-bound failure mode is a population of queries that are individually trivial and collectively enormous. The log is working exactly as designed; it is designed to miss this. Mean execution time has the same blind spot. The dropdown count’s mean is 0.4 ms and will stay 0.4 ms while it burns four cores, because the mean says nothing about how often the query runs.

The view that sees it is total execution time. pg_stat_statements ordered by total_exec_time, or MySQL’s events_statements_summary_by_digest ordered by SUM_TIMER_WAIT, multiplies per-call cost by call count, which is the number that actually maps to CPU consumed. Sort by that column and the dropdown COUNT(*) is on the first row, with a call count an order of magnitude above anything else in the list. Part II’s Step 6 is the procedure for getting there; this is the failure mode it was written for.

Note

pg_stat_statements aggregates since the last pg_stat_statements_reset() or server start, so the top of the list reflects history, not the current minute. During an incident, reset it and wait sixty seconds, or compare two snapshots a minute apart. A query that dominates a clean sixty-second window is the one burning CPU right now, not the one that happened to run a lot last Tuesday.

COUNT(*) has no shortcut, and that’s structural

The reason this particular query does real work every call, rather than returning a cached number, is MVCC. Under multi-version concurrency control, two transactions running at the same instant can correctly see different row counts for the same table, because each sees the snapshot consistent with its own start. There is no single true count the database could cache and hand back. The PostgreSQL wiki’s “Slow Counting” page states it plainly: PostgreSQL “must walk through all rows to determine visibility,” which “normally results in a sequential scan reading information about every row in the table.” Citus’s 2016 write-up on counting performance puts the same point in one sentence: “There is no single universal row count that the database could cache, so it must scan through all rows counting how many are visible.”

InnoDB works the same way for the same reason. The MySQL documentation notes that InnoDB does not keep an internal stored row count, because a single counter cannot be correct for all transactions at once, and processes COUNT(*) by traversing the smallest available index. This is the detail behind a stubborn piece of stale advice. MyISAM, the old default engine, did keep an exact row count in table metadata, so SELECT COUNT(*) against a MyISAM table really was a constant-time metadata read. Advice written in that era (“COUNT(*) is free, don’t worry about it”) survived the engine that made it true. On InnoDB it is wrong.

An index helps, with conditions. A COUNT(*) filtered by an indexed column scans the index instead of the heap, and PostgreSQL’s index-only scan can satisfy a count from the index alone, but only for the pages the visibility map marks all-visible. The visibility map is maintained by VACUUM. On a table with steady write traffic and autovacuum falling behind, a growing fraction of pages are not marked all-visible, the index-only scan falls back to heap fetches for those pages, and the count gets slower precisely when the table is busiest. The shortcut exists. It is conditional on vacuum keeping pace, and a write-heavy table under a traffic spike is the exact case where vacuum is least likely to be winning.

The CPU-bound family

The dropdown COUNT(*) is the canonical case because it is so cheap per call that it defeats every individually-focused diagnostic. The same shape (cheap-or-medium per call, ridden to 100% by call volume or concurrency) shows up in four other forms worth recognizing on sight.

The first is aggregation. GROUP BY, SUM, AVG, DISTINCT, and the dashboard rollups built on them spend CPU on hashing and sorting, and the work is proportional to rows scanned, not rows returned. A “revenue by region this month” tile that scans 4 million order rows to return 6 numbers does 4 million rows of work every time someone loads the dashboard. One analyst with the dashboard open and a 30-second auto-refresh is 2,880 full rollups a day. A team of forty analysts, each with it open, is a standing CPU load that has nothing to do with how many people are actually looking. The query is not slow. It is medium, and it runs constantly.

The second is parallel-worker starvation. PostgreSQL runs large scans and aggregates across parallel workers drawn from a shared, server-wide pool capped by max_parallel_workers. The PostgreSQL documentation on parallel query is explicit that workers come from one pool and “the requested number of workers may not actually be available at run time.” A few heavy analytics queries, each fanning out to max_parallel_workers_per_gather workers, can drain that pool. Everything else then runs with fewer workers than the planner costed for, or serially. The symptom is strange: CPU pegged, yet many sessions appear to be waiting rather than working, because the plan they got is not the plan they were costed for.

The third is the connection pool sized past the core count. A database with 16 cores can do at most 16 things at once. Point 400 active connections at it and the operating system time-slices 400 runnable processes across 16 cores, and an increasing share of every core’s time goes to context switching rather than query execution. The PgBouncer documentation and the conventional sizing guidance both land near the same place: a pool sized close to the core count, often (cores × 2), beats a much larger pool. The HikariCP “About Pool Sizing” page makes the underlying point bluntly: running two queries sequentially is always faster than time-slicing them across one core, and its Oracle benchmark cut response times from roughly 100 ms to 2 ms by shrinking a pool from 2,048 connections to 96. CPU reads as 100%, but a measurable fraction of it is the scheduler shuffling processes, not the database answering questions.

The fourth is fan-out from a distributed service mesh. Picture twelve microservices, each running six instances, and every instance polling the database on its own timer for state it treats as live: a config row it reloads every few seconds, a routing table it re-reads on a schedule. Each of the seventy-two issues a modest, reasonable number of queries, and none of it looks alarming on that service’s own dashboard. The database underneath sees the sum, a baseline load that no team owns and no team’s monitoring displays. What makes this its own shape, rather than the volume problem restated, is that every caller is asking the identical question and getting back the identical answer. The dropdown COUNT was parameterized per account, and every execution did different work for a different row. Here, seventy-two instances poll for one config row that reads the same for all of them, because the database is the single place every service agrees holds the truth. The work is real and redundant: seventy-one of those reads per tick exist only because nothing in front of the database kept the answer the seventy-second already fetched.

Warning

The unread-count badge is the fan-out problem hiding in your own frontend. A “you have N notifications” badge in the global nav re-runs its COUNT on every page load of every authenticated user. It is not one feature’s query, it is a tax on every route in the application. An ORM N+1 in a list view is the same pattern from the ORM coupling: one page render silently expands into one query per row, and at list-page volume that is a CPU load with no single slow statement to point at.

There is a fifth shape that is per-call expensive rather than per-call cheap, and it belongs here because it pins CPU the same way: predicates the planner cannot turn into an index seek. A LIKE '%term%' with a leading wildcard, a regex match, JSONB extraction in the WHERE clause, a function wrapped around a column, an implicit type cast because the column is bigint and the parameter arrived as text. Each forces the engine to evaluate an expression on every candidate row, which is CPU work that no amount of indexing removes until the predicate itself is rewritten. Non-SARGable predicates covers the rewrite. The reason it shares this article is the diagnostic: a non-SARGable filter under concurrency reads as CPU saturation, and total_exec_time is again where it surfaces.

Fixes, and what each one costs

For the dropdown COUNT(*), the first question is whether the number needs to be exact. Often it does not. A status badge that says “Open: 204” is not measurably more useful than one that says “Open: ~200,” and an estimate is close to free. PostgreSQL’s planner statistics already hold a row estimate in pg_class.reltuples; the PostgreSQL wiki’s “Count estimate” page gives the query and the caveat, that reltuples is maintained by VACUUM and ANALYZE and is only as fresh as the last run. For a filtered count, parsing the row estimate out of EXPLAIN output gets you a per-predicate estimate. The trade-off is accuracy: an estimate can be off by a few percent, and it is wrong to use one where the count drives a financial total or a correctness check.

When the number must be exact, the choice is between caching it and maintaining it. A cached count (in Redis, or a materialized view refreshed on a schedule) turns thousands of COUNT(*) executions into one, at the cost of staleness equal to the refresh interval. A counter table or a counter column, incremented and decremented by trigger or by application code, keeps the count exact and reads in constant time, and it moves the cost to writes. Every insert and delete now also writes the counter row, and if that counter is global, every writer contends on one row. That contention is its own CPU and lock problem, sometimes a worse one than the count you started with. Per-account or per-shard counters spread the contention; a global “total orders” counter concentrates it. The pragmatic middle for a UI badge is the bounded count: SELECT COUNT(*) FROM (SELECT 1 FROM orders WHERE status='open' AND account_id=$1 LIMIT 100) t, which stops scanning at 100 and lets the interface render “99+”. The product question of whether anyone needs the exact number above 99 is usually answered “no” the moment you ask it.

Aggregation rollups want to stop running at read time, and the obvious instrument is a materialized view. On an OLTP primary it is more dependency than it looks. REFRESH MATERIALIZED VIEW still runs the full scan; it has only moved the work off the viewers and onto a schedule, on the same cores you are trying to protect, in periodic spikes rather than a steady drip. Plain REFRESH holds an ACCESS EXCLUSIVE lock that blocks reads of the view until it finishes; REFRESH ... CONCURRENTLY trades that lock for a mandatory unique index and a slower full recompute. Add a scheduler to run it and a staleness window to reason about, and a feature that looked like one line of DDL is a small system to operate. The version that actually removes the scan is a summary table the write path maintains: each order insert also bumps the per-region, per-day total, so the dashboard reads a few pre-computed rows and the 4-million-row scan never runs. The cost shifts onto writes, a couple of extra row updates per transaction, and the numbers stay current with no refresh job at all. Where the data tolerates lag, the other move is to get the aggregation off the primary: point the dashboard’s GROUP BY at a read replica so the scan burns the replica’s cores, or feed a separate analytics database and let the heavy rollups live there. A revenue tile that updates every five minutes is fine; an inventory count a customer sees at checkout usually is not.

Parallel-worker starvation is a sizing problem. Cap max_parallel_workers_per_gather so a single analytics query cannot drain the pool, and size max_parallel_workers against the cores you can spare after the OLTP workload has what it needs. The connection-pool case is the same discipline: size the pooler near the core count rather than near peak concurrency, and let connections queue briefly in the pooler instead of oversubscribing the scheduler. Both fixes feel like throttling, and they are. A query that waits 5 ms for a pool slot and then runs at full speed beats one that starts immediately and fights 399 others for a core.

Fan-out has two fixes, and which applies depends on whether the callers want the same answer. When they do - a config row, a feature-flag set that every instance reads identically - a cache in front of the database is the direct fix. One instance’s read populates a shared cache, the other seventy-one read from the cache, and the database serves the query once per TTL instead of once per instance per tick. When the callers genuinely need different answers, no single query is wrong and the lever is governance: a platform-level view of aggregate QPS broken down by calling service, and a per-service query budget treated as a real constraint. Health checks can poll every 30 seconds instead of every 2. Polling can become a push. None of that happens without someone holding the number that no individual team’s dashboard shows.

Note

Across every fix here, the move is the same: do the work fewer times. Estimate instead of count, cache instead of recompute, refresh on a schedule instead of per-view, poll less often, queue instead of oversubscribe. Tuning a query to be faster per call is the Part IV and Part V conversation. The CPU-bound failure is a volume problem, and volume is what these fixes attack.

When it really is the load

Sometimes the workload genuinely needs the cores. A reporting database that runs heavy analytical queries, an ETL window, a system doing real per-row computation that no rollup can precompute because the parameters change every call, will sit at high CPU because that is the job. The tell is in total_exec_time: when the top of the list is a spread of genuinely heavy statements rather than one trivial query with an enormous call count, you are looking at a workload that wants capacity, and adding cores is the honest answer. The diagnostic distinguishes the two cases; the dropdown COUNT(*) at the top of the list means volume, a flat distribution of expensive queries means load.

And the one-day spike can be a legitimate reason to rent. A Super Bowl ad, a product launch, a Black Friday window: a known, bounded surge where the cost of engineering a permanent fix before the date exceeds the cost of a bigger box for 48 hours. Scale up Friday, scale down Monday, fix the COUNT(*) in the next sprint with the incident graph as the justification. That is renting the bug on purpose, with a return date. The failure is renting it by reflex, with no return date and no ticket, so the bigger box becomes the permanent baseline and the next spike starts the cycle again.

The dropdown count on the orders page was always going to break. The Super Bowl ad only decided the date. A query that does work proportional to traffic, on a system whose traffic only grows, has a ceiling it will reach; the box size sets the date, and the query sets the slope. Sort by total_exec_time before you size the instance, and you find out which one you’re actually fighting.

The Paradox of the Fast Engineer

Mon, 18 May 2026 00:00:00 +0000

TL;DR

The judgment that lets an engineer override a model is built in the slow work the model now offers to do for them. Accept enough of that help on the work that would have built the judgment, and the agent’s speed arrives without the quality, security, scalability, maintainability, or operational sense that the slow work used to deposit alongside the code.

Three months after shipping, customers start complaining that menu items are missing from the navigation. The query that builds the menu does three LEFT JOINs against a self-referencing categories table. The agent produced that shape when the engineer described the requirement; review passed because the test fixtures were three levels deep. Production grew to seven. The query was silently truncating subcategories the day it shipped, and the engineer who accepted the output had never reached for a recursive CTE, because nobody on the team had ever shown them one.

The fix is a recursive CTE with UNION ALL, anchored on the root row and joining the source table back to itself until no more rows come out. Five lines. Both shapes are valid SQL; the one that holds up against arbitrary depth is the one the engineer reaches for only after seeing it before. Without that prior, the idiom isn’t in their decision space. They can’t ask the agent for it, and they wouldn’t recognize it as the right answer if the agent offered it. No memory of a broken version that lacked it, no internal alarm that “three LEFT JOINs against a tree” is the shape of a future incident.

The obvious fix isn’t the fix

Review the agent’s code before approving it. True, and insufficient. The reviewer who has never written a tree walk over a self-referencing table doesn’t know what they should be looking for. They see SQL that compiles, returns rows on the test data, and matches the shape of the request. The internal alarm that says “this assumes a fixed depth, what happens when the tree is deeper than the joins” doesn’t come from reading SQL. It comes from writing the broken version yourself, watching it fail in production, and tracing the failure back through your own assumption.

Code review without that prior pain is pattern matching against the surface of the query. The bugs that ship through review are the ones where the surface looks right.

The paradox

Here is the paradox. The judgment that lets an engineer override the model is built in the slow work the agent now offers to do for them. The engineer who accepts the output, reviews it briefly, and ships it has gotten the speed. They have not gotten the read on whether the query holds under the production tree shape, the security sense for whether the patch closed the CVE without invalidating something downstream, the scalability instinct for whether the join multiplies under real data, the maintainer’s eye for whether this diff just doubled the toil bill six months from now, or the operational feel for which parts of the system are load-bearing and which are decoration. None of those come bundled with the agent’s output. The five-minute version of the recursive CTE problem passes through them without depositing anything, the way watching someone debone a chicken on YouTube does not teach you when the knife is sharp enough.

The pattern shows up in the public data. METR ran a controlled study in July 2025 on sixteen experienced open-source developers working in repositories averaging more than a million lines and a decade old. The developers self-reported a 20% speedup from AI assistance. Measured against the control, they were 19% slower. A forty-point gap between what the engineer feels and what the stopwatch records, on a population that does this work for a living.

Google’s 2025 DORA report found 90% of developers using AI and over 80% reporting it made them more productive, while organizational delivery metrics stayed roughly flat for teams without strong measurement practices. The same report measured bugs per developer up 54% and the median time a pull request spends in review up 441%. The verification work the agent created moved to the reviewer. The reviewer is now the bottleneck the agent isn’t helping, and the skill that makes a reviewer fast (the recognition of which agent-generated PR is hiding a fixed-depth assumption, or a missing index, or a quietly invalidated invariant) is the skill the same reviewer is no longer building by writing the slow version themselves.

Cloudflare’s Project Glasswing write-up lands on the same shape from the security side. When they let a security-focused model write its own patches against live infrastructure code, the fixes “fixed the original bug while quietly breaking something else the code depended on.” The thing standing between those patches and production was a senior engineer who could read a regression suite and notice when a patch had quietly invalidated a load-bearing assumption. That recognition was built over years of debugging exactly that class of mistake. The model has no way to produce the recognition, and accepting the patch without it means shipping the regression and learning nothing in the process.

Note

None of this is saying the agent is useless. Its reliable surface is pattern-matching across volume, the way grep is reliably better than reading the whole file when you already know what string you’re looking for. Surfacing every place a deprecated API is called across a million-line repo. Pulling the regex syntax you’d otherwise have to look up. Flagging the four files in a 200-file diff that touched the auth path. The agent is a faster grep against language, and on that narrow ground it earns its seat. What is being sold and billed for, though, is autonomous production, and the autonomous-production claim does not survive the METR result above. The agent is nowhere near human decision-making, and the cost of treating its output as if it were is exactly the gap between the perceived 20% speedup and the measured 19% slowdown.

The slow-onset failure

The damage falls hardest on engineers who came up after the tools landed. The current cohort of senior engineers built their judgment in a decade when the slow work was the only available path. Every recursive query was a recursive query they had to figure out. Every migration was one they had to plan. Every 2 a.m. incident was one they had to root-cause without a model offering a first-guess hypothesis (Alert Triage Without an Agent goes deeper on that specific muscle). The path that produced today’s seniors ran straight through the slow work the agent now does on demand.

Juniors who let the agent do that work will not arrive at the same place by the same route. Three years of accepting every agent PR, and the engineer who used to be a junior in their codebase is still a junior in their codebase, except now the codebase has grown more complex and the parts they don’t understand have grown faster than the parts they do. The gap doesn’t show on day one, or month six, or even year two. It shows the first time the agent produces output the engineer cannot evaluate: when the question a senior would ask about a migration is one the junior doesn’t know to ask, or when the bug in the agent’s PR is invisible to anyone who hasn’t written the broken version themselves (see also What AI Gets Wrong About Your Database for the database-specific shape of this).

Warning

By the time the gap shows, it has been compounding for years. The engineer is on the wrong side of a hiring market that pays for exactly the recognition they no longer have, and nothing in a quarterly performance review catches the deposit you didn’t make to your own long-term memory.

The calibration

The skill worth building is knowing which work the agent should do, which work you should do by hand, and which work you should accept from the agent and then rewrite anyway to internalize the pattern.

The agent is the right tool for work where the context you’d gain by doing it yourself is marginal. Boilerplate. Syntax you’d otherwise look up. Test scaffolding for code paths you already understand. The migration template you’ve written for the tenth time this year. The fifty-line helper that’s mechanically obvious once you’ve decided what it should do. Let the agent handle these with a brief review and move on.

The agent is the wrong tool for work where the context is the asset. The parts of the system you don’t yet understand. New code paths through a critical module. Database changes whose consequences you’d want to feel in your fingers before approving them in production. The first recursive CTE against a tree-shaped table you’ve never queried before. The first incident in a class of failures you haven’t seen, where the agent’s hypothesis is a hypothesis you should also be forming yourself. Do this work by hand, even when the agent would have produced a working diff faster. The slow version is what builds the alarm that catches the agent’s mistake the next time the same shape of work shows up.

The hard part is the middle. Work that’s neither pure boilerplate nor entirely novel. Some of it belongs in a tight loop where you drive and the agent assists on syntax. Some gets reviewed line by line as a learning exercise rather than a compliance step. The rest gets rewritten by hand after the agent produces a working version, just to deposit the pattern in your own muscle. The choice turns on whether the work sits in a part of the codebase you need to know deeply or one you can afford to treat as a black box.

When this doesn’t apply

The argument cuts cleanest for engineers building depth in a domain they intend to stay in. A platform engineer who needs to know the database. A security engineer building the recognition Cloudflare’s example demands. A backend engineer whose career bet is on a specific stack. A frontend engineer whose framework just shipped thirteen advisories in a coordinated security release (auth bypass, SSRF, i18n path bypass, an RSC DoS hitting every App Router deployment on 13.x through 16.x) and who needs to read their own dependency graph well enough to know whether they were exposed. For these engineers, the slow work is the investment that pays back over the next decade.

It cuts less cleanly for work that doesn’t depend on depth. The hobbyist exploring a new language. The throwaway script that ships in an afternoon and dies in a week (The 10x Is Real, on Internal Tools You’d Otherwise Never Ship covers that end of the spectrum). The pre-product-market-fit startup whose entire codebase is throwaway in expected value, where vibe-coding the MVP and finding out if anyone wants the product is the rational trade against the dominant risk of nobody wanting it. The bill on that last case comes if the product wins, in the form of hiring engineers who can read the agent’s output and untangle the parts that now have to scale. That is a problem to have.

It also doesn’t apply where the agent’s baseline beats what the company can actually hire at its price point. The frontier model is mediocre in absolute terms (METR again), but it is a consistent floor, and not every company can outhire that floor at the salaries they actually pay. In those shops the cheaper path is to let the model produce and have a senior reviewer (often a contractor) clean up after it. The agent there is competitive at the level the company can afford, which sits below senior judgment but above the median hire the budget will permit.

Senior engineers who already have the context sit outside the trap entirely. The one who has written the recursive CTE a dozen times can accept the agent’s first-draft query and review it competently because the alarm is already wired. The asymmetry is that the trap falls hardest on the engineers least equipped to recognize it.

The bigger picture

The market for engineering judgment is splitting. Work the model can do at the level of a competent mid-level engineer is being commoditized; work that requires the judgment to recognize when the model is wrong is being concentrated. Which side an engineer ends up on is determined less by the tools they use than by which work they choose to do by hand.

The senior’s value is going up because the volume of model output needing adult supervision grew faster than the supply of adults to supervise it. The junior’s floor is the level the model now hits without help. The path from one to the other used to be the slow work, and the path is still the slow work, except the slow work is now optional and most engineers will not opt in.

It's Almost Always the Queries, Part II: Troubleshooting Steps

Sun, 17 May 2026 00:00:00 +0000

TL;DR

Database troubleshooting is a learnable skill with a repeatable sequence: observe what’s happening now, categorize the wait, narrow to the specific cause, then act. The sequence matters more than the tools. This article walks through it for engineers who don’t do this often.

Your APM lights up. Endpoint latency on /api/checkout has tripled in the last three minutes. The graph shows a wall of slow requests, no deploy in the window, no traffic spike. Something changed in the database layer. You have maybe fifteen minutes before someone senior asks what’s happening. What do you actually do?

If your instinct is to paste the alert into an LLM and ask what to do, pause. The model will give you something that looks like an answer. It might suggest killing the longest-running query, or restarting the connection pool, or adding an index. It pattern-matches on symptoms (duration, state labels, error messages) without access to the causal structure underneath. If you don’t understand why it’s suggesting what it’s suggesting, you can’t tell when it’s wrong. This applies even if you’re running an agent with MCP access to your database (hopefully read-only). The agent can query pg_stat_activity faster than you can type it, but if you don’t understand what the output means and can’t evaluate whether the agent’s next step is appropriate, you’ve handed control of a production incident to something that can’t distinguish a victim from a cause. When it’s wrong during a live incident, you make things worse. This article builds the mental model that lets you troubleshoot yourself. Use LLMs to learn these concepts on your own time. Don’t rely on them at 3am.

The sequence below is designed to give you understanding before you reach the “act” step.

If you have SQL access to the primary

This is the fuller diagnostic path. You can query the system tables directly. What follows assumes you can connect to the primary (or a replica that exposes these views). System tables show what’s happening right now; dashboards show what happened over the past hour. Both have a place, and Step 7 covers what to look for in dashboards. Learn whatever monitoring you have before you need it. Figuring out which tab shows wait events at 3am is wasted time.

The first thing you want to see is the process list, filtered to active queries and ordered by time. On PostgreSQL that’s pg_stat_activity. On MySQL that’s SHOW PROCESSLIST or, better, performance_schema.processlist.

Before you run anything, protect your own session:

1
2
3
4
5


-- PostgreSQL
SET LOCAL statement_timeout = '5s';

-- MySQL (per-query)
SELECT /*+ MAX_EXECUTION_TIME(5000) */ ...

Why: if the database is under heavy pressure, your diagnostic query competes for the same resources. A five-second timeout means your debugging doesn’t pile onto the problem. If your diagnostic can’t finish in five seconds, that itself tells you something (extreme contention, buffer pressure, WAL pressure).

Now pull the active sessions with the columns that actually help you categorize:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


-- PostgreSQL
SELECT
 pid,
 state,
 wait_event_type,
 wait_event,
 now() - xact_start AS xact_duration,
 now() - query_start AS query_duration,
 pg_blocking_pids(pid) AS blocked_by,
 LEFT(query, 100) AS query_snippet
FROM pg_stat_activity
WHERE state != 'idle'
 AND pid != pg_backend_pid()
ORDER BY xact_start NULLS LAST;

Most diagnostic snippets you’ll find online show pid, query, state, and duration. They skip two columns that matter: wait_event_type and wait_event. These tell you why a query is taking long, not just that it’s taking long.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


-- MySQL (use performance_schema, not INFORMATION_SCHEMA)
SELECT
 t.PROCESSLIST_ID AS pid,
 t.PROCESSLIST_STATE AS state,
 t.PROCESSLIST_TIME AS duration_sec,
 w.BLOCKING_THREAD_ID AS blocked_by_thread,
 LEFT(t.PROCESSLIST_INFO, 100) AS query_snippet
FROM performance_schema.threads t
LEFT JOIN performance_schema.data_lock_waits w
 ON t.THREAD_ID = w.REQUESTING_THREAD_ID
WHERE t.PROCESSLIST_COMMAND != 'Sleep'
 AND t.TYPE = 'FOREGROUND'
ORDER BY t.PROCESSLIST_TIME DESC;

Warning

On MySQL, use performance_schema.threads or performance_schema.processlist (8.0.22+), not INFORMATION_SCHEMA.PROCESSLIST. MySQL Bug #94077 (January 2019) documented a 70% performance drop from polling INFORMATION_SCHEMA.PROCESSLIST under load. Bug #100049 (June 2020) showed the same query causing pending queries to pile up until the server became unresponsive. Both trace to a mutex held during execution. The diagnostic query you run during an incident should not itself become part of the incident.

Note

Worth asking your DBA or platform team to wrap these queries into views (active_queries, blocking_chains, long_transactions) so that during an incident you’re running SELECT * FROM active_queries instead of assembling joins from memory. You don’t want to be copy-pasting SQL from a blog post at 3am.

Step 3: Look for obvious offenders

Start with the longest-running queries in the output. Some problems are visible from the query snippet alone. A SELECT * with no WHERE clause on a large table, a function wrapping a column in the predicate (non-SARGable), a COUNT(*) over millions of rows, an N+1 pattern showing up as dozens of identical queries with different IDs. If you recognize the shape, you already know what to fix.

Also look for DDL in the list. An ALTER TABLE, CREATE INDEX, or DROP INDEX that’s been running longer than you’d expect is usually not doing the work it looks like it’s doing. It’s waiting on a lock. On MySQL this shows up as a metadata lock (MDL): the DDL waits for every transaction still holding the table open to commit or roll back, and meanwhile every new query against the table queues behind the DDL. On PostgreSQL the equivalent is an ACCESS EXCLUSIVE lock at the relation level, with the same cascade: the DDL waits on active transactions, and everything else waits on the DDL.

The non-obvious version is when no heavy query is running but the server is wedged anyway. The trigger is usually something innocuous: a connection pooler holding a session ‘idle in transaction’, an analytics job that opened a transaction and never closed it, a long SELECT still holding its ACCESS SHARE lock, or an autovacuum touching the same table. The DDL blocks on whichever of those it is, then every new query against the table queues behind the DDL. The process list shows a wall of waiting sessions against one table, the DDL at the head of the queue, and the actual root cause somewhere further down (or in a connection that doesn’t look problematic at all).

If nothing jumps out from the query text alone, move to Step 4 and Step 5 to dig deeper into the suspect query. If nothing looks slow at all, skip to Step 6.

Step 4: Check the schema

Once you have a suspect query, look at the table it’s hitting. Pull the table definition (\d tablename in psql, SHOW CREATE TABLE tablename in MySQL) and compare it against what the query needs.

Walk the clauses one at a time. WHERE columns need indexes the optimizer can actually use, and ‘usable’ depends on the predicate shape: equality on a high-cardinality column is the straightforward case, range predicates (>, <, BETWEEN) work but constrain what can follow them in a composite, and a large IN (...) list may flip the planner to a sequential scan even when an index exists. JOIN columns need an index on the inner side of the join (the side being looked up once per outer row); in PostgreSQL, foreign-key columns are not indexed automatically, which is a frequent cause of joins that ran fine at low volume and fell over at scale. ORDER BY can sometimes use an index to skip the sort entirely, but only when the index’s leading columns line up with the ORDER BY columns. GROUP BY is the same shape: an index on the grouping columns lets the planner stream the aggregation instead of building a hash, which on a multi-million-row table can be the difference between a sub-second query and one that exhausts work_mem and spills to disk.

A missing index on a high-cardinality filter column is the single most common cause of queries that worked fine at low volume and fell over at scale. Composite index column order is the runner-up. An index on (status, created_at) serves WHERE status = 'pending' ORDER BY created_at. An index on (created_at, status) does not, even though it contains the same columns. The general rule is equality columns first, then the range column, then columns used only for sort.

Also check for covering index gaps: an index that covered the query last month might have stopped covering it after a column was added to the SELECT list, forcing a heap lookup per row where there used to be an index-only scan.

Step 5: Read the execution plan

If the schema looks right and you still can’t explain the behavior, ask the database what it’s actually doing. EXPLAIN (PostgreSQL) or EXPLAIN (MySQL) shows the planner’s chosen strategy without executing the query. EXPLAIN ANALYZE executes it and shows actual row counts alongside the estimates. Reading these outputs is the most important skill in query troubleshooting. Every claim about what a query ‘should’ do is wrong until the plan confirms it; the optimizer might pick a different index than you expect, fall back to a sequential scan because of stale statistics, or choose a join order you’d never write by hand. The plan is the ground truth.

Run this on a replica if you have one. The plan will be the same (assuming similar data and stats), and you avoid adding load to a primary that’s already under pressure. If you don’t have a replica, plain EXPLAIN (without ANALYZE) gives you the plan without executing the query. It’s an estimate, not a measurement, but it’s often enough to spot the problem.

What to look for in the output: sequential scans on large tables (the planner couldn’t find a usable index), rows estimated vs. rows actual diverging by orders of magnitude (stale statistics or a bad cardinality guess), reading a million rows to return ten (missing or ignored index), nested loops where each iteration does its own index lookup against a large table (the N+1 shape at the engine level).

Warning

EXPLAIN ANALYZE executes the query fully. On a SELECT that takes 30 seconds in production, it will take 30 seconds when you run it too. If the query modifies data (INSERT, UPDATE, DELETE), wrap it in a transaction and roll back: BEGIN; EXPLAIN ANALYZE UPDATE ...; ROLLBACK;. On a system already under pressure, be deliberate about what you choose to execute.

Step 6: If nothing looks slow, check volume

Sometimes every query in the process list finishes in a few milliseconds and nothing looks wrong individually. The problem isn’t one slow query. It’s thousands of fast ones hitting the same resources concurrently.

This is where digest-level views help. pg_stat_statements (PostgreSQL) and performance_schema.events_statements_summary_by_digest (MySQL) aggregate queries by their normalized pattern and track call counts. Sort by total_exec_time (PostgreSQL) or SUM_TIMER_WAIT (MySQL), not by mean time. A query that averages 5 ms but fires 10,000 times per second consumes 50 CPU-seconds per wall-second. It will never show up in a “longest running” list, but it dominates the workload.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


-- PostgreSQL
SELECT
 LEFT(query, 80) AS query_snippet,
 calls,
 total_exec_time::int AS total_ms,
 (total_exec_time / calls)::int AS avg_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- MySQL (join current statements to their digest summary)
SELECT
 LEFT(d.DIGEST_TEXT, 80) AS query_snippet,
 d.COUNT_STAR AS calls,
 ROUND(d.SUM_TIMER_WAIT / 1e9) AS total_ms,
 ROUND(d.AVG_TIMER_WAIT / 1e9) AS avg_ms
FROM performance_schema.events_statements_current esc
JOIN performance_schema.events_statements_summary_by_digest d
 ON esc.DIGEST = d.DIGEST
JOIN performance_schema.threads t
 ON esc.THREAD_ID = t.THREAD_ID
WHERE t.PROCESSLIST_COMMAND != 'Sleep'
ORDER BY d.SUM_TIMER_WAIT DESC
LIMIT 10;

This is the case from Part I, where the COUNT(*) averaged 40 ms and never triggered the slow-query log. Part III walks the CPU-bound version of this in detail.

Step 7: Read the dashboards

Dashboards see what the system tables don’t: history. By the time you query pg_stat_activity, you’ve lost the picture of what was happening five minutes ago. Whatever you have (Datadog DBM, Aurora and RDS Performance Insights, pganalyze, Grafana with the right exporters), this is where you bring it in. If you have no SQL access at all, the dashboard is your entire diagnostic toolkit; everything you can learn, you’ll learn here.

Start with timing. A graph with a hard step-change at 14:23 is a different problem from one that climbed slowly over the past hour. The hard step points at a discrete event: a deploy, a config change, a single long-running query that started blocking others. The slow climb points at a workload trend or a plan regression that compounded as the working set grew. Overlay deployment markers, autoscaling events, and scheduled job runs on the latency graph if you can. A latency jump that exactly tracks a deploy is the deploy until proven otherwise.

Active session count is the other timing graph worth pulling. A flat baseline that doubles at 14:23 points at a blocking event. A slow climb over an hour points at workload growth or queue buildup. Session count is harder to lie to than latency: a single slow query can hide in a p99 average, but it can’t hide in the count of sessions waiting on it.

Which resource is pinned tells you which dimension is the cause and which is following. CPU at 100% with IOPS low and stable is a CPU-bound workload, often a missing index causing repeated sorting or hashing, or a regex or JSON predicate doing per-row work the planner can’t push down. IOPS pinned with CPU low and waits on IO:DataFileRead (PostgreSQL) is a buffer-cache miss problem: the working set has outgrown RAM and every query is going to disk. RAM climbing steadily for days while the buffer cache hit rate falls at the same rate is a forecast, not an incident.

Waits are where dashboards earn their keep. Aurora Performance Insights, pganalyze, and Datadog DBM all stack active sessions by wait class over time, which is exactly the information pg_stat_activity can’t give you historically. Locks dominating means contention: a deadlock storm, or an open long-running transaction blocking everything that touches the same rows. IO dominating means buffer pressure or storage saturation. CPU dominating means the queries themselves are doing more work per call than they used to. Client reads dominating points at a slow consumer: the app isn’t reading results fast enough and sessions stack up waiting to send the next page.

WAL and checkpoint pressure are easy to miss because they don’t appear in the active query list. WAL generation rate climbing with no proportional traffic increase points at write amplification: a runaway UPDATE rewriting the same rows, a hot index getting bloated, a trigger writing more than it needs to. Checkpoint duration climbing, or checkpoint frequency increasing, means the system can’t keep pace with the write rate. On MySQL the same signal shows up as InnoDB log file utilization approaching its configured size, with checkpoint-related stalls visible in SHOW ENGINE INNODB STATUS. These often correlate with sudden IOPS spikes, because the checkpoint flush is what saturates the disk, not the workload directly.

Workload composition completes the picture. Most monitoring breaks queries-per-second down by type. A 10x spike in write QPS with read QPS flat is a different incident from a 10x read spike. Within reads, the ratio of index scans to sequential scans is a leading indicator of plan regression: in PostgreSQL, pg_stat_user_tables.seq_scan climbing on a table that previously got index scans; in MySQL, Handler_read_rnd_next rising relative to Handler_read_key. A jump in read-ahead activity (InnoDB’s Innodb_buffer_pool_read_ahead, Aurora’s read-IOPS metrics) often signals large scans that weren’t there before: a new query, or an old query whose plan changed.

You’re not trying to find the exact query from the dashboard. The goal is to narrow the category before going back to the system tables: read-side vs write-side, query-level vs workload-level, lock contention vs buffer pressure vs CPU saturation. That tells you where to focus in the earlier steps, and if you have no SQL access, it’s enough to escalate with specifics. “40 sessions in Lock wait starting at 2:43, no deploy in the window, getting worse” is something whoever owns the database can act on. “It’s slow” is not.

Before the next incident

Everything above assumes you can run the queries when you need them, that you know what their output means, and that you’ve seen what ’normal’ looks like so the abnormal stands out. None of that is true at 3am unless you’ve done the work before then.

Know your access. Can you connect to the primary, or only to a replica? Read-only or with permission to call pg_terminate_backend? Does your role have access to pg_stat_statements (which requires pg_read_all_stats on PostgreSQL 13+) and to the MySQL performance_schema tables? On managed services, some catalog views are hidden behind parameter groups that take a restart to enable. The time to discover you don’t have pg_stat_statements is not the moment you need it.

Know your team’s views. Ops teams often wrap the queries from this article into named views: active_queries, blocking_chains, long_transactions, top-N digest views. Find out whether yours exist. If they do, learn the column names and what they filter out (some hide replication workers, autovacuum, or your own session, which can mislead during an incident if you don’t know). If they don’t, ask your DBA whether they’d take a pull request, or write them yourself. A view you can SELECT * FROM is faster to run and harder to typo at 3am than a 15-line join assembled from memory.

Run the queries on a quiet system. Pull pg_stat_activity against your dev database while nothing stressful is happening. Note what idle connections look like (the pool’s keep-alives, your IDE’s introspection queries, your monitoring’s polling traffic). Pull a pg_stat_statements snapshot and read through the top 20. The point is to know what your environment looks like at rest, so during an incident the abnormal jumps out instead of getting lost in baseline noise.

Read the column documentation once. wait_event_type and wait_event in PostgreSQL have a documented enumeration with dozens of values. MySQL’s performance_schema instruments follow a naming convention you can learn in fifteen minutes. Knowing that IO:DataFileRead means a buffer cache miss and Lock:transactionid means waiting on another transaction’s row lock turns opaque output into a diagnosis.

Practice reading EXPLAIN output. Pull a slow-ish query from your own codebase and run EXPLAIN (ANALYZE, BUFFERS) on it in a quiet environment. The ‘what to look for’ in Step 5 is more useful when you’ve spent thirty minutes staring at a plan in low-stakes context first. Visualizers like explain.depesz.com and explain.dalibo.com help with PostgreSQL output, and MySQL’s tree-format EXPLAIN is more readable than the default table format. But none of those substitutes for knowing what each node type means. Read the docs for sequential scan, index scan, bitmap heap scan, nested loop, hash join, and merge join once.

What this article doesn't cover

None of the seven steps is comprehensive. The full diagnostic surface (lock-graph reconstruction, planner cost-model tuning, statistics histograms, MVCC and vacuum internals, the catalog views nobody talks about) takes years to learn and a book to lay out. What’s here is less than the minimum a DBA assumes you already know - enough to triage with confidence and escalate with specifics, not enough to skip the call to whoever owns the database.

It's Almost Always the Queries, Part I: Why Metal Doesn't Help

Sat, 16 May 2026 00:00:00 +0000

TL;DR

Infrastructure alerts on a relational database almost always trace to query and schema choices, not capacity. Scaling the box rents the bug. The exceptions are real but smaller than most teams assume.

What unites replication lag, CPU at 100%, dashboard timeouts, disk filling, and the server that crashes every Tuesday afternoon? Almost always the same thing: bad queries. Throwing metal at it fixes the symptom, leaves the cause, and rents back the same outage at the next traffic threshold.

A team added a third read replica because the primary was at 95% CPU. Lag got worse, not better. The slow-query log was empty because the threshold was 100 ms and the offending statement averaged 40 ms. pg_stat_statements sorted by total_exec_time showed it on the first row: SELECT COUNT(*) FROM orders WHERE status = 'open', fired by the status-filter dropdown on the orders page, roughly 600 calls per second at peak. Forty milliseconds becomes 24 CPU-seconds per wall-second the moment a few hundred users land on that page in parallel. The same shape is documented publicly in Rails Admin issue #2699 from August 2016, where COUNT(*) on tables with 1-10 million rows ran 10-20 seconds and made the admin dashboard unusable. Part III walks the CPU case in depth; Part II is the troubleshooting playbook that gets you there, and this article is the framing for the whole series.

The obvious fix and why it buys you weeks

Bigger instance. More replicas. Faster IOPS. Every one of those is a real lever, and on a sufficiently bad day, the right immediate move. They share a property the postmortem usually skips: each rents capacity proportional to the bug’s cost, and the bug stays. The 40 ms COUNT(*) costs 60% less on a box twice the size, but the cost is still proportional to traffic, and traffic only goes one direction. Six months later the same team is sizing up the box again, and the dropdown is still firing 600 times a second.

I know, I know - digging into the query, reading the plan, refactoring the ORM call is engineering time, and engineering time looks more expensive than a bigger instance. On the day of the incident, the math holds. A quarter later it stops holding, when the same dropdown is firing 900 times a second instead of 600 and the bigger box is back at 95%. You are going to deal with it. The only question is whether you spend one SME hour now, or whether you keep paying the surcharge that grows with traffic and end up spending more time on it over the year than the tuning would have cost in the first place.

Four symptoms, one cause

Almost every infrastructure alert on a relational database has a hardware-shaped reading and a query-shaped reading. The query-shaped reading is right more often. The four symptoms below each get their own post in this series; what follows is the map, not the territory.

CPU pegged at 100% usually means an aggregate or lookup that runs cheaply per call and fires under heavy concurrency: a dashboard COUNT(*), an unread-count badge that re-renders on every page load, an N+1 from the ORM inside a list view that nobody noticed. Sort pg_stat_statements by total_exec_time, not mean_exec_time, and the offender is on the first page. MySQL’s equivalent is performance_schema.events_statements_summary_by_digest ordered by SUM_TIMER_WAIT; same trick, different schema. Part III.

Memory pressure is rarely a workload that needs more RAM. More often it’s work_mem multiplied by connection count: each sort or hash spilling its allocation, multiplied by a few hundred Rails workers, blows past whatever the instance has. The MySQL shape is the same with per-thread buffers (sort_buffer_size, join_buffer_size, tmp_table_size) multiplied by max_connections. Bigger box, same multiplier, same alert. Part IV.

Disk filling and IOPS saturation usually mean bloat (no-op UPDATEs producing dead tuples faster than autovacuum cleans them), audit tables without retention, or non-SARGable predicates and coverage that broke when a column got added to the SELECT, forcing random heap fetches that look like an IOPS shortfall. InnoDB has the same shape with undo-log growth under long-running transactions starving purge, and with random reads from non-clustered secondary indexes that force a clustered-index lookup per row. Provisioning more IOPS works, and leaves the access pattern intact. Part V.

Replication lag presents as a replica problem and is almost always a writer problem. Long transactions hold back replay. Over-indexed tables under heavy UPDATE traffic produce write amplification; the same WAL stream replays single-threaded on every replica. ORMs that re-write every column on every save produce no-op WAL records that every replica then applies. MySQL has the same pattern through the binlog, with one SQL thread per replica by default; parallel replication helps on independent workloads but rarely closes the gap on write-heavy ones with intra-transaction dependencies. Long-held locks blocking unrelated work compound the same way. Adding replicas makes it worse, not better. Part VI.

Read the top-10 before opening the cloud console

The discipline is mechanical. Before touching the instance type, the replica count, or the IOPS budget, pull the top-10 from pg_stat_statements sorted by total_exec_time and diff against last week. If the offender is new (a recently shipped feature, a new dashboard tile, an admin tool someone built last quarter) the fix is at that callsite. If the offender has been there the whole time and is only now problematic, traffic crossed a threshold the query couldn’t hold. Either way, the action is at the query, not the box.

The trade-off this advice doesn't fix

Query tuning is cheap in dollars and expensive in SME hours. A team without a database specialist and with a deadline in two weeks does not have the headcount to read an execution plan, refactor an ORM call, and verify the fix under load. For that team, scaling the box is the right move, and the bug stays on the backlog as planned debt. The article’s framing assumes you have, or are willing to develop, the skill to read pg_stat_statements and EXPLAIN ANALYZE output. Without that skill, capacity is what you can buy; query understanding is what you can’t.

Asking Claude or another LLM is a real option for narrow questions (“what does this EXPLAIN ANALYZE mean?”, “is this index doing what I think?”) and worth using as a first pass. It hallucinates more on architecture than on syntax, and the only thing standing between that and a worse outage is whether someone on your team can read what it produced and tell when it’s wrong.

When this doesn’t apply

In an early-stage startup where four engineers are doing four jobs each and the storage layer is one of fifteen things on someone’s plate, reading pg_stat_statements weekly is not where the next dollar of engineering time goes. Scale the box. The cloud upsell exists for a reason, and at that stage the bug stays on the backlog as planned debt while the team finds product-market fit. The trade-off is honest as long as someone knows the debt is there.

The version of this that hurts later is a data-heavy company building without anyone who owns the storage layer. If the product is fundamentally about reading and writing data (OLTP-heavy SaaS, analytics-adjacent dashboards, event ingestion at volume, anything that touches embeddings or vector search), the schema and access patterns chosen in the first six months decide what is available to build on for the next three years. Without an SME on the foundation, the team ships a model the workload can’t actually run, and the same query-shaped failures arrive on a much shorter timeline than the founders planned for. The cheap version of doing this right is hiring or contracting someone who has seen this fail before, before the schema is hardened by code that depends on it.

What the next five parts cover

Part II is the troubleshooting playbook: what to open first when an alert fires, the built-in views worth knowing (pg_stat_activity, pg_stat_statements, pg_locks on Postgres; performance_schema.threads, data_locks, events_statements_summary_by_digest on MySQL), the handful of custom views worth saving for the next incident, and when a third-party tool like pganalyze, PMM, or Datadog DBM earns its cost. Part III takes the CPU case in detail: why sorting by total_exec_time finds the offender that mean_exec_time hides, and how MVCC visibility makes unbounded COUNT(*) the canonical example. Part IV is memory pressure and work_mem math. Part V is disk and IOPS: bloat, retention, fillfactor, the access patterns that look like a storage shortfall. Part VI is replication lag, where the fix is always on the writer. Each post stands on its own; reading them in order makes the pattern visible.

Exposing Data to an Agent: MCP vs API

Fri, 15 May 2026 00:00:00 +0000

TL;DR

MCP is a wire protocol; what sits behind it decides the blast radius. In non-prod, pointing it at the database tends to be fine, because unbounded exploration is worth more than the occasional mistake. In prod, the shape that holds up is having the MCP server’s tools call an agent-specific API that enforces allowlisted operations, row caps, column masking, and per-prompt audit, rather than the database directly. The version that points at the database tends to surface later as a privacy incident.

Note

This is about the third-party database MCP servers from public registries (Postgres, MySQL, MongoDB, Redis, Elasticsearch), whose load-bearing tool is query(sql_string) against whatever connection they were configured with. A custom MCP server you wrote to wrap your own API is a different shape and isn’t the argument here.

A revenue dashboard agent runs against production through the MCP server the analytics team stood up last quarter. Marketing asks for enterprise signups in Q1 with their account contacts. The agent generates SELECT id, email, phone, last_login_at, plan, mrr FROM users JOIN subscriptions ... WHERE created_at >= '2026-01-01' AND plan = 'enterprise', and 2.3M rows come back. The agent truncates the chat-side display to the first fifty. The full result set leaves the database, crosses the MCP server, and lands in the conversation history the model provider keeps for thirty days. The connection that ran the SELECT held a slot on the read replica for fourteen minutes before the proxy reaped it, and p99 read latency for the customer-facing dashboard tripled over that window. The audit log records one MCP call from mcp-readonly@analytics. No prompt, no agent identity, no user attribution. The post-mortem has six unanswered questions.

Read-only doesn’t bound any of this

The patch the post-mortem will land on in fifteen minutes is “make the MCP connection read-only.” The connection already was. Read-only restricts the verb set, and every failure above happened on SELECT.

A read-only SELECT against a 50M-row table is still a SELECT, with the same cost on the replica. Read access on users is read access on users.password_hash and users.api_token. The corruption floor that If Your Guardrail Is a Prompt describes eventually emits a query against a table the agent had no business touching, and read-only lets it through. And every row the agent reads becomes part of the context window the model provider keeps for thirty days, regardless of what your privacy policy says.

The verb was never the surface. The catalog is.

MCP is the wire, the endpoint is the policy

MCP is a tool-surface protocol. The standard database MCP server exposes query(sql_string): the model writes SQL, the server forwards it to whatever connection it was configured with. That makes the MCP server a conduit between the model and the catalog. The agent’s effective permissions are the connection’s, the agent’s query surface is every SQL statement the connection can run, and the audit trail is one row per call from one identity with the SQL as the only payload, which a pre-AI audit log treated as sufficient and an AI-era audit log doesn’t.

Note

The protocol isn’t the problem. MCP solves a real coordination problem: how a model discovers and calls tools across hosts, harnesses, and vendors. What you put on the other end is the part that decides whether you’ve exposed a database or an API.

A SQL conduit also makes the silent-failure shapes from What AI Gets Wrong About Your Database reachable from a chat window: JOIN paths against tables the model inferred from names, status = 1 filters where 1 means “pending” not “active”, unconstrained bridge tables that multiply rows. None of it requires write access, and all of it lands in the model provider’s trace.

The thing you want on the other end of MCP is an API. Not your customer-facing API. An API written for the agent: a list of operations it can call, with parameters, shaped responses, per-operation entitlements, row caps, timeouts, column masking, and an audit trail that records the agent identity and the prompt that produced the call. The agent never composes SQL. It calls get_enterprise_signups(quarter, plan) and gets back an aggregated result.

What the agent API looks like

Named operations, not raw SQL. get_revenue_by_segment(quarter, segment), list_active_enterprise_accounts(limit, cursor), get_customer_summary(customer_id). The agent picks from a menu the platform team curated. Operations get added when an analysis pattern proves useful enough to commit to a stable interface.

Responses shaped for the agent, not for the application. A revenue-by-segment call returns aggregated totals, not the 2.3M rows behind them. The shape is token-budget aware: a top-N list with totals beats a paged row dump.

Column-level masking inside the API. Email becomes a domain plus a hash. Account IDs are opaque tokens the API resolves on the next call, not database primary keys. Sensitive columns are gated by per-operation entitlements granted explicitly to the agent identity.

Row caps and statement timeouts the API enforces. Every operation has a hard cap on rows and database time. Caps live in code the API team owns, not in the prompt. If an operation needs higher caps, the cap is raised for that operation, not the connection.

Per-call audit with prompt provenance. Every call records the agent identity, upstream user, operation, parameters, response shape, row counts, latency, and the prompt that produced the call. Six months later, “who ran the query that leaked the enterprise customer list” is two SELECTs away.

Per-agent rate limits. Agents loop. Agents retry. The API budgets calls per identity, per operation, and per database time. The budget is a backstop on cost, on the replica, and on the model provider’s trace volume.

Warning

Don’t reuse your customer-facing API for this. Your customer API is shaped for an authenticated user reading their own data. The agent API is shaped for a service account reading across users, returning aggregates rather than rows, masking PII by default, and logging every call against a prompt. Two consumers, two contracts. One API that tries to serve both ends up either too permissive for customers or too restrictive for agents.

The MCP server’s tools then become thin wrappers over the API. Each MCP tool corresponds to one API operation. The agent sees get_revenue_by_segment as a tool; under the hood it’s an HTTP call to a service that talks to the database with its own pool, its own identity, and its own rules. The model never speaks SQL to anything.

What you get for the work

Control over what’s exposed, including the catalog. The API is the curated surface; what isn’t on the surface isn’t reachable. PII is masked or omitted by default, sensitive tables don’t have an operation, and the system catalog (information_schema, pg_catalog, MongoDB’s listCollections) never reaches the agent. Hide the catalog and you hide the menu of mistakes the model can make. The same surface-narrowing pays a partial dividend on prompt injection: an instruction smuggled into a document the agent reads has no query(sql) tool to hijack, only the operations on the menu.

Observability. Who called, when, with what parameters, against what prompt, returning what row counts. You can see which agents are over-fetching, which operations are getting hammered, which prompts produce weird call patterns. Patterns drive the next iteration: the operation called twenty times an hour gets cached, the one that always returns a million rows gets a tighter cap.

Throttling in a layer the database doesn’t reach. Per-agent, per-operation, per-minute, with hard backpressure during a customer-facing incident. This matters most when the agent is pointed at a primary: it shares a connection pool and CPU budget with the customer-facing write path, and a runaway loop or deep aggregation can move primary CPU enough to slow checkout. Statement timeouts on the database alone don’t help, because most of the damage lands in the first ten seconds. The API can apply the throttle at the call boundary, before the SQL reaches the connection: per-agent QPS caps, per-operation concurrency limits, a circuit breaker on customer-facing latency.

Where MCP-direct still earns its keep

Local development against a seeded test database. Nightly-refreshed sanitized snapshots of production with PII stripped. CI integration tests against ephemeral databases built from fixtures. Single-operator setups where the agent’s permissions are explicitly the operator’s. In all four, the cost of a mistake is bounded, and the loop of asking any question and throwing the answer away is the point of the environment. Patterns that prove useful in dev or snapshots get promoted to operations on the prod API; the rest stay in dev.

The dividing line is who pays the cost of a mistake. If it’s the same person running the agent, MCP-direct is fine. If it’s a customer whose contact list just got absorbed into a model provider’s training-eligible context buffer, MCP through the API. A two-engineer team with one agent and one use case can defer the API, but they’ll feel the cost the first time a second agent shows up or the first time a privacy review asks where customer data has been read from.

If MCP-direct, harden the database side

When the team picks MCP-direct in prod anyway, the database layer has knobs worth turning on. None substitute for an API. All are cheap.

A dedicated database user for the MCP connection. Not the analytics role, not an existing service account, not anything with grants accumulated over years. The agent’s user gets its own grants and an audit-log identity that names a single purpose.

Per-schema and per-table grants. PostgreSQL’s REVOKE ALL ON SCHEMA ... FROM PUBLIC is the underused default. The agent’s role gets read on a small set of schemas (often a dedicated analytics schema of shaped views), with explicit denies on schemas holding credentials, secrets, audit logs, and the system catalog.

Column-level masking via views or row-level security. A view over users that hashes email and omits password_hash, api_token, and phone closes most PII exfiltration in five minutes. RLS policies on tenant-scoped tables enforce a single-tenant read by default.

Aggressive statement timeouts and connection caps. statement_timeout and idle_in_transaction_session_timeout set per role at five or ten seconds kill runaway aggregations before they touch replica CPU. Connection caps via PgBouncer prevent the agent from monopolizing the pool during a retry storm.

The bigger picture

The pattern is the one every public-facing system already settled into a decade ago: you don’t expose the database to the internet, you put an API in front. The agent is a new principal that deserves the same treatment. MCP is the transport, the way HTTP is the transport for your frontend. Transports don’t make policy. Pointing MCP at a database makes the database the endpoint, and the database has no concept of an agent identity, a prompt, or a column-level mask for a non-human caller.

Building the agent API is the ideal case of an internal tool an AI agent can write quickly: greenfield code, one team owning the contract, low blast radius, replaceable v1, sandbox available for the first cut. A day or two with a coding agent rather than the quarter-long platform initiative it would have been in 2022. It’s testable, observable, and the thing that lets you point MCP at production without filing a privacy incident the following Tuesday.

The 10x Is Real, on Internal Tools You'd Otherwise Never Ship

Wed, 13 May 2026 00:00:00 +0000

TL;DR

AI coding agents do hit 10x or better on a specific slice of work: greenfield internal tools where the agent authors the code and the conventions it later re-reads. Outside that envelope (existing services, cross-team consumers, regulated code paths) the gain compresses to 10–30% at best and turns net negative on mature codebases, because verification cost dominates typing cost.

A DBRE has had a MySQL binlog purge script in the back of their head for six months. Current process: every other Friday, run SHOW REPLICA STATUS against each replica to find the oldest binlog any of them is still reading from, take the minimum source-log-file across the fleet, factor in any replica that is lagging, then connect to the primary and run mysql -e "PURGE BINARY LOGS TO 'mysql-bin.XXXXXX'" with a safe cutoff. By hand, in a notebook, while making sure no replica is far enough behind that the cutoff would yank a binlog the replica still needs and break replication. Pre-AI estimate to script it properly: half a day, with the replica-position scan across the fleet, the safe-cutoff math, a dry-run mode, and the README so the next on-call knows what it does. Never made the sprint. Friday afternoon with a coding agent: a small Go binary that walks each replica, parses SHOW REPLICA STATUS, computes the minimum source-log-file with a configurable safety margin, refuses to act if any replica is more than 30 seconds behind, runs PURGE BINARY LOGS TO on the primary, supports --dry-run, emits structured logs, ships with a README the agent wrote in the same pass, and has a one-shot CI job that exercises it against a sandbox primary plus replica. Forty minutes including the test. Ships Monday. The twenty minutes a week of toil it removes is the kind of work that has never made anyone’s quarterly goals.

Where the multiplier comes from

“AI is faster everywhere now” is the reflexive read of the scenario above, and it’s the wrong read. The same agent on the customer-facing payments service, modifying a checkout flow three years old with four other teams reading the code, lands at 10–20% faster on a good day and net negative on a bad one. The numbers in the literature back this up.

Stanford’s Software Engineering Productivity research on a corpus of more than 100,000 developers found AI gains of roughly 30 to 35 percent on low-complexity greenfield tasks, 10 to 15 percent on high-complexity greenfield, and brownfield work compressing further from there. Google’s 2025 DORA State of AI-Assisted Software Development report found 90% of developers using AI and over 80% reporting it made them more productive, while organizational software delivery metrics stayed flat. Individual perceived productivity is not translating into faster delivery to customers. DORA’s authors describe AI as an amplifier of existing engineering capability and flag a negative relationship between AI adoption and software delivery stability, with 30% of developers reporting little or no trust in the code AI generates. The verification tax is real: time saved on creation is getting spent on audit.

METR’s July 2025 controlled study on sixteen experienced open-source developers in repositories averaging a million-plus lines of code and a decade old found something stranger. The developers were 19% slower with AI than without, and they thought they were 20% faster. Mature codebases are an antagonistic environment for an agent and a confusing one for the human paired with it.

“AI” by itself doesn’t earn the 10x. Four other conditions have to stack on top of it: greenfield, AI-authored conventions, small audience, fast feedback. Remove one and the math compresses to single-digit percentages. Remove two and the agent is net negative.

Greenfield is cheap context

An existing codebase has conventions the agent has to infer from reading. The error-handling pattern, the test-fixture convention, the logger wrapper everyone uses, the half-deprecated config layer the new code is supposed to use instead. Some of this is written down in a CONTRIBUTING file from 2022 that is now wrong in two places. Most of it lives in tribal knowledge. The agent reads twelve files and guesses, sometimes wrongly, and the engineer’s time goes into correcting the guess. The context tax is real and it doesn’t show up on any productivity dashboard.

Greenfield collapses that tax to zero. The agent writes the first file and decides the convention, because there is no prior convention to defer to. Error handling is whatever the first handler returned. Logging is whatever the agent picked in the first module. The pattern propagates forward because the agent is reading its own work on every subsequent call. The convention isn’t reliable in the LLM sense (nothing the model does is reliable), but the distribution narrows considerably when the only style the agent has seen in this repo is the style it wrote yesterday. Less to hallucinate about. Fewer plausible alternatives competing for the next token.

The agent reads its own code and stays in pattern

The cleaner version of this story is that the agent authors both the code and the docs and re-reads its own docs on the next session, so everything stays coherent. That’s wishful. Agents leave READMEs stale routinely. A session edits the code, claims success, and silently leaves the documentation pointing at a function signature that’s been renamed since. Pretending otherwise is the same “AI is reliable now” framing that fails the moment someone trusts it.

The real driver is smaller and more mechanical. On a small greenfield repo, the agent’s first move on a new session is usually to scan the files that are already there. The code it reads is code the agent wrote last week. The patterns it produces in this session mirror what it finds in the existing files, because the model is doing what models do: sampling tokens that match the recent style it’s already seen in the context. Error handling looks the same in module five as it did in module one because the agent read module one before writing module five. Logging conventions, naming, test layout: all of it propagates from re-reading the code, not from a doc the agent has any particular discipline about.

Warning

Treat any AI-authored README as a build artifact, not a maintained source of truth. Agents skip README updates routinely, and a stale doc is worse than no doc because the next session will follow the wrong instruction confidently. If docs are sticking around, regenerate them rather than maintain them, and don’t trust any README older than the most recent code change. The code is the source of truth. The moment the tool grows to where re-reading the code on each session stops being cheap, it has graduated out of this regime and needs the discipline of any other production system.

The shape of the speedup is the same as why a single author writes a coherent short piece faster than three co-authors with the same word count. Coordination overhead is the multiplier, not raw output speed. In a human team, coordination is meetings, code review, RFC documents, the slow accumulation of shared style. In a single-agent greenfield repo, coordination is one session re-reading the small artifact it wrote yesterday and matching the patterns it finds. The cost of “what’s our convention here” is the cost of one repo scan.

Small audience, fast deploy, throwaway

The MySQL binlog organizer is used by one team. The AZ backup shipper is used by the storage on-call. The diagnostic API in front of Redis is curl’d by whoever is debugging tonight. None of these tools have customer SLOs, escalation paths, or a third-party integration to coordinate. v1 broken on Monday is fixed by Tuesday afternoon. The cost of a bug is bounded by how loudly someone says “this broke” in Slack.

Fast feedback is what makes the testing strategy work. A --dry-run flag and a sandbox replica is sufficient verification for the binlog purge script, because the worst case is the script crashes and the on-call reverts to the manual find command, which is what they were already doing. There’s no need for a 2000-row test suite. There’s barely a need for a runbook. The tool exists in the negative space between “the manual process” and “a properly engineered service” and both ends of that range are operationally fine.

And the tools are throwaway. If the binlog organizer turns out to be shaped wrong (the team wanted archival, the agent built categorization, the bucketing scheme is awkward) it gets thrown out and rewritten from scratch. Sunk cost is hours, not weeks. The agent doesn’t carry the same emotional attachment to the previous version’s design that a human author would, which makes the rewrite cheaper than the original. That’s the property that lets a platform team take more shots than they otherwise would. Most of the toil-removal scripts that sit in the backlog die there because the expected effort feels higher than the expected payoff. The 10x reframes the expected effort, and a noticeable chunk of the backlog suddenly clears its own bar.

The rewrite property has a second face: the replacement is often an upgrade, not just a regeneration. Most platform teams have the internal HTML dashboard from 2012 nobody wants to touch. Bootstrap, jQuery, server-rendered templates, no input validation on any of the forms, no real documentation, and one engineer left who remembers which buttons do what. Pre-AI estimate to modernize: a quarter that nobody schedules. Friday afternoon with a coding agent: a React frontend with form validation built in, a typed backend behind it, a README that documents what the endpoints actually do, and the input-validation gap that has been a low-grade footgun for three years closed by default, because the new stack treats validation as table stakes. The rewrite is cheaper than the original and an upgrade at the same time, because the agent is writing in a stack that already has the properties the old code lacked.

Note

The other property internal-only buys you: v2 ships side by side with v1. Both links go on the wiki, the team tries the new one in real work for a week, parity gets confirmed against the old one for the handful of operations anyone actually uses, and v1 gets deleted when nobody opens it anymore. No cutover plan, no customer comms, no parallel-infrastructure cost worth pricing. The team migrates itself. Try this rollout shape with a customer-facing service and the conversation goes very differently.

How to operationalize this without standardizing it to death

The temptation, as soon as the team notices the multiplier, is to standardize. Pick a Go scaffolding template, pick a logging library, mandate a directory layout, route everything through the platform team for review. Resist that. The standardization is where the multiplier collapses, because the moment two teams have to negotiate conventions, the agent is back in coordination-cost territory.

A workable shape: every team picks two or three toil-removers from their own backlog and commits to shipping them this quarter. Each tool is owned by one team. Code review is one teammate, same-day, with the explicit understanding that the review is checking the tool does what the README says and doesn’t delete prod data, not enforcing a corporate style guide. Testing is on a sandbox or staging instance, not in CI for two weeks. A Backstage plugin backend that surfaces deploy status for the storage team’s services lives in the storage team’s repo, with the storage team’s conventions, and gets reused by other teams only if the other teams want to depend on it (and accept the version it ships in).

The honest trade-off: tools built this way are good enough for one team and not good enough for the platform catalog. Some will need rewriting if they cross teams. That’s fine. The cost of rewriting is hours, and most of them will never cross teams anyway. The cost of insisting on cross-team-grade quality up front is that the diagnostic API the network team would have shipped in a Friday afternoon turns into a Q3 platform initiative, then a Q4 platform initiative, then nothing.

The same multiplier extends past engineering. Marketing wants to bulk-edit campaign tags. Customer success wants a dashboard joining ticket history against feature adoption. Finance wants a quarterly close helper. With a coding agent, the team that actually knows what shape they want can build the first version themselves. Engineering takes the tool over and hardens it only if it proves valuable enough to graduate. Anything that doesn’t prove itself disappears quietly, which is the right outcome for a prototype.

Warning

The sandbox is non-negotiable when non-engineers build their own internal tools. The environment has to be one where the tool cannot reach production credentials, cannot read customer data at production scale, and cannot expose anything to the public internet. A finance analyst prototyping a close helper against the prod database with their own AI agent is the worst-case version of this trend. The same prototype against pseudonymized data in a network-isolated environment is the version that pays off. What the platform team owes the business isn’t a scaffolding template. It’s a safe place to ship in. The infrastructure cost of getting the sandbox right is small. The cost of getting it wrong is the breach.

When the math runs the other way

The regime above doesn’t extend to every tool a platform team might build. The decision matrix:

Cross-team tools from day one. Two teams on the same internal tool means convention negotiation, which means coordination cost, which means the multiplier is gone. Build it the way you build any shared service: design review, versioned API, deprecation policy. The agent is still useful here. It is not 10x useful.
Regulated internal systems. HR, finance close, anything in SOX, SOC2, or HIPAA scope. The verification bar rises sharply because the audit trail has to survive an external reviewer, and AI-speed advantage compresses against the human review time the controls require.
Tools that touch customer data, even internally. A script that joins across users and subscriptions is a customer-facing risk regardless of who runs it. Read access to PII is read access to PII whether the caller is a customer-facing API, an analytics agent, or a backup shipper. Blast-radius arguments don’t get a discount for being operational.
Destructive operations on prod infra. A binlog purge script can break replication if it computes the cutoff wrong on a replica that’s silently behind. A snapshot shipper writes to S3 buckets other systems read from. The testing rigor required to ship destructive code pulls the productivity gain down. Still positive, often still 3x or 4x, but not the 10x of a pure read-only diagnostic.
Tools that drift into load-bearing. The orchestrator that started life as a Friday-afternoon “drain, snapshot, upgrade, verify, restore traffic” script and is now the deploy path for six services. Once a tool crosses that line it has graduated out of the throwaway regime and into the production-systems regime. The conventions need writing down, the test suite needs to exist, and the next human who edits the code needs to be able to read it without an agent’s help. The productivity question is no longer “how fast can the agent ship v1” but “how cheap is the system to operate at year three.” Different question. Different answer.

The bigger picture

The 10x envelope is narrow: greenfield, internal, replaceable, one team, fast feedback. It contains the toil backlog that has cost the team twenty silent minutes a week per uncreated tool for years, and the cost of writing those tools just dropped by an order of magnitude. The business case runs higher: when a non-engineer ships a tool with a coding agent, the multiplier is closer to 100x, because the pre-AI version of the tool doesn’t exist. Nobody ever wrote a ticket for it.

The most ambitious version of the regime is the agent API in front of the database: greenfield, owned by one team, conventions the agent sets, audience small enough that v1 wrong is recoverable. The team that has been deferring it because it sounds like a quarter of work has a path to ship it in days.

What doesn’t extend: customer-facing systems and customer data don’t tolerate the same speed. Once PII leaks into a model provider’s buffer, or a destructive script hits the wrong replica, you don’t get it back. The apology you’ll get is some flavor of “you’re absolutely right”. The phrase has no recall semantics.

Why You Should Use an Agent to Assist with Quarterly Reviews

Wed, 13 May 2026 00:00:00 +0000

TL;DR

Pulling six months of evidence from Jira, GitHub, Slack, and 1:1 notes is mechanical work an agent can do in an hour. Making the judgment calls (rating, promotion, raise, PIP, fire) stays with the manager and is the part that should consume the saved hours. Done with verification (every claim traced to source, no decision delegated) the bookkeeping that used to take a week takes a day, and the team’s tracking hygiene improves as a side effect because the agent only sees what’s in the system. Done without verification, the failure mode is a confidently wrong review of a real person.

Will Larson’s writeup of a typical performance cycle at scale puts calibration at three to five hours per round per participating manager, across three rounds (sub-organization, organization, executive). Calibration is the part where managers argue ratings against each other under a budget. It is not the part where the review gets written. The writing happens earlier: pulling six months of context out of 1:1 notes, Jira, GitHub, Slack, design docs, and PagerDuty into a coherent per-report narrative, which runs one to three hours per direct report and is mostly gathering and cross-referencing rather than deciding. A manager with eight reports spends most of a working week on the full cycle, and the judgment-call portion (rating, promotion recommendation, calibration argument) is a fraction of that hour count.

The standard responses (raise headcount, adopt a better template, push self-reviews onto reports) each redistribute the gathering cost without reducing it. The architectural fix is to script the gathering and use a narrow LLM call for the synthesis. The job is bounded, the inputs are structured, the output is a draft a human will edit. The same work pattern holds up across the agent-skeptical posts on this blog (alert triage, index management, LLM-driven SQL): scripted gathering, narrow LLM synthesis, human keeps every decision. The review is a particularly clean instance because the decision is human-only by definition. No agent can fire someone. No agent can sign off on a promotion. The agent’s job is the part nobody wants to do.

The five things the agent earns its cost on

The first thing it earns: the team starts using Jira. Work the team doesn’t track is work the team can’t talk about, and the review version of that argument is sharper than the standup version. If a project isn’t in Jira, it might not be in the review, because the manager can’t hold six months of work for eight people in their head and the agent only sees what’s tracked. People notice when the first draft of their review omits the project they worked on for half the quarter. The conversation that follows is “where did that work go in the system, and how do we make sure it’s there next time?” Tracking hygiene improves automatically as a side effect of the review process, without needing its own mandate.

The second thing it earns: completeness. The thing that always slips in manual reviews is the project from month two of the quarter that everyone has stopped thinking about. The migration that finished. The incident that got handled cleanly. The cross-team contribution that left a paper trail in someone else’s repo. The release-management work the IC quietly took over when the previous owner left. These show up in the corpus the agent reads. They don’t show up in the manager’s working memory in March. Reviews written from the corpus capture them; reviews written from memory don’t, and the gap shows up cumulatively over years as the engineers whose work is visible to the manager move ahead of the ones whose work isn’t.

The third thing it earns: estimation discipline becomes a review signal. The blog’s Point After the Fact post argues for re-pointing tickets after they close, so the team accumulates real data on where time actually goes. That data is review-grade material. Not “did they hit their point targets” (a metric the team will game within a quarter of being told about it) but how complete their record is and how well their post-hoc estimates correlate with what shipped. A report who consistently re-points after the fact, even when the numbers move against them, is demonstrating a discipline the agent can surface directly. A report who doesn’t is harder to evaluate at all, which is itself information.

The fourth thing it earns: GitHub activity, loosely weighted. The agent pulls PR counts, review participation, commit cadence, and approximate complexity (lines changed, files touched, languages crossed). This is loose data and the article using it needs to be explicit about that. A staff engineer who unblocks the team on three substantive reviews a week is worth more than one who ships ten unreviewed PRs of their own. The numbers spot patterns; they do not grade. A draft that says “shipped 23 PRs and reviewed 14 from teammates, with review depth averaging 3 substantive comments per PR” is useful. A draft that grades the report on those numbers has overstepped, and the manager should ask the agent to remove the grade and re-present the data.

The fifth thing it earns: focus and follow-through patterns. Started-vs-finished ratios. Average time in-flight per ticket. Tickets opened and abandoned. Context switches per week per person. These surface patterns the manager would otherwise have to reconstruct from memory, and they’re patterns memory is bad at. Same caveat as the PR numbers. A high abandonment rate could be a focus problem or it could be an environment dragging the engineer between tickets every two days because nobody else is around to handle interruptions. The data spots the pattern. The manager makes the call about what the pattern means.

The agent is the index, not the conclusion

Every claim in a good draft is a pointer into the corpus, not a finding about it. “Delivered the auth migration, 10 weeks (TICKET-4421, TICKET-4422, PR #1138)” is a draft worth editing. “Met expectations on delivery” is a draft worth rejecting. The first is something the manager can verify in fifteen seconds; the second is a conclusion the agent has no business making, and a conclusion the manager can’t trace without redoing the gathering work.

The architecture under all five is the same shape: deterministic gathering, narrow LLM synthesis, human decision. The gathering can be done two ways.

The first is MCP. Wire up off-the-shelf Jira, GitHub, Slack, and PagerDuty MCP servers and let the agent query each system directly. Lower setup cost, faster to a first draft, easier to extend when a new system gets added to the team’s stack. The agent decides what to query and how.

The second is scripts. Write Python (or any deterministic language) that hits each API and returns structured data as a CSV or JSON file before the agent ever sees it. Tickets closed per report with dates, points, and labels. PRs opened and reviewed with timestamps, approval counts, and lines changed. Time-in-flight distributions per ticket. Abandonment rates. On-call shifts and incident participation. The output is an artifact the manager can open, sort, and audit independently, and re-running the script reproduces the same numbers exactly.

The trade-off is control. MCP is faster to stand up and weaker on verification. Scripts cost more upfront and produce a deterministic audit trail. For review-grade work where the cost of a wrong cited metric is paid by a real person, scripts are usually worth the upfront cost. For the qualitative exploration the agent does later in the same workflow (pull a specific 1:1 note, read one design doc), MCP is fine because verification by re-reading the source is what catches errors anyway. A reasonable middle path is scripts for anything that becomes a quoted number and MCP for anything the agent only needs to read once.

Either way, the agent gets the structured inputs, the report’s name, their role expectations, and the review period, and produces a draft with explicit sections: shipped projects with source links, focus and follow-through patterns with the metrics inline, collaboration evidence with PR-review counts and threads referenced, contradictions surfaced rather than smoothed. Do not ask for ratings. Do not ask for recommendations. The bright line is that the agent assembles and the manager decides, and the prompt enforces it.

When the gathering uses scripts, hallucinated numbers become the easiest class of agent error to catch, because the script that produced them is a source of truth the manager can re-run in seconds. A draft claiming “23 PRs and 14 substantive reviews” is one re-run away from being confirmed or rejected. With MCP, the verification surface is weaker: the agent’s report of a query result is what the manager sees, and reproducing the exact same fetch isn’t always trivial. Either way, keep the gathering deterministic where it can be. Keep the synthesis narrow. Don’t let the agent count things you can count for it.

What it can’t be allowed to touch

Every reason above is contingent on a verification discipline the article should spend more space on than the gathering architecture deserves. Frontier LLMs corrupt a measurable fraction of delegated multi-step work, and the rate doesn’t drop with better prompts, more tools, or longer context windows (Corruption Is a Feature). A draft review the agent writes confidently, citing a specific ticket, a specific PR, a specific 1:1 quote, has a meaningful probability of being wrong in any of those specifics. The ticket exists but says something different. The 1:1 line paraphrases what was said into something close-but-not. The PR review that counts as “substantive” was a thumbs-up emoji. The migration the agent attributes to the IC was actually led by their teammate, and the IC was a reviewer.

These errors land in a document that determines whether a real person gets promoted, doesn’t, gets put on a PIP, or doesn’t. The cost of getting this wrong is paid by the report, not the manager, which makes the verification discipline an ethical obligation, not just an operational one.

The failure mode that ends careers

A confidently written review citing fabricated specifics (“Q1 incident response on the payments outage”) that the manager doesn’t verify is the failure mode. The agent invents the attribution from a Slack thread it misread, or compresses three engineers’ contributions into one name, or quotes a 1:1 line that was said by a different report in a different week. The review is wrong, the manager signs it, and the person on the receiving end has no idea the underlying corpus contradicts what they’re reading. Verify every citation. Open every linked ticket. Read every quoted line in its original context. Re-run the script that produced any cited number - that’s the fastest verification surface and the one most likely to catch a fabricated metric.

Two disciplines hold the work together. The first is that every claim in the draft has to be traced to a source the manager opens and reads before the draft becomes a review. Not skimmed. Read. If the draft says “delivered the auth migration in Q1, ten weeks”, the manager opens the ticket, confirms the dates, confirms the scope. If the draft says “needs growth on cross-team collaboration”, the manager opens the threads cited as evidence and forms their own assessment. The agent’s draft is the index into the corpus, not the conclusion about it.

The second discipline is rejecting first drafts. The first generation always reads cleaner than the corpus actually is. Patterns get smoothed. Conflicting signals get harmonized into a coherent narrative that isn’t quite the truth. Re-prompt with the parts that look wrong. Ask the agent to surface contradictions it smoothed over. Ask for the strongest negative case before the draft contains any praise, and then for the strongest positive case in a separate pass. Read both. The first draft is a hypothesis. The third draft, after the manager has read the corpus and contested the obvious narrative, is closer to a review.

No decision belongs to the agent. Not the rating. Not the promotion recommendation. Not the raise. Not the PIP. Not the fire. The agent assembles. The manager decides. A draft that recommends a rating is a draft that has overstepped; reject it and re-prompt for the evidence underneath, without the conclusion.

When this doesn’t apply

The setup is overkill on small teams. A manager with three reports has the full corpus in their head and doesn’t need synthesis. The discipline pays back at six reports and up, and the curve gets sharper above ten. A new manager who doesn’t yet have judgment to verify against is in the most dangerous position, because the agent’s narrative will be the most coherent thing in the room and the temptation to trust it is highest exactly when it shouldn’t be. The agent assembles confidently regardless of how well it matches reality. Wait until you’ve written a few rounds of reviews manually before adopting this workflow. Orgs without basic systems hygiene (no Jira, scattered PR reviews, no design docs) don’t have the corpus the agent reads, and the synthesis is hollow. And teams where reviews are calibrated heavily on a single visible metric have already decided that the synthesis doesn’t matter; the agent’s draft is decoration in that environment.

The bigger picture

Scullen, Mount, and Goff (2000) found that idiosyncratic rater effects accounted for 62% and 53% of performance-rating variance across two large samples (n=2,350 and n=2,142), more than twice the variance attributable to actual ratee performance. Most of what makes a review is the manager, not the engineer. Scripts and structured corpora don’t eliminate that bias, but they pull the underlying evidence onto a surface the next reviewer in calibration can audit, which changes what gets argued about: the data or the manager’s narrative.

The agent doesn’t fix the underlying performance-management problem either. Gallup’s 2024 survey of Fortune 500 CHROs found 2% strongly agree their system inspires employees to improve, and 22% of employees agree the process is fair and transparent. Those numbers have been some version of themselves for as long as anyone has measured. The agent makes the mechanical parts mechanical, which is the most an architecture decision can do here. The hours that buys back get spent on the parts the corpus can’t help with: the conversation with the report about growth, the calibration argument with peer managers, the year-out career planning, the things that take attention and presence rather than data.

Hidden Database Costs of an AI Rollout: Storage, CPU, Memory, and Cache

Sun, 10 May 2026 00:00:00 +0000

TL;DR

Adding RAG to your existing Postgres usually 5x’s the storage on the affected tables, drives the HNSW index off a memory cliff that doesn’t degrade gracefully, and pollutes the buffer cache hard enough that unrelated OLTP queries regress at p95. Halfvec, binary quantization with rerank, and a separate replica recover most of it, after the bill has already arrived.

A pgvector user opened issue #666 in September 2024. They had one million records, 512-dimensional vectors, an HNSW index. Cold-cache search took 83 seconds. Warm cache, the same query returned in roughly 100 milliseconds. Three orders of magnitude. The index had not grown. The query had not changed. What changed was that other applications running on the same Postgres preempted the vector index pages out of cache, and the next search read the entire HNSW graph back from disk one block at a time. The capacity plan that approved the AI feature six weeks earlier had a line for storage and a line for CPU. It did not have a line for the OLTP buffer pool quietly fighting the vector index for residency, and losing.

The senior reader’s first response is “throw a bigger instance at it” or “stop running this on the OLTP Postgres and use a dedicated vector DB”. Both are right answers in some configurations. A bigger instance buys headroom for the working set without changing the slope of the cost curve as the corpus grows. A dedicated vector store removes the cache-pollution problem and adds a network hop, a second consistency model, and another piece of infrastructure to back up. The choice being made is between paying the cost on the OLTP Postgres in the form of a bigger instance and worse p95s, or paying it on a second system in the form of more operational surface area. The conversation that didn’t happen is the one about what the cost actually consists of and where it accrues. That is what the rest of this article is.

The four cost categories

Every pgvector vector row takes 4 * dimensions + 8 bytes. A 1536-dimensional embedding from text-embedding-3-small lands at 6,152 bytes. A 50-million-row table that occupied 40 GB on its existing columns becomes 350 GB after the embedding column is added, before any index. Supabase published a case study in August 2023 on 224,482 embeddings where Postgres RAM consumption went from 4 GB on 384-dimensional vectors to 7.5 GB on 1536-dimensional vectors, on the same hardware and the same row count. The dimensions alone were the difference.

Vector index builds cost CPU at a scale that breaks the assumptions of any normal migration window. Jonathan Katz benchmarked HNSW build on the dbpedia 1M corpus across pgvector versions: 7,479 seconds on 0.5.0, 250 seconds on 0.7.0 with the parallel-build improvements, 49 seconds with binary quantization. AWS published Aurora numbers on a 5M OpenAI dataset showing 29,752 seconds on 0.5.1 versus 445 seconds on 0.7.0 with binary quantization. Versions matter. They also bound how bad it can get on older versions. pgvector issue #300 reports a 24-plus-hour HNSW build on 10M rows of 768-dimensional vectors with 4 CPU and 16 GB RAM. Issue #807 reports the connection dropping after roughly two hours of an HNSW build on 17M rows of 1536-dimensional vectors with 48 CPU and 192 GB RAM. The build is the part of the workload least visible to the production dashboard and most painful to retry.

HNSW does not degrade gracefully when its working set exceeds RAM. The graph is designed to be traversed in random order, which is fast when every page is in shared_buffers and falls off a cliff when it isn’t. pgvector issue #844 caught the in-build version of the same problem: the user got the message hnsw graph no longer fits into maintenance_work_mem after 5908085 tuples, and the build slowed dramatically from that tuple onward. The query-side equivalent has the same shape. Crunchy Data’s HNSW write-up reports index sizes of roughly 8 GB per million rows on typical AI embeddings. Neon’s operational guide recommends keeping maintenance_work_mem at no more than 50–60% of available RAM for vector workloads. The exact latency-vs-RAM curve past the cliff is not published in any vendor source I can find. The shape is well-known to anyone who has watched it happen, and the absence of a published curve is itself a sign of how much of this knowledge lives in incident channels rather than docs.

The cache problem is the inverse of the build problem. Once the index exists, every ANN query touches thousands of pages chased through a graph traversal. A handful of vector queries running concurrently with OLTP traffic is enough to evict the heap pages the OLTP queries depend on, and the OLTP p95 regresses without any change to OLTP code. The pgvector #666 numbers from the opening are the cleanest single data point on this. The instructive piece is the magnitude. Not 2x worse, not 5x worse. Three orders of magnitude depending on cache state. There is no other workload class on a typical OLTP Postgres that produces that swing.

All four cost categories converge on the managed-service bill. Aurora storage runs $0.10 per GB-month at the base tier. 100 million 1536-dimensional full-precision vectors require roughly 6.15 GB raw per million rows, plus an HNSW index closer to 8 GB per million on typical configurations. That is about 1.4 TB before backups, replicas, or growth, around $140/month in raw storage at the base rate. Storage is the floor, not the cost. The cost is the instance class needed to keep the active portion of that index in shared_buffers, which on the memory-cliff curve above means an instance one or two tiers above what the rest of the workload required.

The two events the capacity plan didn’t budget for

Embedding model upgrades rewrite the embedding column. text-embedding-3-small was released January 2024 alongside text-embedding-3-large, and any team that wanted the better recall on the larger model also wanted to re-embed the existing corpus. The migration is a full rewrite of the largest column on the largest table, plus an HNSW index rebuild on the new vectors, plus the API call cost of generating the new embeddings, plus double storage during cutover unless the team is willing to take a recall regression by deleting the old embeddings before the new ones are validated. There is no published postmortem from a named company giving real numbers on this event. The closest public signal is pgvector issue #559 from December 2024, where a user reports that individual inserts on a 1M-row HNSW table went from millisecond-scale before the index existed to “5–8s” afterward. The same write amplification applies to the migration, except now it applies to every row at once. The absence of postmortems is itself worth noticing. The event is recent enough that most teams haven’t lived through their second model upgrade yet.

The other unpriced event is connection pool starvation when the application holds a Postgres connection while waiting on an LLM call. The pattern is straightforward. A request needs context from the database, the application opens a transaction, fetches the rows, builds a prompt, calls the model, gets a 4-second response, writes the result back, commits. The connection is held for the entire round trip. A pool sized for 200ms transactions exhausts at one-twentieth the request rate it was sized for, and the failure surfaces as too many connections errors on requests that have nothing to do with the AI feature. There is no named-company postmortem in the public record for this one either. The pattern is recognizable to anyone who has run a database behind a synchronous LLM call. The fix is structural. Do the LLM call outside the transaction. Release the connection before the model call begins and acquire a new one after. Or move to a connection pooler that explicitly supports this pattern. None of those is free, and none of them is what the first version of the feature ships with.

What actually moves the bill

Three levers do most of the work, and each one carries a trade-off the AI-feature team would rather not own.

Halfvec is the cheapest move. The pgvector halfvec type stores each component as a 16-bit float instead of 32-bit, halving storage at no measured recall cost on most embedding models. AWS’s Aurora benchmark shows the Cohere 10M corpus dropping from 38 GB to 19 GB, with database memory consumption going from 15.12% to 7.55% on the same r7g.12xlarge instance. Neon’s July 2024 post on halfvec reports 50% storage reduction, 23% faster index build, 50% faster prewarming, and equivalent recall on a 1M DBpedia 1536-dimensional corpus. The trade-off is that halfvec only buys 2x. The corpus growth that earned the bill in the first place still applies. Halfvec moves the bill, it does not change its slope.

Binary quantization with reranking is the next lever, and it is the one with a published tension worth understanding. Qdrant’s binary quantization article from September 2023 reports recall of 0.985 on text-embedding-3-small at 3x oversampling with reranking, and 0.997 on text-embedding-3-large at the same setting. Storage drops by roughly 32x relative to full precision. Neon ran a binary-quantization test on 1536-dimensional vectors in 2024 and concluded recall was “insufficient for production use”. Both are correct. Qdrant tested with a rerank stage that re-evaluates the top-k binary candidates against full-precision vectors held elsewhere; Neon tested binary alone. Binary quantization without rerank is dangerous. Binary quantization with rerank requires keeping a full-precision copy of the vectors somewhere accessible, which is back to a storage problem of a different shape.

Separating the vector workload onto its own physical Postgres replica or onto a dedicated vector store addresses cache pollution directly. The trade-off is operational. A second system to back up, a second consistency model to reason about, a second incident-response runbook. On a small team this can dominate the cost it was meant to save. On a larger team where the AI feature has its own owners and the OLTP database has its own owners, the separation aligns infrastructure boundaries with team boundaries and is usually worth it on those grounds alone, before any cache argument is made.

A scalar quantization note worth keeping in view: Jonathan Katz’s scalar and binary quantization benchmark from April 2024 measured halfvec recall at 0.968 versus full-precision 0.968 on dbpedia-openai-1000k-angular at the same ef_search. The author’s verdict was direct: “Scalar quantization from 4-byte to 2-byte floats looks like a clear winner.” On most embedding models, halfvec is the move you can make today without touching application code or rerank pipelines, and the bill drops by half.

When the math runs the other way

This article is overkill on three configurations. Small corpora, under roughly 100,000 vectors at 1536-dim or smaller, do not generate enough storage or index volume to matter on any modern Postgres instance. Low-QPS internal tools where the vector search runs a few times a minute do not pollute the buffer cache enough to regress OLTP, and do not need the operational complexity of a separate vector store. Teams already on a dedicated vector store from day one (Pinecone, Weaviate, Qdrant Cloud, pgvectorscale on a separate instance) have paid the operational price up front and have a different cost structure that this article does not speak to. The four cost categories above all assume a vector workload colocated with an OLTP Postgres at production scale, which is the configuration most teams ship first because it is the configuration that requires the fewest decisions.

The bigger picture

The shape of the problem recurs across every AI-rollout postmortem worth reading. The feature ships fast because the existing infrastructure is already there. The cost lands later because the existing infrastructure was sized for a different workload. Embeddings on the OLTP Postgres are cheap to add and expensive to operate. The capacity plan that signed off on the AI feature did not have line items for storage at 6 KB per row, for HNSW builds that consume entire instances for hours, for memory cliffs that do not degrade gracefully, or for cache pollution that regresses unrelated p95s. The standard “what does this feature cost” template was written for application features that read and write rows the database was designed to read and write. Vector search is a different access pattern. The team that surfaces these four numbers before the feature ships pays them on a normal capacity ticket. The team that doesn’t, pays them anyway, six weeks later, in the form of an emergency one.

Your Alert Triage Doesn't Need an Autonomous Agent

Fri, 08 May 2026 00:00:00 +0000

TL;DR

Autonomous agents are the wrong abstraction for alert triage. A scripted playbook of RE-curated queries plus one LLM call to summarize the structured output gives the responder a triage hint with the raw data attached. The summary saves time on the easy pages; the raw data carries them through the cases the LLM gets wrong.

3:14am page. p99 latency on /api/orders checkout past the 1500ms SLO for six straight minutes. The on-call assistant’s summary at the top of the alert reads “elevated checkout latency correlated with deploy of order-service r8472, 14 minutes ago. Recommend rollback.” The responder pages the deploy author and starts the rollback. Latency stays past SLO. The actual cause is a worker that hung yesterday holding an open transaction, idle for 18 hours, blocking vacuum the entire time. Bloat on the orders table is what made the checkout query slow enough to finally cross the SLO during normal early-morning traffic. The agent pulled the pg_stat_activity snapshot and the idle session was in it. The summary picked the deploy anyway, because the deploy was the most legible recent change in the data it had. The responder did not read pg_stat_activity because the summary said roll back the deploy. Twenty-three minutes page-to-fix, twenty on the wrong path.

The senior reader’s first response is “give the agent better access. Wire in pg_locks, slow-query log, replication slot state, recent lag, the works.” The agent in the scenario already had pg_stat_activity. Missing data was never the problem. The agent had the data and picked the most legible recent change as the cause, because that is what the model defaults to on partial structured data. Adding more sources gives it a longer list to pattern-match against. The summary that comes back is more confident without being more correct, and a responder defers more readily to a confident summary. That is how the failure mode shifts from “agent missed the data” (visible, fixable in tooling) to “agent had the data and misattributed cause” (invisible, the post-mortem has to reconstruct what the responder would have seen without the summary).

What the agent abstraction blends together

“Agent” in the current vendor pitch means autonomy in tool selection: the LLM reads the alert, decides which MCP servers to consult, what queries to run, in what order, what to do with the results. That bundle does two jobs at once. Summarizing structured data into prose is the job where an LLM hallucinates least, though it still fabricates values and misreads fields on the way. Deciding which queries to run for a given symptom is detective work that depends on a system model the LLM does not have. The reliability engineer has the model. They have been on call. They have read every post-mortem. They know that a replication-lag alert wants the slot state, the publisher’s WAL position, the largest active transaction’s age, and the last three deploys, in that order, every time. The LLM does not know that. It can pattern-match toward it on familiar shapes and miss it on the rest.

The right design splits the two jobs. The reliability engineer curates the playbook: for this alert ID, run these queries against these systems, with this scope. A script runs the playbook on every fire of that alert. One LLM call at the end takes the structured output and writes a paragraph: what is affected, what is notable in the data, what jumped out. No tool selection by the model. No causal claims unsupported by the queries the playbook ran. The model is doing the safest thing it can do.

This is not a smaller version of an agent. The autonomy in tool selection has been removed entirely. The reliability engineer chose the queries, the playbook runs them, the LLM formats the result.

In practice, most teams that ship an agent in production end up not trusting its autonomy unsupervised either. Tool descriptions accumulate, system prompts get tuned with hints like “for slow-query alerts, consider checking pg_stat_activity, pg_locks, and the last three deploys.” That natural-language playbook lives inside the prompt, with no guarantee the model executes it on any given run. The engineer is authoring a playbook either way. The only choice is whether the playbook lives in code that runs every time, or in a prompt the model may or may not honor on this alert.

What every round trip costs

The autonomous-agent design pays for the same data several times. Each tool call’s output goes into the input prompt of the next step, and the loop is serial: read alert, decide query, run query, read result, decide next query, run query, read result. A pg_stat_activity dump fetched at step one is in the input for steps two, three, four, and five. The same dump is re-tokenized as input on every subsequent step, so a six-step loop bills the model for that payload roughly five extra times, plus the output tokens spent emitting tool-call JSON at each hop. At a page rate of a few hundred a day across a platform team, the bill compounds. Prompt caching cuts the bill but does not change the shape. Every step still serializes through a model call, every tool error still pollutes the conversation, and every retry still spends real wall-clock seconds.

The agent also does this fresh on every alert. It carries no schema knowledge between runs the way the reliability engineer does. On a given page it queries pg_stat_statements for total_time on a PG14 cluster (the column was split into total_exec_time and total_plan_time in PG13), reads the SQL error in the next prompt, retries with a different guess, gets it wrong again, queries information_schema to discover what columns actually exist, dumps that result into the conversation, and finally runs the query it should have run from the start. Every error and every discovery dump piles into the next prompt. Per alert. The playbook does this once when the RE writes it, in version control, against a real database.

And the loop takes time. Even on the fast tier, the loop is serial: every step is a model call followed by a tool call followed by another model call. Six round trips compound, and the responder paged at 3am has opened three dashboards manually before the agent posts its first summary. The supposed time savings of the agent are negative against a responder who already knows where to look.

The playbook design fetches everything once, in parallel, and passes one shaped bundle into one LLM call. The shaping is where most of the token savings come from. Raw pg_stat_activity output is verbose JSON with thirty columns per row, half of them irrelevant to a triage summary. A playbook can project the four columns the prompt actually needs (pid, state, query_start, query), format them as a small table rather than nested JSON, truncate long query text, and pass a hundred bytes where the agent would have passed ten kilobytes. Page-to-summary time is the slowest single query plus one summarization call, regardless of how many queries the playbook fetches.

The alert artifact

What the responder gets has three layers.

Tier sits at the top, set by the routing layer: prod page, non-prod channel, Jira queue. The tier picks the playbook. A P0 page activates the prod-read playbook, which can hit replicas and recent deploy state. A Jira ticket runs only the runbook-lookup playbook with no live read access. Tier-as-scope is the security half of the design and falls out for free once the playbook is the unit of action.

Raw data sits in the middle: every query the playbook ran, with its output. The pg_stat_activity snapshot. The lock graph. Replication slot state. Last five deploys with author and SHA. Slow-query log entries from the last fifteen minutes. The artifact attaches all of it because the playbook already paid the cost to fetch it. Re-running the queries from the responder’s terminal at 3am is exactly the time the design exists to save.

Summary sits on top: one paragraph from one LLM call, generated from the structured output of the playbook. “Replication lag of 47 seconds. Slot pub_orders is held with restart_lsn 18 hours stale. No recent deploys touch the publisher service. Largest active transaction is session 88234, idle in transaction for 18h2m.” That sentence is doing the job an LLM hallucinates least on: compressing structured input into readable prose, with the inputs visible to the responder one scroll below. It is not claiming the slot is the cause. The responder reads the summary, scrolls to confirm in the slot-state output, kills the session, slot drains, lag recovers.

The summary is a reading hint, not a source of truth. On the bulk of pages where it is right (bad CPU, lag, slow query, full disk), the responder saves a few minutes of dashboard-tab opening. On the cases where it is wrong, they scroll past the summary, read the raw output the playbook already gathered, and override. They never have to wait on the model to fetch anything.

Lower tiers do not run the playbook automatically. A non-prod channel post or a Jira ticket lands with the alert payload and an “investigate” button. Most of those alerts get glanced at and dismissed: known flake, the synthetic that fires every Tuesday morning. Running a playbook and a summarization call on every one wastes tokens and clutters the channel. The button is for the alerts the responder decides to look at; pressing it runs the playbook and attaches raw data and summary the same way a P0 page would have them. P0 pages skip the button because the responder is already committed; the summary is there the moment the page opens.

What the post-mortem actually changes

Post-mortem deltas in this design land somewhere specific. Usually the playbook needs another query: pg_prepared_xacts was missing, or the lock graph was dumped without the waiter chain. Sometimes the prompt template needed to surface a signal the playbook already gathered but the LLM ignored. Occasionally the routing tier was wrong and the alert hit the wrong playbook entirely. All three ship as a pull request a reviewer can read.

The same post-mortem in an autonomous-agent setup is harder to reason about. The agent decided to run queries A, B, C this time. It might run D, E, F next time on a similar-looking alert. The prompt and the run are intertwined, and the fix is “tune the agent’s tool descriptions” with no guarantee the next run reaches for the right tool.

How you’d actually measure this

The argument that a curated playbook plus a summarization call beats an autonomous agent is testable on a team’s own pages. Pull the last quarter of P0 and P1 alerts. For each one, run the candidate playbook against a snapshot of the systems’ state at the time the page fired (or against archived metrics, depending on what’s stored). Generate the summary the way the design would. Compare it against the post-mortem’s documented root cause.

Two regressions to count separately: cases where the summary names a wrong root cause that the bundle’s raw data would have ruled out, and cases where the summary fabricates a value the playbook never produced. The first measures the model’s misattribution rate on data it actually saw. The second measures the model’s tendency to invent facts that are not in the input, the floor problem the raw-data layer exists to catch.

Run the same evaluation against an autonomous-agent baseline on the same alerts, and the comparison is concrete rather than theoretical. Either the agent’s tool selection picks queries the playbook would not, in which case the playbook needs editing. Or the agent’s summaries hallucinate at a higher rate on the same inputs. Either result is useful. The eval is cheap to set up once and pays back every time the playbook is changed.

Where the design strains

A few real caveats. None of them the agent design solves either.

Playbook maintenance is work, but the work is the cleanest accuracy lever the engineer has. Adding a query directly improves the summary’s grounding, because the model now reads more of the data the cause lives in. Tuning an agent’s prompt does not have the same property. The model can still ignore the hint, conflict it with another instruction, or pick a different tool, and there is no deterministic check that any of those did not happen. The bundle either has the data or it does not. The strain is the silent failure mode on the maintenance side. When a query references a column that was renamed or a service that moved, the query returns empty, the bundle gets thinner, and the summary gets less informative without anything visibly breaking. The discipline that catches it is owned playbooks (one team, one engineer named in the file) plus a cadence: post-mortems produce playbook deltas, and a periodic review flags queries that have returned zero rows on every recent run. Without that, the playbook decays.

The summary can still be wrong on data the playbook surfaced. Curating the input does not fix the model’s tendency toward confident misattribution. The bundle might include the idle-in-transaction session and the recent deploy side by side, and the model can still pick the deploy because it pattern-matches better to recent-change framing. The raw-data layer is the floor under the summary. The responder scrolls, reads the idle session, overrides. Curating the input does not change the model’s tendency to misattribute; it changes how easy the misattribution is to catch.

The summary can also fabricate facts the playbook did not produce. Even with a curated bundle as input, the model can describe values that were not in the data (a lag of 47 seconds when no lag query ran), invent observations from a single-row snapshot, or restate the bundle in a way that adds confidence the data does not support. The raw-data layer is again the floor: the responder catches a fabricated number by reading the actual query output the playbook attached. The agent design has the same failure plus a worse one. The agent’s summary can claim observations that no tool call ever produced, because in an agent trace the summary text and the actual tool calls are separate artifacts and the responder rarely reads both.

Prompt injection is a real exposure. Raw strings from user-controlled fields end up in the bundle: query text, application_name, log message bodies. An attacker who can write into those fields can attempt to steer the summary. Tier-as-scope helps because low-trust alerts get less context to work with, but the playbook design does not eliminate the risk any more than the agent design does. Standard mitigations apply: prompt isolation, output sanitization, and treating the summary as untrusted input to anything downstream.

Page delivery takes longer if the summary blocks. The LLM call adds 200ms to a couple of seconds, and on paging tiers that is a regression on time-to-acknowledge. The fix is the same shape as the investigate button on lower tiers: lazy. Deliver the raw alert and the playbook output the moment they are ready. The summary lands asynchronously and appends to the thread when the LLM call returns. The responder starts reading the raw data while the summary is still rendering. Blocking page delivery on the summarizer is the kind of regression the design was supposed to prevent.

When this doesn’t apply

A few places the discipline costs more than it pays.

Small systems with a single responder who has the full mental model. A two-service team with five alert types and one on-call does not need playbook authoring overhead and a per-alert LLM call. The responder’s pattern-match resolves the page in thirty seconds and the summary is friction.

Alerts with no diagnostic surface. A boolean health check with no associated query set is not a playbook target. The alert is the data; there is nothing structured to summarize on top.

Novel incidents the playbook has not seen. By design, no playbook matches and no summary is generated. The responder gets the raw alert and reads pg_stat_activity themselves. That is the correct behavior. The alternative, an autonomous agent that reaches for whichever queries it pattern-matches to, would produce a confident summary on a problem the team has never seen, which is the worst case for both MTTR and post-mortem quality.

Tier-one NOC responders working escalation playbooks. The design assumes a responder senior enough to override a confident summary by reading the raw output below it. A 24/7 NOC tier whose runbook says “if summary recommends rollback, page deploy author and roll back” inherits the worst of both worlds: the summary’s confidence with none of the override capacity. For that org shape the same design needs an additional rule. The summary never names a recommended action, only what is notable in the data, and the runbook explicitly tells the responder to escalate when the summary’s notable-fact list does not match the runbook’s expected pattern. Without that, a senior-on-call design imported into an L1 environment makes incident outcomes worse.

The bigger picture

Summarization of structured data into prose is the job where LLM hallucinations are smallest. They still happen, but the floor is higher than for tool selection or causal attribution. Where the work depends on a system model the model does not have, the hallucinations are everywhere. The same shape shows up across letting AI manage indexes and prompts as guardrails. The reliability engineer keeps the system model. The playbook is where that model lives in code. The model formats. Choosing what to fetch is the part the engineer is for.

If Your Guardrail Is a Prompt, You Don't Have a Guardrail

Fri, 01 May 2026 00:00:00 +0000

TL;DR

A prompt instruction biases the next-token distribution. It cannot bound it. Real guardrails for agents holding production credentials sit below the prompt, in layers the model cannot read or override: scoped identities, vetted tool surfaces, harness hooks, wire-level statement filtering, provenance-tagged logs, behavioral monitoring.

The agent’s instructions said staging only, read-only by default, no schema changes, confirm before any DELETE. The environment said DATABASE_URL=postgres://app_writer:...@prod-cluster:5432/app. Twenty turns in, the user wrote “looks good, can you also clean up the old events”, and the agent ran DELETE FROM events WHERE created_at < '2025-01-01' against the only connection string it had. The instruction never lost an argument with the destructive call. By turn 20 it had decayed into background.

A stronger prompt won’t fix this

The reflex is to write a stronger prompt. ALL CAPS, with a rules block at the top of the system message and a reminder at the bottom of every user turn. Replit’s July 2025 incident already ran the experiment: eleven all-caps messages forbidding writes during a code freeze, an agent that ignored every one, 1,206 executive accounts dropped, no rollback path the agent could find. Twelve would not have helped. Fifty would not have helped. The prompt is the wrong layer for a guardrail, and the strength of the wording is not the variable.

The same holds for the moves around it. Fine-tuning lowers the rate without zeroing it, and a fine-tune is harder to update than a SQL REVOKE. Self-verification runs the verifier through the same architecture as the actor, ratifying the destructive call with the same confidence that produced it. The corruption piece walks through the mechanism. Same model checking the same model is not a check.

Why the prompt can’t bound the distribution

A prompt is more tokens. The system message, the developer instructions, the user turns, the tool responses all land in the same context window. They feed the same attention mechanism that produces the next-token probability. “Never run DROP TABLE” shifts that probability toward continuations consistent with the rule. It does not remove the token from the vocabulary. It does not produce a hard zero on the path that emits it. Sampling has no off-switch. Production agents run at positive temperature, where every token keeps nonzero probability of being sampled. Even at temperature zero, the model takes the argmax over the distribution, and the argmax is whichever token the context tilted highest. The prompt shifts that ranking. It does not bound it.

The model is also stateless. Every turn, the harness resends the entire conversation, and the “memory” the agent appears to have is the transcript being rebuilt on each call. Nothing in the weights retains the rule from turn 1 at turn 50. There is only a longer transcript with the rule somewhere in the middle, competing for attention weight against everything else. More tokens means more probability mass to spread and a smaller share for any single instruction. Long system prompts loaded with rules and exceptions are the worst case: each rule dilutes every rule already there. Keep the prompt short.

Corruption Is a Feature, Not a Bug walks through the architecture in detail. Across enough sessions, with enough varied phrasings, some context tilts the distribution far enough that the forbidden token wins the sample. The rate per session is small. Multiplied by the sessions a production agent runs, it becomes a count. The count is the incident, guaranteed in expectation.

Adversarial inputs share the channel. Every document the agent reads, every tool response it parses, every user message lands in the same window as the system prompt. There is no privileged layer. Input that statistically reads as “the user wants this destructive action” can override the instruction that says don’t. Prompt injection is the named version. The unnamed version is conversation drift.

Each new session starts fresh. If the guardrail lives in the prompt, every session re-establishes it from scratch.

The honesty-suppression test in the corruption piece is the smallest reproducible demonstration. A single banned word, surfaced through every channel the harness exposes (CLAUDE.md, skills, system reminders, project memory), still leaks. Every more complex guardrail violates at a higher rate.

What lives outside the prompt

A guardrail is something the agent cannot remove by re-reading its instructions. Six pieces, each catching a different failure class.

Scoped identities with minimum-necessary credentials. The agent has its own database role, service account, and API keys. The role grants the minimum permission the task requires: read-only by default, explicit grants on the narrow write paths it actually needs. Revocation is one DROP ROLE away. The agent cannot escalate by rephrasing its own context, because the credentials live in a layer the model cannot read.
MCP and tool surfaces from trusted sources only. An untrusted MCP server is a malicious tool the agent will call with the same confidence as a benign one. A public-registry MCP server from an unknown author is unsigned code from the internet, with the agent doing the executing. Trust belongs at the connection layer: allowlists, signed manifests, internal-only registries. A prompt that says “only use trusted tools” is the agent grading its own homework.
Harness hooks that intercept tool calls. Claude Code fires shell commands on events like PreToolUse and PostToolUse. A PreToolUse hook reads the tool name and arguments and returns allow or deny; the model never sees the decision. The pattern handles concrete bans the prompt cannot reliably enforce: blocking Edit on .env or secrets/, blocking Bash against regexes like rm -rf or DROP TABLE, blocking Write to migration directories outside an explicit unlock. Codex’s approvals primitive is a narrower version of the same idea, and some harnesses expose nothing comparable, in which case this layer has to be built outside the harness.
Wire-level filtering between the agent and the database. A SQL-aware proxy in front of the database parses every statement and blocks denylist matches: DROP, TRUNCATE, unqualified DELETE, schema-modifying DDL outside an unlock window, queries that touch tables the agent’s role has no business reading. ProxySQL with query rules, pgBouncer with extensions, and custom proxies all sit here; commercial SQL-firewall products exist but the open-source space is thin. The same pattern applies one layer up: an MCP server or RPC layer you control exposes only validated operations rather than passing arbitrary SQL through. The agent never speaks raw SQL to production. It speaks to an interface that decides what reaches the wire.
Structured logs with prompt provenance. Every tool call captured: input prompt, tool name, arguments, response, timestamp, agent identity. A pre-AI audit log captured the SQL and treated it as sufficient. The AI-era version captures the reasoning context that produced it, because the SQL alone is incomplete in any incident review. Corruption can surface months after the prompt that produced it.
Behavioral monitoring with anomaly alerts. Rate limits per agent identity. Baselines for normal call volume, tables touched, read and write volume. Alerts on threshold crossings: a 100x increase in DELETE calls, a sudden read of a table the agent has never touched, a write to a schema outside its usual scope. Agents are non-human users with their own behavioral baselines, and the reference frame is anomaly detection on user accounts in security tooling.

None of these lives in the context window, which means the agent cannot route around them by being asked nicely. The credentials are checked by the database. The MCP allowlist is checked by the connection layer. The hooks run in separate processes. The proxy parses every statement before it reaches the database. The logs are written by the harness. The alerts are evaluated by a separate system. Every layer is enforced by something that does not sample from a probability distribution.

Each layer is engineering work. On a small enough deployment, the cost exceeds the cost of an incident.

When this doesn’t apply

Read-only agents, with a caveat. Analytics, query, chat. A read-only role at the database is the layer for write damage, but reads are not free: every row the agent fetches is sent to the model provider as part of the next prompt. A table holding API keys, customer PII, or internal hostnames is not safe to expose to a third-party-hosted model just because the agent cannot write. Either the role’s grants exclude those tables, a wire-level filter strips sensitive columns, or the agent runs against a sanitized snapshot.

Toy environments. Playgrounds, scratch databases, demos. The whole point is that the agent can break things; guardrails are friction.

Single-operator small teams where the agent is the operator’s autocomplete. The operator is the verification layer. The agent’s permissions are intentionally equivalent to the operator’s.

Everyone else needs all six. Agents holding production credentials. Agents working against shared infrastructure. Agents wired into CI/CD. Agents reading and writing customer data.

The bigger picture

None of these layers is novel for staff engineers. Scoped credentials, vetted tool surfaces, request interception at every layer, audit logs, anomaly detection. Every one is standard practice for any system that touches production, except they tend to be applied to humans and to other software. The shift the AI era forces is treating the agent as a separate principal that needs its own version, and treating the prompt as something other than a security control.

Letting AI Manage Your Indexes: the System and Guardrails the SME Has to Build

Wed, 29 Apr 2026 00:00:00 +0000

TL;DR

AI can propose and ship index changes against a database where the SME has built two things: a context system (comments on indexes, recorded history, workload evidence surfaced into the prompt) and a guardrails layer (performance regression tests, catalog-redundancy checks, drop-safety rules, post-deploy monitors) that catches the corruption-floor errors any LLM produces. Without both, the loop collapses into “ask the assistant to fix the slow query,” and the index set grows monotonically because the model has neither memory nor visibility into what already exists.

The dashboard is slow. An engineer pastes the query into the assistant and gets back a CREATE INDEX on three columns from the WHERE clause. The query drops from 800 ms to 12 ms. Ticket closed. Three weeks later a different engineer files a similar ticket against a sibling query on the same table. Same flow, different index, same satisfying speedup. Six months and a hundred sessions later, the orders table carries fourteen secondary indexes. pg_stat_user_indexes reports idx_scan = 0 on eight of them. Three are strict prefixes of larger composite indexes that already cover the same predicate. The table’s index volume now exceeds its heap volume. p99 INSERT latency has drifted from 9 ms to 31 ms over the same period, and no single deployment is responsible. Nobody added more than one index. Everyone added one.

The obvious response is “the model is bad at this, don’t let it touch indexes.” That’s half right. Any LLM is bound by the corruption floor; any proposal can quietly miss a constraint and ship a plausible-looking wrong answer, which is exactly what happens above when prefix-redundant indexes get created because the model can’t see the existing list. The model is the wrong variable to focus on. What does fix the loop is two pieces of work the SME owns: surface the context the model needs to make a grounded proposal, and build the guardrails that catch the residual errors before they ship. With both, AI does the bookkeeping a careful human would do, faster. With neither, the dashboard scenario above is the steady-state behavior.

Why the catalog isn’t enough

Indexes are a workload property. The catalog is a schema property. The mismatch is what every AI-driven index mistake comes from when nothing has been built to bridge it.

A schema describes columns, types, and constraints. It says nothing about how the table is read, in what proportions, with what selectivity, or how often each predicate fires under production load. The two pieces of evidence that matter most for any index decision live entirely outside the catalog: the workload itself (slow-query log, pg_stat_statements, the application’s actual query mix) and the planner’s recorded behavior (pg_stat_user_indexes, pg_stat_user_tables). An assistant reading the catalog sees neither. It sees the schema, the one query in the prompt, and a vague sense from training data of what indexes “tend to” exist on tables that look like this one.

That gap explains the failure modes. Without the existing index list, the assistant proposes (customer_id, created_at) when (customer_id, status, created_at) already exists and serves the same predicate as a left-prefix match. Without selectivity statistics, the assistant orders composite columns by the order they appeared in the WHERE clause, producing indexes whose leading column has 4 distinct values across 50M rows. Without the write/read ratio, every proposal is implicitly priced as free on the write path.

Each session also starts fresh. There’s no continuity between the assistant that proposed idx_orders_status_created last quarter and the one being asked the same question today. A reviewer six months ago tried that exact index, found the planner ignored it because of correlated columns, and removed it. The next session has no record of any of that and proposes it again.

The lifecycle is asymmetric. AI is asked to make slow queries faster, a question that resolves with a CREATE. AI is rarely asked to make the index set smaller, because nobody files a ticket for “we have too many indexes” until something is on fire. Every interaction nudges the count up; nothing in the loop nudges it back down.

All of these gaps have the same shape: information that exists somewhere outside the catalog, and that the assistant has no path to unless something puts it there.

What every new index actually costs

The default mental model is “an index makes reads faster, what’s the harm.” The harm is real and shows up across every layer the database touches.

Every INSERT, every UPDATE on an indexed column, and every DELETE updates every relevant index. A row written to a table with nine indexes is nine extra B-tree descents, nine page pins, nine potential page splits, and nine WAL records. The per-row write cost grows roughly linearly with the index count, and an UPDATE that migrates the row touches every index when HOT can’t apply.

WAL volume is what replicas consume, so the same cost replays on every standby and shows up as replication lag under load. A write-heavy workload with a redundant index set can saturate the replication channel before it saturates the primary’s local disk, and the failure mode reads as “the replicas are falling behind” rather than “we have too many indexes.”

Indexes occupy buffer-pool pages that would otherwise hold hot heap data. A table with twice the index volume has roughly half the cache headroom for the heap. Backup size, restore time, and vacuum I/O all scale with total index volume, not heap volume. On most production systems, index volume already exceeds heap volume.

There’s also a hidden second-order effect that the local optimization framing misses. Index choice changes the planner’s decisions for other queries. An index that helps query A by a measured 50 ms can shift the plan for query B onto a worse path costing 500 ms on a code path nobody’s currently watching. The session adding index A has no visibility into this. The regression surfaces a week later as “query B got slower,” the on-call engineer reaches for the assistant, and another index gets proposed for query B. The cycle repeats.

Every new index is a permanent write tax paid on every transaction, in exchange for a read benefit on a subset of queries. The math only works when the read benefit is real and large enough to matter, the index is actually used by the planner under production statistics, and no existing index could have served the workload through extension or column reordering. Establishing those three points is exactly what the system around the assistant exists to do.

Most slow queries aren’t index problems

Before the system is reached for, the framing the assistant defaults to is “slow query → CREATE INDEX.” That collapses a much larger decision tree into the move with the highest permanent cost. Four cheaper interventions usually exist, and at least one of them resolves the slowness without adding anything to the catalog.

The query itself can be wrong. A predicate wrapped in a function (WHERE LOWER(email) = ?, WHERE DATE(created_at) = ?) is non-sargable and won’t use any regular index, so adding one accomplishes nothing. The fix is rewriting the predicate or fixing the column’s collation. Non-SARGable predicates covers the catalog. The same shape applies to implicit type casts on join columns, OR predicates that defeat composite indexes, and OFFSET-based pagination that gets quadratically slower as the offset grows.

The application can change access pattern. List views that paginate with LIMIT/OFFSET past page 50 belong on keyset pagination, where the client passes the last seen (created_at, id) tuple and the query becomes a sargable range scan against an existing index. Sorting that the database is doing on a non-indexed column for a result set the client only ever reads twenty rows of can move client-side. Aggregations that fire on every page load can be cached or pre-computed by the application, removing the read pressure entirely rather than indexing around it. The pattern across all three: the slowness is a property of how the application is asking, not of what the database has indexed.

The statistics can be stale. The planner can have the right index already and refuse to use it because pg_stats thinks the predicate matches 80% of the table when it actually matches 0.1%. ANALYZE is the first move on any “the planner won’t use the index” complaint, and on correlated columns the fix is CREATE STATISTICS (Postgres) or extended histograms (MySQL), not another index.

The schema can be the actual problem. A JSON column the workload filters on every read is paying the JSON-extract cost on every row no matter what indexes are added on the side; the durable fix is promoting the queried keys to typed or generated columns. A VARCHAR column carrying numeric IDs forces an implicit cast on every lookup that no index can rescue. A polymorphic resource_id column whose target depends on a sibling discriminator can’t be indexed in a way the planner uses for the conditional join the application actually wants.

The assistant’s default skips all four because the catalog doesn’t surface any of them. A model prompted with the diagnostic ladder explicitly will work through it; a model prompted with “this query is slow, what should we do” will reach for the index. The difference is whether the prompt was constructed by a system that knows the ladder exists.

What the SME builds: context

The first half of the SME’s work is the context system: what the model sees in the prompt, and what persists between sessions. Five surfaces, each closing a specific gap the catalog leaves open.

Each index has a documented purpose. PostgreSQL supports COMMENT ON INDEX; MySQL supports an INDEX ... COMMENT clause on creation. Use either, and put a one-line description of which query the index serves and why the column order is what it is. Naming conventions carry the same load: idx_orders_dashboard_list_v2 is more legible than idx_orders_status_created_customer. Both surfaces live in information_schema, so any tool reading the catalog picks them up.

History is persisted somewhere queryable. A migration log, a docs/indexes/<table>.md file, or a comment block on the table itself, recording what was tried and dropped. “We tried (status, created_at) in 2025-Q3, the planner ignored it because of correlated columns, removed in migration 0142” is the cheapest way to keep the next session (any session, six months from now) from proposing the same dead end.

Workload evidence goes into the prompt. A slow-query log entry, an EXPLAIN ANALYZE against representative data, the current pg_stat_statements count for the query, and a snapshot of pg_stat_user_indexes for the table. The artifacts ground the proposal in real workload rather than a hypothetical, and they’re exactly the evidence the model can use if it’s handed in.

The existing-index dump goes into the prompt. Before any proposal, dump the current indexes on the table and the indexes the planner has been using for nearby queries. The dump catches redundancy (proposed index is a left-prefix of an existing composite) and supersession (the existing composite would serve the new query if its column order were tweaked or an INCLUDE clause were added). A model handed the dump routinely catches this; a model not handed it routinely misses it.

Hypothetical indexes go first. PostgreSQL’s HypoPG creates a fake index, the planner costs it as if it existed, and EXPLAIN reports whether it would be used. The cost is zero, and the signal is whether the proposed index would actually change the plan under current statistics. MySQL has no direct equivalent; the discipline there is to validate the proposal against a recent production snapshot before merging.

What the SME builds: guardrails

The context system gets the model to a grounded proposal. The guardrails catch what the model still gets wrong, the same way tests on database code catch regressions a thoughtful human can produce.

Performance regression tests on hot queries. For each query the team cares about, write a test that runs EXPLAIN and asserts the plan uses the expected index, or stays under a row-count budget, or doesn’t fall back to a sequential scan. Run on every migration. The test catches “AI added an index that shifted the planner onto a worse path for query B” - a class of failure that’s invisible at code review time.

Catalog-redundancy linter. Block CI when a migration adds an index that’s a strict left-prefix of an existing composite, or a single-column index that duplicates a leading column of one. The check is mechanical, the rule fits in a small SQL query against pg_index, and it catches the most common AI failure mode without any human in the loop.

Drop-safety check. Before any DROP INDEX lands, the check confirms idx_scan has been zero for N days and the index’s comment doesn’t flag it as kept for a known non-daily workload. The check fails loud and the migration doesn’t run. This is what the comment-on-index discipline above pays back: the comment is the data the check reads.

Lock-budget guards. Block DDL that would take ACCESS EXCLUSIVE on tables tagged as hot, unless the migration uses CONCURRENTLY (Postgres) or the equivalent online algorithm (MySQL). Catches “AI proposed CREATE INDEX without CONCURRENTLY on a 500M-row table” before it reaches production.

Continuous index-health monitoring. Workloads shift constantly: queries get removed from the application, access patterns change, table sizes grow past the point where a once-useful index stops mattering, a deploy reroutes the planner to a different index. None of those surface in the catalog. A long-running monitor watches pg_stat_user_indexes, pg_stat_statements, and write-path latency, and fires when a previously-hot index’s scan count flatlines, when its write-cost-to-read-benefit ratio crosses a threshold, or when the planner walks away from an index that hot queries used to depend on. Each is a separate alert. The corruption-floor failure that survives every other check usually shows up here first, as a metric change.

When a monitor fires, the alert is a context-gathering job the model is well-suited for. The LLM pulls the index’s comment, the migration history, the documented purpose, recent pg_stat_statements data, and any related queries from across the catalog, and produces a summary: what the index was meant to serve, what the data shows about its current usage, and a proposed disposition. The SME reads the summary and makes the call. The drop-safety check above is the floor underneath the call: even if the SME approves the drop, the migration doesn’t run if the comment flags a known non-daily workload.

The honest trade-off is that this isn’t free. Building the context surface and the guardrails up front is real work, and on a small or short-lived database it’s overkill. The work pays back when the system is large enough that no human reliably has the whole picture, the workload changes faster than any one person can track, and AI is being used in the loop. At that scale, every component above is cheaper than the failure it prevents.

When the discipline isn’t worth the friction

The system earns its cost on production OLTP databases with multiple writers, sustained traffic, and a year or more of accumulated drift ahead. It’s overkill in three places.

OLAP and columnar workloads work differently. ClickHouse, DuckDB, and BigQuery don’t carry the same per-row write tax, and the article’s mental model doesn’t transfer. Very small tables don’t repay the discipline either; a 5,000-row admin table with a dozen lookup indexes is using a few megabytes of cache and adding microseconds to writes that aren’t on any hot path. Single-writer workloads with a small, enumerable query set are the third case: a reporting database serving twenty known queries from one ingestion job has an index set that can be designed up front and reviewed by hand. The system pays for itself when the query mix is large enough that no human reliably has the whole picture.

The bigger picture

The recurring pattern across self-documenting schemas, foreign keys, and column comments is that the work of making AI useful on a production database is the same work that makes the database legible to any reader. Indexes are the same shape one level up. An index whose comment explains its purpose, whose history is recorded, and whose usage shows up in a regression test that runs on every migration is an index a human or an assistant can reason about safely. An index named idx_orders_status_created_customer_3 with no comment, no recorded history, and no test asserting which query depends on it is an index neither can reason about, and the failure mode is the same in both cases.

The SME’s role in an AI-assisted database is the work AI doesn’t do: build the context surface, and build the guardrails that catch the corruption-floor errors the model still produces against the best context. The model proposes. The system the SME built is what makes those proposals safe to ship.

Corruption Is a Feature, Not a Bug: Why LLMs Corrupt by Design

Wed, 22 Apr 2026 00:00:00 +0000

TL;DR

Frontier LLMs corrupt at least 25% of delegated multi-step document work in lab conditions. The rate rises with document size and turn count, and tool use doesn’t help. Corruption is a property of the architecture, not a defect to be patched, and the only thing that closes the gap is a best-in-class domain expert at every checkpoint.

Microsoft Research has a number on it. Laban, Schnabel, and Neville’s LLMs Corrupt Your Documents When You Delegate (arxiv 2604.15597, April 2026) ran the DELEGATE-52 benchmark across 52 professional domains (coding, crystallography, music notation, professional writing) against 19 frontier LLMs including Claude 4.6 Opus, GPT 5.4, and Gemini 3.1 Pro. Average corruption: 25% of document content by the end of long workflows. Tool use doesn’t fix it. Agentic harnesses don’t fix it. Larger documents and longer interactions make it worse, not better. The number doesn’t depend on which frontier model you pick.

The 25% is a floor, not a ceiling

The benchmark is a controlled lab measurement against curated tasks with known ground truth. Real production has every reality catalogued in What AI Gets Wrong About Your Database: undocumented conventions, polysemic columns, four-format date strings, JSON-as-schema, business logic in tribal knowledge, ten-year-old codebases with three “current” patterns for the same operation. The model in production reads from that impoverished signal, and the rate multiplies. The 25% is what you get on a good day on clean data. Production is not a good day.

The version doesn’t matter - corruption is a feature

Claude Opus 4.7 is the latest as of writing. DELEGATE-52 measured 4.6, GPT 5.4, Gemini 3.1 Pro. The next generation will measure at the same floor. Not because the labs aren’t trying (they are) but because the corruption isn’t a defect to patch. It’s the property you bought when you bought “language model.” The same mechanism that makes the model useful (generalizing from a training distribution to plausible novel output) is the one that makes it corrupt your document (generalizing from a training distribution to plausible novel output that doesn’t match your specific facts). You can’t fix one without losing the other.

The framing this post takes is that LLMs are a probability machine first and an intelligence-shaped artifact second. That’s a stance, not an uncontested fact, but the engineering implications of the post all follow from taking the first reading seriously.

The obvious fixes that don’t work

The reflex when the rate is 25% is to reach for the things that usually fix software defects. None of them touch the floor:

“Use a better model.” The benchmark already measured the frontier. Same rate.
“Add tools, RAG, fine-tuning.” Tool use doesn’t change the rate. RAG narrows the prior, but the same sampling mechanism draws from it. Fine-tuning shifts the distribution; it doesn’t add deterministic constraints.
“Add agent self-verification.” The verifier is the same architecture reading the same training distribution as the generator. It will ratify the corruption with the same confidence the generator produced it with.
“Add more context.” What AI Gets Wrong About Your Database already covered this. More context lowers the rate, doesn’t drive it to zero. The hallucination floor is structural.

These aren’t bad ideas. They lower the rate from terrible to bad. They don’t make delegation safe.

Why this is structural: the mechanism

The mechanism is worth understanding because it’s what tells you why “use a better model” doesn’t move the floor.

Start with embeddings. The model “understands” users.deleted_at as a vector position adjacent to other deleted_at columns it saw during training. There is no concept of your soft-delete convention, your tenant filter, the incident your team had last quarter, or the rule you wrote into the catalog comment two months ago. The vector is a fingerprint of what tokens like that one tend to appear next to in a billion training documents, not a fact set the model can check against.

Attention works the same way. Each output token is a weighted blend of every other token in context, with the weights being learned similarity scores. The model isn’t looking up “the right answer for this schema.” It’s computing a weighted average of what tokens like this one tend to be followed by tokens like that one in its training distribution. Correctness is the special case where the distribution happens to be sharply peaked on the correct token.

Generation pulls all of that into a sampling step. Every token is a draw from a probability distribution. When training data is dense and consistent for the topic, the distribution is sharp; but a sharp distribution is reliable token-relationship probability, not knowledge of the answer. The tokens in that region of the data point reliably to a particular continuation, and the relationships are co-occurrence statistics with no internal check against reality. When the underlying relationship happens to match the world, the guess looks right. When it doesn’t (a popular misconception in the training corpus, an outdated convention, a pattern typical of training data but not of your specific case) the guess is confidently wrong with the same calibration. When training data is sparse, contradictory, or local to your codebase, the distribution flattens, the guess gets noisier, and the calibration of the model’s confidence stays the same. Nothing in the architecture says “I don’t know.” It says “this token has the highest probability among my distribution,” even when the distribution is barely above random, or sharp on a relationship that doesn’t hold for your case.

To collapse those three steps into the operation that actually runs: every token is a vector, a position in a high-dimensional space, typically thousands of dimensions (4,096 in some models, 12,288 in others). Similarity between two tokens is the dot product (or cosine distance) of their vectors. Attention computes its weights by taking those similarity scores between the current position and every prior token in context, then softmax-normalizing. The probability of the next token is the dot product of the model’s predicted direction against every candidate token’s vector in the vocabulary, divided through a softmax to produce a distribution. Every probability you read out of the model is a distance computation between vectors in that high-dimensional space. Understanding is position. Probability is geometric proximity. There’s no step in the pipeline where knowledge enters; only matrix multiplications and a normalizing function.

DELEGATE-52’s 25% is the rate at which the distribution flattened across 52 domains’ edge cases and the sampling collapsed to plausible-sounding hallucination. The confidence reading stayed identical to when the model was right. This is Part 1’s “confidence is anti-signal” restated at the architecture level: confidence and correctness are produced by different mechanisms, neither tied to the other.

Why best-in-class SME is the load-bearing safeguard

Humans don’t operate this way. A crystallographer knows the unit cell parameters have to satisfy specific symmetry constraints, not because she’s seen a million similar structures, but because the constraints follow from a small set of facts she can verify against. A senior database engineer knows the soft-delete convention because she wrote it. A composer knows the chord progression doesn’t resolve because the leading tone wasn’t raised. That knowledge is symbolic, propositional, traceable to evidence the human can produce on demand. It isn’t a probability distribution.

That is the gap an SME closes. Not “any reviewer with a checklist”; a generalist reviewer can’t tell when the model has silently corrupted the symmetry constraints, the soft-delete predicate, or the leading-tone rule. The corruption looks plausible because the model’s job is producing plausible output. Catching it requires someone who holds the actual constraints in their head and can check the model’s output against them. The cheaper the SME, the more corruption ships.

The labor-market reading of this is already visible. IBM announced in February 2026 it would triple US entry-level hiring, explicitly because the AI era hollowed out the rote tasks that used to fill junior roles and left the load-bearing work (judgment, customer interaction, oversight of automated systems) needing humans who grow into it. The pipeline argument is unforgiving: cut juniors to capture the AI-productivity dividend, save short-term, and starve the senior layer the next decade of work depends on. The companies treating today’s juniors as a long bet on the experts they’ll become are reading the architecture honestly. The ones cutting them to bank the LLM savings are paying down the pipeline their competitors are building.

The supply side is tightening on both ends. The senior tier is retiring out (the engineers who built the soft-delete conventions, the schema histories, the production-incident memory) and that institutional knowledge isn’t transferring into a probability distribution any model can sample from. On the formation side, Anthropic’s own research shows engineers using AI assistance score 50% on comprehension quizzes about the code they shipped, against 67% for engineers writing the same code unaided - a 17-point gap. The skill-formation loop that turns juniors into SMEs over a decade (write, struggle, debug, internalize) gets shortcut by tooling that produces working code without the struggle. Companies that don’t actively design against both effects get the worst of three pressures: SMEs retiring, juniors not deepening, and an architecture that has no internal substitute for either.

The unintuitive recommendation that follows: don’t fire the humans you have. The SME labor market will tighten faster than LLM tooling can replace what SMEs do. Supply is shrinking on both ends, the architecture has no substitute, and today’s senior engineer is cheap relative to their replacement cost in three to five years. Companies banking the LLM productivity dividend by cutting senior staff are trading short-term margin for a much steeper rehiring bill against a constrained future market. The math will look obvious in retrospect. It doesn’t look obvious now because the LLM line item lands on the income statement before the SME-shortage bill arrives.

This is why “best-in-class” is load-bearing in the title. A junior with a checklist runs the same architecture-level pattern-matching the model does. Recognizes things that look right, doesn’t catch silent semantic drift. A top-of-class SME has the constraint set internalized to the point that the wrong answer feels wrong, even when the surface looks correct. That feeling is the safeguard the architecture cannot provide.

The system

Don’t delegate end-to-end. Decompose work into chunks small enough that an SME can verify each one in minutes. Checkpoint between chunks. Route each checkpoint to the SME whose domain it’s in. Treat every chunk as 25%-floor untrusted by default. Don’t trust agentic chains to self-verify (the verifier reads from the same training distribution as the generator). Don’t trust LLM-judge eval as a release gate (Part 2 of the testing series covered why; the architectural reason is in the mechanism above). The system is decomposition plus checkpoints plus SMEs, not a better model and not a better prompt.

The cost is real. SMEs are expensive, the workflow is slower, and the temptation to skip checkpoints when the early ones look fine is constant. The cost of the alternative (silent 25%-floor corruption layered through a long workflow, surfaced six months later when the data has propagated past the recovery window) is much higher and structurally harder to detect. The math is the math the testing series already laid out: catching corruption pre-deployment is a fraction of the cost of finding it after.

When this doesn’t apply

Drafts you’ll discard. Brainstorming, throwaway code, content the human will rewrite anyway. The model is generating a starting point, not delegated output.
The user is the SME. A senior database engineer using AI to draft SQL she’ll review line-by-line is using the model as autocomplete, not as delegation. The 25% is irrelevant because she’s the verification layer.
Low-stakes, recoverable work. A typo in a personal email isn’t a 25% corruption event you need to system-design around.
Bounded, well-trodden problems. Generating boilerplate in a popular language with a well-documented framework is the dense-distribution sweet spot. The rate is much lower because the prior is sharp.
Proof-of-concept and rapid-feedback work. “Does this idea work at all” needed in minutes. The 25% floor is the right trade because the output is a directional signal, not production code; the cost of being wrong is “we tried, didn’t pan out.”

The article is about the rest - production work where corruption is invisible, expensive to fix, and the team is treating the model as a co-author instead of a guess machine.

The bigger picture

Calling the model “intelligence” is the framing that gets engineers in trouble. Intelligence implies a knowing entity that holds facts, checks them against evidence, and tells you when it doesn’t know. The architecture has none of those properties. It has a learned distribution and a sampling procedure. The output is a guess every time, and the guess is well-calibrated only where the training data was dense and consistent - precisely not where your specific codebase, your specific schema, or your specific domain conventions live.

The 25% floor is what that guarantees, in numbers. Versions don’t move it. Tools don’t move it. Bigger context doesn’t drive it to zero. The only thing that closes the gap between the architecture and the work is a human who knows the domain, checking the output against constraints the architecture can’t represent.

Treat the model as a probability machine and the engineering decisions get easier. Decompose. Checkpoint. Put the best SME you have on each domain. Build the testing layer the way the testing series describes. Stop expecting the next model to fix it.

How Teams Actually Finish What They Start, Part V: The Sprint as a Working Set

Tue, 21 Apr 2026 00:00:00 +0000

TL;DR

For teams whose week is shaped by inbound work, the sprint should hold only what is being worked on now plus what gets pulled next. No forward estimates, no velocity commitments. Priority lives in labels; the team pulls from the labeled backlog as in-flight work completes.

Tuesday at 2pm. Sprint planning. The team has been here for ninety minutes. Eighteen tickets on the board, points being argued about (was this a 5 or an 8 last quarter?). A senior engineer flags they have to leave for an interview at 3. The product manager wants to commit to 42 points so the velocity curve in the leadership deck stays smooth. Three operational tickets came in during the meeting itself. Nobody has touched code today. The sprint will start tomorrow with eighteen tickets the team has not properly looked at, plus the three that arrived during planning, plus whatever arrives over the next two weeks. The planning meeting was the work today.

Better grooming doesn’t fix it

The standard fixes target the planning meeting: better grooming, T-shirt sizing instead of points, async estimation in Slack. Each saves twenty minutes and leaves the underlying mistake intact. The mistake is the sprint trying to be a committed plan for two weeks of work the team has not done yet, on a team whose two weeks are not predictable. No amount of grooming makes the unpredictable predictable. The fix is a sprint that admits it.

The sprint as a working set

A different mode: the sprint holds only what is being worked on right now, plus the highest-priority items the team will pull next. That is it. The sprint stops being a two-week forecast or a velocity commitment. The board reflects reality rather than narrating it.

The mechanics follow. The backlog holds everything, labeled by priority. The manager keeps the labels current, and high-priority items rise to the top of the filtered view. The sprint holds only tickets currently in progress plus immediate next pulls. Engineers pull from the labeled backlog into the sprint as their current work completes. There is no forward estimation at planning, because points are written post-hoc (see Part IV). Planning becomes a short check-in: the team confirms priorities, surfaces blockers, and returns to work.

Engineers file their own tickets when they discover work. A bug found while shipping a feature. A refactor that surfaces during code review. A dependency that needs chasing. The IC who found it writes the ticket. “Someone will write this up later” becomes nobody, and the work disappears from the tracker without disappearing from reality. The team’s tracker has to hold the team’s actual work; if the work is not in the tracker, the work does not exist for planning purposes.

The responder rotation absorbs incoming interruption tickets (Part III) so the sprint is not churned by every Slack message and every cross-team request. The sprint is what the team is building. The responder column is what arrives. The two queues stay separate, and the sprint stays small.

The sprint is not a copy of the backlog

The pressure to grow the sprint is constant. Leadership wants velocity numbers, the team wants to look ambitious, every new priority feels like it should land “in this sprint.” It should not. The sprint is what the team is doing now plus what they will pull next. If the sprint contains tickets nobody has looked at, the discipline has slipped, and the velocity that comes out the other side is fiction.

In the tracker

Jira gives you priority fields, labels, components, ranks, epics, and themes. The working-set sprint needs three things from the tracker: a backlog the manager can prioritize, a way to see the labeled top of the backlog, and a sprint board that shows what the team is doing right now. The rest is decoration.

A workable setup: priority lives on a single field or a single label, picked once and used consistently. A saved filter (JQL or board view) shows the labeled high-priority backlog, and that filter is the team’s entry point when their current ticket closes. The sprint board shows in-progress and next-up tickets only. Standups walk that board ticket by ticket. Estimation columns are optional; if used, they are filled in after the ticket closes, not before.

Use one priority mechanism, not three

Jira lets you mix priority field, labels, components, and rank order. Pick one. A label like priority:p0 works. So does the built-in priority field. Mixing them means engineers pull from one filter while the manager updates another, and the team works on the wrong tickets while the tracker says everything is fine.

When forward sprints work

A team with a stable, well-scoped backlog and predictable interruptions can run forward sprints with point commitments. Some maintenance teams have this. Some platform teams whose remit has been narrowed and frozen do too. For everyone else, the working-set sprint is the one that matches reality. Part IV’s measurement discipline is how the team finds out which side it is on; the working-set sprint is what the team does with the answer.

What changes

Sprint planning gets short. Velocity stops being a fiction the leadership deck has been running on. With the board reflecting actual work, the rest of the cadence gets honest too. Standups walk the board ticket by ticket and finish in fifteen minutes. Retros talk about what actually happened, not about the gap between estimate and reality. And reviews compare engineers on the same shapes of work, with the data Part IV produced. The tracker stops being a stage for the planning ritual and becomes a tool for getting work done.

How Teams Actually Finish What They Start, Part IV: Point After the Fact

Fri, 03 Apr 2026 00:00:00 +0000

TL;DR

Forward estimation breaks for any team whose week is shaped by work that arrives. The discipline that scales is to point after the fact: when the ticket closes, the person who did the work writes down what it took. Over a quarter the team has real data about where time actually goes.

Standup Monday. A 5-point ticket lands in the engineer’s column. Wednesday the engineer is still on it, two production fires deep, the original scope half-discovered. By Friday it ships. The retro looks at velocity. The 5 stays a 5, the team’s data says the engineer did 5 points of work, and the next sprint’s planning uses that data to size the next ticket of the same shape. The bug compounds.

Better estimates don’t fix it

The obvious move is to estimate better: planning poker, three-point estimates, finer story points, more grooming up front. None of it works on a team whose week is shaped by inbound work. The variance is not in how the engineer reads the ticket. It is in what arrives between Monday and Friday. A ticket scoped honestly Monday gets eaten by an unrelated incident Wednesday. A 5-point ticket stays 5 points until the dependency the engineer didn’t know about turns it into 13. Forward estimation is trying to predict the team’s week, and the team’s week is not predictable.

Point what you did

The fix is mechanical. When the ticket closes, the person who did the work writes down what it took. Over a quarter the team has real data: which categories of work consume the most time, which engineers carry which kinds of load, where the same shape of problem keeps eating a day each time. The cost is near zero (the ticket is closed; the person who did the work is sitting there). Bottlenecks surface fast: a category that always takes three times what its siblings take is a place to invest, and the data makes it visible without anyone having to argue for it.

The rule has to be load-bearing in the workflow, not aspirational. A ticket cannot move to Done without a points value. Without that constraint, the data has gaps, and the gaps are not the random kind.

Half-pointed data is worse than no data

Without a load-bearing constraint, easy tickets get pointed and hard tickets get closed in a rush. The long-tail work the team most needs to see disappears from the record. The team trusts the partial data anyway and reaches the wrong conclusions about its own capacity.

What the data enables

Operation-heavy teams can drop forward sprint commitments entirely. The manager sets priorities through labels on tickets and epics. The team pulls from the top of the labeled backlog as engineers free up. After-the-fact points accumulate over the quarter and give the team a real baseline: capacity, distribution, ticket-shape patterns. Forecasting becomes a quarter-long view rather than a sprint-long commitment.

The same data benchmarks people. Recurring work converges in shape over a quarter. Responder rotations cluster around the same handful of incident types. Refactors pattern-match to a few common shapes. When ten engineers have closed the same kind of ticket, the spread of points across people becomes visible. Reviews stop being a debate about effort and become a comparison against work everyone has done. The conversation shifts from “I think Alice ships fast” to “Alice closes the same shape of ticket in 6 points where the team median is 8.”

Compare same shapes, not raw velocity

The benchmark only works for like-for-like work. A frontend ticket and a database investigation are not on the same scale. Group tickets by shape (incident type, refactor pattern, investigation category) before comparing engineers, and ignore raw point totals across the team.

When forward estimation works

Forward estimation works when the team’s work is genuinely stable: same shape every week, predictable interruptions, no unscoped depth. Some maintenance teams have this. Most product teams and most operational teams do not. The point of measuring after the fact is to find out which side your team is actually on, and to design the cadence around the answer.

What changes over a quarter

A team that points after the fact has a quarter’s worth of evidence about itself: which work consumes time, who is faster on what, where the bottlenecks live. A team that points before the fact has a quarter’s worth of guesses, refined and re-litigated each sprint. Retros become data-driven instead of opinion-driven. Reviews compare engineers against the same shapes of work. And when the team asks for headcount, the ask is grounded in a category that has been swallowing unbudgeted time, not a hunch.

How Teams Actually Finish What They Start, Part III: A Working Responder Rotation

Tue, 17 Mar 2026 00:00:00 +0000

TL;DR

For teams whose week is operational by nature (SRE, DevOps, platform, database and storage reliability, anyone whose sprint work coexists with a steady stream of alerts and partner-team asks), a weekly responder rotation breaks silos only when the operational rules force the responder to actually do the work, not route it. Five rules carry the load: a fixed resort order (runbooks → docs → LLM → SME), SMEs as advisors rather than handoff targets, tiered routing for alerts and asks so the responder isn’t drowning, improvement tickets that shrink the rotation’s load over time, and tech implementation as the only valid mechanism for easing the role.

A team has a responder rotation. The responder’s name is in the channel topic every Monday. Six months in, the database alert that pages at 2am still goes to Sarah because she wrote the schema in 2024 and nobody else has dug into it. The Kafka partition imbalance at 3pm still goes to Marcus. The Redis eviction issue still goes to whoever has been on the team longest. The responder forwarded all three to the original owners within two minutes of the page. The rotation existed. The silos held. The responder name on the calendar was a routing layer, not a learning one.

That is the failure mode the rest of this post is about. The rotation is necessary but not sufficient. What separates a working rotation from a name on a calendar is a small set of operational rules that force the responder to actually own the work, not just route it.

The reflex is to write down a clearer rule: the responder handles everything, no exceptions, never escalate. That fails on first contact with a real production page. The 2am Kafka issue is genuinely faster to resolve if Marcus picks up. The customer is on the line. The deadline is tomorrow. Saying “no, the responder must learn” costs the company money this week and won’t carry next week either, because the next 2am page also has a real-world reason it should go to the specialist. “No escape” is not the rule. The rule that holds is a different shape.

The five rules

Each rule sits below the prompt the team gives itself, in a layer the team has to defend actively. None is novel on its own. The combination is what makes the rotation produce the silo-breaking it claims.

Resort order: runbooks, then docs, then internal LLM, then SME. Before the responder asks a human, they work through the documented self-serve options in order. Runbooks for known incidents, docs for system understanding, an internal LLM or RAG tool for synthesis across both, and only after those run out, the SME. The point is not to ban the SME. It is to make sure the responder has tried the cheaper rungs first, so that when the SME does get pulled in, the conversation starts from “I read the runbook section on this and tried X, here is what I’m seeing” instead of “what is this and what do I do.”
SMEs advise, they don’t take handoff. When the responder consults the SME, the conversation produces a path forward, not a transfer. The SME explains the issue, suggests the next steps, and goes back to their declared work. The responder owns the ticket through resolution and writes the resolution into the runbook. This is the rule that turns “ask Marcus” into knowledge transfer instead of work transfer.
Alert and ask routing tiered by urgency and source. The responder watches a high-signal channel actively for prod alerts and urgent asks, skims a separate channel for non-prod, and works a Jira queue between fires for non-urgent automated signals (failed nightly backup, config drift, flag mismatch). Asks from other engineers, especially storage-team work like Kafka or Redis, go to a live conversation channel rather than a ticket queue; tickets get filed downstream when the conversation produces work worth tracking, they are not the entry point. The detailed channel structure, severity policy, and alert-tuning discipline are their own subject for a future installment in this series.
Improvement tickets generate themselves. Every recurring incident produces a ticket. A noisy alert that fired three times this quarter gets a tuning ticket. A runbook gap the responder hit gets a doc-update ticket. An LLM that gave a confident wrong answer gets a source-update ticket. The rule works because the role concentrates pain on one person for five days: the responder absorbs what is normally scattered across the team, and the person paged at 2am is the same person who will write the runbook on Wednesday. If the same fire happens twice in a quarter and no improvement ticket exists, the rotation is wallpapering toil instead of reducing it. The improvement queue is also the team’s most honest signal that the rotation is working: if the queue is shrinking and the runbooks are growing, the responder is producing the cross-training the rotation promised.
Reduce responder load only through tech, never through policy carve-outs. As the rotation matures and the team wants to make the responder’s week easier, the only valid mechanism is technical implementation: new automation, better runbooks, alert tuning, self-service tools for partner teams who keep asking the same questions. Carving out categories back to the SME, lowering the bar for what the responder is expected to handle, or routing painful asks somewhere off-stage all shift toil rather than removing it. Tech implementation takes the toil out of the system entirely. Every responder week that produces a real engineering ticket reduces what the next rotation has to do. Policy carve-outs do the opposite, quietly.

The SME who keeps picking up is the failure mode

Rule 2 fails more often than any of the other four. The first time the responder pings the SME, the SME thinks “I’ll just fix this one.” The second time, it is a habit. The fix is on the SME, not the responder: when consulted, the right response is “this belongs in the runbook” or “this should be automated,” never “let me handle it.” Every ask that reaches the SME is evidence of a gap the team should close.

What the five rules buy for the rest of the team is a clean focus week. The non-responder ICs are not in the partner-team channel triaging asks, not fielding DMs about Kafka topics or Redis schema, not on the other end of the 2am page. They are doing the work they declared on Monday, with the morning declaration and 3pm sync from Part II structuring their day, and they are finishing what they started. A morning declaration is a goal that four hours of unscheduled interruption will eventually destroy. The rotation is what protects the declaration long enough for the IC to actually finish it.

The cost of all five is real. Each requires discipline that is easier to skip than to keep. The SME who always picks up will keep picking up unless the team explicitly stops the handoff pattern. The improvement-ticket discipline only works if there is a half-hour each rotation set aside to file them, and rotations under heavy load lose that half-hour first. The tiered routing requires actively maintaining alert filters and channel topology that drift fast. The tech-implementation rule asks the team to file real engineering work after each rotation, which competes with sprint commitments and slips when sprints get tight. None of this is one-time setup. The rules are an active practice the team defends the same way it defends the 3pm sync slot from Part II.

Handoff at the week boundary

The responder’s week ends Friday afternoon. Three categories of things sit in the queue.

Open incidents and paused investigations get explicitly handed off in a written note: what is the state, what is the next action, what was already tried, who has been pulled in. The note is short. Half a page is plenty. Bullet form is fine. The rule is not that everything is exhaustively documented; it is that nothing is silently dropped.

Closed work gets closed, with whatever runbook update or follow-up ticket the resolution produced. Improvement work the responder identified but didn’t get to becomes a backlog ticket, scheduled into someone’s planned-work queue rather than left to the next responder to either pick up or ignore.

The next responder picks up Monday morning with full context on what is actually inbound. No ramp-up day spent figuring out what the previous week was working on. No ticket that quietly got dropped at the rotation boundary because nobody owned it across the weekend.

What the responder does not handle

A working rotation is honest about specialization. Some categories of work genuinely require the specialist, and naming them upfront prevents the format from feeling like a fiction.

Security incidents. Corner-case data-recovery operations. Load-bearing decisions that require historical context the rotation can’t reasonably build (the schema choice from 2022 that has shaped every query since). The responder still owns the ticket and stays in the loop, but the actual work happens with the specialist driving and the responder learning. Over a quarter, the responder may move into the specialist column for some of these. Some they won’t, and that’s fine.

Carve-outs need a budget

Three named carve-outs is honest about specialization. Thirty is theater wearing a costume. Keep the list short and explicit so it cannot expand quietly over a quarter into “the rotation only handles the easy stuff.”

When this doesn’t apply

The structure earns less than its cost when interrupt volume is too low. If the team gets pinged twice a day and most of those are easily handled, the rotation is a structure without a job. Volume is the test, not team size or specialization. A three-person team with heavy operational load still benefits from concentrating interruptions on one person while the other two ship sprint work; a ten-person team with light load doesn’t. A team of four MySQL DBAs benefits more from rotation, not less, because everyone can handle everything; uniform specialization makes the rotation easier, not harder. The format earns its cost when interrupt volume is high enough that focused weeks are otherwise impossible.

The bigger picture

A working responder rotation has visible evidence: improvement tickets are filed every rotation, the rotation’s interrupt volume is trending down quarter over quarter, the runbook count and quality is going up, and at handoff Friday the next responder receives a written note rather than a tribal-knowledge briefing. None of that is rocket science. All of it requires the team to defend the rules every week against the easier path of letting the SME pick up.

The teams that abandon the rotation usually didn’t abandon it because the rotation was wrong. They kept the calendar entry and dropped the rules. The SMEs kept picking up, the improvement tickets stopped getting filed, and the rotation became a name in a channel topic that nobody was treating as load-bearing.

Designing Partitioning You Don't Have to Babysit

Fri, 06 Mar 2026 00:00:00 +0000

TL;DR

Partition by the primary key, not by created_at, and let a background service manage boundaries based on observed growth. Queries keep using the keys they already have, partition pruning works automatically, and the partition column never leaks into application code. The same “service watches and adjusts” pattern applies to hash and list partitioning with different operations.

The orders dashboard started loading slowly the week after the partitioning deploy, and the team’s first instinct is to blame the new index strategy. The actual culprit shows up in EXPLAIN: thirty-six lines of Partitions: orders_p2025_01, orders_p2025_02, ... on a query that’s just SELECT * FROM orders WHERE id = 12345. The plan reads every partition because the WHERE clause doesn’t include created_at, and created_at is the partition key. The lookup that used to be one index probe is now thirty-six.

The proposed fix is the one that always gets proposed: add created_at >= '2024-11-01' to the dashboard query. It works. The plan drops to one partition. Then the audit page does the same thing, then the admin tool, then the migration script. Three months later there’s an internal lint rule that flags any SELECT FROM orders without a date filter, and code reviews include “did you add the partition filter?” as a standard check. The partition key has stopped being a storage decision and become a contract every query has to honor. Forgetting still produces no error. Just slowness.

The partition key problem

Both PostgreSQL and MySQL require the partition key to be part of any primary key or unique constraint on the table. That rule exists for correctness: if the primary key didn’t include the partition key, the database couldn’t enforce uniqueness without scanning every partition.

The consequence is that if you want to partition by created_at, you can’t just have PRIMARY KEY (id) anymore. You need PRIMARY KEY (id, created_at). The date column is now part of the primary key whether your application needed it to be or not.

The more subtle cost is that id is no longer unique in the eyes of the database. Uniqueness is enforced on the tuple (id, created_at): the database will cheerfully accept two rows with the same id as long as they have different timestamps. The application probably still treats id as unique, but nothing in the schema guarantees it. And you can’t recover the guarantee with a separate UNIQUE (id) constraint: both MySQL and PostgreSQL require every unique constraint on a partitioned table to include the partition key columns. The uniqueness property has effectively been traded away.

This isn’t purely cosmetic; it changes the query plans the optimizer is willing to generate:

With PRIMARY KEY (id), WHERE id = 1 is a constant-time lookup. MySQL’s EXPLAIN shows this as the const access type; the optimizer knows exactly one row matches and the executor stops after finding it. Joins on id are eq_ref, the fastest join access type.
With PRIMARY KEY (id, created_at), the same query becomes a ref lookup: a prefix scan on the leftmost index column that could, as far as the database is concerned, return multiple rows. Joins that used to be eq_ref become ref. Cardinality estimates fall back to index statistics instead of the guaranteed “one row” assumption, which can push the optimizer toward worse plans further up the query tree.

To get the old const plan back, every lookup has to spell out the full primary key:

1
2
3
4
5


-- Was a const lookup, now a ref lookup (one of potentially many rows)
SELECT * FROM orders WHERE id = 1;

-- Back to const, but only if the caller knows the created_at
SELECT * FROM orders WHERE id = 1 AND created_at = '2026-04-01 12:34:56';

That’s the same leakage as partition pruning, from a different angle: the partition key has forced its way into queries that had nothing to do with dates, first to get pruning and now to get single-row access.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


-- Before partitioning
CREATE TABLE orders (
 id BIGINT AUTO_INCREMENT PRIMARY KEY,
 customer_id BIGINT NOT NULL,
 total_cents INT NOT NULL,
 created_at DATETIME NOT NULL
);

-- After partitioning by month
CREATE TABLE orders (
 id BIGINT AUTO_INCREMENT,
 customer_id BIGINT NOT NULL,
 total_cents INT NOT NULL,
 created_at DATETIME NOT NULL,
 PRIMARY KEY (id, created_at) -- created_at forced into the PK
)
PARTITION BY RANGE (TO_DAYS(created_at)) (
 PARTITION p202601 VALUES LESS THAN (TO_DAYS('2026-02-01')),
 PARTITION p202602 VALUES LESS THAN (TO_DAYS('2026-03-01')),
 ...
);

At this point everything still works. The table accepts inserts, queries return correct results, and the partition boundaries exist. The problem shows up the first time someone runs a query that doesn’t include created_at in the WHERE clause.

Partition pruning only works if you ask for it

Partition pruning is the optimization that makes partitioning worth doing. When a query’s WHERE clause restricts the partition key, the database can skip partitions that can’t possibly match. A query for last week’s orders only reads the one or two partitions that contain last week’s data.

That optimization depends on the partition key appearing in the WHERE clause. A query that filters on anything else doesn’t get pruned; it scans every partition.

1
2
3
4
5


-- This query scans every partition. There are 36 of them.
SELECT * FROM orders WHERE id = 12345;

-- This one prunes to a single partition
SELECT * FROM orders WHERE id = 12345 AND created_at >= '2026-03-01' AND created_at < '2026-04-01';

The first query is the kind of lookup that happens constantly: fetch an order by its primary key. On a non-partitioned table, it’s a single index seek. On a partitioned table where the pruning key isn’t in the WHERE clause, it’s a separate index probe against every partition: 36 index lookups instead of one. Still fast in absolute terms, but much worse than the non-partitioned version, which is the opposite of why partitioning was introduced.

The “fix” teams usually land on is to add the partition key to every query that touches the table. That’s a leaky abstraction. A storage decision is now a contract with every caller: new code has to remember the partition filter, old code has to be audited, the ORM has to be configured around it.

No error, just slowness

A query that should prune but doesn’t still returns correct results. The plan just scans every partition. No exception, no warning, no flag in the application logs, only an EXPLAIN that nobody reads until a dashboard times out. Most teams discover the failure by reviewing slow-query logs after a partition deploy, not from anything the database surfaces during query execution.

Static partition boundaries don’t age well

The other thing that tends to go wrong is hardcoding partition boundaries at table creation time. The initial layout reflects whatever the team’s growth projection looked like at that moment. Six months later the traffic pattern has changed, some partitions are 10x larger than others, and the p_future catch-all partition is holding half the table.

1
2
3
4
5
6
7


-- Defined at creation: looks reasonable
PARTITION p2026_q1 VALUES LESS THAN (100000000),
PARTITION p2026_q2 VALUES LESS THAN (200000000),
...

-- Six months later: growth accelerated, p_future is now the entire active workload
PARTITION p_future VALUES LESS THAN MAXVALUE -- 800M rows and growing

Manually splitting and rebalancing partitions is operational work nobody wants to own. It requires scheduling maintenance windows, running ALTER TABLE ... REORGANIZE PARTITION against tables that might be hundreds of gigabytes, coordinating with application teams, and not making a mistake. It tends not to happen until there’s a performance incident, and at that point the fix is expensive.

The shape of the better approach

The primary key already exists. For tables using BIGINT AUTO_INCREMENT, it’s monotonically increasing: newer rows have larger IDs. That’s the property range partitioning needs. The primary key is the partition key.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


CREATE TABLE orders (
 id BIGINT AUTO_INCREMENT PRIMARY KEY,
 customer_id BIGINT NOT NULL,
 total_cents INT NOT NULL,
 created_at DATETIME NOT NULL
)
PARTITION BY RANGE (id) (
 PARTITION p0001 VALUES LESS THAN (100000000),
 PARTITION p0002 VALUES LESS THAN (200000000),
 PARTITION p0003 VALUES LESS THAN (300000000),
 PARTITION p_future VALUES LESS THAN MAXVALUE
);

Every query that filters by id (which is most of them) gets partition pruning for free, with no changes to application code. Range queries by ID prune across a small number of partitions. Point lookups prune to exactly one. The primary key is already in every WHERE clause that matters, because it’s the primary key.

The trade-off is that partition boundaries aren’t directly defined by time anymore, which looks like it breaks time-based retention. In practice this is less of a trade-off than it looks; the point of partitioning often isn’t retention but keeping index sizes manageable, making maintenance operations cheap, and bounding the blast radius of a bad query. When retention is a goal, boundaries can still be chosen to align with time. They just get picked at DDL time by the partitioner service, rather than baked into the schema. See Time-aligned boundaries without a date in the key.

Automating range partition management

Everything up to this point assumes range partitioning: partitions defined by continuous boundaries on an ordered value (an ID range, a date range). The operational work is mechanical: watch the active partition fill up, split the MAXVALUE catch-all into a new bounded partition before that happens, and drop partitions that have fallen past the retention threshold. A small service running on a schedule is enough to keep the layout healthy. The hard part isn’t the logic, it’s doing it safely: running DDL against a large table without locking out writes, handling partial failures, and recovering cleanly if the service crashes mid-operation.

1
2
3
4
5
6


-- Split the catch-all partition into a new bounded partition + new catch-all
-- This is the operation the service runs periodically
ALTER TABLE orders REORGANIZE PARTITION p_future INTO (
 PARTITION p0037 VALUES LESS THAN (3700000000),
 PARTITION p_future VALUES LESS THAN MAXVALUE
);

REORGANIZE PARTITION on an empty catch-all partition is fast; there’s nothing to move. If you split the catch-all before any rows land above the split point, the operation is metadata-only. The service’s job is to stay ahead of the write workload: split the catch-all when it’s still small or empty, not when it’s already holding hundreds of millions of rows.

What makes the service tricky in production

The logic is a couple of DDL statements. The hard parts are everything around them: not locking out writes during the REORGANIZE, surviving a service crash mid-DDL (idempotency on retry), handling concurrent migration tooling that’s also taking ACCESS EXCLUSIVE on the table, and having a clear runbook for “the service has stalled and the catch-all is now 200M rows, what do we do.” Production-grade partitioners usually spend more code on the operations bracket than on the DDL itself.

There’s no single right target; it depends on what’s driving the partitioning in the first place. If the goal is keeping the OLTP working set small via retention, the boundary spacing is a business decision: how long does the data need to stay queryable in the hot store, one year, seven years, somewhere in between. If the goal is performance, sizing each partition so its indexes fit comfortably in memory is a reasonable rule of thumb, provided there’s no significant key skew concentrating reads or writes on a single partition. The service can be configured against either target and adjust boundary spacing based on observed growth.

Time-aligned boundaries without a date in the key

Partitioning by id doesn’t mean giving up time-based boundaries; it just means choosing them after the fact. The service can run a single query against the live table to find the ID boundary for any point in time:

1
2
3


-- Where was the ID pointer at the start of March?
SELECT MAX(id) FROM orders WHERE created_at < '2026-03-01';
-- -> 3700842139

That value becomes the upper bound for the next bounded partition. The catch-all stays above it, and future partitions get cut at time-aligned ID boundaries:

1
2
3
4
5


-- Partition is still defined by ID range, but chosen to align with a month boundary
ALTER TABLE orders REORGANIZE PARTITION p_future INTO (
 PARTITION p2026_03 VALUES LESS THAN (3700842140),
 PARTITION p_future VALUES LESS THAN MAXVALUE
);

The resulting partition p2026_03 contains roughly all orders from March 2026, but created_at never appears in the primary key, never needs to be in any WHERE clause to get pruning, and never leaks into application code. The date column is used once, at boundary-creation time, by the service running the DDL. Queries continue to filter by id and get pruning for free.

Retention works the same way. To drop data older than twelve months, the service runs SELECT MAX(id) FROM orders WHERE created_at < NOW() - INTERVAL 12 MONTH, identifies every partition with an upper bound below that ID, and drops them. The MAXVALUE catch-all is what makes this pattern work; there’s always a place for new rows to land while the service is deciding where to cut next.

What the service looks like

The service itself is small. It runs on a schedule (hourly for high-throughput tables, daily for slower-moving ones) and on each tick it does a handful of things:

Inventory. Read the current partition layout from the catalog: partition names, upper bounds, and approximate row counts.
Sizing check. Look at the active partition (the bounded one just below the catch-all). If it’s filled past a configured threshold of the target size, it’s time to cut the next boundary.
Boundary selection. Pick where to cut. For time-aligned partitions, query the live table for the ID that was current at the next month boundary; that ID becomes the upper bound of the new partition.
Split. Reorganize the catch-all into a new bounded partition plus a fresh catch-all above it. As long as the catch-all is still empty when the split runs, this is metadata-only.
Retention pruning. Translate the retention window (e.g. twelve months) into an ID via the same created_at-to-id lookup, then drop any partition whose upper bound sits below that ID.
Concurrency guard. A single advisory lock or leader election so two instances don’t run DDL against the same table simultaneously.
Metrics and alerting. Per-partition size and row count, time-since-last-tick, and a clear alert if the active partition starts filling faster than the service is splitting ahead of it.

Run it as a cron job, a Kubernetes CronJob, or a tiny always-on worker; the operational footprint is intentionally small. The bulk of the production-readiness work goes into the concurrency guard, naming conventions, and handling of split failures from lock contention or concurrent DDL. None of that changes what the service does on each tick: a few catalog reads and one DDL statement.

Existing tools and how this pattern differs

The automation itself isn’t new; several tools already manage partition boundaries on a schedule. What differs across approaches is which column the partitioning is done on, and how much of the schema contract that choice locks in.

pg_partman is the widely used partition manager in the PostgreSQL ecosystem. It pre-creates future partitions on a schedule, drops old ones against a retention window, and can migrate non-partitioned tables into partitioned ones in place. Its defaults (and most tutorials written on top of it) assume time-based range partitioning on a timestamp column. That’s the pattern earlier sections argue against: the date column ends up in the primary key and leaks into every query that expects pruning.

TimescaleDB goes further. Its “hypertables” are automatically time-partitioned PostgreSQL tables, with chunk creation, retention, and compression all managed by the extension. It’s the right tool for workloads where every query is genuinely time-scoped: observability, IoT telemetry, append-only event streams. It’s a worse fit for OLTP tables where some queries are time-scoped and others aren’t, because the time column is mandatory and every non-time query pays the same partition-key-in-the-WHERE-clause tax as manual time partitioning.

Vitess includes partition management as part of its broader MySQL sharding solution. Its partitioning conventions are flexible, but most production uses land on the same time-based defaults.

The common thread across all three: the automation layer assumes the partition key is picked in advance (usually a time column) and manages boundaries on top of that assumption.

The approach in this post keeps the automation pattern (small service, pre-split the catch-all, drop behind retention) but changes the key choice. The partition key is the primary key, and time alignment is derived at DDL time via the SELECT MAX(id) WHERE created_at < X lookup. The schema-level contract stays PRIMARY KEY (id); time-based retention still works, computed once per boundary instead of baked into every query.

The trade-off is owning the service. pg_partman is a well-tested extension; a DIY partitioner is real operational surface area: advisory locks, failure recovery, metrics, alerts. The useful question is whether created_at is already a natural filter in every query that matters. If every query is time-scoped by design (observability, telemetry, audit logs, anything time-series) pg_partman or TimescaleDB against a time column is the right answer. The partition key isn’t leaking because it was already there.

If the workload is mixed (some queries filter by date, most don’t) then adding created_at to the PK forces a choice: retrofit every non-time query with a date filter, or eat full partition scans on point lookups. Either option is the overhead, whether it’s already visible as slow queries or just spreading through the codebase as AND created_at >= ? clauses added “for partitioning reasons” on queries that have nothing to do with dates. At that point, owning a small DDL service is cheaper than propagating a partition key through every caller forever.

Automating hash and list partitioning

The same “service watches and adjusts” idea transfers to other partitioning strategies, but the operations change.

Hash partitioning distributes rows across a fixed number of partitions via a hash function. With good cardinality, every partition grows at roughly the same rate; there’s nothing to split based on growth. What there is to monitor is skew: a low-cardinality column or a hot key produces one partition that grows faster than the others, which is the failure mode partitioning was supposed to prevent.

Automation here isn’t about adjusting boundaries. Changing the hash partition count rebuilds the entire table, which isn’t something a background service should trigger. The useful work is detection: track per-partition size and growth, surface skew early, run OPTIMIZE or VACUUM FULL on partitions as they bloat. The service flags problems; a human decides whether to reshape the table. The consequence is that hash partition count is a decision to get right the first time. Over-provisioning (64 partitions when 16 would do today) is cheap insurance against a later full rebuild.

List partitioning maps enumerated values to specific partitions: one partition per region, per significant tenant, etc., with the long tail in a DEFAULT catch-all. The automation problem is partition promotion: when a value in the catch-all grows large enough to deserve its own partition.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


-- Starting state: named partitions for known-large tenants, DEFAULT for the rest
CREATE TABLE events (
 id BIGINT AUTO_INCREMENT,
 tenant_id BIGINT NOT NULL,
 payload JSON NOT NULL,
 PRIMARY KEY (id, tenant_id)
)
PARTITION BY LIST (tenant_id) (
 PARTITION p_tenant_42 VALUES IN (42),
 PARTITION p_tenant_73 VALUES IN (73),
 PARTITION p_default VALUES IN (DEFAULT)
);

-- The service notices tenant_id = 108 is now 15% of p_default and growing quickly.
-- It promotes that tenant into its own partition.
ALTER TABLE events REORGANIZE PARTITION p_default INTO (
 PARTITION p_tenant_108 VALUES IN (108),
 PARTITION p_default VALUES IN (DEFAULT)
);

Promotion is more expensive than splitting an empty range catch-all (the rows for that tenant have to physically move out of DEFAULT into the new partition) but it can be batched and scheduled during low-traffic windows. Dormant values can be merged back into DEFAULT in the reverse direction to keep the partition count bounded.

What actually belongs in the partition key

The question worth asking before adopting any partitioning scheme: what column is already in every query that matters? In most OLTP systems the answer is the primary key. It sits in every lookup, every join, every foreign-key fetch, so partitioning by it gets pruning for free. The other answers are real but narrower. tenant_id works in a multi-tenant system if every query is tenant-scoped; a date column works if the workload is time-series and every query already filters by date. When those conditions don’t hold, the partition key leaks into application code the first time someone writes a query without it.

The failure mode is partitioning by a column that isn’t already in every query, then retrofitting the query layer to add it. That’s the path that ends with AND created_at >= ? stapled onto queries that have nothing to do with dates, just to avoid a 36-partition scan.

How Teams Actually Finish What They Start, Part II: A Two-Stage Standup

Sat, 28 Feb 2026 00:00:00 +0000

TL;DR

Teams ship more when each IC declares one focused goal in writing every morning and the team holds a brief sync at roughly 3pm to surface what isn’t going to plan and offer help on it. The morning declaration creates focus and commitment. The mid-day sync catches problems while there is still a work day left to fix them, with help that’s actionable now rather than tomorrow.

It is 3pm on a Tuesday. The team meets for fifteen minutes. The first engineer reads back this morning’s declaration: green CI on the orders-export flake by 1pm. What actually happened: CI is green, but the snapshot endpoint is slow enough that the test takes eight minutes, which probably isn’t tolerable. Two minutes of conversation. Someone else hit a similar slow-endpoint problem last quarter and links the PR. A third engineer suggests a smaller fix that gets the test under three minutes. The first engineer goes back to their desk with a concrete next step and three hours to finish.

A 9am standup could not have produced that conversation. At 9am the engineer didn’t know the snapshot endpoint was slow; they were committing to green CI. By the time they hit the slow-endpoint problem at 12:30, the morning meeting was four hours behind them, and the teammate with the fix had context-switched to something else. By 5pm, the conversation that could have unblocked the engineer is happening with no work day left to act on it. 3pm is the sweet spot, and the morning declaration is what the 3pm session is anchored to.

The reflex is to fix the morning meeting instead. Run it tighter. Ask better questions. Replace it with Geekbot or an async post in Slack. Each produces a marginal improvement without changing what the format can do, because the morning is the wrong moment for solving problems. At 9am, today’s blockers haven’t surfaced yet. The conversations that could resolve them are happening eight hours before the conversations are useful.

Two artifacts, one cadence

The morning declaration goes into the team channel before the IC starts work, typically by 10:00 local time. The post answers three questions: what I’m working on today, what outcome I’m aiming for, what could block me. Two to four sentences. A single focused goal, not a list of three. A statement of intent that the IC, the lead, and the rest of the team can all read.

A real example:

Working on the orders-export flake (#4127). Aim: green CI by 1pm. Blocker if it turns out to be the upstream snapshot endpoint, in which case I’ll switch to the dashboard cleanup ticket.

Compare to “today I’ll keep working on bug #4127, no blockers.” The first names a target and a failure path. The second is filler. The act of writing the first one forces the IC to commit to something concrete and to audit whether it fits in a day. The second is something an IC can recite half-asleep.

The mid-day sync meets around 3pm, fifteen to thirty minutes. Each IC’s contribution starts with status: what I declared this morning, what actually happened so far, where I am right now. That part is unavoidable and useful. The rest of the team needs the read. The differentiator is the second beat: what isn’t going to plan, what could help. Each contribution is anchored to the morning post, which means the team works through real, named issues in real time once the status piece is on the table.

3pm is the load-bearing detail. A meeting at 5pm reports problems that already cost the team a day. A meeting at 11am hasn’t seen most of today’s blockers yet. 3pm is when today’s reality is mostly visible and today’s work day still has runway. A blocker named at 3pm is one a teammate can suggest a fix for at 3:05 and the IC can act on at 3:10. The same blocker named at 5pm is tomorrow’s problem.

The two artifacts are paired. The morning declaration is what the 3pm sync is anchored to. Without the morning, the 3pm meeting is a synchronous standup with all of standup’s problems. Without the 3pm sync, the morning is a pile of writing the team never discusses, and the productivity benefit evaporates the same way classic standup’s information does.

Why this raises output

Several mechanisms compound.

One focused goal per IC, declared in writing, beats a list of three. Teams that try to ship four things a day per IC finish fewer than teams that pick one, and Sophie Leroy’s 2009 paper on “attention residue” (Organizational Behavior and Human Decision Processes, vol. 109) names the mechanism: cognitive load from an unfinished task A persists into task B, even when nothing in the environment is reminding the worker of A. Declaring one thing in the morning forces the IC to name the priority before the noise starts. The declaration is also a forcing function for honest scoping: an engineer writing “today I’ll finish the export refactor and start the migration” notices, in the act of writing, that two days of work won’t fit in one. Speaking has no such friction. A standup at 9am hears “I’ll work on the export and start the migration” and nobody, including the engineer, has audited the claim.

Writing is binding, and the binding holds because the writing sticks. The commitment-and-consistency principle in social psychology (Cialdini’s Influence, 1984, building on Deutsch and Gerard’s 1955 work on normative social influence) is the canonical statement: written commitments produce more consistent follow-through than spoken or merely-considered ones. A goal stated in writing, in a public channel, with a name attached, is a stronger commitment than the same goal mumbled aloud at 9:02. By 10:00, most of what was said at 9:00 is gone, not only from the listeners but from the speaker, who at 4pm cannot reliably recall what they committed to. The morning post is still readable at 4pm. Anyone who joined late, anyone in a different time zone, anyone in a partner team that needs to know what your team is touching today, can read the channel.

The live 3pm sync produces saves the rest of the format cannot. Someone’s blocker is someone else’s two-minute fix; the live, bounded conversation surfaces the fix while the IC still has runway to apply it. Async help in Slack threads doesn’t carry the same load: nobody is required to read at any particular time, and a fix offered at 4:45pm with the IC heads-down is functionally a fix offered tomorrow. The sync works because everyone is present, the conversation is short, and the help is actionable now.

The sync also builds a different kind of team feeling than ad-hoc help does. Going through real issues together in a session everyone planned to attend produces hints, half-remembered prior fixes, and “oh, I had this last week” moments that don’t carry the social cost of random interruption. A teammate pinged at 2:34pm pays a context-switch cost the asker doesn’t see; Gloria Mark’s research at UC Irvine on interrupted work has consistently measured refocus times around twenty to twenty-five minutes after a single interruption. That cost disappears when the conversation is the format. Over a quarter, ICs hear each other think out loud and the team starts to know each other as collaborators rather than as Slack avatars.

Achievement compounds morale. Declaring something and getting it done is its own reinforcement. Teresa Amabile and Steven Kramer’s analysis of nearly 12,000 daily diary entries from 238 knowledge workers (The Progress Principle, Harvard Business Review Press, 2011) found that perceiving daily progress on meaningful work was the single biggest driver of inner work life and motivation, larger than recognition or compensation. A team where ICs make daily declarations and meet most of them ends each week with a visible record of completed commitments. That isn’t a soft benefit. Over a quarter, it’s the difference between a team that perceives itself as producing and one that doesn’t, and the perception drives the next quarter’s output. Teams that don’t see their own progress slow down regardless of the underlying work; teams that do speed up.

Inclusivity falls out of the format. Written-first formats produce more balanced contribution than live round-robins. Quiet team members get equal weight when their declaration is in the channel, and the 3pm session anchors back to what they wrote rather than rewarding whoever talks first. Over a year, this is the difference between knowing what your introverts are doing and not.

Where this fits among existing patterns

Most components are not new.

Async written standups are widely practiced. Geekbot, Standuply, Range, and Polly are tools built specifically for the morning-write half. Basecamp’s Check-ins does the same thing. GitLab’s handbook documents an async-first variant publicly, as do Doist (Twist), Automattic, and parts of Basecamp’s own engineering. Anyone who has worked at a remote-first org in the last five years has probably written one of these.

What is less common is the deliberate pairing with a same-day live sync, and the timing of that sync at 3pm rather than at end-of-day. Most teams that move to async written standups skip the live half entirely. Most teams that keep a live standup don’t bother with the written morning. End-of-day retros exist in some agile shops but tend to surface problems too late to act on them today. The 3pm window is the productivity unlock, and it isn’t formalized in any of the named tools.

The other less-common piece is using the written trail as a coaching primitive. Some tools surface trends, but most teams treat written standups as status updates rather than as longitudinal evidence about how an IC scopes their work. The same artifact that makes the morning bind makes it readable as a coaching signal across days and weeks.

The article isn’t claiming async standups are new. The combination (one-goal morning declaration + 3pm team sync + multi-day longitudinal use) is the angle worth picking apart.

How team leads use the trail

A week of declarations, read together, is a different artifact from any single one. A lead reading Monday-through-Friday for one IC sees the rhythm of declarations, what’s being achieved, what’s drifting, where the IC chooses to spend focus.

Three patterns:

Weekly review. Skim the week’s declarations on Friday or Monday. Look for ICs whose intent and outcome diverged repeatedly. Look for blockers that recur. Look for scope that grew without being renegotiated.
1:1 prep. Walk into the 1:1 with the IC’s last two weeks of declarations open. The conversation has a concrete anchor instead of “how’s everything going.”
Drift detection. When an IC’s declarations stop matching outcomes, when blockers recur, when commitments shrink without explanation, the trail surfaces it before the next sprint review does.

The discipline cuts both ways. A lead who reads declarations as ammunition turns the artifact into a control mechanism, and the team adapts by writing declarations that are safely vague.

The trail is not evidence

The fastest way to kill this format is to use it as a productivity-tracking tool. A lead who cites past declarations as evidence against an IC, who treats missed commitments as a record to hold, who reads the trail to judge instead of to help, breaks it within a quarter. The team adapts: declarations get written defensively, blockers go unmentioned because admitting one is now on the record, and the real coordination work quietly moves into DMs and 1:1 channels where nothing is being filed. What’s left in the public channel is theater. The longitudinal coaching value disappears and the productivity gain goes with it. The format depends on honest declarations and the team feeling that the trail exists to help, not to judge. It amplifies whatever culture is already there. It does not fix a broken one.

Failure modes

The structure has its own anti-patterns.

Declarations become theater. “My goal: fix bugs” produces nothing useful at 3pm because there’s nothing concrete to talk through. The fix is to push the format toward concrete outcomes (a ticket number, a measurable end-state) and to model good declarations from the lead’s own posts.
The 3pm sync collapses to status alone. Some status is built into the format and that’s fine. The failure mode is when status is the whole meeting. When the lead asks “what’s your update” and nobody jumps in with a fix, a question, or “I had this last week,” the meeting has reverted to a synchronous standup with extra writing overhead. The discipline is the second beat: surfacing what isn’t working and inviting help on it before the next person reports.
Over-commitment in writing. Engineers declare more than fits in a day because written commitments feel public. Two weeks of “declared three things, finished one” is a coaching signal, not a discipline issue. The format is asking for honest scoping; the lead’s job is to model it.
Declarations nobody reads. If the morning post goes into a channel nobody scrolls, the persistence benefit evaporates and the 3pm session has no anchor. The fix isn’t more pings. It’s making the channel a routine read for the lead, the partner teams, and the IC’s peers.
The 3pm slot gets eroded by other meetings. Mid-afternoon is prime calendar real estate, and over a quarter the slot can be eaten by partner-team syncs, reviews, and one-offs. Defend it. Moving the sync to 4:30 to accommodate another meeting kills the productivity property the 3pm timing was buying.

When classic standup is still the right tool

This isn’t a one-size-fits-all replacement.

Tight single-time-zone teams get less from the async morning. A team in one office, all in the same five-hour window, can run a synchronous standup at 9:00 and the persistence benefit is small because everyone is already in the same room. The 3pm sync still pays back; the morning declaration is the part that becomes optional.

Very small teams (three or four people) face overhead high relative to size. A daily fifteen-minute conversation accomplishes most of what the two-stage format does, and the multi-day coaching trail matters less when the lead can ask any of three people directly.

Pair-programming and mob-programming workflows complicate the morning declaration. The artifact becomes the pair’s intent, not the individual’s. The 3pm sync still works; the longitudinal trail is less useful because the artifact you’re reading is the pair, not the IC.

Crisis modes are the clearest exception. Incident response, deploy days, anything where the team is already on the same shared context for hours. Use a war-room channel and resume the regular cadence after.

Trade-offs

This isn’t free.

ICs do more writing every day. Five sentences before 10am may not sound like much, but compounded across a year and a team of ten it’s real overhead. The format is paying for that with output, morale, and a coaching trail. If none of those matter in the team’s context, the cost isn’t worth it.

The 3pm slot is expensive calendar real estate. Carving fifteen to thirty minutes out of the most productive part of the work day is a real tax, paid back in problems solved while still solvable. A team that can’t defend the slot will see the productivity benefit decay over a quarter as the meeting slides later or gets cancelled.

The format depends on psychological safety. A team where missed commitments become punishment material, where blockers are punished as weakness, will produce performative declarations regardless of the format. The structure amplifies whatever the team’s actual culture is.

The longitudinal trail can be misused. Already named in the failure modes; the most expensive failure because it kills the productivity benefit fastest. The same artifact that makes the format work as coaching makes it dangerous as compliance theater.

The tooling fit is partial. Geekbot, Standuply, Range, Polly, and Basecamp Check-ins each support parts of the structure. None map perfectly to “morning declaration + 3pm live sync + longitudinal trail.” Most teams running this end up with a Slack channel for the morning, a recurring 3pm calendar event, and the lead’s own notes for the trail. The tooling is a wrapper around the discipline, not a substitute for it.

The bigger picture

The test for whether this format fits a team is concrete. Does the lead read the morning declarations, and does the 3pm meeting actually run as a working sync? If both are true, the productivity gain is real. If either drifts (the channel goes unread, or the meeting becomes a status round) the format collapses to a more expensive version of the standup it was supposed to replace.

Teams that try this and abandon it usually abandoned it because they kept the format and dropped the discipline. The format is cheap. The discipline is the part that ships.

Testing Your Database, Part 2: What to Test, and How

Tue, 17 Feb 2026 00:00:00 +0000

TL;DR

What most teams call “database tests” are application tests with a database underneath. They cover whether the code reads and writes correctly, not whether the database does what its catalog claims. Real database testing covers five distinct categories, each requiring a different tool, and each invisible to the others.

The CI suite has 600 tests. Every one is green. Judged on what they actually exercise: 480 are unit tests using a stub database for fixtures, 80 are application integration tests using a real Postgres container with seeded fixtures, 20 are migration tests that run db:migrate against the empty test database, 20 are end-to-end tests that hit the API and observe response shape. Coverage looks comprehensive. None of these tests catches: an ALTER TABLE that locks the table for 40 minutes against production volume, a CHECK constraint that’s syntactically valid and semantically wrong, an ON CONFLICT (email) that depends on a UNIQUE constraint nobody declared, a JOIN that multiplies rows through a missing bridge UNIQUE, a query that returns the right shape with the wrong number, a generated column whose definition drifts from its dependencies after a migration. The suite isn’t bad. The failures live in categories the suite doesn’t cover.

The obvious response is “use Testcontainers.” Testcontainers is a real tool, addresses a real gap (the test container is the production engine, not SQLite-as-Postgres), and most teams should adopt it. It still only addresses one of the categories below. A team that adopts Testcontainers and stops there has moved from “no database tests” to “application tests with a real database engine”; better, and still missing the four other categories. The same applies to every other “the one thing we should do” answer: each tool below addresses a class of failure the others can’t see, and an AI-introduced bug can land in any of them.

Five categories of database test

1. Lint DDL safety before merge

The category most teams don’t have at all. The test runs against the migration source itself (not the database) and checks for patterns known to lock or rewrite tables in production: ADD COLUMN ... NOT NULL without a default-then-backfill split (Postgres pre-11) or a non-constant default (Postgres 11+ stores it metadata-only), ALTER COLUMN TYPE that triggers a table rewrite, ADD FOREIGN KEY without NOT VALID, DROP COLUMN on a table other services still write, indexes created without CONCURRENTLY (Postgres) or without ALGORITHM=INPLACE, LOCK=NONE (MySQL).

Tool. Squawk for Postgres; lints SQL migrations for known-bad patterns. Configurable rule set, fast failure, runs as a CI step or pre-commit hook. Atlas for cross-engine coverage (Postgres, MySQL, ClickHouse); its 2025 analyzer set covers destructive changes, data-dependent modifications like ADD COLUMN NOT NULL without a default, nested transactions, and SQL-injection-prone migration code, with hooks designed to gate AI-authored migrations specifically. For MySQL, pt-online-schema-change and gh-ost defaults plus a custom lint script that flags raw ALTER on tables over a configured row threshold.
What it catches. The locking-migration class from Part 1: ALTER TABLE users ADD COLUMN tier TINYINT NOT NULL DEFAULT 0 against a 50M-row table. Squawk’s adding-not-nullable-field, disallowed-unique-constraint, and require-concurrent-index-creation rules surface this pattern before the migration is applied anywhere.
What it misses. Anything that requires running the migration to detect. It’s a syntactic check. A migration that’s safe by lint but breaks an invariant the team relied on still passes.

2. Assert what the catalog says

The catalog is full of declarations that are easy to write incorrectly and hard to read back. Did the FK actually cascade, or did it default to NO ACTION? Does the partial index cover the predicate the query actually uses? Does the CHECK reject the values it claims to? After a migration applies, are the constraints, indexes, and triggers the team intended actually present, with the names, columns, and behavior they intended? Tests in this category run after migrations apply and assert against the resulting schema and the resulting behavior, not the application’s view of it.

A note specifically on database-resident business logic. The companion post Where Business Logic Lives argues to keep most of it in the application layer, where review density and tooling are higher. Triggers, stored procedures, functions, RLS policies, generated-column expressions, and multi-statement CHECK constraints encoding state-machine rules are running code that doesn’t show up on a normal PR diff and doesn’t execute in local development the way application code does. If business logic is in the database (by accident, by legacy, or by deliberate choice) every one of those needs unit-test coverage the same way an application function would. For stored procedures and functions: assert every branch, every EXCEPTION block, every side effect on the rows the procedure touches, every return value the caller depends on. For triggers: assert each firing condition (BEFORE / AFTER × INSERT / UPDATE / DELETE), every WHEN filter, the actual state change the action performs, and any interaction with other triggers on the same table. The same pgTAP / tSQLt harness covers all of it; the discipline is treating database code as code, not as configuration. A trigger that’s never been asserted against isn’t a guarantee. It’s a hope.

Tool. pgTAP for Postgres, tSQLt for SQL Server, utPLSQL for Oracle. Schema-level assertions: has_column('users', 'tier'), col_type_is('users', 'tier', 'integer'), has_index('users', 'users_email_idx'), fk_ok('orders', 'user_id', 'users', 'id'). Behavior-level assertions: insert an illegal row and assert the CHECK rejects, insert a parent and child and delete the parent and assert the cascade fired, attempt a forbidden state transition and assert it’s rejected.
What it catches. The CHECK that lists valid values but doesn’t constrain transitions. The FK declared without ON DELETE CASCADE despite the team thinking it was. The partial index whose WHERE clause has drifted from the queries that use it. The UNIQUE constraint the upsert depends on but nobody declared.
What it misses. Anything outside the catalog: query result correctness, lock duration, performance regressions, runtime data invariants.

3. Assert query results, not just shapes

The hardest category and the most under-covered. The query is syntactically valid, the result set is the right shape, the EXPLAIN is clean, and the number is wrong by 30%. The test that catches this asserts the result against a known dataset: given a fixture with 1,000 users where 200 are soft-deleted, the active-users query returns 800; given five orders with stacked discounts, the revenue query returns the discounted total, not the row-multiplied total.

Example-based fixtures aren’t enough on their own. The AI-generated bugs in this category live in inputs the test author didn’t think to write. Take a paginated query the model emits as ORDER BY created_at LIMIT 100 OFFSET ?. An example test inserts ten orders, paginates through them, gets all ten back, passes. The bug (that created_at isn’t unique, so rows with identical timestamps swap positions between pages and rows get skipped or duplicated) never surfaces against ten hand-written rows. A property-based test that asserts “every row appears exactly once across all pages” finds the timestamp collision in seconds and the fix is a deterministic tie-breaker (ORDER BY created_at, id). The same pattern applies to round-trip (insert and read back unchanged), conservation (sum of children equals parent total), and idempotency (running twice equals running once).

Property and fixture tests assert that the result is correct against a known input. They don’t assert the query means what the human asked. The second tier addresses that gap: dual-track evaluation, where the query runs programmatically (count, aggregate, expected rows) and a separate LLM judge scores semantic alignment against the original natural-language intent, the schema, and the result set. Thomson Reuters’ internal SQL agent shipped with 73% silent-failure rate on time-based analyses (predicates landed on the parent date column but not the joined ones); adding a “consistent time constraint across joined tables” validator plus dual-track judge eval drove it below 10%.

The 10% is lab-clean and best-case. TR measured it against curated analytical queries with known ground truth on an instrumented internal data lake. The gap from there to a typical legacy schema (undeclared FKs, polysemic TINYINT status columns, tribal-knowledge soft-deletes, four-format VARCHAR dates, JSON-as-schema, the realities in What AI Gets Wrong About Your Database) multiplies that rate, because the judge reads the same impoverished catalog the generator does. 10% is unshippable on its own: a daily financial query at 10% silent-failure lays down corrupted data every week, errors layer into next month’s inputs, and by the time a customer flags the discrepancy six months later the WAL retention is exhausted and the backups have rolled past. That’s a continuity event, not a bug. Treat LLM-judge eval as a floor-raiser for what reaches human review, never the release gate. The gate is property tests, fixture-based result assertions, and human review against representative data; the judge sits underneath all three.

Tool. dbt tests for analytical/transformation SQL: built-in unique, not_null, accepted_values, relationships, plus custom SQL tests. Soda Core for production data quality assertions. For application SQL, custom integration tests that load a representative fixture, run the query, and assert the count and one or two known aggregates. Hypothesis (Python), fast-check (TypeScript), or PropEr (Erlang) for property-based generators that exercise the input distribution the fixture doesn’t. data-diff for regression: run the query against the same dataset before and after a change, fail if the result diff is larger than expected. For semantic verification of AI-generated SQL: LLM-judge templates from Arize, Evidently, Langfuse, or Monte Carlo, scoped to the referenced tables only; schema bloat poisons the judge the same way it poisons the generator. Differential testing (running the agent’s query and a hand-written reference against the same dataset and diffing) is the natural extension. No productized tool exists for it yet, and any team adopting AI-authored SQL at scale should be ready to roll their own harness.
What it catches. The opening incident from Part 1: the soft-delete-naive LEFT JOIN that over-reported revenue by 7%. The JOIN cardinality blowup through a bridge table without composite UNIQUE. The polysemic-TINYINT predicate landing on the wrong meaning. The pagination that drops rows on timestamp ties. The temporal-misalignment failure from the Thomson Reuters case: predicates landing on the parent date column but not the joined ones. The aggregate that ran in 80ms and was off by $1.4M. Anything where the failure shape is “result is plausible but wrong.”
What it misses. Anything that only manifests at production scale, under concurrency, or against data shapes the fixture and the property generators didn’t include. The LLM-judge tier specifically misses any failure mode invisible from the result set: a query that returns the right number for the wrong reason still passes.

4. Run the migration and budget what it costs

The category that catches the locking migration from Part 1. The test applies the migration to a database the size of production (or a representative fraction), measures lock duration with pg_locks or information_schema.innodb_trx, runs concurrent reads and writes against the table to surface metadata-lock contention, and asserts against a budget. Same idea for queries: run EXPLAIN against a representative dataset, assert the planner uses the index the team meant, assert the cost is below a budget, assert the query plan didn’t change unexpectedly between the previous and current revisions.

Tool. Testcontainers for spinning up a real database engine inside the test runner, seeded with an anonymized prod-shaped snapshot (pg_anonymizer or equivalent for the snapshot pipeline). A test harness that applies the migration with a stopwatch and a pg_locks watcher running in a parallel session. EXPLAIN budgets via per-query assertions in CI; production query-plan regressions via pg_stat_statements snapshots.
What it catches. The 40-minute migration. The query that’s fast on 10K rows and slow on 10M. The index the planner ignores. The migration that succeeds in isolation and deadlocks against concurrent writers.
What it misses. Catalog-level invariants the migration doesn’t change but the test should still verify; data invariants that drift over time independent of any migration.

5. Catch data drift after the schema is correct

The category most useful for catching production drift, not pre-deployment bugs. Once a day, in CI or as a scheduled job, run a set of assertions against actual production data: every order has a user, every user with an active subscription has a payment method, the soft-delete column is consistent across related tables, the JSON keys in the column match the documented shape, the count of users.deleted_at IS NOT NULL matches the count of soft-delete audit records. These assertions don’t run against fixtures; they run against the real data and surface inconsistencies the schema can’t enforce - the integrity rules that live in application code, not in the catalog.

Soda Core, Great Expectations, and custom SQL assertions wrapped in a test runner that fails loudly and pages on a missed assertion are the standard toolchain. Schemathesis is the cousin that handles property-based testing of API contracts hitting the database, useful for catching drift introduced through the API rather than through backfills. What this layer catches is what the schema cannot: soft-delete inconsistencies between related tables, orphaned rows the FK should have caught but didn’t (because the FK was declared after the orphans existed and was created NOT VALID), JSON shape drift, business-logic invariants that live in application code and got bypassed by a backfill or a one-off script. What it can’t catch is pre-deployment bugs. By the time a data invariant fires, the bad data is already there; this category is the safety net under the others, not a replacement for them.

The minimum useful subset

Five categories is more than most teams will adopt at once. The order to add them, ranked by leverage per hour invested:

DDL safety lint (Squawk or equivalent). One config file, zero runtime cost, catches the highest-impact failure mode: schema migrations that lock production. Adopt this week.
Schema invariants (pgTAP for Postgres, tSQLt for SQL Server). One test file per migration, asserting the migration produced the schema the team intended and that constraints behave the way the team thinks. Catches the constraints that look right and aren’t.
Lock and performance budgets (Testcontainers + prod-shaped snapshot). The largest setup cost - the snapshot pipeline has to be built and maintained - and the largest payoff. Catches the failures that only manifest at production scale.
Query result regressions (dbt tests or custom integration tests). High value for analytical workloads, lower for transactional. Pick the queries that drive business decisions and assert their results against fixtures; expand from there.
Data invariants (Soda or scheduled SQL). Useful once the pre-deployment categories are in place. Without them, you’re chasing drift the earlier categories should have prevented.

A team that has none of these and adopts the first three closes most of the AI-introduced surface from Part 1. A team that has all five has built the verification layer that, pre-AI, lived in the heads of senior engineers - now written down, runnable on demand, and cheap to re-run on every change.

The recovery layer

The five categories test the database on the way in. They don’t test the path back when something AI-introduced makes it past every category and corrupts data anyway. Backups that have never been restored aren’t backups; they’re unverified hopes. Replication and failover that have never been broken on purpose aren’t HA; they’re configuration that hasn’t been disproved.

The drills are mechanical: pull a random backup from the last seven days, restore to a fresh ephemeral instance, run a checksum query against a known table, destroy the instance. Daily. Alert on failure. Same idea for point-in-time recovery: pick a timestamp from yesterday, restore to it, verify. Same for failover: kill the primary on a schedule in staging, confirm promotion, restore. Each drill costs an hour or two to automate and catches the class of failure where backups silently stopped working three months ago and nobody noticed because nobody needed them yet.

The framing shifts with agents in the loop. An agent that holds write permissions on production can cause damage at machine speed; recovery has to work at machine speed too. A four-hour restore that requires three engineers isn’t a recovery procedure. It’s a postmortem.

The performance environment

Category 4 covers per-PR migration safety and per-query plan budgets, good for CI feedback loops, where a Testcontainers engine spins up, runs one operation, asserts a budget, and tears down. That’s the wrong shape for the failure modes that only emerge under sustained load: throughput collapse under realistic concurrency, p95/p99 latency creeping past SLO under traffic mix, replication lag under write pressure that staging never produces, connection pool saturation when a new query plan blows the average query duration, buffer cache thrash when an added index pushes hot data out of shared_buffers, and the slow degradation that only shows up after the cache has warmed and the workload has run for hours.

The environment those tests need is a shadow of production: same data volume, same replication topology, ideally same hardware shape, populated from an anonymized snapshot refreshed on a regular cadence. Traffic comes from replay rather than synthesis. Capture statement traces with pg_stat_statements or query logs over a representative window, and replay them with pg_replay, Percona Playback, or a custom harness against the shadow at production rate. Synthetic load generators (k6, JMeter) work for application-level scenarios but miss the long-tail query distribution that production carries, which is exactly where AI-introduced regressions hide.

The payoff is the failure class CI cannot reach: a migration passes every Testcontainers check, deploys to production, and degrades p99 latency 40% three hours later because the new index shifted the planner’s choice for a different query the CI run never executed. That regression is invisible in a five-second container and obvious in a four-hour replay.

The cloud collapses most of the setup cost. Aurora’s fast database cloning is copy-on-write against the source. Ready in minutes regardless of database size, you only pay for the diff, and you tear it down when done. Neon’s branches do the same for managed Postgres outside Aurora. Google Cloud SQL clones and Azure SQL database copy are slower but in the same family. Plain RDS without Aurora is slower still (snapshot + restore) but cheaper than building the pipeline yourself. The “we’d need to maintain a parallel copy of production” objection that used to kill this kind of testing infrastructure is a one-API-call problem in 2026 for anyone on managed cloud Postgres or MySQL: clone prod on demand, run the migration and the replay against the clone, throw it away. The blocker shifted from infrastructure to discipline. Are AI-authored migrations actually running this gate before merge, or is the clone capability sitting unused?

A read-only replica in production is the degenerate version of this. It’s the cheapest shadow you’ll ever build, and the answer to “is the new query going to scan a billion rows” is to run it on the replica before merging. Many teams already have one; far fewer route AI-generated queries through it as a standing gate.

When this doesn’t apply

The minimum useful subset assumes a production-facing system with multiple writers, frequent migrations, and AI in the loop. Cases where less is enough:

A read-only analytical workload with no migrations. Categories 1 and 2 don’t apply. Category 3 (result regressions) and category 5 (data invariants) carry the load.
A throwaway service with one writer and no enduring data. None of this is necessary; the cost of a wrong query is recoverable.
A team with zero AI in its data layer. The case for the test categories is weaker - the implicit human review is intact. The categories still matter, but they aren’t load-bearing in the same way.
A schema small and stable enough to fit in one head. Twenty tables, three engineers, one service writing every row. The reviewer who wrote the migration is the test, the same way they always were. Grow the team or the schema by an order of magnitude and the math flips.

For everything else, the cost of building the test layer is a fraction of the cost of one production incident. AI made the math obvious.

How Teams Actually Finish What They Start, Part I: Designing the Team as a Process

Wed, 11 Feb 2026 00:00:00 +0000

TL;DR

A team’s output and motivation come from designing the work as a process: each person in a role that uses them well, an owned outcome attached to it, a cadence chosen deliberately. Every team’s process is custom; the operating principle is universal. Break the work into chunks the team can consume, run them on a schedule the team can hold, and the team ships like a clock.

Six engineers, three quarters in. Weekly standup, retro every two weeks, sprint planning Monday, sprint review Friday, an architecture review the second Tuesday of the month, OKR tracking, the usual mix of Slack channels. The team ships, but the velocity chart has been flat for two quarters and three engineers are quietly looking. Retros produce theme complaints (too many meetings, scope creep) and no underlying diagnosis. From the outside the team looks well-managed; from inside it feels like rituals running on inertia. Nobody can answer “what is this team’s process actually for” without falling back on the names of meetings.

The reflex is to cut meetings. Cancel one, the others stay; the calendar gets quieter, the velocity stays flat, and the same three engineers keep looking. The problem isn’t meeting volume. The team has no process design: no answer to what each person is at the team to produce, what role they fill, what output flows from them to whom. Cutting rituals doesn’t add design. It removes friction from a system that wasn’t producing anything in particular.

Process beats capacity

NYC port had less infrastructure than Boston or Charleston in 1817. By 1830 it was the dominant Atlantic port and stayed that way for the next century. The difference was one decision in 1818: Black Ball Line started running ships on a published monthly schedule from NYC to Liverpool, full hold or not. Boston ran ships when there was cargo. Charleston the same. Traders started routing through NYC because they could plan around the schedule. Volume followed predictability, predictability compounded into market position, and the gap kept widening for a century. The engineering equivalent is the same shape. A team that ships on schedule draws downstream commitments. Partner teams plan around it. Leadership trusts it with bigger scope. Customers stop hedging. A team that ships fast when convenient does not.

Andy Grove’s High Output Management opens with a breakfast service. Same people, same stove, same eggs. With role design (one cook on the line, one runner, one dishwasher) the kitchen produces roughly twice the output of the same people working as a generalist mob. Grove’s frame is that a manager’s output is the output of the organization under their influence, and the leverage is in process design rather than in the manager’s individual speed at any task. An engineering team is the same shape. A team where everyone does everything looks egalitarian and produces less than a team where each person owns a role that uses their strengths and feeds the next person’s input.

NASA’s Apollo mission control ran the same shape under maximum stakes. Each flight controller owned a specific system: FIDO for trajectory, RETRO for the return burn, EECOM for electrical and life support. The Flight Director consolidated. Each role had a clear handoff to the people whose decisions depended on theirs. The same engineers working as generalists across every screen would have caught nothing in time. When Apollo 13’s CO2 scrubber failed, the room routed the problem to the right console in seconds. The role design was what made that speed possible.

These run on the same mechanism: consumable chunks on a held schedule, each station fed by the one before it. The Toyota Production System named the principle decades later. Place each person where the work arrives. Arrange the layout so output flows to the next station without wasted motion. A line designed that way produces more than the same people working as a generalist mob, because the design strips out the waiting, the searching, and the reaching that eat individual capacity. Engineering teams that ship reliably run the same shape, with whatever chunks and whatever schedule the work can sustain.

Roles as the unit of design

Each station on that line is a role. Team Topologies frames a team as exposing an API: the interface other teams plan against. The same shape applies inside a team. A role is an interface. The rest of the team needs to know what outcome it produces, who is accountable, and the cadence it runs on. Take the code-review queue. Reviewing PRs is the activity, not the role. The engineer on this week’s rotation is accountable for the queue. The outcome is PRs reviewed within four hours, with reviewers learning from each other in the process. The cadence is weekly: the rotation rolls every Monday. A team with that interface explicit has a designed role. A team with just the activity has a queue that may or may not produce anything. Roles fit together by design: one role’s output is another’s input, and engineers cooperate at the boundaries instead of throwing work over a wall. Owning a scope means accountability for what comes out of it, not territory.

A team’s process is the set of roles it runs and the way work passes between them. A team without designed roles still has roles. They emerged accidentally, drifted to whoever was loudest or most senior, and rarely get pruned. Most silos start there: a role belongs to one person not because the team chose them, but because they happened to be the one who picked it up.

Three rules separate designed roles from drifted ones:

Custom to the team, not a template. Spotify Squads, Atlassian Goalies, the latest FAANG framework: every successful process is the team’s own. Importing someone else’s wholesale gets the form without the function. Templates are inputs to design, not the design itself.
Each role owns an outcome, not a task list. A role that lists what to do produces compliance. A role that names the outcome the IC owns produces ownership. The difference shows up over a quarter: owners go deep, propose improvements, push back on bad designs, and stay engaged. Task-fillers don’t. The team’s reputation across the org follows the depth: whatever the team owns gets serious treatment, and the team gets credit for it.
Cadence justifies itself by output. A weekly meeting that exists because “that is our cadence” is a role optimized for nothing in particular. Each recurring role should serve a specific output. An SRE-to-INFRA weekly runs weekly because cross-team dependency surprises happen on a weekly horizon, and the meeting is less evil than the alternative of implementing in silos and surprising each other with operational load. Roles without a justification should be ad-hoc, set up when a purpose appears and dissolved when it goes away.

The manager’s role

The manager designs the interface and picks the people to staff it. Which roles exist, what each one produces, the cadences, the handoffs - those are decisions, not emergent properties. The manager hires into the roles, understands what each engineer does well and badly, and places them where the role’s outcome and the engineer’s strengths meet. When two roles overlap or interfere, redrawing the boundary is the manager’s call.

The manager also decides last. ICs propose. The manager arbitrates from the only seat in the room that sees all roles at once. A team that runs every decision past its manager has a manager doing the team’s work. The reverse failure is a team that never escalates anything, where the manager has quietly stopped owning the scope.

Priorities are the other lever. Most engineers have a tendency to rewrite things they find ugly.

The rewrite cycle

If it were up to engineers, no release would ever ship. The legacy gets rewritten, the rewrite ages into legacy, and the rewrite of the rewrite begins.

The manager carries the business context that engineers do not always see: what the company is trying to do this quarter, what each system costs to run, what the customer actually pays for. That context turns “rebuild this because it’s ugly” into “fix the slow part first because it doubles conversion.” Low-hanging fruit and process optimization beat rebuilding from scratch nearly every time.

Take a talented IC. Do they ship more inside the process you built than they would alongside the same teammates with no designed roles at all? Grove’s frame from the breakfast service makes this the central test: a manager’s output is the output of the team under their influence. A process that fits the work makes a strong IC compound: their output flows into a role where someone else is waiting for it, and that person’s output flows somewhere in turn. Without designed roles, the same IC works hard and the team still drifts, because the work has nowhere to go between people. That delta, between what the IC ships with the process and what they ship without it, is the manager’s actual output.

The delta has a second face that velocity charts miss. A role with a named outcome and a clear handoff lets the IC running it see who is waiting on their work and what gets built on top of it. That visibility is what makes the team feel like a team, and what makes engineers want to push the work forward instead of clearing tickets. A process that ships work but leaves engineers feeling interchangeable is missing the half of the manager’s output that the engineers themselves feel.

Write the interface down

That kind of visibility does not happen by default. The interface only exists for the rest of the team if it is written down. Every role gets a short entry: the owner, the outcome it produces, the cadence it runs on, the SLA it holds itself to, who pages who when something slips. Team Topologies publishes a Team API template for the team-to-team version, and the same shape works scoped down to the role level. The doc lives somewhere everyone reads, whether a wiki page, a pinned Slack thread, or the README of a process repo. Without a written record the interface lives in one person’s head, and the team is one departure away from rediscovering whose role each thing was. The doc is also where the team negotiates changes: an outcome shifts, a cadence moves, an owner rotates, and the change is visible to everyone affected before it happens. The same doc carries into quarterly reviews. A role’s outcome is what its owner committed to producing, and the review can work from a shared reference instead of from whatever incidents are easiest to recall. Engineers know in advance what the bar is.

Parts II through V as worked examples

This series puts role design to work. Part II covers the morning declaration and 3pm sync as a productivity cadence: two roles that turn morning intent into the day’s output. Part III covers the responder rotation as the role that protects the others, especially for SRE, DevOps, and platform teams whose work is otherwise pinged into noise. Part IV covers pointing tickets after they close, as the measurement discipline that tells the team where the design is actually working and where it is not. Part V covers what the sprint should actually contain - work in flight plus immediate next pulls, with priority living in labels and the team pulling from the labeled backlog.

All four are worked examples of the same discipline. Name the role, name the outcome, design the cadence to fit, measure what comes out, and prune the roles that don’t justify themselves. None of them is a template to copy. The discipline transfers; the specifics depend on your team.

When this doesn’t apply

Process design overhead exceeds benefit in four cases.

Genuinely small teams (three or fewer) where the “process” is mostly synchronous conversation. Designing roles for three people is over-engineering when the three of them can talk in real time about anything that matters.

Pre-PMF or product-discovery teams, where the activity boundaries themselves are unstable. When the team does not yet know what it should be producing, designing roles around outcomes locks in the wrong shape, and the cost of redesigning weekly outpaces the cost of running ad-hoc. Stay loose until the work has a recognizable rhythm, then design.

Pure-research environments where output is uncertain by definition. Process design assumes a known output to optimize for; research breaks that assumption. The right structure for research is closer to “give people room to wander and meet to share findings.”

Crisis modes. Incidents, deploy days, market events. The ad-hoc response is the right move, and process discipline is the wrong tool until the crisis is over. Resume role discipline after.

The bigger picture

A team designed as a process is one whose members can answer “what role do I fill, what outcome do I own, what flows in and out.” That answer is the difference between rituals running on inertia and rituals serving output. From the outside the rituals look the same; the team’s quarter does not.

The teams that improve over a year tend to be the teams where someone designed the process: pruned the roles that did not justify themselves, gave each surviving role an owner, made the cadence honest. Teams that don’t improve usually didn’t have that work happening.

Testing Your Database, Part 1: Why AI Made It Mandatory

Tue, 10 Feb 2026 00:00:00 +0000

TL;DR

AI removed the implicit human review that used to catch database bugs. The engineer writing the migration was the test, and that test was never written down or runnable in CI. Once an LLM writes any portion of your data layer, the test suite is the only line between hallucinated SQL and a production incident.

A senior engineer asks an assistant to write a query returning each user’s total order value. The codebase has used soft deletes for three years; deleted_at columns are on most tables. The model produces:

1
2
3
4
5


SELECT u.id, u.email, COALESCE(SUM(o.amount), 0) AS total
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.deleted_at IS NULL
GROUP BY u.id, u.email;

The query compiles. It runs. It returns numbers. It’s wrong: there’s no o.deleted_at IS NULL predicate, so soft-deleted orders inflate every total. The reviewer skims the diff, sees LEFT JOIN, sees COALESCE, sees deleted_at IS NULL somewhere in the WHERE clause, approves. The bug ships. A finance dashboard over-reports revenue by 7% for two months until a customer flags a discrepancy against their own records.

The new shape of database failure

The bugs an LLM introduces don’t look like junior-developer bugs. Junior devs ship SQL that doesn’t compile, throws an obvious type error, or returns nothing at all: the kind of failure CI catches by running the query once. The model ships SQL that compiles, runs, and returns something, just not the right something. It pattern-matched on a million open-source codebases and missed the rule that lives only in yours: the soft-delete convention this team adopted three years ago, the multi-tenancy filter every query is supposed to carry, the polysemic TINYINT whose 0 means “unknown” in one column and “free tier” in another, the denormalized counter that has to be bumped in the same transaction as the row it counts.

The obvious response is “review AI-generated code more carefully.” That doesn’t survive contact with how reviews actually happen. Engineers using AI assistance score measurably lower on comprehension quizzes about the code they shipped; they read the output less carefully than the code they would have written themselves, the syntax is right, the change is small, and the reviewer ratifies. Even if that effect didn’t exist, the reviewer who can spot the missing o.deleted_at IS NULL is the same reviewer who would have written the predicate in the first place, which is exactly the value AI was supposed to provide. “Review more carefully” asks the team to do a check the AI was supposed to obviate.

It also misses how broad the surface is. The same reasoning shape (plausible SQL that pattern-matches on training data and misses the local rule) produces migrations that lock production at scale, upserts that assume a UNIQUE that was never declared, indexes the planner won’t use, and CHECK constraints that allow the state transition the team wanted to forbid. The SQL is syntactically valid in every case; the EXPLAIN is clean; the failure only appears against real data, real concurrency, real volume, or the conventions that live in nobody’s head but the team’s.

What the experienced engineer was doing that CI doesn’t

Before the model wrote that LEFT JOIN, an engineer who had been on the team for a year would have done several things in their head:

Recognized the soft-delete convention from the column naming pattern and added o.deleted_at IS NULL to the join, because the team’s rule is “every query that isn’t an audit filters soft-deletes on every joined table.”
Considered whether the report should attribute orders to soft-deleted users at all (probably not for revenue; possibly yes for retention analysis) and asked the requester instead of guessing.
Recognized o.amount is nullable for refund rows and decided whether refunds count toward the total or not.
Checked whether the orders table has a tenant_id column the multi-tenancy filter requires, and added the predicate the codebase’s row-level security depends on.
Recalled that an earlier version of this exact report shipped six months ago with a JOIN through order_items, doubled every total, and required a backfill, and chose the simpler aggregation deliberately.

None of these checks is in CI. Each is the kind of context the first post in the AI series describes: information the schema doesn’t carry, that lives in the engineer’s head, that the model has no way to read. That post argues for putting as much of that context back into the catalog as possible: declared FKs, column comments, named constraints, conventions. The work raises the floor; it doesn’t change what’s true here. Even with a perfectly described catalog, no schema declaration says “this team’s rule is to filter soft-deletes on every joined table” or “the previous version of this query had a JOIN cardinality bug, don’t repeat it.” Some classes of risk only exist at runtime against real data, real conventions, and real history, and the test suite is the only place those conditions can be reproduced before production.

Confidence is anti-signal

The dangerous dynamic isn’t that the model gets things wrong. It’s that it sounds most confident exactly where it’s most likely wrong. Ask for a basic SELECT with a JOIN and the output is right, no hedging. Ask for NULL semantics in an outer join, timezone arithmetic across a DST boundary, isolation-level behavior under contention, dialect-specific JSON path syntax, or a window-function frame clause, and the output reads with the same calm tone. The probability of correctness has dropped; the prose around it hasn’t.

Human reviewers anchor on tone. There’s no visual signal in the diff that says “this part lives in a region where the training data is sparse and contradictory.” A query that drops rows because of an unintended tie in ORDER BY created_at LIMIT 100 OFFSET 200 reads exactly like a query that doesn’t. A CHECK that lists valid values but doesn’t constrain transitions reads exactly like one that does. The reviewer’s internal calibration (“this part looked sketchy, I’ll dig”) is fed by uncertainty cues that the model doesn’t emit. “Review more carefully” cannot fix this, because the things worth scrutinizing don’t look worth scrutinizing.

Volume changes the math

The bugs above existed before AI. What changed is the rate. A team that used to ship eight migrations a week now ships eighty, because the cost of writing one collapsed and the cost of reviewing one didn’t. The reviewer who used to give each migration ten minutes now gets one; the senior engineer who would have caught the soft-delete bug on her own change has eight more changes in queue waiting for the same scrutiny. The implicit human review didn’t disappear, it got rationed.

The acute version of this is the wrong query that ships to production. The chronic version is schema drift, and it’s worse. Naming conventions decay because nobody enforces them on every PR. Redundant indexes accumulate because the model suggested one without checking what already exists. NOT NULL constraints get dropped because the migration was failing and removing the constraint made it pass. Soft deletes get reinvented every quarter because each AI session starts fresh on the team’s conventions, and “fresh” plus “confident” plus “eighty PRs a week” is how a codebase ends up with three different ways to mark a row deleted, all in production.

None of this is catastrophic in any given week. The compounding cost shows up eighteen months later, when query plans degrade, an ORM upgrade exposes the inconsistencies, or a new engineer can’t tell which of three “current” patterns to follow. The test suite isn’t only catching the acute incident; it’s the artifact that resists the drift, because every assertion the team writes is a convention written down where the next session can’t unlearn it.

Murphy’s Law used to apply intermittently. With AI in the loop, it applies 100% of the time. Every weak corner of the test suite gets exercised on a weekly cadence, and what used to be a rare edge case becomes the load-bearing case.

The failure mode is broader than wrong queries

The soft-delete bug is one shape. The same dynamic produces every other class of database failure AI introduces:

Migrations that lock production at scale. Asked to add a tier column to users, the model produces ALTER TABLE users ADD COLUMN tier TINYINT NOT NULL DEFAULT 0. CI’s empty-database migration test runs in 200ms and passes; in production, MySQL rewrites all 50 million rows under a metadata lock for 40 minutes during business hours. The DDL is syntactically valid; the failure is volumetric, and CI is structurally incapable of seeing it.
JOIN cardinality bugs that produce plausible numbers. An AI-generated LEFT JOIN with a predicate placed in the WHERE rather than the ON clause filters out the unmatched side; a JOIN through a bridge table without composite UNIQUE multiplies aggregations. The result set has the right shape, the row count is plausible, the number is wrong. Catching it requires a test that asserts the result against a known dataset where the failure mode would change the count.
Upserts assuming undeclared constraints. ON CONFLICT (email) DO UPDATE is correct only if email actually has a UNIQUE constraint. Without it, PostgreSQL throws the moment the planner sees no usable arbiter index; in some MySQL configurations, the equivalent INSERT ... ON DUPLICATE KEY UPDATE silently inserts duplicates because there’s no key to conflict on. The model writes the upsert from the column name and the question, doesn’t check the catalog for the constraint, and the test database has no concurrent writers exercising the path.
CHECK constraints that look right and aren’t. Asked to “prevent transitioning from cancelled to active,” the model emits CHECK (status IN ('pending','active','cancelled')), which lists the valid values but doesn’t constrain transitions. The constraint is syntactically right, semantically wrong, and indistinguishable from a working constraint until an UPDATE in production succeeds where it shouldn’t.
Indexes the planner will ignore. Asked to speed up WHERE LOWER(email) = ?, the model adds CREATE INDEX ON users (email). The planner can’t use it for the function call; the query stays slow; the EXPLAIN was never inspected because the index was the visible “fix.”

Every one of these is detectable. None of them is detected by the kind of test most teams run.

What “test the database” actually means today

For most teams, “we test the database” means one of two things:

The application’s unit tests use a real database (often SQLite or an empty Postgres) for fixtures, and they pass.
CI runs db:migrate against a fresh empty database before the test suite, and it doesn’t error.

Neither is testing the database. The first tests the application’s happy path with a database underneath. The second tests migration syntax. Both leave the entire surface above untested. The class of failure AI introduces - semantically valid SQL that misbehaves under realistic conditions - passes both checks every time.

What’s missing is a layer of tests that asserts against the database’s actual behavior on representative data: the migration finishes within a duration budget; the constraint rejects the values it claims to; the query, run against a known fixture, returns the expected count and the expected aggregate; the index in the EXPLAIN is actually used; the schema invariants the team thinks they have are present and behave the way the team thinks they do. These categories of test exist. Tools for each have existed for years. Most teams have never written them. Pre-AI, the teams that didn’t write them got away with it because the engineer writing each migration and each query was applying the checks in their head: slowly, with friction, sometimes wrong, but applying them. Post-AI, that engineer is increasingly not in the loop, and the only thing left is what’s actually written down.

Part 2 covers what to test in each category and the tools that exist for each.

When this doesn’t apply

The argument is conditional: if AI is writing any portion of your data layer, you need this. Cases where the conditional doesn’t hold:

AI isn’t writing your migrations or queries. The team uses AI for application-level boilerplate; SQL and schema changes are still hand-written. The implicit human review is intact for the data layer, and the case for tests is the same as it was before.
The database is small and single-writer. A 200-row admin table maintained by one service. Lock duration on ALTER is microseconds, JOIN cardinality is unambiguous, the surface AI can hurt is small enough to manage by reading every diff.
The work is read-only and a domain expert validates each result. An analyst uses AI to draft queries and reviews each result against expectations. The AI is generating drafts, not shipping queries. The domain expert is the test, the same way the senior engineer used to be.
The data is throwaway. Dev environments, ephemeral analytics, throwaway scripts. The cost of being wrong is low and the failures are reversible.

For everything else (production-facing systems, multi-writer schemas, data the business depends on) the conditional holds.

The bigger picture

Automation moves verification work; it doesn’t eliminate it. Pre-AI, “we don’t really test the database” worked for most teams because the engineer writing each migration and each query was the test, applying dozens of unwritten checks per change. The check was slow, expensive, and unreliable, but it existed. Post-AI, the engineer is increasingly not in the loop, and the team’s database verification is whatever the test suite explicitly asserts. The useful choice isn’t whether to verify; it’s whether to verify in CI or in production.

The picture worsens with agents in the loop. An agent that runs migrations, executes restores, or modifies schemas under its own permissions doesn’t have an implicit human review even at the moment of execution. A model that’s right 99.9% of the time, given a thousand operations a week, ships a serious incident every two weeks (and “right 99.9%” is generous). The test suite is what makes any of that safe; the cost of building it is a fraction of the cost of one production incident, and AI made the math explicit.

Part 2 covers what each layer has to assert and the frameworks (Squawk, pgTAP, Testcontainers, lock-duration probes, result-regression diffs) that already exist for each category.

What AI Gets Wrong About Your Database

Tue, 03 Feb 2026 00:00:00 +0000

TL;DR

LLMs generate SQL from the catalog (column names, types, whatever constraints have been declared), and the silent-failure rate is set by the gap between what the schema describes and what the database actually contains. Close the gap with declared FKs, column comments, named constraints, and enforced conventions: the same work that makes the schema describe itself to any reader.

An analyst asks the assistant for total revenue per enterprise customer for Q1. The model reads the catalog (customers, orders, order_items, subscriptions), generates a four-table JOIN, applies what looks like the right status = 1 filter on subscriptions and a created_at >= '2026-01-01' predicate, and returns a number. $4.2M.

The number is $1.4M too high. order_items is joined through a promotions bridge that multiplies rows for any order with a stacked discount, and the bridge has no UNIQUE constraint that would have stopped the multiplication. status = 1 on subscriptions means “pending,” not “active” (the column is a TINYINT reused with different semantics across tables, with no comment to disambiguate). The date filter only constrains subscriptions.created_at, so historical orders attach to current subscriptions. The query ran in 80ms. EXPLAIN looked clean. The result set had the right shape. Nothing about it said it was wrong.

A senior engineer who knows the schema would have caught all three. They know the bridge multiplies rows, they know status is overloaded across tables, they know which created_at belongs to which entity. The model has none of that internal context. It has only what the catalog tells it. The rest of this post is about what’s missing from the catalog and how to put it back, so the model gets the same affordances the senior engineer relies on.

The obvious fix is “give the model more context: connect MCP, dump the schema into the prompt, fine-tune on the company’s queries.” Each helps at the margin. None of them closes the gap in the opening scenario, because the model’s mistake wasn’t a context-window problem. It was working from a description of the database that didn’t contain the information it needed. Even a perfect, fully-loaded schema describes the contract the database enforces on writes, not the meaning the data carries or the parts of it that live entirely outside the catalog. Every model also has a hallucination floor against any prompt, no matter how complete: asked for customers.email, it’ll sometimes produce customer_email because that’s the more common pattern in the training data; asked to join through a bridge, it’ll sometimes invent a column that doesn’t exist. More context lowers the rate; it doesn’t drive it to zero.

This post is the index of where those gaps live. Each issue gets a brief description and a link to the dedicated post that covers the mechanics in full. The filter throughout: a knowledgeable human handles it because of context that isn’t in the schema. The same context, encoded in the catalog, is what the model needs.

Caveat first: the verification gap nothing in this article fixes

Before any of the layers, the ceiling. Application code has a build step. The type checker rejects bad assignments, the compiler rejects ill-formed programs, unit tests run in CI, integration tests catch runtime mismatches. If it compiles and the tests pass, the mistake is more likely in the test than in the code. The default assumption is that the pipeline would have caught an obvious bug. SQL has none of that. The parser accepts any syntactically valid query, regardless of whether it matches the question the human was trying to answer. EXPLAIN tells you the plan’s cost, not whether the predicates are aimed at the right columns. The database compiles, runs, and returns a result whether the query is right or not.

A handful of teams write meaningful tests for their queries (dbt tests, Soda checks, data-diff regressions, assertions against known inputs on every PR). Most don’t. Most production SQL has no test coverage in the sense application code does: no “given this, expect that” assertion, no diff against last week’s numbers on a representative dataset, no guard that fails a PR when a predicate changes the result set in an unexpected way. Code review is the verification step, and code review on SQL is usually “does this look right”, which is exactly the check AI-generated output is designed to pass.

That’s the ceiling. The reviewer who can catch a wrong AI-generated query is, by definition, the reviewer who already knew enough to write the right query, which is exactly the value the model was supposed to provide. A richer catalog narrows the window where this matters: more constraints to violate, more types to mismatch, more comments to flag a wrong predicate. It doesn’t close the window. Everything below is about pushing the floor up under that ceiling.

A note on what “reads the catalog” actually means

The model doesn’t always read the catalog on the first attempt. With most setups (Copilot in the IDE, ChatGPT with a database tool, MCP-backed agents) the first pass is pattern-matching against the question, the table names mentioned in chat, and whatever the training data suggests. The catalog gets queried lazily, often only after a syntax error sends the model back for another try. Silent-failure SQL never errors, so the catalog never gets read at all.

Everything below describes what’s missing when the model reads the catalog. In practice, the failure rate is higher than the layers alone predict, because the catalog is the floor of what the model could see, not the baseline of what it actually sees. The fixes still apply. A richer catalog at least gives the model something useful when it does read.

Three layers the catalog can fix

1. Relationships - how tables connect

Missing foreign keys. Without declared FKs, the only signal connecting tables is column-name matching. That works for user_id → users.id and breaks on the column vocabulary every legacy schema accumulates: creator_id, modified_by, owner, assigned_to, ref_id, parent. The FK is the one machine-readable statement of how tables actually connect, and every assistant that reads information_schema falls back to guessing without it. Foreign Keys Are Not Optional.
Bare id primary keys. table_a.id = table_b.id is syntactically valid SQL between every pair of tables in the database, so the model can construct nonsense joins that return rows. With mixed PK strategies coexisting (older services on BIGINT AUTO_INCREMENT, newer ones on CHAR(36) UUID), joining a UUID id to a BIGINT id silently casts in MySQL and returns zero rows or false matches with no error. Generic id Primary Keys.
Polymorphic references. A resource_id column whose target table depends on a sibling resource_type discriminator ('order' → orders, 'invoice' → invoices) can’t be enforced as an FK and looks like a normal column. The correct query needs a conditional JOIN or UNION pattern the model won’t generate without being told. Polymorphic References.
TEXT/JSON columns. Column type JSON says nothing about the keys inside; the actual shape lives in a serializer class six repos away. JSON_EXTRACT paths the model writes from the column name and the question match zero rows once the producer renamed a key two years ago, and old generations of payloads coexist in the same column with no version field to dispatch on. TEXT and JSON Columns.
Cross-database references. A service whose account_id points at alpha.businesses.id in another schema is invisible to a model scoped to one connection (the default for most MCP setups). The reference exists in application code or in views, not in the catalog of either database, so neither end describes it.

2. Meaning - what values actually mean

Polysemic types and data drift. TINYINT NOT NULL accepts 1 meaning “active” in one table, “pending” in another, “has been processed” in a third. Soft-delete coverage is partial across tables; VARCHAR dates carry multiple format generations in the same column; sentinel rows like user_id = 0 for “anonymous” or email = 'DO_NOT_USE@test.com' get treated as real data. Copilot ranked the test row as the top customer with $99,999 in revenue because it had the highest total. Visible to anyone querying the table for years, invisible to anyone reading only the DDL. Reading the Schema Is Not Reading the Data.
Legacy schema drift. tmp_orders is the main orders table; old_price is the current price; flag1 means something nobody remembers. The model reasons from the names (“this is staging, prefer the non-tmp table; this is historical, ignore for current queries”) and each reasonable inference is wrong in this specific schema. Legacy Schemas Are Sediment.
Missing column comments. The lowest-cost fix in the entire schema-as-context surface, and almost universally absent. status TINYINT COMMENT 'Order lifecycle: 1=pending, 2=processing, 3=shipped, 4=delivered, 5=cancelled' is the difference between a model that knows and a model that guesses, and adding it is a pure-metadata operation with zero downtime. Comment Your Schema.
NULL semantics. The catalog says a column is nullable; it doesn’t say whether NULL means “unset,” “not applicable,” “still in progress,” or “data lost during the 2019 migration.” A knowledgeable human knows what NULL signifies on each column from context that lives outside the catalog; the model has no such reflex and writes predicates that work for the non-null path. The fix is the same as polysemic types: encode the meaning in a comment. NULL and Three-Valued Logic.
Business rules outside the schema. “Active customer” is status = 'active' to one team, last_login > 90 days ago to another, account_balance > 0 to a third. Discount logic, approval workflows, regulatory carve-outs all live in application code, queue workers, or a Confluence page. The fraction the model can read is whichever fraction the team chose to put in the database. Encoded as CHECK constraints, generated columns, views, or stored procedures, the rule becomes part of the schema and visible to every reader. Where Business Logic Lives.
Schema as the source of truth. The catalog is only useful as self-documentation if it’s the source of truth. ORM-heavy codebases split the data model across the migration, model class, serializer, fixtures, and any query helpers, and the version the model class describes can drift from what the schema actually enforces (CHECK constraints the model class doesn’t know about, triggers that mutate after insert, generated columns treated as regular fields). Schema-first tools (sqlc, Drizzle, jOOQ) keep the database authoritative; ORM-first frameworks bury constraints in code the model has no signal to read. ORMs Are a Coupling.
Inconsistent conventions. userId, user_id, and UserID referring to the same entity across tables built by different teams in different eras. Mixed PK strategies, partial soft-delete adoption, ambiguous boolean prefixes. Every inconsistency forces the model to guess which variant each table uses, and the senior engineer who knows the per-era convention from history is the only one closing the gap. Schema Conventions and Why They Matter.

3. Integrity - what the catalog actually enforces

Missing UNIQUE constraints. A many-to-many bridge table without a composite UNIQUE silently inflates aggregations on join. That’s the row-multiplication failure in the opening scenario. ON CONFLICT (email) DO UPDATE only works if email actually has a declared UNIQUE constraint; without it, behavior is undefined or throws. The constraints exist in the team’s heads (“these can’t repeat”) but not in the catalog, so the database can’t enforce them and the model can’t read them. Join Cardinality Silent Bugs; Uniqueness and Selectivity.
Type, charset, and collation drift. Two VARCHAR(50) columns with the same name in different tables can have different charsets (utf8mb4 vs latin1) or collations (utf8mb4_general_ci vs utf8mb4_0900_ai_ci), causing joins to silently break equality or fall back to per-row conversion. The information is in information_schema.COLUMNS, but it’s a column nobody reads. The senior engineer knows the charset history from the migrations they were around for; the model has no reflex to check. The self-doc fix is enforcing one charset and one collation per database and documenting (or migrating away from) the legacy pockets. Schema Conventions covers the enforcement side.

What actually helps

The leverage is in making the catalog a richer description of the database, so the model has more to read and the database has more to enforce. Each lever has a real cost; none is free.

Declare the relationships. FKs are the highest-leverage single fix; every assistant that reads information_schema immediately gets the join graph. Cost: orphan cleanup on long-lived tables.
Comment the columns. The single largest gain in benchmarked LLM SQL accuracy comes from semantic descriptions next to the schema. Pure metadata, zero downtime, almost universally absent.
Constrain the writers. CHECK, UNIQUE, and NOT NULL are facts the model can read and classes of bad query the database will reject. Composite uniqueness on bridge tables prevents the multiplication failure in the opening scenario.
Promote what gets queried out of the blob. JSON keys that drive most filters belong in real columns; generated columns are the low-friction path.
Pick conventions and enforce them. Naming, PK strategy, charset, soft-delete pattern; one of each per database, linted on every migration.
Move business rules into the schema where it makes sense. CHECK constraints, generated columns, views make the team’s definitions readable to every consumer of the database, not only the service that owns them.
Treat AI-generated SQL as external input. Profile the columns in the predicates against the actual data before the query ships. SELECT col, COUNT(*) FROM t GROUP BY col ORDER BY 2 DESC LIMIT 20 catches polysemic-TINYINT and sentinel-value mistakes in seconds. For aggregations, sanity-check against an order-of-magnitude expectation.
Read every statement before shipping it. Don’t vibe-code production SQL. If you can’t explain why this LEFT JOIN is LEFT rather than INNER, why this column is in the GROUP BY, or why the predicate is status = 1 instead of status IN (1, 2), you’re trusting the model’s understanding instead of your own. The catalog work above gives the model better signal; it doesn’t replace the human review where each clause has to be read, understood, and justified before the query goes near production.

What self-documentation doesn’t fix

Improving the catalog closes most of the silent-failure surface. Four classes of concern persist regardless of how well-described the schema is, flagged here so the article isn’t read as overclaiming.

Protocol and integration gaps. MCP and other text-to-SQL connectors have their own reliability holes: context-window truncation, no standard error contract, tool definitions that change after confirmation. Mitigation lives in the integration, not the schema.
Agentic blast radius. Read-only assistants are one risk profile; agents that can run DDL or arbitrary writes are another. The Replit incident in 2025 (deleted production data, generated 4,000 fake users to cover it up) was an authorization failure, not a schema failure. Lever: read-only credentials and audit trails.
Comprehension debt. Engineers using AI assistance score measurably lower on comprehension quizzes about the code they shipped. A perfect catalog doesn’t help if the team has lost the mental model of what they’re maintaining.
Adversarial inputs. Text-to-SQL is sensitive to crafted prompts that produce malicious SQL. Mitigation is read-only credentials, query parsers, row-limit caps. Not richer schema metadata.

When the schema-only model is fine

Greenfield schemas with strict conventions. A six-month-old service database with FKs everywhere, every column commented, every enum an ENUM, every date TIMESTAMPTZ. The drift hasn’t accumulated and the conventions are linted on every migration.
Curated demo databases. Sakila, Northwind, Chinook. AI performs dramatically better on these than on any production schema, and benchmarks run on them aren’t predictive of production performance.
Read-only exploration with a domain-aware human in the loop. The model writes the query; the human reads the result and recognizes the wrong answer. The mistake is treating the model’s output as an answer rather than as a draft.
Single-team, single-database workloads. Twenty tables, three engineers, one service writing every row. The model has less to get wrong because there’s less schema to read. Grow the team or the schema by an order of magnitude and the math flips.

The bigger picture

A production database is the smallest version of itself in information_schema. The catalog is what was declared; the database is what’s actually in it. Every gap between the two is a place where a knowledgeable human carries the missing context in their head and an LLM produces plausible-shaped wrong answers (relationships that aren’t constraints, meaning that isn’t named, integrity rules that live in code instead of in the schema). None of these failures are unique to AI. Humans hit them too, more slowly and with more friction, and the friction is what gives experienced engineers a chance to notice. AI removes the friction without closing the gaps the friction was compensating for.

The lever is making the catalog a richer description: declared FKs, column comments, CHECK and UNIQUE constraints, conventional naming, generated columns where the JSON gets queried. The schema describes itself and the database enforces what it describes. Each investment pays off whether or not LLMs are in the loop. The schema gets more useful to every reader, the integrity gets more enforced, and the part of the database that lives in tribal knowledge shrinks. AI is the forcing function that makes the cost of skipping any of this immediately visible.

The Hello-World Procurement Problem: Why LLM Tooling Gets Bought Wrong

Sun, 21 Dec 2025 00:00:00 +0000

TL;DR

A CTO declares “full agentic” off a vendor demo. Without an SME watching the rollout, corruption ships and surfaces a year later when a customer reports a wrong number. With an SME, the work is information infrastructure first (so agents have enough context to make high-probability decisions) and guardrails for the cases where context isn’t enough.

A CTO sits through a vendor demo. A sales engineer types “show me the top ten customers by revenue last quarter” into a prompt and a working SQL query materializes in 30 seconds, runs against a sample dataset, returns plausible numbers. The CTO declares the company is going full agentic. Procurement closes the contract by Friday.

Procurement closes on Friday. The first agents reach production Tuesday. By the end of the quarter, three internal dashboards are running LLM-generated SQL, two of them against the company’s order data. The SQL passes the lint check (EXPLAIN runs, the result set has the right columns), and the dashboards display plausible numbers. Whether the numbers match what they should be is a question nobody on the team is positioned to answer.

Without SMEs

If the agent-generated SQL looks like gold to everyone in the room, the Rounders rule applies: if you can’t spot the sucker in your first half hour at the table, you are the sucker. Without someone in the room who’d catch the polysemic tier column or the undocumented soft-delete convention buried in three tables, the team is approving plausibility on a system optimized to produce it.

If the produced code looks good to you, you’re probably not the SME.

The corruption rate observed in the demo is a lower bound for what the tool produces against real data, often by a multiple. The realities catalogued in What AI Gets Wrong About Your Database (undocumented conventions, polysemic columns, business logic in tribal knowledge, ten-year-old codebases with three “current” patterns) are exactly the regions of input space where the model’s training distribution is sparse and contradictory. Demos run in the dense-distribution sweet spot. Production runs the inverse on every axis.

With nobody positioned to measure the gap, nothing flags it. Corruption is silent by construction. It doesn’t surface as one identifiable bug; it surfaces as drift across many places at once, traced back to LLM-generated code or queries whose authors can’t reconstruct what the model meant. By the time the rate is visible, the corruption has been propagating for weeks or months. The team has too many simultaneous issues to triage one at a time. Backups have rolled past the worst of the window.

The detection mode is external. A customer reports a number that doesn’t match what they expected. An analyst running LLM-powered queries on the company’s data publishes a report that contradicts internal numbers. A regulator asks a question and the answer doesn’t match the previous quarter’s filing. Whatever surfaces it, the failure is now a public one, and the team learning the failure mode is the same team trying to contain it.

With SMEs

The CTO’s declaration doesn’t change. The job changes. With an SME watching the rollout, the work is infrastructure first.

Agents make high-probability decisions when their inputs are dense. That means the schema is documented, polysemic columns are tagged, conventions are written down somewhere the model can reach, the dataset the agent runs against mirrors production rather than a curated subset. The realities that make a mature codebase mature (patterns evolved over years, decisions encoded in column names, exceptions buried in tribal knowledge) are exactly the inputs the agent doesn’t have unless someone puts them there. The SME’s first job is documenting what currently lives in heads. Without that, the agent operates in the sparse regions of its training distribution, and the floor on its corruption rate stays high regardless of how the harness is tuned.

Guardrails are the second piece, for the cases where dense inputs still aren’t enough. Decompose work into chunks small enough to verify. Route checkpoints between chunks to the SME whose domain it is. Audits produce a failure-rate number against ground truth, not a yes/no. Recovery drills test rolling back six months of LLM-generated changes, because that’s the realistic detection horizon for silent corruption. The point is to catch the cases where the agent’s confidence and its accuracy are decoupled, which is where most of the corruption lives.

Both pieces have to be in place before the deployment goes wide. Once the rate is visible from outside, the SME bench is already triaging incidents instead of building infrastructure, and the architecture won’t grow either piece on its own.

When this doesn’t apply

Small teams. The buyer is the SME, or one degree away. The infrastructure question gets answered by the same person making the rollout call.
Bounded, low-stakes use cases. Personal productivity tooling, draft generation, internal-only knowledge work where corruption is recoverable.
Mature vendor categories. Office suites, established CI/CD platforms, well-trodden CRM tooling. The failure modes are known and the buyer has reference points. New categories are where the asymmetry lives, and that’s exactly where LLM tooling sits in 2026.

The bill arrives later

The productivity dividend the CTO booked off the demo is real, in the sense that the deal closed and the harness shipped. The bill arrives a quarter or two later when a customer surfaces a number that doesn’t match the books, the auditor asks how the model arrived at it, and the team learns the failure mode in public instead of in QA.

Where Your Cloud Bill Actually Leaks: An Audit Nobody Runs

Thu, 13 Nov 2025 00:00:00 +0000

TL;DR

Cloud bills creep up because nobody owns bringing them down. The largest leaks are S3 versioning without a lifecycle policy, backup retention set when the database was a fraction of its current size, cross-AZ traffic on chatty services, lower environments running 24/7 at production sizes, and old workloads on instance generations the cloud now surcharges. An annual one-day audit by one engineer typically recovers a five-figure monthly sum and the savings stop being mysterious.

The S3 bill on a team’s data-lake bucket went from $1,400 a month to $9,800 over six months without anyone deploying anything new. The bucket had versioning enabled in 2022, no lifecycle policy, and a daily ETL job overwriting the same 40,000 objects every morning. Six months of overwrites left each object with roughly 180 versions in cold storage, and the storage charge was for all of them. Two hours on a one-page lifecycle policy reclaimed about $8,000 a month. The cost had been compounding for three and a half years; nobody had looked at the line item that broke it down.

“Buy a FinOps tool” is the reflexive answer, and it’s half right. Cost tools surface the bill but don’t fix it. They tell you the storage line is up 40%; they don’t tell you which 40,000 objects are versioned 180 times, which dev environment has been running 24/7 since the previous CTO, or which AZ your chatty cache shares with. The savings live in walking the items.

Audit storage and lifecycle first

Object storage with versioning enabled and no lifecycle policy is the most common large leak in any AWS account. S3 versioning charges for every version of every object indefinitely; a bucket with daily writes to the same keys can carry hundreds of versions per object after a year. The audit takes one query against the s3:ListObjectVersions API or one tab in S3 Storage Lens. For buckets holding derived data (build artifacts, ETL outputs, logs with authoritative copies elsewhere), disable versioning entirely; that’s cheaper than running a lifecycle policy against it. For buckets that genuinely need versioning, a lifecycle rule expiring non-current versions after 30, 60, or 90 days reclaims most of the cost. Incomplete multipart uploads are the related sweep: failed uploads sit on the bucket forever unless a separate lifecycle rule clears them.

Check before you disable versioning

Versioning is sometimes the only mechanism preventing data loss from an application bug, a misconfigured deletion policy, or a compliance retention requirement. A bucket that looks like “derived data” today might be the audit log a regulator asks about next year. Check the application’s recovery model and any compliance scope before turning it off.

Backup retention runs second. Most managed databases ship with a default of 7 days, and most teams later bumped it to 30 or 90 days “for safety” without revisiting whether the database was actually a fraction of its current size at the time. Snapshot storage above the database’s allocated size is billed separately at object-storage rates. A database that grew from 200 GB to 4 TB while retention stayed at 90 days has roughly 360 TB of snapshots on the line item, much of it for backups nobody has restored from. Cross-region snapshot replication, on by default in some compliance configurations, doubles that number. The conversation worth having is which databases need 90 days of point-in-time recovery and which only need 7. The answer is almost never “all of them”.

Temporary and scratch storage is the third item. Buckets named tmp-, scratch-, data-export-*, and migration-2023-* get created for one-off jobs and never deleted. EFS file systems mounted for migration work that finished two years ago. Test datasets uploaded for vendor pitches nobody pursued. Logs shipped to a debugging bucket during last summer’s incident. The discipline is a tag-based lifecycle policy: every temporary resource carries an expires=YYYY-MM-DD tag at creation, and a scheduled job deletes anything past its expiry. Same principle for ephemeral compute and infra: TTLs at creation, not retroactive sweeps.

Database tiering is the fourth storage item, and the cost shows up twice: in steady-state storage charges, and again every time someone touches the cluster. On a 30 TiB RDS cluster left alone as “the archive”, a routine ALTER TABLE to change one column’s datatype kicked off a full table rewrite that ran for a month and cost about $5,000 in IO before completing. The cluster had no active alerts, no recent change requests, and no one watching the bill - the charge accrued the full month before anyone noticed. Hot OLTP storage is the most expensive byte the cloud sells, and tables carrying years of archival rows the application reads less than monthly pay that premium plus a surprise tax on every schema migration. Partitioning by date and moving old partitions to S3, to a slower instance class, or to a column-store warehouse is a one-week project on a known shape. The migration looks unglamorous on a roadmap and gets perpetually deferred until the storage line, or an unexpected five-figure ALTER, crosses a threshold finance flags.

The unifying discipline across all four items is preventive: every storage resource needs an explicit retention policy and ownership tags at creation time, enforced in provisioning code rather than by human attention.

The audit recovers; the wrapped module prevents

A Terraform module that creates an S3 bucket should require lifecycle_rule, owner, and expires (or an explicit retention_class) as inputs and refuse to plan if they’re missing. The same wrapper-module pattern applies to RDS, dev environments, scratch buckets, and one-off compute. Tags applied retroactively only cover what someone remembered to update; tags enforced at the module cover everything provisioned from that point forward, including the infra a future engineer spins up without thinking about cost.

Pin chatty pairs and right-size capacity

Cross-AZ traffic is the silent compute leak. AWS charges roughly $0.01 per GB for data crossing AZs in both directions; on a chatty service that fans out to a cache and a database in different AZs, the round-trip charges add up to more than the instance cost itself within a few months. The fix is placement. Pin the chatty pair to the same AZ when the consistency model allows it. Batch the calls when it doesn’t. Move the cache layer to a per-AZ deployment so each application instance hits its local replica. The audit is one query against VPC flow logs or a glance at the Cost Explorer “Data Transfer” breakdown filtered by AZ.

Right-sizing is the next item. Instances provisioned for a load test in 2023 that ran at 12% CPU for two years are still on the bill at the size they were provisioned for. AWS Compute Optimizer and the equivalent recommenders in GCP and Azure are accurate enough to act on for the obvious cases without further investigation. The non-obvious cases (memory-bound workloads, spiky workloads, workloads with seasonal peaks, services with strict latency budgets) need a human pass with a week of metrics in front of them. Either way the data is already in the cloud; nothing has to be instrumented.

HA is the third. Multi-AZ on a Postgres primary roughly doubles the instance cost. On services where a five-minute outage is genuinely tolerable (internal tools, batch jobs, dev databases, services with a clear retry path on the caller) the second instance is paying for an SLA the business doesn’t actually need. The conversation worth having is which services have an RPO and RTO that justifies the standby. Most don’t. The original architecture review made the call on every service the same way (HA on, by default) and never revisited it as the service catalog grew.

Tune queries on the hot paths

Most items above remove waste in the infra layer. Query tuning and application-side optimization make the existing infra do more work per dollar, and on most systems they’re the largest single cost lever in this article. A single N+1 query in a hot path can put 10x more load on the database than necessary, sized as a more expensive RDS, a higher tier in every downstream cache, and more cross-AZ traffic. The infra audit cuts the bill by trimming what isn’t needed. Query tuning cuts it by reducing what’s actually being done.

Pick any production codebase older than eighteen months. At least one of the patterns covered elsewhere on this blog is in it, and almost always more than one: N+1 ORM iteration on a hot route, non-SARGable predicates that defeat any index, indexes built without understanding selectivity, OFFSET pagination past page 50, retry loops without backoff that triple request rate during the exact conditions that caused the original timeout, aggregations recomputed every request that could be cached for thirty seconds, and long-held row locks blocking unrelated work. Each one shows up at the cost layer as more vCPU, more IO, more cache pressure, and more cross-AZ traffic than the workload actually requires.

Query tuning is more expensive work than the infra audit: reading the slow-query log, profiling hot paths, and refactoring application code that touches the database. The payback shape is better, though. An infra audit recovers a fixed amount once. A query optimization saves on every future request, scaling with traffic growth.

Sweep sprawl and version surcharges

Lower environments default to running 24/7 at sizes someone picked when production was a quarter of its current size. The cheapest move is scheduled shutdown nights and weekends, where the workload isn’t worldwide and the engineers using it aren’t online at 3am. A 16-hour weekday shutdown plus full weekends recovers two-thirds of the monthly hours. AWS Instance Scheduler, GCP’s recommender, and a 30-line Lambda all do the job. Lower environments don’t need HA, don’t need the same retention, and don’t need the same instance class.

Per-engineer dev environments and PR-preview deployments are a related leak. Preview environments that spin up on every pull request and don’t tear down on close. Forgotten branches with attached infra. Personal sandboxes from engineers who left the company two years ago. Same TTL-at-creation discipline as the temp storage section above.

The cloud charges a surcharge on deprecated product versions in two places. EC2 instance generations are the visible one. AWS retired a long list of older EC2 generations and quietly raised the per-hour price on the ones still launchable; eventually they refuse to launch at all. Workloads still running on m4, c4, t2, or r3 generations are paying the surcharge today. Migrating to a current generation is usually a stop-start with a different instance type and a brief test window. The audit is aws ec2 describe-instances --query 'Reservations[*].Instances[*].InstanceType' and a join against the published deprecated-generation list. Same pattern in GCP for retired n1 machines, same in Azure for older v2 series.

The same surcharge runs on managed databases and the rest of the managed-storage catalog, less visibly. AWS RDS Extended Support charges per-vCPU-per-hour for Postgres, MySQL, and Aurora major versions past community end-of-life, stepping up each year until the version is forcibly upgraded. Postgres 11 hit that surcharge in early 2024; MySQL 5.7 followed. ElastiCache, OpenSearch managed, and DocumentDB have equivalent timelines. Azure SQL and Cloud SQL apply similar fees on out-of-support versions. EBS gp2 volumes carry a quieter version of the dynamic: gp3 is usually cheaper at the same IOPS budget even though gp2 isn’t formally deprecated. The audit is aws rds describe-db-instances joined against the engine’s published support timeline. The major upgrade was going to happen eventually; the surcharge puts a deadline on it.

Unused infra is the easiest sweep and the smallest line item per resource. EBS volumes left detached after the instance was terminated, billed monthly for storage that nothing reads. Elastic IPs not associated with any instance, billed hourly for the privilege of holding them. NAT gateways carrying near-zero traffic at the same hourly base rate as one carrying terabytes. Load balancers with zero healthy targets. RDS snapshots from databases deleted years ago. CloudWatch log groups with no retention policy that have been growing since 2019. The audit script is twenty lines per resource type. The savings are small per item and large in aggregate, and the cleanup is the safest of any item in this article. Nothing in production depends on a detached volume by definition.

Shared infra is the last item and the hardest call. Centralized logging, metrics, CI runners, internal developer platforms, and shared lower-environment clusters all start as obvious wins because the per-team cost is low and the operational burden is borne by a platform team. Years later the per-team cost has crossed the threshold where running it locally to the team that owns the workload would be cheaper, but the original decision is rarely revisited. The conversation worth having is per-team cost vs. operational complexity, not absolute cost. Centralization wins on operations and loses on per-team economics at scale, and the right answer for a 50-engineer org is rarely the right answer for a 500-engineer one.

When this discipline isn’t worth running

Three conditions make the cost sweep overkill. Very small accounts where the total monthly bill is under a few thousand dollars don’t repay the engineering time it takes to walk the list. Workloads in a hard regulatory regime where retention, HA, and cross-region replication are externally mandated have less room to cut than the article suggests; the audit still surfaces the line items, but the action set is smaller. And teams in a steep growth phase where the cost of the engineer’s time on cost work is more expensive than the savings should defer the sweep until the growth stabilizes. The discipline pays back at sustained scale, in established workloads, with engineering time available to allocate.

Make it an annual habit

The exercise is re-running old decisions against current numbers, in the order where the gap is biggest. The S3 lifecycle that was reasonable when the bucket held 40,000 objects, the backup retention that was reasonable when the database was 200 GB, the m4 instance that was a fine choice in 2021. None of those decisions were wrong when they were made; the numbers underneath them changed by an order of magnitude and nobody re-ran the math. The audit takes longer the first time because nothing is documented. By year three it’s a quarterly quick-pass, and the line item nobody used to read is the one finance forwards as good news.

TEXT and JSON Columns: Where the Schema Goes to Hide

Thu, 25 Sep 2025 00:00:00 +0000

TL;DR

A TEXT or JSON column moves the schema out of the database catalog and into application code; the data inside has a shape, but the DDL won’t tell you what it is. Promote the fields that actually get queried into real columns, and treat the rest as genuinely opaque.

An AI assistant is asked to “find customers who upgraded to enterprise in the last quarter.” It reads the catalog, finds api_logs(id, endpoint VARCHAR, payload LONGTEXT, created_at DATETIME), and generates the reasonable query:

1
2
3
4
5


SELECT JSON_EXTRACT(payload, '$.action') AS action, created_at
FROM api_logs
WHERE JSON_EXTRACT(payload, '$.action') = 'upgrade'
 AND JSON_EXTRACT(payload, '$.plan') = 'enterprise'
 AND created_at >= NOW() - INTERVAL 90 DAY;

Runs clean. Returns zero rows. The actual key was renamed from action to event.type two years ago when the team adopted a shared event schema; new rows match $.event.type, old rows still match $.action, and no one migrated the historical data because it wasn’t queryable anyway. Neither column nor catalog said any of this. The query is syntactically perfect, semantically correct for the key it guessed, and wrong because the key doesn’t exist in most of the rows.

The obvious fix is “switch to JSONB, validate with a JSON schema, add a GIN index.” Each one helps at the margin and none of them close the gap. JSONB tells you the blob is valid JSON, not what keys are in it. CHECK constraints with JSON_SCHEMA_VALID or jsonb_matches_schema work prospectively, but the six years of rows already in the table were written against five format generations and no validator reaches back in time. A GIN index accelerates key lookups but only if you know which keys to look up. The problem isn’t the storage format. The schema emigrated to application code, and changing the column type doesn’t bring it back.

What leaves the catalog when the column becomes a blob

DDL is the contract between the database and everything that reads it. A typed column says “this value is an integer between 0 and 2³¹−1, and here’s the index I’ve built over it.” A TEXT or JSON column says “this value is a string the application decided on, and the application can tell you what that means.” The second contract is thinner in ways that compound.

Readers can’t discover the shape from the schema. information_schema.COLUMNS for a JSON column returns COLUMN_TYPE = 'json' and nothing else. Every tool that reads catalog metadata (MCP servers, ERD generators, typed-client code generators, AI assistants, new engineers running \d+) sees a blob. The shape lives in the serializer class, the protobuf definition, the TypeScript interface, or nowhere. Whichever of those the reader happens to find is the shape they’ll assume. See Comment Your Schema for the lowest-effort way to leave a trail, but comments can describe the shape; they can’t make the catalog enforce it.

Generational drift is silent. Year one the payload is {action, user}. A migration adds nested metadata: {action, user, metadata: {source}}. A rewrite flattens and renames: {event: {type, user_id}, source}. A new service standardizes with a version field: {version: 3, event: {...}}. All four versions are sitting in the same column with nothing to distinguish them at read time except the keys they happen to have. A JSON_EXTRACT path written against today’s producer hits the newest generation and silently misses the older ones. The failure mode is exactly the one described in Legacy Schemas Are Sediment: the schema’s history is compressed into the data, and the data can’t decompress itself.

Writes are untyped. Without CHECK constraints or a JSON-schema validator, the writer is the only guardrail. A service deployed last Tuesday that emits amount as the string "9900" instead of the integer 9900 silently poisons the column. Downstream queries comparing amount > 1000 work on new rows and misbehave on the poisoned batch, because JSON-extract returns a string and the comparison is lexicographic. The same class of mismatch a typed column would reject on INSERT.

The planner is working blind. Row-count estimates on JSON_EXTRACT(payload, '$.event.type') = 'upgrade' have no histogram to consult; the planner falls back to a default selectivity estimate that’s usually wrong. Plans for queries filtered on JSON fields are routinely pessimistic or optimistic by an order of magnitude, and there’s no ANALYZE to fix that because the statistics don’t exist for the interior of the blob.

Indexes are per-key, not per-column. A functional index on JSON_EXTRACT(payload, '$.event.type') accelerates one path. The next query filters on $.source and scans the table. Generated columns are the cleaner version of this (payload_event_type VARCHAR(50) GENERATED ALWAYS AS (JSON_EXTRACT(payload, '$.event.type')) STORED) but each one is a schema change with a backfill, and you have to know in advance which keys matter. GIN indexes on JSONB cover arbitrary keys but are large, slow to update, and still don’t tell the reader what keys exist.

Untyped writes + untyped reads = silent schema drift

A TEXT or JSON column accepts anything the writer emits and returns exactly that on read. Two services writing to the same column with slightly different shapes don’t conflict at the database level; they produce a column whose contents depend on which service wrote the row. The divergence is invisible until a query tries to read uniformly across both.

Plausible paths, empty results

Schema-reading LLMs generate JSON_EXTRACT paths the same way they generate column names in a typed schema, by pattern-matching the column name and the question. Asked about “upgrade actions,” the model guesses $.action = 'upgrade' because the English-to-JSON-path mapping is obvious. It has no way to know that the key was renamed, that three generations coexist, or that the canonical name is now buried under two layers of nesting. The catalog gives it a column type of json and nothing else, and the model’s best guess is reasonable and wrong.

The failure pattern is familiar from other schema-hiding designs. Polymorphic references hide which table a foreign-key-shaped column points at; bare id primary keys hide which identifier is being compared; TEXT/JSON columns hide what’s in the column at all. All three are cases where the LLM generates a plausible query against a schema that isn’t telling it enough, and the query returns plausibly-shaped but semantically empty results.

The fix, and where it stops being free

The lever is being honest about what’s inside and picking the right storage per field.

Promote fields that get queried. If the application filters on event.type more than occasionally, that’s a real column. Generated columns are the low-friction middle path: derive a typed, indexable column from the JSON, keep the raw payload as the audit trail.

1
2
3
4


ALTER TABLE api_logs
 ADD COLUMN event_type VARCHAR(50) GENERATED ALWAYS AS
 (JSON_UNQUOTE(JSON_EXTRACT(payload, '$.event.type'))) STORED,
 ADD INDEX idx_event_type (event_type);

The trade-off: every promoted field is a migration, and generated columns don’t retroactively rewrite rows written with a different shape; you still need the COALESCE(JSON_EXTRACT(payload, '$.event.type'), JSON_EXTRACT(payload, '$.action')) cleanup for the old generations, and you’re doing that exactly once as part of the promotion rather than in every query.

Enforce new writes with a JSON schema. PostgreSQL’s pg_jsonschema and MySQL 8.0’s JSON_SCHEMA_VALID let a CHECK constraint reject writes that don’t match a named schema. Doesn’t fix existing rows; does stop the next silent format change from landing. If the team doesn’t already have a shared event schema, a CHECK constraint is the forcing function that produces one.

Version the payload explicitly. {"version": 3, "payload": {...}} at the top lets every reader dispatch on version instead of inferring it from which keys happen to be present. Doesn’t help rows written before versioning started, but bounds the drift going forward and turns “which generation is this row?” from archaeology into a lookup.

Document what stays inside. Comments on the column (“see github.com/org/events for the schema; versions 1–3 coexist in rows older than 2024-Q2”) won’t replace types, but they give the reader a place to look. Comments on the schema are cheap, in-place, and propagate through every tool that reads the catalog; for genuinely-opaque columns this is the best available signal.

When JSON is actually the right answer

The pattern earns its keep in specific shapes where the alternative (typed columns) is worse.

Truly variable shape per row. User-supplied settings blobs, custom-field configurations, extension points where the keys are genuinely per-tenant or per-user. Modeling each variant as a column produces a wide table full of NULLs; see God Tables for the cost of that direction. The column is honest about being schemaless because the data is schemaless.

Audit payloads nobody queries. Raw API request/response bodies retained for compliance, debug traces, incident forensics. Written once, read by humans one row at a time, never aggregated. The lack of a queryable schema is fine because no query needs one. A sensible default here is to keep the payload compressed and add a small set of typed columns (endpoint, status_code, user_id, created_at) for the predicates the operational queries actually use.

Short-lived staging. Job queues, idempotency cache payloads, outbox entries, where the producer and consumer are deployed together, the payload is read once, and the row is deleted on completion. Drift can’t accumulate in rows that don’t stay around.

Document stores on purpose. PostgreSQL JSONB with a stable schema, validated on write, with functional indexes on the paths that matter. This is a real design; it’s not the unspoken default that most TEXT columns represent. If the team is reaching for JSONB and treating it as a document store, it should look like one (with validation, indexes, and documentation) not like a TEXT column that happens to parse.

The bigger picture

A TEXT or JSON column is a specific architectural choice: move part of the schema out of the catalog, in exchange for cheaper writes and looser contracts between producer and consumer. When the trade is deliberate (genuinely variable data, write-once audit, short-lived buffer) it’s the correct shape. When it’s the path of least resistance because typed columns would require a migration, the cost is deferred to every future reader who has to reconstruct the format from commit history.

Databases are good at enforcing the contracts they know about. The column types are how they know. Every field that matters to a query deserves to be in the part of the schema the database can see; everything else is honestly opaque and should look it. The default drift (“stick it in the payload, we’ll parse it later”) produces columns whose contents nobody fully knows, including the team that wrote them.

Reading the Schema Is Not Reading the Data

Mon, 08 Sep 2025 00:00:00 +0000

TL;DR

A schema describes the shape the database enforces; the data inside follows a second set of conventions (soft-delete coverage, sentinel values, encoding quirks, format drift) that live nowhere the catalog can show. Queries written from the DDL alone run clean and return results that look right and mean something different. Treat the data as a second source that has to be read, sampled, and documented alongside the types.

An engineer (or an AI) writes a query to find pending orders:

1
2
3
4


SELECT id, total_cents, created_at
FROM orders
WHERE status = 1
 AND created_at > NOW() - INTERVAL 7 DAY;

orders.status is TINYINT NOT NULL. The query runs. Forty thousand rows come back. Most of them shipped days ago. The mistake lives in the column’s other life: status on this table is a boolean is_processed flag where 1 means “has been through the fulfillment pipeline.” The order lifecycle state (pending, processing, shipped, delivered, cancelled) is in orders.state, also TINYINT NOT NULL, also no comments, and whoever read the schema first picked the column whose name they recognized. The DDL was no help; both columns have the same type, the same nullability, and the same look in information_schema. The data was telling the real story, and the data wasn’t read.

The obvious fix is “add comments, use ENUM, lint for ambiguous names.” Each of those helps on new columns and the next migration. None of them touch the existing data, which is where the ambiguity actually lives: forty thousand rows of status = 1 that mean one thing on this table and a different thing on its sibling, ten million VARCHAR dates written by five generations of code in three formats, and a users table where rows with email = 'DO_NOT_USE@test.com' have been on the leaderboard for two years. Fixing forward keeps the problem from growing. Reading the data is how you find out what’s already there.

Four ways the data disagrees with the schema

These are not the exotic cases. They show up in nearly every mature production database, and each one is a place where a schema-only read produces a plausible, wrong query.

TINYINT(1) is polysemic. It stores a boolean flag (is_active, has_seen_onboarding, email_verified), a small enum (lifecycle states, tier levels, priority), a bit-packed byte (eight flags in a single column), or a count that never exceeds 127. All four uses produce identical entries in information_schema. Naming conventions (is_*, has_*, can_* for booleans; _type, _status, _level for enums) are the informal signal, and like every informal signal, they’re applied inconsistently and broken in legacy tables. See Schema Conventions and Why They Matter for the prescriptive side; this is the descriptive reality.

Soft-delete coverage is partial. Some tables have deleted_at TIMESTAMP NULL. Some have is_deleted TINYINT(1) DEFAULT 0. Most have neither, because the original author decided the table didn’t need soft deletes and nobody revisited. A query that correctly filters WHERE deleted_at IS NULL on customers returns the right answer; the same pattern applied to addresses either errors out (column doesn’t exist) or silently matches everything (column exists but is always NULL because the application never writes to it). There’s no global rule to encode and no way to know from the catalog which tables fall in which bucket. You have to read the data, or read the application code that writes to it (which is usually worse).

VARCHAR dates in multiple formats. A column called signup_date VARCHAR(10) is a tell. The first generation of rows has YYYY-MM-DD. A rewrite that switched import vendors introduced MM/DD/YYYY. An international expansion produced DD/MM/YYYY for rows that came in through a specific endpoint and DD-Mon-YYYY for one partner’s CSV imports. All four formats live in the same column. WHERE signup_date >= '2025-01-01' matches the first generation correctly, matches the third generation backwards (“2025-01-01” sorts before “15/03/2024”), and misses the fourth entirely because the sort order doesn’t touch Mon strings. The query returned rows, so the reviewer moved on.

Sentinel values and test data. Row with user_id = 0 means “anonymous.” Row with email = 'DO_NOT_USE@test.com' is a test account that’s been in production for three years because nobody wanted to take responsibility for deleting it. Row with created_at = '1970-01-01 00:00:00' is a backfill where the original timestamp was unknown and epoch zero got written as a placeholder. Every one of these is an intentional violation of the apparent meaning of the column, and every schema-level read treats them as ordinary data. Copilot ranked DO_NOT_USE as the top customer with $99,999 in revenue because the row had the highest total; the test record had been sitting there for years, visible to anyone who queried the table but invisible to anyone who only read the DDL.

Input-convention drift. VARCHAR(255) accepts “Acme Corp,” “ACME CORPORATION,” “Acme Corp.,” “acme corp,” and “ACME CORP” (two spaces, somebody’s trailing whitespace bug). All five are the same company in different rows. The unique constraint, if it exists, didn’t catch any of them because they’re not byte-identical. Any query that groups or joins on the text field silently double-counts - not by a small amount, by however much the convention drift is worth. Encoding quirks compound: café in NFC and NFD look identical in the terminal and hash differently; case-folding depends on collation; trailing whitespace varies by source system.

Why the catalog can’t tell you this

information_schema describes the contract the database enforces on writes. That contract is narrow: types, nullability, defaults, constraints, foreign keys. It doesn’t describe what got written before the constraint was added (almost all of it), what gets written by code paths that bypass the ORM (a surprising fraction of it), or what the application decided to write into a column that the database happily accepts because the type matches.

Type compatibility is a floor, not a ceiling. TINYINT NOT NULL excludes strings, NULLs, and integers outside [-128, 127]. It doesn’t exclude 1 meaning five different things in five different tables, because that’s not a type constraint - it’s a semantic one, and the database has no vocabulary for semantics. The same logic applies to NULL handling: the catalog tells you a column is nullable; it doesn’t tell you whether NULL means “unset,” “not applicable,” “still in progress,” or “data lost during the 2019 migration.”

LLMs inherit this limitation directly. A model generating SQL from the catalog sees column names and types, not data distributions. It has no way to tell that status is polysemic across tables, that deleted_at exists on four of the six relevant tables, or that signup_date has three format generations. The LLM’s best guess is the one a new engineer would make: the schema looks uniform, so the data probably is. Neither is wrong in general; both are wrong often enough in mature databases to produce plausibly-shaped and semantically-hollow query results. This is the generalization of the specific patterns covered in Legacy Schemas Are Sediment; legacy schemas are one source of data drift, and there are others.

Runs clean, returns plausible, means something else

Schema-only queries fail in the quietest way a query can fail. The SQL is syntactically correct. The types match. Rows come back. Some fraction of those rows mean what the author intended, and some fraction mean something else, and there’s no signal at the database level telling you which is which. Reviewers who only look at the query text can’t catch it. The data is where the check has to happen.

The fix is a habit, not a migration

You can’t retroactively enforce a schema on ten years of writes. You can change what the next reader (human or model) has available before they generate the next query.

Profile before you query. Before writing a predicate against an unfamiliar column, run a one-liner: SELECT col, COUNT(*) FROM t GROUP BY col ORDER BY COUNT(*) DESC LIMIT 20. For low-cardinality columns (status, type, flags) this reveals the actual value distribution in thirty seconds and catches the flag-versus-enum mistake before the query ships. For higher-cardinality columns, sample: SELECT col FROM t ORDER BY RAND() LIMIT 50. The time cost is minutes; the catch rate is substantial.

Comment the columns the DDL can’t describe. A one-line comment on orders.status ('Pending=1, Processing=2, Shipped=3, Delivered=4, Cancelled=5') and on orders.state ('Boolean: 1 if order has been through fulfillment') is the difference between a reader who gets it right and one who guesses. Comment Your Schema covers the mechanics in full; for the flag/enum disambiguation specifically, this is the highest-leverage fix per character of effort anywhere in schema maintenance.

CHECK constraints for new values. CHECK (status IN (1,2,3,4,5)) is the forcing function for the next writer. It won’t clean up existing rows, and it won’t stop a future engineer from reaching for 6, but it will fail loudly when they try, instead of silently accepting a value the readers of the table don’t know about. On nullable columns, CHECK (deleted_at IS NULL OR deleted_at > created_at) catches the backfill-sentinel case.

Migrate VARCHAR dates when you can afford it. The migration is real work: parse each row, fail loudly on unparseable formats, pick a canonical representation, backfill. Leaving VARCHAR in place guarantees the next query is written against whichever format the author happened to sample. The right-sized fix in the meantime: a comment on the column listing the known formats, and a view that exposes a parsed DATE for the queries that can tolerate loss on the unparseable rows.

Treat data profiling as part of review. When a PR adds a new query, the reviewer’s first question is “does this predicate match the data?”, which requires actually looking at the data, not just the query. For AI-assisted development this is even more load-bearing: the model generated the query from the catalog, so the human review is the only layer that can compare the query’s predicates to the column’s actual contents.

When schema-only reading is fine

Not every database carries this baggage. Three cases where the schema really is the data’s description:

Schemas designed from scratch with strict conventions. New services, greenfield tables, codebases where every column has a comment, every enum is an ENUM type, and every date column is DATE or TIMESTAMPTZ. The drift hasn’t had time to accumulate, and the conventions are enforced by linters on migrations. The failure modes described above can still show up; they show up as bugs that get caught, not as the steady-state of the table.

Small, single-team databases. Twenty tables, three engineers, all the data flowing through one service. Everyone who writes to the table knows what the conventions are; the data drift is small because there are only three writers. The cost of the habit described above exceeds the cost of the drift it catches. Grow the team or the table count by a factor of ten and the math flips.

Analytical warehouses that expect exploration. In a BigQuery, Snowflake, or ClickHouse dataset built for analytics, everyone who queries the data profiles it as a matter of course: sample the column, check the distribution, look for nulls. The profiling habit is already the workflow; the schema is treated as a hint rather than a contract. This is the part of the data stack where reading the data is assumed, and the failure mode is correspondingly rare.

The bigger picture

A production database has two artifacts worth reading: the DDL the engine enforces, and the data the engine happens to hold. The first is legible, indexed, and comes with tooling; the second is tribal knowledge, distributed across rows written by years of code, and invisible to every tool that stops at the catalog. Everyone from new engineers to LLMs reads the first artifact and assumes it describes the second, which is true in schemas fresh enough to have no drift and false in every schema old enough to have generated any.

Rigor on new tables pays off, but the larger lever is routine comparison between what the schema says and what the data does: sampling before querying, commenting columns whose meaning isn’t self-evident, treating data profiling as part of review rather than a debugging step. None of it scales to “we documented the whole schema in one sprint.” It scales one column at a time, on the columns that are about to be queried, until the fraction of the schema that lies to its readers is small enough to stop costing incidents.

Random UUIDs as Primary Keys: The B-Tree Penalty

Fri, 22 Aug 2025 00:00:00 +0000

TL;DR

UUIDv4 primary keys are globally unique and coordination-free, and the cost is paid every time you write a row: random B-tree positions, page splits, secondary indexes bloated with 16- or 36-byte key copies, and a working set that stops fitting in the buffer pool once the table is large enough. UUIDv7 fixes the insert-locality problem (time-ordered, sortable) without changing storage size; the full fix is picking v7, storing as BINARY(16) or native uuid, and keeping UUIDs at the API boundary rather than internal to every join.

A table configured like this on day one looks unremarkable:

1
2
3
4


CREATE TABLE orders (
 id CHAR(36) PRIMARY KEY, -- UUIDv4, generated by the application
 ...
);

Inserts are fast, reads are fast, the ORM is happy. At 100,000 rows, it’s still fine. At 10 million, the nightly ingest job gets noticeably slower. At 200 million, inserts take 50 ms each instead of 2 ms, the buffer pool is constantly churning, and the secondary indexes are three to four times the size they’d be with a BIGINT primary key. Nothing about the schema changed. The table just got large enough for a design decision to start charging rent.

The obvious fix is “use BIGINT auto-increment.” That’s the right answer in a lot of cases and the wrong one in others; it reintroduces coordination requirements, leaks row counts through URL-exposed IDs, and doesn’t work for schemas that need to be generated offline or across shards. UUIDs exist because those constraints are real. The sharper question is: what exactly is UUIDv4 costing you at scale, and which of those costs have cheaper alternatives?

What random keys do to a B-tree

B-tree indexes are sorted structures. When the primary key is an auto-incrementing integer, every new row goes to the end. The rightmost leaf page is the only one that gets written to, and the rest of the index stays in cache undisturbed. Inserts are sequential and cheap.

UUIDv4 is random by design. Every new row lands at a random position in the B-tree. Instead of appending to one page, the engine has to:

Find the right page somewhere in the middle of the tree.
Load it into the buffer pool if it isn’t already (on a large table, it usually isn’t).
Split it if it’s full.
Write both halves back.

On a table with hundreds of millions of rows, the index doesn’t fit in memory, so most inserts trigger a random disk read before they can do anything else. The write amplification is real and measurable: factor of 5 to 10× versus sequential inserts isn’t unusual.

The damage doesn’t stop at the primary-key index. In InnoDB (MySQL), every secondary index includes a copy of the primary key at its leaves. A 36-byte CHAR(36) UUID embedded in every secondary index entry means larger indexes, more pages, more I/O compared to an 8-byte BIGINT. Secondary indexes on a UUID-keyed table are routinely 3–4× the size of the same indexes on a BIGINT-keyed table. Every lookup through a secondary index reads more pages to cover the same rows.

PostgreSQL handles storage differently. Its heap means the primary key is just another index, so the physical table isn’t ordered by it. The primary-key index still suffers the same random-insertion pathology, and the write amplification from random page loads still applies.

Page splits compound over time. When a new UUID lands in a full page, InnoDB splits the page in two, each roughly half full. Over millions of inserts, the index develops internal fragmentation: pages allocated but only partially used. The index is physically larger than it needs to be, and scans read more pages for the same row count. OPTIMIZE TABLE (MySQL) or REINDEX (PostgreSQL) can repack the index, but on a busy table it’s a maintenance window you have to schedule.

UUIDv7: the insert-locality fix

UUIDv7 is the version most new code should reach for when UUIDs are the right answer. It encodes a Unix millisecond timestamp into the high 48 bits, with random bits filling the rest. Two practical consequences:

Sortable. Sequential generation means new IDs land at the end of the B-tree, not scattered across it. Insert locality is close to a BIGINT’s. The pathological page-split behaviour of v4 goes away.
Time-parseable. The creation time is embedded in the ID, recoverable from the primary key alone: useful for log correlation, rough time-range filtering, and debugging without reaching for created_at.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


-- UUIDv7: time-ordered, so inserts are roughly sequential
-- PostgreSQL 18 ships a built-in uuidv7() function
CREATE TABLE orders (
 id UUID PRIMARY KEY DEFAULT uuidv7(),
 ...
);

-- Recover creation time from the ID - no created_at column needed
SELECT id,
 uuid_extract_timestamp(id) AS created_at,
 uuid_extract_timestamp(id)::date AS created_date
FROM orders
ORDER BY id DESC -- v7 sorts chronologically, newest first
LIMIT 10;

uuid_extract_timestamp() has existed in PostgreSQL since 17 but only returned a value for UUIDv1. PG 18 extended it to support v7 alongside the new uuidv7() generator. One caveat: calling it in a WHERE clause (WHERE uuid_extract_timestamp(id) >= '2026-04-01') is non-SARGable and forces a scan; see Non-SARGable Predicates. For indexed time-range filtering, keep a created_at column as the query target, or compare against a boundary UUID generated at the target timestamp.

MySQL 8 doesn’t ship a v7 generator or a timestamp extractor, so application-side generation is the norm there - libraries exist in every major language, and most modern ORMs default to v7 if you ask for UUIDs. Extraction is manual: for BINARY(16) storage (the recommended form), the first 6 bytes hold the millisecond timestamp.

1
2
3
4
5
6
7


-- MySQL: manually parse v7's timestamp prefix (BINARY(16) storage)
SELECT id,
 FROM_UNIXTIME(CONV(HEX(SUBSTRING(id, 1, 6)), 16, 10) / 1000) AS created_at,
 DATE(FROM_UNIXTIME(CONV(HEX(SUBSTRING(id, 1, 6)), 16, 10) / 1000)) AS created_date
FROM orders
ORDER BY id DESC -- v7 sorts chronologically
LIMIT 10;

For CHAR(36) storage, the extraction strips hyphens first: CONCAT(SUBSTRING(id, 1, 8), SUBSTRING(id, 10, 4)) gives the 12 hex characters of the timestamp prefix. If your v1 UUIDs were stored with UUID_TO_BIN(id, 1) (the swap flag that reorders bytes for v1 index locality), the byte layout differs and the substring offsets change. Most v7-generating libraries skip the swap because v7 is already time-ordered without it - check what yours does before trusting the extraction.

What v7 doesn’t change. It’s still 16 bytes on disk, and still 36 if you stored it as CHAR(36). The insert-locality win doesn’t come with a storage discount, so the overhead versus a BIGINT is the same as v4. The readable creation timestamp is usually a feature and occasionally a problem: in systems where row-creation time is sensitive (order IDs revealing traffic patterns to competitors, user IDs exposing signup timing), it’s the one property v4 had that v7 gives up.

CHAR(36) is the silent tax

The worst-case UUID storage, CHAR(36), is what most ORM-generated schemas default to, because it’s the portable representation. BINARY(16) in MySQL or the native uuid type in PostgreSQL cuts storage by more than half and keeps comparisons on fixed-width integers instead of strings. Pick the narrow form on day one; retrofitting it later is a full-table rewrite that touches every secondary index.

UUID-to-integer mapping: keep UUIDs at the edge

The other workable fix is structural: expose UUIDs externally, use integers internally. A single lookup table maps the external UUID to an internal BIGINT, and every other table in the database uses the BIGINT as its foreign key. The UUID lookup happens once (at the API boundary) and everything downstream is fast, compact, 8-byte integer joins.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


CREATE TABLE users (
 id BIGINT AUTO_INCREMENT PRIMARY KEY,
 external_id BINARY(16) NOT NULL UNIQUE, -- the UUID the outside world sees
 ...
);

-- Every other table references the BIGINT, not the UUID
CREATE TABLE orders (
 id BIGINT AUTO_INCREMENT PRIMARY KEY,
 user_id BIGINT NOT NULL REFERENCES users(id),
 ...
);

-- API request comes in with a UUID; one indexed lookup to resolve it
SELECT id FROM users WHERE external_id = UUID_TO_BIN('a1b2c3d4-...');
-- From here on, everything uses the BIGINT

The UUID column has a unique index, so the lookup is a single index seek, sub-millisecond regardless of table size. The rest of the schema gets 8-byte keys everywhere: smaller indexes, faster joins, no page splits, no secondary-index bloat. The external-facing API still uses UUIDs, so you don’t leak sequence information or row counts.

The trade-off is an extra layer of indirection. Every inbound request resolves the UUID before anything else; in practice this is negligible (one indexed lookup), but it means the schema has two identity systems to maintain. For long-lived OLTP applications where every join on every table pays the UUID cost, this structure is often worth the extra lookup.

When random UUIDs are actually fine

Not every schema needs to bend. Three cases where UUIDv4 as a primary key is a defensible choice:

Small tables that stay small. A configuration table, a lookup table, a feature-flag table. At 50,000 rows the page-split pathology doesn’t show up, secondary indexes are tiny, and the convenience of client-generated IDs outweighs any cost.

Write rates low enough that random I/O doesn’t matter. An admin tool recording 50 events per minute doesn’t care about write amplification. The index fits in cache, every page is warm, page splits happen rarely enough that fragmentation stays manageable. “Doesn’t survive scale” is only a problem at scale.

Information-leak concerns that outweigh performance. If hiding creation-order is a hard requirement (competitive, privacy, or security), v7’s embedded timestamp is a non-starter and v4 is the only UUID version that meets the requirement. Pay the write-amplification cost and use the UUID-to-integer mapping to contain the damage.

Why v4 keeps showing up as the default

Schema-reading assistants and scaffolding tools reinforce UUIDv4 as the default answer, and the reason is mostly inertial. Training corpora are heavy on examples where uuid4() is the canonical “globally unique ID” call; CREATE TABLE ... id UUID DEFAULT gen_random_uuid() appears in orders of magnitude more tutorials than the v7 equivalent. Asked for a new table schema, a model produces the v4 version because that’s what the surrounding code it learned from produced. B-tree locality and write amplification don’t show up in the DDL (they’re runtime properties of the key distribution) so the catalog gives no signal that v4 and v7 behave differently at 100M rows. Both look identical in information_schema: uuid or char(36), primary key, not null.

The fix is the same discipline this post already makes the case for, with one amplified beat: document the choice where a schema reader can find it. A comment on the PK column ('UUIDv7: time-ordered; required for insert locality' or 'UUIDv4: randomized; chosen to hide creation order') turns a silent convention into a machine-readable decision. The next reader (teammate or model) sees why the column is what it is and why the alternative was rejected. Without the comment, the next table scaffolded by an assistant inherits whichever version the training data sampled, and the schema drifts toward whichever default produces the most plausible-looking DDL.

The bigger picture

UUIDv4 is a tool that solved a coordination problem (distributed ID generation without central authority) and accidentally became the default for everything, including the cases where coordination wasn’t a problem and the cost of random writes is non-trivial. “Pick a UUID for your PK” is a decision most schemas make without ever being explicit about what they’re trading.

The decision matrix is short. Do you need globally unique, coordination-free IDs? If no, use BIGINT. If yes, use UUIDv7 and store it as BINARY(16) or native uuid, never CHAR(36). If v7’s embedded timestamp is a problem, use v4 but keep it at the API boundary and use integers inside the schema. Each of those decisions costs almost nothing on day one and saves a lot of rework at 100 million rows.

God Tables: 150 Columns and the Quiet Cost of 'Just Add a Column'

Tue, 05 Aug 2025 00:00:00 +0000

TL;DR

A wide table looks cheap because every column was added for a real reason; the expensive part is that rows grow, every write amplifies, and every secondary index inherits the bloat. The fix is splitting by access pattern (columns read together stay together, rarely-touched columns move out), not aggressive normalization that trades one wide table for six-way joins on every read.

The schema started clean four years ago: users(id, email, password_hash, created_at), four columns. Today the table is renamed customers and has 184 columns. Billing address. Shipping address. Three additional shipping addresses numbered 2 through 4. preferences_json for user settings. Twelve feature-flag TINYINTs. Three Stripe identifiers from three processor migrations. last_login_at, last_seen_at, last_purchase_at, last_notification_sent_at. Forty more columns whose meaning lives in Confluence, if anywhere. No single ALTER TABLE ADD COLUMN was unreasonable at the time. The accumulated result is an average row size of 6KB, an UPDATE to last_login_at that rewrites every byte of it, and a buffer pool holding four customer rows per page instead of forty.

The obvious fix is to normalize it: split into customer_profile, customer_billing, customer_addresses, customer_preferences, customer_feature_flags, customer_audit. That’s the textbook answer and it’s the one that breaks the moment you look at the dominant read. The list view on the admin page needs name, email, status, last login, Stripe status, and total spent. Now it’s a six-way join on every page load. The fix that looked clean in the migration doc makes the most-frequent query more expensive, not less. The read cost moves to the place it’s paid most often, and somebody (usually a few months later) proposes a materialized view to “just flatten it back out,” which is the god table returning through a different door.

How a row-store actually reads a row

Before the cost math makes sense: OLTP engines like InnoDB and PostgreSQL’s heap store complete rows laid out contiguously on fixed-size pages - typically 16KB in InnoDB, 8KB in PostgreSQL. A page holds as many rows as fit. When a query needs one column of one row, the engine doesn’t read that column alone; it locates the row’s page via an index lookup or scan, loads the whole page into the buffer pool, and reads the requested column out of the in-memory row image.

The one exception is the index-only scan: if every column the query projects and filters on is already present inside an index, the base table doesn’t have to be touched and only the index pages are loaded. See Covering Index Traps for how quickly this optimization disappears, usually the moment a SELECT list grows by one column. Every other read path goes through the row, which means the row’s width sets the floor on how much data the engine moves per lookup. Reading email from a 184-column customer row loads 6KB into memory to return 50 bytes; reading the same column from an 800-byte row loads 800 bytes. The buffer pool is a fixed size and every byte of unused column data in it is displacing something another query needs.

Column stores (ClickHouse, BigQuery, Parquet-backed warehouses) invert this entirely. Data is laid out by column, so reading one column reads only that column’s storage. The wide-table cost math doesn’t apply there, which is why this anti-pattern is specifically a row-store OLTP problem and why denormalized fact tables in analytical warehouses are fine at 300 columns.

What 150 columns actually costs

The individual cost of one column is negligible. The system-level cost shows up in several places at once, and none of them are visible in a diff that adds one more.

Row size and write amplification. InnoDB stores full rows on disk pages, and an UPDATE rewrites the entire row even if only one column changed. On a 184-column table averaging 6KB per row, updating last_login_at on every sign-in rewrites 6KB, not 8 bytes. PostgreSQL doesn’t rewrite in place (MVCC creates a new tuple for every UPDATE and marks the old one dead) but the new tuple is 6KB too, and VACUUM has that much more to reclaim. Either engine, the write cost per logical change scales with row width.

Buffer pool density. The page-per-read mechanism above means buffer-pool efficiency scales inversely with row width. At 6KB per row, an InnoDB 16KB page holds two rows; at 400 bytes per row it holds forty. A database with 10GB of buffer pool has the effective working set of a much smaller instance once rows get wide. Queries that used to run hot start touching disk for no reason other than that the rows they cared about no longer fit in memory alongside the rows other queries cared about.

Secondary indexes inherit the width problem. Every secondary index in InnoDB carries a copy of the primary key at its leaves; every index entry is a key-columns + PK-copy record. A wide table tends to accumulate indexes: you index email, Stripe ID, last-login, phone, region, account-manager-ID, each for a different query path. Six secondary indexes on a 184-column table isn’t unusual, and each of them is physically larger than it would be on a narrow table, because the PK copy and fill-factor choices interact with row density. Covering indexes are also harder to arrange: the list view wants eight columns projected, and indexing eight columns of a 184-column table to cover one query is an expensive trade.

Lock and transaction width. Every UPDATE acquires a row-level lock. Transactions that touch a wide row hold that lock for the duration of the transaction, and because the row spans many concerns (billing, preferences, audit timestamps) transactions from unrelated code paths contend on the same row. A background job updating last_seen_at now serializes against a billing job updating stripe_customer_id on the same customer, because both paths lock the same row. In the split-by-concern shape, they’d contend on different rows of different tables.

Schema migrations get more expensive. ALTER TABLE ADD COLUMN on a 184-column table is slower, holds metadata locks longer, and has a larger blast radius if it fails. MySQL’s online DDL is usually fine for NULL-default additions; PostgreSQL is generally fast for the same case. Any migration that needs to rewrite rows (changing a column type, adding NOT NULL with a backfill) scales with row size, and a 6KB row rewrite on 200 million rows is a different operation than an 800-byte row rewrite on the same count.

Every column is a commitment

The cost of adding a column is small and immediate. The cost of having 150 columns is systemic and deferred: buffer-pool density, index size, write amplification, lock contention, migration cost. None of the deferred costs are visible in the PR that adds one more column, which is why they accumulate uncorrected until the table is painful.

Why LLMs make this worse

Schema drift in the wide-table direction is what language models reinforce by default. A model generating ALTER TABLE for a feature request reads the current schema and proposes the smallest change that makes the feature work, which is almost always adding columns to the table that already holds the related data. Proposing a split requires understanding the access pattern, the transaction boundaries, and the write frequency of the new columns versus the existing ones. None of that is in the CREATE TABLE.

The loop reinforces itself: the wider the table gets, the more natural it is for the next change to widen it further. “Where do loyalty tier and tier expiry go?” The model sees customers has every other user-attached concept in it and adds two columns. The alternative (CREATE TABLE customer_loyalty (customer_id PK FK, tier, expires_at)) requires the model to argue for a split, and splits are rare in the training data compared to additions because splits are rare in real codebases for the same reason: they’re harder to ship than additions. The model is correctly pattern-matching on what humans actually do, which is exactly the problem.

ORMs compound this. One model equals one table is the default shape in ActiveRecord, Django ORM, Prisma, SQLAlchemy, and Ecto. Refactoring a Customer model into three co-owned tables is a change that touches every query, every serializer, every test. The ORM makes “add a column to the existing model” a five-line change and “split the model” a project. Engineers pick the cheap option every time, and the wide table ratchets.

Split by access pattern, not by concept

“Normalize it” isn’t the fix because normalization is a property of data shape, not query cost. The fix is to look at what columns are actually read and written together, and keep those co-located; the rest moves out.

A workable decomposition for the customers example:

Core hot table. The columns read on nearly every query: id, email, name, status, tier, stripe_customer_id, created_at. Maybe twenty columns. This is what the list view, the auth path, and most API responses need.
1:1 cold tables. Concerns that are read rarely or in specific flows: customer_audit for login/seen/purchase timestamps, customer_preferences for user settings, customer_feature_flags for the twelve TINYINT flags. Each is a separate table with customer_id as PK and FK, joined only when the flow actually needs it. Writes to last_login_at stop rewriting the billing row.
1:N tables for repeating groups. Addresses, payment methods, anything that was modeled as shipping_address_2, shipping_address_3, shipping_address_4 is an addresses table with a FK and a type. This collapses polymorphic-ish schema decisions that shouldn’t have been made at the column level in the first place; see Polymorphic References for the related pattern where doing this without a FK goes wrong.

The trade-off is that some queries now join two or three tables instead of reading one. On the hot path this is fine; the joins are on PK-equals-FK, the join tables are small, and the read is usually cheaper than scanning a fat row. The cold path is where it matters: the audit screen now joins customers to customer_audit, which costs one indexed lookup and nobody notices. The place to be careful is the query that reads from three of the split tables on every request. If that’s dominant, one of those tables probably belongs merged back in.

When a wide table is actually fine

Not every 100-column table is a god table. Three cases where width is defensible:

Analytical and reporting tables on columnar storage. As noted above, warehouses like ClickHouse, BigQuery, and Redshift invert the cost calculus. Reading one column doesn’t load the rest, and the normalization pressure flips: denormalize aggressively because joins are expensive and per-column reads are cheap. This anti-pattern is specifically a row-store OLTP problem.

Small tables that stay small. A tenants table with 80 columns and 500 rows fits entirely in the buffer pool. The write amplification is paid a few thousand times a day, not a few million. The secondary-index cost is negligible because the indexes are small. Width matters when row count is large enough for the per-row cost to dominate; on small tables it doesn’t.

Every query reads every column. Uncommon but real. If the dominant read is “fetch the full customer record for display” and the split would produce a join that runs on every request anyway, the split doesn’t help. The test is whether the queries you actually run touch disjoint column sets. If they do, the split has a real win; if they don’t, it’s architecture for its own sake.

The bigger picture

Relational databases aren’t built for developer convenience. They’re built for storage efficiency and retrieval speed: narrow rows, well-placed indexes, joins on indexed keys, query plans that read only what they need. Normalization isn’t an academic ideal; it’s the shape that lines up with how the engine actually pays its bills. Every cost mechanism in this post (buffer-pool density, write amplification, index bloat, row-lock width) is the engine reporting the same thing in different dialects: the shape you’re asking it to hold isn’t the shape it was optimized for.

God tables are the limit of a sequence of rational local decisions where the global cost is invisible at each step. The column count of a mature production table is usually a decent proxy for how long the team has been making the cheap choice, which is most teams most of the time, and that is not by itself a failure. The failure is that the cost goes uncounted. A 6KB row is a write-amplification multiplier on every UPDATE, a buffer-pool multiplier on every read, and an index-size multiplier on every secondary index. None of those costs are on the PR that adds a column; all of them are on the dashboard that shows p99 drifting up quarter after quarter.

The lever is to count the cost at the system level when the table hits a certain width (pick a threshold: sixty columns, a hundred, whatever fits) and make the next column addition a conversation about whether this concern belongs here, not a line in a migration. The answer is often still yes, but it shouldn’t be the default answer.

Covering Index Traps: When Adding One Column Breaks Your Query

Fri, 18 Jul 2025 00:00:00 +0000

TL;DR

An index-only scan is the fastest way a relational database can answer a query: the engine reads the index and never touches the table. Adding a single column to the SELECT list that isn’t in the index silently breaks the optimization, and the query that ran in a millisecond now takes seconds. The SELECT list is part of the query’s performance contract with the index.

Here’s a query that ran in production for a year with sub-millisecond latency:

1

SELECT status, created_at FROM orders WHERE customer_id = 42;

The orders table has a composite index on (customer_id, status, created_at). Every column the query needs (customer_id for the filter, status and created_at for the output) is in that index. The database reads the index, returns the results, and never touches the table. This is an index-only scan: one of the most significant optimizations a relational engine makes, and the mechanism behind “covering” queries.

Then a feature request: “show the order total on this page.” The change looks trivial.

1

SELECT status, created_at, total_cents FROM orders WHERE customer_id = 42;

One column added. The query is still correct. The index still matches the filter. total_cents isn’t in the index, so for every matching row, the engine now follows a pointer back to the table to fetch that one extra column. On a table with millions of rows, that’s a random I/O per match. The query that was 0.4 ms is now 1243 ms.

The obvious fix is “just don’t add columns to queries.” That doesn’t work; features need data. The slightly-less-obvious fix is “always project the minimum columns,” which is fine as advice and ignored in practice because every ORM defaults to SELECT *. The actual fix is to treat the SELECT list as part of the query’s performance contract with the index, and to know what that contract is before changing it.

What’s actually happening

The execution plan tells the whole story:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


-- Before: index-only scan
EXPLAIN ANALYZE SELECT status, created_at FROM orders WHERE customer_id = 42;
-- Index Only Scan using idx_orders_cust_status_created on orders
-- Heap Fetches: 0
-- Execution Time: 0.4 ms

-- After: index scan + table lookups
EXPLAIN ANALYZE SELECT status, created_at, total_cents FROM orders WHERE customer_id = 42;
-- Index Scan using idx_orders_cust_status_created on orders
-- Execution Time: 1243.7 ms

Same index. Same filter. Same rows returned. The only difference is the select list, and it moves the query from a pure index walk to an index walk plus one random I/O per matching row.

Buffer pool pollution compounds the damage. When the engine fetches full rows from the table instead of reading compact index entries, it loads entire data pages into the buffer pool. Those pages (carrying every column of every matched row, most of which the query doesn’t need) evict pages that other queries do need. On a busy system with a finite buffer pool, one query losing its covering index degrades performance for unrelated queries across the database. The slow query you noticed is rarely the only thing getting slower.

Nothing in the query results tells you. The rows come back correctly. The response looks the same. A SELECT COUNT(*) returns the same count. The only place the degradation is visible is in the execution plan, and nobody checks the execution plan when the feature ships.

ORM defaults

Most ORMs emit SELECT * unless explicitly told otherwise. ActiveRecord needs .select(:id, :status); Django needs .only('id', 'status'); SQLAlchemy needs explicit column specification; Prisma needs an explicit select block. On a high-traffic table, a one-line change to project only the needed columns is one of the highest-leverage optimizations available. Worth checking what your ORM actually generates on the query paths that matter; the generated SQL is the contract, not the method call.

The fix: match the index or extend it

There are two workable fixes when a query loses its covering index, and they trade different costs:

Project only what the index covers. If the new column isn’t worth fetching from the table on every row, don’t fetch it. Split the query: one covered query for the list view, a targeted lookup for the detail row the user actually wants. Most feature requests that “need” an extra column on a list page are actually fine with lazy-loading the value on click.

Extend the index to include the new column. If the column is genuinely needed on every row, add it to the index, either as an additional indexed column or (in PostgreSQL) as an INCLUDE clause that adds the value to the leaf pages without making it part of the B-tree ordering:

1
2
3
4


-- PostgreSQL: add total_cents as a non-key included column
CREATE INDEX idx_orders_cust_status_created_total
 ON orders (customer_id, status, created_at)
 INCLUDE (total_cents);

INCLUDE is the right tool when you need the column covered but don’t want it affecting the sort order or filter path. The trade-off is write cost: the index is now larger, and every update to total_cents has to update the index entry. On a write-heavy table that’s meaningful; on a read-heavy table it’s usually negligible compared to the read speedup.

MySQL (InnoDB) doesn’t support INCLUDE but has a natural equivalent: every secondary index already contains the primary key at its leaves, and you can extend the secondary index to cover additional columns by adding them as regular key columns. The planner is smart enough to use the covered form when the column is present.

When covering isn’t the right call

Covering indexes aren’t a universal good. Three cases where chasing a covering index is the wrong move:

Low-selectivity filters. If customer_id = 42 matches 80% of the table, the planner won’t use the index at all; a sequential scan is cheaper. Index-only scans matter when the filter is selective. On a low-selectivity predicate, covering changes nothing.

Write-heavy tables. Every index slows writes. A table taking 50,000 inserts per second with five secondary indexes already pays a real cost for every index entry. Adding a covering variant of an existing index to shave read latency from 15 ms to 3 ms is a bad trade if the table is write-dominated; the write penalty compounds on every row, and only the reads benefit.

Rapidly changing projections. If the feature team is adding and removing columns from the list view every sprint, chasing the covering index is a losing game. Freeze the list-view columns as a contract, document them in the schema, and let the index match that contract, or don’t bother indexing for coverage at all.

One more column, silently uncovered

The archetypal AI-generated version of this bug is a one-line change that adds a column to the SELECT list. A feature request says “show the order total on the list page”; the assistant reads the existing query, adds total_cents to the projection, and returns the patch. The query still runs, the list page still renders, and the p99 quietly moves from 0.4 ms to 1200 ms because the index-only scan became a heap-fetch scan and nobody noticed until the dashboard did.

Coverage checking itself isn’t hard reasoning. Given a query and an index definition, working out whether the SELECT list stays inside the index is a short syntactic check any capable model can do. The catalog exposes the ingredients: PostgreSQL’s pg_index separates key columns from INCLUDE ones via indnkeyatts vs indnatts, MySQL’s information_schema.STATISTICS lists all columns per index. The signal is there. What fails in practice is subtler. The relevant index often isn’t in the prompt’s context window; schema-aware tools pull catalog metadata, but whether idx_orders_cust_status_created lands in the retrieved context for “add total_cents to the list view” depends on retrieval heuristics, not the model’s capability. Even when the index definition is available, the default behavior for “modify this query” is to modify the query; re-verifying that the projection stays covered is a step the assistant rarely takes unsolicited. And only the planner’s actual choice is authoritative; static analysis gets most of the way, but nothing short of EXPLAIN tells you which index the query will use under real statistics.

The fix at the schema level is what makes the coverage relationship legible to the next reader, human or model: name indexes after the query they support (idx_orders_list_view tells you what depends on it), document INCLUDE columns in the index comment, and put a comment on the query itself pointing at the index. None of this is novel advice. It becomes load-bearing once an assistant is routinely modifying queries: the explicit link between query and covering index is the signal that tells the assistant (and the human reviewer) “this change has an index implication” rather than silently shipping the uncovered patch.

The bigger picture

The SELECT list is a performance contract in most code reviewers’ blind spot. WHERE clauses get scrutinized because they’re obviously performance-relevant. JOINs get scrutinized because cardinality mistakes are visible. The SELECT list gets waved through because “it’s just what we display”, and then a one-column addition drops a query from 0.4 ms to 1243 ms with no code-review signal to catch it.

EXPLAIN ANALYZE is the only authority here. Reading execution plans isn’t glamorous, but it’s the difference between a query that works and a query that works at scale, and between a select-list change that’s free and one that silently broke the optimization the index existed to enable. On the queries that carry the most traffic, the execution plan belongs in code review alongside the query itself.

Legacy Schemas Are Sediment, Not Design

Tue, 01 Jul 2025 00:00:00 +0000

TL;DR

A legacy schema looks like a design and reads like a sediment: layers of decisions from different eras, where names that once described the data no longer do and conventions that look uniform aren’t. Renaming is prohibitively expensive once every caller depends on the current names. The workable fix is documenting the drift so the next reader (human or LLM) can navigate what’s actually there.

A new engineer joins the team and reads the schema. tmp_orders looks like scaffolding, something to delete once the real migration ships. The tech lead answers: never delete it. tmp_orders is the main orders table. The temp-to-permanent rename was planned for 2017, nobody shipped it, and every service in the company now writes to the table. The name is a lie the schema tells every new reader, and every LLM generating SQL against the catalog.

The obvious fix is to rename the table. Nothing about the database itself prevents it: drop the tmp_ prefix, update every call site, ship. The reality is that every service, ORM model, report, integration, and runbook references tmp_orders by name. The rename is a multi-quarter effort that crosses team boundaries, and the only justification is legibility. Teams rarely prioritize legibility work, so the name stays, and the schema keeps lying.

What’s drifted

Legacy drift shows up in three visible modes and one invisible one.

Names that stopped describing the data. tmp_ tables that are permanent. old_ columns that are current. deprecated_ fields that every write path still populates. flag1, flag2, status_code: names whose meaning was obvious when the column was added, because the person adding it remembered why. By the time a new reader arrives, the intent is gone and the name is false advertising. Comment Your Schema covers the documentation side of this; legacy schemas are the case where comments would help most and where they’re most often absent.

Conventions per era. The 2014-era backend team used camelCase. The 2019 rewrite adopted snake_case. The 2022 microservice added a third table with PascalCase because the Go team wrote it and nobody pushed back. Now one database has userId, user_id, and UserID, all referring to the same entity across different tables. The LLM that generates business.created_at when the column is actually business.createdDate isn’t wrong in any sense the schema could catch; it’s inferring a convention from one table and applying it to another, which is a reasonable thing to do in a schema that has only one convention.

Tables that were supposed to be temporary. tmp_orders is the canonical example, but every long-lived database has some. Staging tables that got promoted to production. Migration tables that weren’t cleaned up. “Phase 2” tables built for a transitional period that shipped in phase 1 and never came back to finish. The names encode the original intent; the data encodes the current reality; the two diverge a little more with every migration that preserves the name instead of fixing it.

Invisible structural drift. Charsets and collations are the version of drift that doesn’t even show up in the column list. Older tables created before the Unicode migration default to latin1; newer tables use utf8mb4. A join between a VARCHAR(100) column in one table and a VARCHAR(100) column in another (both with the same name, both with the same logical meaning) silently produces different results depending on which side’s collation MySQL picks. In the bad cases, an implicit charset conversion kills index usage and turns the query into a table scan. SHOW TABLE STATUS reveals this; reading the column list doesn’t. Most LLMs read the column list.

Why this is worse for LLMs than for humans

A new human engineer working with a legacy schema can ask. They can ping the on-call channel, look up the original migration in git, trace a column back to the PR that introduced it, or simply ask “what is flag1?” and get an answer from someone who knows. The answer is often wrong or outdated, but it’s a starting point, and the engineer learns to treat the schema with appropriate suspicion.

An LLM generating SQL from the catalog has no such recourse. It sees tmp_orders and reasons from the name (probably “this is a staging table, prefer the non-tmp version if one exists, otherwise deprioritize”). It sees old_price and treats it as historical. It sees flag1 BOOLEAN and infers a generic flag. Each inference is reasonable; each is wrong in the specific case; the schema gives no signal that this is one of the cases where reasoning from the name produces bad SQL.

This is the sharper version of the generic id primary key problem. Both are failures of the schema to describe itself. The PK case hides what’s being matched; legacy drift hides what anything means. Neither failure shows up at write time; both produce queries that run, return data, and look plausible, because the rows exist and the types match. The wrongness is in the interpretation, which the database has no way to check.

The fix is documentation, not renaming

The obvious fix (rename everything to match intent and convention) fails on cost. Every table, column, and constraint in a mature schema is referenced by services the team has forgotten about: scheduled jobs, Redshift imports, third-party integrations, BI dashboards built by a contractor in 2019, runbooks pasted into wiki pages that nobody has edited since. A rename that looks like a one-line migration touches every surface the table is exposed on, and the projects that survive the attempt usually take a year and leave the schema worse during the transition.

The workable fix is to stop the drift from continuing and make the existing drift visible. Stopping new drift means picking a convention for new tables and columns and writing it down where CI can enforce it (Schema Conventions and Why They Matter covers the mechanics). Making existing drift visible means column and table comments on everything whose name doesn’t match its meaning, plus a per-era mapping somewhere in the repo that says “this database has four naming conventions, used in these periods, applied to these tables.” Legacy schemas are the case where COMMENT ON pays off highest. The names are already wrong, the cost of fixing them is prohibitive, and the comment is the one affordable signal the next reader gets.

1
2
3
4
5


COMMENT ON TABLE tmp_orders IS
 'Main orders table. The tmp_ prefix is historical: a 2017 migration was planned to rename this and was never completed. Do not drop.';

COMMENT ON COLUMN customers.flag1 IS
 'VIP customer flag. Legacy name from the 2014 schema; never renamed because of external reporting dependencies.';

One-line migrations, zero risk, and every reader (human and LLM) now has a chance of reading the schema correctly. This isn’t a fix in the sense of “problem solved.” It’s a fix in the sense of “the next reader has a chance.” The drift is structural; the documentation is how you navigate it without making it worse.

When a clean rewrite is actually worth it

Renames and migrations aren’t always wrong. Three cases where the rewrite earns its cost:

A misleading name is actively causing incidents. If tmp_orders is regularly truncated or dropped by someone who reads the name literally and acts on it, the rename cost is less than the recovery cost from the next incident. Usually the practical fix here isn’t a rename; it’s a view, synonym, or ALTER-TABLE-RENAME that exposes orders as the canonical name and leaves tmp_orders as a compatibility alias for legacy callers.

A schema migration is happening anyway. If the team is replatforming the OLTP database or splitting it across services, the rewrite opens a window where renames are cheap because callers are being updated either way. Take the opportunity; don’t schedule a separate naming cleanup six months later when the window has closed.

A database small enough that it fits one person’s head. Early-stage startups, internal tools, bounded-scope services. At twenty tables and three developers, a Saturday afternoon of renames is cheaper than a decade of comments.

In every other case, the schema is load-bearing history, and you renovate it the way you renovate a building with people still living in it: patch, document, and schedule the demolition for a window when it’s genuinely cheap.

The bigger picture

Every production schema is a compressed record of the decisions the team made under pressure. Some of those decisions were good and still fit; some were good at the time and don’t fit now; some were expedient and nobody noticed. The schema can’t tell you which is which, and it was never going to. The aspiration isn’t a clean schema that doesn’t accumulate history (no such schema exists past a three-year horizon) but enough signal for the next reader to decompress the sediment without guessing.

Comment the columns that lie. Document the conventions per era. Treat LLMs generating SQL against the catalog as the same kind of reader a new engineer is, and give them the same written context.

Non-SARGable Predicates: How a Function in WHERE Kills Your Index

Sat, 14 Jun 2025 00:00:00 +0000

TL;DR

A predicate is SARGable (Search ARGument able) if the database can use an index to evaluate it. Wrapping a column in a function makes the predicate non-SARGable: the engine has to compute the function on every row before it can filter, which means a full table scan no matter what indexes exist. The fix isn’t always to rewrite the predicate (sometimes the column’s type or collation is wrong and the code is masking it) but every non-SARGable predicate on a hot path is a performance bug waiting for the table to grow.

Here are two queries that return the exact same rows:

1
2
3
4
5
6
7


-- Version A
SELECT id, status FROM events
WHERE YEAR(created_at) = 2025;

-- Version B
SELECT id, status FROM events
WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01';

On a 10,000-row events table, both run in under a millisecond and nobody notices the difference. On a 200-million-row events table with an index on created_at, version A does a sequential scan and takes 45 seconds; version B does an index range scan and takes 12 milliseconds. Neither query is wrong. They don’t even disagree about the answer. One just does the same work in a way the planner can’t optimize.

The obvious fix is “rewrite every function-wrapped predicate as a range.” That works for the date-extraction case and a few others. For WHERE LOWER(email) = 'alice@example.com', the rewrite needs to know whether the column’s collation is case-insensitive, and if it isn’t, there’s no direct equivalent, only a functional index or a schema change. The fix depends on why the function is there, and “why” usually points back at something in the schema that’s pretending to be something it isn’t.

What SARGable means in practice

An index on created_at is a sorted structure: the engine can jump to any date range in O(log n) time by walking the B-tree. For the planner to use that index on a predicate, the predicate has to be expressible as “the column is in this range”: a direct comparison between the column and a constant or parameter.

created_at >= '2025-01-01' meets that contract. The planner translates it to “walk the index to the first entry ≥ 2025-01-01, read forward from there.” That’s a range scan.

YEAR(created_at) = 2025 doesn’t meet the contract. The value being compared isn’t created_at; it’s the output of YEAR() applied to created_at. The index on created_at doesn’t know the output of YEAR() for any row without computing it. So the planner falls back to evaluating the function on every row (a sequential scan) and only then filtering.

Common forms of the same mistake:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


-- Non-SARGable: function on column → full scan
WHERE LOWER(email) = 'alice@example.com'
WHERE DATE(created_at) = '2025-01-15'
WHERE CAST(price AS INT) > 100
WHERE CONCAT(first_name, ' ', last_name) = 'Alice Smith'

-- SARGable equivalents
WHERE email = 'alice@example.com' -- if collation is case-insensitive
WHERE created_at >= '2025-01-15' AND created_at < '2025-01-16'
WHERE price > 100 -- fix the type at the schema level
WHERE first_name = 'Alice' AND last_name = 'Smith'

Three of the four non-SARGable forms have clean rewrites. The first one (LOWER(email)) depends on collation, which is where a lot of real-world cases live.

The collation case

WHERE LOWER(email) = 'alice@example.com' is almost always a tell that the email column has a case-sensitive collation and the application is hiding it at query time. Two real fixes, one cosmetic fix:

Fix the column. If the data should be matched case-insensitively, give the column a case-insensitive collation. In PostgreSQL that’s CITEXT or a COLLATE "und-x-icu" with the ICU provider; in MySQL it’s a _ci collation (which is usually the default anyway). Once the column’s collation handles the case folding, WHERE email = 'alice@example.com' is SARGable and fast. This is the right fix when case-insensitivity is a property of the data.

Add a functional (expression) index. If you can’t change the column’s collation (there’s a case-sensitive comparison elsewhere in the schema that depends on the current behavior) index the expression itself:

1
2
3
4
5
6


-- PostgreSQL: functional index
CREATE INDEX idx_users_email_lower ON users (LOWER(email));
-- Now WHERE LOWER(email) = '...' uses the index

-- MySQL 8.0+: expression index (requires the same constant-folding fix)
ALTER TABLE users ADD INDEX idx_email_lower ((LOWER(email)));

This works, with caveats. The index’s storage and write cost is real. The predicate has to match the indexed expression exactly: LOWER(email) is indexed, but UPPER(email) isn’t, and the planner won’t translate between them. Every non-SARGable expression you want fast needs its own index.

Cosmetic fix: case-fold at write time. Store the email as already-lowercased. WHERE email = 'alice@example.com' is now SARGable directly, no expression index needed. This usually requires application changes (whoever’s writing has to remember to case-fold) which is why the functional index is more popular even though it’s heavier. Where business logic lives covers the general shape of this decision; case-folding at the database with a generated column (GENERATED ALWAYS AS (LOWER(email)) STORED) is often the cleanest answer when the application can’t be trusted to normalize consistently.

Implicit type conversions are the subtler version

The function isn’t always in the query. Sometimes the planner is adding one:

1
2


-- account_id is VARCHAR, literal is numeric
WHERE account_id = 12345

MySQL will silently cast every account_id value to a number for comparison: a per-row function call that kills index usage just as effectively as an explicit CAST(). PostgreSQL is stricter and usually errors, but can still do implicit conversions between compatible types that undermine indexes.

The fix is matching types in both directions: the column type should be what the column is (a numeric ID should be BIGINT, not VARCHAR), and the query should write the literal in the column’s type (WHERE account_id = '12345' if the column is genuinely a string). Either fix works; matching the column type to the data’s real shape is usually the durable answer.

This is also where mixed PK strategies show up. Joining a BIGINT id to a UUID id doesn’t just return wrong results; on MySQL it coerces one side to a string, which is the same implicit-function problem dressed up as a join.

When non-SARGable is acceptable

Not every non-SARGable predicate is a bug. Three cases where it’s fine:

Small tables. A 5,000-row lookup table with a function-wrapped predicate scans in microseconds. The planner isn’t going to use an index on that size anyway. WHERE UPPER(code) = 'NY' on a 50-row states table is not worth worrying about.

One-off analytical queries. A one-time data extract that scans a large table is going to scan it regardless. If the query will never run again, the function call isn’t the bottleneck (the table size is) and adding a functional index to optimize one query isn’t worth the write cost on every future insert.

When the function genuinely can’t be avoided. Some predicates legitimately need to compute. WHERE haversine_distance(lat, lng, user_lat, user_lng) < 10 on a geospatial query can’t be rewritten as a simple range; you need a spatial index (PostGIS, MySQL spatial extensions) to make it SARGable in the geometric sense. The fix is a different kind of index, not a rewrite.

Why natural-language-to-SQL tilts non-SARGable

Schema-reading assistants and text-to-SQL models produce this class of bug more often than hand-written queries do. A user asks “events in 2025”; the closest English-to-SQL mapping is WHERE YEAR(created_at) = 2025, and that’s what the model writes. The correct form (a half-open range) requires knowing the calendar boundary of the year and producing two comparison operators, which is a less-direct translation of the question. WHERE LOWER(email) = 'alice@example.com' is the natural translation of “find the user with this email, case-insensitive,” even when the column’s collation already handles case and the function wrap defeats the index it would otherwise use.

The catalog-level fix is the same one the bigger-picture section below points at: model the column so the natural query is already SARGable. Pick a case-insensitive collation on email, store prices as NUMERIC so no cast is needed, partition or index date columns so the range-literal form performs. When the schema matches the shape of the question, the model’s default translation works. When it doesn’t, the model produces a query that runs clean and scans the table, and no plan inspection is built into the generation loop to catch it.

The bigger picture

Non-SARGable predicates are easy to write, and they come from somewhere: almost always a schema decision that’s being papered over at query time. LOWER(email) hides a collation mismatch. CAST(price AS INT) hides a type that should have been NUMERIC from the start. DATE(created_at) hides the fact that the query is answering a date-range question but written in a way that reads more naturally as an equality. Every one of these is a query-level workaround for a schema-level issue, and every one of them costs an index when the table grows large enough to care.

EXPLAIN ANALYZE is the diagnostic. If the plan shows a sequential scan on a predicate that should hit an index, the predicate is almost certainly non-SARGable; look at what’s wrapping the column. Fix the schema if you can, add a functional index if you can’t, and treat non-SARGable predicates on hot paths as latent performance bugs, not style issues.

The Bare `id` Primary Key: When Every Table Joins to Every Other Table

Tue, 27 May 2025 00:00:00 +0000

TL;DR

A bare id primary key on every table makes a.id = b.id valid SQL between any two tables, which means neither a human reviewing the query nor an LLM generating one can tell which of those equalities are meaningful. Name primary keys after the table they identify, and the schema describes its own relationships.

Here’s a query an AI assistant generated against a real production schema:

1
2
3
4


SELECT u.email, a.payload
FROM users u
JOIN actions a ON u.id = a.id
WHERE u.email = 'alice@example.com';

Syntactically clean. Ran without error. Returned zero rows, which the assistant reported back as “this user has no actions.” The real answer: users.id is a BIGINT and actions.id is a CHAR(36) UUID. MySQL coerced the integer to a string, compared it to a UUID, and found no match. The join wasn’t wrong, exactly. It was meaningless, and the database had no way to say so.

The experienced reader’s first fix is “just use UUIDs everywhere” or “enforce the type at join time.” Neither works. The footgun isn’t the type mismatch; it’s the column name. When every table’s primary key is named id, a.id = b.id is a valid expression between any two tables in the schema, and nothing in the column names tells you whether that expression means anything. Fix the types and you close one failure mode; the identically-typed, semantically-unrelated users.id = 42 = orders.id case still ships.

What nobody can see

The <table>_id convention is older than most of us, and the case for it is usually framed as clarity or style. The sharper framing is that bare id hides the information that matters most at the point of the join (which table’s identity is being compared, and whether comparing them makes sense) from every reader of the query.

The query’s reviewer. ON u.id = a.id gives no hint of what’s being matched. A human reviewer has to carry the table-to-alias mapping (u is users, a is actions) and the table-to-type mapping (users.id is BIGINT, actions.id is UUID) in working memory, then cross-check them against the join condition. None of those steps are hard, but reviewers skip them because the column names look symmetric. Two .id references read as “joining on primary keys,” which is the kind of join nobody flags.

The LLM reading the schema. An assistant generating SQL from the catalog sees users(id BIGINT, ...) and actions(id CHAR(36), ...) as two tables with primary keys named id. Absent a full column-type check on every candidate join (and most schema-reading prompts don’t do this), the natural-looking join between “a user and their actions” is u.id = a.id, which is exactly wrong. The schema presented the column as joinable; the LLM took it at face value. The same mistake a tired human makes, but at scale and without fatigue to blame.

The static analyzer. Linters and schema-aware query builders operate on names first and types second. A rule that warns on suspicious cross-table joins has no signal to fire on when both sides are .id; the column names match, so the join is “legitimate” by shape. The same rule on users.user_id = actions.action_id would flag it immediately, because the names would be obviously non-corresponding.

None of these readers are missing a step they should have taken. They’re all doing the reasonable thing, and the reasonable thing produces wrong queries because the schema is telling them id is id in both tables.

Three failure modes, ranked by how loudly they fail

Three distinct outcomes hide behind a.id = b.id, and they don’t fail equally:

PostgreSQL, mixed types. The comparison errors out with operator does not exist: bigint = uuid. Loud, caught in development, fixed before merge. The best failure mode.
MySQL, mixed types. Silent coercion to string, zero rows returned. The opening example. Bad, because “no results” looks like valid data to every downstream consumer.
Either engine, same type but semantically unrelated. BIGINT users.id = 42 matched against BIGINT orders.id = 42 returns the rows where the integers happen to collide. The query runs, the result set isn’t empty, and the rows look plausible because they’re real rows from real tables. The worst failure mode, because nothing about the output signals that the join was nonsense.

The first two are loud enough to catch in review. The third is the one that ships. The third is the default once more than one table in the schema uses a plain BIGINT id, which is almost every relational schema in existence.

Zero rows looks like no data

A join that silently returns zero rows because of a type coercion is indistinguishable from a join that legitimately has no matches. Code generators, dashboards, and AI assistants all interpret empty results as “the relationship exists but has no rows,” not “the query is nonsense.” The failure hides inside success.

Mixed PK types make the naming problem sharper

Production schemas rarely stay on one PK strategy for long. The original tables are usually BIGINT AUTO_INCREMENT because the framework defaulted to it; a newer service switches to UUIDs to let clients generate IDs offline or to distribute across shards; join tables pick up composite keys because (user_id, role_id) is the natural identity. Nothing in the schema announces which tables fall into which bucket; SHOW CREATE TABLE or \d is the only source of truth, and even that requires reading every table to know what joins are legal.

Mixed types are where the naming footgun turns from theoretical to frequent. When every PK was a BIGINT, the “same type but semantically unrelated” case was the main risk and reviewers caught most of it. Once the schema has BIGINT and UUID sitting next to each other (all named id) the mismatched-type cases pile on top, and “no data found” becomes a regular report from any tool generating queries from the schema.

The sizing question (when to pick BIGINT versus UUID versus UUIDv7 versus composite, and what each costs at the index level) is covered separately in Random UUIDs as Primary Keys. The two problems interact but have independent fixes: pick your PK types deliberately, and name them so the schema describes its own relationships. Neither fix substitutes for the other.

Naming is the lever that actually helps

Naming is what makes a schema describe its own relationships without requiring the reader (human or otherwise) to open every CREATE TABLE. Two conventions, consistently applied, close most of the gap:

Name the primary key after the table. users.user_id, orders.order_id, actions.action_id. The equality users.user_id = orders.order_id reads as obvious nonsense, because the column names are no longer identical. Reviewers see it, LLMs don’t produce it, linters can flag it. The cost is a small amount of redundancy in queries (users.user_id instead of users.id), which is almost always a fair trade. This lines up with the broader guidance in Schema Conventions and Why They Matter.

Foreign keys mirror the target PK. orders.user_id clearly references users.user_id. actions.user_id clearly references users.user_id. This is already common practice; the only change is that the target’s PK name matches, closing the loop. Foreign Keys Are Not Optional covers why the FK itself matters; naming is what makes the FK legible without the REFERENCES clause in hand.

The bare id convention is defensible when the PK column only ever shows up in queries alongside its table name (users.id) and never as a bare id in a SELECT list or join condition. That discipline is hard to enforce across a team over years, and every framework’s default query builder produces SELECT id FROM users without thinking about it. The naming fix makes the discipline unnecessary.

When bare `id` is actually fine

Not every schema needs to bend. A small application, a service with a handful of tables, or a database where every query is reviewed by one team has plenty of context to keep the a.id = b.id landmine out of reach. The cost of the convention scales with the number of tables, the number of engineers, and the number of non-human query generators; in the small case it rarely shows up.

What changes once any of those numbers grow: nobody remembers which tables are BIGINT versus UUID, the assistant pattern of generating queries from schema is routine, and the review process that caught a.id = b.id in a 20-table schema can’t read every join in a 400-table one. At that size the convention pays rent, and renaming PKs is a migration that gets slower every quarter.

The bigger picture

A schema’s job is to hold data correctly and describe its own shape well enough that the tools reading it can reason about relationships without reading every line. The bare id PK is a small departure from that (one column name shared across tables) but it’s the departure that most consistently produces silent-wrong-answer queries, because SQL has no way to distinguish “same name, same meaning” from “same name, different meaning.”

Name the primary key after the table it identifies, so the schema tells its own story when someone (human or otherwise) joins two of them together. It costs almost nothing on day one and leaves the schema legible at 400 tables.

Polymorphic References Are Not Foreign Keys

Sat, 10 May 2025 00:00:00 +0000

TL;DR

A polymorphic reference is resource_id plus resource_type where the type string chooses which table the ID points to. ORMs make it a one-liner; the database enforces nothing. Reads need conditional joins, orphans accumulate silently, and for most uses (comments, notifications, attachments) per-target tables or mutually-exclusive FKs are the better trade.

What the pattern looks like

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


CREATE TABLE notifications (
 id BIGINT PRIMARY KEY,
 user_id BIGINT NOT NULL REFERENCES users(id),
 resource_id BIGINT NOT NULL,
 resource_type VARCHAR(50) NOT NULL,
 message TEXT NOT NULL,
 created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- resource_type = 'order' → resource_id references orders.id
-- resource_type = 'invoice' → resource_id references invoices.id
-- resource_type = 'ticket' → resource_id references support_tickets.id

The tell is resource_id BIGINT NOT NULL with no REFERENCES clause; it can’t have one, because there are multiple targets. What the application treats as a foreign key is, at the database level, a plain integer with a sibling tag string.

What the database can’t do

The cost shows up as absence: every mechanism the database offers for reasoning about relationships is disabled, because the column’s meaning depends on data in another column.

No foreign key. A REFERENCES clause names exactly one target. Orphaned resource_id values are a write-time non-event and a read-time mystery. (Foreign Keys Are Not Optional covers the general cost; polymorphic is the case where skipping isn’t a choice.)
No cascade. Delete an order and nothing cleans up the notifications pointing at it. The application has to know every table that might hold a polymorphic reference to orders and clean each one. New tables added later don’t get noticed.
No planner metadata. Foreign keys feed join ordering and row estimates, especially in PostgreSQL. The planner sees resource_id as a BIGINT with a histogram and no known target.
No schema-level description. Anything that reads the catalog (ERD tools, query generators, AI assistants, typed-client generators) sees no link between notifications.resource_id and the tables it points at. The mapping lives in model files and string literals. (Comment Your Schema helps here but can’t fully restore the information.)

Orphans accumulate silently

A polymorphic column with no FK and no cascade develops orphans over time. Reads paper over them with LEFT JOIN ... WHERE target.id IS NOT NULL, so the broken rows disappear from the UI but stay in the table. In schemas a few years old, the orphan rate is rarely zero, and nobody designed for it.

Reads pay for the write-side convenience

The absent FK is the schema problem. The read-path shape is where the cost becomes daily. A query that needs any column from the referenced row can’t write a single join; the target depends on a per-row value, and SQL’s join syntax takes a static target.

1
2
3
4
5
6
7
8
9


-- Conditional LEFT JOIN per target
SELECT n.id, n.message,
 COALESCE(o.order_number, i.invoice_number, t.ticket_code) AS ref
FROM notifications n
LEFT JOIN orders o ON n.resource_type = 'order' AND n.resource_id = o.id
LEFT JOIN invoices i ON n.resource_type = 'invoice' AND n.resource_id = i.id
LEFT JOIN support_tickets t
 ON n.resource_type = 'ticket' AND n.resource_id = t.id
WHERE n.user_id = 42;

Every new target type adds a join clause here and in every other read-path query that displays a related field. The alternative (a UNION ALL per target) is narrower per branch but scales linearly with target count and pushes pagination up to the union level. Most ORMs’ default resolution is one query per (resource_type, resource_id) group, which is the N+1 pattern that makes polymorphic feeds slow once the target set widens.

“One column can point at many tables” on the write side turns into “every read query enumerates every possible table” on the read side. The symmetry people expect isn’t there.

Why the pattern spreads

It’s the path of least resistance that framework ergonomics encourage. Rails’ polymorphic: true, Django’s GenericForeignKey, and Laravel’s morphTo make one-liner what would otherwise be multiple belongs_to associations and a migration. “Comments on orders” and “comments on invoices” look like duplication, so a single comments table with commentable_id / commentable_type feels cleaner. An open-ended “add comments to anything” product ask reads as an argument against committing to a target list.

Each of those framings overweights the write-side cost (another table or another FK column) and underweights the integrity loss (no enforcement, no cascades, schema no longer describes itself). ORMs Are a Coupling covers the broader trade. Polymorphic is the canonical case where the ORM’s preferred shape is actively incompatible with what the database wants to enforce.

What the schema-reading assistant sees

A tool reading the catalog (Copilot on a schema dump, an MCP-backed agent, a RAG pipeline indexing DDL) sees notifications.resource_id BIGINT NOT NULL with no REFERENCES clause and no way to tell the column is anything other than an integer. Asked for “notifications about orders,” the assistant’s best guess is notifications.resource_id = orders.id: a join that runs clean, returns every notification whose resource_id happens to collide with an order ID (which includes invoice notifications, ticket notifications, and anything else pointing at an integer that also appears in orders), and surfaces plausible-looking but semantically nonsense rows. The resource_type filter that would make the join correct is the piece the schema doesn’t advertise.

This is the structural version of the problem covered in the bare id primary key: schema that can’t describe its own relationships forces every reader to guess, and schema-reading models guess confidently. Pulling the polymorphic column apart (per-target tables, mutually-exclusive FKs, supertype) restores the signal in the catalog. The assistant stops hallucinating the join; any RAG system indexing the schema picks up real REFERENCES metadata; the next engineer reading the table doesn’t need to grep the ORM models to find out which target types exist. The integrity win and the catalog-legibility win come in the same migration.

Alternatives

Each alternative gives back some of the database’s relational machinery at different levels of verbosity.

Per-target tables. Split along the target dimension: order_notifications, invoice_notifications, ticket_notifications, each with a real FK. Real cascades, real planner metadata, self-describing schema. Cost: duplicated column sets and an explicit UNION ALL for cross-target reads. That union already exists implicitly in the polymorphic shape, just moved from the read query into typed branches.

Mutually-exclusive nullable FKs with CHECK. One table, one FK column per target, a constraint enforcing exactly one is non-null:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


CREATE TABLE notifications (
 id BIGINT PRIMARY KEY,
 user_id BIGINT NOT NULL REFERENCES users(id),
 order_id BIGINT REFERENCES orders(id),
 invoice_id BIGINT REFERENCES invoices(id),
 ticket_id BIGINT REFERENCES support_tickets(id),
 message TEXT NOT NULL,
 CONSTRAINT exactly_one_target CHECK (
 (order_id IS NOT NULL)::int +
 (invoice_id IS NOT NULL)::int +
 (ticket_id IS NOT NULL)::int = 1
 )
);

Real FKs per target, real cascades, row’s meaning unambiguous. Scales reasonably up to a handful of targets and stops scaling cleanly somewhere around ten.

Supertype table. A shared parent table carries a common ID; each target type’s table references the parent. The polymorphic column then points at the parent, which is a single real FK. Cleanest structural answer and the one with the highest adoption cost; retrofitting this onto an existing schema is substantial migration work.

When polymorphic is actually the right call

The trade-offs stack up unfavorably for most common uses, but not all. The pattern earns its keep when the relationship is genuinely best-effort: audit events, activity logs, “recently viewed” lists, undo history, where a lost reference is a recoverable annoyance rather than a correctness incident. The FK was never going to be load-bearing, and the polymorphic shape matches the actual semantics: “reference anything, and if it’s gone, show a tombstone.”

Outside that zone the default bias should run the other way. A comment system with three possible parents is not a case for polymorphism; it’s a case for three comment tables or mutually-exclusive FK columns, with the ORM abstracting the read-side stitching.

The bigger picture

Polymorphic references are a specific case of a broader pattern: designs that move information out of the schema and into the application, in exchange for ergonomics in the model layer. The schema drifts from “self-describing relational structure” toward “indexed key-value store the application interprets.” That’s a legitimate position (DynamoDB and friends live there on purpose) but a relational database running on polymorphic associations is paying for a relational engine and choosing not to use most of what it offers.

The pattern isn’t wrong. It’s an aggressive trade, priced on day one by the convenience of polymorphic: true and on day three hundred by the silent orphan count, the conditional joins, and resource_id BIGINT telling no one what the table is related to. Reach for it on purpose. Keep the option of pulling it back onto typed FK columns open, because the migrations away are slower the longer the schema has been pretending the reference isn’t there.

ORMs Are a Coupling, Not an Abstraction

Wed, 23 Apr 2025 00:00:00 +0000

TL;DR

An ORM is a coupling between schema shape and code shape, not an abstraction over it. The coupling pays off in year one and compounds against you in year five. For long-lived OLTP systems, a thinner layer over raw SQL (sqlc, jOOQ, typed query builders) ages better.

There’s a period early in a project where an ORM feels like pure upside. You define a model, the framework generates a migration, and User.where(email: …) returns typed objects. No SQL to write, no mapping layer to maintain, no integration boilerplate. Five years later the same project has four migration directories, a model class with thirty custom methods overriding the ORM defaults, team memory of which relations are lazy-loaded and which aren’t, and a quarterly discussion about whether it’s time to upgrade Rails 4 to Rails 7 or skip straight to something else entirely.

Somewhere between those two points, the ORM stopped being an abstraction and became a coupling: a bidirectional contract between schema and code that both sides have to honor for every change. The contract shapes more than how changes propagate. It also shapes the schema itself, because an ORM’s default output is a database structured like the class graph rather than one designed for the workload. Short-lived prototypes and simple CRUD apps still benefit from ORMs. The defensible use cases are narrower than the industry’s default deployment pattern suggests, and the coupling is real, durable, and consistently underestimated at the point a team decides to adopt one.

The oddity worth pausing on

SQL is arguably the most widely-deployed, longest-lived programming language in the industry. Every major database speaks it, every backend engineer eventually learns it, the DDL and DML haven’t meaningfully changed in decades. The ORMs wrapping it are the opposite: framework-specific, tied to a particular version of a particular stack, with conventions that differ across ecosystems and shift across major releases. The default across most engineering orgs is to go out of their way to adopt the less portable, less stable of the two and hide the more durable one behind it. A team joining a new project expects to relearn the ORM. Nobody expects to relearn SELECT.

The rest of this post offers one answer for why that’s the default: the coupling an ORM introduces hides its cost long enough that the trade looks very different in year one than it does in year five.

What the ORM is actually doing

The word “ORM” suggests abstraction, “object-relational mapping” as if the mapping is the hidden plumbing. The practical reality is the opposite: the mapping is the product. An ORM takes your schema shape and projects it onto code shape. Columns become fields. Tables become classes. Foreign keys become methods. Indexes are invisible until you care about them. Constraints are whatever the ORM’s DSL exposes and nothing more.

That projection is useful. It lets application code avoid SQL, most of the time. It also means the code and schema are now two views of the same data model, and those views are expected to stay in sync by you, by your migration framework, by your tests, and by every developer who touches either side.

Stay in sync, in practice, means every schema change is also a code change. Every code change that adds a field triggers a schema change. Every migration is a coordinated edit across multiple files. The coupling isn’t an implementation detail; it’s the defining characteristic of the tool.

Source of truth: pick one, know which

Every ORM ecosystem has a default answer to “where does the schema canonically live”, and most teams never think about it.

Model-first. Rails and Django generate migrations from changes to model classes. The model is the source of truth; the schema follows. Running rails db:schema:dump produces a schema.rb that describes the current state, and the migration files are the history of how it got there.
Schema-first. sqlc and jOOQ read SQL DDL files and generate typed client code. The schema is the source of truth; the code follows.
Hybrid / unclear. Hibernate can do either, depending on configuration. SQLAlchemy lets you declare models in Python and generate migrations via Alembic, or point Alembic at an existing schema and generate models. Teams that don’t decide end up doing both.

The hybrid case is where the real damage happens. Over years, a team that migrates from model-first to schema-first (or vice versa) without a clean cutover ends up with a schema that neither the models nor the migration history correctly describes. Rows backfilled by a DBA with direct SQL don’t show up in the ORM’s understanding of the world. Columns added by a production hotfix get rediscovered six months later when someone regenerates models from the database.

The fix isn’t to prefer one approach over the other. It’s to decide, document, and enforce, the way you would any other convention.

Migrations stop being “DB work”

In a raw-SQL codebase, a schema migration is a single file: CREATE TABLE, ALTER TABLE, DROP COLUMN. The migration is the change.

In an ORM codebase, a single logical schema change is typically:

A migration file (add_email_to_users.rb).
The model class (User#email getter, validation, serialize calls).
The serializer (UserSerializer#email).
The API contract (OpenAPI spec, GraphQL schema, whatever the team uses).
Fixtures and factories (FactoryBot, factory_boy, test data).
Query helpers that need to know the new column.
Type stubs or generated types (TypeScript declarations, Python stubs).
Admin UI config, sometimes.

What should be a single metadata-level change is now a coordinated edit across five to eight files, and missing any one of them produces a subtly broken application. The ORM didn’t create the complexity; it distributed it. The schema change is still one change. It just has to be propagated to every place the code has a mirror of the schema.

At small scale this is fine. The friction compounds once the team is big enough that the people writing the migration aren’t the same people owning the serializers and the API consumers. A schema change now requires coordinating across teams, each with their own view of the data model, each needing their files updated. The schema itself didn’t get harder to change. The ORM layer around it did.

Hidden queries

The ORM generates SQL you didn’t write. That’s the value proposition. It’s also a persistent failure mode.

Lazy loading. user.orders triggers a query. user.orders.first.line_items triggers another. In a loop over 100 users, that’s at least 101 queries, none of them visible in the code. The classic N+1.
Implicit joins. .includes(:orders) eager-loads associations, but only if someone remembers to write it. The default is lazy. Defaults win.
Magic methods. where(status: :active).first_or_create(email: …) is three or four queries depending on the code path, and the code says nothing about it.
Generated sort and filter. User.order(:created_at).limit(10) on a table without an index on created_at does a full table scan. The query was generated by the ORM; the reviewer never saw it.

None of these are the ORM doing something wrong. They’re the ORM doing exactly what it said it would. The cost is that the SQL the database actually runs isn’t in version control, isn’t code-reviewed, and isn’t profiled until it shows up in slow-query logs. Every ORM codebase accumulates query shapes nobody intentionally wrote.

The queries you don't see

The SQL emitted by an ORM is invisible until something breaks. Code review covers the method call; the database sees three joins and a subquery. Teams relying heavily on ORMs end up needing separate tooling (query logs, APM, pg_stat_statements, EXPLAIN on every slow path) just to know what’s actually running.

Two query languages, neither complete

Past the CRUD ceiling, every ORM codebase ends up with raw SQL living alongside ORM calls. Window functions, recursive CTEs, PostgreSQL DISTINCT ON, LATERAL joins, MySQL INSERT ... ON DUPLICATE KEY UPDATE with complex update clauses, exclusion constraints, full-text search, spatial queries: the list of things awkward or impossible to express through the ORM grows over the life of the project.

The result is a codebase with two query languages coexisting. Reviewers have to know both. Type safety is uneven; ORM calls produce typed objects, raw SQL produces hashes or arrays that need manual mapping. The two styles drift. The ORM-side queries follow the ORM’s conventions; the raw-SQL queries follow whatever the author happened to write that day.

The honest consequence: past a certain complexity threshold, the ORM isn’t reducing the SQL surface area, it’s adding a second layer on top of it. The SQL didn’t go away. It got pushed into the half of the codebase that’s harder to trace.

Bidirectional coupling

The part that surprises teams is how hard it is to leave.

Migrating a database schema (renaming a column, changing a type, splitting a table) is mechanical. It’s a migration file and a deploy window. The mechanics are well-understood and the blast radius is bounded.

Migrating off an ORM is not mechanical. The ORM’s conventions have bled into:

Controller and API code. JSON shapes match model attributes. as_json, serializable_hash, and ORM callbacks define what the outside world sees.
Test suites. Fixtures, factories, and in-memory SQLite test databases depend on the ORM being there.
Third-party integrations. Export formats, webhooks, analytics pipelines, all built against the ORM’s JSON representation of the data.
Admin UIs. Rails Admin, Django Admin, Laravel Nova; hard-wired to specific ORM conventions.
Query helpers. Every scope, every association, every callback is ORM-native.
Team knowledge. Every engineer who’s been there more than a year thinks in the ORM’s abstractions.

None of this is the database’s problem. It’s the surrounding code that grew up expecting the ORM to be there. Replacing the ORM means replacing or rewriting every one of those layers. A schema migration is a weekend project; an ORM migration is a yearlong initiative.

The asymmetry is worth naming. The coupling is bidirectional, and one direction (schema → code) is much harder to undo than the other. Teams that adopt an ORM for velocity rarely account for the exit cost.

Database-side logic doesn’t round-trip

Most ORMs have a tunnel-vision view of the schema: they see what they created. They don’t see:

CHECK constraints. The ORM has no concept of them. A constraint like CHECK (amount >= 0) is invisible to the model; the ORM’s validations become the only gatekeeper the application knows about.
Triggers. A trigger that mutates a row after insert produces data the ORM didn’t know would be there. Reading back the row often requires an explicit reload.
Generated columns. MySQL’s GENERATED ALWAYS AS (…) STORED and PostgreSQL’s equivalent produce values the ORM treats as regular columns, but they can’t be written to, and the ORM’s default behavior is to try.
Partial and expression indexes. The ORM sees the column, not the index. A query that should hit a partial index on WHERE deleted_at IS NULL gets generated without that predicate and misses the index.
Exclusion constraints. PostgreSQL EXCLUDE USING gist (…). Completely outside the ORM’s worldview.

The ORM’s view of the schema is a subset of the real schema. Queries written against that subset can violate invariants the database enforces. The application code thinks the write succeeded; the INSERT comes back with a constraint violation; the code has no idea why. Teams paper over this with application-level validation that duplicates the database’s, and then the two drift, which is its own class of production incident.

Relational modeling isn’t object modeling

The coupling goes one direction that’s easy to see: schema changes require code changes. It also goes the other direction, which is harder to see. The ORM’s object model is what shapes the schema in the first place. For simple data, a User with an email and a password hash, that’s fine. For non-trivial domains, the shape inherited from object modeling produces schemas that look like class hierarchies and perform like poorly-designed databases.

This mismatch has a name: the object-relational impedance mismatch. Its practical consequence is that ORM-driven schemas get shaped by class hierarchies rather than by the relationships and access patterns the workload actually has.

Normalization doesn’t look like inheritance. A properly normalized schema is structured by the shape of the relationships between entities, not by a class graph. Consider a scheduling application with three kinds of entries: appointments, days off, and product launches. All of them are events. They have a start time, an owner, a status. Each has different additional fields.

The relational answer is a supertype/subtype pattern (sometimes called class table inheritance): a base events table with the shared fields, and specialized tables for each subtype, each with event_id as a primary key that’s also a foreign key back to events:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


CREATE TABLE events (
 id BIGINT PRIMARY KEY,
 user_id BIGINT NOT NULL REFERENCES users(id),
 starts_at TIMESTAMPTZ NOT NULL,
 ends_at TIMESTAMPTZ NOT NULL,
 kind TEXT NOT NULL CHECK (kind IN ('appointment', 'day_off', 'launch'))
);

CREATE TABLE appointments (
 event_id BIGINT PRIMARY KEY REFERENCES events(id) ON DELETE CASCADE,
 client_id BIGINT NOT NULL REFERENCES clients(id),
 location TEXT,
 notes TEXT
);

CREATE TABLE days_off (
 event_id BIGINT PRIMARY KEY REFERENCES events(id) ON DELETE CASCADE,
 reason TEXT,
 paid BOOLEAN NOT NULL
);

CREATE TABLE launches (
 event_id BIGINT PRIMARY KEY REFERENCES events(id) ON DELETE CASCADE,
 product_id BIGINT NOT NULL REFERENCES products(id),
 audience TEXT
);

Each subtype has its own columns, indexes, and constraints. Each can evolve independently. A new field on appointments doesn’t touch events, days_off, or launches. Dropping the launches feature drops one table and a CHECK-constraint value. Queries that only care about one subtype hit a narrow, well-indexed table instead of scanning across fifty columns of mostly-null data.

The ORM-driven shape tends to produce something different. Rails’ single-table inheritance (STI) collapses everything into one wide table with a type column and every possible subtype field nullable. Django’s multi-table inheritance is closer to the relational answer but introduces implicit joins the developer didn’t ask for. Hibernate offers all three strategies (SINGLE_TABLE, JOINED, TABLE_PER_CLASS) but most teams pick SINGLE_TABLE because it’s the default and the fastest for small-scale CRUD.

STI-style tables start showing their cost around the 10-million-row mark. Every query now scans a table with dozens of nullable columns. Indexes have to include the type column to be useful. Adding a field to one subtype means adding a nullable column visible to every other subtype. The schema looks like a class hierarchy and performs like one table doing the job of four.

Complex relationships don’t fit class graphs. Many-to-many bridges with their own columns, polymorphic references (one column that points to different tables depending on a sibling column’s value), temporal tables, recursive self-references; once the data model has these, the object graph starts fraying. The ORM’s answer is usually a custom association that looks natural in code and generates SQL nobody would write by hand.

Normalization decisions are driven by access patterns, not classes. A well-designed schema decides what to normalize and what to denormalize based on read/write ratios, query patterns, and storage trade-offs. The ORM-first approach tends to normalize by class structure, which is mostly correlated with good access-pattern normalization at small scale and mostly uncorrelated with it at scale.

The coupling here isn’t only code to schema. It’s class-graph to schema-shape, and that second form is the one that dictates how the database performs under real traffic.

When scale exposes the modeling

The class-shaped schema is cheap at small scale. Its cost is hidden until the workload grows, and because the schema shape is coupled to the class graph the application assumes, fixing it isn’t a schema migration. It’s an application restructure. The ORM’s opinions about data modeling are fine at 1,000 rows. Tolerable at 1 million. Breaking at 10 million. At 100 million, the patterns that were quietly suboptimal become the production incidents of the quarter.

Wide STI tables that scanned fine for 100k rows become the reason a query times out at 100M, because the planner can’t pick an efficient path through dozens of columns of mostly-null data with mixed cardinalities.
Lazy-loaded associations that were 200ms at small scale are now 60-second requests fanning out to a thousand queries.
find_or_create_by races that never mattered when two users hit the same endpoint now cause daily deadlocks on hot rows.
Unindexed ORM-generated sorts that worked at 10k rows become sequential scans over hundreds of gigabytes.
Connection-pool exhaustion from ORMs that hold connections across application logic becomes a top-of-funnel incident when traffic grows.

At this point, teams reach for tools that weren’t supposed to be in the solution space for an OLTP application. Materialized views are the common one. They’re legitimately useful for analytical workloads, wrong for write-heavy OLTP because they have to be refreshed, and refresh windows during traffic either stall the primary or serve stale reads. Read replicas with application-level routing get bolted on not because the read workload demands it, but because the primary is buckling under queries that would have been cheap on a better-designed schema. Caching layers get introduced to paper over query shapes the ORM insists on generating. Each of these has legitimate uses. None of them is a fix for a schema that wasn’t designed for the access pattern it’s getting.

Materialized views aren't an OLTP tool

A materialized view is a precomputed query result stored as a table. In an OLTP system with heavy writes, the refresh cost either stalls the primary during the refresh or leaves the view stale. Neither is acceptable for a live application. Materialized views are an analytical-workload tool; reaching for them to fix an OLTP performance problem is a sign the underlying schema shape is wrong.

The pattern: ORM-driven schemas work until they don’t, and when they don’t, the options are rewrite the schema (hard, because the ORM’s conventions are everywhere) or add infrastructure that papers over the problem (expensive, and eventually stops working too). The schema that was designed to be ergonomic for the ORM at 1,000 rows is now the binding constraint on what the application can do at 100M.

The thinner alternatives

There’s a spectrum between “hand-roll every query with database/sql” and “full ORM with identity map, lazy loading, and 200-line models.” Several tools occupy the middle ground by treating SQL as the source of truth and generating typed code from it, without the mapping layer.

sqlc. Go, Kotlin, Python, TypeScript. You write SQL queries in .sql files; sqlc generates type-safe client code. The schema is canonical, the queries are code-reviewed SQL, and there’s no runtime layer to reason about. Migrations stay plain DDL.
jOOQ. JVM. Reads your schema and produces a fluent, type-safe DSL for building queries. Feels like SQL, reads like SQL, with compile-time type checking. Schema-first, no model mapping.
Kysely. TypeScript. Typed query builder with no ORM layer. You describe the schema in types; Kysely ensures queries match. The full SQL surface area is reachable.
Drizzle. TypeScript. Despite the name, closer to a typed query builder than a classical ORM. Schema declared in code, queries written in a SQL-like DSL, no identity map.
Plain database/sql or pgx with a small query helper. Go in particular has a tradition of “raw SQL plus a thin wrapper.” More boilerplate, minimal coupling.

The common thread across these tools: schema is the source of truth, queries are code-reviewed first-class artifacts, and there’s no mapping layer pretending the database doesn’t exist. The payoff is predictability; the SQL you see is the SQL that runs. The cost is some of the magic: no User.find(1).orders.where(total: 100..).first_or_create one-liners.

For long-lived OLTP systems with non-trivial query shapes, that predictability is worth more than the magic. For short-lived CRUD apps, it isn’t.

When ORMs still earn their place

ORMs have a place. It’s narrower than the industry’s default deployment suggests. The workloads where the velocity payoff consistently outweighs the coupling cost share two properties: they’re bounded in scope and they’re bounded in lifespan.

Short-lived prototypes and experiments. Projects that will be rewritten, replaced, or discarded within a year. Model-first iteration is genuinely faster when the schema is fluid, and the coupling cost doesn’t compound if the project doesn’t live long enough to hit it.
CRUD-heavy internal tools and admin UIs. Query shapes are uniform and simple, the workload won’t scale past the ORM’s comfort zone, and the system doesn’t outlive the product it supports. The ORM’s constraints function as a style guide rather than as a limit on what the application can do.

That’s the list. Not “projects where the team knows Rails.” Not “workloads with uniform query shape, for now.” Not “small teams.” Those framings start as short-lived exceptions and end up as the default, and once the project outlives its original scope the coupling cost compounds silently until it’s too expensive to remove.

The failure mode isn’t picking an ORM for a prototype. It’s keeping it ten years later, after the prototype has become the company’s main production system, after the workload has grown past its original shape, and after migrating off costs more than a rewrite of the application. Most of the ORM codebases engineers end up cursing started in one of the two bullets above and were never reconsidered when they outgrew them.

Trade-offs

Everything in this post has a counter-argument, and the counter-arguments are real.

ORMs save real time on simple queries. User.find(1) is shorter than SELECT * FROM users WHERE id = 1. Across a codebase it adds up.
Type safety in the application layer. Rails and ActiveRecord don’t give compile-time types, but Django’s model fields, SQLAlchemy’s typed columns, and Hibernate’s entity types do. Raw SQL’s answer is schema-first code generation (sqlc, jOOQ), which works but requires tooling.
Domain modeling. Some teams legitimately want their data model to have methods, validations, and behavior co-located with the data. An ORM gives that for free; a query builder doesn’t.
Team familiarity. A team that knows Rails deeply will out-ship a team learning sqlc for the same project. The right answer depends on the team, not the abstract merits.
The middle ground isn’t free. Typed query builders require maintained type definitions. Schema-first code generation adds a build step. “No ORM” means a different abstraction, maintained by you.

The choice isn’t ideological. It’s a trade between two failure modes: the ORM’s coupling cost versus the query-builder’s boilerplate and maintenance cost. For short-lived systems, the ORM wins. For long-lived systems, the thinner layer wins. The catch is that most systems surviving their first year are long-lived, and most teams underestimate how long their system will live. If the project is still running three years from now, you’re probably in the second category whether or not you planned to be.

The bigger picture

The thing an ORM sells is a mapping between code and schema. The thing it delivers is a coupling. For short-lived projects (the prototype, the internal CRUD tool, the bounded experiment) the trade is worth it; the coupling cost is deferred, and by the time it would catch up the project has served its purpose or been replaced.

For projects that live long enough and grow complex enough (which is almost any project that survives its first year) the coupling becomes the dominant cost. Every major framework upgrade is a migration of its own. Every scale inflection requires working around the ORM’s opinions. Every query past the CRUD ceiling is raw SQL anyway. The better default for an application the team expects to still be running in three years is schema-first: keep the DDL canonical, keep queries as first-class code-reviewed artifacts, use a thin typed layer (sqlc, jOOQ, Kysely, Drizzle) to bridge to the application, and leave the ORM in the toolbox for cases that genuinely match its narrow strengths.

If you’re starting a project expected to live more than a year, default to schema-first. Inside an existing ORM codebase where the signals are showing up (raw-SQL ratio creeping up, migrations that require cross-team coordination, queries the ORM can’t express, performance paths that bypass it anyway) the useful question isn’t whether to migrate off. It’s where to draw the schema-first boundary for new work. Usually at new subsystems, not legacy code. Grandfather what’s there, pick up sqlc or jOOQ or Kysely for new code, and let the boundary move over years.

Schema Conventions Don't Survive Without Automation

Sun, 06 Apr 2025 00:00:00 +0000

TL;DR

Schema conventions only survive when automation enforces them. A rule a linter, ORM, migration runner, or IaC module checks will hold for years; a rule the team merely agreed to won’t outlast the people who agreed. Pick the conventions your automation needs and skip the purely subjective ones, because they’ll drift regardless of how strongly anyone feels.

A new engineer adds a table to the analytics schema and runs into a build break: the CI lint rule complains about a missing soft-delete column. She checks ten other analytics tables. Eight have deleted_at. Two have is_deleted. One ignores soft-delete because its rows are immutable. She asks in #data-eng which convention applies and gets back “depends on the table, ask the original author.” Two of the original authors have left. She adds deleted_at TIMESTAMP NULL to match the majority, ships the PR, and the dashboard that aggregates across all eleven tables starts double-counting the rows where is_deleted = 1 overlaps with the new deleted_at IS NULL.

The convention the lint rule was meant to enforce never actually existed. The deleted_at pattern landed in 2017; is_deleted in 2019 from an engineer who preferred the explicit boolean; the no-soft-delete table in 2021 from a third engineer who argued (correctly, for that table) that soft delete didn’t fit the use case. Each decision was right in isolation. None of them got written down anywhere a tool could read. The lint rule enforces the column name. It cannot tell that the column name is meaningless when three patterns coexist for the same operation.

The corollary is the thesis of this post. Conventions only survive when a tool is enforcing them, and they only matter when the tool checks what they actually mean, not just what they’re called. Pick the ones a linter, ORM, or migration runner can check fully (column name, behavior, and consistency against the surrounding schema), enforce them in CI, and skip the ones the tool can only validate by name. Those drift the same way the team’s memory does.

What “conventions” means here

Conventions in this post means the decisions that apply across every table, not the design of any particular table:

Naming. snake_case, camelCase, or ALLCAPS for tables and columns.
Table names. Singular (user) or plural (users).
Primary keys. Bare id or <table>_id. BIGINT, UUID, or composite.
Foreign keys. user_id referencing users.id, or ad-hoc names like owner and creator.
Mandatory columns. created_at, updated_at, deleted_at, created_by. Which tables need them and which don’t.
Status and enum patterns. INT with documented values, CHECK constraint, or native ENUM. Zero-indexed or one-indexed.
Boolean naming. is_active, has_completed, can_edit, or bare active / completed.
Timestamp types. TIMESTAMP, DATETIME, TIMESTAMPTZ. Timezone-aware or naive.
Character sets and collations. utf8mb4 vs latin1; en_US.UTF-8 vs C.

None of these have one right answer. All of them have consequences that multiply across the lifetime of the schema.

Humans benefit, but not durably

Consistent schemas are easier for humans. Onboarding is faster, review is mechanical, queries are predictable. These benefits are real. They’re also entirely dependent on something other than memory holding the convention in place.

A new engineer spends less time building a mental model when PKs, FKs, and timestamps are named the same way everywhere. True, and the convention enabling it exists only as long as someone is actively keeping it enforced.

A migration adding CustomerReference INT in a codebase where everything else is customer_id BIGINT gets flagged when conventions are consistent. True, and whether it actually gets flagged depends on whether the reviewer remembers the rule or a linter is enforcing it.

JOIN users ON orders.user_id = users.id works without a lookup when the convention is <table>_id. True, and the query is right only because every prior migration followed the rule, which is only the case if something kept them on track.

The pattern: every human benefit is downstream of enforcement. A rule that exists only because the current team agreed to it lasts exactly as long as that team does. People change jobs, preferences evolve, new hires bring their own instincts. Within a few quarters of turnover, a human-only convention is gone, and so is the benefit.

The reasons worth picking a convention are the reasons a machine can enforce it.

Why it matters for automation

Automation is the only thing that holds a convention over time. A linter fails the build when snake_case becomes camelCase and keeps failing until someone addresses it; a team agreement doesn’t. The tools below are both the enforcement mechanisms and, by that logic, the only reasons a convention is worth picking in the first place. If none of them apply to your stack, the convention probably isn’t worth the debate.

Every tool that touches the schema reads conventions implicitly. When conventions are consistent, the tool works without configuration. When they’re not, someone has to tell the tool how to handle each exception. Usually in a config file nobody maintains.

ORMs rely on naming rules. ActiveRecord assumes a table named users has a primary key id and that a user_id column is the foreign key. Deviate and you write explicit mappings. Every non-standard table adds a line of configuration; every belongs_to :author, foreign_key: :creator_ref is convention drift showing up as code. Other ORMs are more explicit but still benefit from predictable column names: autogeneration works, inference works, magic methods work.

Code generators produce better output. sqlc, Prisma, jOOQ, and similar tools read schema metadata and emit type-safe client code. Consistent naming means the generated output looks like hand-written code. Inconsistent naming produces getCustomerReferenceByUserId() sitting next to getOrderByUserId(), same concept, different shape, every caller has to remember the difference.

Migration tools depend on mandatory columns. Frameworks that manage created_at / updated_at automatically assume every table has them. Tables that omit these columns silently break the assumption: inserts work, updates work, but the “last modified” display in an admin UI shows null for some tables and not others.

Deployment pipelines assume a consistent migration shape. Migration runners that execute schema changes as part of CI/CD (Flyway, Liquibase, Alembic, Atlas, skeema) rely on migration files following a predictable naming and ordering convention, up/down scripts that mirror each other, and tables that don’t need per-case special-handling. Zero-downtime patterns like expand-and-contract assume updated_at exists for cache invalidation, that new columns are nullable or have defaults so old and new application versions can both write the table, and that soft-delete markers are consistent so rolling deploys across mixed versions don’t resurrect rows one version thought were gone. Every convention that drifts turns a deploy playbook into a per-table checklist, and the checklists are what get skipped under time pressure.

Schema diffing and drift detection depend on consistent shape. Tools like Atlas and skeema compare the desired schema (in version control) to the actual state of each environment and generate the migration to reconcile them. They work well when naming, types, and mandatory columns are uniform, and produce noisy diffs, false positives, and hand-maintained exception lists when they aren’t. Environment parity between dev, staging, and prod degrades the same way: the drift the team never notices becomes the one that breaks a deploy at the worst time.

Schema linters only work if there’s a rule to check. SQLFluff, sqruff, and similar tools can enforce naming conventions, require certain columns on new tables, reject forbidden types, and flag style issues. But the lint rule has to match the team’s convention. No convention, no rule. No rule, no enforcement.

Documentation generators like tbls and SchemaSpy produce browsable schema docs straight from the catalog. Consistent conventions make the generated output navigable. Inconsistent ones make it look like a dump.

Schema-reading LLMs and RAG pipelines have joined the same list. Copilot, MCP-backed agents, text-to-SQL tools, and retrieval-augmented coding systems pull column names and types from information_schema and pattern-match them against natural-language questions. When one table uses createdAt, another uses created_date, and a third uses date_created, the model either generalizes from the most-frequent variant and gets the other two wrong, or hedges and produces verbose conditional SQL. Uniform naming lets the model carry an assumption across tables without re-checking the catalog for every column; the accuracy gains from clean conventions stack on top of the 27% lift studies attribute to column comments alone. Conventions that were about making humans and codegen tools agree turn out to matter just as much for the machine-reading layer.

The common thread: tools treat conventions as a contract. When the contract holds, tools work. When it doesn’t, tools either break or force the team to maintain exceptions forever.

The contract is implicit

Nobody writes down that created_at must be a TIMESTAMPTZ or that FKs must be named <table>_id; the tooling silently starts expecting it. The moment a table violates the expectation, every tool built on it starts producing surprises. Conventions are a contract whether or not anyone acknowledges them, and the tools are the ones keeping score.

Each decision below matters only if something in your stack cares about it. The notes below lean on what tools typically expect. Pick the option that matches your automation. If nothing in your stack cares either way, skip the decision; it won’t survive the next round of team change regardless of which side “won” the debate.

Naming: snake vs camel

snake_case is the idiomatic choice for PostgreSQL and MySQL. Unquoted identifiers in PostgreSQL are case-folded to lowercase, so created_at and createdAt both become createdat unless one is quoted, which means mixed-case names force every query to quote the column. camelCase works if the team is disciplined about quoting, but most teams aren’t. Pick snake_case unless there’s a specific reason not to.

Table names: singular or plural

Both work. Rails and Django default to plural (users). CREATE TABLE user will actually fail in PostgreSQL because user is a reserved word, which is an argument for plural. Singular reads cleaner in joins (user.id feels like “the user’s id”). This is the smallest decision on the list in terms of consequences. The real requirement is that whatever you pick, you use it everywhere.

Primary keys: `id` vs `<table>_id`

Bare id is shorter and matches the default of most ORMs. It also creates a subtle hazard: table_a.id = table_b.id is syntactically valid SQL that silently returns wrong results. <table>_id (so user_id on the users table) makes cross-table joins impossible to write accidentally, because the identifier tells you which table the ID belongs to.

The trade-off is that ORM defaults expect id, so using <table>_id means configuring every model. For teams that rely heavily on an ORM’s conventions, staying with id is pragmatic. For teams with more ad-hoc SQL, <table>_id pays off.

Foreign key naming

user_id referencing users.id is the convention most tools expect. Ad-hoc names like owner, creator, assigned_to, ref_id are sometimes necessary (multiple FKs to the same table need different names) but should be explicit about what they reference, either in the column name (owner_user_id) or in a schema comment. A column named owner with no comment and no FK is a question nobody can answer from the schema alone.

Mandatory columns

Decide which columns every table must have. Common choices:

created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(). Row creation time.
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(). Last modification, driven by a trigger or application logic.
created_by / updated_by. Audit fields, if the team needs them.
deleted_at TIMESTAMPTZ. Soft-delete marker.

Partial adoption is worse than none

If 80% of tables have deleted_at and 20% don’t, every query has to remember which tables to filter and which not to. The queries that forget silently return soft-deleted rows from some tables and not others. Pick a rule (“every table has created_at, updated_at; soft-delete tables have deleted_at”) and apply it uniformly.

Status and enum patterns

Three common strategies, each with trade-offs:

INT with documented values. status TINYINT NOT NULL COMMENT '1=active, 2=paused, 3=cancelled'. Compact, fast, relies on comments for semantics. Works across engines.
CHECK constraint. status VARCHAR(20) CHECK (status IN ('active', 'paused', 'cancelled')). Self-documenting in the DDL, slightly larger storage, human-readable in query results.
Native ENUM. PostgreSQL has first-class ENUM types, MySQL has ENUM(...). Compact and typed, but changing the set requires a schema migration; in PostgreSQL, removing a value is genuinely hard.

Any of these is fine. Mixing them (one table uses INT, another uses CHECK, a third uses ENUM) is what creates the problem. Every query that aggregates across tables has to handle three value formats.

Boolean prefixes

is_active, has_completed, can_edit make filter expressions self-documenting: WHERE is_active AND NOT is_deleted. Bare names like active or completed create ambiguity in review. Is this column a flag or a timestamp? Adjective or verb? Prefixing eliminates the ambiguity at no runtime cost.

Timestamp types

The choice matters more than the name. TIMESTAMP in MySQL auto-converts between UTC and the session timezone, which is usually not what you want. DATETIME stores the literal value with no timezone awareness. PostgreSQL’s TIMESTAMPTZ stores UTC with automatic conversion on input and output, the most forgiving option for most applications.

Mixing types across related tables is where silent timezone bugs come from. A created_at TIMESTAMPTZ on one table joined to a DATETIME on another will either implicit-cast or mismatch, depending on engine and version. Pick one per engine and apply it everywhere.

Character sets and collations

utf8mb4 in MySQL, UTF-8 in PostgreSQL. Anything else in 2026 is a legacy holdover. The subtle hazard: mixing charsets across columns causes joins between text columns to fail silently or return wrong results. PostgreSQL is stricter about this; MySQL is more permissive and more dangerous because of it.

Conventions beyond the schema

Schema conventions usually stop at the DDL, but the automation layer around the database depends on naming decisions that live outside it: secrets, endpoints, users, roles, hostnames, backup files, environment variables. Those names show up in Terraform modules, Vault paths, Kubernetes resources, IAM policies, service-discovery records, monitoring dashboards, and every deploy pipeline. When they’re consistent, the infrastructure is self-describing and IaC modules stay generic. When they aren’t, every piece of automation grows a special case.

Common places this shows up:

Secret names. prod/db/orders/primary/password vs prod-orders-db-pw vs orders_prod_password. A clear prefix/suffix pattern lets secret rotation scripts, IAM scopes (arn:aws:secretsmanager:*:*:secret:prod/db/*), and environment-promotion automation use wildcards instead of hardcoded lists.
Hostnames and endpoints. db-orders-rw.internal and db-orders-ro.internal for reader/writer splits, db-orders-primary-0.us-east-1 for cluster node addressing. Consistent patterns mean DR runbooks, connection pools, and failover scripts can resolve endpoints by transforming a base name rather than reading from config.
Database users and roles. app_orders_rw, app_orders_ro, migration_bot, readonly_analytics. The role name should say what it can do. Teams without a convention end up with svc_user_42, rails, monitoring, and nobody can audit privileges without a spreadsheet.
Database names. orders_prod vs prod_orders vs orders-production. Consistent environment placement (always suffix or always prefix) means wildcard grants, backup pattern matching, and cross-environment queries stay simple.
Environment variables. DB_ORDERS_HOST, DB_ORDERS_USER, DB_ORDERS_PASSWORD_SECRET. A per-service naming convention lets config loaders and IaC modules generate the full variable set from a single identifier.
Backup and snapshot names. orders-prod-20260420-0000 vs backup_orders_20260420. Retention jobs, restore runbooks, and compliance audits all read these names by pattern.

These aren’t schema conventions in the strict sense; they’re operational conventions that happen to be tied to the schema. They follow the same rules: pick a pattern, apply it everywhere, document it where the infrastructure code lives, and enforce it in the IaC linter (tflint, checkov) or the Kubernetes admission controller so new resources can’t be named off-pattern.

The failure mode is the same as inside the schema. A team with three secret-naming patterns needs a custom script per resource. A team with three hostname patterns runs DR runbooks twice as long as they should be. Operational conventions have the same compounding cost as schema conventions, in a different layer; the tooling to enforce them is different (IaC linters instead of SQLFluff), but the discipline is identical.

Enforcement: conventions without enforcement decay

Written conventions that nobody enforces last until the next person who didn’t read the doc. The only conventions that hold over years are the ones CI checks.

Schema linters

SQLFluff is the most popular for PostgreSQL and MySQL. It runs on migration files in CI and can enforce:

Naming rules (snake_case only, specific prefixes/suffixes).
Required columns on CREATE TABLE (every table must have created_at).
Forbidden types (reject TIMESTAMP in favor of TIMESTAMPTZ).
Style (trailing commas, keyword casing, indentation).

The alternative is a custom linter, a script that parses migration files and checks them against a ruleset. More work to build but more flexible if the rules are unusual. Teams with strong opinions often end up here.

CI checks on the schema itself

Beyond linting migration files, a CI job can introspect the database after migrations are applied and assert properties of the final schema:

1
2
3
4
5
6


-- Every table in the application schema has created_at
SELECT table_name
FROM information_schema.columns
WHERE table_schema = 'public'
GROUP BY table_name
HAVING COUNT(*) FILTER (WHERE column_name = 'created_at') = 0;

If the result is non-empty, fail the build. This catches the migration that adds a new table without the mandatory columns, the case a file-level linter can miss if the CREATE TABLE was split across migrations.

Other useful assertions:

1
2
3
4
5
6
7
8


-- No table uses TIMESTAMP without timezone
SELECT table_name, column_name
FROM information_schema.columns
WHERE data_type = 'timestamp without time zone'
 AND table_schema = 'public';

-- Every FK column has an index
-- (expensive to query but worth running on schedule)

Introspection-based checks run against the shape of the schema after migrations are applied; they catch drift the file-level linter can’t see.

Pre-commit hooks

Developer-machine enforcement: running sqlfluff on staged migration files before commit. Faster feedback than CI, but only works if every developer has the hook installed. Treat pre-commit hooks as a developer experience improvement, not as the real gate. CI is the gate.

CODEOWNERS on migration directories

Putting a small group of owners on migrations/ forces review by someone who understands the conventions. This is a human check, not a mechanical one, but it catches things the linter can’t (“this new table has all the right columns but the design is wrong”). The owner doesn’t have to be one person; a rotating review responsibility works.

Review templates

A PR template that includes a checklist for schema changes (“does this follow the naming convention? does it include mandatory columns? are the types consistent with existing tables?”) nudges the author to check before review. The cost is zero; the benefit is that most issues get caught before they reach a reviewer.

Scope: strict for new, lenient for legacy

The enforcement question that derails most teams: do existing tables have to meet the convention? Trying to retrofit decades of legacy is an impossible project; requiring only new tables to meet the convention is achievable. The practical pattern:

New tables. Linter is strict. No exceptions without a documented reason.
Existing tables. Grandfathered. Linter skips them or only checks newly-added columns.
Legacy migrations. An explicit backlog, prioritized by frequency of use and onboarding pain.

This splits the problem into “hold the line on new work” and “improve legacy opportunistically.” Both are manageable. Trying to do both at once isn’t.

The hardest part: changing conventions without creating a new one

Conventions decay not because they were bad, but because they changed faster than the team could propagate the change. The result isn’t “the new convention”. It’s a schema with three coexisting conventions, none of which applies everywhere.

The discipline is straightforward, even if it’s not always followed.

Write the convention down

Before enforcement, before any migration, there has to be a single authoritative document: a SCHEMA-CONVENTIONS.md in the repo, or a runbook, or an RFC. Not a Slack thread, not tribal knowledge. Something a new engineer can read and apply.

The doc is short by design: a page or two, not a book. It answers “what naming convention do we use?” and “what columns does every table need?” and “which timestamp type?”. It doesn’t try to teach relational design. Short docs get read; long ones don’t.

Use a lightweight RFC process for changes

When someone wants to change a convention (switch from id to <table>_id, add updated_by as a mandatory column, move from INT to UUID primary keys) it goes through a written proposal:

What’s changing and why.
Impact on existing tables (migrate all, grandfather, or cutover by date).
Impact on tools, ORMs, dashboards, and downstream consumers.
Who decides (single decision-maker or review board).
Explicit cutover date if changing for new work only.

The RFC doesn’t have to be heavyweight. A paragraph in a shared doc, reviewed by two or three people, approved by a named owner. The value isn’t the document. It’s the forcing function that prevents conventions from changing by PR comment.

Decide: migrate, grandfather, or both

Three options, each with a different risk profile:

Migrate everything. Rename columns across the schema, update every query, every ORM model, every dashboard. This is the clean option and almost never the practical one. Retroactive renaming breaks downstream consumers the team may not even know exist: analytics jobs, exports, integration partners, cached query plans.
Grandfather legacy, enforce on new. Old tables stay as-is; new tables follow the new rule. The schema ends up with two conventions coexisting, but it’s predictable: “tables before this date use X, tables after use Y.”
Cutover with a migration window. Pick a date, migrate the highest-traffic or highest-visibility tables before the date, grandfather the rest, close out the long tail opportunistically.

The grandfather option is the most common in practice because it respects the reality that the schema is a shared resource nobody fully owns. Write the decision down (“before 2025-Q3, tables used camelCase; after, snake_case”) so future engineers know the split exists and isn’t a bug.

The two-generation rule

Two is the limit

One convention is best. Two coexisting conventions is survivable - new engineers can be told “look at the table’s creation date.” Three or more is where schemas become unreviewable. Any proposal to change a convention needs to answer: “are we ending up with two generations, or a third?” A third generation is a forcing function to finish migrating the first one first, not to introduce a new one.

This is a heuristic, not a hard rule, but it’s a useful test. When a proposed change would create a third convention without a plan to eliminate one of the existing two, the change probably isn’t worth it.

When to accept legacy drift

Not every legacy convention is worth fixing. The calculation:

How often does the old convention cause bugs? Column names nobody can remember, types that force implicit casts, missing mandatory columns that break tooling. Real costs, worth migrating.
How often is the table touched? A table used by ten queries a day is different from one used by ten thousand. Migration risk scales with usage.
What breaks downstream? ORM models, dashboards, exports, cached plans, monitoring. Every consumer of the table name or column name has to update. If the count is unknown, it’s higher than you think.
Is there a cheap alternative? A VIEW that exposes the table under the new convention, while the underlying table keeps its legacy name, can bridge the gap without a full migration.

The honest answer is often “leave it alone and document why.” A comment in the schema, or a note in the conventions doc, is cheaper than a migration and accomplishes the main goal: making the inconsistency visible and intentional.

Trade-offs

Conventions have a cost. A rule that doesn’t serve automation is noise. It takes space in the conventions doc, invites bikeshedding in review, and adds nothing to the schema’s consistency over time, because there’s nothing to keep it from decaying the moment the people who cared move on. The heuristic: if no tool fails when the rule is violated, the rule doesn’t need to exist.

Over-specifying is the second failure mode. A team with thirty linter rules will find a way around them or ignore them. Rules that block common, legitimate cases get bypassed with -- noqa comments until the linter stops being a gate.

The lightweight approach:

A small set of rules, each one tied to a specific tool that cares (naming, mandatory columns, forbidden types).
A larger set of advisory warnings, not blockers.
A clear escape hatch for exceptions, with the exception documented.
Periodic review. Rules that fire too often are wrong; rules that never fire are noise.

Strict conventions are a feature up to the point where the enforcement matches the rule count. Beyond that, they become a tax on every change. The right level is the smallest set automation will actually enforce without constant arguments.

The bigger picture

The useful question is what your automation needs, and whether a machine can enforce it. If yes, pick the convention your automation needs and wire it into CI. If no, skip the decision; debating aesthetics in the absence of enforcement produces nothing that will still be true a year from now. People change, teams turn over, preferences drift. A convention enforced by a linter doesn’t care who wrote the migration; a convention enforced by “we agreed last quarter” does.

The schemas that age well are the ones where the only surviving conventions are ones a linter, ORM, migration runner, or IaC module is actively enforcing. Everything else (bikeshed questions about singular vs. plural, religious debates about column ordering) drifts the moment the people who cared stop working there. That’s the predictable result of anchoring a rule to something as ephemeral as a team’s current preference.

Where Business Logic Lives - Database vs. Application

Wed, 19 Mar 2025 00:00:00 +0000

TL;DR

Keep the database narrow: NOT NULL, UNIQUE, FK within a service, simple CHECK for per-row invariants, generated columns for stable derived values. Put everything else (orchestration, computation, rules that change weekly, anything crossing services) in an application-layer library every writer uses. “Dumb database” is half right: dumb across service boundaries, narrowly smart within one.

amount >= 0 lives in three places. A CHECK on the column, a Pydantic validator in the API model, a guard in the order-creation service. Added in different quarters by different teams. Out of sync since GDPR forced a change to the validator that nobody propagated to the constraint. The migration tightening the CHECK to match fails on 4,000 rows the application thought were fine.

This is the default state of any rule about valid data, eventually. It lives in more than one place. The places drift. The reflex answer, “both layers for safety,” is what produced the drift in the first place; “application-only because we have microservices” is the same answer applied to a different fashion cycle. Neither is a decision, both are defaults. The useful question is what each layer can enforce, what it costs, and how often the rule will change. Four axes do the work: scope, cadence, cost, and write-path count.

The short history of the “dumb database” position

The microservices canon and the cloud databases built to support it have already answered one half of this question.

Chris Richardson’s database-per-service pattern rules out cross-service foreign keys as a design choice: each service owns its schema and no one else touches it. Fowler and Lewis’s “Microservices” article coined “smart endpoints and dumb pipes” and “decentralized data management”. Neither the middleware nor a shared database holds cross-service logic. Fowler calls the alternative, integration through a shared database, the canonical encapsulation breach. Vaughn Vernon’s DDD work puts the consistency boundary at the aggregate, enforced in process, not in the DBMS.

The storage layer follows suit. Google Spanner does not support user-defined stored procedures or triggers; its docs explicitly say that on migration, “business logic implemented by database-level stored procedures and triggers must be moved into the application.” DynamoDB has no CHECK, no foreign keys, no triggers; integrity is a per-item conditional write. Cassandra, Bigtable, and Uber’s Schemaless are the same story. Facebook’s TAO keeps the social graph’s integrity inside TAO itself; the underlying MySQL shards don’t enforce it. Shopify, even inside a Rails monolith, doesn’t enforce relationships at the database layer; foreign keys are maintained only in the model code, a choice driven by their sharding and cell architecture.

That’s the position the last fifteen years of large-scale engineering has converged on, and it’s right in the scope it applies to. Across service boundaries, the database physically can’t enforce most cross-cutting rules, the dominant cloud storage engines won’t host procs or triggers, and the pattern literature has codified the split.

The mistake is generalizing from this to “the database should be dumb, period.” That collapses two different debates into one slogan.

Where the position is strong and where it isn’t

The near-unanimous consensus is about cross-service integrity: FK between services, triggers as integration glue, stored procs as the coordination layer. There the answer is genuinely settled. Application-layer, usually in a shared library, sometimes in an orchestration service.

The within-service question is different. Inside a single service’s private schema, with one team owning the reads and writes, the database still sees every write path the service produces: the normal request path, backfill scripts, admin tools, the occasional DBA command at 2am, the new code path the team added last sprint. Richardson, Fowler, and Vernon don’t argue against NOT NULL, CHECK, or UNIQUE inside that boundary. Shopify’s position is an outlier driven by sharding operations, not ideology. Yugabyte goes further and defends stored procedures and triggers inside a service boundary.

So the real framing: the “dumb database” position is unanimous across service boundaries and contested within them. The rest of this post is about where the line actually sits within a service. The honest answer is still “mostly keep the database lean, but not empty,” for reasons that have more to do with deployment cadence and scaling economics than with purity.

The four axes that actually decide the split

The rule-by-rule question is a balance across four properties of the system, not a preference between layers.

1. Scope: does this rule cross service boundaries?

If the rule spans services, the database can’t enforce it. A foreign key into another service’s database doesn’t exist. A trigger that writes to tables owned by another team isn’t compatible with any sane microservices pattern. Cross-service correctness lives in application code, typically in a library that every writing service depends on, or in event-driven compensation (sagas, outbox patterns, eventual-consistency protocols).

The only databases that let you enforce cross-service rules are ones the pattern literature treats as an anti-pattern on purpose: shared databases with multiple writers.

2. Cadence: how often does this rule change?

Application code deploys in minutes. Schema migrations deploy on a migration window, with expand-and-contract dances, NOT VALID + VALIDATE phases, and careful ordering across rolling deploys. A rule that lives in the database inherits the database’s deployment cadence.

That’s fine for rules that change annually or never: “email column is not null”, “amount is non-negative”, “status is one of four values for the life of the product”. It’s painful for rules that change with product experiments: pricing logic, promotion codes, fraud thresholds, discount stacking rules, feature gates. The friction of modifying a CHECK constraint or a stored procedure for a rule that’s going to change again next quarter adds up to “this probably shouldn’t have been in the database in the first place.”

3. Cost: where can this rule run cheapest?

The application tier scales horizontally. The primary database, for most OLTP workloads, scales vertically until sharding, and sharding is a project, not a tuning knob. Every CPU cycle spent inside the database is a cycle not spent on I/O, lock management, query planning, or serving other requests. A busy primary at 80% CPU doesn’t have slack for an additional stored procedure body to run on every write.

For a simple CHECK (amount >= 0), the cost is measured in nanoseconds per write. Irrelevant. For a trigger that recomputes an aggregate on every insert, the cost is a hot row plus whatever the aggregation costs, charged to the most scarce compute tier in the system. For a stored procedure that loops over rows, the cost is full procedure-body CPU on the primary for every call.

Application code, by contrast, has near-free horizontal scale. Adding a pod is cheap. Adding database CPU is vertical-scaling dollars until you’ve run out of instance sizes, then it’s a sharding project.

The database is a vertical-scaling tier

Moving computation into the database moves it toward the scaling ceiling. Declarative constraints (CHECK, FK, UNIQUE) are cheap enough to be irrelevant. Triggers that do nontrivial work, procedures that run loops, and anything that touches multiple rows per call eat CPU on the one tier that’s hardest to scale. The “app can do this magnitudes faster” intuition is right when “faster” is measured in throughput under load, not because a single call is faster, but because the application tier absorbs more of them without a scaling event.

4. Write-path count: how many things write to this schema?

One service, one codebase, one team, one ORM writing to a schema the team fully owns: application-layer enforcement works. A shared library is the single choke point; every write goes through it.

More than one writer (multiple services, admin tools in a different language, backfill scripts maintained by a different team, DBA incident-response SQL) and the library has gaps. Every writer that isn’t the library bypasses the validation. The database is the only layer that catches them all, and the cost of catching them is a small set of declarative constraints.

Two writers isn’t a lot. Most systems that survive a few years accumulate more: data-migration jobs for a table split, an admin dashboard written in a different stack than the service, a reporting ETL that occasionally writes aggregates back, a partner integration that writes through a shared DB user.

The balance that holds in practice

The four axes point at a consistent split. Keep the database narrow and declarative. Put everything else in application code, ideally in a library every writer depends on.

The narrow set the database earns its keep on:

NOT NULL, UNIQUE, FOREIGN KEY within a service’s private schema.
Simple CHECK constraints for per-row invariants: ranges, regex on identifiers, enum membership.
Generated columns for derived values that are deterministic, stable, and cheap to compute.
Indexes the application needs for performance (not business logic, but a reminder they belong in the schema, not in code).

These are declarative, near-zero CPU cost per write, cover every write path, and change rarely enough that the schema’s deployment cadence isn’t a problem. Foreign keys in particular are the canonical within-service example. A post on their own goes deeper on why application-layer referential integrity consistently loses to database-enforced FKs over time, and that argument is this whole post’s framework applied to one specific constraint.

What stays in application code:

Orchestration across multiple statements, services, or external calls.
Rules that depend on request context, caller identity, time-of-day, or anything outside the row.
Rules that change with product experiments.
Rules that span services.
Computation that would cost measurable database CPU per call.
Derived values that involve complex business logic or are likely to change.

If there’s one writer, a shared library is the single source of truth. If there are multiple writers (or there will be, which is most systems after a year), the library is still valuable but needs a narrow safety net in the database for the invariants that would corrupt data if they slipped.

The library as the primary, the schema as the safety net

The pattern that works in practice: a validation library (or a rich domain model) owns the full rule set, including validation messages, business logic, cross-field checks, everything the UI and API need. The schema carries only the declarative subset the database can enforce cheaply: NOT NULL, CHECK, UNIQUE, FK. When the library’s rules diverge from the schema’s, the database rejects the write. The schema is the safety net, not the primary enforcement path. Violations surface as 500s that flag drift, not silent corruption.

CHECK constraints, the cheap, defensible middle ground

Declarative CHECK constraints are the strongest example of database-side logic that justifies itself on every axis.

1
2
3
4
5
6
7
8
9


CREATE TABLE orders (
 id BIGINT PRIMARY KEY,
 user_id BIGINT NOT NULL REFERENCES users(id),
 amount_cents BIGINT NOT NULL CHECK (amount_cents >= 0),
 currency CHAR(3) NOT NULL CHECK (currency ~ '^[A-Z]{3}$'),
 status TEXT NOT NULL CHECK (status IN ('pending', 'paid', 'shipped', 'refunded')),
 placed_at TIMESTAMPTZ NOT NULL,
 shipped_at TIMESTAMPTZ CHECK (shipped_at IS NULL OR shipped_at >= placed_at)
);

Scope is within the service’s schema, applicable. Cadence is annual or never; adding a new status value is a planned migration, not a product-experiment iteration. Cost is near zero, since the planner evaluates the expression once per write and for the operators shown it’s nanoseconds. Write-path count covers every path, including the backfill job someone writes next year in a different language.

The trade-off is real but small. Error messages from a constraint violation are less friendly than a hand-crafted validation message, and adding a CHECK to a large existing table is a migration project (MySQL rewrites the table; PostgreSQL needs NOT VALID then VALIDATE CONSTRAINT to avoid long locks). Both are known problems with known workarounds.

The common pattern that holds up: application library owns the error message and UX, the database owns the enforcement. The library’s check is a fast-path for better errors; the constraint is the gate.

Generated columns, the most underused declarative tool

Generated columns produce a derived value from other columns in the same row. MySQL since 5.7, PostgreSQL since 12. Indexable. Can’t be written to. Consistency guaranteed by the engine.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


CREATE TABLE line_items (
 id BIGINT PRIMARY KEY,
 order_id BIGINT NOT NULL REFERENCES orders(id),
 unit_price_cents BIGINT NOT NULL CHECK (unit_price_cents >= 0),
 quantity INT NOT NULL CHECK (quantity > 0),
 total_cents BIGINT GENERATED ALWAYS AS (unit_price_cents * quantity) STORED
);

CREATE TABLE users (
 id BIGINT PRIMARY KEY,
 email TEXT NOT NULL,
 email_normalized TEXT GENERATED ALWAYS AS (LOWER(email)) STORED,
 UNIQUE (email_normalized)
);

On the four axes: scope is within-service, cadence is stable (the formula is an identity, not a business rule), cost is negligible (pure arithmetic or string operations), write-path count covers everything because every writer gets the same result automatically. Generated columns are the cleanest way to handle derived values that would otherwise be maintained by discipline.

The cost: the derivation has to be stable. Changing email_normalized = LOWER(email) to add Unicode normalization is a migration. If the formula is an active business rule, it’s the wrong tool.

Triggers, for schema migrations only

Triggers run procedural code on insert, update, or delete. That’s exactly what makes them wrong for implementation logic. A trigger mutates rows the caller didn’t ask to change, fires cascades the caller didn’t initiate, and makes “this update touches one column” a lie. The caller’s application logs say one thing; the database does something else. When a bug surfaces, the stack trace goes to application code that never ran the hidden logic.

The usual defenses (updated_at maintenance, audit logging, soft-delete cascades, counter caches) are all better handled in application code.

updated_at belongs in the ORM’s model callback, the shared write library, or a middleware that sets it on every persist. Every writer already goes through that path, and adding a timestamp is one line. If backfill scripts or admin tools bypass the library, the fix is to make them use the library, not to paper over the gap with a trigger.

Audit logs need application context: the user ID, the request ID, the reason, the session, the tenant. A trigger can’t see any of that without awkward session-variable tricks that break across connection pools. Write the audit row in application code, next to the logic that knows why the change is happening.

Soft-delete cascades are business rules. Which child rows get deleted when a parent is soft-deleted, in what order, with what side effects, is a product decision, not a storage concern. Orchestrate it in the application.

Counter caches via trigger create a hot row where every concurrent write serializes on the same parent lock. Application-side counters, background rollups, or a separate events-with-aggregation pipeline all scale better and leave the hot path free.

The general principle: application logic should be visible in application code. A trigger that modifies data the application wrote is a hidden side effect, and hidden side effects are an anti-pattern for the same reason global variables are. They make the reachable state of the system larger than the code the reader is looking at.

The debugging cost is the real cost

When an on-call engineer is looking at a production incident, they read the application code that ran. A trigger that fired three levels down, in a language they may not read fluently, mutating rows nobody expected, is the single biggest source of “the code says X, the database did Y” incidents. That’s not a tooling problem. It’s a design choice that can be avoided by not writing triggers as implementation.

This gap widens when an ORM sits between the application and the database. ORMs model what they created (columns and relations) and don’t reflect triggers, CHECK constraints, or generated columns in the model class. A trigger that mutates a row after insert produces data the ORM didn’t know would be there, and the in-memory object diverges from the persisted row until someone thinks to reload. The ORM coupling post covers this failure mode in more depth; triggers are one of the specific shortfalls that show up as “the model says one thing, the database did another.”

The legitimate case: schema migrations

The one place triggers earn their keep is time-bounded, explicit migration work. During an expand-and-contract schema change (renaming a column, splitting a table, changing a type), a trigger can dual-write between the old and new shape so that mixed old-application and new-application traffic both see consistent data. The trigger exists for the duration of the migration window and is dropped once the backfill is complete and all writers are on the new shape.

This is trigger-as-scaffolding. A temporary mechanism that bridges a specific transition, with a clear removal criterion. It doesn’t hide business logic; it handles transitional compatibility between two versions of a schema while the application rolls forward.

The most common real-world instance of this pattern in MySQL is Percona’s pt-online-schema-change: it creates a shadow table with the target schema, installs INSERT/UPDATE/DELETE triggers on the original to replicate writes into the shadow while data is copied in chunks, then atomically renames and drops the triggers. The triggers exist for the migration’s duration and nothing longer. In PostgreSQL, pgroll does the same kind of dual-write-via-trigger for zero-downtime schema changes. Both treat triggers exactly as this section argues they should be treated: time-bounded scaffolding with an explicit tear-down step.

Worth noting the counter-example. GitHub’s gh-ost performs the same migrations without triggers, reading the binlog instead. Their stated reason is that triggers add synchronous load to the primary during the migration and share its locking fate. That argument is about migration tooling trade-offs, not a defense of triggers in application logic. The conclusion in both camps is the same: triggers outside of migration scaffolding don’t earn their keep.

Everything outside that narrow case (cross-cutting concerns, derived values, audit logs, product rules) belongs in application code where it’s visible, testable, and traceable from the same stack trace as the logic that caused the write.

How companies end up with triggers anyway

A large share of production databases carrying heavy trigger logic didn’t get there by choice. They got there by losing track of the write boundary. The pattern is predictable. A database starts as one service’s store. A second team needs the same data and connects directly because it’s easier than building an API. A data-warehouse ETL starts writing back aggregates. An analytics job needs a “last seen” column updated. A partner integration gets a read-write user “just for this quarter.” Five years later the database has a dozen clients, some inside the company, some not, some on systems nobody actively maintains, and nobody has a full list.

At that point, asking every writer to go through a shared library stops being possible. The library is only the single source of truth if every writer imports it, and “every writer” now includes a Java batch job, a Go analytics worker, a legacy PHP admin tool, a vendor ETL, and a spreadsheet someone’s been running for years. The company doesn’t know where all the calls are coming from, so moving rules into an API layer isn’t an option. There’s no API layer every caller can be forced through.

The database, meanwhile, sees every writer. That’s how a team ends up with a trigger enforcing a rule that should have been in application code. The trigger is the only remaining place. It’s a symptom of losing the boundary, not a design choice made on its merits.

The real lesson is that the boundary is the thing worth defending. Once multiple unknown clients are writing to a schema, every future rule either becomes a trigger by necessity or goes un-enforced. Greenfield systems should treat “who is allowed to write to this schema” as a first-class architectural decision, with one service in front of it and everyone else going through that service. Migrations out of the trap exist (service extraction, proxying direct-DB clients through a write API, introducing a write-time event bus) but they’re multi-quarter projects, and the trigger layer usually stays in place throughout because it’s doing the job nothing else is available to do.

Stored procedures, the vertical-scaling trap

Stored procedures move application logic into the database process. They’re the tool most directly opposed to the “database as storage” position, and the one with the clearest scaling argument against them. On the four axes, stored procedures fail most of them for general business logic.

Scope is within one database. Across services, impossible (which is part of why Spanner and DynamoDB don’t support them). Cadence is schema-migration speed; a product rule that needs a hotfix takes a migration. Cost is the procedure body running on the primary’s CPU, competing with every query for the same scarce resource, when the application tier could run the same logic on a pod that scales horizontally. Write-path count is the one axis where procedures are strongest: if the procedure is the only way to perform the operation, every write path is covered.

The narrow case for stored procedures is the intersection of those trade-offs. Operations that must be atomic, must cover every write path, and would be prohibitively expensive to run row-by-row over the network. Bulk data operations that are genuinely row-by-row expensive. Security boundaries where the application is explicitly not trusted with direct table access. Legacy systems where procedures are the system of record.

Outside those cases, stored procedures trade a scaling-ceiling problem and a deployment-cadence problem for centralization that a shared application library provides at lower cost. The argument that “a stored procedure prevents the application from drifting” is real, and the same argument applies to a validation library without the scaling or deployment penalty.

Views, the quietly useful option

Views don’t enforce writes but they do shape reads, and shaping reads affects correctness in practice. A view that filters soft-deleted rows means every consumer sees the same definition of “active”. Updatable views can also be a migration-compatibility tool.

1
2


CREATE VIEW active_orders AS
SELECT * FROM orders WHERE deleted_at IS NULL;

Scope is within-service. Cadence is fine either way; view bodies change as often as the underlying queries. Cost is the planner expanding views at query time, and complex views can hide expensive plans from the caller. Write-path count is read-time only, so views don’t help with integrity.

Views are underused for their cheap benefits (canonical join shapes, soft-delete filtering, migration shims) and overused when they become a layer of logic the calling code can’t see. Materialized views are a separate topic; they add refresh-cadence questions the live-query tools don’t.

Derived columns and counter caches, implicit logic

Comment counts, follower counts, status summaries, running totals. Every one of these encodes business logic; the question is which mechanism maintains it.

1
2
3
4
5
6


CREATE TABLE posts (
 id BIGINT PRIMARY KEY,
 author_id BIGINT NOT NULL REFERENCES users(id),
 comment_count INT NOT NULL DEFAULT 0,
 last_comment_at TIMESTAMPTZ
);

Through the four-axis lens, four mechanisms:

Application code maintains it. Cadence is fast. Cost is zero on the DB, per-write work on the app tier. Write-path count fails if any writer skips the library. Scope is fine within the service.
Materialized view or batch job. Cadence is decoupled from the write. Cost is the refresh window. Write-path count covers everything, but the value is stale between refreshes. Scope is within-service.
Read-time aggregation. Cadence is irrelevant. Cost is per-read and can be expensive on feed-style queries. Write-path count is always correct. Scope is within-service.
Separate counter service with async events. Cadence is fast. Cost is extra infrastructure and delivery semantics to reason about. Write-path count covers everything if every writer publishes the event. Scope is any.

A trigger is conspicuously absent from that list on purpose. Counter-cache triggers are the canonical example of hidden logic causing a contention problem the application team can’t see: every concurrent comment insert serializes on the parent post’s row lock, and the debugging path goes straight through PL/pgSQL the service engineers didn’t write. The four-axis analysis points instead at the library-maintained counter when there’s one writer, the background rollup when reads are hot, and a separate counter service at scale or across boundaries.

The library pattern, done seriously

The natural consequence of “narrow database, logic in application” is that the application layer’s logic has to be reusable. A validation that only lives in one service’s Rails app isn’t a library, it’s service code. A library every writer imports is the actual mechanism.

Four shapes show up in practice:

Monolith, one language. A package inside the codebase, imported by every write path. Works well. Admin tools and background jobs depend on the same package as the web request path. Backfill scripts should depend on it too; in practice this is where discipline breaks down.
Microservices, one language. A shared library published as a package. Every service depends on the same version, or accepts that a rollout takes a deploy cycle across services. Version skew is the operational tax.
Polyglot services. A shared library doesn’t exist. Validation gets reimplemented per service, or pushed into a validation service that every caller hits over RPC. The RPC option is real and works; it turns “shared library” into “shared service” with the same logical role.
Schema-first code generation. Tools like sqlc and jOOQ generate typed client code from the schema, which gives a narrow kind of library reuse (type safety and query shapes) without attempting to encode business logic. For logic itself, schemas aren’t enough; the library is separate.

The discipline that makes this work: the library is the only write path, and if it isn’t, the database’s declarative constraints are the backup. The two pieces reinforce each other. The library holds the full rule set, fast and rich and horizontal-scale. The schema holds the small subset the database can enforce cheaply and that every writer, library or not, has to pass through.

The duplication trap

The most common failure mode isn’t picking the wrong layer. It’s picking both without deciding which is authoritative.

Application validator: email must match regex A.
Database CHECK: email must match regex B.
Over the years, one gets updated (for GDPR, for internationalization); the other doesn’t.
Legacy rows exist that pass the old version but not the new one.
A migration that tries to tighten the CHECK fails on legacy rows the application thought were fine.

The pattern repeats with status enums, numeric ranges, referential rules, and soft-delete semantics. Two versions of the truth stay in sync as long as someone is actively keeping them in sync, and then they don’t.

The useful framing: pick one layer as authoritative and name the other as a UX mirror or a safety net. The authoritative layer is the one that runs when the other doesn’t, which, for correctness invariants where write paths multiply, still points at the database for the narrow declarative subset.

1
2


-- authoritative: the declarative CHECK
CHECK (status IN ('pending', 'active', 'closed'))

1
2
3
4


# mirror in the library: better errors, fast-fail before the round trip
def validate_status(value):
 if value not in ("pending", "active", "closed"):
 raise ValidationError("Status must be pending, active, or closed.")

If the library and the schema disagree, the schema wins and the write fails. The failure is loud, traceable, and tells you the drift exists, instead of the silent corruption you get when neither layer enforces a rule.

Rules the assistant can see

The choice of where to put a rule is, among other things, a choice about which readers can see it. An AI assistant writing SQL or application code against the schema reads the catalog (column types, constraints, FKs, CHECK definitions) and whatever source files the prompt happens to include. Declarative rules show up in information_schema and pg_constraint. The assistant can reason about them without being pointed at additional files. A CHECK (status IN ('pending', 'active', 'closed')) is visible to any schema-reading tool on day one.

Rules living in triggers, stored procedures, ORM callbacks, or a shared Python validation library don’t surface when the same tool reads the catalog. The write path enforces them at runtime; the schema doesn’t describe them. A model generating an INSERT statement against a table whose uniqueness is enforced only by a before-insert trigger will produce a query that looks correct and violates an invariant the catalog never mentioned. This doesn’t change the conclusion that most logic belongs in the application, but it does tip the math, at the margin, toward the narrow set of correctness invariants where declarative constraints pay double: they enforce on every write path, and they’re the only form of the rule a schema-reading assistant sees for free.

Trade-offs

Every position in this post has counter-arguments, and they’re real.

Declarative database constraints lock you into SQL semantics. A CHECK constraint doesn’t survive a migration to DynamoDB or Spanner without rework. Teams building for a future migration accept less database-side logic in exchange for portability. The trade is real; the frequency of actual cross-engine migrations is lower than the frequency of discussions about them.
Schema changes are slow enough that even “simple” constraints are friction. Adding a CHECK to a 500M-row table is a migration project. For teams shipping schema changes weekly, every constraint is a cost, and sometimes the cheaper answer is to accept looser database-side invariants and stricter application-side ones.
Application-side validation is easier to test, version, and roll back. A library’s tests run in milliseconds; a constraint’s tests need a real database. Teams with weak integration-testing infrastructure end up under-testing database-side rules.
Horizontal-scaling arithmetic isn’t universal. For services running on a single database at moderate load, the “vertical scaling ceiling” argument is an abstraction. The primary has plenty of headroom and the scaling argument is theoretical. The argument matters more as traffic grows.
Shopify’s position is internally consistent. No database-level foreign keys, all integrity in models, sharded storage. It works because every write path goes through Rails and because the operational investment in model-layer integrity is serious. A smaller team without that investment can’t safely adopt the same pattern; the constraints in the database are what a smaller team can afford.
Stored procedures aren’t universally bad. The Yugabyte post is right that in a single-service OLTP context, procedures can centralize logic effectively. The scaling argument is real but not always the binding constraint. Teams with deep SQL skills and disciplined version-control-for-procedures can extract more value than the “avoid them” position suggests.

The balance described above is what holds across the most common cases. Specific cases have specific answers. The failure mode is rarely picking the wrong point on the axis. It’s not picking at all.

A rule-by-rule framework

Instead of a blanket policy, a set of questions that point at the right layer per rule.

Does the rule cross service boundaries? If yes, application library or orchestration service. The database can’t help.
Would violation corrupt data? If yes, the database should enforce it as a declarative constraint, because every write path has to be covered.
Is the rule a derived value with a stable formula? Generated column. Cheap, covers every writer, zero sync code.
Is the rule a derived value with a changing formula or external inputs? Application library.
Does the rule depend on anything outside the row (request context, external services, feature flags)? Application library.
Does the rule change more often than quarterly? Application library.
Is the rule a cross-cutting concern every write path needs (timestamps, audit logs)? Application library that every writer imports, not a trigger. The trigger hides the logic; the library makes it visible to the reader of the code that caused the write.
Does the rule involve non-trivial computation or touch multiple rows per call? Application library. Database CPU is the scarce tier.
Is there more than one write path? The library alone isn’t enough; declarative constraints in the schema are the backup.

The questions don’t eliminate judgment (several rules will land on edges) but they make the trade-offs visible and keep decisions from being driven by which layer the author was working in when the rule came up.

The bigger picture

Across services, the database is storage and logic lives in services and shared libraries. That’s the direction Spanner, DynamoDB, Cassandra, and the pattern literature all point, and the cross-service question is genuinely settled. Within a service it’s softer. The database can enforce things the application can’t, a narrow set of declarative constraints costs almost nothing, and the schema is the only layer that sees every writer the library’s author didn’t plan for. Keep the database lean. Put the full rule set in a library the application owns. Let the schema carry the small subset that catches the writes the library missed (which is more writes than anyone planning the system thought there would be).

Database Deadlocks, Part 2: Diagnosis, Retries, and Prevention

Sun, 02 Mar 2025 00:00:00 +0000

TL;DR

Part 1 covered the patterns. This post is the operational half: reading the deadlock log to identify which pattern fired, designing retries that fail loudly instead of hiding the real bug, isolating hot rows before they become incidents, and the prevention primitives (NOWAIT, SKIP LOCKED, isolation-level changes, lock_timeout) that remove entire categories from the workload.

A nightly inventory sync deadlocks every Tuesday at 02:14. The job runs INSERT INTO inventory ... ON CONFLICT (sku) DO UPDATE across eighteen partner warehouses in parallel, and the deadlock counter spikes from 3 per hour to 400 per hour for the duration of the run. The on-call response over the last three incidents has been the same: bump the retry limit from 5 to 10. The deadlocks still happen, they just take longer to surface as user-facing errors. After the third week, someone reads the LATEST DETECTED DEADLOCK output and finds lock_mode X locks rec but not gap on the primary key. The workers aren’t sorting the upsert batches. One worker processes (1, 2, 3) while another processes (3, 2, 1), and they deadlock on the middle row.

The fix is one line in application code: rows.sort(key=lambda r: r.sku) before each batch. The deadlock counter drops back to baseline immediately. The retry limit goes back to 5. The three weeks of bumping retry settings were chasing the symptom; the actual fix lived in the access path, not the retry layer. The log told the story the first time it was generated.

This post is the operational half of the deadlocks series. Part 1 covered the patterns. What follows is the sequence: read the log to identify which pattern fired, design retries that don’t hide the bug, then reach for the prevention primitives that remove categories from the workload entirely.

Read the MySQL deadlock log

SHOW ENGINE INNODB STATUS dumps the most recent deadlock in the LATEST DETECTED DEADLOCK section. The catch: only the most recent. On a busy system, deadlocks overwrite each other faster than someone can log in and copy the output. Before anything else, turn on innodb_print_all_deadlocks = ON in every production deployment. It writes every deadlock to the error log instead of a single overwriting slot. The volume is negligible, the diagnostic value is high, and there is no downside.

A representative entry looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


*** (1) TRANSACTION:
TRANSACTION 4823941, ACTIVE 3 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1136, 3 row lock(s)
MySQL thread id 892, OS thread handle 0x7f..., query id 18293 ...
UPDATE orders SET status = 'shipped' WHERE id = 1001

*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 48 page no 112 n bits 144 index PRIMARY
of table `shop`.`orders` trx id 4823941 lock_mode X locks rec but not gap
waiting

*** (2) TRANSACTION:
TRANSACTION 4823942, ACTIVE 2 sec starting index read
UPDATE orders SET status = 'paid' WHERE id = 1002

*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 48 page no 112 n bits 144 index PRIMARY
of table `shop`.`orders` trx id 4823942 lock_mode X locks rec but not gap

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 48 page no 112 n bits 144 index PRIMARY
of table `shop`.`orders` trx id 4823942 lock_mode X locks rec but not gap
waiting

*** WE ROLL BACK TRANSACTION (2)

The parts that matter:

lock_mode vs. lock_type. X is exclusive, S is shared. locks rec but not gap is a pure record lock; locks gap before rec is a gap lock; the unadorned X under REPEATABLE READ is usually next-key (record + gap). Matching lock_mode S locks rec but not gap against Part 1’s unique-index section tells you immediately that this is a duplicate-key-on-insert deadlock.
index name. index PRIMARY vs. index idx_customer reveals whether the cycle formed on the clustered index or a secondary one. Two transactions approaching the same rows from different indexes is the “secondary index locks on InnoDB” pattern from Part 1; the fix is usually consolidating access paths.
The query text. This is the last statement the transaction executed before the deadlock, not necessarily the one that caused it. A transaction holding locks from three earlier statements can deadlock on the fourth, and the log only shows the fourth. Cross-reference with the application’s structured logs to reconstruct the full transaction.
trx id is monotonically increasing and stable for the life of the transaction. Searching the general log or slow-query log for that trx id reconstructs the full statement sequence, but only if general-query logging is on for the window in question, which it usually isn’t.

performance_schema.data_locks and data_lock_waits give a real-time view of current locks and waits. Useful for catching a deadlock-adjacent pathology (long wait chains, hot rows) before the cycle forms:

1
2
3
4
5
6
7
8


SELECT
 bl.lock_type, bl.lock_mode, bl.object_name, bl.index_name,
 w.REQUESTING_ENGINE_TRANSACTION_ID AS waiting_trx,
 w.BLOCKING_ENGINE_TRANSACTION_ID AS blocking_trx
FROM performance_schema.data_lock_waits w
JOIN performance_schema.data_locks bl
 ON w.BLOCKING_ENGINE_LOCK_ID = bl.ENGINE_LOCK_ID
WHERE bl.OBJECT_SCHEMA = 'shop';

Read the PostgreSQL deadlock log

PostgreSQL’s diagnostic story is narrower by design. Deadlocks are logged automatically when the cycle is detected. log_lock_waits = on logs any wait exceeding deadlock_timeout (default 1s), which catches the wait-chain escalation before the detector fires. There’s no equivalent to SHOW ENGINE INNODB STATUS; everything lives in postgresql.log or the extensions you’ve installed.

A representative deadlock entry:

1
2
3
4
5
6
7
8
9


ERROR: deadlock detected
DETAIL: Process 14234 waits for ShareLock on transaction 89234;
 blocked by process 14235.
 Process 14235 waits for ShareLock on transaction 89233;
 blocked by process 14234.
 Process 14234: UPDATE accounts SET balance = balance - 100 WHERE id = 2
 Process 14235: UPDATE accounts SET balance = balance + 50 WHERE id = 1
HINT: See server log for query details.
CONTEXT: while updating tuple (0,18) in relation "accounts"

The ShareLock on transaction X wording is PostgreSQL-specific. One transaction is waiting to see the commit status of another (a row-lock wait manifests as waiting on the holder’s transaction ID). The tuple identifier (0,18) points to the exact physical row (page 0, tuple 18 in the heap), which is useful for reproducing the scenario but changes as rows are updated (MVCC creates new versions at new (page, tuple) locations).

For real-time inspection, pg_locks joined against pg_stat_activity shows live lock state:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


SELECT
 a.pid, a.usename, a.state,
 a.wait_event_type, a.wait_event,
 l.mode, l.locktype, l.relation::regclass,
 pg_blocking_pids(a.pid) AS blocked_by,
 LEFT(a.query, 80) AS query
FROM pg_stat_activity a
LEFT JOIN pg_locks l ON l.pid = a.pid AND NOT l.granted
WHERE a.state != 'idle'
ORDER BY a.xact_start;

pg_blocking_pids(pid) returns the array of PIDs blocking a given transaction. Walking it recursively reconstructs the live wait-for graph: the same data the deadlock detector uses, just before a cycle forms. For hot systems, pg_stat_statements combined with pg_stat_activity snapshots at regular intervals builds a picture of which statements accumulate the most wait time, which is almost always the right first place to look.

Row locks are invisible in pg_locks

PostgreSQL’s row-level locks (the result of SELECT ... FOR UPDATE, FK checks, and plain UPDATE/DELETE) are stored on the tuple itself, in the xmax system column, not in the shared lock table. They don’t show up in pg_locks. The only way to see them is through the pgrowlocks extension, which scans the heap directly:

1
2


CREATE EXTENSION pgrowlocks;
SELECT * FROM pgrowlocks('accounts');

This is the single biggest difference between PG and InnoDB lock introspection, and the reason PG operators often feel blind to row-level contention until a cycle actually forms.

Handle SERIALIZABLE serialization failures separately

Under SERIALIZABLE isolation, PostgreSQL uses Serializable Snapshot Isolation (SSI), an optimistic mechanism based on SIREAD predicate locks that track read-write dependencies between transactions. SSI cannot deadlock by design; it never blocks on lock acquisition. What it does is abort one transaction with a serialization failure when it detects a dangerous read-write dependency cycle that would violate serializability.

The two failure codes look similar but have fundamentally different semantics:

40001 serialization_failure. SSI detected a dependency cycle and aborted the transaction before it could commit a non-serializable result. The transaction did nothing wrong; the combination of its operations with a concurrent transaction would have produced an inconsistency. Retrying is always safe and usually succeeds (the concurrent transaction will have committed or aborted, so the second attempt doesn’t see the same conflict).
40P01 deadlock_detected. A cycle in the wait-for graph was broken by killing a victim. Retrying may or may not succeed depending on what caused the cycle. If the cycle was deterministic (two code paths with inconsistent ordering), it will keep recurring.

The practical consequence for retry architecture: an application running under SERIALIZABLE must handle 40001. It’s not a deadlock, it’s the normal failure mode of SSI, and retries are the only recovery path. An application running under READ COMMITTED never sees 40001. An application that handles 40001 identically to 40P01 is correct but coarse. The right granularity is: always retry 40001 (the workload’s own correctness guarantee assumes this); retry 40P01 with caution and escalate on repeat.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


def retry_on_conflict(fn, max_attempts=3):
 for attempt in range(max_attempts):
 try:
 return fn()
 except psycopg.errors.SerializationFailure:
 # 40001: always retry. SSI guarantees make this safe.
 backoff(attempt)
 except psycopg.errors.DeadlockDetected:
 # 40P01: retry with caution; log for root-cause analysis.
 log_deadlock_for_analysis(...)
 backoff(attempt)
 raise TransactionRetryExhausted

Architect retries that surface bugs, not bury them

Every database driver documentation says “deadlocks happen, retry the transaction.” That’s true. It’s also incomplete. The dangers are subtle enough that a naive retry loop becomes part of the problem:

Retries without backoff make the cycle worse. The condition that caused the deadlock (contention on a hot key set) is still in effect when the retry runs. A tight retry loop turns one deadlock into a thundering herd: all victims retry simultaneously, all hit the same contention, all deadlock again. Use exponential backoff with full jitter, capped at a few hundred milliseconds.

Retries mask lock-ordering bugs. If an application deadlocks 10x/minute but retries successfully, the operator sees no failures, but the underlying transactions are doing up to 20x the work (original + retry). The deadlock rate itself is a metric worth tracking, not just the post-retry error rate. When the rate grows, the fix is diagnosing the pattern, not tuning the retry limit.

Retries aren’t always safe. A transaction that sent an external notification, wrote to a message queue, or called a non-idempotent HTTP endpoint before the deadlock can’t be blindly retried; the external side effect already happened. Retries belong on database-only transactions, or on transactions where the external calls are idempotent and tolerant of duplicate execution. The boundary is architectural, not a library setting.

Retries need a budget. If a transaction can’t complete after ~3 attempts, the problem is no longer transient contention. It’s either a systemic hot spot or a bug that retries will never resolve. Escalate (alert, fail the request, enqueue for manual review), don’t loop forever.

The retry pattern that works in production:

1
2
3
4
5
6
7
8
9


for attempt in range(3):
 try:
 with db.transaction():
 do_work()
 return
 except (DeadlockError, SerializationFailure) as e:
 metrics.increment("db.retry", tags={"error": e.code})
 time.sleep(random.uniform(0, 0.1 * 2**attempt))
raise RetryBudgetExhausted()

Three attempts, exponential backoff up to ~400ms, metrics emitted on every retry, hard failure past the budget. The metric matters as much as the retry - without it, the team never learns which transactions are retrying frequently and why.

Idempotency is a transaction-shape property

A transaction is safe to retry iff re-running it produces the same observable state. That includes downstream side effects. A transaction that writes to a table AND sends a webhook is not safe to retry even if both operations are internally correct: the second attempt sends a duplicate webhook. The fix is the outbox pattern: write the webhook-send intent to a table in the same transaction, then have a separate worker process the outbox with its own idempotency guarantees. This is a non-negotiable part of building deadlock-retry-safe systems at scale.

Sort writes to lock in one direction

Part 1’s lock-ordering deadlock (two transactions updating the same set of rows in opposite orders) is the single most common production pattern and the one with the highest-leverage fix. If every code path that writes to a set of tables acquires locks in the same order, the wait-for graph literally cannot form a cycle on those rows. The engine still takes the locks, still holds them for the duration of the transaction, but the second transaction waits cleanly for the first instead of grabbing a lock the first will need.

The rule is: sort the rows by a stable key (usually the primary key) in application code, before any SQL is issued. Lock acquisition order in both engines is determined by the order the engine processes rows, which for most write patterns is the order the application submitted them. Sort once up front, and N workers all doing the same thing can’t cycle because they all approach the row set from the same end.

The three batch shapes that matter in practice, and where the ordering actually happens:

1. Loop of per-row UPDATEs. The classic batch worker:

1
2
3
4
5
6
7


# Sort in the application; the iteration order IS the lock acquisition order.
rows.sort(key=lambda r: r.id)
for row in rows:
 cursor.execute(
 "UPDATE accounts SET balance = balance + %s WHERE id = %s",
 (row.delta, row.id),
 )

Each UPDATE locks its target row at execution time; the loop order determines acquisition order. No SQL-level ORDER BY involved; the fix lives in the .sort() call.

2. Bulk UPSERT. INSERT ... ON DUPLICATE KEY UPDATE (MySQL) or INSERT ... ON CONFLICT (PostgreSQL):

1
2
3
4
5
6
7
8


# Sort rows by the unique key BEFORE building the batch.
rows.sort(key=lambda r: r.id)
execute_values(
 cursor,
 "INSERT INTO accounts (id, balance) VALUES %s "
 "ON CONFLICT (id) DO UPDATE SET balance = accounts.balance + EXCLUDED.balance",
 [(r.id, r.delta) for r in rows],
)

The engine processes the VALUES list in order and acquires locks as it goes. Two concurrent batches sorted by the same key approach the key space from the same end; without the sort, one batch might process (1, 2, 3) while another processes (3, 2, 1) and they deadlock on the middle row. This is the exact shape of the bulk-UPSERT deadlocks called out in Part 1’s unique-index section.

3. Small multi-row transactions. The canonical bank transfer:

1
2
3
4
5


-- Always update the lower id first, regardless of transfer direction.
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

For 2–3 rows with per-row different values, the application computes sorted((X, Y)) and issues UPDATEs in that order. Same principle as the batch case, just smaller.

Where SELECT ... FOR UPDATE ORDER BY actually earns its keep. Most batches don’t need it; they control lock order through the submission order. The one shape where it’s the right answer is a single UPDATE statement over a derived table where the engine decides scan order and you can’t control it from outside:

1
2
3
4


UPDATE accounts a
SET balance = a.balance + u.delta
FROM (VALUES (1, 50), (2, 100), (3, 25)) AS u(id, delta)
WHERE a.id = u.id;

Here, sorting the VALUES list in application code doesn’t reliably control lock order; the planner picks the scan. A SELECT id FROM accounts WHERE id IN (...) ORDER BY id FOR UPDATE up front pre-acquires locks in deterministic order before the UPDATE runs. Or refactor into shape 1 or 2.

ORDER BY controls result order, not scan order

ORDER BY on a SELECT ... FOR UPDATE controls the result ordering, but lock acquisition happens during the scan. With a primary-key or unique-index predicate (WHERE id IN (...) on a PK), the planner does ordered index lookups and locks land in ORDER BY order in practice. For non-indexed predicates or range scans on non-unique columns, the planner may scan in a different order and sort results afterward; locks get acquired in scan order. Verify with EXPLAIN before relying on this pattern against non-PK predicates. Also: MySQL’s UPDATE ... ORDER BY syntax applies one SET clause to all matching rows; it doesn’t help when rows need different values.

This sounds trivial. It almost never is in practice; the ordering has to hold across every code path that writes to the same tables: the main request handler, backfill scripts, admin scripts, scheduled jobs, ORM bulk-save methods, and whatever migration scripts run during releases. One path that writes in a different order is enough to reopen the cycle. The durable fix is encoding the order in the access layer so individual query sites can’t diverge from it:

A repository function that always sorts by PK before writing.
A stored procedure or database function that owns the multi-row write.
A service method with the ordering baked in, and a lint rule or review check forbidding direct table writes from elsewhere.

Two places this invariant breaks without anyone noticing:

ORM bulk-save methods. ORMs hide whether they process in input order or reorder internally. Django’s Model.objects.bulk_update, SQLAlchemy’s bulk_update_mappings, ActiveRecord’s upsert_all, Hibernate’s batch inserts; some process in input order, some sort by PK internally, some chunk before doing either. If you can’t tell from docs, test it: two concurrent bulk-saves with opposing-ordered input lists will either deadlock (proving input order matters) or not (proving the ORM sorts internally). Either way, sorting the input collection before handing it to the ORM is cheap insurance.

Batches sourced from a SELECT in the same transaction. A common pattern: “grab N pending rows, process them.” If the feeder SELECT doesn’t have an ORDER BY, rows come back in scan order, non-deterministic across workers, which reopens the cycle. The fix is an explicit ORDER BY on the feeder query, not in the subsequent loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


# Bad: scan order feeds the loop.
rows = cursor.execute("SELECT id, delta FROM pending WHERE status = 'ready' LIMIT 100")
for row in rows: # Whatever the scan produced; varies across workers.
 ...

# Good: deterministic order, same across every worker.
rows = cursor.execute(
 "SELECT id, delta FROM pending WHERE status = 'ready' ORDER BY id LIMIT 100"
)
for row in rows:
 ...

The lint-rule angle matters more than it sounds. Deadlock-ordering violations are almost impossible to catch in code review - two PRs that each look correct in isolation can introduce inconsistent ordering when combined. The check that actually works is structural: no direct writes to tables X, Y from anywhere except the repository. Once that invariant is enforced, the ordering invariant follows.

Multi-table transactions need the same rule applied to table order. A transaction that updates users then orders in one code path, and orders then users in another, can deadlock through the FK chain even with per-table row ordering. The rule generalizes: sort rows within a table, and sort tables within a transaction, by a stable convention the whole codebase agrees on (alphabetical is fine; just pick one).

Weigh the isolation-level trade-offs

The isolation level a workload runs under determines which deadlock categories even apply. Most MySQL deadlock incidents stem from REPEATABLE READ’s gap locks, a category that doesn’t exist on PostgreSQL or on MySQL at READ COMMITTED. Changing the isolation level is the single highest-leverage tuning lever, and also the one with the most potential to quietly break application correctness.

Dropping MySQL from REPEATABLE READ to READ COMMITTED. Under READ COMMITTED, InnoDB still takes row locks but skips most gap locks (they exist only for unique-key and FK enforcement on inserts). Most OLTP workloads don’t need REPEATABLE READ’s range-consistency guarantee. Most application code was designed around READ COMMITTED semantics anyway, because that’s what PostgreSQL and SQL Server default to. Teams migrating to READ COMMITTED typically see deadlock rates drop by an order of magnitude with no functional change.

Avoiding range locks on write paths. Independent of isolation level, replacing SELECT ... FOR UPDATE scans over ranges with point lookups by primary key removes the gap-lock surface entirely on the statements that do it. If a write path doesn’t need to lock “all orders for customer 5,” locking just the specific row it’s about to update is both faster and safer.

FK shared locks are shorter-lived under READ COMMITTED. The foreign-key shared-lock pattern from Part 1 (high-write child tables concentrating shared locks on hot parent rows) has a narrower window under READ COMMITTED simply because the lock lifespan is tied to the statement rather than the transaction. The cycle potential is still there, but the wait window is smaller.

Isolation change is a behavior change, not a tuning knob

READ COMMITTED eliminates most gap locks but also changes the visibility semantics of long-running transactions. Any code that relied on re-reading a row and getting the same result (transfer logic, inventory deductions, financial calculations) has to be re-examined. The safe migration is application-by-application, not database-wide. Run it in staging under production-like load and watch for subtle correctness regressions: “phantom” rows appearing inside a transaction that used to see a stable snapshot, inventory counts that shift mid-transaction, calculations that no longer match because an underlying row changed between reads.

Session-scoped change as a migration path. Both engines let you set isolation level per session or per transaction (SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED in MySQL, SET TRANSACTION ISOLATION LEVEL READ COMMITTED per transaction in PostgreSQL). The usual migration pattern is to start with the most contended code paths, move them to session-scoped READ COMMITTED, monitor for regressions, then expand the scope. A global flip from REPEATABLE READ to READ COMMITTED on a large, stable MySQL deployment is rarely the right first step.

Isolate hot rows off the contention path

When the top N deadlocks on a production system concentrate on a small set of rows (a counter, a config row, an AUTO_INCREMENT source of truth), retries don’t converge. Every retry hits the same row, takes the same lock, cycles with the same peers. The fix is removing the row from the hot path, not tuning the retry layer.

Three patterns that work:

Counter tables with sharding, for extreme write-hot counters only. Reach for this only when the counter is taking thousands of writes per second against a single row and the simpler options below aren’t viable. For anything less, the queue pattern below or an external store (Redis atomic INCR, a time-series DB) is almost always the better answer: less complexity, no schema overhead, no sum-across-rows read cost. Sharded counters are the specialized escalation, not the default.

When it does fit the workload: N physical shards per logical counter, keyed on a compact integer (counter_id, shard_id) composite. Application code picks the shard. Keeping the random choice out of SQL makes it portable across engines and testable independently:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


CREATE TABLE counter_shards (
 counter_id BIGINT NOT NULL, -- FK to a counters metadata table if you need names
 shard_id SMALLINT NOT NULL, -- 0..N-1, fixed per-counter
 value BIGINT NOT NULL DEFAULT 0,
 PRIMARY KEY (counter_id, shard_id)
);

-- Seed the shards once per counter (e.g., when the counter is created):
INSERT INTO counter_shards (counter_id, shard_id, value)
VALUES (42, 0, 0), (42, 1, 0), (42, 2, 0), (42, 3, 0),
 (42, 4, 0), (42, 5, 0), (42, 6, 0), (42, 7, 0),
 (42, 8, 0), (42, 9, 0), (42, 10, 0), (42, 11, 0),
 (42, 12, 0), (42, 13, 0), (42, 14, 0), (42, 15, 0);

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


# Increment: application picks the shard. Portable, cheap, no SQL-side RAND().
shard = random.randrange(16)
cursor.execute(
 "UPDATE counter_shards SET value = value + 1 "
 "WHERE counter_id = %s AND shard_id = %s",
 (42, shard),
)

# Read: sum across shards.
cursor.execute(
 "SELECT COALESCE(SUM(value), 0) FROM counter_shards WHERE counter_id = %s",
 (42,),
)

16 shards turn one hot row into 16 warm rows. The contention surface scales with shard count. The read cost is one SUM across N rows instead of a single-row SELECT, usually acceptable for counter use cases; if not, cache the aggregate.

A common refinement is deriving the shard deterministically from the connection or worker ID (e.g., connection_id % 16) so each worker consistently hits the same shard. That improves InnoDB buffer-pool locality and reduces cross-shard interference, at the cost of slightly less even distribution if worker counts aren’t balanced.

Advisory locks for app-level serialization. Both MySQL and PostgreSQL support advisory locks: named locks that exist outside the table model and don’t take row locks. For operations that need to be serialized at the application level (leader election, rate limiting, config migration), advisory locks are dramatically cheaper than row locks and can’t participate in a table-lock cycle:

1
2
3
4
5
6
7


-- PostgreSQL: advisory lock keyed by a bigint.
SELECT pg_advisory_xact_lock(hashtext('refresh_cache:customer_42'));
-- Lock released at transaction end. Only one worker per key runs at a time.

-- MySQL equivalent:
SELECT GET_LOCK('refresh_cache:customer_42', 10);
-- Returns 1 if acquired, 0 on timeout. Must explicitly RELEASE_LOCK.

The caveats: advisory locks are application-layer discipline; they don’t enforce anything the database checks. Use them where the application chooses to serialize, not where correctness requires it.

Queue patterns instead of direct updates. For counter-like workloads, write intents to an append-only table and aggregate periodically:

1
2
3
4
5
6
7
8
9


INSERT INTO counter_events (counter_key, delta, created_at) VALUES (?, ?, NOW());
-- No contention: every insert creates a new row.

-- Periodic aggregation job:
INSERT INTO counter_totals (counter_key, total)
SELECT counter_key, SUM(delta) FROM counter_events
WHERE processed = FALSE
GROUP BY counter_key
ON CONFLICT (counter_key) DO UPDATE SET total = counter_totals.total + EXCLUDED.total;

Trades real-time accuracy for throughput. The right trade-off for page-view counters, metric accumulation, any workload where eventual consistency is acceptable.

Hot parent rows behind FK chains. Part 1 described how a high-write child table concentrates shared locks on a hot parent row, and how any exclusive-lock request on that parent (a name update, a soft-delete, a trigger-driven counter update) becomes a contention point. Two levers that work:

Move high-frequency parent-row updates to a side table. The last_activity_at timestamp, the cached counter, the updated_at that a trigger bumps on every child insert; none of these need to live on the parent table. Moving them to customer_activity(customer_id, last_seen_at) or customer_counters(customer_id, order_count) eliminates the exclusive-lock contention entirely. The parent row stops changing on hot paths, the shared locks from FK checks coexist fine, and the cycle potential disappears.
Narrow the FK scope where integrity can tolerate it. Not every child table needs an enforced FK to every parent. Logs, events, and audit tables are often the biggest offenders, and often have the least need for strict integrity (an orphaned log row is rarely a correctness problem). Dropping the FK removes the shared-lock dependency entirely. This trades integrity for throughput, a decision that belongs with the team owning the data, not a default.

Under READ COMMITTED, both levers matter less because the FK shared locks release at statement end rather than transaction end. A workload that runs on REPEATABLE READ and can’t change isolation level (because of application semantics) gets the most benefit from these two fixes.

Shorten long-running transactions

The longer a transaction holds locks, the wider the window for a cycle. Two patterns recur in production:

Application-layer long transactions. A transaction that opens at request start, makes several queries, calls an external API, then commits. The external call is where the transaction actually spends its time: seconds of open transaction holding row locks the whole time. Every concurrent transaction that touches those rows waits. Deadlock probability scales with transaction duration. The fix is the inverse of the outbox pattern: do the external call outside the transaction, passing in any needed IDs.

Idle-in-transaction sessions. A session that runs BEGIN, some writes, then stalls; idle but not committed. In PostgreSQL, this blocks vacuum on touched tables, bloats MVCC, and holds locks indefinitely. pg_stat_activity shows state = 'idle in transaction'. MySQL’s equivalent is a thread with an open transaction and no current query.

PostgreSQL has a first-class timeout for this; MySQL does not:

1
2
3


-- PostgreSQL: kill idle-in-transaction sessions after 5 minutes.
-- (Units required; bare integer would be interpreted as milliseconds.)
SET idle_in_transaction_session_timeout = '5min';

MySQL has no direct equivalent. wait_timeout and interactive_timeout govern idle connections, not sessions idle inside an open transaction. A connection that did BEGIN then stopped sending queries will hold its locks until the connection drops or the client commits. The production workaround is either a watchdog script (e.g., Percona’s pt-kill) that polls information_schema.innodb_trx and terminates transactions exceeding a duration threshold, or a connection pool with per-connection transaction lifetimes. Connection pools that acquire a connection, start a transaction, then return the connection to the pool without committing (rare but real) will produce sessions that live indefinitely otherwise.

Finding long-running transactions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


-- PostgreSQL
SELECT pid, usename, state, xact_start, now() - xact_start AS duration, query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL AND now() - xact_start > INTERVAL '30 seconds'
ORDER BY duration DESC;

-- MySQL
SELECT trx_id, trx_started, trx_mysql_thread_id, trx_rows_locked, trx_query
FROM information_schema.innodb_trx
WHERE TIMESTAMPDIFF(SECOND, trx_started, NOW()) > 30;

Alerting on any transaction exceeding 30s in an OLTP workload catches most of the long-transaction-induced deadlocks before they produce incidents.

Audit triggers and cascades for hidden locks

A trigger that updates a second table on every write to the first adds an edge to the wait-for graph that isn’t visible in the original query. ON DELETE CASCADE foreign keys behave similarly - one delete can take locks on every child row in the cascade, and if the cascade order differs between two concurrent deletes, they can deadlock through tables neither statement directly referenced.

This is the origin of the “why is my DELETE FROM users deadlocking against an INSERT INTO events?” question. The DELETE triggered a cascade to user_preferences, which had a trigger that updated a counter in tenants, which was locked by the INSERT. Four tables in the cycle, two in the application’s explicit query, zero mention of the other two in any log entry until someone reads the DDL.

The operational pattern: when a deadlock log mentions a table the application’s code doesn’t explicitly reference, check (1) FK cascades on the tables that are in the query, (2) triggers on those tables, (3) generated columns that fire on update. All three are non-obvious lock sources, all three are fixable, but only after they’re identified.

Set `innodb_autoinc_lock_mode` deliberately

MySQL InnoDB’s AUTO_INCREMENT column has its own lock, historically a source of contention and occasional deadlock. The innodb_autoinc_lock_mode parameter controls the behavior:

Mode 0 (traditional). Table-level AUTO-INC lock held for the duration of the statement. Serialized across inserts. Safe for statement-based replication, terrible for concurrency.
Mode 1 (consecutive). A lighter lock for simple inserts (single-row or known-row-count), and the traditional table lock for bulk inserts (INSERT ... SELECT). Was the default in 5.7.
Mode 2 (interleaved). No AUTO-INC table lock; IDs are assigned per-row as needed, possibly interleaved across concurrent statements. Default in MySQL 8.0. Fastest, and correct for row-based replication (which is also the 8.0 default).

The mode-2 default in 8.0 eliminated a substantial source of historical deadlocks and contention. Bulk inserts that used to serialize on the AUTO-INC lock now proceed in parallel. If you’re migrating from 5.7 to 8.0, this is a free win. If you’re still on binlog_format = STATEMENT (uncommon but not unheard of in legacy deployments), you cannot safely run mode 2; the replica may generate different IDs than the source, corrupting the data. Switch to binlog_format = ROW first, then adopt mode 2.

Guard DDL migrations with `lock_timeout`

Online schema change isn’t deadlock-prone in the classical sense, but it interacts with deadlocks in a specific operational way: DDL takes heavy locks that queue behind ongoing DML, and while the DDL waits, every subsequent query on that table queues behind the DDL. In PostgreSQL, a DDL taking ACCESS EXCLUSIVE that waits for an existing long-running SELECT will cause every new SELECT to wait behind the DDL. The system grinds to a halt, and application logs fill with timeout errors that look like deadlocks but aren’t. It’s a queue, not a cycle.

The standard prevention idiom for PostgreSQL migrations:

1
2


SET lock_timeout = '2s';
ALTER TABLE orders ADD COLUMN notes TEXT;

If the ALTER can’t acquire its lock in 2 seconds, it fails instead of queueing. The migration tool catches the error and retries with backoff. This prevents the queue-behind-DDL outage entirely - the cost is that some migrations need multiple attempts to land, which is almost always the right trade-off.

MySQL’s equivalent tooling is pt-online-schema-change (Percona) and gh-ost (GitHub). Both create a copy of the table, stream writes to both via trigger or binlog, and swap at the end. They run concurrent DML against the original and the copy, so they inflate deadlock rates during the migration window: not because the tool is buggy, but because there are now more transactions touching the same rows. The operational practice: run migrations at low-traffic windows, watch the deadlock counter during the run (not just replica lag), and have a rollback path ready.

DDL inside transactions is engine-dependent

PostgreSQL supports transactional DDL: BEGIN; ALTER TABLE ...; COMMIT; is atomic. MySQL does not; every DDL statement implicitly commits the current transaction. A migration script that assumes it can roll back mid-migration works on PostgreSQL and silently half-applies on MySQL. Know which engine you’re writing migrations for.

Reach for `NOWAIT` and `SKIP LOCKED`

Both engines support two SQL-level concurrency primitives that remove the need for application-layer deadlock handling in specific patterns:

SELECT ... FOR UPDATE NOWAIT. If the row is locked by another transaction, fail immediately with an error instead of waiting. Useful for user-facing paths where “I can’t get this resource right now” is a better UX than “wait 500ms and maybe deadlock anyway.” Also useful for detecting lock contention synthetically in tests.
SELECT ... FOR UPDATE SKIP LOCKED. If rows are locked by another transaction, skip them and return only rows the current transaction can lock. Transforms a contended queue-processor pattern into a lock-free one: N workers each grab a different set of rows, zero contention, zero deadlocks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


-- Queue processor: deadlock-free, contention-free.
SELECT * FROM jobs
WHERE status = 'pending'
ORDER BY priority, created_at
LIMIT 10
FOR UPDATE SKIP LOCKED;

-- Fast-fail acquisition: don't wait, fail now.
SELECT * FROM leader_election
WHERE resource_id = 'cache-refresh'
FOR UPDATE NOWAIT;

SKIP LOCKED arrived in PostgreSQL 9.5 and MySQL 8.0. Before those versions, queue-processor patterns required either advisory locks or application-level coordination (Redis, Zookeeper). Post-SKIP LOCKED, they can live entirely in the database with a single primitive. For any workload where workers pull from a shared queue, this is the pattern - not retry loops on FOR UPDATE.

Monitor the metrics that catch regressions

The single most useful metric is deadlock rate over time. Not error rate, not retry rate; the raw count of deadlocks per minute or per thousand transactions. A workload with 0.1 deadlocks per thousand transactions is healthy; 10 per thousand is a paging threshold; 100 per thousand means retries aren’t converging and something is structurally wrong.

For MySQL: there’s no Innodb_deadlocks status variable. The correct source is performance_schema.events_errors_summary_global_by_error, which is enabled by default:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


-- Cumulative deadlock count (compare over time windows).
SELECT SUM_ERROR_RAISED AS deadlock_count
FROM performance_schema.events_errors_summary_global_by_error
WHERE ERROR_NAME = 'ER_LOCK_DEADLOCK';

-- Lock-wait activity (useful for adjacent contention, NOT a deadlock counter):
SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_waits';

-- Plus the error log, searchable for "LATEST DETECTED DEADLOCK"
-- once innodb_print_all_deadlocks=ON.

Innodb_row_lock_waits is commonly misread as a deadlock counter - the manual defines it as “the number of times operations on InnoDB tables had to wait for a row lock,” which is contention in general. Pair it with the events_errors_summary query, not in place of it.

For PostgreSQL:

1
2
3
4


-- pg_stat_database exposes per-database deadlock counter.
SELECT datname, deadlocks, xact_commit, xact_rollback
FROM pg_stat_database
WHERE datname = current_database();

Scrape both into Prometheus (mysqld_exporter and postgres_exporter both expose these), compute the rate, alert on sustained rises. Pair the deadlock rate with a retry rate from the application layer - a spike in one without the other means either the retry logic is broken or the workload shape changed. A spike in both means a real regression.

Beyond the rate itself, the top-K pairs of statements involved in deadlocks (extracted from innodb_print_all_deadlocks logs or PG’s deadlock log entries) identify exactly which code paths are fighting. This list rarely changes - the same two or three patterns account for most deadlocks in any given system. Fix those and the rate drops by an order of magnitude.

The mental model for Part 2

Part 1’s patterns answer why deadlocks happen. Part 2’s operations answer what to do about them, and the useful reframe is that the answer is almost never “tune the retry logic.” Retries are the recovery mechanism that keeps the application running while the actual fix lands. The actual fix is almost always one of:

Identify the pattern from the log. This is step zero; skipping it means you’re tuning blind.
Enforce consistent lock ordering at the access-layer level. Highest-leverage fix for the lock-ordering pattern; deterministically eliminates the cycle rather than shrinking its window.
Change the code path to use SKIP LOCKED / NOWAIT where the pattern matches (queue processors, resource acquisition).
Isolate hot rows (counter tables, shards, advisory locks, queue patterns; move high-frequency parent-row updates to side tables).
Shorten transactions. Move external calls out, enforce idle-transaction timeouts.
Drop isolation level where the workload allows it; session-scoped first, global only after regression testing. Eliminates the gap-lock category on MySQL.
Remove cascades/triggers from the hot path when they’re the hidden lock source.
Handle SERIALIZABLE’s 40001 as a normal event if you’re on SERIALIZABLE, and don’t confuse it with 40P01.
Plan DDL windows with lock_timeout and watch the deadlock counter through the migration.

Retries let the application survive while the fix is in flight. Monitoring tells you which fix to prioritize. Each of the above removes a category from the workload entirely. The goal is a system where the few remaining deadlocks are rare enough that the retry layer handles them invisibly and the team’s attention can go elsewhere. Not zero (that’s a theoretical fiction at realistic concurrency), but managed.

Database Deadlocks, Part 1: The Patterns

Thu, 13 Feb 2025 00:00:00 +0000

TL;DR

A deadlock is two transactions each holding a lock the other needs, caught in a cycle the engine breaks by killing one. The patterns are finite and repeatable: inconsistent lock ordering across workers, InnoDB gap locks under REPEATABLE READ, foreign-key shared locks on hot parent rows, unique-index conflicts, index-scan lock amplification, and parallelism patterns that only surface on replicas or under worker-pool load. This post is the patterns. Part 2 covers diagnosis, retry architecture, and prevention.

Deadlocks occupy a strange place in production operations. They’re rare enough that most engineers haven’t thought hard about them, frequent enough in high-concurrency workloads to show up as paging incidents, and subtle enough that the first instinct (“just retry”) is right often enough to keep the root cause hidden. The transaction that got killed was syntactically perfect. The one that survived was too. The bug wasn’t in either statement; it was in the order the two transactions touched rows.

That makes deadlocks harder to reason about than most database failures. The query text in the error log isn’t wrong. The lock it was waiting for isn’t held by a misbehaving process. The system is doing exactly what concurrency control says it should. The failure mode is the interaction between transactions, and those interactions are almost never visible from any single query.

This post covers the patterns: the shapes deadlocks take and why each one exists. The companion post covers reading the deadlock log end-to-end, retry architecture, hot-row isolation, SERIALIZABLE’s serialization failures, DDL migration windows, and NOWAIT / SKIP LOCKED as prevention primitives.

What a deadlock actually is

A deadlock is a cycle in the wait-for graph. Transaction A holds lock L1 and needs L2; transaction B holds L2 and needs L1. Neither can proceed. The only way out is to kill one: pick a victim, roll it back, release its locks, and let the other complete. Every modern relational database does this automatically, usually within hundreds of milliseconds.

The three preconditions are always the same:

Two or more transactions hold locks.
Each needs a lock the other holds.
The locks can’t be acquired atomically (there’s no single “grab both or grab nothing” operation).

Remove any one and the deadlock can’t form. In practice, that’s the shape of every prevention strategy: reduce the number of locks held concurrently, reduce the duration they’re held, or make the acquisition order consistent across all code paths.

Deadlock vs. lock wait timeout

A deadlock is a cycle. A lock wait timeout is a long queue: transaction A is waiting for transaction B, which is waiting for something reasonable, which is taking too long. No cycle, no victim selection, just a timer expiring. Both produce errors that look similar in application logs, but they’re entirely different failure modes with different fixes. innodb_lock_wait_timeout (MySQL) and lock_timeout (PostgreSQL) govern the second. The deadlock detector is a separate mechanism that fires independently of those timers.

The canonical lock-ordering deadlock

The single most common production deadlock is two transactions updating the same two rows in opposite orders:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


-- Transaction A
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

-- Transaction B (concurrent)
BEGIN;
UPDATE accounts SET balance = balance - 50 WHERE id = 2;
UPDATE accounts SET balance = balance + 50 WHERE id = 1;
COMMIT;

If A and B interleave such that A takes the row-level lock on row 1, B takes the row-level lock on row 2, and each then tries to grab the other row, the engine has a cycle. One gets killed.

sequenceDiagram
 participant A as Transaction A
 participant R1 as Row id=1
 participant R2 as Row id=2
 participant B as Transaction B

 Note over A,B: t=0, both transactions begin
 A->>R1: UPDATE (acquire X-lock)
 R1-->>A: granted
 B->>R2: UPDATE (acquire X-lock)
 R2-->>B: granted

 Note over A,B: t=1, each reaches for the other's row
 A->>R2: UPDATE (request X-lock)
 R2--xA: BLOCKED (held by B)
 B->>R1: UPDATE (request X-lock)
 R1--xB: BLOCKED (held by A)

 Note over A,B: Wait-for graph has a cycle: A → B → A
 Note over A,B: Detector fires; victim: Transaction B
 B->>B: ROLLBACK, release R2 lock
 R2-->>A: now granted
 A->>A: COMMIT

The key property is that neither transaction is wrong in isolation. Each acquires locks in an order that’s locally correct. The cycle forms in the global ordering across concurrent sessions, which no single query can see. That’s the defining shape of the pattern: correct code, in both cases, interacting at the transaction boundary. The fix is in how all code paths agree on an ordering, covered in Part 2: Consistent lock ordering.

InnoDB gap locks turn inserts into deadlock sources

MySQL’s default isolation level is REPEATABLE READ, and under that isolation level, InnoDB takes next-key locks: a row lock plus a gap lock on the range before it. The gap lock prevents other transactions from inserting into that range, which is how REPEATABLE READ keeps range queries consistent across re-execution.

The consequence: a SELECT ... FOR UPDATE or an UPDATE with a range predicate locks not just the matching rows, but the gaps between them. Two concurrent transactions that both try to insert into the same gap can deadlock without sharing a single row:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


-- Table: orders(id BIGINT PK, customer_id BIGINT, amount_cents BIGINT)
-- Index on customer_id
-- Existing rows: customer_id = 5 has orders with ids 100, 200, 300

-- Transaction A
BEGIN;
SELECT * FROM orders WHERE customer_id = 5 FOR UPDATE;
-- Takes next-key locks covering ids 100, 200, 300 AND the gaps between them,
-- AND the gap after 300 extending to the next customer_id.

-- Transaction B (concurrent)
BEGIN;
INSERT INTO orders (id, customer_id, amount_cents) VALUES (250, 5, 10000);
-- Blocks: tries to insert into a gap A has locked.

-- Transaction A
INSERT INTO orders (id, customer_id, amount_cents) VALUES (150, 5, 5000);
-- Deadlock if B has also started acquiring locks that A now needs.

Two transactions inserting into what looks like “different rows” can still cycle through gap locks. The failure mode is especially insidious because the EXPLAIN plan doesn’t mention gaps - only rows - and the lock information in SHOW ENGINE INNODB STATUS requires reading the next-key notation carefully.

The category is peculiar to InnoDB under REPEATABLE READ. PostgreSQL prevents phantom reads through MVCC snapshot isolation starting at its own REPEATABLE READ level (stricter than the SQL standard requires) without any range-locking mechanism, so the whole class of gap-lock deadlocks doesn’t exist on PG, at any isolation level. Under READ COMMITTED on MySQL, gap locks are disabled for most searches and index scans but retained for foreign-key and duplicate-key checking, which is the first lever most teams reach for once this pattern dominates their deadlocks, though it doesn’t eliminate the gap-lock category entirely. The isolation-level trade-off and the “avoid range locks on write paths” refactor both live in Part 2: Isolation-level trade-offs.

Unique-index deadlocks are a category of their own

The detailed patterns are covered in Uniqueness and Selectivity, but the shape worth naming here: when InnoDB detects a duplicate-key error on an INSERT, it acquires a shared lock on the conflicting index record before raising the error. Under REPEATABLE READ that shared lock is next-key (record + gap). Under READ COMMITTED, gap locks are mostly disabled, but duplicate-key checking is one of the documented exceptions where gap locking still occurs, so dropping isolation alone doesn’t eliminate the category. Two concurrent transactions inserting toward the same unique key end up holding shared locks and waiting for each other to release: a deadlock caused entirely by the uniqueness check, not by the rows the application thought it was writing.

INSERT ... ON DUPLICATE KEY UPDATE behaves differently: on conflict it takes an exclusive lock instead of a shared one, because the statement is about to modify the row. This matters for reasoning about cycles. Two concurrent ODKU statements contend on exclusive locks (mutually exclusive, one always waits), whereas two concurrent plain INSERTs can both hold shared locks at once and then deadlock when either tries to upgrade. Blog posts and older documentation sometimes conflate the two; the locking rules are documented in the MySQL reference: Locks Set by Different SQL Statements in InnoDB.

The equivalent in PostgreSQL is less severe (the duplicate-key check doesn’t hold long-lived shared locks the same way) but INSERT ... ON CONFLICT with multiple unique indexes can still produce deadlocks when batches touch overlapping keys in different orders. The shape is the same across engines: the uniqueness check itself is what forces the extra locking, and the cycle forms when two sessions approach the same key from different batches.

Foreign keys take shared locks you didn’t ask for

Both MySQL and PostgreSQL acquire shared locks on the referenced row when you insert or update a row with a foreign key. The purpose is to prevent the referenced row from being deleted mid-transaction; you can’t have an orders.customer_id pointing to a customers.id that’s being concurrently deleted.

The side effect is that a high-write child table concentrates shared locks on hot parent rows:

1
2
3
4
5


-- customers has id = 42 (a frequently-used customer)
-- Many concurrent transactions inserting orders for customer 42:

INSERT INTO orders (customer_id, amount_cents) VALUES (42, 1000);
-- Takes a shared lock on customers(id=42)

Shared locks don’t block each other, so concurrent inserts coexist fine. What breaks is the interaction with any transaction that wants an exclusive lock on the parent row: an update to the customer’s name, a soft-delete, a trigger that updates a cached counter. Suddenly, dozens of shared-lock holders are blocking one exclusive-lock request, and if any of them start trying to acquire other locks (say, through a trigger cascade), a cycle can form.

The symptoms: deadlocks that mention tables far removed from the one the application thought it was touching. “Why is my UPDATE customers deadlocking against an INSERT INTO order_items?” Because the order_items insert took a shared lock on customers through the FK chain, and the UPDATE wanted exclusive on the same row.

This is one of the hardest patterns to diagnose on sight, because the offending query never references the contended table explicitly. Mitigations (narrowing FK scope, moving hot parent-row updates to side tables, isolation-level trade-offs) are covered in Part 2: Hot-row isolation.

Index scans lock more rows than queries return

Under InnoDB’s default REPEATABLE READ, an UPDATE with a WHERE clause on a non-indexed column acquires a record lock on every row it scans, not just the ones that match. The engine has to examine each row to check the predicate, and it takes a lock to guarantee the check is stable for the duration of the transaction.

1
2
3


-- Without an index on status, under REPEATABLE READ:
UPDATE orders SET priority = 'high' WHERE status = 'pending';
-- Locks every row in orders during the scan.

If the table has a million rows and only a thousand match, all million get locked for the duration of the update. Any concurrent transaction touching any of those rows has to wait, which inflates the wait-for graph and makes deadlocks more likely.

Under READ COMMITTED, InnoDB narrows this substantially: per the docs, it releases locks on non-matching rows after the WHERE evaluation and uses semi-consistent reads, returning the latest committed version of an already-locked row so the engine can check whether it matches the WHERE before deciding to wait. The net effect is much lower lock footprint and deadlock risk on the same query. PostgreSQL behaves similarly by default: only rows actually updated retain their locks. This is one of the few cases where the same underlying issue (an unindexed predicate) shows up as both a latency problem and a concurrency problem, and where the concurrency angle is specifically a REPEATABLE READ-on-InnoDB amplifier.

Secondary index locks on InnoDB

InnoDB takes locks on both the clustered index (primary key) and any secondary indexes touched by the query. A WHERE status = 'pending' using a status index locks the relevant index entries and the corresponding PK entries. Transactions that approach the same rows from different indexes (one via status, another via customer_id) can deadlock on the PK-side lock even though their index-side locks don’t overlap. This is the most common “why are these two queries deadlocking, they don’t even reference the same columns?” failure mode.

Parallelism-induced deadlocks

The lock-ordering patterns above assume two separate transactions from two separate sessions. Parallelism adds a few variants that don’t fit that frame; the cycle can form inside a single logical unit of work, or show up only on a replica that never issued the original statements.

Worker pools racing on a shared queue. The archetypal production pattern: N application workers pulling jobs from the same table (jobs, outbox, email_queue) and locking rows for processing. If every worker does SELECT ... FOR UPDATE on “the next available batch” without a deterministic ordering, two workers can grab overlapping row sets in opposite orders and cycle. This is the same lock-ordering cycle from earlier, distributed across workers that all look identical from a code-review perspective.

Intra-query parallel workers. PostgreSQL has a full parallel query executor (parallel sequential scans, bitmap heap scans, index and index-only scans (B-tree), parallel aggregates, parallel joins) that spawns worker processes to cooperate on a single query. MySQL has a much narrower feature: innodb_parallel_read_threads (added in 8.0.14) enables parallel scanning of the clustered index, used initially by CHECK TABLE and extended to unconditional SELECT COUNT(*) in 8.0.17. It is not general parallel query; MySQL does not parallelize arbitrary SELECTs, joins, or aggregates. In both engines, workers coordinating on a single query don’t deadlock among themselves in normal operation; the engine manages the shared lock state. What can happen is a parallel worker holds a lock an unrelated transaction needs, and the parallel query itself takes longer than a serial one would, widening the wait window. Usually not a direct deadlock source, but it changes the timing of existing ones.

Parallel replication on replicas. MySQL’s multi-threaded replica applies committed transactions in parallel. Transactions that committed serially on the source (no possibility of deadlock there) can deadlock on the replica because the applier threads are racing on rows the source never had concurrent writers on. The replica’s deadlock detector resolves them the same way it would a live deadlock, but they show up in the replica’s error log with no corresponding entry on the source. Since MySQL 8.0.27, replica_parallel_type=LOGICAL_CLOCK and replica_parallel_workers=4 are the defaults, and replica_parallel_type was deprecated in 8.0.29; LOGICAL_CLOCK is effectively the only supported mode going forward. The slave_* → replica_* rename happened in 8.0.26; older deployments and blog posts still use the legacy names. PostgreSQL 16+ introduced parallel apply for logical replication (streaming = parallel is the default on CREATE SUBSCRIPTION), which exposes the same class of apply-side cycles on a setup that historically didn’t have them: a surprise for teams upgrading from 15 and earlier.

Parallel/online DDL interacting with DML. Tools like pt-online-schema-change and gh-ost run concurrent DML against the table being altered (through triggers or a row-copy process). Under load, the trigger-installed writes and the copy process can both take locks on the same rows the application is updating, and the wait-for graph gains edges that wouldn’t exist during a normal workload. This rarely manifests as a hard deadlock (the tools are written defensively) but it does manifest as elevated deadlock rates during the migration window.

None of these are properties of the queries themselves. They’re properties of how work gets distributed across workers, processes, or replicas, which means they’re invisible to query-level review and only surface when the deadlock counter is watched over time. The primitives for fixing them (SKIP LOCKED, NOWAIT, advisory locks, DDL timeouts) are covered in Part 2: NOWAIT and SKIP LOCKED.

Engine-level differences that shape the patterns

The same pattern can deadlock on one engine and not the other. These differences are pattern-shaping; they change which of the above sections apply to your workload. Operational tuning (detector cost, wait timeouts, monitoring) is covered in Part 2: Monitoring.

Default isolation. PostgreSQL defaults to READ COMMITTED. MySQL defaults to REPEATABLE READ (with gap locks). The same application code has measurably different deadlock rates between the two because of this alone, before any other tuning.
Gap locks. Only InnoDB has them, and only under REPEATABLE READ (plus the foreign-key and duplicate-key exceptions that retain gap locking even under READ COMMITTED). PostgreSQL prevents phantom reads through MVCC at its own REPEATABLE READ (stricter than the SQL standard requires) without a range-locking mechanism, so the entire gap-lock deadlock category doesn’t exist on PG at any isolation level.
Lock granularity. PostgreSQL takes row-level (tuple) locks; InnoDB takes record locks on index entries (with next-key extension under REPEATABLE READ). The practical consequence is that InnoDB locks are more entangled with index choice than PostgreSQL’s; changing which index a query uses can change which rows and gaps get locked.
FK lock style. MySQL’s FK check holds a shared lock on the referenced row (next-key under REPEATABLE READ, and the docs list FK checking as one of the places gap locks persist even under READ COMMITTED). PostgreSQL takes a FOR KEY SHARE lock (added in 9.3 specifically to reduce FK lock contention vs. the older FOR SHARE). Hot parent rows are more contended under MySQL as a result.
Row-lock visibility. PostgreSQL row-level locks don’t show up in pg_locks. Per the docs, they’re stored on the tuple header on disk, not in shared memory. A process waiting for a row lock usually appears in pg_locks as waiting for the holder’s transaction ID, not the row. InnoDB’s performance_schema.data_locks exposes row-level lock state directly. More on this in Part 2.

Neither engine is “better.” The behaviors are different, and code that assumes one can deadlock mysteriously when moved to the other.

Why schema-reading assistants hit these patterns

Locking behavior has no syntax in the query text. A SELECT ... FOR UPDATE advertises the intent; a plain INSERT ... ON DUPLICATE KEY UPDATE or INSERT ... ON CONFLICT doesn’t. The shared next-key lock on a duplicate-key violation, the FK shared lock on the parent row, the gap-lock extension under MySQL’s default isolation are all implementation details of the storage engine. Schema-reading assistants read the catalog, which describes tables, columns, and constraints, and the codebase, which describes queries. Neither surfaces lock ordering, gap-lock semantics, or the difference between READ COMMITTED and REPEATABLE READ unless the prompt specifically includes them.

That’s why AI-generated UPSERT and batch-insert code deadlocks in production the way it does. The model reads INSERT ... ON DUPLICATE KEY UPDATE as an atomic upsert, not as “takes a shared next-key lock, possibly including a gap, before raising the duplicate-key error that the application will retry.” It generates batch INSERTs that process rows in whatever order the application supplies, not sorted by key: fine for a single writer, a lock-ordering cycle under any realistic concurrency. The patterns above are the ones the catalog and query text can’t warn about, and they’re the ones that arrive in production as “intermittent deadlocks under load” after passing every test that didn’t include a second concurrent worker. The fix lives one level up from the query (sorted batches, explicit lock ordering, retry loops) which is the subject of Part 2.

What’s in Part 2

The patterns are the first half. Turning them into working systems takes a different set of skills: reading the deadlock log to identify which pattern is firing, building retry logic that doesn’t mask the real bug, isolating hot rows before they become incident reports, and choosing the right tool for each (NOWAIT, SKIP LOCKED, advisory locks, counter tables, or the isolation-level change that eliminates the category entirely). PostgreSQL’s SERIALIZABLE/SSI produces serialization failures that look like deadlocks but aren’t; the difference matters for retry architecture. AUTO_INCREMENT and sequence-related locking have their own failure modes. DDL migrations on both engines introduce lock queues that manifest as deadlock-like incidents.

All of that is in Database Deadlocks, Part 2: Diagnosis, Retries, and Prevention.

Mental model for the patterns

Deadlocks are what consistent concurrency control does when two transactions make the engine choose between them. The database isn’t misbehaving; it’s refusing to let both of two contradictory orderings win. The error in the application log is a notification, not a fault.

That makes the diagnostic question concrete. Which pattern is firing? Every deadlock in production fits one of the shapes above: lock-ordering cycle, gap lock on a range, duplicate-key shared lock, FK shared lock on a hot parent, unindexed predicate lock amplification, worker-pool race, or replication-apply cycle. Identifying the pattern from the deadlock log narrows the fix enormously. “Two transactions deadlocked, retry the transaction” is true but useless. “Two workers took locks on the same jobs queue in different orders, switch to SKIP LOCKED” is a fix. The work is in the identification.

NULL in SQL: Three-Valued Logic and the Silent Bug Factory

Sun, 26 Jan 2025 00:00:00 +0000

TL;DR

NULL is the absence of a value, and SQL evaluates expressions involving it under three-valued logic (TRUE / FALSE / UNKNOWN). Most operators return UNKNOWN when one of their operands is NULL, so rows with NULLs silently drop out of !=, IN, and NOT IN filters and behave inconsistently across JOIN, GROUP BY, DISTINCT, and aggregate functions. The rules are consistent if you know them, and a source of silently wrong results when you don’t.

There’s a category of SQL bug that shows up in almost every mature codebase. Someone writes a filter like WHERE status != 'closed', expecting it to return every row that isn’t closed. Instead it returns fewer rows than the raw table contains. The rows where status is NULL silently dropped out. No error. No warning. The query is doing exactly what the SQL standard says it should, and the result is still wrong for what the author meant.

NULL handling is the single most common source of silently wrong query results in relational databases. The behavior is consistent if you know the rules, but the rules don’t match the intuition most programming languages build. In Java or Python, null != "closed" is true. In SQL, it’s UNKNOWN, and UNKNOWN rows get filtered out. That one difference produces most of the bugs.

NULL is not a value

Every introduction to SQL NULL starts here because it has to. NULL is the absence of a value, a marker that says “this column has no data.” It’s not zero, not empty string, not false. It doesn’t equal itself. It doesn’t not-equal itself either. Any comparison involving NULL returns a third logical state: UNKNOWN.

This is called three-valued logic (3VL), and SQL uses it consistently throughout the language. The three values are TRUE, FALSE, and UNKNOWN. Most operators propagate UNKNOWN: any arithmetic, string, or comparison operation with a NULL operand returns NULL (or UNKNOWN, in a boolean context).

1
2
3
4
5
6


SELECT NULL = NULL; -- NULL (not TRUE)
SELECT NULL = 5; -- NULL
SELECT NULL != 5; -- NULL
SELECT NULL + 1; -- NULL
SELECT 'hello' || NULL; -- NULL in PostgreSQL (ANSI standard behavior)
SELECT CONCAT('hello', NULL); -- NULL in MySQL, 'hello' in PostgreSQL

The CONCAT difference is a good example of how engines diverge even within well-defined territory. MySQL’s CONCAT propagates NULL: any NULL argument makes the whole result NULL. PostgreSQL’s CONCAT function does the opposite, silently skipping NULL arguments and returning the concatenation of the non-NULL parts. (PostgreSQL’s || operator still propagates NULL, matching ANSI.) Two queries that look identical can return different results on different engines, and the difference only shows up when a NULL appears.

For NULL-skipping concatenation that behaves the same on both engines, use CONCAT_WS (concat with separator). Both MySQL and PostgreSQL skip NULL arguments with it:

1
2


SELECT CONCAT_WS(' ', first_name, middle_name, last_name);
-- "Alice Smith" even if middle_name IS NULL, on both engines.

One MySQL-specific gotcha: if the separator itself is NULL, the whole result is NULL. The separator is the one argument CONCAT_WS still propagates NULL from. As long as the separator is a literal string, the function is a reliable NULL-safe concat across engines.

IS NULL, not = NULL

The only way to test for NULL is with IS NULL or IS NOT NULL. WHERE col = NULL always returns zero rows, because col = NULL evaluates to NULL, which is not TRUE, so the row is filtered out. This is one of those mistakes every SQL engineer makes exactly once.

WHERE clauses filter out UNKNOWN

The rule that drives most NULL bugs: WHERE only keeps rows where the condition evaluates to TRUE. UNKNOWN rows are filtered out, same as FALSE rows.

1
2


-- "Users not on the sales team"
SELECT * FROM users WHERE team_id != 3;

If team_id is NULL for unassigned users (a completely normal state) those rows are silently dropped. The expression NULL != 3 evaluates to UNKNOWN, and UNKNOWN is not TRUE, so the row doesn’t survive the filter.

The mental model most developers carry from application code (“anything that isn’t team 3 is included”) is wrong in SQL. To get that behavior, you have to spell it out:

1

SELECT * FROM users WHERE team_id != 3 OR team_id IS NULL;

This is one of the most common sources of “the numbers don’t match” bugs. A report that’s supposed to count “everyone outside the sales team” quietly excludes every unassigned user, and the total looks plausible because unassigned users aren’t visible in the team-level breakdown either. The discrepancy only surfaces when someone reconciles against a direct row count.

NOT IN is a trap

NOT IN with a nullable subquery is the classic silent-failure NULL bug. The trap is specifically that the subquery has to return a column that can contain NULL: rules out primary keys but extremely common for foreign keys, self-references, and any column that’s optional by design.

1
2
3


-- "Find users who aren't anybody's manager."
SELECT * FROM users
WHERE id NOT IN (SELECT manager_id FROM users);

The subquery returns every manager_id in the table, including NULL for users who don’t have a manager (the CEO, top-level roles, anyone unassigned). The moment the subquery contains a single NULL, the outer query returns zero rows.

The reason is how NOT IN expands. x NOT IN (a, b, c) is equivalent to x != a AND x != b AND x != c. If any of a, b, c is NULL, that comparison returns UNKNOWN, and AND with UNKNOWN can only ever be FALSE or UNKNOWN. The row never passes the filter.

Safer alternatives:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


-- Use NOT EXISTS - handles NULLs correctly
SELECT * FROM users u
WHERE NOT EXISTS (
 SELECT 1 FROM users m WHERE m.manager_id = u.id
);

-- Or filter NULLs out of the subquery explicitly
SELECT * FROM users
WHERE id NOT IN (
 SELECT manager_id FROM users WHERE manager_id IS NOT NULL
);

NOT EXISTS is the better habit. It’s correct regardless of NULL presence, and the query planner handles it at least as well as NOT IN on any modern engine. Treating NOT IN as “suspicious until proven NULL-free” saves a category of bug that’s almost impossible to catch in review.

COUNT and NULL: skipped, not zero

The single most important thing to know about aggregates and NULL: NULL is not treated as zero. It’s skipped entirely. Nothing about NULL gets coerced or counted; it’s as if the row weren’t there for the purposes of the aggregate.

COUNT makes this visible because it has three forms that behave differently:

COUNT(*) counts rows, regardless of their contents. NULLs in the row don’t matter.
COUNT(col) counts non-NULL values of col. A row where col IS NULL is skipped.
COUNT(DISTINCT col) counts distinct non-NULL values. NULL is not treated as a distinct value; it’s excluded.

1
2
3
4
5


SELECT
 COUNT(*) AS total_rows,
 COUNT(email) AS rows_with_email,
 COUNT(DISTINCT email) AS distinct_emails
FROM users;

On a table of 1,000 users where 200 have NULL emails:

COUNT(*) returns 1000 (all rows)
COUNT(email) returns 800 (NULLs skipped)
COUNT(DISTINCT email) returns ≤ 800 (distinct non-NULL emails only)

This shows up in reports all the time. “How many users signed up this month?” gets answered with COUNT(signup_source) and comes up short because the column was added later and older rows have NULL. The row is there. COUNT(*) would see it. COUNT(signup_source) doesn’t.

The rule: use COUNT(*) when you want rows, COUNT(col) when you specifically want “rows with that column populated.”

SUM, AVG, MIN, MAX: also skip NULL

The same rule holds for every aggregate. NULL is not contributed to the sum, not counted in the denominator for the average, not considered for min or max.

1

SELECT SUM(rating), AVG(rating) FROM reviews;

If half the rows have NULL rating:

SUM(rating) is the sum of the non-NULL half. NULLs don’t contribute 0; they contribute nothing.
AVG(rating) is the sum of the non-NULL half divided by the count of non-NULL rows, not the total row count.

The AVG behavior is the most common source of surprise. If 10,000 rows have rating = 5 and 10,000 have rating = NULL, AVG(rating) is 5.0, not 2.5. The NULL rows don’t pull the average down toward zero. They’re not in the denominator at all.

If you want NULL-as-zero behavior, you have to opt in:

1
2
3


SELECT AVG(COALESCE(rating, 0)) FROM reviews;
-- Now NULLs become 0 and land in both the sum and the denominator.
-- Returns 2.5 in the example above.

SUM of all NULLs is NULL, not zero

SUM(col) over a set where every value is NULL returns NULL, not 0. A SUM that feeds into arithmetic downstream (total + tax, for example) can propagate NULL through the rest of the expression, often somewhere the query author wasn’t expecting. COALESCE(SUM(col), 0) is the idiomatic fix; make the fallback explicit at the aggregate.

The framing that keeps this straight: NULL is not a value, so aggregates have nothing to aggregate. Absent, not zero. If you want absent to mean zero, that’s a COALESCE decision the query author makes; the engine won’t make it for you.

GROUP BY and DISTINCT treat NULLs as equal

Here’s where the rules get inconsistent in a way that genuinely surprises people: GROUP BY and DISTINCT treat all NULLs as the same group, even though NULL = NULL returns UNKNOWN.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


-- All rows where team_id is NULL land in one group, as if they were equal.
SELECT team_id, COUNT(*) FROM users GROUP BY team_id;
-- team_id | count
-- NULL | 200
-- 1 | 500
-- 2 | 300

-- DISTINCT collapses all NULLs into one row.
SELECT DISTINCT team_id FROM users;
-- NULL
-- 1
-- 2

This is a deliberate exception carved out by the SQL standard. GROUP BY and DISTINCT use a “NULL-safe” equality for grouping purposes, because the alternative (one group per NULL row) would be useless. But it means the behavior is internally inconsistent: WHERE a = b says NULLs aren’t equal, GROUP BY a says they are.

The practical implication: COUNT(DISTINCT col) excludes NULL entirely (consistent with COUNT(col)), while GROUP BY col produces a single row for all NULLs. Two different “null-handling” behaviors under the same umbrella of “treats NULLs as equal for grouping.” Queries that rely on either for correctness should be written with the awareness that the two operations don’t agree.

NULL-safe comparison operators

Both MySQL and PostgreSQL offer operators that treat NULL as equal to NULL, mirroring the GROUP BY behavior for regular comparisons.

1
2
3
4
5
6
7


-- MySQL
SELECT * FROM users WHERE email <=> NULL;
-- Matches rows where email IS NULL. <=> is the null-safe equal operator.

-- PostgreSQL (ANSI SQL)
SELECT * FROM users WHERE email IS NOT DISTINCT FROM NULL;
-- Same idea. Treats NULLs as equal to each other.

These are useful when joining or filtering on columns that may contain NULL on both sides and you want NULLs to match:

1
2
3
4
5


-- Standard equality misses NULL-to-NULL matches
SELECT * FROM a JOIN b ON a.col = b.col;

-- IS NOT DISTINCT FROM treats NULLs as matching
SELECT * FROM a JOIN b ON a.col IS NOT DISTINCT FROM b.col;

Neither is used often in practice. The habit most teams settle on is “don’t let NULL be meaningful in join columns”: either constrain the columns NOT NULL or filter NULLs out before joining. The operators are there for the cases where those aren’t options.

ORDER BY: NULL placement varies by engine

When sorting, NULL has to go somewhere. The SQL standard leaves the default placement implementation-defined, and engines disagree.

PostgreSQL. NULLs sort last for ASC and first for DESC by default.
MySQL. NULLs sort first for ASC and last for DESC by default.
Oracle and SQL Server. Match PostgreSQL’s behavior (NULLs last for ASC).

The fix is to be explicit:

1
2


SELECT * FROM events ORDER BY event_time ASC NULLS LAST;
SELECT * FROM events ORDER BY event_time DESC NULLS LAST;

NULLS FIRST / NULLS LAST is ANSI standard and supported by PostgreSQL, Oracle, and SQL Server. MySQL doesn’t support the NULLS FIRST/LAST syntax directly; you fake it with a computed column:

1
2
3


-- MySQL idiom for "NULLS LAST" on an ASC sort
SELECT * FROM events ORDER BY event_time IS NULL, event_time ASC;
-- event_time IS NULL returns 0 for non-nulls, 1 for nulls; 0 sorts first.

Teams that run the same reports against different engines (especially during a migration or in a polyglot analytics stack) hit this one hard. A top-10 leaderboard quietly reorders when the ORDER BY engine changes underneath it.

JOINs don’t match on NULL

A standard equi-join a.col = b.col doesn’t match rows where either side is NULL. This is consistent with the three-valued logic rule: NULL = NULL is UNKNOWN, so the join predicate fails.

1
2
3
4
5


-- Users can have no manager (manager_id IS NULL).
-- This join drops any user with no manager.
SELECT u.name, m.name AS manager_name
FROM users u
JOIN managers m ON u.manager_id = m.id;

If the intent is “every user, with manager info if present,” use a LEFT JOIN. If the intent is “users where manager_id matches some manager row,” the INNER JOIN is correct but it’s worth naming the exclusion: users with NULL manager_id are gone, on purpose.

For joins that should treat NULLs as matching (both sides have NULL, and that means “same”), use the null-safe operator:

1
2


SELECT *
FROM a JOIN b ON a.external_ref IS NOT DISTINCT FROM b.external_ref;

This is rare but legitimate (e.g., matching optional identifiers where “both unspecified” should be treated as a match). Most of the time, the correct answer is to make the column NOT NULL and use a sentinel if needed (and then deal with the sentinel’s own problems, covered below).

Foreign keys are nullable by default

A foreign key column is nullable unless declared NOT NULL. A nullable FK means the reference is optional: users may or may not have a manager, orders may or may not be linked to a promotion. This is often the correct intent, but it’s frequently unintentional.

1
2
3
4
5
6


-- manager_id is nullable by default. This is intentional if users can be unmanaged.
CREATE TABLE users (
 id BIGINT PRIMARY KEY,
 name TEXT NOT NULL,
 manager_id BIGINT REFERENCES managers(id)
);

Review migration files with this in mind. A column that should always be populated but was added as nullable will accept NULLs forever. Retrofitting NOT NULL later requires backfilling or cleaning up existing NULL rows: easy when the table is small, painful at scale. (Foreign Keys Are Not Optional covers the broader picture of FK enforcement and why application-level validation is an incomplete substitute.)

What NULL actually means is context-dependent

The SQL rules for NULL are unambiguous. What NULL means in a given column is not. NULL can mean:

Unknown. The data exists but we don’t have it. A user’s birthdate where the user declined to share.
Not applicable. The field doesn’t make sense for this row. spouse_name on a row for a single person.
Ongoing or not yet set. The state isn’t finalized. end_date on an active subscription.
Data entry error. The column should have been populated but wasn’t.
Legacy. The column was added after the row was created and never backfilled.

The same column may mean different things in different rows, and the schema doesn’t tell you which is which. This is where schema comments earn their keep, documenting the semantics of NULL in each column in the DDL itself rather than in a wiki page nobody finds.

Sentinel values: the alternative, and its own problems

A common workaround: use a sentinel value instead of NULL. end_date = '9999-12-31' for “ongoing.” status = -1 for “unknown.” deleted_at = '1970-01-01' for “not deleted.”

Sentinels avoid the three-valued-logic rules at the cost of introducing their own bugs. A few to watch for:

Aggregates include sentinels. AVG(rating) over a column where “unknown” is stored as -1 skews the average toward negative. Sentinels break the “aggregates skip missing values” assumption that NULL provides for free.
Range queries break in unexpected directions. WHERE end_date > NOW() returns all the sentinel rows along with real future dates. Every filter has to explicitly exclude the sentinel.
Indexes skew. A column where 80% of the values are the sentinel has a low-selectivity index. The planner may skip the index entirely on queries that filter out the sentinel, because it doesn’t know that’s the intent.
Downstream consumers have to know. Every system that reads the data has to treat 9999-12-31 specially. Miss one consumer and wrong data shows up in a report.

The trade-off is real. NULL forces every query author to think about three-valued logic. Sentinels let queries use normal equality but require every author to know the sentinel. Neither is free; they move the cost around.

The pragmatic middle ground: use NULL for genuinely absent data (ongoing subscriptions, optional fields), use sentinels sparingly and document them, and declare NOT NULL everywhere you can enforce presence. A column that’s NOT NULL is the one case where the rules don’t matter, because NULL can’t get in.

Diagnosing a NULL bug

When a query returns fewer (or more, or none) of the rows it should, the fastest way to narrow it down to a NULL issue:

1
2
3
4
5
6


-- Are there NULLs in the columns referenced by the filter?
SELECT
 COUNT(*) AS total,
 COUNT(team_id) AS with_team,
 COUNT(*) - COUNT(team_id) AS no_team
FROM users;

If no_team is non-zero and the filter is team_id != X or team_id IN (...), the NULL rows are the likely culprit. Rewriting with explicit NULL handling (team_id != X OR team_id IS NULL, or NOT EXISTS, or COALESCE(team_id, -1) != X) will reveal whether NULLs were being silently excluded.

For NOT IN, inspect the subquery:

1
2


-- Does the NOT IN subquery contain NULL?
SELECT COUNT(*) FROM users WHERE manager_id IS NULL;

If the answer is non-zero, NOT IN is returning an empty set regardless of the outer query’s data.

The mental model

NULL handling is consistent once you internalize the rule set, and the rule set is smaller than it looks:

Any comparison involving NULL returns UNKNOWN. WHERE filters out UNKNOWN rows.
Aggregates skip NULLs. COUNT(*) doesn’t. COUNT(col) does.
GROUP BY, DISTINCT, and ORDER BY treat all NULLs as equivalent (with engine-specific sort placement).
NOT IN with a nullable subquery returns empty. Use NOT EXISTS.
Join predicates don’t match NULLs unless you use IS NOT DISTINCT FROM.

Past that, most NULL bugs are prevented by one habit: declare NOT NULL wherever the column should actually be populated. Every NOT NULL column is a column where none of these rules matter, because there’s nothing for them to misbehave on. The fewer nullable columns the schema has, the less of this there is to think about.

The columns where NULL genuinely carries meaning (optional references, ongoing states, data that may not exist) are the ones worth documenting. A schema comment that says “NULL means the subscription is still active” pulls the NULL semantics into the DDL itself, where it’s visible to every engineer, every tool, and every query author who wasn’t around when the decision was made.

Joins That Lie: The Cardinality Problem

Thu, 09 Jan 2025 00:00:00 +0000

TL;DR

Most silently wrong SQL comes from the same root cause: a join that multiplies rows in a way the author didn’t expect. Aggregations built on those rows (SUM, COUNT, AVG) inflate without producing any error. The fix is understanding the cardinality of every join before writing the aggregation, not more careful SQL.

There’s a category of SQL bug that never throws an error, never fails code review, and never shows up in tests. The query runs. The results look reasonable. Someone ships a dashboard, and a month later finance asks why revenue is 40% higher than what the billing system reports. That 40% isn’t a bug in the data; it’s a join that multiplied rows, and a SUM that dutifully added them all up.

The tricky part is that structurally the query is fine. The joins are valid. The filters are valid. The aggregation is valid. Every individual piece is correct. The cardinality of the relationships (how many child rows exist per parent, and how that changes when multiple child tables are joined at once) is doing damage the query never surfaces.

Cardinality, briefly

Cardinality describes the number of rows on each side of a relationship:

One-to-one (1:1). Each row in table A matches at most one row in table B. Less common than 1:N, but legitimately used for optional extensions (splitting off rarely-accessed or sensitive columns into a side table), inheritance patterns (a base table with specialized sub-tables), or separating hot and cold data for caching and storage reasons. A 1:1 join preserves row count.
One-to-many (1:N). Each row in A matches zero or more rows in B. The common case: one order has many order items, one user has many sessions, one post has many comments. Joining A to B duplicates the parent row once per matching child. If a parent has zero children, an inner join drops it entirely; a left join keeps it with NULLs on the child side. This difference matters and it’s the source of another whole class of silent bugs.
Many-to-many (N:M). Rows in A match many rows in B and vice versa. Always implemented through a bridge table (junction table) that sits between them. A bridge is two 1:N relationships back-to-back: the bridge table holds a foreign key to A and a foreign key to B, with each row pairing one A with one B. A has many bridge rows, and B has many bridge rows. Joining through it multiplies by the cardinality on both sides.

The shape of the relationship determines what a join does to row counts. This is where aggregations start to lie.

Schema cardinality vs. data cardinality

There’s a distinction worth naming: what the schema allows vs. what the data actually contains. A foreign key from user_profiles.user_id to users.id with a unique constraint is 1:1 at the schema level. A column typed as 1:N by constraint can be 1:1 in practice; if every order in your system happens to have exactly one line item, the relationship is legally 1:N but effectively 1:1. This matters for query planning (the optimizer uses constraints, not observed data), index choice, and reasoning about whether a join can actually multiply rows. A query that’s safe against the current data can break as soon as the data starts exercising the cardinality the schema permits.

The row multiplication problem

The examples in this article use a deliberately simple customers / orders / order_items schema so the mechanics are easy to follow. In real systems the shape changes constantly: invoices and payments, subscriptions and usage events, tickets and messages, events and dimensions in a warehouse. The permutations are endless, but the underlying failure is the same: a join that multiplies rows in a way the author didn’t expect, feeding an aggregation that now lies. Once the pattern is visible in one schema, it’s visible everywhere.

Consider a schema everyone has seen some version of:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


CREATE TABLE customers (
 id BIGINT PRIMARY KEY,
 name VARCHAR(255) NOT NULL
);

CREATE TABLE orders (
 id BIGINT PRIMARY KEY,
 customer_id BIGINT NOT NULL REFERENCES customers(id),
 total_cents BIGINT NOT NULL
);

CREATE TABLE order_items (
 id BIGINT PRIMARY KEY,
 order_id BIGINT NOT NULL REFERENCES orders(id),
 price_cents BIGINT NOT NULL,
 quantity INT NOT NULL
);

A question: what’s the total revenue per customer? The obvious query:

1
2
3
4


SELECT c.name, SUM(o.total_cents) AS revenue
FROM customers c
JOIN orders o ON o.customer_id = c.id
GROUP BY c.name;

This is correct. One row per order, total_cents summed per customer. Now someone asks: “can we also see how many items they bought?” The change looks trivial; add a join and a count:

1
2
3
4
5
6
7
8


SELECT
 c.name,
 SUM(o.total_cents) AS revenue,
 COUNT(oi.id) AS items_purchased
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.id
GROUP BY c.name;

The items_purchased count is correct. The revenue is wrong.

Here’s what happened. orders to order_items is 1:N. Joining them multiplies each order row by the number of items it contains. An order with 5 items now appears 5 times in the result set, once per item. total_cents, which lives on the orders row, is duplicated in each of those 5 copies.

SUM(o.total_cents) now sums the same order total once per item. A $100 order with 5 items contributes $500. Revenue is inflated by the average number of items per order.

The query runs. The numbers look like revenue. Nothing is flagged. The dashboard ships.

Why it's easy to miss

The inflation is proportional to the cardinality of the join, so it affects every row by roughly the same factor. Totals grow uniformly, relative rankings stay intact, and top-10 lists still look “right.” There’s nothing that stands out as obviously wrong, except the grand total doesn’t match the source system.

The bridge table trap

Many-to-many relationships make this problem worse because the multiplication happens in both directions. Take a schema with products, orders, and promotions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


CREATE TABLE order_items (
 id BIGINT PRIMARY KEY,
 order_id BIGINT NOT NULL,
 product_id BIGINT NOT NULL,
 price_cents BIGINT NOT NULL
);

CREATE TABLE order_item_promotions (
 order_item_id BIGINT NOT NULL,
 promotion_id BIGINT NOT NULL,
 PRIMARY KEY (order_item_id, promotion_id)
);

CREATE TABLE promotions (
 id BIGINT PRIMARY KEY,
 name VARCHAR(255) NOT NULL
);

An order item can have multiple promotions applied to it (a percentage discount stacked with a free shipping promo). Query: total revenue, broken down by promotion:

1
2
3
4
5


SELECT p.name, SUM(oi.price_cents) AS revenue
FROM order_items oi
JOIN order_item_promotions oip ON oip.order_item_id = oi.id
JOIN promotions p ON p.id = oip.promotion_id
GROUP BY p.name;

If an order item had two promotions, its price_cents shows up twice (once under each promotion). Sum those up and total revenue exceeds actual revenue. Worse, if you then compare “sum across all promotions” to “total revenue from order_items,” the numbers don’t tie out, and there’s no obvious reason why.

The bridge table is doing exactly what it’s supposed to do. The query is doing exactly what the SQL says. The meaning of the aggregation drifts as soon as you cross a many-to-many boundary.

A variation of the grain problem shows up in schemas where related tables each carry their own independently-moving date column: orders vs. shipments, subscriptions vs. invoices, tickets vs. updates, orders vs. returns. When a question is time-bounded (“Q1 revenue from items shipped in Q1”), the date filter has to land on the column that matches the question. Filtering on both tables “to be safe” silently excludes rows whose dates diverge. An order placed in December with items shipping in January is a Q1 shipment; a filter on orders.created_at throws it out.

The rule is the same as for row multiplication: pick the grain that matches the question, once. If the question is about shipments, filter on shipped_at. If it’s about orders, filter on created_at. Combining both feels more rigorous and quietly returns the wrong set.

How to diagnose it

The symptom is always the same: a number that doesn’t match what another system says it should be. Revenue doesn’t match billing. User counts don’t match the auth service. Item totals don’t match inventory. When that happens, the first thing to check isn’t the aggregation; it’s the row count at each stage of the query.

Take the aggregation off and see what you’re actually summing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


-- Original (wrong) query
SELECT c.name, SUM(o.total_cents) AS revenue
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.id
GROUP BY c.name;

-- Diagnostic: see the raw rows for one customer
SELECT c.name, o.id AS order_id, o.total_cents, oi.id AS item_id
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.id
WHERE c.id = 42
ORDER BY o.id;

If the same order_id and total_cents appear on multiple rows, the sum is going to double-count. Seeing the raw rows makes the multiplication obvious in a way the aggregated output never does.

Another useful check: compare counts at each level independently.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


-- Count orders directly
SELECT COUNT(*) FROM orders WHERE customer_id = 42;
-- Returns: 3

-- Count orders through the joined query
SELECT COUNT(*) FROM orders o
JOIN order_items oi ON oi.order_id = o.id
WHERE o.customer_id = 42;
-- Returns: 12

-- The 4x multiplication is the join's cardinality

When the two numbers don’t match, the join is multiplying rows. Every aggregation downstream of that join is suspect.

How to solve it

There’s no single fix; the right technique depends on whether the aggregation lives on the parent or the child, and how many cardinality boundaries you’re crossing.

Aggregate at the correct grain, then join

The cleanest approach is usually to do each aggregation at the table where the data actually lives, then join the pre-aggregated results together. This keeps row counts under control and makes the query’s intent obvious.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


WITH order_stats AS (
 SELECT customer_id, SUM(total_cents) AS revenue
 FROM orders
 GROUP BY customer_id
),
item_stats AS (
 SELECT o.customer_id, COUNT(oi.id) AS items_purchased
 FROM orders o
 JOIN order_items oi ON oi.order_id = o.id
 GROUP BY o.customer_id
)
SELECT
 c.name,
 order_stats.revenue,
 item_stats.items_purchased
FROM customers c
LEFT JOIN order_stats ON order_stats.customer_id = c.id
LEFT JOIN item_stats ON item_stats.customer_id = c.id;

Revenue is summed from orders where it lives, once per order. Items are counted through the orders→order_items join separately. Then both are joined back to customers. Each aggregation happens at its correct grain, and the final join is 1:1:1, no multiplication.

It looks more verbose. It is. That’s the point. The verbosity is making the cardinality explicit instead of hiding it behind a single flat join.

Use `DISTINCT` inside the aggregate, with caution

When the multiplication is already there, SUM(DISTINCT ...) can sometimes paper over it:

1
2
3
4
5
6
7


SELECT
 c.name,
 SUM(DISTINCT o.total_cents) AS revenue -- suspicious
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.id
GROUP BY c.name;

This only works if total_cents is guaranteed to be unique across the duplicated rows. If two different orders happen to have the same total, DISTINCT collapses them into one and revenue drops. It’s fragile: correct for the query but wrong for the data.

COUNT(DISTINCT o.id) is safer because id is always unique by definition. Use DISTINCT on natural keys, not on aggregated values.

Window functions for “per parent” aggregates

When you need a running or per-group aggregate without collapsing rows, window functions keep the row count intact and do the math within a partition:

1
2
3
4
5
6
7
8
9


SELECT
 o.id AS order_id,
 o.customer_id,
 oi.id AS item_id,
 oi.price_cents,
 SUM(oi.price_cents) OVER (PARTITION BY o.id) AS order_total,
 SUM(oi.price_cents) OVER (PARTITION BY o.customer_id) AS customer_total
FROM orders o
JOIN order_items oi ON oi.order_id = o.id;

No group-by, no row collapsing, totals computed at the right grain. The cost is a result set the size of order_items, so use this pattern when the row-level detail is actually needed, not as a default replacement for GROUP BY.

LATERAL joins and correlated subqueries

When you need a per-row aggregate (the total for each order, or the most recent child row) a lateral join keeps the parent’s grain and evaluates the child aggregation row by row.

1
2
3
4
5
6
7
8


-- PostgreSQL: LATERAL join
SELECT o.id, o.customer_id, items.total, items.item_count
FROM orders o,
LATERAL (
 SELECT SUM(price_cents) AS total, COUNT(*) AS item_count
 FROM order_items
 WHERE order_id = o.id
) items;

One row per order, aggregation computed inside the lateral subquery, no multiplication. This is often faster than joining and then grouping, especially when orders is heavily filtered and order_items is large.

Schema-level defenses

Query-level fixes only work if the person writing the query knows to apply them. Schema-level guarantees work for every query, forever.

Foreign keys tell the query planner about cardinality. PostgreSQL in particular uses FK metadata to make join-order decisions and to eliminate redundant joins during planning. Beyond the integrity benefits, FKs make the shape of the data visible to both humans and the planner. (Foreign Keys Are Not Optional goes deeper on why skipping them compounds into silent corruption over time.)

Unique constraints on bridge tables prevent accidental many-to-many explosions. A bridge table with PRIMARY KEY (a_id, b_id) can’t contain duplicates, so joining through it can’t multiply rows because of duplicate bridge entries (only because of legitimate N:M relationships).

1
2
3
4
5


CREATE TABLE order_item_promotions (
 order_item_id BIGINT NOT NULL REFERENCES order_items(id),
 promotion_id BIGINT NOT NULL REFERENCES promotions(id),
 PRIMARY KEY (order_item_id, promotion_id) -- prevents duplicates
);

Without that composite primary key, a bug in the application layer that inserts the same (order_item_id, promotion_id) pair twice would silently double revenue for that item in any query joining through the bridge. With it, the database rejects the duplicate at write time.

Schema comments on tables and columns document the cardinality and semantics that aren’t visible from the DDL. A line like COMMENT ON TABLE order_item_promotions IS 'N:M bridge. One row per (item, promotion). Joining this multiplies order_item rows by avg promotions-per-item.' tells every future engineer exactly what the table does to row counts. (Comment Your Schema covers the mechanics across MySQL and PostgreSQL and why this metadata layer is almost always empty.)

Denormalized totals, when the trade-off is worth it. For heavily queried aggregates (order totals, user balance, post comment counts), storing the aggregate on the parent table eliminates the join entirely. The write-path cost is keeping the denormalized value consistent: either through application code, triggers, or scheduled reconciliation. For high-read, low-write aggregates, the read simplicity often wins. For everything else, computing on demand is cleaner.

Denormalization has its own failure mode

A stored orders.total_cents that’s out of sync with SUM(order_items.price_cents) is its own form of silent corruption, moved from the query layer to the write layer. Either invest in keeping it consistent (triggers, reconciliation jobs) or don’t denormalize it at all. A half-maintained denormalized aggregate is worse than no denormalization.

The pause that schema-reading assistants don’t take

A schema-reading assistant asked for “total revenue by customer” reads the catalog, finds the chain of tables it needs, writes the JOINs, adds the SUM, and hands back a query that looks right. The pause described in the section below (“wait, does this join multiply rows?”) is a step the model doesn’t take unless the prompt asks for it. The catalog tells the assistant that customers, orders, order_items, and the order_item_promotions bridge exist; it doesn’t tell it that joining through the bridge duplicates every order_items row once per promotion. The inflated total and the correct one look the same on the way back.

The same schema-level defenses that help humans give the model more to work with. FK metadata lets a catalog-reading tool see which joins are 1:N versus N:M. Composite primary keys on bridge tables prevent the “duplicate-in-bridge” multiplier from ever materializing in the data. Table comments that spell out cardinality (something like 'N:M bridge. Joining this multiplies order_item rows by avg promotions-per-item.') put the warning in the part of the schema the assistant actually reads. This doesn’t replace the pause described below; it narrows the set of cases where the pause has to do all the work.

The mental model

The shortcut that prevents most of these bugs: before writing an aggregation, picture the row count at every stage of the query.

Start with the leftmost table. How many rows?
Each join: does this multiply, preserve, or filter the row count?
At the point where the aggregate runs: what is the grain of each row? What does “one row” represent?
Does the aggregate make sense at that grain?

When the answer is “one row represents an order item, but I’m summing an order-level field,” the bug is already obvious. When the answer is “one row represents an order, and I’m summing order totals,” the query is correct.

This isn’t a skill that scales with query complexity; it’s a habit that kicks in before the query gets written. The senior engineers who never seem to hit these bugs aren’t writing smarter SQL. They’re pausing before the SUM and asking what row they’re actually summing over.

Putting it together

Cardinality bugs are a specific kind of wrong: syntactically valid, semantically broken, and invisible to every automated check. Tests pass. Code reviews approve. Reports render. The numbers just happen to be wrong.

The defense is structural, not tactical. Understand the cardinality of each relationship before writing the join. Aggregate at the grain where the data lives. Use the schema to make cardinality explicit: foreign keys, composite primary keys on bridges, comments that document the shape. When diagnosing a wrong number, strip the aggregation and look at the raw rows; the multiplication is almost always visible as soon as the SUM is out of the way.

The worst thing about silent bugs is that they stay silent. A crash gets fixed; wrong numbers persist for quarters. The habit of thinking about cardinality first (before writing the aggregation, not after someone flags the total) is one of the highest-leverage habits in working with relational data.

Uniqueness and Selectivity: The Two Numbers That Drive Query Plans

Mon, 23 Dec 2024 00:00:00 +0000

TL;DR

Uniqueness governs correctness, selectivity governs performance. The interesting parts of both live in the edge cases: partial unique indexes and their UPSERT targeting quirks, the way partitioning weakens every uniqueness guarantee, correlated columns that defeat planner assumptions, stale statistics that turn a 5ms query into a 5-minute one. Declaring the constraints the planner can see and keeping its statistics fresh buys more than any amount of query rewriting.

Everyone who works with relational databases knows UNIQUE. What they often don’t know is how it behaves under partitioning, how ON CONFLICT targets it (and doesn’t), and what the planner actually does with it beyond rejecting duplicates. Selectivity is in the same category. The definition is trivial, but the behavior that matters lives in composite column ordering, stale statistics, and the correlated-columns problem that breaks the planner’s core assumption.

This is the territory where “the query is correct” and “the query is fast” stop being the same question, and both depend on what the database can actually prove about the data. The constraints are the contract between the schema and the planner. Everything else is inference.

Partial and filtered unique indexes

PostgreSQL supports partial unique indexes: uniqueness enforced only over rows matching a predicate. This is the right tool for the common real-world case “email must be unique among active users”:

1
2
3
4


-- PostgreSQL: email unique only among non-soft-deleted rows.
CREATE UNIQUE INDEX users_active_email_uniq
 ON users (email)
 WHERE deleted_at IS NULL;

A plain UNIQUE (email) forces a choice: either allow re-registration (and lose referential integrity by reusing emails across deleted and active rows) or block it (and frustrate users whose accounts were long ago soft-deleted). The partial index lets both coexist.

MySQL doesn’t support partial unique indexes directly. The workaround exploits MySQL’s treatment of NULL as distinct under UNIQUE (covered in NULL: Three-Valued Logic):

1
2
3
4
5


-- MySQL idiom: generated column that's NULL for deleted users.
ALTER TABLE users
 ADD COLUMN email_active VARCHAR(255)
 GENERATED ALWAYS AS (CASE WHEN deleted_at IS NULL THEN email END) VIRTUAL,
 ADD UNIQUE KEY users_active_email_uniq (email_active);

The constraint effectively fires only for rows where email_active is non-NULL: exactly partial-index semantics, just expressed through a generated column. Awkward to write, but portable-ish and the ORMs catch on eventually.

Partitioned tables force uniqueness compromises

Partitioned tables in both PostgreSQL and MySQL require the partition key to be part of every unique constraint - including the primary key. The rule exists for correctness: without the partition key in the constraint, the database would have to scan every partition on every insert to enforce uniqueness, defeating the point of partitioning.

The practical consequence is that PRIMARY KEY (id) isn’t allowed on a table partitioned by created_at. It has to become PRIMARY KEY (id, created_at). The same applies to every other unique constraint: UNIQUE (email) on a users table partitioned by region becomes UNIQUE (email, region), which quietly weakens the guarantee. The schema now allows the same email to exist in multiple regions, whether or not the application ever intended that.

This is one of the sharper trade-offs in partitioning decisions. A uniqueness guarantee the schema used to provide gets weaker, and point lookups that used to be single-row const accesses become ref lookups because the full primary key isn’t spelled out in every query. How Partitioning Turns WHERE id = 12345 Into a 36-Partition Scan covers the full picture, including why partitioning by the primary key itself (when the PK is monotonically increasing) sidesteps the trade-off entirely.

UPSERT targeting is more specific than it looks

INSERT ... ON CONFLICT (PostgreSQL) and INSERT ... ON DUPLICATE KEY UPDATE (MySQL) bind to specific unique constraints, not to “any uniqueness that happens to apply.” The difference between the two engines is where most of the subtle bugs live.

PostgreSQL is explicit. ON CONFLICT (email) requires a unique constraint or unique index exactly matching email. If none exists, the statement errors out. If a partial unique index exists instead of a plain one, ON CONFLICT (email) does not match it; you need the full predicate:

1
2
3
4
5


-- Must match the partial index's predicate to target it.
INSERT INTO users (email, name)
VALUES ('alice@example.com', 'Alice')
ON CONFLICT (email) WHERE deleted_at IS NULL
DO UPDATE SET name = EXCLUDED.name;

If the partial index changes (predicate tightened, column added), every ON CONFLICT targeting it has to change too. This is explicit coupling, but it’s coupling.

MySQL is implicit and more dangerous. ON DUPLICATE KEY UPDATE fires on conflict with any unique key on the table, not just the one the query author had in mind. If the table has UNIQUE (email) and UNIQUE (external_id), an insert that conflicts on either key triggers the update. For rows where the inserted email matches one existing row and the external_id matches a different one, the behavior depends on which index is checked first and is undefined as far as the language is concerned.

The practical implication: adding a new unique key to a table can silently change the semantics of every existing INSERT ... ON DUPLICATE KEY UPDATE against that table. There’s no error, no warning, just different behavior on the next conflict that falls into the new key’s path. On large schemas with dozens of unique keys, this is the UPSERT equivalent of action at a distance.

The mitigation on MySQL is to prefer INSERT ... ON DUPLICATE KEY UPDATE only when there’s a single obvious unique key, and to reach for REPLACE or explicit SELECT ... FOR UPDATE + conditional UPDATE/INSERT flows when the semantics need to be explicit.

Unique indexes also concentrate deadlock pressure

Two deadlock patterns are specific to unique indexes and show up almost nowhere else:

Duplicate-key inserts take locks even when they fail. When InnoDB detects a duplicate on insert, it doesn’t just raise the error; it first acquires a shared next-key lock on the conflicting row. Under REPEATABLE READ (the default), that lock covers the gap too. Two concurrent transactions inserting near the same unique key can deadlock on those shared locks before either sees the duplicate-key error. The most common production signature: a batch-upsert worker hitting the same hot row ranges from multiple threads.

ON DUPLICATE KEY UPDATE batches deadlock when key ordering differs. Each row in a batch insert acquires its lock when the row is processed, not when the batch starts. Two batches touching overlapping keys (A, B) vs (B, A) take locks in opposite order and cycle. The fix is either sorting rows by unique key before the batch (so lock acquisition order is consistent across workers) or switching to INSERT ... ON CONFLICT DO NOTHING plus a separate targeted UPDATE pass.

Neither of these shows up the same way with non-unique indexes; the uniqueness check itself is what forces the extra locking. It’s the cost of making the database enforce the guarantee, and it scales badly once the hot-key set is small and write concurrency is high. (Database Deadlocks, Part 1 covers the broader patterns; Part 2 covers reading the log, retries, and prevention.)

Composite index column ordering

The order of columns in a composite index is a selectivity decision that determines whether the index helps the query it was built for. The usual rules compress to three:

Equality filters before range filters. An index on (customer_id, created_at) is efficient for WHERE customer_id = 42 AND created_at > '2026-01-01'. Reversed ((created_at, customer_id)), the index has to scan a wide range of created_at values and filter customer_id as a secondary step, which is usually worse than a sequential scan.
More selective column first for equality-only predicates. For filters of the form WHERE a = ? AND b = ?, the column with more distinct values goes first so the first lookup narrows more aggressively.
Match the query’s access pattern. An index on (a, b, c) serves queries filtering by a, (a, b), or (a, b, c). It does not serve queries filtering by b alone, c alone, or (b, c). The leading column is load-bearing.

These interact with covering index considerations, sort order requirements, and the planner’s ability to combine multiple indexes via bitmap scans. But the starting point is: think about how the index will be read, not what columns are available to throw at it.

MySQL clustered indexes flip the rule

The above applies cleanly to secondary indexes. The MySQL InnoDB primary key is a different animal: a clustered index, meaning the PK’s leaf pages are the table. The ordering of PK columns decides physical row order on disk, and that often matters more than selectivity.

The canonical example is PRIMARY KEY (tenant_id, id) on a multi-tenant table. tenant_id has maybe 10K distinct values (low selectivity); id is near-unique. By “most selective first,” the answer would be (id, tenant_id), and it would be wrong:

Physical clustering. All rows for one tenant sit contiguously in the B-tree. Tenant-scoped range scans read a narrow slice of pages sequentially, and the buffer pool caches a single tenant’s hot data together. (id, tenant_id) scatters that same tenant’s rows across the whole table.
Secondary index lookups cost less. InnoDB secondary indexes store the PK, not a row pointer. A query that uses a secondary index and then needs a full row does a PK lookup per match. With (tenant_id, id), those lookups for one tenant cluster together. With (id, tenant_id), each is random I/O across the table.
Insert locality. If id is monotonically increasing within a tenant, inserts land on recent pages per tenant, avoiding page splits scattered across the index.

The rule for an InnoDB PK is: put the column that represents the dominant access pattern first, even if it’s less selective. Selectivity cuts rows; clustering cuts I/O. On a large clustered index, I/O usually dominates.

This is also why PRIMARY KEY (id) plus INDEX (tenant_id) on a multi-tenant table is often slower than PRIMARY KEY (tenant_id, id); the secondary index forces a PK-lookup hop on every read that the clustered choice avoids entirely.

PostgreSQL’s primary key is a separate B-tree unique index, not clustered (a CLUSTER command exists but isn’t maintained as rows are inserted), so the ordering logic there stays closer to the secondary-index rules.

The planner doesn’t read the data - it reads statistics

The planner’s entire decision-making process rests on statistics that summarize the data, not the data itself. PostgreSQL’s per-column statistics live in pg_stats:

1
2
3
4
5
6
7
8
9


SELECT
 attname,
 n_distinct, -- estimated distinct values (negative means fraction)
 null_frac, -- fraction of rows that are NULL
 most_common_vals, -- top values by frequency
 most_common_freqs, -- corresponding frequencies
 histogram_bounds -- distribution of non-common values
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'customer_id';

MySQL exposes similar information through information_schema.STATISTICS and INNODB_TABLESTATS, though less granularly than PostgreSQL’s statistics. MySQL lacks per-column histograms on most versions (8.0+ has optional histograms, off by default).

These statistics are gathered by explicit ANALYZE in PostgreSQL and maintained automatically by InnoDB in MySQL. They go stale between runs. A table that was analyzed at 10M rows and is now 200M rows has planner statistics that no longer reflect reality. Join reorderings based on those estimates are decisions made on outdated data.

The usual symptom is a query that was fast yesterday and slow today, with no schema or query change. The planner’s row estimate for some step has drifted far enough from reality that the plan shape flipped: nested loop where it should have been hash join, or a sequential scan where an index seek would have won. EXPLAIN ANALYZE with its estimated-vs-actual row counts is the fastest way to confirm this:

1
2
3
4


EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;
-- Index Scan using orders_customer_id_idx
-- (cost=0.43..1234.56 rows=1000 width=128)
-- (actual time=0.123..45.678 rows=180000 loops=1)

The rows=1000 is the estimate. The actual rows=180000 is reality. A ratio of 100x+ between them is the signal. The fix is statistical (refresh stats, increase the statistics target for that column, add extended statistics for correlated columns) and not a query rewrite.

Cardinality estimation errors and their shape

The single most common cause of bad query plans in production is a bad row-count estimate on an intermediate step. Two flavors, each with distinctive symptoms:

Underestimates. The planner thinks a step will return 10 rows, actually returns 10 million. The plan picks a nested loop (good for a small outer side), which now runs 10 million iterations. A query that should have been a 50ms hash join takes 50 minutes. The telltale sign in EXPLAIN ANALYZE is loops=10000000 on an inner node that was costed for a handful.

Overestimates. The planner thinks 10 million rows, actually 10. The plan allocates a hash table sized for millions, spills to disk under memory pressure, and runs a 5ms lookup in 5 seconds. Less common but more insidious, because the query didn’t “fail” in any obvious way; it just used more memory and I/O than it needed.

Both are failures of the statistics, not the query. Both are especially hard to diagnose because the query text is identical in the fast and slow cases; only the planner’s belief about the data changed. When the ratio between estimated and actual is large and consistent, the problem is upstream of the query.

Correlated columns break the independence assumption

The planner estimates the selectivity of a compound predicate WHERE a = x AND b = y by multiplying the individual selectivities, assuming the columns are statistically independent. When they’re not, the estimate can be off by orders of magnitude.

The canonical example is (country, state):

1
2
3
4


EXPLAIN ANALYZE
SELECT * FROM addresses WHERE country = 'US' AND state = 'CA';
-- Estimate: (0.25) * (0.02) * N = 0.5% of rows
-- Reality: ~2% of rows - state = 'CA' implies country = 'US'

The planner assumed the two filters cut the rowcount independently. In reality, state = 'CA' already determines country = 'US' (there are no California rows with a different country) so the compound filter isn’t as selective as the multiplication suggests.

PostgreSQL 10+ supports extended statistics to fix this:

1
2
3


CREATE STATISTICS country_state_corr (dependencies, ndistinct)
 ON country, state FROM addresses;
ANALYZE addresses;

The dependencies statistic captures functional dependencies (one column determines another); ndistinct captures the distinct combinations of the column set. Both are used during planning to correct the independence-assumption multiplication.

MySQL has no equivalent. Correlated-column estimation errors there are harder to fix at the planner level; the workaround is usually to restructure the query (force a specific join order, introduce an intermediate CTE, or add a covering index that captures the correlated access pattern directly).

UNIQUE as a planner signal, not just a guardrail

A UNIQUE constraint is also a proof the planner can use. Knowing a column is unique lets the optimizer reason about the shape of joins and aggregates in ways it can’t when uniqueness is only implicit:

Deduplication elimination. SELECT DISTINCT u.id FROM users u JOIN orders o ON o.user_id = u.id can skip the DISTINCT step entirely if the planner knows users.id is unique. The join already produces at most one row per u.id per matching order, and the DISTINCT becomes a no-op. Without the declared uniqueness, the planner has to run the dedup pass.
Join elimination. When joining A to B on a unique column of B, and selecting only columns from A, the planner can drop the join entirely in some cases (it proved the join doesn’t change the output). This is a real optimization on star-schema queries.
Reorderable joins. Unique constraints make certain join orderings provably equivalent, giving the optimizer more plan shapes to choose from. The more plans it can try, the more likely it finds a good one.
Index-only scan eligibility. Unique indexes are natural targets for index-only scans, which skip the heap/table access when every column the query needs is already in the index.

Schemas that leave uniqueness implicit (enforced in application code, promised in a wiki) can still produce correct results, but the planner can’t trust assumptions it can’t see. The constraint is what turns uniqueness from a property of the data into a property of the schema that the planner reads as a fact.

What UNIQUE tells a schema-reading model

The planner isn’t the only consumer of declared uniqueness. Schema-reading assistants (Copilot, MCP-backed agents, text-to-SQL tools) read information_schema.TABLE_CONSTRAINTS and pg_constraint the same way they read column types. A declared UNIQUE is the only signal in the catalog that says “at most one row per X.” Without it, the model has no way to prove 1:1 semantics and either hedges with a defensive LIMIT 1 it can’t justify or writes GROUP BY / DISTINCT passes that shouldn’t be necessary. ON CONFLICT and ON DUPLICATE KEY UPDATE targeting is especially fragile: the model picks the column name that matches the prompt (“upsert by email”) and the query either fails at runtime because no unique constraint exists on that column, or silently targets a different constraint than intended.

Selectivity is the part the model has even less access to. Planner statistics (pg_stats.n_distinct, MySQL’s information_schema.STATISTICS cardinality estimates) aren’t part of the prompt for most schema-aware tools, and the model has no way to query them mid-generation. Asked “how do I speed this query up?” the assistant’s default answer is “add an index,” regardless of whether the indexed column has two distinct values or two million. The same schema discipline that keeps the planner honest (declared unique constraints on every at-most-one relationship, composite primary keys on bridge tables, column-level comments that describe the value shape) is what gives catalog-reading models enough context to produce queries that don’t require a second human pass.

Diagnosing the usual suspects

Three patterns cover most of the uniqueness/selectivity-shaped bugs in production:

“This query got slow and nothing changed.” Run EXPLAIN ANALYZE. Compare estimated to actual row counts on each node. A large ratio (10x+) means the planner has stale statistics, missing extended statistics on correlated columns, or both. Refresh stats with ANALYZE; add extended statistics if a compound predicate is the source.

“I built an index and the planner ignores it.” Check the column’s selectivity directly: distinct values over total. Below ~5%, a sequential scan is usually the right choice and the planner isn’t wrong. If selectivity is high, check for functions in the WHERE clause (non-SARGable predicates), implicit type casts (an indexed BIGINT column filtered with a VARCHAR literal can fall off the index), or stale statistics underreporting the column’s uniqueness.

“My UPSERT corrupts data under load.” Check which unique key it’s targeting. In MySQL, ON DUPLICATE KEY UPDATE fires on conflict with any unique key, including ones added after the query was written. In PostgreSQL, partial unique indexes require the predicate in ON CONFLICT; mismatches silently fall through to insert rather than update.

The mental model

Uniqueness and selectivity collapse to two questions that both the planner and the engineer need answered for every table and query:

How many rows per key? Uniqueness. Determines whether joins multiply, whether UPSERTs target the right constraint, and whether aggregations can be trusted.
How many distinct values relative to total? Selectivity. Determines whether indexes help, which join order the planner picks, and how badly a compound filter will miss.

Both answers are visible to the planner if the constraints are declared and the statistics are current. Both become guesswork when they’re not. The habit that pays off isn’t heroic query tuning. It’s keeping the database’s model of the data honest: declare the unique constraints that exist (including composite ones on bridge tables), refresh statistics on busy tables, add extended statistics where correlation has burned you before, and read EXPLAIN ANALYZE for the ratio between estimated and actual rows every time a query slows down.

Comment Your Schema

Mon, 18 Nov 2024 00:00:00 +0000

TL;DR

Every major database engine lets you attach comments to tables and columns: descriptions that live in the schema itself and show up in every tool that reads it. They cost nothing to add, require no downtime, and make every schema dump, ER diagram, and monitoring tool more useful. Almost nobody uses them.

A new engineer is debugging a customer support ticket: “order #4421 shows up as ‘Failed’ in the admin tool but the customer received it.” She opens the orders table in DataGrip and finds status TINYINT NOT NULL. The admin tool displays “Failed” when status = 2. The fulfillment service ships when status = 3. The reporting view treats status = 1 as “active.” None of the three definitions are in the schema, and nobody on the team remembers the original mapping; the engineer who designed the table left eight months ago.

Resolving the ticket takes ninety minutes: grep three service codebases, find three different mappings, reconcile them against the actual row’s status = 2, draft the customer email. Every part of that work happens because the integer values aren’t grounded anywhere the database knows about, and the code’s three guesses disagree. The mechanism that would have grounded them in the catalog has existed in every major database engine since the 1990s. The team has just never written it down.

What schema comments are

Schema comments are metadata strings attached directly to database objects: tables, columns, indexes, views. They’re stored in the database catalog and exposed through standard metadata queries.

In PostgreSQL:

1
2
3


COMMENT ON TABLE orders IS 'Customer purchase orders. One row per checkout.';
COMMENT ON COLUMN orders.status IS '1=pending, 2=processing, 3=shipped, 4=delivered, 5=cancelled';
COMMENT ON COLUMN orders.end_date IS 'NULL means order is still in progress';

In MySQL:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


ALTER TABLE orders MODIFY COLUMN status TINYINT NOT NULL
 COMMENT '1=pending, 2=processing, 3=shipped, 4=delivered, 5=cancelled';

-- Or at table creation:
CREATE TABLE orders (
 id BIGINT PRIMARY KEY AUTO_INCREMENT,
 user_id BIGINT NOT NULL COMMENT 'References users.id',
 status TINYINT NOT NULL COMMENT '1=pending, 2=processing, 3=shipped, 4=delivered, 5=cancelled',
 total_cents BIGINT NOT NULL COMMENT 'Order total in cents, not dollars',
 end_date DATE DEFAULT NULL COMMENT 'NULL = order still in progress'
) COMMENT='Customer purchase orders. One row per checkout.';

SQL Server uses extended properties, Oracle uses COMMENT ON. The syntax varies. The concept is universal.

Where comments show up

This is the part that makes comments more useful than a wiki page or a README. Because they live in the catalog, they propagate automatically to every tool that reads schema metadata.

Schema dumps. pg_dump and mysqldump include comments in the output. Anyone restoring a backup or reviewing a migration gets the context without looking elsewhere.

ER diagram tools. DBeaver, DataGrip, pgAdmin, MySQL Workbench all render column comments in schema viewers and diagrams. Hover over a column and the description is right there.

information_schema and catalog queries. Any script, tool, or automation that queries metadata picks up comments for free.

In PostgreSQL:

1
2
3
4
5
6
7
8


-- Get a table comment
SELECT obj_description('orders'::regclass, 'pg_class');

-- Get a column comment
SELECT col_description('orders'::regclass, 1);

-- Or just use psql
\d+ orders

In MySQL:

1
2
3
4


SELECT COLUMN_NAME, COLUMN_COMMENT
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = 'your_database'
 AND TABLE_NAME = 'orders';

ORM introspection. Many ORMs and code generators that reverse-engineer models from databases will pull comments into generated code as docstrings or annotations.

Monitoring and alerting tools. When an alert fires about a table or column, the comment provides immediate context without requiring someone to look up external documentation.

CI pipelines and schema validation. Linting tools can check whether new tables and columns have comments, the same way code linters check for function docstrings.

The point is that comments flow through the entire toolchain. You write them once, in one place, and every tool that reads the schema benefits, without configuration, without plugins, without maintaining anything separately.

What goes wrong without them

The absence of comments creates a category of problems that compounds over time. Onboarding takes longer than it should: every new engineer who encounters status TINYINT has to ask someone or investigate. Multiply that by every ambiguous column in every table across every service, and it stops being a one-time cost. It’s paid every time someone new touches the schema.

Debugging becomes archaeology. When something breaks at 2am and you’re looking at a table with columns named type, flag, ref_id, and config, all with no comments, you’re not debugging. You’re reverse-engineering institutional knowledge that should have been written down.

Schema reviews lose context too. A migration that adds is_processed TINYINT(1) DEFAULT 0 looks fine syntactically; processed by what, when, and is it idempotent? A comment turns the review from “does this look right?” into “does this match what we agreed on?”

External documentation drifts the moment someone adds a column or changes a status code without updating the doc. Comments live in the schema itself. They move with ALTER TABLE. They show up in every dump. They can’t be in a different repo than the data they describe.

What to comment

Not everything needs a comment. A column called created_at TIMESTAMP NOT NULL is self-documenting. Focus on the columns where the schema doesn’t tell the whole story:

Status and type columns. What do the values mean? 1=active, 2=suspended, 3=closed.
Nullable columns where NULL has meaning. Does NULL mean “not set,” “not applicable,” or “ongoing”?
ID columns that reference other tables without foreign keys. owner_id BIGINT COMMENT 'References users.id'.
Columns with non-obvious units. total_cents vs total (dollars? cents? units?), duration (seconds? milliseconds? minutes?).
Columns with business logic encoded in values. plan_type TINYINT COMMENT '1=free, 2=starter, 3=pro, 4=enterprise'.
Tables themselves. What does this table represent? One row per what?

A good table comment answers: “What is one row in this table?” A good column comment answers: “What does this value mean when I see it in a query result?”

Adding comments to an existing schema

This is the part that makes the cost-benefit ratio hard to argue against.

In PostgreSQL, COMMENT ON is a catalog-only operation. It takes no locks on the table. It doesn’t rewrite data. It doesn’t block reads or writes. On a table with 500 million rows, it completes in milliseconds.

1
2


-- This is instant. No lock. No downtime. No risk.
COMMENT ON COLUMN orders.status IS '1=pending, 2=processing, 3=shipped, 4=delivered, 5=cancelled';

In MySQL, ALTER TABLE ... MODIFY COLUMN with a comment is a metadata-only change in most cases with InnoDB online DDL, but behavior depends on the version and what else is in the MODIFY. For comment-only changes on MySQL 8.0+, the ALGORITHM=INSTANT path applies. On older versions or when combined with type changes, it may trigger a table rebuild.

1
2
3


-- MySQL 8.0+: instant for comment-only changes
ALTER TABLE orders MODIFY COLUMN status TINYINT NOT NULL
 COMMENT '1=pending, 2=processing, 3=shipped, 4=delivered, 5=cancelled';

Zero-downtime operation

In PostgreSQL, COMMENT ON is a catalog-only update: no table lock, no rewrite, completes in milliseconds even on tables with hundreds of millions of rows. In MySQL 8.0+, comment-only changes go through the ALGORITHM=INSTANT path. This is about as safe as database changes get.

The risk profile is as close to zero as database changes get. There’s no reason not to do this incrementally; comment a few columns every time you touch a table. Over time, the schema becomes self-documenting.

Generating documentation from comments

Because comments live in the catalog, tools can extract them and produce browsable documentation automatically. A few worth knowing:

SchemaSpy. Java-based, generates interactive HTML with ER diagrams. Reads table and column comments from the catalog. Run it against your database and you get a full documentation site with relationships, comments, and diagrams; no manual authoring.

tbls. A single Go binary, CI-friendly. Outputs Markdown, PlantUML, or SVG. Reads comments directly from the schema. Designed to run in pipelines: generate docs on every migration, commit them to the repo, and they stay in sync.

DataGrip / DBeaver. Not doc generators per se, but both render column comments inline in their schema browsers. For teams already using these, comments become immediately visible without any extra tooling.

The pattern is the same across all of them: comments in the schema become descriptions in the output. No separate documentation source to maintain. The schema is the source.

For teams that want generated docs as part of CI, tbls is the lowest-friction option: add it to your pipeline, point it at the database, and commit the Markdown output. Every migration that adds or changes a comment updates the docs automatically.

The RAG surface most teams forget

Schema-reading assistants (Copilot, MCP-backed agents, text-to-SQL tools, retrieval-augmented coding models) start with the same catalog every human does: information_schema, pg_description, \d+. If the catalog contains only column names and types, that’s the context the model gets. A column named status TINYINT is ambiguous to the model for the same reason it’s ambiguous to a new engineer, except the model won’t ping the on-call channel; it will generate a plausible query and hand it back. Published studies on text-to-SQL accuracy have put the lift from adding column-level semantic descriptions as high as ~27%, not because models are bad at reading schemas, but because most schemas don’t tell them enough to read.

Comments are the one catalog field that can carry business meaning. Every other metadata row is mechanical: type, nullability, length, constraint name. A comment on orders.status ('1=pending, 2=processing, 3=shipped, 4=delivered, 5=cancelled') turns a blind guess into a grounded answer for any tool that reads the catalog, human or otherwise. It’s the cheapest RAG context a team can ship: no vector store, no separate doc pipeline, no sync problem; the description travels with the column it describes. If the team is rolling out database-aware AI assistants and hasn’t commented the ambiguous columns first, the assistants are working from less context than a new hire would get on day one.

Making it stick

The challenge isn’t the mechanism, it’s the habit. Comments rot just like any other documentation if they’re not maintained. A few things help:

Comment at creation time. If the comment is part of the CREATE TABLE or migration, it happens naturally. Retrofitting is always harder than including it from the start.

Add it to your migration template. If your team uses a migration tool, add comment fields to the template. Make the absence visible rather than the default.

Lint for it. A simple CI check can flag tables or columns without comments. It doesn’t have to block merges - even a warning changes behavior over time.

Treat comments as part of the schema, not as documentation. When a column’s semantics change, the comment changes in the same migration. Same PR, same review.

The trade-offs

Comments aren’t a substitute for all documentation. They’re good at describing what a column or table is, not how a multi-table workflow operates. System-level documentation (data flow diagrams, service interaction maps, runbooks) still belongs somewhere else.

Watch out

Stale comments are worse than no comments; they actively mislead. A column that says '1=active, 2=inactive' when the code now also uses 3=suspended will send someone down the wrong path. Treat comment updates as part of the migration, not a follow-up task.

There’s also a maintenance cost. Stale comments are worse than no comments because they actively mislead. A column that says '1=active, 2=inactive' when the code was updated to also use 3=suspended will send someone down the wrong path. The mitigation is treating comments as schema, not as a nice-to-have; they change when the schema changes.

For teams with hundreds of tables and thousands of columns, retrofitting comments is a slow process. It’s not a weekend project. It’s an incremental habit that pays off over months.

Where to start

Start with the columns that make people ask questions. The status column where 0/1/2 means something nobody can quote from memory. The nullable date that means “ongoing” in one place and “missing” in another. The foreign key with no foreign key. Those are the columns where a one-line COMMENT ON recovers more institutional knowledge per character than any other change a schema can absorb.

Foreign Keys Are Not Optional

Fri, 01 Nov 2024 00:00:00 +0000

TL;DR

Foreign keys are the last line of defense against orphaned data, silent corruption, and integrity issues that compound over time. Application-level validation covers the happy path; production finds every other path. The overhead is negligible; the cost of skipping them isn’t.

A new engineer joins the team and runs SELECT count(*) FROM order_items oi LEFT JOIN orders o ON o.id = oi.order_id WHERE o.id IS NULL as part of a routine schema audit. The number comes back 4,127. Four thousand line items pointing at orders that no longer exist. She asks the tech lead when this started, and the answer is “we dropped the foreign key two years ago during a launch crunch to shave write latency on the bulk import path.” The PR that did it has eleven approvals and one comment: “good catch.” The orphan cleanup will take a week, and the bigger question - what else has accumulated - will take longer.

Most of the orphans trace back to two services racing to write the parent row. Service A creates the order, service B creates the line items, and B sometimes ran first when A’s deploy lagged. With the FK in place, B’s inserts would have failed and the orchestration layer would have retried. Without it, B’s inserts landed. The retries never fired. The orphans piled up at roughly six per day for two years.

The happy path isn’t the only path

Application-level validation works great when everything is running normally. When every deploy goes clean, when no one is running a backfill at 2am, when your ORM is doing exactly what you think it’s doing. Production is the set of conditions where one of those isn’t true.

Every developer writing a query, a migration, or a backfill script has to carry the full mental model of the schema in their head: which tables depend on which, what breaks if a row disappears, where the implicit relationships are. Without foreign keys, that mental model is the only thing keeping the data consistent. The example writes itself.

1
2
3


-- Looks harmless. Is it?
INSERT INTO order_items (order_id, product_id, qty) VALUES (9999, 7, 2);
-- order 9999 doesn't exist. No FK, no error. Silent corruption.

Think of it this way

A foreign key is to data integrity what a type system is to application code: it catches mistakes at write time instead of letting them surface as bugs in production.

With a foreign key, the database rejects this immediately, the same way a compiler catches a type error before runtime. Without one, you find out weeks or months later when a report doesn’t add up or a customer calls about missing data. By then, your backups may have already rotated out. The data is just gone.

Service contracts drift the same way. Service A creates the parent record; service B creates the child. B ships a bug and starts referencing IDs that don’t exist. Without a foreign key there’s no error, just bad data accumulating quietly until someone notices the numbers don’t add up. ORMs have their own edge cases: race conditions in bulk inserts, upserts that skip association checks, lazy loading that masks broken references. Every major ORM has documented ways to let bad data slip through. The database is the one place where the check is guaranteed.

This is one specific case of a broader pattern: where business logic lives, database vs. application. Referential integrity is the textbook example of a correctness invariant that every write path has to pass through, and the database is the only layer that sees them all.

The performance question

If your architecture is so perfectly optimized that a foreign key check is the last thing left to tune, that’s not an FK problem. There are almost certainly unindexed queries, missing covering indexes, suboptimal join patterns, N+1 queries, poor partitioning strategy, collation mismatches forcing implicit conversions, functions wrapped around predicates killing index usage, datatype mismatches between join columns, oversized datatypes wasting pages and cache, stale statistics misleading the planner, parameter sniffing locking in bad plans, redundant data bloating tables, under-normalized or over-normalized schemas, tables with too many columns per row, tables with too few columns forcing constant joins, bad query designs pulling more data than needed, misconfigured OS settings, undersized buffer pools, wrong parallelization thresholds, and the list goes on. All hiding somewhere in the stack. If removing an FK constraint is the performance win on the table, it’s worth looking harder at everything else first.

Foreign keys add a check on every insert and update to the child table; the database verifies that the referenced row exists. In practice, this is a lookup against a primary key index. It’s fast. Microseconds. The overhead is negligible compared to the cost of tracking down integrity issues after the fact. Teams routinely spend weeks debugging problems that a foreign key would have caught at insert time.

In practice, the FK check is almost never the bottleneck. This is:

1
2
3
4
5


EXPLAIN ANALYZE
SELECT * FROM events WHERE created_at > '2024-01-01';
-- Seq Scan on events (cost=0.00..4125892.80 rows=198234567 ...)
-- Planning Time: 0.2 ms
-- Execution Time: 287643.109 ms

That missing index on a 200 million row table is the bottleneck. Not the FK check.

It’s much harder to add them later

Adding a foreign key to a table with a few thousand rows is trivial. Adding one to a table with hundreds of millions of rows in a production database that’s been running for years is a different story entirely.

The database has to validate every existing row against the constraint. On MySQL, ALTER TABLE with a foreign key takes a lock; on a large table, that can mean minutes or hours of blocked writes. On PostgreSQL, you can split it into two steps:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


-- Step 1: add the constraint without validating existing rows
ALTER TABLE order_items
 ADD CONSTRAINT fk_order
 FOREIGN KEY (order_id) REFERENCES orders(id)
 NOT VALID;

-- Step 2: validate (full table scan, can't avoid it)
ALTER TABLE order_items VALIDATE CONSTRAINT fk_order;
-- ERROR: insert or update on table "order_items" violates
-- foreign key constraint "fk_order"
-- Detail: Key (order_id)=(9912) is not present in table "orders".

The longer you wait, the harder it gets

Adding a foreign key to a table with years of accumulated data means finding and resolving every orphaned row first. On a large, busy database, that cleanup alone can take weeks.

There it is: orphaned data that’s been silently accumulating. The constraint can’t be added until every violation is found and resolved. On a large, busy database with years of drift, that cleanup alone can take weeks of careful work. Starting with the constraints on day one avoids this entirely.

What foreign keys actually give you

Beyond preventing bad data, foreign keys serve as living documentation. They tell every engineer who looks at the schema:

This table depends on that table
These rows cannot exist without those rows
This is the shape of the data, enforced by the system itself

Documentation gets outdated. Code comments drift. Constraints are always current because the database enforces them on every write.

Foreign keys also help the query planner. PostgreSQL uses FK relationships to make better decisions about joins. You’re protecting your data and helping the database perform better in the same migration.

FKs as the schema’s relationship map

The documentation value compounds once the readers aren’t all human. information_schema.KEY_COLUMN_USAGE in MySQL, pg_constraint in PostgreSQL: foreign keys are queryable catalog metadata, and every schema-reading assistant (Copilot, MCP-backed agents, text-to-SQL tools, RAG systems indexing the catalog) uses that metadata to reason about how tables connect. A declared FK is a machine-readable statement that order_items.order_id references orders.id. The model doesn’t have to guess from the column name.

Drop the constraint and the signal disappears. The assistant falls back to guessing joins by name match, which works for obvious cases (user_id → users.id) and fails on the real-world column vocabulary every mature schema accumulates: creator_id, modified_by, owner, assigned_to, ref_id, parent. Each of those is a logical FK with no metadata backing it, and a schema-reading model will confidently invent a relationship that doesn’t hold. Adding the FK fixes the integrity hole and, in the same migration, makes the schema self-describing to every tool that consumes catalog metadata, including the ones that didn’t exist when the table was first created.

The NoSQL comparison

It’s worth noting that document databases like MongoDB don’t have this problem in the same way. When an order, its line items, and its shipping address all live inside a single document, there’s nothing to orphan - integrity is structural. The data can’t reference something that doesn’t exist because it’s all embedded together.

That’s actually one of the real strengths of the document model. The moment a relational database splits that same data across orders, line_items, and addresses tables, those relationships need to be enforced somewhere. The application can try, but the database is the only place that guarantees it across every write path: manual queries, migrations, ORM edge cases, and all.

Foreign keys exist because relational databases chose normalization over duplication. That’s a good trade-off, but only if the relationships are actually enforced.

Compare the two models:

1
2
3
4
5
6
7
8
9


// Document model - integrity is structural
{
 "order_id": 1001,
 "user": { "id": 42, "name": "Alice" },
 "items": [
 { "product": "Widget", "qty": 2 },
 { "product": "Gadget", "qty": 1 }
 ]
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


-- Relational model - integrity must be enforced
CREATE TABLE orders (
 id INT PRIMARY KEY,
 user_id INT NOT NULL REFERENCES users(id)
);

CREATE TABLE order_items (
 id INT PRIMARY KEY,
 order_id INT NOT NULL REFERENCES orders(id),
 product_id INT NOT NULL REFERENCES products(id),
 qty INT NOT NULL
);

In the document, there’s nothing to orphan. In the relational model, remove those REFERENCES clauses and every row is on its own.

When it’s reasonable to skip them

In an OLTP system, almost never. If you’re using a relational database, you want relational integrity. That’s the whole point. If you don’t need enforced relationships between your data, a relational database might not be the right tool in the first place.

Where foreign keys can be impractical is in specific edge cases:

Partitioned tables where cross-partition foreign keys have historically been unsupported. PostgreSQL 12 added support for foreign keys referencing partitioned tables, though with some limitations: the referenced table must be partitioned, and certain partition schemes can still cause issues.
Staging tables used for temporary ETL ingestion before data is validated and moved to its final destination.

Even in analytics and data warehouses, integrity still matters; orphaned or dangling references mean wrong aggregations, broken joins, and reports that silently lie. The enforcement mechanism might look different, but the need for referential integrity doesn’t go away just because the workload changed.

Before you drop one

Before dropping a foreign key for performance, exhaust the thousand other ways to tune your system first. The check itself is a primary-key lookup measured in microseconds; on a profile, it’s almost always rounding error compared to the indexes you haven’t built, the queries that aren’t covering, or the planner stats the FK itself encodes for free. The constraint is almost never the bottleneck, and removing it has a habit of creating new ones a year or two down the line, usually discovered by an engineer who wasn’t on the team when the original PR landed.