PostgreSQL as a queue

Jun 8, 2026 · 21 min readBackend Postgres

For about four years I owned the parcel-tracking job queue at a logistics SaaS, and it ran on a single Postgres table. No Redis, no RabbitMQ, no SQS. One table, two indexes, a claim query built on SKIP LOCKED. At peak it pushed something like 40M jobs/day. That's a rounded peak off a holiday-surge dashboard, not a clean benchmark, so read it as "tens of millions," not a number I'd defend in a lab. The operation was boring on purpose. Here's how it worked, why I kept turning down the broker, the SLO I defended, the incidents that left marks, and where I'd have stopped recommending it.

40M

jobs / day at peak

< 2 min

queue-lag SLO

extra brokers

Transactional enqueue deletes a class of bugs

The lazy version of this post is "brokers are complicated, SQL is simple." That's not the argument. I re-evaluated the obvious options every time a new senior asked why we didn't just use a queue product.

Redis lost on the dual-write problem. A Redis list is fast and BRPOPLPUSH is nice, but the job is produced by a transaction that already commits to Postgres. A shipment row is written, and that write is what should spawn the tracking job. With the queue in Redis you write the shipment, then push to Redis as a second non-atomic step. Die in the gap and you've got a shipment with no job, or a job for a shipment that rolled back, plus a year of reconciliation sweepers hunting orphans. Persistence narrows the window. It doesn't close the seam.

RabbitMQ is a proper work queue with routing and fan-out, but it's a clustered stateful service with its own failure modes, upgrade dance, and on-call burden, and it still doesn't give transactional enqueue with my Postgres data. Kafka I rule out on a category error as much as cost: it's a partitioned log and pub/sub bus, not a work queue. Per-message ack, redelivery of one failed job, visibility timeouts — that's not what it's for. You can bolt a work-queue shape onto it, but you fight the grain of the tool.

SQS was the serious contender, since it's managed and I didn't want another stateful system. But it has the same seam from the other side: the enqueue is a network call to AWS outside my transaction, so I'm back to dual writes and an outbox table to make it reliable. At which point I'm already running a queue table in Postgres to feed the "real" queue, reconciling two systems instead of one. SQS is right when the producer isn't already transactional against a database you control. Mine was.

So the queue lived in Postgres for one decisive reason: the job row and the row that caused it commit in the same transaction. Both land or neither does. The outbox pattern exists to paper over the seam a broker introduces, and a queue inside the database has no seam. The second reason nearly matched it. The team already operated Postgres — backups, failover, monitoring, muscle memory for its internals. A broker meant a system nobody was fluent in yet, at 3am, mid-incident. The choice was deliberate: keep the number of stateful systems a small team must be excellent at as low as possible.

The SLO: lag, claim p99, flat DLQ

The mistake juniors make is calling a queue healthy when it's "running." A queue is a buffer, and a buffer is healthy only while it drains as fast as it fills. So I defined health numerically and put it on the wall:

Oldest ready job — queue lag — under two minutes at p99 of any one-minute window. The SLO that mattered.
Claim latency p99 under a few milliseconds. The claim is on every worker's hot path; once it crosses tens of milliseconds, something's wrong with the heap or the index, not with load.
Dead-letter growth flat over a rolling day. A DLQ that climbs means a handler is silently broken, not that one job failed.

I defended the lag SLO because it was the business. This was parcel tracking, so a lagging job is a tracking page showing stale carrier state. A customer refreshes, sees nothing new, opens a ticket. Enough of them and the carrier escalates. Queue lag wasn't an abstract metric. It was, with a short delay, the support-ticket volume for the next hour. Every decision below served lag < 2 min. When I talk about autovacuum or fairness, that's the number I was protecting.

The table: fillfactor and HOT updates

The whole thing fits in one table, but two physical-storage decisions matter as much as the columns:

sql

CREATE TABLE jobs (
    id           bigint      GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    kind         text        NOT NULL,
    payload      jsonb       NOT NULL,
    state        text        NOT NULL DEFAULT 'ready',
    run_at       timestamptz NOT NULL DEFAULT now(),
    locked_at    timestamptz,
    locked_by    text,
    attempts     int         NOT NULL DEFAULT 0,
    max_attempts int         NOT NULL DEFAULT 20,
    last_error   text
) WITH (
    fillfactor = 80,
    autovacuum_vacuum_scale_factor = 0.02,
    autovacuum_vacuum_cost_delay   = 0
);
 
-- The only index the hot path touches.
CREATE INDEX jobs_claimable ON jobs (run_at) WHERE state = 'ready';

A queue table is update-heavy. Every job is written once and updated several times: claimed, retried, completed. Postgres updates are copy-on-write, so each UPDATE writes a new tuple version. If that version fits on the same heap page, Postgres does a HOT (heap-only tuple) update. It chains the new version off the old one on the page and touches no index. That kills index write amplification and lets the next vacuum reclaim the dead version cheaply within the page. Leaving 20% of each page free at insert time gives later updates room to stay HOT. Skipping fillfactor is the difference between a queue that bloats itself to death and one that doesn't.

Index only the column whose change is the transition

The partial index is the whole game. The hot path only looks at ready rows, so the index stays proportional to backlog depth, not table size, even with tens of millions of done rows in the heap awaiting archival. Without it, dequeue degrades to a heap scan and p99 climbs into the seconds as the table grows.

There's a tension. Updating state moves a row in and out of jobs_claimable, an index write you can't avoid, since the predicate depends on state. But every other churning column (locked_at, locked_by, attempts, last_error) is deliberately out of every index, so once a row leaves ready, its heartbeat and retry bookkeeping touch no index and stay HOT. The rule: index only the column whose change is the queue transition, and keep every high-churn bookkeeping column un-indexed. One column earns its keep beyond the claim. That's kind, which is how I stopped one tenant from eating the queue, below.

The claim: SKIP LOCKED

sql

UPDATE jobs
SET state     = 'running',
    locked_at = now(),
    locked_by = $1,            -- worker id, for reaper + observability
    attempts  = attempts + 1
WHERE id IN (
    SELECT id FROM jobs
    WHERE state = 'ready' AND run_at <= now()
    ORDER BY run_at
    FOR NO KEY UPDATE SKIP LOCKED
    LIMIT $2
)
RETURNING id, kind, payload, attempts;

SKIP LOCKED makes this a queue instead of a thundering herd. Any row another worker holds in its in-flight transaction is silently skipped, not blocked on. Twenty workers running this concurrently each walk away with a disjoint batch. No advisory locks, no coordination.

The inner-SELECT/outer-UPDATE shape is mandatory, because you can't attach SKIP LOCKED to an UPDATE. Select IDs with skip semantics, update by id, and RETURNING hands the rows back in one round trip. Tune $2 against job duration: short jobs want larger batches to amortize the round trip, while long jobs want LIMIT 1 so one slow worker doesn't hoard a batch.

FOR NO KEY UPDATE, not FOR UPDATE

The lock strength is FOR NO KEY UPDATE, deliberately. FOR UPDATE takes the strongest row lock Postgres has, and it blocks anything needing a key-share lock, which includes transactions inserting or updating rows that hold a foreign key referencing this one. In a queue the job row is frequently the parent of audit/result rows that reference jobs.id, so a plain FOR UPDATE claim needlessly serializes against those FK checks. FOR NO KEY UPDATE locks against concurrent updates and deletes but doesn't block the key-share lock FK validation takes, since the claim never touches key columns. It's also the level Postgres itself takes for an ordinary UPDATE of non-key columns, so it matches the outer statement's semantics.

The day one carrier backed everyone up

A single shared queue with ORDER BY run_at is strict FIFO, and FIFO has a failure mode no tutorial mentions: a burst from one producer or one kind starves everything behind it. The claim doesn't know the five hundred thousand rows in front are all the same low-value backfill. It takes them in order.

I learned this the hard way. One afternoon a single carrier integration replayed its entire webhook history (a misconfigured retry on their side) and dumped a few hundred thousand carrier_sync jobs in a couple of minutes. All with a run_at of roughly now, they sat at the front of the FIFO line, and every other kind — tracking refreshes, notification sends — queued behind that wall. Lag on the jobs that mattered went from seconds to twenty minutes while workers chewed one carrier's backfill in arrival order. The queue was healthy by any "is it running" measure and completely failing the SLO.

Per-kind routing and per-tenant caps

The fix had two parts. First, route by kind. The kind column already existed, so I split the claim so a worker pool claims with WHERE kind = ANY($pool_kinds), and the partial index became (kind, run_at) WHERE state = 'ready' so each pool's claim still hit the index cleanly. Latency-sensitive kinds got their own pool a bulk backfill could never crowd out. Second, cap any single tenant within a pool:

sql

-- Weighted claim: no single carrier eats more than its share of a batch.
WITH ranked AS (
    SELECT id,
           row_number() OVER (PARTITION BY payload->>'carrier_id'
                              ORDER BY run_at) AS rn
    FROM jobs
    WHERE state = 'ready' AND run_at <= now() AND kind = ANY($1)
    ORDER BY run_at
    FOR NO KEY UPDATE SKIP LOCKED
    LIMIT $2 * 4
)
UPDATE jobs SET state='running', locked_at=now(), locked_by=$3, attempts=attempts+1
WHERE id IN (SELECT id FROM ranked WHERE rn <= $4)   -- per-carrier cap
RETURNING id, kind, payload, attempts;

It's not perfectly fair and doesn't try to be. Strict weighted fair queuing on a table is more machinery than it's worth. It's a cap: one hot tenant can use spare capacity but can't monopolize a pool. After that change the same carrier could replay its whole history again and customer-facing lag never moved. That's the version of fairness that defends an SLO.

Why the naive reaper is a footgun

The claim transaction commits the moment the worker holds the batch. It has to, or the lock would be held for the whole job duration and SKIP LOCKED would gain you nothing. The consequence: a job is durably running before it has run. If the worker OOMs, gets SIGKILL-ed mid-deploy, or the box panics, that row sits in running forever. You need a reaper:

sql

-- Resurrect rows whose lease has expired.
UPDATE jobs
SET state = 'ready', locked_at = NULL, locked_by = NULL
WHERE state = 'running'
  AND locked_at < now() - interval '5 minutes'
  AND attempts < max_attempts;

Here's the hazard nobody warns you about, and it bit me in production: this reaper can't tell a dead worker from a slow one. If a legitimate job runs past the lease, whether from a tail-latency spike, a slow downstream, or a GC pause, the reaper resurrects a job that's still executing, a second worker claims it, and the same job runs twice concurrently. The reaper doesn't fix double execution. Configured carelessly, it causes it.

Heartbeat lease plus idempotent handlers

Two correct mitigations, and you want both.

A heartbeat lease. The worker periodically bumps locked_at while it works (every 30 s for a 5-minute lease). The lease then expires only when the worker has genuinely stopped, not because a job is slow. The heartbeat touches locked_at only, which is un-indexed, so it stays HOT and is nearly free:

sql

UPDATE jobs SET locked_at = now()
WHERE id = ANY($1) AND locked_by = $2 AND state = 'running';

The locked_by = $2 guard makes this safe. If the reaper already resurrected the row and another worker re-claimed it, locked_by no longer matches and the heartbeat updates zero rows. The original worker detects that and aborts.

Idempotent handlers, unconditionally. Even with a heartbeat, a worker can complete a job, then crash in the microsecond before it commits state = 'done'. The lease expires, the job reruns. At-least-once delivery is the only honest guarantee a queue like this gives. Every handler must be keyed on the job id or a natural idempotency key so a second execution is a no-op.

I set the lease from the job-duration distribution, not a round number. Take a week of per-kind durations, grab the p99.9, set the visibility timeout a couple of multiples above. Long enough that a slow job never trips the reaper, short enough that a dead worker's jobs return before lag notices. Kinds with wildly different profiles (a sub-second notification versus a multi-minute reconciliation) got different leases rather than one global number wrong for both. Treat the heartbeat plus idempotency as the correctness boundary. The lease is a backstop.

LISTEN/NOTIFY with a poll backstop

Workers poll. Polling every 200 ms gives a latency floor and a steady drip of idle queries. LISTEN/NOTIFY removes the floor: the producer signals on commit and idle workers wake immediately.

sql

-- In the enqueue transaction, after the INSERT:
NOTIFY jobs_ready;
-- Worker, on startup:
LISTEN jobs_ready;

Be precise about what NOTIFY guarantees. It fires at commit time, so a notify in a transaction that rolls back is never delivered, which is what you want. But it's best-effort in-memory delivery: not durable, not queued for offline listeners, and a worker that drops and re-establishes its connection misses every notification sent in the gap. NOTIFY is a latency optimization, never a correctness mechanism. The poll loop stays as the backstop:

text

loop:
    wait for NOTIFY jobs_ready, or 1s timeout, whichever first
    run the claim query
    process the batch

Sub-millisecond pickup on a quiet queue, bounded worst-case latency if a notification is ever lost.

Retries, backoff, and the dead-letter table

Failure handling lives in the worker's completion transaction. On success it sets state = 'done'. On failure it reschedules with exponential backoff or, once attempts reaches max_attempts, moves the job to a dead-letter table so it stops consuming claim capacity:

sql

-- Transient failure: capped exponential backoff, jittered upstream.
UPDATE jobs
SET state = 'ready',
    run_at = now() + (least(power(2, attempts), 3600) * interval '1 second'),
    locked_at = NULL, locked_by = NULL, last_error = $2
WHERE id = $1 AND attempts < max_attempts;
 
-- Exhausted: move it out of the hot table entirely.
WITH dead AS (
    DELETE FROM jobs WHERE id = $1 AND attempts >= max_attempts RETURNING *
)
INSERT INTO jobs_dead SELECT * FROM dead;

Keeping dead jobs out of the live table matters. The partial index only sees ready rows, but the heap still has to be vacuumed, and a pile of permanently-failed rows is dead weight on every vacuum pass.

Retention, and who owns the dead letters

A dead-letter table is only useful if someone owns it, so I gave it a policy. Dead rows lived 30 days, then dropped. Long enough for a slow Monday-morning triage cycle and a holiday weekend, short enough that the table never grew into something you had to think about. The on-call engineer owned the DLQ. A daily alert fired if jobs_dead grew past a small threshold over the prior day, because a growing DLQ is almost always one broken handler. Triage meant grouping by kind and last_error, which nearly always pointed at a single deploy or a single upstream that changed shape.

Completed jobs followed a different policy, and this is where partitioning earned its place. I didn't DELETE completed rows. The jobs table was range-partitioned by completion time, and a scheduled job dropped whole partitions older than the window. That's an instant catalog operation rather than millions of row deletes vacuum then chases. Recent partitions stayed online for debugging and replay. Older ones were dropped, or for audit-required kinds, copied to cold storage first. Retention was a DROP PARTITION on a schedule, not a sweeper fighting autovacuum.

Autovacuum debt: the day "slow database" was a costume

The worst day looked like a database problem and wasn't. We were in a holiday spike, churn on jobs well above normal, and the first symptom was claim p99 creeping from the usual couple of milliseconds into the tens, then the low hundreds. Lag followed it over the two-minute SLO, and the tracking-page tickets started.

The dashboards said "the database is slow," so the first hypothesis was the obvious wrong one: the box is overloaded, scale it up, maybe the claim needs another index. We bumped the instance class. It bought twenty minutes, then latency climbed right back. Adding an index would have made it strictly worse, since that's more index writes per update, less HOT, more bloat, and I'm glad I talked myself out of shipping it.

The actual cause was autovacuum falling behind on the high-churn table. The spike pushed update volume past what autovacuum reclaimed, dead tuples piled up, and the heap and partial index bloated. Index-only scans stopped being index-only because the visibility map was stale, so each claim touched more pages and a few-millisecond probe turned into hundreds of milliseconds chasing bloat. The "slow database" was vacuum debt wearing a CPU-load costume. Scaling treated the symptom. It couldn't catch up on the debt.

The fix, in order of how much it helped

Aggressive per-table autovacuum so it kept pace under burst: autovacuum_vacuum_scale_factor = 0.02 and autovacuum_vacuum_cost_delay = 0, which is why those settings are on the table. Then confirming fillfactor = 80 so recovery updates stayed HOT and stopped generating index garbage. And the big one: getting completed rows out of the heap via partition DROP instead of DELETE, so the steady-state dead-tuple load dropped by an order of magnitude. After that the heap stopped bloating in the first place and the claim returned to single-digit milliseconds through the next spike.

The runbook entry I wrote that night still reads roughly: "Claim p99 climbing and lag rising under load — do not scale the box first, do not add an index. Check n_dead_tup and last_autovacuum on jobs. If dead tuples are climbing and autovacuum is lagging, this is vacuum debt. Trigger a manual VACUUM to stop the bleeding, confirm partition drops are running, and verify a migration did not reset the per-table autovacuum settings." It's saved the next on-call more than once.

Where the ceiling actually is

This pattern scaled further than I expected, and the limiter isn't the one people assume. The bottleneck is almost never raw SELECT throughput, since SKIP LOCKED claims are cheap. It's two things on the single primary. First, autovacuum keeping up with dead tuples: every claim, heartbeat, retry, and completion makes one, and a high-churn queue can generate them faster than autovacuum reclaims them. Second, WAL write throughput: every state transition is a logged write, so you're bounded by how fast one primary can fsync WAL, and you can't shard that away without losing the transactional-enqueue property that was the whole point.

In practice that put the sustainable ceiling in the low thousands to roughly 10k claims/s on a well-provisioned primary, well below where the planner starts to care. The most effective mitigation for the vacuum half is to stop deleting completed rows one at a time. Range-partition by completion time and DROP whole partitions instead: an instant catalog operation that produces zero dead tuples. Partition dropping bought the most headroom by far.

Owning it day to day

A queue you own is a queue you can see. The dashboard had five panels, each with an alert:

Oldest ready job (queue lag), per pool. Alert when it crosses the SLO. The page that woke me.
Claim rate and claim p99. Alert on p99 above a few milliseconds, the early warning for the vacuum-debt spiral, well before lag moves.
Running count. A sudden climb means workers are claiming but not completing: a stuck handler or downstream outage.
DLQ size and its daily delta. Flat is healthy; growing is a broken handler.
Autovacuum lag on jobs — dead tuples and time since last autovacuum. The leading indicator I didn't have the day of the incident, and the first thing I added after.

Two tools made it bearable. A DLQ replay/repair tool: pull dead rows by kind or error signature, fix the cause, and re-enqueue them in a controlled trickle rather than dumping a backlog that would itself trip the fairness problem. And discipline about payloads. The payload jsonb carried customer addresses and tracking identifiers, which is PII, so the rule was never to log raw payloads. Errors logged the job id and kind. If you needed the payload you went to the row under access control, not a log line that outlives its purpose in a search index. That rule predated any audit asking for it, which made the audit boring.

When to actually leave

For the large majority of backend systems I've seen, the database queue is the right call and a broker is premature. Keep it in Postgres when throughput is in the thousands of jobs/s or below, your jobs are produced by transactions that already hit the database, and you value transactional enqueue and one fewer thing to operate, back up, and fail over.

Leave for a dedicated broker when you hit a genuine wall, and only then. Real multi-consumer fan-out, where one event must reach many independent consumer groups (faking that with copied rows is write amplification you don't want). Sustained five-figure jobs/s the single primary's WAL and vacuum load can't absorb. Or strict total ordering across partitions, which SKIP LOCKED deliberately doesn't provide. Those are message-bus problems, and a work queue on a table is the wrong tool for them.

The day I would migrate is the day the queue table's vacuum load starts contending with the application's own queries on the same primary, and partitioning has stopped buying headroom. Until then, one table and one well-understood claim query, owned by someone who watches the lag and reads the autovacuum stats, will carry you a remarkably long way.

The claim query, the per-tenant fairness cap, the heartbeat lease and reaper, and the autovacuum notes from this post are a runnable crate: skip-locked — one Postgres table, with migrations, examples, and tests. The numbers here are committed benchmarks, not assertions: bench_throughput clears ~38k claims/s on the SKIP LOCKED hot path, and bench_claim_latency puts a single claim's p99 in the single-digit milliseconds — the floor being a transaction's commit, not the lock.