Migrating a Django queue to Rust
We ran a Celery worker fleet on carrier webhooks. Every status update a parcel carrier pushed became a task that normalized the payload, advanced the shipment state machine, and fanned out notifications. At roughly 8M tasks/day it worked, but badly. The migration to a Rust worker took a quarter. The "rewrote it in Rust, 40x faster" posts skip the second half. The language swap was the easy 30%. The boring parts did the work.
The SLO we kept breaching
This didn't start as "make it faster." It started as a recurring SLO breach. The contract with the carriers was unforgiving: ACK an inbound webhook with p99 < 100 ms, and keep end-to-end queue lag under about 2 minutes even during a retry storm. That second number is the one people skip. Carriers retry any webhook we don't ACK fast enough, so a slow tail didn't just hurt latency, it manufactured load. A webhook we were 400 ms late to ACK came back as a second and a third, each a fresh task doing the same work. The tail fed the volume that produced the tail.
That had a price tag, which is why it was a project and not a backlog ticket. Duplicate customer notifications and the support tickets behind them ("why did I get three texts saying my parcel shipped"). Wasted per-carrier API quota. And on contracts with delivery-confirmation SLAs and penalty clauses, line-item financial exposure. We weren't chasing a benchmark. We were trying to stop spending the error budget faster than it refilled, and every decision below was measured against "does this defend the SLO."
The three measured problems
- Tail latency.
p50was fine, around 18 ms;p99was 900 ms andp99.9over 4 s. The tail was a mix I never fully prised apart: Python's allocator churning through the per-task object graph, the odd GC pause, and CPU-bound normalization we could only scale by adding prefork processes. Nothing serialized across the children the way a thread-bound workload would, and prefork just didn't absorb spikes cleanly. And every slow ACK came back as a retry, sop99.9was a load multiplier, not a cosmetic number. - Memory. Each prefork child sat at ~280 MB RSS from the import graph and payload buffering, so the boxes were memory-bound long before they were CPU-bound.
- CPU efficiency. Normalization (parsing heterogeneous carrier XML and JSON into our canonical schema) was pure CPU dominated by Python object allocation. Profiling put about 70% of wall time in parse and validate.
Why not the cheaper options
A rewrite is the most expensive thing you can do to a working system, and "rewrite it in Rust" reads as competence while usually being appetite. So I spent real weeks trying not to migrate. It's the only honest way to justify the expensive one.
Optimize the Python path was the longest stop. We moved JSON to orjson, shifted validation off vanilla pydantic onto pydantic-core / msgspec for the hot structs, pulled apart the worst per-task allocations, looked hard at PyPy. It shaved p99 and tightened the median, but faster parsing didn't touch the thing killing us: each child carried ~280 MB regardless of how fast it parsed. Optimizing CPU inside a process doesn't give back the RAM the process costs to exist. That bought maybe a quarter of the headroom we needed, with no obvious second quarter.
Scale Celery horizontally meant buying RAM linearly, and boxes hit their memory ceiling well before their CPUs were busy. Packed tight enough that one payload buffering up could tip a box into swap. Past a point you're scaling runtime overhead at cloud rates, not the workload. A separate Python parser service added a network hop on the hottest path, a second service to deploy and page on, and still a Python process's worth of memory per unit of parallelism. Same ceiling, two deploys.
Go instead of Rust
This is the comparison I owe the most honesty on, because Go would have worked. No prefork tax, cheap goroutines, a GC that for this workload would have been a non-issue, a gentler curve. If another team had picked Go here I'd have nodded. What tipped it wasn't benchmarks. We already ran two small Rust services, pockets of tokio/sqlx that a couple of us knew and that had been quietly reliable. Rust meant reusing patterns, CI, and a deploy story we trusted instead of standing up a Go toolchain we'd have to learn to operate. The deciding factor was the existing operational surface, not the language.
Switching the broker to Kafka or SQS is what someone always asks about, but the broker was never the bottleneck. The Postgres queue claimed and committed well inside budget. The lag was CPU and allocation inside normalization. Swapping it would have been a large, risky migration that solved a problem we didn't have.
The moment it flipped
"We knew it was time" is what people say without evidence, so I made it measurable. After the optimization pass I drew the line out: with another sprint of the same work, what headroom was left before the next seasonal peak breached the SLO again? Roughly one peak's worth, and each increment cost more than the last. Optimizing the old path stopped being the cheap option the moment the remaining headroom was under one peak and the next slice cost more than the last. A worker on tokio, sqlx, and serde answered all three problems: no GC pauses, a fraction of the allocation, parallelism that didn't cost a whole process each. But the language change alone wouldn't have made the migration safe. What follows is the boring 70% that did.
Strangling, not rewriting
We didn't rewrite the system. We strangled it. The decision that made it safe: both fleets claimed from the same Postgres queue table. I'd already moved this team off Celery's Redis broker onto a Postgres-backed queue, so a job was just a row, and any process that could speak the claim protocol could process it. The claim is one statement, a SELECT ... FOR UPDATE SKIP LOCKED to grab a batch of ready rows without blocking on rows another worker holds, wrapped in UPDATE ... RETURNING so the claim and the state transition are atomic.
#[derive(sqlx::FromRow)]
struct Job {
id: i64,
kind: String,
payload: serde_json::Value,
attempts: i32,
}
async fn claim_batch(pool: &PgPool, n: i64) -> sqlx::Result<Vec<Job>> {
sqlx::query_as::<_, Job>(
r#"
UPDATE jobs
SET state = 'running', locked_at = now(), attempts = attempts + 1
WHERE id IN (
SELECT id FROM jobs
WHERE state = 'ready' AND run_at <= now()
ORDER BY run_at
FOR UPDATE SKIP LOCKED
LIMIT $1
)
RETURNING id, kind, payload, attempts
"#,
)
.bind(n)
.fetch_all(pool)
.await
}SKIP LOCKED is what lets two heterogeneous fleets share one table without coordinating: each claim steps over rows the other has locked, so there's no contention and no double-claim. The batch LIMIT amortizes round-trips. We ran it at 64 and let the per-worker pool cap real concurrency.
Routing was per task kind plus a feature flag in the enqueuer. We started by sending 1% of webhook.normalize jobs to the Rust worker, ran a shadow phase that diffed Rust output row-for-row against Python output in a side table, and widened the split only once the diff held at zero across a few million jobs. The flag was a dial from 0 to 100% per task type, reversible in seconds: a second consumer on the same queue, dialed up slowly.
Building the rollback before the forward path
The thing I lose sleep over isn't the deploy, it's the deploy I can't undo. One you can't reverse because the data has moved on under you is an incident with your name on the postmortem. So the rollback story was designed first, and most of the strangler's shape exists to serve it.
The kill switch was the simplest thing possible: the split percentage was one value in our config service, and setting it to 0 stopped routing to the Rust worker within seconds. No deploy, no CI wait while the graphs are red, doable from a phone. I didn't want rollback to require the person who wrote the service, because at 3am that person might be me and asleep.
The harder half: for the whole split, both workers wrote to the same shipment state and queue, so their outputs had to be mutually readable. We froze the canonical schema additive-only. Same shape, new fields allowed but nothing renamed, removed, or redefined, every consumer tolerant of unknown fields. You can't clean up the schema mid-migration, but it let a shipment be normalized by Python at 10:00 and Rust at 10:01 with no consumer able to tell. A mixed-version fleet is only safe when neither version can write something the other can't read.
What rollback actually looked like
We ran both in parallel about six weeks. People asked why not push to 100% once the shadow diff was clean. Because the diff proves correctness, not operability. It proves that the output matches, not that the thing pages cleanly, drains on deploy, and holds at the next peak. Six weeks bought a real peak under partial load with rollback still one config change away.
We rehearsed rollback rather than hoped. Dialing to zero at 50% stranded nothing: in-flight Rust jobs finished under their own graceful drain, claimed rows committed normally, and new jobs routed back to the Celery fleet that had never stopped running. Rolling back was just changing who claimed the next row. No drain-and-replay, no broker to reconcile, no half-migrated state to repair. That rollback was a routing change and not a data migration is the single most valuable thing the shared-queue design bought us, more than the latency.
One noisy carrier should not eat the queue
The claim query has a quiet flaw that doesn't show up until a carrier misbehaves. ORDER BY run_at is FIFO, and FIFO across a shared queue means a burst from one source is serviced strictly ahead of everything behind it. The day a carrier replayed several hours of backlog at us, those older run_at values sorted to the front, and the workers drained tens of thousands of one carrier's events while every other carrier's webhooks (and the unrelated task kinds sharing the table) sat behind them aging past the lag SLO. A perfectly correct queue starved itself. FIFO is fair to rows and unfair to tenants.
Per-kind routing already isolated task types, so a flood of one kind couldn't starve another. Within the hot kind we added fairness: the inner select partitions by carrier and caps the per-carrier share of the LIMIT, so a backlogged carrier drains steadily instead of all at once. A shared queue is a multi-tenant system whether you designed it to be one or not, and you find out the first time a tenant goes pathological. Build the fairness in before that.
The PgBouncer prepared-statement trap
This one cost me an afternoon and is worth more than the rest of the post, so here's how it actually went. The symptom showed up only in production, only under load, only sometimes: a small fraction of claims failing with a database error, no pattern, nothing on my laptop or in staging where the pool was cold and tiny. Intermittent, load-dependent, non-reproducible. The three words that mean you're about to lose a day.
My first hypothesis was the obvious one and wrong. New Rust worker, opens connections, errors under load, so clearly I'd saturated something. I checked max_connections, watched pg_stat_activity, and spent a good while making that story fit. It didn't. Connection counts were comfortable, the error was neither a timeout nor "too many clients", backend count flat, no acquire-timeout spike, error rate uncorrelated with connection pressure. I'd been debugging the story in my head instead of the graphs in front of me, which after fifteen years I still catch myself doing.
Reading the error instead of theorizing
The clue was the error itself: prepared statement "sqlx_s_3" does not exist. Not a saturation error. A statement that existed a moment ago and then didn't, and only behind PgBouncer; workers pointed straight at Postgres never threw it. Correlating by backend pid turned the lights on. I watched a statement get prepared on one backend, then referenced in a later transaction against a different backend where it had never existed. The connection under it was being swapped out between transactions.
That's exactly what PgBouncer's transaction-pooling mode does, by design. sqlx prepares statements on the server and caches them per connection: the first time it sees a query string it issues a PREPARE, then reuses that server-side statement over the same connection. A real win against a direct connection, a correctness bug behind a transaction-mode bouncer, where a backend is handed to a client for one transaction and then returned, so the next lands on a different backend where the statement doesn't exist. Under load, intermittently, in exactly the way that doesn't reproduce on a cold laptop where you're the only client.
The runbook line that came out of this: "prepared statement ... does not exist means a pooler is moving connections between transactions. Check whether the worker is behind PgBouncer in transaction mode before you touch connection limits."
Two fixes that trade off
The first: keep PgBouncer in transaction mode and disable the statement cache on PgConnectOptions (capacity zero) so sqlx doesn't assume a statement survives across transactions. The cost is a parse/plan round-trip per execution. For a hot claim loop, that's real overhead.
The second, which we shipped: point the worker directly at Postgres (or PgBouncer in session-pooling mode) with a small bounded pool. The insight people miss is that a worker fleet is a handful of processes, not the hundreds of web request handlers PgBouncer exists to fan in. You don't need transaction pooling to protect Postgres from three workers. Keep the statement cache, keep the plan reuse, bound the connections yourself.
use sqlx::postgres::{PgConnectOptions, PgPoolOptions};
use std::time::Duration;
// Direct to Postgres (or session-mode bouncer): keep the statement cache.
let opts: PgConnectOptions = std::env::var("DATABASE_URL")?.parse()?;
// Behind a transaction-mode bouncer instead, accept the re-prepare cost:
// let opts = opts.statement_cache_capacity(0);
let pool = PgPoolOptions::new()
.max_connections(8)
.acquire_timeout(Duration::from_secs(5))
.connect_with(opts)
.await?;Pool sizing is backwards
When the Rust worker first hit production it opened too many connections and briefly starved the API tier. max_connections in Postgres is a hard, shared, cluster-wide ceiling. Every connection costs a backend process and a slice of work_mem, and the queue worker competes with the web app for the same budget.
Here's the counterintuitive part. The faster worker needs fewer connections, held for shorter durations, because a connection is only busy for the milliseconds a query is in flight. Python held them longer per task because the task itself was slow, so it needed many to keep Postgres fed. The Rust worker finishes in single-digit milliseconds and hands the connection straight back, so a small pool sustains far more throughput. We sized at max_connections(8) per process, three processes, and a 5 s acquire_timeout so a pathological backlog surfaces as a timeout instead of an unbounded queue of waiters. Twenty-four backends total, down from where Celery had been.
Idempotency, because both could run a job
The moment two heterogeneous workers share a queue you have to assume a job can run more than once: a Python worker claims it, the reaper requeues it after a stale lock, the Rust worker runs it too. Our handlers were "meant to be" idempotent, which isn't the same as being idempotent. We made it explicit with a dedupe key derived from the real-world event, carrier_id + event_id, unique per status change regardless of how many times it's delivered, and put the dedupe insert in the same transaction as the side effect, branching on rows_affected():
let mut tx = pool.begin().await?;
let key = format!("{}:{}", job.carrier_id, job.event_id);
let dedupe = sqlx::query(
"INSERT INTO processed_events (key) VALUES ($1) ON CONFLICT DO NOTHING",
)
.bind(&key)
.execute(&mut *tx)
.await?;
if dedupe.rows_affected() == 0 {
tx.rollback().await?; // already processed; nothing to do
return Ok(Outcome::Duplicate);
}
apply_state_transition(&mut tx, &job).await?;
tx.commit().await?;Because the insert and the state transition commit or roll back together, there's no window where the event is marked processed but the effect didn't land, or the reverse. The dedupe table needs a unique constraint on key or ON CONFLICT has nothing to fire on. This caught real duplicates during the shadow run: Celery's at-least-once delivery had been silently double-applying a small fraction of notifications for years. The Rust cutover didn't introduce the bug, it exposed it.
Fast enough to break the downstream
This one I didn't see coming. We pushed the split to around 50% on webhook.normalize and a pager went off, but not mine. The notification service's on-call got paged: its p99 climbing, its downstream calls backing up, looking from where they sat like a problem in their service. A short, slightly tense cross-team call traced it upstream to us. Nobody had touched the notification service. The only thing that had changed was that its producer had gotten faster.
That's the whole trap. The notification service had been sized against the old slow worker, and that slowness was load-shaping the downstream without anyone having decided so on purpose. Being fast enough to break everything that was quietly relying on you being slow is a real failure mode, and the default outcome of a successful performance migration. The Rust worker was working perfectly. That was the problem.
The fix is to bound fan-out concurrency with a tokio::sync::Semaphore so the worker can't push more in-flight calls than the downstream can absorb. The semaphore is the throttle the slow worker used to be.
let permits = Arc::new(Semaphore::new(32)); // cap concurrent fan-out calls
for job in claim_batch(&pool, 64).await? {
let permit = permits.clone().acquire_owned().await?;
let client = http.clone();
tokio::spawn(async move {
let _permit = permit; // released on drop, when the task finishes
if let Err(e) = fan_out(&client, &job).await {
tracing::warn!(job_id = job.id, error = %e, "fan-out failed");
}
});
}Draining on SIGTERM
The first version had a real gap: on deploy the orchestrator sends SIGTERM, the old process exited immediately, and any job it had claimed sat in state = 'running' until the reaper timed out the lock minutes later. Duplicate work and delayed delivery on every deploy. A worker has to do three things on SIGTERM: stop claiming new batches, drain in-flight jobs, and bound how long it waits. A CancellationToken plus tokio::select! does it.
let token = CancellationToken::new();
let signal_token = token.clone();
tokio::spawn(async move {
let _ = tokio::signal::ctrl_c().await; // wire SIGTERM via unix signals in prod
signal_token.cancel();
});
loop {
tokio::select! {
biased;
_ = token.cancelled() => break, // stop claiming
batch = claim_batch(&pool, 64) => {
for job in batch? {
process(&pool, job).await; // let in-flight work finish
}
}
}
}
// Bounded grace period for spawned fan-out tasks to settle.
let _ = tokio::time::timeout(Duration::from_secs(30), drain_inflight()).await;biased makes the select check the cancellation arm first, so once the token fires we stop claiming on the next loop turn instead of grabbing one more batch. Anything still running gets a 30 s window. In production we wire SIGTERM through tokio::signal::unix, and the orchestrator's termination grace period has to be longer than ours or it gets SIGKILLed mid-drain, right back where you started.
Retry or dead-letter
Celery retried on any exception with its own backoff. We reproduced the retry-versus-dead-letter decision deliberately: transient failures requeue with backoff, permanent ones go to a dead-letter table for a human. The classification is a match, and getting it wrong in either direction is expensive. Dead-letter a retryable timeout and you drop work. Retry a schema violation forever and you build a hot loop.
match handle(&pool, &job).await {
Ok(_) => mark_done(&pool, job.id).await?,
Err(e) => match classify(&e) {
// carrier 5xx, DB timeout, connection reset: back off and requeue
Class::Transient => {
let delay = backoff(job.attempts); // capped exponential + jitter
requeue(&pool, job.id, delay).await?;
}
// schema validation, unknown event type: a human decides
Class::Permanent => dead_letter(&pool, job.id, &e.to_string()).await?,
},
}We shipped this subtly wrong at first and dead-lettered some retryable timeouts. We caught it because the dead-letter table grew when it shouldn't have, the kind of thing you only see if you alert on it.
Done is when on-call can operate it
The migration doesn't finish at 100% traffic. It finishes when the person on call at 2am, who didn't write a line of this, can be paged, understand what's wrong, and fix it without calling me. Everything here existed before we widened the split past single digits, because instrumenting after you've shifted load is instrumenting blind.
The dashboards went up first, one panel per failure mode: queue lag (the SLO number that pages), claim rate (is the worker pulling work), dead-letter size and growth (the canary that caught the classification bug), downstream saturation (the metric we wished we had before the backpressure incident), and carrier retry rate (the load multiplier). Each alert was tied to the error budget, not a round number that felt nice.
A few things easy and expensive to skip. PII redaction: carrier payloads carry names, addresses, phone numbers, and the instant you log a payload to debug a parse failure you've written PII to your log store. Redaction went in the logging layer from the first commit. An audit trail: every state transition writes an immutable row (old state, new state, which worker, when, which job) so when a customer disputes a delivery status, the trail is the answer. A replay tool for the dead-letter table: useless if the only tool is hand-writing SQL at midnight, so we built a repair/replay command that re-enqueues through the normal path with its idempotency key intact so the replay can't double-apply. That turns the common case (a new carrier event type we hadn't mapped) from an incident into a five-minute fix. A runbook keyed by the alerts: what each means, what to check, where the kill switch is. The PgBouncer line lives there.
Who owns the Rust afterward
Here's the risk that showed up on no latency graph and worried me most. This was a Python and Django team, and after the migration they owned a Rust service they hadn't asked for, in a language most of them didn't write. The failure mode isn't that the service is slow. It's that the service becomes mine, the bus-factor-of-one corner everyone routes around until the day I'm on holiday and it breaks. I've built that thing before, by accident, and wasn't going to do it again on purpose.
So a chunk of the budget went into not letting that happen, and it shaped the code as much as any performance concern. We paired through the build, so by cutover at least three other people had typed into this codebase and argued with me about it. The Rust surface was kept deliberately small and boring: no clever async gymnastics, no exotic crates, no lifetime puzzles where a clone would do. Just tokio, sqlx, serde, and as little else as I could manage. The most senior thing you can do in Rust on a team like this is write the least impressive Rust that does the job, because the next person to read it at 2am isn't a specialist. I optimized for "a Python developer can follow this" over "this is idiomatic," and walked on-call through the main loop, the claim, the error match, and the shutdown before they had to operate it.
We also talked about hiring, because pretending the language choice has no staffing cost is how you orphan a service. We bet Rust skills were gettable and the service small and boring enough to hand to a competent engineer who hadn't written Rust before, a different bet than "hire specialists to babysit it." If it had needed exotic Rust the bet would have been reckless and I'd have pushed harder for Go. That this wouldn't be a one-person service was a harder problem than the tail latency, and it shows up in none of the numbers below.
The measured result
After full cutover, at the same daily volume:
p99task time dropped from 900 ms to 34 ms,p99.9from 4.2 s to 110 ms.- The fleet went from 18 boxes to 3. Steady-state RSS was about 280 MB per Celery child against roughly 40 MB for the whole Rust process. Not like-for-like, but per-child-versus-whole-process was the comparison that decided how many we packed per box.
- Carrier webhook retries from slow ACKs fell by something like 95%, removing a chunk of load that had only existed because we were slow.
These are steady-state medians after the rollout settled, rounded. The first couple of weeks were noisier and the retry number bounced around as carriers adjusted, so treat them as the shape of the win, not a benchmark you can hold me to.
The number I actually cared about isn't in that chart, because it's a non-event: we stopped breaching the ACK p99 and the queue-lag SLO during retry storms, and the error budget went from chronically overspent to mostly untouched. A faster worker was the means. Staying inside the contract was the goal.
If I did it again the order would be the same but tighter: exhaust the cheap options and write down where each ceiling is; build the rollback path and the dashboards before the forward path; put graceful shutdown and the kill switch in the first commit, not the fourth; pair from day one so the bus factor is never one. The biggest risk in a "rewrite the hot worker" project is never the new language. It's the two things the language can't help with: that the new worker breaks everything downstream that was quietly relying on the old one being slow, and that the team it lands on can't operate it after you've moved on. Solve those and the rewrite is boring, which is exactly what you want a production migration to be.