Why refund processing is harder than it looks

Apr 3, 2026 · 21 min readPayments Backend

Every payments system I've worked on started with refunds as an afterthought. A button that called gateway.refund(charge_id) and trusted the return value. Every one eventually paid for that, sometimes literally. Refunds look like payments run backwards, but they're not. A payment has a live human at checkout supplying the retry pressure. A refund happens asynchronously, often days later, and it's usually fired by a batch job rather than a person, against a charge that may have been only partially captured. Every failure mode ends in one of two places: a customer owed money who doesn't have it, or your company paying the same money twice. I'm writing this from having carried the pager and cleaned up the aftermath more than once.

double refunds

2 yr

in production

~12k

refunds auto-recovered

Money is the error budget

The numbers are rounded. The 12k is an incident I'll get to, and the zero double-refunds is what the whole design exists to protect. That's the lens for everything below. A bug here doesn't degrade a percentile. It moves cash out of the company's bank account, or strands it in transit to a customer who was promised it. The error budget is denominated in dollars and in customer trust, and neither one refills on its own.

The service-level objective I held the system to was plain enough to put in a finance review. Every refund reaches a terminal, reconciled state within two business days, and at any moment the count of refunds stuck in submitted past that threshold is near zero. That isn't aesthetic. A stuck refund is a customer who was told their money is coming and doesn't have it, which becomes a support ticket, then an escalation, then a chargeback, then a one-star review about a company that "stole" from them. The alert is wired to exactly that count: when submitted-and-aging rows cross a small threshold, someone gets paged, because that number is the error budget burning in real time.

Why build this at all

Before any code, the honest question is why this needs to be its own system, because the alternatives are real and someone will ask.

The first is to lean entirely on the gateway's idempotency. Stripe-class gateways dedupe on an idempotency key, so why not retry safely and let them be the system of record? Because the gateway answers exactly one question (did this specific API call already happen) and you've got three others. You need a durable record of intent, because a refund is often decided minutes or hours before submission by a batch job or an approval workflow, and that intent has to survive a crash. You need a cross-charge invariant the gateway doesn't know about: it'll happily accept refunds summing past what you captured if you ask in two calls. And you need reconciliation against the bank, which is a different source of truth than the gateway's API. Its idempotency is necessary and I rely on it hard, but it's one layer of three.

The second is a durable-execution engine like Temporal, or Step Functions. On paper it fits. A refund is a long-lived, multi-step process with retries and timeouts, exactly what those engines are for. I've used Temporal for harder orchestration and I'd do it again. Here I didn't. The refund has five or six states, and the dangerous behavior isn't orchestration but a single invariant and a single never-advance-on-optimism rule, both of which live in SQL transactions, not workflow steps. An engine meant more infrastructure to operate. And the deciding part: it moved a refund's state out of a column the on-call engineer, finance analyst, and support agent could all read with a plain SELECT, into an execution history you need the engine's tooling to inspect. At 3am, during the incident below, "what state is this refund in" had to be answerable in SQL by a tired human. An explicit status column won on debuggability, not capability.

The third is full event sourcing or a double-entry ledger. Where money has to balance to the cent across many accounts, a ledger is right and I'd build one. For a refunds table that reduces a charge, append-only event history buys a beautiful audit trail at the cost of every read becoming a fold. I took the middle path. A boring mutable status column for the current state, plus a separate append-only audit table for the transitions. That gives auditability where it's legally required without making "is this refund settled" an event replay. The boring representation, chosen deliberately, knowing the fancier options existed and what they'd cost.

Money is not a negative payment

The first instinct is to model a refund as a payment with a negative amount. Resist it. A refund has its own lifecycle. It can be partial. It can be issued against a charge that was itself only partially captured, in which case the capturable base is amount_captured, not the authorized amount. It settles on a different schedule. It can be reversed by the issuing bank weeks later. And a payment's terminal state is "captured," while a refund's is "settled and reconciled against the original charge," which is a constraint spanning two records, not one.

Before any of that, get the money representation right, because everything downstream inherits it. Amounts are integer minor units (cents) stored as bigint, never floats. A float eventually hands you a refund of 4.999999 dollars, and the rounding bug surfaces in the reconciliation file months later when nobody remembers why. The refund currency has to match the charge's. A refund in a different currency than the capture isn't a refund, it's a second unhedged FX transaction wearing a refund's clothes.

Store refunds as negative rows in the payments table and your balance arithmetic looks right while your audit trail lies. You can't answer "how much of charge X has been refunded" without summing across a polymorphic column, and you can't enforce the one invariant that matters, sum(refunds) <= amount_captured, as anything real. Refunds get their own table, their own state machine, and a foreign key to the charge they reduce.

sql

CREATE TABLE refunds (
    id              uuid PRIMARY KEY,
    charge_id       uuid NOT NULL REFERENCES charges(id),
    amount_cents    bigint NOT NULL CHECK (amount_cents > 0),
    currency        text   NOT NULL,
    idempotency_key text   NOT NULL UNIQUE,
    gateway_ref     text,                -- null until submitted
    status          text   NOT NULL DEFAULT 'requested'
        CHECK (status IN ('requested','pending_review','submitted',
                          'settled','failed','canceled')),
    reason_code     text,                -- enumerated, not free text
    requested_by    text   NOT NULL,     -- actor; feeds separation of duties
    failure_reason  text,
    created_at      timestamptz NOT NULL DEFAULT now(),
    updated_at      timestamptz NOT NULL DEFAULT now()
);
 
CREATE INDEX refunds_charge_id_idx ON refunds (charge_id);
CREATE UNIQUE INDEX refunds_gateway_ref_idx ON refunds (gateway_ref)
    WHERE gateway_ref IS NOT NULL;
 
-- Append-only. Never UPDATE or DELETE here. This is what a dispute,
-- an auditor, or a 3am on-call query reads to reconstruct truth.
CREATE TABLE refund_transitions (
    id          bigserial PRIMARY KEY,
    refund_id   uuid NOT NULL REFERENCES refunds(id),
    from_status text,
    to_status   text NOT NULL,
    actor       text NOT NULL,          -- 'user:1234', 'job:...', 'webhook'
    reason      text,
    at          timestamptz NOT NULL DEFAULT now()
);

A refund is a state machine

The single most valuable thing I did was make the refund an explicit state machine with a persisted status column, instead of an implicit sequence of function calls whose state lived in a call stack that dies with the process.

The states: requested, then submitted, then settled, with failed and canceled as terminal off-ramps, plus pending_review for amounts over a threshold so a human approves the large ones before they reach the gateway. The transitions are the hard part. The dangerous moment is always the window between submitted (we told the gateway to refund) and settled (the gateway confirmed the money moved). A process can die in that window. A call can time out in it. A deploy can SIGKILL the worker in it. The code must never read "I haven't seen a success" as "the refund didn't happen." It means only that you don't yet know, and the difference between those two readings is the difference between a clean retry and a double payout.

Preventing over-refund under concurrency

This is what separates a refund system that survives from one that quietly leaks money. The invariant is sum(refunds) <= amount_captured. The naive enforcement reads the captured amount, sums existing refunds, checks the new one fits, and inserts. It passes every test you write, because your tests are single-threaded.

Now run two partial-refund requests for the same charge concurrently, say a support agent and an automated returns job. Both read amount_captured of 100 dollars. Both sum existing refunds and see 0. Both check their 60-dollar refund fits under 100. Both insert. You've refunded 120 dollars against a 100-dollar capture, and no constraint stopped it, because the check-then-insert wasn't atomic. A CHECK constraint can't express a cross-row aggregate, so the database won't save you. You have to make the read and the write happen under a lock.

sql

BEGIN;
 
-- Lock the charge row. Any concurrent refund against this charge now
-- blocks here until we commit. This is what makes the invariant real.
SELECT amount_captured, currency
  FROM charges
 WHERE id = $1
   FOR UPDATE;
 
-- With the row locked, this sum cannot change underneath us.
-- Exclude failed/canceled refunds; they never moved money.
SELECT COALESCE(SUM(amount_cents), 0) AS refunded_so_far
  FROM refunds
 WHERE charge_id = $1
   AND status NOT IN ('failed', 'canceled');
 
-- In app code: assert
--   $new_amount_cents <= amount_captured - refunded_so_far
--   and $new_currency = charges.currency
-- then, still inside this transaction:
INSERT INTO refunds (id, charge_id, amount_cents, currency,
                     idempotency_key, requested_by, status)
VALUES ($2, $1, $new_amount_cents, $new_currency, $2, $actor, 'requested');
 
-- Same transaction: the audit row. State and transition commit
-- together or not at all, so they can never disagree.
INSERT INTO refund_transitions (refund_id, from_status, to_status, actor, reason)
VALUES ($2, NULL, 'requested', $actor, $reason_code);
 
COMMIT;

The SELECT ... FOR UPDATE on the charge row is the entire trick. It serializes every refund attempt against a given charge through one lock, so the sum I read is the sum that's true at insert time, not a stale snapshot. Without the lock the check is advisory. With it, the check is an invariant. Teams reach for SERIALIZABLE isolation instead, which also works, but then you handle serialization failures and retry the whole transaction, and under contention on a hot charge that gets expensive. An explicit row lock scoped to the one charge is cheaper, and it's legible to the next engineer. Note the second insert: the state change and its audit record commit in the same transaction, so there's no window where a refund advanced but the log missed it.

Idempotency keys done right

Every refund request to the gateway carries an idempotency key, generated up front and persisted before the first call. If our process crashes and retries, it sends the same key, and the gateway returns the original result rather than issuing a second refund.

The mistake I've seen repeatedly is deriving the key by hashing mutable fields like charge id, amount, free-text reason. That's fragile both ways. The same logical refund hashes differently if anyone edits the reason between attempts, so the retry looks new and double-pays. And two genuinely distinct refunds that happen to share a charge, amount, and reason hash to the same key, so the second silently no-ops and a customer gets shortchanged.

The correct key is a client-generated UUID that's the refund row's own id, allocated before the first gateway call and sent verbatim as the Idempotency-Key header. It's stable across retries because the row id doesn't change, and unique across refunds because every row gets its own id. Two legitimate same-amount goodwill credits against the same charge carry distinct ids and therefore distinct keys, with no special handling. The idempotency_key UNIQUE constraint is the database-side mirror of the gateway's dedupe: a duplicate logical refund fails locally before any network call. This is precisely what let us recover 12,000 refunds without guessing.

python

def submit_refund(refund_id: str, conn) -> None:
    # refund_id was persisted as status='requested' inside the
    # FOR UPDATE transaction above. It IS the idempotency key.
    row = conn.fetchrow(
        "SELECT amount_cents, currency, status FROM refunds WHERE id = $1",
        refund_id,
    )
    if row["status"] in ("settled", "failed", "canceled"):
        return  # already terminal
 
    conn.execute(
        "UPDATE refunds SET status='submitted', updated_at=now() "
        "WHERE id=$1 AND status IN ('requested','submitted')",
        refund_id,
    )
 
    # A retry of a call that already succeeded returns the SAME
    # refund object instead of moving money again.
    resp = gateway.create_refund(
        amount=row["amount_cents"],
        currency=row["currency"],
        idempotency_key=refund_id,
    )
 
    # Record the gateway ref, but DO NOT mark settled here.
    # Settlement is the gateway's word, by webhook or poll.
    conn.execute(
        "UPDATE refunds SET gateway_ref=$2, updated_at=now() "
        "WHERE id=$1 AND gateway_ref IS NULL",
        refund_id, resp["id"],
    )

Note what that code doesn't do: it doesn't set status = 'settled' on a 2xx from the gateway. A successful API call means the gateway accepted the request, not that the bank moved the money. Advancing to settled from your own optimism is how you end up with a database full of refunds finance can't find in the settlement file. Never settle from optimism.

Hostile webhooks

Settlement comes from the gateway over webhooks, which are the most adversarial input in the whole system. They arrive out of order, so a refund.settled can land before the refund.submitted you expect to precede it. They arrive duplicated, sometimes many times. And occasionally a refund.failed shows up for a refund you were sure had gone through, because the bank rejected it after acceptance. The handler verifies the signature on every payload, dedupes on the event id so replays are no-ops, and tolerates any order by keying off the refund's gateway_ref rather than assuming a sequence.

python

def handle_webhook(event: dict, raw_body: bytes, sig: str, conn) -> None:
    if not verify_signature(raw_body, sig):
        raise PermissionError("bad webhook signature")
 
    # Dedupe: dropping a row we already processed is a no-op.
    inserted = conn.execute(
        "INSERT INTO webhook_events (id) VALUES ($1) "
        "ON CONFLICT (id) DO NOTHING RETURNING id",
        event["id"],
    )
    if inserted is None:
        return  # duplicate delivery
 
    ref = event["data"]["gateway_ref"]
    if event["type"] == "refund.settled":
        # Out-of-order safe: only advance from a non-terminal state.
        conn.execute(
            "UPDATE refunds SET status='settled', updated_at=now() "
            "WHERE gateway_ref=$1 AND status IN ('submitted','requested')",
            ref,
        )
    elif event["type"] == "refund.failed":
        conn.execute(
            "UPDATE refunds SET status='failed', "
            "failure_reason=$2, updated_at=now() "
            "WHERE gateway_ref=$1 AND status NOT IN ('settled','canceled')",
            ref, event["data"].get("reason"),
        )

The guard status IN ('submitted','requested') on the settle path makes out-of-order delivery harmless. A duplicate or late settled for an already-settled row changes nothing, and a settled for a row since marked failed gets rejected rather than resurrecting a dead refund. Every transition also writes its refund_transitions row, actor webhook, in the same transaction. I left that out here for brevity, but it's not optional in the real handler.

The 12k incident

Here's the failure that bit us, and the reason I trust the design above rather than just believing in it.

A carrier's tracking service went down and an automated policy kicked in: shipments past a guarantee window get refunded. The outage was wide, so the policy queued automatic refunds for roughly 12,000 shipments and a batch worker started churning. Partway through, an unrelated deploy rolled the worker pods and Kubernetes SIGKILLed the process mid-flight. Some refunds were requested. A large block were submitted, meaning we'd called the gateway. Almost none were confirmed settled. The worker didn't come back cleanly, and the alert that fired was the SLO alert: thousands of refunds aging in submitted.

The question on the morning call was the only one that mattered, and it was frightening not to be able to answer it immediately. Did we double-refund anyone? The instinct of whoever finds a half-finished batch is to rerun it, and a naive rerun would have re-issued a fresh refund for every shipment whose first attempt had actually succeeded. Twelve thousand potential double payouts, real money gone, recoverable only by clawing it back from customers one ticket at a time. The fear was specifically that we couldn't tell the succeeded from the failed, because the worker died before recording outcomes.

What saved us was discipline that had felt like overkill when it was written. We never advanced a row to settled on our own optimism, and the idempotency key was the refund's own id, persisted before the gateway call. So for every submitted row, recovery wasn't a guess but a question for the gateway: do you have a refund with idempotency key equal to this id? If yes, it had already happened, so we recorded the gateway_ref, waited for the settled webhook, and didn't resubmit. If no, it hadn't happened, and resubmitting with the same key was safe by construction. If our judgment was wrong and it had gone through, the gateway's dedupe returned the original rather than moving money twice. We walked all twelve thousand rows through that one converging procedure and the double-refund count was zero. Finance lined up too: the next day's settlement file showed exactly the refunds the gateway confirmed it held, no more, and the number finance feared (settlement lines our system didn't know about) was also zero, because the system of record never lost track of intent even though the worker died.

We didn't walk away feeling clever. We walked away with two things. First, a runbook. "Batch worker died mid-refund" is now a named procedure: don't rerun the batch, query the gateway by idempotency key for every submitted row, converge, then resume. Second, the guardrail that should have existed. Bulk auto-refund flows now go through a rate limiter and a circuit breaker, so an automated policy reacting to a vendor outage can't fire twelve thousand money movements in one uninterruptible burst. A human gets paged and approves continuation past a threshold. The scar was cheap because the invariants held. The lesson was that a system capable of issuing 12,000 refunds in minutes needs a brake, and we hadn't built one.

Audit, approval, and PII

Refunds are regulated money movement, and the design assumes an auditor, a disputed transaction, and a hostile insider, not just a buggy worker.

That's why pending_review is a first-class state, not a feature flag. Above a threshold a refund enters pending_review and can't reach the gateway until a second person approves it, and the approver must not be the requester. Separation of duties is enforced in code: the approval transition checks approver != requested_by and rejects otherwise. This stops both an honest fat-finger refund of 50,000 instead of 500, and a dishonest employee quietly refunding to an account they control. The requested_by and the approver both land in the append-only refund_transitions table, so "who authorized this money to move and when" is a query, not an archaeology project.

The audit table earns its keep in disputes. When a customer or card network contests a refund months later, the answer has to be reconstructable: who requested it, under what reason code, what states it passed through, what the gateway said, when it settled. Because transitions are append-only and written in the same transaction as the state change, that history is complete and tamper-evident. We never UPDATE a transition row, so no version of events quietly overwrites another. Financial records carry retention obligations, so these rows are kept for the statutory window regardless of whether the refund is "interesting," and excluded from the data-minimization jobs that prune everything else.

One PII rule lives at the layer that talks to the card network. Request and response carry card data, and nothing in this path may log a full PAN. We log the gateway_ref, the refund id, the amount, and the last four. Never the PAN, never the full token in plaintext. The structured logger has a redaction filter that drops anything matching a card pattern before it reaches the sink. PCI scope is a cost you pay forever once a secret lands in a log, and the cheapest time to prevent it is in the code review of the function that makes the call.

Reconciliation is co-owned

The number in your database isn't the number that matters. The number on the bank's settlement file is. A daily job matched each settled refund against the file by gateway_ref and amount, flagging three discrepancy classes:

A refund settled in our DB but missing from the file after N days. The bank dropped it, and a customer waits on money we think we sent.
A settlement line with no matching refund row. Money that left the account we can't account for, the worst class, because a refund moved that our system never recorded.
An amount mismatch, almost always a currency, rounding, or partial-capture bug, the kind that bigint-in-cents discipline mostly prevents but doesn't eliminate.

That job found bugs nothing else could, because it compared our story to the bank's, and the bank is the only source of truth for whether money actually moved.

Reconciliation isn't an engineering job finance consumes. It's co-owned. The job is owned by eng, but the second class (money left the account with no matching row) is a finance escalation with a clock on it, closed by people who sit in finance, not the codebase. So the output had to be legible to them: dashboards they read without an engineer translating. Refunds by status, the count aging in submitted, mismatches by the three classes, and the headline finance trusts, total refunded in our system today versus total on the settlement file. Each one has an alert. The aging-submitted count pages on-call eng. A class-two mismatch pages both eng and the finance controller, because neither resolves it alone.

When a refund does get stuck there's a repair tool, and the dangerous version would let an operator click "retry" and blindly resubmit. Ours can't. The repair action re-queries the gateway by idempotency key first and shows the operator what the gateway actually believes before offering any action. It's the same converge-don't-guess logic from the incident, packaged so a support lead runs it safely without understanding the locking. That's the deliberate consequence of the boring status column. A refund sits at the seam of three orgs: eng owns the mechanism, finance owns the money truth, support owns the customer, and the representation had to be one all three could read. A workflow engine's history would have been correct and unreadable to two of them.

Refunds are harder than they look because the easy 90% (call the gateway, mark it done) works in every demo and every happy-path test. Then the 10% involving a dead worker, a timed-out call, two concurrent requests, a dropped settlement line, or an auditor's question two years later turns out to be the entire reason the system needs to exist. You're not building the refund. You're building the recovery from the refund that half-happened, and the record that proves which half.

The state machine, the row-locked over-refund check, the idempotency keys, and the reconciliation from this post are a runnable crate: refund-engine — with a demo that replays the near-double-pay incident. Run it at INCIDENT_N=12000 and it seeds the full twelve thousand, SIGKILLs the batch mid-flight, then converges every row with zero double-refunds — the same invariant, at the same scale as the incident.