How I load anonymized production data into dev with evil_seed

A teammate's laptop almost caught fire during a download. Not literally. Their dev-database refresh was on hour four, restoring a production snapshot into Postgres, and the laptop's fans had been on full for so long that he finally stepped away and let it run overnight. The next morning the restore had filled the disk and crashed halfway through. The drive was almost full of a database he hadn't even been able to log into.

That was the team's normal way to get a realistic dev DB. Every developer downloaded a production dump (real PII included, column-encrypted but still PII once decrypted), restored it locally, and tried to keep enough free space to actually code. The restored database swallowed a chunk of the disk you would otherwise use for things like funny cats videos.

I had never downloaded it. I had written seeds instead. The seeds were small, fast, version-controlled, and ran against an empty database in seconds, which made the dev loop tight. But each new feature or bug fix wanted its own seed, and adding a seed-per-fix turned into its own friction: conflicts when two branches added the same record, timestamps drifting, factories quietly stale. A hundred times better than a quarter-of-the-disk staging dump (I had also seen what happens when the disk cleaning goes wrong: "How two AIs made me wipe my Mac"), but still a chore.

Then the ask landed: please produce a comprehensive, realistic, anonymous demo database for a single tenant. Something we could use in staging, hand to a new hire, or run a sales demo against, without leaking any real PII.

I thought back to a talk at Tropical on Rails in São Paulo a couple of months earlier: Evil Martians' evil-seed. You declare roots and associations, and it dumps the subset of your database that satisfies them, optionally rewriting each row through a customizer.

That sounded like the shape of the problem, but the first two things I tried did not use it properly. The plan in both was to chunk the production DB down to something more manageable first, and only then point evil-seed at the smaller thing to anonymize what was left.

Attempt one was the cheapest move: run the production restore with the largest tables excluded at pg_restore time. It collapsed immediately. The biggest table in the database is shared across every tenant, and the demo cannot render without it. There was no --exclude-table line that did not also break the demo.

Attempt two was the obvious correction: restore the production snapshot in full, then prune. Load everything, write DELETE chains that peel away other tenants' rows, end up with a single-tenant DB. On a hyper-connected schema this turns into its own problem. Every DELETE has a FK trail behind it, and the chain is open-ended in the same way the excludes were: when do I stop pruning?

evil-seed had been the right tool the whole time, but I had not embraced what it was asking for. I was looking for shortcuts instead of accepting that the work was going to be hard. So I sat down and drove the gem the way I should have from the start.

Bottom-up: declare every root explicitly

By bottom-up I do not mean leaves. I mean picking the handful of tables the demo actually reads and giving each its own scoped root, instead of starting from one tenant and hoping the walker stops at the right place.

Even a single Account-as-root walk explodes on a schema like this. Tables like the audit log and the PaperTrail-style version history are populated by every action across every tenant; walking the associations of one Account pulls in slices of them the demo does not need, and through them, slices of other tenants. A single root tries to bound the entire graph from one place. The schema does not let it.

Instead of starting from the database and pruning outward, I gave each table the demo needed its own scoped root. The tenant became one of many entries, not the entry. The walker stopped escaping because there was nothing to escape to: every root knew its own bounds. I started with the dozen or so roots I knew I would need, then loaded the dump, clicked through the demo, and added a new root every time something rendered empty or crashed. The final config declares around fifty.

The skeleton of the config:

EvilSeed.configure do |config|
  config.unscoped     = true   # don't double-filter through model default scopes
  config.dont_nullify = true   # we'll load with FK triggers disabled

  config.root("Account", id: account_id) { |r| r.limit_deep(0) }

  config.root("Member",
    "members.account_id = #{account_id}") { |r| r.limit_deep(0) }

  config.root("Document",
    "documents.id IN (#{ids_list})") { |r| r.limit_deep(0) }

  config.root("ItemResponse",
    "item_responses.member_id IN (#{member_ids_list})") { |r| r.limit_deep(0) }

  # one root per table the demo actually reads

  config.customize("Member") { |attrs| anonymize_member!(attrs) }
  config.customize("ItemResponse") { |attrs| scrub_json!(attrs, "response") }
end

Two flags matter. unscoped = true because a default scope (a kept from soft-deletes, for instance) will silently re-filter your scoped root and drop rows you carefully selected. dont_nullify = true because the loader runs with FK triggers disabled anyway, so evil-seed's per-row FK-nullify pass would just burn time on tables it does not need to touch.

The other choice that mattered was less obvious. The scoping lists are interpolated as literal IN (1,2,3,...), not IN (SELECT ...). With subqueries the planner kept picking parallel seq scans on the largest response tables, turning a single dump into hours. Materializing the id sets up front and interpolating them as literals forced index usage on the per-batch SELECTs evil-seed issues. The dump dropped from hours to minutes.

The silent login break

Anonymization felt straightforward until login broke silently in a way nothing in CI caught. The model encrypts email and a few other PII columns at the column level. I had written a customizer that ran the new fake value through the encryption serializer, dropped the ciphertext into the row, and moved on. Every demo user looked perfect in the dump. None of them could sign in.

evil-seed's row dumper already applies column.type.serialize(value) once at INSERT time. Calling .serialize a second time in the customizer produced encrypt(encrypt(plaintext)). The reader returned ciphertext where the app expected an email, and Devise's deterministic find_by(email: ...) silently could not locate the user. Customizers see the model's logical value, not the database's serialized form. I deleted the explicit .serialize call, wrote plaintext into the customizer, and login worked.

The PII sweep that catches these regressions started life as a grep -F against a sample of source values, and it false-positived in two ways. Short fake names (a two-character first name, for example) were substring-matching inside base64-encoded ciphertexts, producing 7,691 spurious hits in one run; adding -w for word boundaries fixed that. Separately, the demo's shared bcrypt hash would occasionally byte-collide with a real user's hash by accident, blocking the run. I replaced the byte-equality check with a positive invariant: every bcrypt-format token in the dump must equal the known demo hash.

The dump that wouldn't unpack

Then the dump shipped, and then the dump didn't. Bottom-up scoping had bounded the entity catalog to what the demo rendered, but demo users' responses, comments, and todos still pointed at rows outside that bound. Loading with session_replication_role = replica let those orphans in: FK triggers were off, so nothing complained at load time.

The complaint showed up only when I repacked the loaded demo DB into a pg_dump -Fc archive for distribution. A consumer's pg_restore blew up on a handful of ADD CONSTRAINT statements covering thousands of orphan rows. The fix was a short DELETE block at load time, before the pack step. Disabling FK triggers to load a slice means you also have to clean up after them before you ship the slice.

What changed in how we work

The dump shrank by orders of magnitude. A production restore weighs in around half a terabyte. Build produces a plaintext .sql under a gigabyte (the PII sweep greps against it), and a pg_dump -Fc archive under 200 MB. We ship the archive, not the SQL: roughly 2,500× smaller than the source, loadable on a fresh laptop in single-digit minutes.
No real PII on disk. A scripted sweep fails the build if a source name, email, or subdomain leaks into the dump file.
Bottom-up is the move whenever a graph walker keeps escaping. Top-down works for trees. Bottom-up works for densely cross-linked schemas.
Anonymization belongs upstream of the serializer, not in front of it. If a column has an attribute type, write the logical value and let the column do its job.

It went out as one clean commit on main. The four days and forty-three checkpoint commits behind it stayed on my working branch, where they belong. Nobody on the team has waited four hours for a staging restore since.

Thanks to Evil Martians for building and open-sourcing evil-seed. The gem did the heavy lifting; the bottom-up walk was just learning to ask the right thing of it.

I'm Fred Sapuppo, a Rails developer who does not make new hires download huge production databases. More field notes at fedesapuppo.com.