"Lost Time" in Alembic Migrations and the Defensive Migration Strategy

Managing database schemas is a standard operation for most of us. In a local environment, you run alembic revision --autogenerate, then alembic upgrade head, and everything works flawlessly. But what if one day your Production (or Staging) database is at the version indicated by your migration file, but physically does not contain those tables or columns?

That is the moment you have come face to face with the infamous database "Drift" problem and hours of Alembic chaos. Today I want to share the war story behind the "UndefinedTable" error we encountered and the "Defensive Migration" strategy we subsequently developed.

Setting the Scene: A Failed Deploy

It all started on an ordinary Friday when we tried to deploy the Nodus project to the Staging environment on Railway. The pipeline ran, the container spun up, and suddenly that cold-sweat-inducing error log appeared: psycopg2.errors.UndefinedTable: relation "health_metrics" does not exist.

This was impossible. Because the health_metrics table should have been in the database from a migration done weeks ago.

Situation Assessment: We looked at the infamous alembic_version table that Alembic keeps in the database. Version: 5c85ade3019c (Meaning the version where the table had already been created, and even "file_id" added on top of it). The Problem: The database was saying "I'm on step 5", but the table that physically should have been in step 1 was nowhere to be found. Someone had either deleted the table manually, or a synchronization loss (drift) had occurred during database dump/restore operations coming from different environments.

Attempt 1: The Repair Script and "Things Left in the Past"

Our first instinct was to add a "repair file" (123456789abc) at the very beginning. We placed the file right after the initial_schema. The thought was: "If the table is missing while Alembic is scanning from the beginning, let it create it."

What Happened? The operation failed. Alembic is a smart (?) tool. When it saw that the database was already at the 5c85... (step 5) version, it completely ignored our poor repair file from the past, saying "This must have run before anyway".

Attempt 2: Laying Stones Where You Stepped (The Linked List Trick)

Since Alembic didn't look back at the past, we placed our repair file at the exact stair step where the database stood. We set our file's down_revision value to 5c85..., the current version of the database.

What Happened? The table was created! Alembic saw our file and executed it. But it was too early to rejoice. The migration chain had split in two. Because there were other files pointing to the same step. We encountered the infamous Git/Alembic nightmare: Multiple heads are present for given argument 'head'.

Different environments (Dev, Stage, Prod) were in different timelines. The Stage version was fixed, but this time the Prod database blew up because it was on a different step. When we tried to merge the main and stage branches, things literally turned into "soup".

Attempt 3: Fixing the Chain and the Ironic "DuplicateColumn" Error

We resolved the Git conflicts, put the repair file where the Prod database was, and arranged the entire migration chain onto a single line (linear). Now everything was going to run in order.

What Happened? This time we got a wonderful (!) psycopg2.errors.DuplicateColumn error. Why? Because our repair script added some basic columns (e.g., ref_min, analysis_id) to the table so it wouldn't crash while creating it. But since the original files further down the Alembic chain also tried to add these columns, the database rejected the operation saying "This column already exists!".

The Reliable Solution: "Defensive" Migration Strategy

The fundamental reason for all this chaos was that Alembic files were acting blindly from memory. The assumption was being made: "If I am at this step, the current table is like this". However, in external databases, reality does not always match the code.

Leaving these assumptions behind, we decided to make all risky migration files in the project Defensive (Data-Driven). Using SQLAlchemy's Inspector capability, we added "querying" logic to each step:

from sqlalchemy.engine.reflection import Inspector

def upgrade() -> None:
    # 1. Let's see what physically exists in the environment:
    bind = op.get_bind()
    inspector = Inspector.from_engine(bind)
    tables = inspector.get_table_names()

    # 2. ULTIMATE SAFEGUARD: If the table is missing for some reason, CREATE it in ANY file!
    if 'health_metrics' not in tables:
        op.create_table('health_metrics', ...)

    # 3. Defensive Column Addition: If the column ALREADY exists, pass silently
    columns = [c['name'] for c in inspector.get_columns('health_metrics')]
    if 'analysis_id' not in columns:
        op.add_column('health_metrics', sa.Column('analysis_id', sa.Integer()))

    # 4. Defensive Constraint(FK) Deletion: Do not try to delete if it's not really there
    fks = inspector.get_foreign_keys('health_metrics')
    fk_names = [fk['name'] for fk in fks if fk['name']]
    if 'health_metrics_file_id_fkey' in fk_names:
        op.drop_constraint('health_metrics_file_id_fkey', 'health_metrics', type_='foreignkey')

Conclusion: A Bulletproof Process

Thanks to these little "if" logics, the project's migration structure suddenly turned into an impenetrable fortress.

While the Prod (Main) environment was updating, it found the table missing; silently created it, and didn't try to add existing columns again.
While the Stage environment was updating, the table was already there due to the repair; but the constraint was missing, it just updated what was there without throwing an error.
The Dev (Local) environment was already up to date; it continued on its way normally without tripping the defensive checks.

Lesson Learned

ORM migration tools like Alembic are designed for "Perfect World" scenarios and always assume the revision history is the absolute truth. However, in real life (dump/restores, different database states across environments, branch switching) the "This was already created in the past" assumption ends up costing dearly.

The best possible practice is; in critical migrations that touch tables, to operate (=defensively) by believing in what actually exists on the physical hardware at that moment with the help of Inspector, rather than the version the database claims to be on.

"Lost Time" in Alembic Migrations and the Defensive Migration Strategy

"Lost Time" in Alembic Migrations and the Defensive Migration Strategy

Setting the Scene: A Failed Deploy

Attempt 1: The Repair Script and "Things Left in the Past"

Attempt 2: Laying Stones Where You Stepped (The Linked List Trick)

Attempt 3: Fixing the Chain and the Ironic "DuplicateColumn" Error

The Reliable Solution: "Defensive" Migration Strategy

Conclusion: A Bulletproof Process

Lesson Learned

Enjoyed the read?