# Database migrations

The vast majority of APIs should be the sole owner of the given database that they connect to. This applies not only to the database schema but also to the data stored within it.

No other applications should make connections directly to an API's database to query data or perform schema or data migrations. Operations to retrieve, create, update, and delete data should always go through the API.

Following this guidance provides several benefits:

Awareness of all applications with dependencies on the data
Ability to make internal database changes without affecting consumers - provided the API contracts remain unchanged
Confidence that the schema will always match the expectations of the API that owns it
Predictable and accurate execution of the correct set of up and down migrations when deploying or rolling back an API release

# Migration mantras

Migrations must be able to migrate a fresh, empty database to current expected state
A migration always consists a corresponding up and down
Encourage errors
Be transactional
Keep migrations isolated by feature
Do not seed or move data to/from external sources

# API and data store coupling

An API and its underlying data store function as a single unit (provided the API stores and works with data). While an API may support multiple storage engines (SQL Server, PostgreSQL, Redis, MongoDB, etc), its dependence on some data store cannot be separated or removed from its execution.

Due to the nature of the API's dependence on a particular runtime data storage schema, it is always best to store migration code within the same code repository as the API itself.

Keeping migration scripts and API code within the same repository provides the key benefit of knowing exactly which migrations belong to a particular API version. This gives deployment teams confidence that the correct migrations are being run during or before API deployments and potential rollbacks.

Each incremental release of an application that has a dependency on changes to its data store must ship with corresponding up and down migrations.

Execution of up migration(s) apply any changes that update database schema or migrate data from the previous release version to fit the expectations of the code in the current release. The corresponding down migration(s) perform the exact same operation(s) in reverse.

Be aware of the strong possibility that the execution of down migrations may result in the loss of data. If an already applied, corresponding up migration added new columns, tables, or other storage mechanisms that have since had data written to them then running the associated down migration will result in the loss of this data. The loss would be equivalent if a database was restored from backup.

These changes include, but are not limited to the following:

Creation, modification, or deletion of tables, views, functions, stored procedures, indices, or other objects
Changes to table column constraints
Migration of data from one place to another

# Migration opinions

# Be vulnerable to errors

In most aspects of programming we want to be defensive about preventing errors. With database migrations, however, this should not be the case. Because our APIs are so dependent on a very particular database schema, we want to be extremely cautious in ensuring, without exception, that the database schema is exactly what we expect it to be before and after a migration is run.

For example, it is tempting to write migration scripts that are defensive around creating a new table, defending against the possibility that it already exists or that the migration may be run more than once.

Consider the following:

CREATE TABLE my_table IF NOT EXISTS (
  ...
);

This query will only create the table my_table if it doesn't already exist and simply do nothing if it does.

Writing a migration in this way allows a potential violation of our expectations about the current database schema to pass silently.

Allowing the query to fail is a much better approach in this case.

CREATE TABLE my_table (
  ...
)

This example will cause the query to fail if my_table already exists and, depending on how you manage your migration script execution, should halt any further migrations from running.

This is exactly what we want to happen. The moment we encounter a database schema that is outside of our expectations we should halt execution, roll back any partially applied changes, if any, and manually intervene to resolve the issue.

# Keep migrations isolated to features

Migration scripts should be limited to as small a scope as possible while still encompassing a given feature. They are treated in the same way as any other code going in to a main branch, as a single unit that supports a given feature.

This does not necessarily mean that you will never have more than one migration in a single merge request. But, if you find yourself submitting more than one migration in a merge request, you should question if you would be better off submitting multiple merge requests and grouping code more appropriately by feature.

# Use transactions as necessary

In instances where a migration is not limited to a single statement, transactions should be used in order to limit or eliminate any need for manual rollback intervention other than possibly resetting the migration tracking table to a clean and correct state if applicable to your migration tooling.

When using transactions in your migrations, allow queries to error out appropriately. If migration failures pass silently and fail to report errors, we allow our schema to become out of sync with our expectations and all of our worst fears will be realized.

# Avoid migrating data from external sources

Moving data from one table to another within the same database is a fairly common task and is certainly appropriate to do within a migration. Migrating data from one database (or any external source) to another in a migration script is not appropriate and should be avoided.

There are a couple relevant use cases for managing migrations per these recommendations that make cross-database data migrations inappropriate:

Developers need the ability to quickly spin up new, empty databases from scratch to begin contributing to a project quickly and safely
Automated integration and E2E testing can be run in a fresh, isolated environment on demand

In either of these scenarios, among any of the other use cases for schema migrations, there exists the possibility that the location the data is being migrated from is offline, cannot be reached, or may no longer exists. Do not assume that the migrations you write today are not going to be executed against a fresh database far, possibly years, into the future.

Code that moves data from one database to another (or any external system) should be stored and maintained within the same repository as the API and migration code but in a separate location, isolated away from migrations, that allows us to ensure migrations can always run safely.

# Use tooling that tracks already applied migrations

As stated previously, migrations should be written with the expectation that they are to be run on any given database instance once and never again. The exception to this expectation is the scenario in which a database was migrated down or restored to a previous version. In those instances, however, from the perspective of the migration, it is being run for the first time.

In order to achieve this, some type of tool should be used that tracks migrations that have already been run or, at least, knows what the last applied migration version was. This allows new migration runs to begin at the oldest or lowest numbered migration that has not yet been run.

# Migrations should be numbered in execution order

Most tools will ship w/cli that lets you new up a migration with a name, increments the highest migration number, creates empty up and down migration files, and puts them in the right place for you.

0001_myMigration_up.sql
0002_myMigration_up.sql