- Yair Mark
On a project I am working on, there are tables with fairly large amounts of reference data that need to be loaded into a Postgres DB. These are a series of large .sql insert files. The files vary in size from 12MB to 65MB.
While this is not considered large by big data standards, it is large enough that running it as part of a Flyway migration significantly slows down your apps startup time when running the initial migration. This made our integration tests take ridiculously long and our UAT container would not even startup as it kept getting killed by the orchestration framework as it took so long to run the migrations and in turn startup. It never got out of this shutdown loop since it hit a long Flyway migration that kept getting rolled back when the app was killed.
There are some obvious solutions to this problem, the most obvious one to load it directly into the DB without Flyway. The issue with this is that now when deving, devs will not have this data and will not be able to properly check if their code works locally. It also means we do not have repeatable migrations for other environments and have to run this manually in every environment.
After a bit of investigation into how best to do this in Postgres, I came across the following article.
This gives a few options to deal with this, the first being to wrap all the inserts in one transaction as by default not doing this results in each insert being wrapped in a transaction. I tried this with Flyway but it did not work. After a bit of investigation, I saw that flyway wraps each migration script in its transaction - this was already happening.
This idea of batching inserts together lead me to investigate Flyway's other options and I came across the batch flag, but this is a paid feature :'(
I revisited the article from earlier and saw the final and recommended option of using Postgres' copy utility. Initially, I dismissed this as it did not look like Flyway supported it. But after further investigation, I came across this answer and a comment that point to the Flyway docs that discuss it. Basically, Flyway does support it, but there are no decent examples of how to set it up.
In the end, after a bit of fiddling, I worked out how to do this.
- Your migration file still is a .sql file with the same naming convention as any Flyway .sql migration file
- The table is any normal table, it does not need to be a special table to accommodate the CSV data
- The script will start with the COPY command
- The COPY command by default expects space-delimited rows without the header row, but this can be overridden
- The CSV data is put in the file below the copy command and not as a separate file
An example of this can be seen below:
COPY person (id, age, gender,name) FROM stdin delimiter ';' quote '"' csv; c3a8f58e-59ac-49fe-ace6-a136cbbc2d38;19;M;"John" c44f46d9-3e92-4afd-8b7f-5e88985bd061;23;F;"Sally" \.
Using a bit of Vim (due to the enormous file sizes) and regex, a large SQL file of inserts can very easily and quickly be converted into the above format.