12.6 COPY: Bulk Loading Data from Files

Right, you’ve been dutifully inserting rows one by one or in small batches. It’s fine for a few hundred, maybe a thousand records. But you and I both know that’s not how the real world works. The real world sends you a 12-gigabyte CSV file from some mainframe and says “have this loaded by lunchtime.” You’re not going to do that with a million individual INSERT statements. You’re going to use COPY, PostgreSQL’s built-in data firehose.

COPY is the tool you break out when you mean business. It bypasses a lot of the overhead of the standard SQL command protocol and streams data directly into a table. It’s orders of magnitude faster. We’re talking about going from minutes (or hours) to seconds.

The basic incantation is simple. Let’s say you have a file, new_employees.csv, and a table to match.

CREATE TABLE employees (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    start_date DATE,
    department_id INTEGER
);

Your CSV looks like this:

42,"Brenda Berkman",2023-05-01,5
43,"Wendy Lopez",2023-05-15,5

To load it, you’d run:

COPY employees (id, name, start_date, department_id)
FROM '/path/to/new_employees.csv'
WITH (FORMAT CSV);

Notice I explicitly listed the columns. This is not just pedantry; it’s a best practice. It protects you from the file having a different column order than the table, which happens more often than anyone admits.

The WITH Clause: Where the Magic (and Landmines) Live

The bare command is okay, but the WITH clause is where you configure COPY to handle the bizarre realities of your data. The most common options are FORMAT, HEADER, and DELIMITER.

FORMAT CSV: Use this for… well, CSVs. It handles quoted fields and commas correctly.
HEADER: Tell COPY to ignore the first line. Crucial if your file has column names.
DELIMITER '|': Because sometimes you get pipe-delimited files, and you just have to deal with it.

A more robust command for a typical messy CSV would be:

COPY employees
FROM '/path/to/employees_with_headers.csv'
WITH (
    FORMAT CSV,
    HEADER true,
    DELIMITER ',',
    NULL '' -- Treat empty strings as NULL
);

Common Pitfalls: The Devil’s in the Defaults

Here’s where COPY can bite you. Its default behavior is optimized for speed, not hand-holding.

Permissions. This is the big one. The PostgreSQL server process (usually user postgres) must have read access to the file you’re loading. This is why you often see \copy in the psql command-line tool—it’s a meta-command that runs COPY using the client’s file permissions, which usually avoids this entire headache. For scripts, you’re often better off using \copy.
Absolute Paths. The FROM filename must be an absolute path on the server’s filesystem. A relative path won’t work, and it won’t tell you why; it’ll just fail.
Quoting and Escaping. The default quote character in CSV mode is a double quote. If your data has embedded quotes that aren’t properly escaped, the entire import will fail halfway through. You’ll get an error, and the entire transaction will be rolled back. This is good (data integrity!) but frustrating. Test with a small sample of your data first.
Errors Stop Everything. By default, COPY is all-or-nothing. A single malformed row in a 10-million-row file will abort the entire operation. This is often what you want, but if you’re dealing with a dirty data source, you might use ON_ERROR or the older LOG_ERRORS option to save what you can.

The Power User’s Tool: FROM PROGRAM

This is one of COPY’s killer features. You can stream the output of any shell command directly into your table. Need to import a gzipped file? Don’t decompress it first. Just do:

COPY employees
FROM PROGRAM 'gunzip -c /path/to/huge_file.csv.gz'
WITH (FORMAT CSV, HEADER true);

This is incredibly powerful for ETL workflows. You can pull data straight from curl, awk, or any custom script you can dream up.

A Final Word of Caution

COPY is fast because it’s minimally logged. Be very, very careful with it on a replicated setup. A large COPY operation can generate a massive amount of WAL, which can cause replication lag. For truly massive bulk loads, you might even consider using pg_bulkload or temporarily altering your replication level. But for 99% of use cases, COPY is the undisputed champion of getting data into your database. Use it liberally.