6.4 Database Encoding: UTF-8, SQL_ASCII, and Collations
Right, let’s talk about the alphabet soup of database text storage. You’ve probably already created a varchar column or two, but the real decisions—the ones that will either save your bacon or cause a multi-byte, accented-character-filled nightmare at 2 AM—happen at the encoding level. This isn’t academic; it’s the bedrock of storing anything beyond “Hello World.”
Think of encoding as the dictionary your database uses to translate bits into letters. If you tell it to use an English dictionary (like LATIN1) and then try to store a Japanese character (こんにちは), it’s going to have a complete meltdown. The only sane choice for a modern application is UTF-8. It’s the one dictionary to rule them all, capable of storing pretty much every character from every human writing system, plus emojis. Yes, your database can store that pile of poop emoji (💩). You’re welcome.
The Ghost of ASCII Past: SQL_ASCII
Now, a horror story. You might stumble upon SQL_ASCII in older PostgreSQL docs or, heaven forbid, in a legacy system. This is not so much an encoding as it is a cry for help. SQL_ASCII means “I will blindly store whatever bytes you give me and pretend everything is fine.” It offers zero validation. If you send it UTF-8 bytes, it will store them. If you send it a video of a cat playing the piano, it will try to store it in a text field and produce glorious, irreversible nonsense.
-- DO NOT RUN THIS ON A PRODUCTION DATABASE. SERIOUSLY.
-- This is how you create a time bomb.
CREATE DATABASE bad_ideas_db ENCODING 'SQL_ASCII';
-- You can insert garbage, and it will "work"...
INSERT INTO some_table (text_column) VALUES (E'\x00\x01\x02ffd8ffe0ThisIsTheStartOfAJPGFile');
The only reason to use SQL_ASCII is if you have a pre-Unicode application that depends on a specific 8-bit encoding and you’re too terrified to migrate it. For everyone else, it’s a hard no.
Why UTF-8 is Non-Negotiable
UTF-8 is brilliant because it’s variable-length. Common characters (like those in the ASCII set) are stored in a single byte, while less common ones use two, three, or up to four bytes. It’s efficient and comprehensive. Setting your database to UTF-8 is the first and most important step.
-- This is how you do it right. Create the DB with UTF-8.
CREATE DATABASE sensible_db ENCODING 'UTF8';
-- Double-check your encoding! Connect to your DB and run:
SELECT character_set_name FROM information_schema.character_sets;
-- You should see 'UTF8' smiling back at you.
But here’s the kicker: just because your database is UTF-8 doesn’t mean your connection to it is. Your client library (like libpq for Python’s psycopg2) must also set the client encoding to UTF-8. Most modern drivers do this by default, but it’s worth verifying. If this gets out of sync, you’ll see the infamous invalid byte sequence for encoding "UTF8" error. This usually means a client is sending Latin-1 or Win-1252 encoded text but the database is expecting UTF-8.
The Devil in the Details: Collations (LC_COLLATE and LC_CTYPE)
So you’ve got UTF-8. Great. Now, how does the database sort and compare those text strings? That’s what collations are for. Is 'ä' equal to 'a'? Does 'Z' come before or after 'a'? (Spoiler: it depends.)
Collations are set when you create the database cluster (initdb) and are incredibly difficult to change later. They are split into two main settings:
LC_COLLATE: Dictates the sort order. This is whyORDER BY text_columnmight behave differently on a German server vs. an American one.LC_CTYPE: Dictates character classification (e.g., what is considered a letter, what is uppercase/lowercase).
-- See your current settings
SELECT lc_collate, lc_ctype FROM pg_database WHERE datname = current_database();
-- lc_collate | lc_ctype
--------------+----------
-- en_US.UTF-8| en_US.UTF-8
The most important thing to know is that indexes built on text columns rely on the collation. If you change the collation after building indexes, they become corrupt because the sort order they rely on has fundamentally changed. You can’t change it on the fly.
The One Weird Trick: Column-Level COLLATE
Need a specific sort order for one query but not the whole database? PostgreSQL lets you override the collation at the column level, or even directly in a query.
-- For a column that must always sort in German order
CREATE TABLE german_users (
name text COLLATE "de_DE"
);
-- Or for a one-off query, like generating a case-insensitive sorted report
SELECT brand_name FROM products
ORDER BY brand_name COLLATE "en_US";
This is incredibly powerful. It allows you to respect local sorting rules for specific data without rebuilding your entire database cluster. Just remember that using a different collation in a WHERE clause or JOIN can prevent the use of standard indexes, as the comparison rules are different. You’d need an index that specifically uses that collation.
The bottom line? Always use UTF8 encoding. Never use SQL_ASCII. Be acutely aware of your collation settings at cluster creation time, because you’re stuck with them. And use column-level COLLATE for those edge cases where you need a different dictionary for sorting. Get this right now, and you’ll never be the person explaining why the user’s last name “Silva” is suddenly appearing after “Zuckerberg.”