21.1 CREATE INDEX: Basic Syntax and Options

Right, let’s talk about creating indexes. This is where you stop politely asking the database to find your data and start giving it a detailed map. The CREATE INDEX statement is your cartography tool. It seems simple on the surface, but the devil—and the real performance gains—are in the details.

The most basic form is a thing of beauty in its simplicity. You tell it the name of the index, the table it belongs to, and the column(s) you want it to track.

CREATE INDEX idx_customer_last_name ON customers (last_name);

Boom. Done. The database now has a sorted list of all last_name values in the customers table, along with a pointer back to the actual row. The next time you run SELECT * FROM customers WHERE last_name = 'Smith';, the database engine can skip the full table scan, go straight to the ‘S’ section in its shiny new index, and find all the Smiths in a fraction of the time. It feels like magic, but it’s just computer science.

The Perils of Non-Unique Indexes

Notice I didn’t say CREATE UNIQUE INDEX. The default index is non-unique, meaning it happily allows duplicate values. This is fine, expected even, for a last name column. But this is your first “gotcha”: an index does not inherently enforce uniqueness. If you need to prevent duplicates in a column (like a username or a product SKU), you must explicitly create a unique index or a primary key. The database will politely use a non-unique index to find rows quickly, but it will trust the data you gave it. If you tell it there are a hundred identical values, it will just assume you’re a bad person and store all hundred pointers.

Multicolumn Indexes: Order is Everything

This is the big one. The concept that most beginners get wrong. You can index multiple columns, but this isn’t a bag of groceries where order doesn’t matter. A multicolumn index on (last_name, first_name) is not the same as an index on (first_name, last_name).

Think of it like a phone book. It’s sorted by last name first. Within each last name, it’s sorted by first name. This is brilliantly efficient for a query like:

SELECT * FROM customers WHERE last_name = 'Smith' AND first_name = 'John';

It finds the Smiths, then within that group, finds the Johns. It’s also great for finding just last_name = 'Smith'. But it is utterly useless for finding first_name = 'John' on its own. The phone book isn’t sorted by first name, so you’d have to scan every page. This is what the database would have to do—a full index scan, which is only marginally better than a table scan.

So the rule of thumb: put the most selective, most frequently queried column first. The order in your WHERE clause doesn’t matter; the database’s query planner is smart enough to figure that out. What matters is the order you define in the index.

The `INCLUDE` Clause: A Performance Cheat Code

Here’s a fantastic feature that arrived to save us from ourselves. Often, you’ll run a query that uses an index to quickly find rows, but then the database has to do an extra lookup back to the main table to get some other column you’re selecting. This is a “heap lookup,” and it kills performance for large result sets.

Let’s say you have a common query:

SELECT last_name, first_name, email FROM customers WHERE last_name = 'Smith';

Your index on (last_name) is used to find the rows, but then it has to jump back to the table to fetch the email for each Smith. Enter INCLUDE, which lets you tack non-key data directly onto the index leaf pages.

CREATE INDEX idx_customer_last_name_include_email ON customers (last_name) INCLUDE (email);

Now, for that query, the database never has to leave the index! It finds all the Smiths in the index and the email is right there, nestled alongside the row pointer. It’s a covering index, and it’s a beautiful thing. Use INCLUDE for those often-selected but rarely filtered-on columns to avoid those expensive lookups.

Where to Put Them: Tablespaces and Your Sanity

You can, if you’re feeling fancy, specify a TABLESPACE for your index. Why would you do this? Mostly for storage management. By default, it just gets plopped in the same tablespace as its parent table. But if you have a massive, heavily accessed table, you might want to put its indexes on a separate, faster disk array (like screaming NVMe SSDs) to spread the I/O load. It’s an advanced tuning trick, but it’s good to know it’s there. For 99% of databases, the default is fine.

CREATE INDEX idx_gigantic_table ON gigantic_table (some_column) TABLESPACE fast_ssd_tablespace;

The `CONCURRENTLY` Option (PostgreSQL)

This one is so important I have to mention it, even though it’s PG-specific. Normally, when you create an index, it takes a lock on the table that blocks writes (inserts, updates, deletes). For a large table, this can mean a maintenance window outage. Not great.

PostgreSQL offers CREATE INDEX CONCURRENTLY. It builds the index without blocking writes. It’s slower and uses more resources, but it keeps your application online. The trade-off is absolutely worth it for production systems. If your database supports a similar option, use it. Your on-call phone will thank you.

CREATE INDEX CONCURRENTLY idx_no_downtime ON orders (customer_id);

The takeaway? Creating an index is easy. Creating the right index is an art. It requires thinking about your queries, understanding the data, and anticipating how the database engine will use the tools you give it. Now go forth and map.

The Perils of Non-Unique Indexes

Multicolumn Indexes: Order is Everything

The INCLUDE Clause: A Performance Cheat Code

Where to Put Them: Tablespaces and Your Sanity

The CONCURRENTLY Option (PostgreSQL)

The `INCLUDE` Clause: A Performance Cheat Code

The `CONCURRENTLY` Option (PostgreSQL)