29.7 Generating a search index (Lunr.js / Pagefind)

Right, so you’ve built this beautiful, content-rich site. I’m proud of you. But now your readers are going to want to find things. Scrolling through pages is for chumps and people who still use AOL dial-up. We’re building a search index.

This isn’t about slapping a Google Custom Search bar on there and calling it a day. That’s lazy, and it hands over your user’s data to a third party. We’re going to build our own client-side search. It’s faster, it’s private, and it gives you total control. The two heavy hitters in this space are Lunr.js and Pagefind. I’ll show you both because they represent two very different, very valid philosophies.

The Core Concept: Indexing is Pre-Work

Here’s the secret they don’t tell you in most tutorials: the real work happens at build time, not in the browser. Client-side search doesn’t mean your user’s laptop compiles the index from scratch when they hit the search page. Oh god, no. That would be a performance nightmare.

Think of it like this: you’re the librarian. At build time (when you run eleventy or hugo or whatever), you walk through every book (page) in your library (site), you note down the title, the author, the content, and the location (URL) on the shelf. You then write all this down in a highly optimized, easily searchable card catalog (a JSON file). When a visitor comes in, you just hand them the card catalog. Their browser does the actual looking up, which is lightning fast because all the hard work is already done.

Generating the Index with Lunr.js

Lunr is the old reliable. It’s a powerful, flexible search library that you wire up yourself. It gives you immense control, which means you also get immense responsibility to not screw it up.

First, you need to create a collection of all your content. In Eleventy, you’d use a special file, let’s call it search-index.js, in your _data directory.

// _data/searchIndex.js
const { promises: fs } = require('fs');
const path = require('path');

module.exports = async function() {
  // Let's get all our markdown content, for example
  const contentDir = path.join(__dirname, '../src/content/posts/');
  const files = await fs.readdir(contentDir);
  const markdownFiles = files.filter(file => file.endsWith('.md'));

  // Map over each file to create our searchable objects
  const searchRecords = await Promise.all(
    markdownFiles.map(async (file) => {
      const filePath = path.join(contentDir, file);
      const fileContents = await fs.readFile(filePath, 'utf8');
      // This is a simplistic front matter parser. In reality, use a proper library.
      const { title, description, date } = parseFrontMatter(fileContents); // You'd write this function
      const content = fileContents.replace(/^---[\s\S]*?---/, ''); // Strip front matter

      return {
        title,
        description,
        content,
        url: `/posts/${file.replace('.md', '/')}`, // Build your URL slug
      };
    })
  );

  return searchRecords;
};

Now, in your template where you generate the actual index (e.g., a Nunjucks file that builds search.json), you’ll feed this data to Lunr.

// This is in your build template, e.g., search.njk
---
permalink: /search.json
---
{
  "index": [
    {% for post in searchIndex %}
      {
        "title": {{ post.title | json | safe }},
        "description": {{ post.description | json | safe }},
        "content": {{ post.content | json | safe }},
        "url": {{ post.url | json | safe }}
      }{% if not loop.last %},{% endif %}
    {% endfor %}
  ]
}

Finally, in your client-side JavaScript, you fetch this search.json file, build the Lunr index, and hook it up to your search input. This is where you can customize the hell out of it—boosting titles over content, adding custom fields, etc. The pitfall here is complexity. It’s easy to over-engineer and end up with a massive index file that hurts your load time. Be judicious about what you index.

Embracing Simplicity with Pagefind

Pagefind is the new kid on the block, and it’s brilliant because it’s opinionated. It says, “Stop messing around with config and just give me your built HTML files.” It runs after your static site generator has done its job, crawling the actual output. This is genius for one huge reason: your search index is guaranteed to match what your users actually see, not some pre-rendered markdown concept of it.

Install the CLI (npm install -g pagefind), and run it pointed at your build output directory:

pagefind --source dist

That’s it. No, seriously. It will spider all your HTML, extract the text, and build an optimized, split index and necessary JavaScript into a _pagefind directory within your dist. To add it to your site, you just drop this snippet in:

<script src="/_pagefind/pagefind-ui.js" type="text/javascript"></script>
<script>
    window.addEventListener('DOMContentLoaded', (event) => {
        new PagefindUI({ element: "#search" });
    });
</script>

The biggest “pitfall” is one of philosophy: you surrender control. You can’t easily index data that isn’t visible in the final HTML. But for 95% of sites, this is not just fine, it’s preferable. It eliminates an entire class of build-time bugs. The Pagefind team made a questionable choice in hiding the advanced configuration a bit, but it’s all there if you need it.

The Universal Best Practice: Keep it Lean

No matter which tool you choose, this isn’t a data hoarding contest. You don’t need to index every single word on every single page. For each record, ask yourself: “If someone searches for this, will the result be relevant?” Indexing the entire text of your legal disclaimer? Don’t. Indexing the product name, description, and category? Yes.

The weight of your index JSON file is a tax every user pays on first load. Make it count. Both Lunr and Pagefind offer stemming (“run” matches “running”) and stopword removal (“the”, “a”, “an”), which helps keep things efficient. Your job is to not feed them garbage in the first place. Now go build a search that doesn’t suck.