36.2 Content Ingestion: Reading, Parsing, and Front Matter Decoding

Right, let’s get our hands dirty. You’ve told Hugo where your content is, and you’ve run hugo server. The first thing it does is the most crucial: it has to actually read your files and figure out what the hell they are. This isn’t just a simple file copy; it’s a full-on archaeological dig, and Hugo is the over-caffeinated professor who has to categorize every artifact before the museum (your public directory) opens.

The process breaks down into three distinct phases: reading the raw bytes off the disk, parsing the content format (like Markdown), and, most importantly, decoding the front matter—the config file at the top of your content that holds all the magic. Get this part wrong, and your build is either hopelessly broken or, worse, subtly wrong in ways that’ll make you tear your hair out later.

The FileSystem Abstraction: It’s Not Just Your Disk

First, Hugo doesn’t just slam open files directly from your OS. It uses a filesystem abstraction layer. Why? Because your content might not be on your local disk. It could be in a memory filesystem for tests, or in the future, it could be fetched from a remote source. This abstraction keeps the rest of the ingestion process blissfully unaware of the source. The important part for you is that it walks the content/ directory recursively, and every file it finds with a valid extension (.md, .html, .mmark, etc.) becomes a Page object in memory. Files that aren’t recognized are just static assets to be copied over later.

The Front Matter Multiverse: YAML, TOML, or JSON?

This is where the real fun begins. Hugo needs to split your file into two parts: the metadata (front matter) and the actual content. It does this by looking for the first --- or +++ sequence in the file. If it finds ---, it assumes YAML. If it finds +++, it assumes TOML. If it finds {, well, you’d better be using JSON, you beautiful masochist.

Here’s the critical part: the front matter must be the very first thing in your file. A single space or invisible byte-order mark before the front matter delimiter will make Hugo throw a silent tantrum and treat your entire file as content, which means no title, no date, no nothing. It’s the most common “why isn’t my page building?!” pitfall.

Let’s look at a properly formatted file with TOML front matter (+++):

+++
title = "Why Static Sites Are Better and Your CMS is Wrong"
date = 2023-10-27T18:45:00-04:00
draft = false
tags = ["hugo", "rant", "web-dev"]
[taxonomies]
category = "tutorials"
author = "Your Humble Narrator"
+++

## The Actual Content Starts Here

Now that we've properly separated the config from the content, Hugo can actually do its job. This content will be processed as Markdown.

And here’s the same thing, but with YAML (---):

---
title: "Why Static Sites Are Better and Your CMS is Wrong"
date: 2023-10-27T18:45:00-04:00
draft: false
tags:
  - hugo
  - rant
  - web-dev
taxonomies:
  category: tutorials
author: Your Humble Narrator
---

## The Actual Content Starts Here

The content is identical, only the config syntax changed.

Hugo parses this front matter block into a key-value map and attaches it to the Page object. This data structure is your golden ticket; it’s what your templates access via .Title, .Date, .Params.tags, etc.

The Content Parsing: Markdown and Beyond

Once the front matter is neatly sliced off, Hugo is left with the content body. By default, it passes this through the Goldmark Markdown renderer. But here’s the clever bit: this is a pluggable interface. You can use other renderers, or even write your own, as long as they implement the Renderer interface. The result of this rendering is stored as a blob of HTML, ready to be slotted into your layout templates.

But wait, what about those .Content vs .Plain vs .RawContent methods you see? This is the difference:

.RawContent is the raw, un-rendered content body after the front matter has been removed. No Markdown processing, just the raw text.
.Plain is a rough approximation of the rendered text with all HTML tags stripped out. Useful for descriptions or SEO stuff.
.Content is the fully rendered HTML, the final product. This is what you’ll use 99% of the time.

The Page State Machine: Drafts, Dates, and Expiry

After ingestion, Hugo doesn’t just blindly build every page it finds. It runs each page through a simple state machine based on the front matter values you provided.

Is draft = true? If you’re in a production build (hugo without -D), it’s skipped. It’s dead to Hugo.
Is publishDate in the future? The page is considered “paused.” It gets built into the site but is hidden from RSS feeds and most section listings. It’s waiting for the clock to strike.
Is expiryDate in the past? The page is forcibly ignored, even if it’s not a draft. It’s a hard delete. Use this for time-sensitive content you want to automatically vanish.

The key insight here is that this filtering happens after ingestion. The file was still read, parsed, and stored as a Page object. This is why you can still access it via .Pages from other pages or via Hugo’s API—it exists, it’s just being filtered out of the final output build. This design is what makes Hugo so powerful for programmatic content management. It’s not just building pages; it’s building a complete content model and then deciding what to publish.