27.2 robots.txt: Controlling Crawler Access

Right, let’s talk about robots.txt. This is the file where you, the website owner, get to politely ask search engine crawlers (or “robots”) to please stay out of certain parts of your digital house. I say “politely ask” because that’s the crucial bit everyone forgets: a robots.txt file is a set of guidelines, not a set of enforced rules. It’s the “Employees Only” sign on a door. It keeps honest people honest, but a burglar isn’t going to read it and suddenly decide to be a law-abiding citizen. Malicious scrapers and some less-scrupulous bots will merrily ignore it. Its real audience is the well-behaved crawlers from Google, Bing, and the like.

You place this file at the root of your domain (https://www.yoursite.com/robots.txt). This is non-negotiable. A crawler will only look for it there. If it’s in /some-folder/robots.txt, it might as well not exist.

The Basic Syntax: It’s Just Two Main Rules

The language itself is mercifully simple. It has two primary directives: User-agent to specify which robot you’re talking to, and Disallow to tell them what to avoid. You can also use Allow as an exception to a broader Disallow rule, which is where things get useful.

Let’s look at the most common example, the “please crawl everything” setup:

User-agent: *
Disallow:

See that empty Disallow:? That’s the key. You’re saying, “To all user-agents (*), there are no paths I’m disallowing.” It’s a green light. You could also just not have a robots.txt file, but that’s sloppy. This explicitly tells everyone you’ve thought about it and are cool with being crawled.

Now, the classic “stay out of my private stuff” example:

User-agent: *
Disallow: /private-storage/
Disallow: /temp-files/
Disallow: /cgi-bin/

This tells every crawler to avoid those three directories and anything inside them. Note the leading slash /—it specifies the path from the root. Forgetting that slash is a classic rookie mistake that renders the rule useless.

Targeting Specific Crawlers and Using Allow

Sometimes you want to give special instructions to a specific bot. The most common one is Googlebot. Here’s how you’d tell everyone to stay out of /temp-files/ but give Google a special pass to one specific file inside it.

User-agent: *
Disallow: /temp-files/

User-agent: Googlebot
Allow: /temp-files/important-public-report.pdf
Disallow: /temp-files/

Order matters here. The crawler will read the file from top to bottom and use the most specific rule it can find for its user-agent. So Googlebot reads the first block (User-agent: *) and thinks, “Okay, I can’t go in /temp-files/.” Then it gets to its specific block, sees the Allow for the PDF, and says, “Aha! An exception! I can grab that one file.” The final Disallow just reinforces the general rule for that folder. It’s redundant for Googlebot in this case, but I’ve included it for clarity.

The Sitemap Directive and Common Pitfalls

You can also tell crawlers where to find your sitemap. This is brilliantly helpful because you’re essentially handing them a map of your entire site. Do this.

User-agent: *
Disallow: /private-storage/
Sitemap: https://www.yoursite.com/sitemap.xml

Now, for the pitfalls. Pay attention, because this is where people shoot themselves in the foot.

Blocking CSS and JS: This is the big one. For the love of all that is good, do NOT do this:
```
User-agent: *
Disallow: /css/
Disallow: /js/
```
If you block crawlers from your assets, Google cannot see your site as a user does. It can’t render your fancy layouts and interactive elements properly. This directly harms how your pages are understood and ranked. Modern SEO requires letting bots access your CSS and JS.
Accidentally Blocking Everything: A single typo can take your entire site out of search results. This is the nightmare scenario:
```
User-agent: *
Disallow: /
```
That single slash disallows everything. It’s the nuclear option. Use it only if you never want to appear in search engines (like on a staging site). Double-check your file. Triple-check it.
Trying to Hide Private Data: I cannot stress this enough: robots.txt is a terrible way to hide sensitive information. The paths you disallow are often publicly accessible in your robots.txt file. You’re basically publishing a list of “interesting hidden folders right here!” for anyone to see. If something truly needs to be private, use proper authentication and server-side security, not a text file that politely asks bots not to look.

Use the robots.txt file for what it’s good for: guiding friendly crawlers away from things that waste their time (and your bandwidth), like infinite calendar URLs, search result pages, or duplicate content areas. For everything else, especially security, use a real tool for the job.