Robots.txt — What It Does and What It Doesn't

What Robots.txt Actually Is

A robots.txt file is a plain text file at the root of your domain — yoursite.com/robots.txt — that tells web crawlers which parts of your site they should and shouldn't crawl. It was first proposed in 1994 as part of the Robots Exclusion Protocol, and despite being over 30 years old, it remains one of the first things crawlers like Googlebot look for when they arrive at your domain.

The key word is "tells." Robots.txt is advisory — it relies entirely on the crawler choosing to follow it. Well-behaved bots from Google, Bing, Yandex, and others do follow it. Malicious scrapers, spammers, and vulnerability scanners often do not.

The simplest possible robots.txt — one that allows all crawlers to access everything — is a single line:

# Allow all crawlers to access everything
User-agent: *
Disallow:

The empty Disallow: means "disallow nothing" — allowing everything. You should have this at minimum, rather than no file at all. A missing robots.txt returns 404, which is fine (it means no restrictions), but a present file with a clear Allow-all statement is a positive signal.

The Myth That Breaks SEO

The most dangerous misunderstanding about robots.txt: Disallow does not prevent pages from being indexed.

If you block a URL with Disallow: /secret-page, Google can't crawl that page anymore. But if any other website links to /secret-page, Google knows it exists. Google may then index the URL — show it in search results — even without ever crawling its content. The result is a search snippet with no title, no description, and a URL that goes to a page Google can't read.

🚫

Do not use robots.txt to hide content from Google. If you don't want a page in search results, add <meta name="robots" content="noindex"> to the page, or return an X-Robots-Tag: noindex HTTP header. Robots.txt prevents crawling. Noindex prevents indexing. They're different operations.

The correct tool for each job:

Goal	Right tool	Wrong tool
Stop Google from crawling a URL	`Disallow` in robots.txt	—
Stop a page appearing in search results	`noindex` meta tag or header	robots.txt Disallow
Block access to private content	Server-side auth, password protection	robots.txt (it's advisory)
Remove existing content from Google	Google Search Console URL removal tool	robots.txt alone
Tell Google about all your pages	XML sitemap + robots.txt Sitemap directive	—

Anatomy of a robots.txt File

A robots.txt file is made of one or more groups. Each group must start with at least one User-agent line, followed by Disallow and/or Allow directives. Lines starting with # are comments. Blank lines separate groups.

# This is a comment
User-agent: *
Disallow: /admin/
Disallow: /checkout
Allow:    /admin/public-announcements/

User-agent: Googlebot
Crawl-delay: 5

Sitemap: https://example.com/sitemap.xml

User-agent

The User-agent directive names the bot this group applies to. * is a wildcard that matches all bots not covered by a more specific group. Some commonly targeted agents:

User-agent value	Targets
`*`	All bots
`Googlebot`	Google's main web crawler
`Googlebot-Image`	Google Images crawler
`Googlebot-Video`	Google Video crawler
`Bingbot`	Microsoft Bing crawler
`Slurp`	Yahoo crawler
`DuckDuckBot`	DuckDuckGo crawler
`facebot`	Facebook link preview crawler
`Twitterbot`	Twitter Card crawler
`AhrefsBot`	Ahrefs SEO tool crawler

If a bot's name matches a specific group and the wildcard group, the specific group takes precedence. The wildcard group is completely ignored for that bot.

Disallow and Allow

Disallow prevents the bot from accessing any URL that starts with the given path. Allow creates an exception, permitting access even within a disallowed tree.

User-agent: *
Disallow: /wp-admin/
Allow:    /wp-admin/admin-ajax.php

In this example, everything under /wp-admin/ is blocked except /wp-admin/admin-ajax.php, which is the WordPress AJAX endpoint needed by some public-facing features.

ℹ️

Priority rule: Google applies the most specific matching rule. If two rules are equally specific, Allow wins. So Allow: /cms/public/ beats Disallow: /cms/ for URLs under /cms/public/ because it's more specific.

Wildcards — * and $

Google (and most modern crawlers) support two wildcard characters in path values:

* — matches zero or more characters at any position
$ — anchors the pattern to the end of the URL

User-agent: *
Disallow: /*.pdf$         # Block all .pdf URLs
Disallow: /search*          # Block /search, /search-results, /searching…
Disallow: /*?*              # Block all URLs with any query string
Disallow: /*?s=*            # Block WordPress search query URLs

The original 1994 robots.txt spec didn't include wildcard support — it was added later and is not universally supported by all crawlers. Googlebot and Bingbot both support both characters.

The Sitemap Directive

The Sitemap directive is technically separate from the robots exclusion rules — it's just a convenient place to tell crawlers where your XML sitemap lives. It can appear anywhere in the file (not inside a User-agent group) and you can have multiple:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/news-sitemap.xml

User-agent: *
Disallow:

Google, Bing, and Yandex all read the Sitemap directive. You can also submit your sitemap URL directly in Google Search Console and Bing Webmaster Tools — doing both is harmless and recommended.

Crawl-delay — Use It Carefully

Crawl-delay tells a bot how many seconds to pause between requests to your server. It's useful for limiting aggressive crawlers that can overload shared hosting. There's one major caveat: Googlebot completely ignores it.

To control Googlebot's crawl rate, use the crawl rate settings in Google Search Console (Settings → Crawling). The Crawl-delay directive is honored by Bingbot, Yandex, and many smaller crawlers.

User-agent: Bingbot
Crawl-delay: 10

User-agent: *
Disallow:

Crawl Budget — Why This Actually Matters

Google allocates a "crawl budget" to each site — an amount of crawling bandwidth based on your site's health, speed, and authority. For most small to medium sites (under ~10,000 pages), crawl budget is rarely a concern. For large sites, it matters a lot.

A well-configured robots.txt helps Google spend its crawl budget on pages that matter. Common crawl budget wasters you should block:

Faceted navigation — filter combinations like /products?color=red&size=M&sort=price create infinite URL variations of the same content
Session IDs in URLs — ?sessionid=abc123 creates unique URLs that are actually duplicate content
Paginated search results — page 47 of a product listing is rarely worth indexing
CMS admin paths — /wp-admin/, /admin/, /.git/
API endpoints — /api/ paths that return JSON, not HTML
Test / staging pages — drafts, preview URLs, internal tools

Build your robots.txt right now

Visual builder with presets for WordPress, e-commerce, and Next.js. Download in one click.

Robots.txt Generator →

Real-World Configurations

WordPress

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Disallow: /?s=
Disallow: /search
Allow:    /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap.xml

E-commerce

User-agent: *
Disallow: /cart
Disallow: /checkout
Disallow: /account
Disallow: /order-confirmation
Disallow: /*?sort=
Disallow: /*?filter=

Sitemap: https://example.com/sitemap.xml

Next.js

User-agent: *
Disallow: /api/
Disallow: /_next/
Disallow: /admin

Sitemap: https://example.com/sitemap.xml

Note: Next.js 13+ can auto-generate a robots.txt via the app/robots.ts file that returns a MetadataRoute.Robots object — no static file needed if you go that route.

Blocking Specific Bots

You can block individual bots by name. This is commonly used to block SEO tool crawlers (which don't contribute to rankings but consume bandwidth) or known scrapers:

User-agent: *
Disallow:

User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: DotBot
Disallow: /

User-agent: MJ12bot
Disallow: /

Repeat: blocking these bots in robots.txt only works if they choose to comply. Ahrefs and Semrush's bots are generally well-behaved and will respect it. Malicious scrapers won't. Server-level blocking (IP, rate limiting, Cloudflare rules) is more effective for persistent violators.

Mistakes That Silently Hurt SEO

Blocking CSS and JS files — Google renders pages like a browser. If you block /wp-content/, Google can't load your theme's styles and sees a broken layout, which affects rendering quality and potentially indexing quality.
Blocking Googlebot-Image — your product or article images may disappear from Google Images, reducing a potential traffic source.
Disallow with noindex — if a page has noindex and is also disallowed in robots.txt, Google can't crawl the page to read the noindex directive. It may still appear in search results from link signals. Pick one: allow crawling + noindex, or just disallow.
Incorrect file placement — /blog/robots.txt is not a valid robots.txt. It must be exactly at yourdomain.com/robots.txt.
Trailing whitespace or BOM characters — some text editors add a Byte Order Mark (BOM) to files. UTF-8 BOM at the start of robots.txt can break parsing. Save as UTF-8 without BOM.
Relative paths — paths must start with /. Disallow: admin/ is invalid; it should be Disallow: /admin/.

How to Test Your Robots.txt

Always verify your file after changes. There are two main ways:

Google Search Console → Settings → Robots.txt tester. This shows the live file Google is reading and lets you test individual URLs. It's the most authoritative check.
Direct URL check — visit https://yourdomain.com/robots.txt in a browser to confirm the file is serving correctly with a 200 status and plain text content.

After deploying a new robots.txt, Google typically re-reads it within 24–48 hours. Previously cached rules may persist for some time. If you need a rule to take effect immediately, Search Console lets you request a crawl.

💡

Tip: Use Google's URL Inspection tool in Search Console (not just the robots.txt tester) to check whether a specific URL is crawlable and indexable. It shows both the robots.txt verdict and any noindex signals on the page — you get the full picture in one place.

The One-Paragraph Summary

Put robots.txt at yoursite.com/robots.txt. Use Disallow to stop crawlers visiting pages that waste crawl budget — admin sections, cart pages, search results, query string variations. Always include a Sitemap directive. Never use robots.txt as a privacy or security measure — it's advisory only. To stop a page appearing in Google search results, use noindex, not Disallow. Test changes in Google Search Console before deploying to production.

🔧

The Tool Empire Team

We build free, browser-based tools for developers and creators. No signup, no installs, no nonsense — just tools that work.