What Robots.txt Actually Is
A robots.txt file is a plain text file at the root of your domain — yoursite.com/robots.txt — that tells web crawlers which parts of your site they should and shouldn't crawl. It was first proposed in 1994 as part of the Robots Exclusion Protocol, and despite being over 30 years old, it remains one of the first things crawlers like Googlebot look for when they arrive at your domain.
The key word is "tells." Robots.txt is advisory — it relies entirely on the crawler choosing to follow it. Well-behaved bots from Google, Bing, Yandex, and others do follow it. Malicious scrapers, spammers, and vulnerability scanners often do not.
The simplest possible robots.txt — one that allows all crawlers to access everything — is a single line:
# Allow all crawlers to access everything
User-agent: *
Disallow:
The empty Disallow: means "disallow nothing" — allowing everything. You should have this at minimum, rather than no file at all. A missing robots.txt returns 404, which is fine (it means no restrictions), but a present file with a clear Allow-all statement is a positive signal.
The Myth That Breaks SEO
The most dangerous misunderstanding about robots.txt: Disallow does not prevent pages from being indexed.
If you block a URL with Disallow: /secret-page, Google can't crawl that page anymore. But if any other website links to /secret-page, Google knows it exists. Google may then index the URL — show it in search results — even without ever crawling its content. The result is a search snippet with no title, no description, and a URL that goes to a page Google can't read.
<meta name="robots" content="noindex"> to the page, or return an X-Robots-Tag: noindex HTTP header. Robots.txt prevents crawling. Noindex prevents indexing. They're different operations.The correct tool for each job:
| Goal | Right tool | Wrong tool |
|---|---|---|
| Stop Google from crawling a URL | Disallow in robots.txt | — |
| Stop a page appearing in search results | noindex meta tag or header | robots.txt Disallow |
| Block access to private content | Server-side auth, password protection | robots.txt (it's advisory) |
| Remove existing content from Google | Google Search Console URL removal tool | robots.txt alone |
| Tell Google about all your pages | XML sitemap + robots.txt Sitemap directive | — |
Anatomy of a robots.txt File
A robots.txt file is made of one or more groups. Each group must start with at least one User-agent line, followed by Disallow and/or Allow directives. Lines starting with # are comments. Blank lines separate groups.
# This is a comment
User-agent: *
Disallow: /admin/
Disallow: /checkout
Allow: /admin/public-announcements/
User-agent: Googlebot
Crawl-delay: 5
Sitemap: https://example.com/sitemap.xml
User-agent
The User-agent directive names the bot this group applies to. * is a wildcard that matches all bots not covered by a more specific group. Some commonly targeted agents:
| User-agent value | Targets |
|---|---|
* | All bots |
Googlebot | Google's main web crawler |
Googlebot-Image | Google Images crawler |
Googlebot-Video | Google Video crawler |
Bingbot | Microsoft Bing crawler |
Slurp | Yahoo crawler |
DuckDuckBot | DuckDuckGo crawler |
facebot | Facebook link preview crawler |
Twitterbot | Twitter Card crawler |
AhrefsBot | Ahrefs SEO tool crawler |
If a bot's name matches a specific group and the wildcard group, the specific group takes precedence. The wildcard group is completely ignored for that bot.
Disallow and Allow
Disallow prevents the bot from accessing any URL that starts with the given path. Allow creates an exception, permitting access even within a disallowed tree.
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
In this example, everything under /wp-admin/ is blocked except /wp-admin/admin-ajax.php, which is the WordPress AJAX endpoint needed by some public-facing features.
Allow: /cms/public/ beats Disallow: /cms/ for URLs under /cms/public/ because it's more specific.Wildcards — * and $
Google (and most modern crawlers) support two wildcard characters in path values:
*— matches zero or more characters at any position$— anchors the pattern to the end of the URL
User-agent: *
Disallow: /*.pdf$ # Block all .pdf URLs
Disallow: /search* # Block /search, /search-results, /searching…
Disallow: /*?* # Block all URLs with any query string
Disallow: /*?s=* # Block WordPress search query URLs
The original 1994 robots.txt spec didn't include wildcard support — it was added later and is not universally supported by all crawlers. Googlebot and Bingbot both support both characters.
The Sitemap Directive
The Sitemap directive is technically separate from the robots exclusion rules — it's just a convenient place to tell crawlers where your XML sitemap lives. It can appear anywhere in the file (not inside a User-agent group) and you can have multiple:
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/news-sitemap.xml
User-agent: *
Disallow:
Google, Bing, and Yandex all read the Sitemap directive. You can also submit your sitemap URL directly in Google Search Console and Bing Webmaster Tools — doing both is harmless and recommended.
Crawl-delay — Use It Carefully
Crawl-delay tells a bot how many seconds to pause between requests to your server. It's useful for limiting aggressive crawlers that can overload shared hosting. There's one major caveat: Googlebot completely ignores it.
To control Googlebot's crawl rate, use the crawl rate settings in Google Search Console (Settings → Crawling). The Crawl-delay directive is honored by Bingbot, Yandex, and many smaller crawlers.
User-agent: Bingbot
Crawl-delay: 10
User-agent: *
Disallow:
Crawl Budget — Why This Actually Matters
Google allocates a "crawl budget" to each site — an amount of crawling bandwidth based on your site's health, speed, and authority. For most small to medium sites (under ~10,000 pages), crawl budget is rarely a concern. For large sites, it matters a lot.
A well-configured robots.txt helps Google spend its crawl budget on pages that matter. Common crawl budget wasters you should block:
- Faceted navigation — filter combinations like
/products?color=red&size=M&sort=pricecreate infinite URL variations of the same content - Session IDs in URLs —
?sessionid=abc123creates unique URLs that are actually duplicate content - Paginated search results — page 47 of a product listing is rarely worth indexing
- CMS admin paths —
/wp-admin/,/admin/,/.git/ - API endpoints —
/api/paths that return JSON, not HTML - Test / staging pages — drafts, preview URLs, internal tools
Build your robots.txt right now
Visual builder with presets for WordPress, e-commerce, and Next.js. Download in one click.
Real-World Configurations
WordPress
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Disallow: /?s=
Disallow: /search
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml
E-commerce
User-agent: *
Disallow: /cart
Disallow: /checkout
Disallow: /account
Disallow: /order-confirmation
Disallow: /*?sort=
Disallow: /*?filter=
Sitemap: https://example.com/sitemap.xml
Next.js
User-agent: *
Disallow: /api/
Disallow: /_next/
Disallow: /admin
Sitemap: https://example.com/sitemap.xml
Note: Next.js 13+ can auto-generate a robots.txt via the app/robots.ts file that returns a MetadataRoute.Robots object — no static file needed if you go that route.
Blocking Specific Bots
You can block individual bots by name. This is commonly used to block SEO tool crawlers (which don't contribute to rankings but consume bandwidth) or known scrapers:
User-agent: *
Disallow:
User-agent: AhrefsBot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: DotBot
Disallow: /
User-agent: MJ12bot
Disallow: /
Repeat: blocking these bots in robots.txt only works if they choose to comply. Ahrefs and Semrush's bots are generally well-behaved and will respect it. Malicious scrapers won't. Server-level blocking (IP, rate limiting, Cloudflare rules) is more effective for persistent violators.
Mistakes That Silently Hurt SEO
- Blocking CSS and JS files — Google renders pages like a browser. If you block
/wp-content/, Google can't load your theme's styles and sees a broken layout, which affects rendering quality and potentially indexing quality. - Blocking Googlebot-Image — your product or article images may disappear from Google Images, reducing a potential traffic source.
- Disallow with noindex — if a page has
noindexand is also disallowed in robots.txt, Google can't crawl the page to read the noindex directive. It may still appear in search results from link signals. Pick one: allow crawling + noindex, or just disallow. - Incorrect file placement —
/blog/robots.txtis not a valid robots.txt. It must be exactly atyourdomain.com/robots.txt. - Trailing whitespace or BOM characters — some text editors add a Byte Order Mark (BOM) to files. UTF-8 BOM at the start of robots.txt can break parsing. Save as UTF-8 without BOM.
- Relative paths — paths must start with
/.Disallow: admin/is invalid; it should beDisallow: /admin/.
How to Test Your Robots.txt
Always verify your file after changes. There are two main ways:
- Google Search Console → Settings → Robots.txt tester. This shows the live file Google is reading and lets you test individual URLs. It's the most authoritative check.
- Direct URL check — visit
https://yourdomain.com/robots.txtin a browser to confirm the file is serving correctly with a 200 status and plain text content.
After deploying a new robots.txt, Google typically re-reads it within 24–48 hours. Previously cached rules may persist for some time. If you need a rule to take effect immediately, Search Console lets you request a crawl.
The One-Paragraph Summary
Put robots.txt at yoursite.com/robots.txt. Use Disallow to stop crawlers visiting pages that waste crawl budget — admin sections, cart pages, search results, query string variations. Always include a Sitemap directive. Never use robots.txt as a privacy or security measure — it's advisory only. To stop a page appearing in Google search results, use noindex, not Disallow. Test changes in Google Search Console before deploying to production.