Robots.txt: Control Crawl Access Carefully

robots.txt defines crawler access rules at site root.

Why It Matters

Misconfigured rules can block important pages, prevent rendering resources from being crawled, or let low-value paths consume crawl resources. A single broad directive can materially reduce search visibility.

robots.txt is most useful for crawl management, not for guaranteed deindexing.

What Robots.txt Can and Cannot Do

  • It can suggest which paths compliant crawlers should avoid.
  • It can help reduce crawl waste on filters, parameters, and utility paths.
  • It cannot guarantee a URL will never be indexed.
  • It cannot replace page-level indexing directives like noindex.

If you need reliable deindexing behavior, use page/header robots directives and allow crawling long enough for bots to process them.

Best Practices

  • Keep directives minimal and intentional.
  • Avoid blocking critical assets needed for rendering.
  • Include sitemap reference when applicable.
  • Use environment-safe defaults so staging blocks cannot leak to production.
  • Review rules after major IA, routing, or CDN changes.

Common Mistakes

  • Deploying Disallow: / to production.
  • Blocking /css, /js, or image paths required for rendering.
  • Overly broad wildcards that suppress valid content sections.
  • Keeping obsolete disallow rules after site migrations.

Example

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

Example of intentionally limiting low-value filtered URLs:

User-agent: *
Disallow: /search?
Disallow: /*?sort=
Allow: /

Validation Workflow

  1. Fetch https://yourdomain.com/robots.txt from production.
  2. Confirm critical pages and rendering assets are not blocked.
  3. Validate with crawler testing tools before and after releases.
  4. Re-check after CDN/proxy and redirect updates.

Final Takeaway

Use robots rules precisely and conservatively. Treat robots.txt as a crawl-control file where small mistakes can create large indexing impact.