Robots.txt: Control Crawl Access Carefully
robots.txt defines crawler access rules at site root.
Why It Matters
Misconfigured rules can block important pages, prevent rendering resources from being crawled, or let low-value paths consume crawl resources. A single broad directive can materially reduce search visibility.
robots.txt is most useful for crawl management, not for guaranteed deindexing.
What Robots.txt Can and Cannot Do
- It can suggest which paths compliant crawlers should avoid.
- It can help reduce crawl waste on filters, parameters, and utility paths.
- It cannot guarantee a URL will never be indexed.
- It cannot replace page-level indexing directives like
noindex.
If you need reliable deindexing behavior, use page/header robots directives and allow crawling long enough for bots to process them.
Best Practices
- Keep directives minimal and intentional.
- Avoid blocking critical assets needed for rendering.
- Include sitemap reference when applicable.
- Use environment-safe defaults so staging blocks cannot leak to production.
- Review rules after major IA, routing, or CDN changes.
Common Mistakes
- Deploying
Disallow: /to production. - Blocking
/css,/js, or image paths required for rendering. - Overly broad wildcards that suppress valid content sections.
- Keeping obsolete disallow rules after site migrations.
Example
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
Example of intentionally limiting low-value filtered URLs:
User-agent: *
Disallow: /search?
Disallow: /*?sort=
Allow: /
Validation Workflow
- Fetch
https://yourdomain.com/robots.txtfrom production. - Confirm critical pages and rendering assets are not blocked.
- Validate with crawler testing tools before and after releases.
- Re-check after CDN/proxy and redirect updates.
Final Takeaway
Use robots rules precisely and conservatively. Treat robots.txt as a crawl-control file where small mistakes can create large indexing impact.