Most people encounter sitemap.xml and robots.txt at the same time and assume they serve the same purpose. They don't. They control opposite ends of the same problem — one tells search engines what to find, the other tells them what to ignore. Getting them confused is one of the most common (and most damaging) technical SEO mistakes.
What robots.txt Does
robots.txt is a plain text file at the root of your domain. It contains instructions that tell web crawlers which parts of your site they're allowed to access. Crawlers check it before doing anything else.
A typical robots.txt looks like this:
User-agent: * Allow: / Disallow: /dashboard/ Disallow: /auth/ Disallow: /api/ Sitemap: https://yourdomain.com/sitemap.xml
This allows all crawlers to access the site, blocks them from three private sections, and points them to the sitemap. The Disallow directive is a request, not a wall — well-behaved bots (Googlebot, Bingbot) respect it, but bad bots may ignore it. robots.txt is not a security mechanism; use server-level authentication for truly private pages.
Think of robots.txt as the no entry signs on your site — it tells crawlers which doors not to open.
What sitemap.xml Does
Your sitemap is an XML file that lists every URL you want search engines to find and index. It's a positive signal — an invitation. Where robots.txt defines exclusions, the sitemap defines inclusions.
Think of the sitemap as the welcome mat and directory — it shows crawlers exactly where to go and what matters.
Sitemaps are particularly valuable for:
- Pages that aren't linked from anywhere else on your site (orphan pages)
- New pages you've recently published that haven't yet been discovered by following links
- Large sites where link-following alone may miss pages deep in the architecture
Key Differences Side by Side
| robots.txt | sitemap.xml | |
|---|---|---|
| Purpose | Exclude pages from crawling | Include pages for indexing |
| Format | Plain text | XML |
| Location | Domain root (required) | Domain root (conventional) |
| Signal type | Negative (block) | Positive (invite) |
| Enforced by crawlers? | Voluntarily honoured | Treated as a suggestion |
| Required? | No, but strongly recommended | No, but strongly recommended |
The Mistake That Breaks Both: Blocking Your Sitemap in robots.txt
The most destructive misconfiguration is also surprisingly common: using robots.txt to block the section of your site that your sitemap lists. For example:
# ❌ This blocks Googlebot from crawling /blog/ Disallow: /blog/ # But your sitemap lists... <loc>https://yourdomain.com/blog/my-post</loc>
When this happens, Google can see the URL in the sitemap but is blocked from crawling it. The page won't be indexed. This exact scenario shows up in Google Search Console as "Blocked by robots.txt" under the Page Indexing report.
The rule is simple: never include a URL in your sitemap that you've blocked in robots.txt. They should complement each other, not contradict.
The Other Common Mistake: noindex vs robots.txt
These two tools handle different things. robots.txt controls crawling. A noindex meta tag controls indexing. The difference matters:
- Block with robots.txt when you don't want a page crawled at all — staging environments, admin panels, private APIs.
- Use noindex when you're fine with the page being crawled (it can still pass link equity) but don't want it to appear in search results — thin pagination pages, thank-you pages, internal search result pages.
- If you block a page with robots.txt, Googlebot can't read it at all — which means it also can't see the noindex tag. If you want to remove a page from search results, use noindex, not robots.txt.
Do You Need Both?
Yes, and they serve different purposes that can't be substituted for each other. You need robots.txt to protect private sections of your site from being crawled. You need a sitemap to make sure search engines find all your important content efficiently. Together, they give search engines a clear picture of what to crawl and what to skip — which is exactly what you want.