ROBOTS_BLOCKED
The target site's robots.txt explicitly disallows GPTBot or a wildcard user-agent.
| Code | HTTP status | Retryable? |
|---|---|---|
| ROBOTS_BLOCKED | 403 | No |
What this means
ROBOTS_BLOCKED is the polite version of WAF_BLOCKED. Onto checks `/robots.txt` before issuing the actual fetch — if the site disallows GPTBot or has a wildcard `User-agent: *` with `Disallow: /`, the request short-circuits with ROBOTS_BLOCKED and never burns an outbound fetch. The check costs zero against your quota.
When you'll see it
HTTP 403. Body always includes code: "ROBOTS_BLOCKED". Branch on code, never on the human-readable message — wording can change without notice; the code is the stable contract.
Example response
{
"status": "error",
"code": "ROBOTS_BLOCKED",
"message": "robots.txt disallows GPTBot for this host"
}How to handle
Honor the block. Respecting robots.txt isn't optional — it's the social contract that keeps the web crawler-friendly for everyone. If you're aggregating content for end users, surface the block as "the source has opted out of AI crawling" and let the user click through to the original site. If you own the target site and want Onto to read it, edit `robots.txt` to allow `GPTBot`.
Suggested handling in a Node client:
if (data.code === 'ROBOTS_BLOCKED') {
// The site explicitly opted out. Honor it.
return null;
}Related errors
| Code | Status | What it means |
|---|---|---|
| WAF_BLOCKED | HTTP 403 | The target site's WAF or CDN refused the Onto crawler (origin returned 401 / 403). |
| URL_NOT_FOUND | HTTP 404 | The target URL returned 404 — the page doesn't exist on the origin. |
See the full error index for the complete catalog with the handling switch statement covering every code at once.