Soft 404 · crawl budget · status codes
Is it a real 404
— or a soft 404?
A soft 404 looks "not found" to people but returns HTTP 200 to crawlers, so Google
wastes crawl budget on it and flags it in Search Console. Paste a URL and soft404scan compares it against a
guaranteed-missing page on the same host — then tells you the verdict and the exact signal that betrays it.
What it checks
How a soft 404 gives itself away
Status code
Whether the URL returns a clean 404/410, a 200 it shouldn't, a server error, or a redirect — the first tell of a soft 404.
The missing-URL baseline
It fetches a made-up URL in the same directory to learn what "not found" looks like on this host: a real 404, a catch-all 200, or a redirect to the homepage.
Content match
If your 200 page is near-identical to that guaranteed-missing page, it's the same not-found template with the wrong status — a soft 404.
Not-found wording
A page that returns 200 while its title or heading says "page not found" is a classic soft 404, even with unique styling.
Redirect-to-home
Redirecting a missing URL to the homepage instead of returning 404 is, by Google's own definition, a soft 404. It's caught explicitly.
JS-app honesty
Single-page apps serve the same shell for every URL. soft404scan detects that and says "verify in a browser" instead of crying soft 404.
Open methodology
Every rule, in the open
No mystery score. Here is exactly how each verdict is decided — so you can verify and trust the result.
- Baseline probeA made-up filename in the same directory as your URL is fetched as a "guaranteed missing" reference for how the host behaves for URLs that do not exist.
- True 404 / 410PASS verdict if the URL itself returns HTTP 404 or 410 — the correct, clean status for missing content.
- Redirect soft 404FAIL if a non-root URL is 30x-redirected to the homepage — Google explicitly treats redirecting missing pages to the home page as a soft 404.
- Catch-all 200 + similarityFAIL if the URL returns 2xx and is ≥90% content-similar (shingled-token Jaccard) to the guaranteed-missing baseline — it is serving the same not-found page with a 200.
- Not-found wordingFAIL if a 2xx page's title/heading reads like a not-found page (e.g. "404", "page not found"); WARN if such wording only appears in the body of an otherwise normal page.
- Suspicious similarityWARN (Possible soft 404) if a 2xx page is 75–90% similar to the not-found baseline on a catch-all host.
- Thin contentWARN if a 2xx page has fewer than ~40 words of server-rendered text — thin pages are prone to being treated as soft 404s.
- JavaScript-app caveatA near-identical match on a client-rendered site (little text, several scripts) is reported as inconclusive, with a prompt to verify in a browser, rather than a hard soft-404 verdict.
- Host 404 behaviourINFO/WARN describing whether the host returns a correct 404, a catch-all 200, or a homepage redirect for missing URLs.
Content similarity is a shingled-token Jaccard score (3-word shingles) over the server-rendered visible text. It reproduces the well-known soft-404 approach — comparing a URL to a guaranteed-missing baseline — but it is not Google's private classifier, so treat a verdict as a strong, transparent signal, not a guarantee.
Frequently asked questions
What is a soft 404?
A soft 404 is a page that tells a human "this content does not exist" but tells crawlers everything is fine by returning HTTP 200 (or by redirecting a missing URL to the homepage). Google then wastes crawl budget on it and may report it as "Soft 404" in Search Console. The fix is to return a real 404 or 410 status for content that is genuinely gone.
How does soft404scan detect a soft 404?
It fetches your URL and, at the same time, a made-up URL in the same directory that is guaranteed not to exist. It then compares the two: the HTTP status code, whether either redirects to the homepage, and how similar the page content is. If your URL returns 200 but is the same page a non-existent URL returns — or its title/heading says "not found", or it redirects to the homepage like a missing page does — that is a soft 404. Every rule is published in the methodology.
Is it free?
Yes — free, no account, no sign-up. Enter a URL and get an instant verdict. We keep no logs of the URLs you check.
Why does this matter for SEO?
Search engines have a limited crawl budget per site. Soft 404s burn that budget on pages that should not be indexed, can keep dead URLs lingering in the index, and muddy your coverage reports. Returning a clean 404/410 lets engines drop missing pages quickly and spend crawl budget on pages that matter.
Can I use it to confirm I fixed a soft 404?
Yes — that is a primary use case. After you change a missing URL to return a real 404/410, paste it here again: a "True 404" verdict confirms the fix is live and that engines will now drop the URL cleanly.
Does it run JavaScript / work on SPAs?
No — it reads the server-rendered HTML, like a crawler that does not execute JavaScript. Many single-page apps serve the same HTML shell for every URL, including ones that do not exist, so the static comparison can look identical. soft404scan detects that case and tells you to verify in a browser or Search Console rather than giving a false "soft 404" verdict.
Is this the same heuristic Google uses?
It reproduces the well-known, published approach (compare a URL against a guaranteed-missing baseline by status and content similarity). It is not Google's private classifier, so treat the result as a strong, transparent signal — not a guarantee of exactly what Search Console will say.
Is my data safe? Any SSRF concerns?
The scan runs on Cloudflare and only fetches public http(s) URLs; requests to private, loopback, link-local and cloud-metadata addresses are blocked, redirects are re-validated on every hop, and responses are size- and time-capped. We keep no logs of what you check.