Crawlability and indexing: why Google may be ignoring your pages

Some search problems no amount of good writing will fix, because they happen before the writing is ever read. A page that a search engine cannot crawl, or chooses not to index, does not rank badly. It does not rank at all.

This article treats that problem directly. It explains the two stages a page must pass before it can appear in search, the specific reasons a page fails at each, how to tell which stage is failing, and what to do about it. It is written for the business that has found, or suspects, that some of its pages are simply absent from search results.

A note on sources is in order. Peer-reviewed research is cited by author and year and listed at the end; Google’s own published guidance is cited as a primary source and identified as such; and any claim resting on the common practice of the SEO field is identified as practitioner consensus.

What this article covers

This article covers crawlability and indexing: whether a search engine can reach a business’s pages, and whether it then stores them so they can be shown. It is a focused article within the technical SEO part of this series, and it builds on the technical SEO pillar and the non-developer audit, which a reader who wants the wider context should treat as the background.

The article first separates crawling from indexing, because they are two stages and the distinction governs everything that follows. It then treats the reasons a page fails at each stage, how to diagnose which stage is at fault, and the often-confused matter of the robots file and the noindex instruction. It closes on orphan pages, on whether crawl budget is something a small business should worry about, and on how to get a page indexed, and when a page is better left out of the index deliberately.

Crawling and indexing: two stages, not one

The single most useful idea in this article is that getting a page into search is not one event but two, and that the two can fail independently. A business that holds this distinction can diagnose its own problem; a business that treats “being in Google” as a single thing cannot.

The first stage is crawling. Crawling is a search engine’s automated program, its crawler, fetching a page: arriving at the page’s address and retrieving its contents. The foundational accounts of how a large-scale search engine works describe this crawler as the component that follows links across the web and collects the pages it finds (Brin & Page, 1998; Arasu et al., 2001).

The second stage is indexing. Indexing is the search engine then processing the page it fetched, making sense of it, and storing it in the vast structure, the index, from which search results are drawn. Only a page that has been indexed is a candidate to appear when someone searches.

This two-stage structure is not a quirk of one search engine but a feature of how web search is built. The standard accounts of search engine architecture describe a crawler and an indexer as separate components, each with its own job and its own ways of failing (Arasu et al., 2001). A business that has taken in the distinction is, in effect, working with the actual architecture of the thing rather than a vague mental image of “being on Google”.

The two stages are sequential and separable. A page can fail to be crawled, in which case the question of indexing never arises. Or a page can be crawled successfully and then not indexed, which is a different problem with different causes and different fixes. The figure below sets out the two stages and the points at which a page can be lost.

Figure 1. The two stages, and where a page is lost. Crawling and indexing fail for different reasons, which is why the first task, when a page is missing from search, is to find out which stage is at fault.

What it costs when a page is not indexed

Before the causes, pause on what an indexing failure actually costs, because the cost is easy to underestimate, and the underestimation is why these failures so often go unaddressed.

A page that is not indexed is not a page that performs poorly. It is a page that performs not at all, as far as search is concerned. Every visitor it might have received through search, every customer it might have reached, is simply absent. Not lost to a competitor who ranked higher, but never in play, because the page was never a candidate to be shown.

This makes an indexing failure quietly expensive. The business has paid the full cost of creating the page, the time, the thought, the writing, and is receiving none of the return; and because the page exists and looks finished, nothing signals that the return is missing. A poorly ranked page at least appears in the data as a page with little traffic; an unindexed page can be overlooked entirely.

The cost compounds when the missing page is an important one. If the page a search engine is failing to index is a main service page, the business is not losing a peripheral visitor or two but the search visibility of the page that does its heaviest commercial work. That is the case in which diagnosing and fixing an indexing failure is not housekeeping but a genuine priority.

Why a page might not be crawled

The first way a page goes missing is that it is never crawled: the search engine’s crawler never fetches it. A handful of causes account for most cases.

The commonest is that nothing links to the page. A crawler discovers pages largely by following links, so a page with no link pointing to it from anywhere, not from the site’s own navigation, not from another page, not from anywhere external, gives the crawler no path by which to arrive. Such a page is treated at length below, because it is common enough to deserve its own name.

A second cause is an instruction that tells crawlers to stay away. A site carries a small file that tells crawlers what they may and may not fetch, and a page can be excluded by that file, sometimes deliberately, often by accident, and occasionally as a leftover from when the site was being built. A crawler that has been told not to fetch a page will not fetch it.

Other causes are more occasional. A crawler may be unable to reach a page because of how the site is built, or because the page sits behind something the crawler cannot pass. And a page may simply be new: a recently published page, or a page on a new site, may not yet have been discovered, because discovery takes time. What unites the crawl-stage failures is that they are about access, the crawler either has no route to the page or has been refused one.

One modern cause deserves a specific mention. Some pages present their real content only after the browser runs code, rather than delivering it directly in the page the crawler first receives. Search engines have grown much better at handling such pages, but a page that depends heavily on this can still be read incompletely or late. A small business is unlikely to meet this often, but it is worth knowing the category exists, because it is the kind of crawl-stage problem whose cause is genuinely a developer’s to explain.

How a page gets discovered in the first place

The crawl-stage failures make more sense against a clear picture of how a search engine discovers a page at all, so let me set that picture out.

A search engine’s crawler does not somehow know, in advance, every page that exists. It discovers pages, and it does so principally by following links: arriving at a page it already knows, reading the links on it, and adding the pages those links point to onto the list of pages still to fetch. The web, to a crawler, is explored by moving along its links from the pages already found.

There is a second route to discovery, the sitemap, the file a site can provide that simply lists its pages. A crawler that reads a site’s sitemap is given, directly, a list of pages the site wants found, without having to discover each of them by following a link. The sitemap supplements the link-following; it does not replace the need for sound internal linking.

This explains the crawl-stage failures at a stroke. A page reached by no link and absent from the sitemap has neither route to discovery, so the crawler has no way to learn it exists. A page that is linked, or listed in the sitemap, has a route. Most crawl-stage problems are, at bottom, a page that the discovery process has no path to, and most crawl-stage fixes are the provision of such a path.

This also explains a common worry of new businesses: that a brand-new site, or a brand-new page, takes time to appear in search at all. It does, and the reason is discovery. A new site has few or no links pointing to it from the wider web, so the routes by which a crawler would stumble upon it are scarce. That is precisely why submitting the site through the search engine’s free tool, and providing a sitemap, matters most at the very start, when ordinary link-following has the least to work with. A business should expect a new site to be found gradually rather than at once, should give the discovery process the help those two tools provide, and should not mistake the ordinary slowness of a new site’s discovery for a fault that needs fixing. Patience, at this one stage, is the correct response rather than a sign of a problem.

Why a page might be crawled but not indexed

The second way a page goes missing is more puzzling to a business, because the page has been fetched and yet still does not appear. This is an indexing failure, and its causes are different in kind from the crawl-stage causes.

One cause is, again, an explicit instruction. A page can carry a small instruction, distinct from the one in the robots file, and treated below, that tells a search engine not to index it. A search engine that crawls such a page reads the instruction and obeys it, declining to add the page to its index.

The other causes are matters of the search engine’s judgement. A search engine does not index every page it crawls; it makes a decision, and it can decline. It may decline to index a page it judges thin or low-value, a page that does not, in its assessment, genuinely answer anything. It may decline to index a page it sees as a duplicate of another, choosing one version and leaving the other out. And sometimes it simply does not select a page, because its sense of the page’s quality or usefulness does not clear the bar.

This is an important and slightly uncomfortable point for a business to absorb. Crawling is largely mechanical, a page is reachable or it is not, but indexing involves judgement, and a search engine can crawl a page, understand it perfectly, and still decide it is not worth keeping. When that happens, the problem is not technical access at all; it is the page itself, and the on-page mistakes treated earlier in this series are often the real cause behind an indexing failure.

A business looking at the site-owner tool will often meet a particular status that captures this exactly: a page reported as discovered or crawled but, in the tool’s words, currently not indexed. That status is not a technical error to be hunted down; it is the search engine saying, in effect, that it has seen the page and has not, so far, judged it worth keeping. Read that way, the status points the business not at its plumbing but at the page.

How to tell which stage is failing

Because the two stages fail for different reasons and need different fixes, the first practical task when a page is missing from search is to find out which stage is at fault. This is, fortunately, something a business can determine directly.

The means is the search engine’s free service for site owners, which the technical SEO articles in this series have already recommended. That service reports, page by page, the status of a site’s pages, and it distinguishes the cases. It will show whether a page has not been crawled at all, or whether it has been crawled but not indexed, and it gives a reason in each case.

This distinction is the diagnosis. A page reported as crawled but not indexed points the business toward the indexing-stage causes: an instruction, or a judgement about the page’s quality or its duplication. A page reported as not crawled points toward the crawl-stage causes: a missing link, a block, an access problem. The fix follows from the diagnosis, and the diagnosis is a matter of reading the report rather than guessing.

Do this before attempting any fix, because the wrong diagnosis leads to wasted effort. A business that responds to an indexing failure by adding more links to the page, which is a crawl-stage fix, has not addressed the indexing-stage cause, and the page remains absent. The report tells the business which problem it actually has.

Give the tool’s report a little patience as well as attention. The data a search engine’s site-owner tool shows is not always instantaneous; it can lag behind the live state of the site, so a page recently fixed may still show an old status for a while. A business should read the report as a recent account rather than a live one, and re-check after a fix rather than expecting the status to change the moment the fix is made.

The robots file and the noindex instruction: two different things

Two mechanisms have been mentioned that both, loosely, keep a page out of search: the robots file and the noindex instruction. They are so often confused that the confusion deserves its own section. They are different things, they act at different stages, and they interact in a way that surprises people.

The robots file is a site-wide file that instructs crawlers what they may fetch. It acts at the crawl stage: it governs whether the crawler is allowed to retrieve a page at all. The noindex instruction is a separate, per-page instruction that tells a search engine not to index a page. It acts at the index stage: it governs whether a page, once fetched, is kept.

The surprising interaction is this. If a business wants a page kept out of the index, the noindex instruction is the right tool, but for it to work, the search engine must be able to crawl the page, because the noindex instruction lives on the page and can only be read by a crawler that fetches it. A business that blocks a page in the robots file and also places a noindex instruction on it has, in effect, prevented the search engine from ever reading the noindex instruction. The two tools, used together in that way, work against each other.

The practical rule is to be clear about which outcome is wanted. To keep a page out of the index reliably, allow it to be crawled and use the noindex instruction. To use the robots file is to manage crawling, not indexing, and a business that reaches for it expecting it to remove a page from search may be disappointed; Google’s own guidance distinguishes these mechanisms and their proper uses (Google Search Essentials, 2022).

The most damaging real-world version of this confusion is worth naming. A business that wants a page gone from search, reaches for the robots file to block it, and finds the page stubbornly still appearing has met the trap directly: a page already in the index, then blocked from crawling, may remain in the index precisely because the search engine can no longer crawl it to discover that it should be removed. The reliable route to removing a page is to let it be crawled and to let the noindex instruction be read.

Orphan pages: the page nothing links to

The orphan page deserves a section because it is among the commonest crawl-stage problems and among the easiest to overlook. An orphan page is a page that exists, and may be perfectly good, but that nothing on the site links to.

Orphan pages arise naturally and quietly. A page is created and then, as the site is reorganised, the link to it is removed; or a page is built for some purpose and never connected into the site’s navigation; or a page is published but the business forgets to link to it from anywhere. The page sits there, reachable by anyone who has its exact address, and invisible to a crawler that has no link to follow to it.

The cost is straightforward. A crawler discovers most pages by following links, so an orphan page is, to a crawler moving through the site, simply not there, and a page a crawler never reaches is a page that is never crawled, never indexed, and never found. The page’s quality is irrelevant; an excellent orphan page is as invisible as a poor one.

The fix is correspondingly straightforward, and it is one a business can do itself. The orphan page needs to be linked to from the relevant parts of the site: from its section, from related pages, from the navigation if it belongs there. A page that is properly connected into the site is a page a crawler can reach, and the act of connecting it is ordinary internal linking rather than anything technical.

Finding orphan pages systematically is worth a word, since they are by definition the pages a business is least likely to notice. The search engine’s site-owner tool, and the free tools that map a site, can reveal pages that exist but that the site’s own link structure does not reach, the gap between every page the business has and every page its links connect. Comparing those two lists is how the orphans are found, and the comparison is something an owner can do.

Does crawl budget matter for a small business?

A term that a small business may encounter, and worry about, is “crawl budget”, the idea that a search engine will only crawl so many of a site’s pages in a given period. It is worth addressing directly, because for most small businesses it is a concern that does not apply.

Crawl budget is real, and it matters, for very large sites. A site with hundreds of thousands or millions of pages can genuinely run up against the limits of how much a search engine will crawl, and managing crawl budget becomes a real discipline at that scale. The literature on how search engines crawl the web treats the scheduling and prioritising of crawling as a serious problem (Arasu et al., 2001).

A small business site is not that kind of site. A site with a few dozen, or a few hundred, pages does not strain any crawl limit; a search engine can comfortably crawl all of it. The concept of crawl budget, transplanted from large-site SEO into small-business advice, becomes a worry about a problem the small site does not have, which is, as a companion article in this series argued about technical SEO generally, effort spent solving the difficulties of a different kind of site.

The practical conclusion, held as practitioner consensus, is that a typical small business should not spend attention on crawl budget. If its pages are not being crawled, the cause is almost certainly one of the specific crawl-stage problems above, a missing link, a block, an orphan page, and not a budget the site is too small to ever exhaust.

How to get a page indexed

With the diagnosis made, the practical question is how to get a page that should be in search into the index. The answer depends on which stage was failing, and follows from everything above.

If the page was not being crawled, the fixes address access. Link to the page from the relevant parts of the site, so the crawler has a route to it. Check that the robots file is not blocking it, and remove the block if it is. Ensure the page is included in the site’s sitemap, the file that lists pages for search engines. These together give the crawler both a path to the page and a notice that the page exists.

If the page was crawled but not indexed, the fixes are different. If a noindex instruction is present and the page should in fact be indexed, that instruction must be removed. If the page was judged thin, low-value, or duplicate, the fix is not technical at all: it is to make the page genuinely worth indexing, by giving it real substance and a clear purpose, or to resolve its duplication with another page. The on-page articles in this series describe that work directly.

In either case, the search engine’s site-owner tool allows a business to request that a specific page be crawled and considered, which can prompt the search engine to look again sooner than it otherwise would. Understand, though, that requesting indexing is an invitation, not a command: the search engine will still apply its own judgement, and a page that was not indexed because it is thin will not be indexed merely because indexing was requested. The request speeds up the reconsideration; it does not override the reasons.

A note on time is needed here too. Even once a page has been fixed and reconsideration requested, indexing is not instant; the search engine works to its own schedule, and a fixed page may take days or longer to appear. A business should make the fix, confirm with the tool that the fix is in place, and then allow time, resisting the urge to conclude the fix failed before the search engine has had the chance to act on it.

Indexed is not the same as ranking well

One clarification belongs near the end of this article, because without it a business can draw the wrong conclusion from getting a page indexed.

Being indexed is necessary, but it is not the goal. An indexed page is a page that is eligible to appear in search results, a candidate, and that is all. Whether it then ranks well, and is actually seen by customers, depends on everything the rest of this series has treated: whether the page is a genuine answer, whether the site is technically sound, whether it has earned authority.

This matters because a business that fixes an indexing problem can feel that the SEO work on a page is now done. It is not; it has only just become possible. An indexing failure was a barrier in front of the page, and removing it lets the page enter the contest, but the contest is still to be competed in, on the terms the on-page, technical, and off-page articles describe.

The right way to hold it is sequential. Crawling and indexing are the gate; ranking is the race beyond the gate. This article has been about getting through the gate, because a page that does not is excluded entirely, but a business that has got its pages indexed should understand that it has reached the starting line, not the finish.

When a page should not be indexed

This article has so far treated being absent from the index as a problem. It is worth closing the topic by noting that, for some pages, being absent from the index is correct, and a business should know which of its pages those are.

Not every page on a site is meant to be found through search. A site commonly has pages that exist for a function rather than to be a search result: internal utility pages, confirmation pages a customer sees after an action, certain administrative or duplicate-by-design pages. These pages serve their purpose perfectly well without ever appearing in search, and a search result leading a stranger to one of them would be a poor result.

For such pages, the noindex instruction is the right tool, used deliberately. A business can place it on the pages that should not be search results, keeping the index populated only with the pages that are genuinely meant to be found: the homepage, the service pages, the real content. This is not a fault to be fixed; it is a tidiness worth maintaining.

A business does not always need to act on this actively. Many website systems already keep their purely functional pages out of search by default, and a business that simply has not thought about its utility pages may find they are already handled. The deliberate use of the noindex instruction is for the cases where a page that should not be a search result is in fact appearing as one; where the functional pages are already absent, there is nothing to do.

The principle that resolves the whole topic is this. The aim is not for every page to be indexed, but for the right pages to be indexed: the pages that are genuine answers to genuine searches should be in the index, and the pages that merely serve a function need not be. A business that has its real, customer-facing pages indexed, and is untroubled by the absence of its utility pages, has the situation as it should be.

Crawl and index problems at a glance

The table below gathers the common crawl-stage and index-stage problems, with the stage each belongs to and the fix.

Problem	Stage	What to do
Nothing links to the page (orphan page)	Crawling	Link to it from relevant pages and navigation
The robots file blocks the page	Crawling	Remove the block if the page should be found
The page is brand new	Crawling	Link to it, add it to the sitemap, allow time
A noindex instruction is present	Indexing	Remove it if the page should be indexed
The page is judged thin or low-value	Indexing	Give the page genuine substance and purpose
The page is seen as a duplicate	Indexing	Consolidate or genuinely differentiate the page
A utility page is correctly absent	Indexing	Nothing — or apply noindex deliberately

Concluding remarks

A page that cannot be crawled, or is not indexed, does not rank badly. It is absent, and no improvement to its writing will change that. Diagnosing and resolving this is therefore prior to all other SEO work on a page.

The key idea is that getting a page into search is two stages, not one. Crawling is the search engine reaching and fetching the page; indexing is the search engine processing and keeping it. The two fail for different reasons: crawl-stage failures are about access (a missing link, a block, an orphan page) while index-stage failures involve the search engine’s judgement, declining to keep a page it finds thin, duplicate, or low-value, or obeying a noindex instruction.

The first task, when a page is missing, is to use the search engine’s free site-owner tool to find out which stage is failing, because the fix follows from the diagnosis. Crawl-stage problems are fixed by restoring access; index-stage problems are fixed by removing a noindex instruction or, more often, by making the page genuinely worth indexing. And not every page should be indexed: utility pages are correctly absent, and the aim is for the right pages to be found, not all of them.

The next article in this series completes the technical SEO part by drawing its threads together, before the series turns to off-page SEO and the authority a site earns from beyond itself.

Future developments

Crawling and indexing are among the most stable parts of how search works, and that stability is worth noting amid the changes elsewhere. A search engine, or an AI system answering questions from the web, must still reach a page and process it before it can use it: the two stages remain, whatever is built on top of them.

What is changing is the pressure on the indexing judgement. As the web grows and as automated systems become more selective about what they draw on, the bar a page must clear to be kept and used is, if anything, rising. A page indexed today is a page that has been judged worth keeping, and that judgement is becoming more discriminating rather than less.

For a small business this points back to the same conclusion the on-page articles reached. The durable way to ensure a page is crawled and indexed is to make it genuinely reachable and genuinely worth keeping: a real answer to a real question, properly connected into the site. A page built that way satisfies the crawl stage and the index stage at once, and it does so in a way that the tightening of the indexing judgement does not threaten. The mechanics may be refined; the requirement to be a genuine, reachable answer will not be.

References

Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., & Raghavan, S. (2001). Searching the Web. ACM Transactions on Internet Technology, 1(1), 2, 43.

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1, 7), 107, 117.

Google Search Essentials. (2022). Google Search Central documentation. Google. [Primary source — official platform documentation, not peer-reviewed.]