The state of business listings in 2026: a synthesis of four analyses of a curated directory of 14,362 records

Author. Gombos Atila Robert, PhD. Owner and Chief Executive Officer, Jasmine Business Directory (D-U-N-S 10-276-4189), Valley Cottage, New York. ORCID: 0000-0001-6468-2811. Correspondence through the author profile.

Data statement. The empirical material analysed in this study was taken directly from the production database of the Jasmine Business Directory. The directory has run since 2009, has never used paid advertising to acquire listings, and adds about ninety per cent of its entries through manual editorial review. The analysed export is current as of 25 May 2026. This study is the fifth and final in a connected series; it draws together and extends the four preceding studies, which examined category concentration, listing completeness, geographic distribution, and growth over time in the same corpus.

Abstract

This study is the fifth and final in a connected series analysing the production database of a curated business directory, and it draws the series together. The four preceding studies examined, separately, the concentration of listings across categories, the completeness of individual listings, the geographic distribution of the corpus, and the accumulation of listings over time. Each reported a finding; none could see how the findings related to one another. This study cross-tabulates the four measures across the same 14,362 listings to establish that relationship.

The central result is that the corpus is, in effect, two corpora layered together. Listings added in the directory’s formative and intensive period of 2009 to 2014, which make up 63.3% of the corpus, have a mean completeness of 1.80 of a possible 10 fields and carry a location in only 13.6% of cases; listings added from 2015 onward have a mean completeness of 6.88 and carry a location in 70.1% of cases. The incompleteness found by the second study, the limited geographic coverage found by the third, and the age skew found by the fourth are therefore not three independent problems but three views of one structural fact: a large, old, sparse layer of listings beneath a smaller, recent, substantially complete one. The study reads this two-layer structure through the data-quality literature (Wang & Strong, 1996), states what it implies for the four prior studies and for the directory, and offers reasoned projections. It is, throughout, a descriptive analysis of a single corpus, and its claims are held to that scope.

Keywords. business directories; data quality; listing completeness; corpus structure; directory growth; geographic coverage; cohort analysis; cross-tabulation; online visibility; curated directory; synthesis; business listings.

Introduction

This study closes a connected series of five. The four that precede it each examined one property of a single corpus, the production database of the Jasmine Business Directory, a curated directory of 14,362 business listings, and each reported a finding about that property in isolation. The first measured how unevenly listings are distributed across the directory’s categories; the second, how complete the individual listings are; the third, where the listed businesses are located; the fourth, when the listings were added across the directory’s seventeen-year history. A series of that shape invites a synthesis, but a synthesis that did no more than place the four findings side by side would be a summary, not a study, and would add nothing the four papers had not already said.

The question this study examines is therefore not what the four findings are but how they relate to one another. Each of the prior studies, at its close, raised the same open question and named it as future work: whether the gaps each had measured, the incomplete listings, the listings with no location, the listings from the oldest cohorts, fall on the same records or on different ones. This study answers that question directly, by recomputing the four measures for every listing and cross-tabulating them against one another.

That cross-tabulation is the whole of the study’s new empirical content, and it is worth saying clearly that it is new. The synthesis does not rest on the prose of the four prior papers, nor on their published tables; it returns to the same database export, recomputes the four measures listing by listing, and crosses them in ways none of the four studies attempted. A reader who treated this paper as a summary would be mistaking it; it is a fifth analysis, distinguished from the others only in that its raw material is four measures rather than one.

The approach is descriptive and quantitative, and it adds a genuine analytical step rather than restating the earlier ones. For each of the 14,362 listings, the study recovers a completeness score, the year the listing was added, whether the listing carries a location, and the size of the category it sits in; it then cross-tabulates completeness against age, against geographic coverage, and against category crowding. As with the four studies it draws on, this is an exploratory analysis of a single corpus, and its claims are held to that scope.

The data and their setting are unchanged from the companion studies. The corpus is the production database of the Jasmine Business Directory, founded in 2009 and headquartered in Valley Cottage, New York; the directory has never used paid advertising to acquire listings, and it adds about ninety per cent of its entries through manual editorial review. The export analysed here was taken on 25 May 2026 and contains 14,362 listings. Every figure in this study is computed from that single export, exactly as in the four studies it synthesises.

One point of method is worth making explicit. Because the synthesis recomputes its four measures from the same export the prior studies used, rather than importing their published figures, the four measures in this study are guaranteed to be mutually consistent and drawn from one extraction. A synthesis that stitched together numbers from four separately conducted analyses would risk small inconsistencies of definition or timing; recomputing all four in a single pass removes that risk, and it is why the study returns to the export rather than to the prior papers.

Why an integrated picture matters can be put at three levels. For the directory, the difference between facing four separate problems and facing one structural fact with four visible faces is the difference between four maintenance agendas and one; the synthesis is what tells the directory which of these it confronts. For a business that holds a listing, the integrated picture explains why the condition of its listing is what it is, and what, concretely, would change it. And for the wider understanding of how directories should be studied, the synthesis shows that the properties usually measured one at a time are, in at least one real corpus, strongly interdependent. Each level is developed in the discussion.

The study contributes three things. It establishes, first, through cross-tabulation, that the corpus is built as two layers, an old, sparse layer and a recent, complete one, and that this single structure underlies the separate findings of the four prior studies. It reads, second, what this means for those studies and for the directory, arguing that what looked like four problems is largely one. It offers, third, reasoned projections for the corpus and a note on the implications for the study of directories generally. The paper proceeds through a review of the relevant background, a description of the dataset, the methodology, the results, the discussion, the projections, the limitations, and the concluding remarks.

This study rests on the four that precede it, and the background therefore begins by restating their frames before turning to what a synthesis must add, to the data-quality literature that is the series’ common thread, and to the gap this final study addresses.

The four prior studies were deliberately separate, each isolating one property of the corpus. The first examined category concentration and found the distribution of listings across categories to be markedly uneven: a Gini coefficient of 0.522 across category sizes, with the largest category, Law, holding 4.4% of all listings and the hundred largest categories holding just over half. The second examined listing completeness and found a mean of 3.67 filled fields out of ten, distributed not around that mean but in a multimodal pattern, with large groups of listings at the empty and the near-complete extremes. The third examined geography and found that only 34.3% of listings carry a country at all, with the located minority concentrated heavily in the anglophone world. The fourth examined growth and found the corpus sharply front-loaded in time, with 63.3% of all listings added by the end of 2014 and half the corpus added in the two years 2013 and 2014 alone.

Each finding was reported on its own, and each prior study measured only its own property. That separation was a virtue for clarity, but it left a question standing: the four properties are properties of the same listings, and whether they vary together or independently across those listings could not be seen from any one study. The four findings are, in this sense, four facets of one object viewed from four fixed positions, and no single view shows how the facets join.

The image of facets should not be pressed too far, though, because facets of a gem are fixed in their relations while the properties of a corpus need not be. Whether completeness and age are joined, and how, is not given in advance; it is something the listings either do or do not exhibit. The metaphor captures the position of the four prior studies, each a fixed view, but how the views relate is genuinely open until the cross-tabulation answers it.

The five studies as a single investigation

The five studies of this series were conceived as one investigation rather than as five separate papers that happen to share a corpus. The decision to examine concentration, completeness, geography, and growth in four distinct studies was a decision about how to be precise, not a decision that the four properties were unrelated. A single study attempting all four at once would have had to treat each in less depth; four studies could each measure one property exactly.

The cost of that decision was that the relationships among the four properties had to be deferred, and this study is where the deferral is paid back. The four prior studies are, on this view, four instruments trained in turn on one object, and this fifth study is the step of comparing what the four instruments recorded. The series was always meant to end this way; a reader who has followed the four studies should regard the synthesis not as an afterword but as the stage at which the investigation reaches its actual subject, the corpus as a whole rather than any one of its properties.

This also fixes what the present study may and may not assume. It may assume the four prior measures, because it recomputes them from the same export and they are defined there; it may not assume any relationship among them, because establishing those relationships is its own task. The synthesis therefore begins from the four measures as given and treats their interdependence as the open question to be settled.

One consequence of this design deserves notice. Because the four prior studies were written without knowing what the synthesis would find, none of them could tilt its own analysis toward the two-layer conclusion; each measured its property on its own terms. The synthesis therefore inherits four measures that were fixed before their interrelation was known, which is a modest safeguard against reading the two-layer structure into the data rather than out of it.

Why a synthesis requires its own analysis

A synthesis could be attempted in either of two ways. The weaker way would set the four findings beside one another and draw a general conclusion in prose, that the corpus is, in various respects, uneven. That conclusion would be true and uninformative, and it would rest on no evidence the four studies had not already presented.

The stronger way, and the one this study takes, treats the relationship between the four properties as an empirical question with an answer in the data. If the incomplete listings, the unlocated listings, and the oldest listings are largely the same listings, the corpus has one problem wearing three appearances; if they are largely different listings, it has three. Only a cross-tabulation can distinguish these cases: completeness against age, completeness against geographic coverage, completeness against category size, computed listing by listing. The synthesis, properly done, is therefore not a recapitulation but a further analysis, and this study is built around that analysis.

Data quality as the common thread

The four studies share a vocabulary, whether or not each named it, and that vocabulary is the data-quality literature. Wang and Strong (1996) established that data quality is not a single property but a set of distinct dimensions, among them completeness, currency, and the consistency of representation; Pipino, Lee, and Wang (2002) treated those dimensions as separately measurable. Seen through this lens, the second study measured the completeness dimension directly; the third measured a particular case of completeness, the coverage of the geographic fields; the fourth measured the age structure that drives the currency dimension; and the first measured a structural property, concentration, that conditions how visible any individual listing’s quality can make it.

The series is, on this reading, four data-quality measurements of one corpus, and the data-quality literature expects that such dimensions need not be independent. A record created hastily may be incomplete, untagged geographically, and never since revised, so that several dimensions fail together on the same record. Whether that co-failure is what the present corpus shows is exactly the question the synthesis sets out to answer.

The data-quality framing also clarifies why the synthesis is worth doing at all. If the dimensions of quality were independent, a directory could be incomplete without being stale, or sparsely located without being old, and a separate remedy would be needed for each. If they fail together, one remedy serves several dimensions. The practical stakes, one maintenance task or several, therefore rest directly on whether the dimensions co-vary, which is why that question is the one the study is built around.

The gap addressed by this study

The literature on directories, on search, and on data quality is well developed, and the four prior studies have measured one substantial corpus against it in detail. What is missing, and what this study supplies, is the integrated picture: an analysis that establishes how the separately measured properties of a real, advertising-free directory relate to one another across its listings. The study does not claim that the two-layer structure it finds generalises to all directories; it claims to establish that structure exactly for one corpus, and to show that recognising it changes how the four prior findings, and the directory’s own situation, should be understood.

The dataset: one corpus, four measures

The empirical material for this study is the production database of the Jasmine Business Directory, the same corpus analysed in the four companion studies. The directory was founded in 2009 by Pecsi Andras and Robert Gombos, is headquartered in Valley Cottage, New York, and has run continuously since then as a general business directory organised by subject category. It has never used paid advertising to acquire listings, and it adds about ninety per cent of its entries through manual editorial review.

The unit of analysis is the individual listing record, of which the export, taken on 25 May 2026, contains 14,362. What sets this study apart from the four it draws on is not the data but the treatment: where each prior study isolated one property, this study recovers four measures for every listing at once, so that the four can be cross-tabulated. The four measures are a completeness score, the year of addition, a location indicator, and the size of the listing’s category. Table 1 summarises the dataset and the four measures.

**Table 1.** The dataset and the four per-listing measures recovered for the synthesis.
Attribute	Value
Source	Jasmine Business Directory, production database
Directory founded	2009
Headquarters	Valley Cottage, New York
Listing-acquisition model	Organic submission and editorial addition; no paid advertising
Editorial curation	Approximately 90% of entries added through manual review
Export reference date	25 May 2026
Listings (universe of analysis)	14,362
Prior studies synthesised	Four (category concentration, completeness, geography, growth)
Measure 1 — completeness	Count of filled fields, 0–10, across ten core fields
Measure 2 — age	Calendar year the listing was added to the database
Measure 3 — location	Whether the listing carries a country
Measure 4 — category crowding	Number of listings in the listing’s category

Methodology

The methodology is set out in three parts: the recovery of the four measures for every listing, the cross-tabulations themselves, and a note on the descriptive design and on the reading of correlation. The framework of the synthesis is shown in Figure 1.

Figure 1. The synthesis framework. The four prior studies are not merely summarised; their measures are recomputed together for every listing and cross-tabulated, and it is the cross-tabulation that yields the two-layer structure reported below.

Recovering the four measures for every listing

Four measures were computed for each of the 14,362 listings. The completeness score is the count of filled fields, from zero to ten, across ten core fields of a listing: its title, description, and keywords; its company name, street, city, state, postal code, and country; and its telephone number. This is the measure used in the second study, and a score of ten denotes a listing with every core field populated, a score of zero a listing with none.

The age measure is the calendar year drawn from the listing’s addition timestamp, the field analysed in the fourth study, which records when the listing entered the database. The location measure is a single indicator of whether the listing carries a country, the criterion of geographic coverage used in the third study. The crowding measure is the number of listings in the category to which the listing belongs, the quantity behind the first study’s concentration analysis. All four measures are deterministic functions of the export, and each reproduces a quantity already defined and used in one of the prior studies.

The four measures are deliberately the simplest faithful version of each prior study’s central quantity. The completeness score does not weight the ten fields, the age measure does not try to distinguish editorial addition from self-submission, the location measure is a single indicator rather than the full geographic detail, and the crowding measure is a raw category count. This simplicity is intended: a synthesis needs measures that are easy to cross-tabulate and hard to dispute, and a more elaborate version of each would have complicated the cross-tabulation without changing the structure it reveals.

The cross-tabulations

The analysis then crosses these measures. Completeness was tabulated against age by grouping listings into the four eras of the fourth study, the formative years of 2009 to 2012, the intensive years of 2013 and 2014, the steady years of 2015 to 2019, and the recent years of 2020 to 2026, and computing the mean completeness within each. Completeness was tabulated against location by comparing the mean completeness of listings that carry a country with that of listings that do not. Completeness was tabulated against crowding by grouping categories into crowded, medium, and small bands and computing the mean completeness within each.

One further partition follows from these cross-tabulations and is used throughout the results: a division of the corpus at the 2014 to 2015 boundary into an earlier portion, comprising the formative and intensive eras, and a later portion, comprising the steady and recent eras. The results will show why that particular boundary is the natural one.

No cross-tabulation in this study introduces information from outside the export. Every cell of every table is a count of listings that satisfy a combination of the four recomputed measures, and the partition into two layers is a grouping of those same counts. The analysis adds a new view of the data, not new data, and that is the sense in which it is a genuine further step rather than a fifth independent collection.

Why the boundary falls between 2014 and 2015

The partition of the corpus into two layers is drawn at the boundary between 2014 and 2015, and because a boundary chosen by the analyst can always be questioned, the basis for this one should be set out before the results rely on it.

The boundary is not imposed on the data; it is read off the point where the data itself breaks. The era cross-tabulation shows that mean completeness is 1.81 in the intensive era ending in 2014 and 6.43 in the steady era beginning in 2015, a jump of more than four and a half fields between two adjacent years, with nothing comparable anywhere else in the seventeen-year span. The change between the two layers is a step, not a slope, and a step has a natural location: the single year-boundary it falls across. A cut placed elsewhere, between 2016 and 2017 for instance, would divide a layer that is internally similar and would not correspond to any feature of the data.

The two-layer partition is therefore a description of a discontinuity that is already in the corpus, not a grid laid over a smooth gradient. This is also why the study is willing to speak of two populations rather than of a continuum: the listings on either side of the 2014 to 2015 boundary differ sharply and consistently, while the listings within each side resemble one another.

The descriptive design and a note on correlation

As in the four prior studies, the design is descriptive. The corpus is the entire production database of one directory, analysed in full; the cross-tabulated figures are exact properties of that corpus rather than estimates, and no inferential test is applied, because there is no sampling beyond the corpus itself.

One point of interpretation must be fixed before the results. A cross-tabulation establishes that two properties vary together; on its own it does not establish that one causes the other. Where the results report that listings in crowded categories are more complete, or that older listings are less complete, these are statements of association within this corpus, and the discussion is careful to mark where a third factor, age in particular, may lie behind an apparent relationship. The study reports the associations exactly and reserves causal language for where it is warranted.

Results

The results are presented in five parts: a restatement of the four prior findings; the cross-tabulation of completeness against age; the cross-tabulation of completeness against category crowding; the cross-tabulation of completeness against geographic coverage; and the two-layer structure of the corpus that these cross-tabulations together reveal. Each part is accompanied by the relevant figure or table.

The four prior findings restated

Table 2 collects, for reference, the four findings the synthesis builds on. Each was established in detail in its own study; each is a property of the same 14,362 listings; and none, taken alone, could show how it relates to the other three.

Table 2 is included for the reader’s convenience, not as a result of this study; the four findings it records were established and defended in the four prior papers, and nothing in the present study revises them. What the present study adds begins with the cross-tabulations that follow, and the table serves only to hold the four prior findings in view while those cross-tabulations are read.

**Table 2.** The four prior studies and their principal findings.
Study	Dimension examined	Headline measure	Principal finding
1	Category concentration	Gini 0.522 across category sizes	Listings are distributed very unevenly across categories
2	Listing completeness	Mean 3.67 of 10 fields filled	Completeness is multimodal, with large groups at both extremes
3	Geographic distribution	34.3% of listings carry a country	Coverage is partial; located listings are 88.3% anglophone
4	Growth over time	49.9% of listings added in 2013–2014	Accumulation is front-loaded; the corpus is age-skewed

Completeness across the age cohorts

The first cross-tabulation sets completeness against age, computing the mean completeness score within each of the four eras. Figure 2 shows the result, and it is stark.

Figure 2. Mean listing completeness by era. The two early eras average under two filled fields of ten; the two recent eras (shown in the accent colour) average more than six. The change between them is abrupt rather than gradual.

A listing added in the steady or recent eras carries, on average, between three and four times as many filled fields as a listing added in the formative or intensive eras. The mean completeness of the formative era is 1.75 and of the intensive era 1.81; the steady era jumps to 6.43 and the recent era to 7.21. The change is not a gradient spread evenly across the seventeen years; it is a step, and the step falls precisely at the boundary between 2014 and 2015. The listings of the two early eras and the listings of the two later eras differ in completeness so sharply that they are best regarded as belonging to two different populations, and the conjecture offered by the fourth study, that old and incomplete listings coincide, is confirmed.

The abruptness of the step deserves a moment’s attention, because a gradual decline and a sharp step would call for different explanations. A gradual decline in completeness with age would suggest a slow process, listings decaying, or standards drifting, year by year. A sharp step, by contrast, points to a discrete change: a change in how listings were added, happening at a particular time. The data show the step, and the discussion accordingly looks for a change in editorial practice around 2014 and 2015 rather than for a gradual process spread across the seventeen years.

Completeness and category crowding

The second cross-tabulation sets completeness against category crowding, comparing the mean completeness of listings in crowded categories with that of listings in medium and small ones. Figure 3 shows the result.

Figure 3. Mean listing completeness by category crowding. Listings in crowded categories are markedly more complete than those in small ones, an association whose interpretation, and a possible confound, are discussed below.

Listings in crowded categories, those holding a hundred or more listings, have a mean completeness of 7.15, against 3.43 for listings in medium categories and 2.27 for those in small ones. The association is clear: the more crowded a listing’s category, the more complete the listing tends to be. Reading that association, however, requires care, and the discussion returns to it. A cross-tabulation cannot on its own show whether crowding draws more complete listings or whether a third factor lies behind the relationship; and the obvious candidate for that third factor is age, since the recent era’s listings, established above as the more complete ones, may also be the listings that have filled the directory’s larger categories. The result is reported here as a genuine association within the corpus, with that confound flagged rather than resolved.

It is worth being explicit about how the confound would work, so the reader can weigh it. The later layer’s listings are the more complete ones, and if those listings also fell disproportionately into the directory’s larger categories, because the larger categories were the ones most actively populated in the steady and recent eras, then crowded categories would show higher completeness without crowding itself having any effect. The crowding association of Figure 3 is real in the data, but the synthesis cannot, with the measures in hand, separate a genuine crowding effect from this age-driven appearance of one.

Completeness and geographic coverage are nearly the same variable

The third cross-tabulation sets completeness against geographic coverage, comparing the mean completeness of listings that carry a country with that of listings that do not. The result is not a correlation so much as a near-identity. Listings that carry a country have a mean completeness of 8.63 of ten; listings that do not have a mean of 1.07. A located listing is, on average, almost fully complete; an unlocated listing is, on average, almost entirely empty.

Part of this gap is mechanical: the country field is itself one of the ten fields counted in the completeness score, so a located listing is guaranteed at least that one point. But a single shared field cannot produce a gap of more than seven and a half points. The gap reflects, rather, the finding of the second study that the fields of a listing are populated together or not at all: a listing that carries a country tends also to carry its other address fields, its contact details, and its content fields, while a listing with no country tends to carry almost nothing. Geographic coverage and overall completeness, in this corpus, are very nearly the same variable measured twice. Table 3 sets out the master cross-tabulation, in which completeness, geographic coverage, and the share of near-complete and near-empty listings are reported together for each era.

The choice to organise the master cross-tabulation by era, rather than by completeness band or by coverage status, is deliberate. Age is the one of the four measures that is set automatically and is present for every listing, as the fourth study established; it is therefore the most reliable axis along which to array the others. Table 3 uses age as its backbone for that reason, and the result is that every other column can be read against a spine that is itself beyond question.

**Table 3.** The master cross-tabulation: completeness and geographic coverage by era (universe: 14,362 listings).
Era	Years	Listings	% of corpus	Mean completeness	% with ≥7 fields	% with ≤1 field	% carrying a location
Formative	2009–2012	1,923	13.4%	1.75	10.6%	55.8%	15.4%
Intensive	2013–2014	7,163	49.9%	1.81	10.9%	50.0%	13.1%
Steady	2015–2019	2,248	15.7%	6.43	63.2%	13.5%	71.3%
Recent	2020–2026	3,028	21.1%	7.21	65.1%	1.3%	69.2%

The four columns to the right of Table 3 move together almost in lockstep. The two early eras hold a low mean completeness, a small share of near-complete listings, a large share of near-empty ones, and a low rate of geographic coverage; the two later eras reverse all four. The era boundary at 2014 to 2015 separates not one property but every property measured, and it does so for completeness, for the near-empty share, and for geographic coverage at once.

How sharply the two populations separate

Before the two layers are set out as a partition, it is worth dwelling on how completely the cross-tabulation separates them, because the degree of separation is itself a result. The means alone, 1.80 against 6.88, understate it; the distributions behind those means barely overlap.

Table 3 carries the detail. In the two early eras, listings with one filled field or none make up 55.8% and 50.0% of the era; in the two later eras, that near-empty share falls to 13.5% and then to 1.3%. The share of near-complete listings, those with seven or more fields, moves in the opposite direction with equal force: from about a tenth in the early eras to roughly two-thirds in the later ones. A listing with one field and a listing with eight are not two points on a shared scale of effort; they are listings of two different kinds, and the two kinds sort almost entirely by era.

The practical expression of this separation is predictive. Knowing only the year a listing was added, a single, automatically recorded fact, one could predict which completeness band it falls into and be right in the large majority of cases. A property as consequential as completeness being so nearly determined by a property as incidental as date of entry is the clearest sign that the corpus is not one population but two.

This predictive framing should not be overstated into a claim it does not support. That date of entry predicts completeness does not mean date of entry causes completeness; the date is a marker for the era, and it is the era’s editorial practice, not the calendar, that did the work. The point of the predictive observation is only to convey how complete the separation is: two properties that track each other this closely are, for descriptive purposes, two names for one division of the corpus.

The two layers of the corpus

The three cross-tabulations converge on a single description of the corpus, and it is best stated by partitioning the corpus at the boundary the cross-tabulations themselves identify. Figure 4 and Table 4 set out that partition into two layers.

Figure 4. The corpus as two layers. The single figure of 14,362 listings resolves, once age is crossed with completeness and coverage, into a large sparse layer and a smaller substantially complete one. The heights of the two blocks are drawn roughly in proportion to the number of listings each holds.

**Table 4.** The two layers of the corpus compared.
Property	Earlier layer (2009–2014)	Later layer (2015–2026)
Listings	9,086	5,276
Share of the corpus	63.3%	36.7%
Mean completeness (of 10 fields)	1.80	6.88
Listings carrying a location	13.6%	70.1%
Typical character	Sparse, skeletal	Substantially complete

The partition is decisive. The earlier layer, comprising the formative and intensive eras, holds 9,086 listings, 63.3% of the corpus, with a mean completeness of 1.80 and a location on only 13.6% of them. The later layer, comprising the steady and recent eras, holds 5,276 listings, with a mean completeness of 6.88 and a location on 70.1% of them. Two further figures fix how completely the two layers separate the corpus’s quality: 75.0% of all located listings, and 77.5% of all listings with seven or more filled fields, belong to the later layer, although that layer holds only 36.7% of the listings. The directory’s substantially complete records and its located records are, in three cases out of four, the same recent records.

The convergence of the two layers’ definition is worth stating in one sentence. Whether the corpus is partitioned by completeness, by geographic coverage, or by age, very nearly the same line is drawn, separating very nearly the same two groups of listings. Because the three partitions coincide so closely, the study is justified in speaking of two layers rather than of three loosely related splits; the layers are not an interpretation imposed on the data but the single partition that all three cross-tabulations independently produce.

Discussion

The results establish that the corpus divides cleanly into two layers, an earlier sparse one and a later substantially complete one, and that this single division underlies the separate findings of the four prior studies. The discussion now reads that structure: it states the two-layer finding plainly, considers why the earlier layer is sparse, re-reads each of the four prior studies in the light of the synthesis, sets out what the state of the directory’s listings in 2026 amounts to, and draws the implications for businesses, for the directory, and for how directories are studied.

The directory is two directories

The plainest statement of the synthesis is that the directory is, in effect, two directories that share a single database. There is a later directory of some 5,300 listings, recent, substantially complete, mostly carrying a location, and there is an earlier directory of some 9,100 listings, older, sparse, mostly carrying no location and little content beyond a title and a category. These two bodies of listings differ on every measure the series has examined, and they differ not by a little but by a wide margin.

This has a consequence for how the corpus should be described. The second study reported a mean completeness of 3.67 filled fields, and that figure is exact; but the present study shows that it is also, in a sense, a fiction, because almost no listing actually has between three and four fields filled. The mean of 3.67 is the midpoint of an empty gap between two populations, one clustered near one field and the other near eight; it describes the average of the corpus and the condition of hardly any listing in it. To speak of the typical listing in this directory is to speak of one of two very different things, and a single average conceals exactly the structure that matters.

The point about the misleading average holds beyond this directory. Any time a corpus is composed of two populations and a single mean is reported for it, the mean will land in the gap between them and describe neither; the more sharply the two populations separate, the more thoroughly the average misleads. The 3.67 of the second study is a particularly clear instance because the separation here is so sharp, but the general caution, that an average is informative only when the thing being averaged is roughly unimodal, is one the synthesis would urge on any corpus study.

Why the earlier layer is sparse

The fourth study established that the earlier layer was built in a particular way: the formative years to 2012 and, above all, the intensive years of 2013 and 2014, when half the corpus was added in a documented period of concentrated editorial accumulation. The present study adds the dimension that the fourth could not see, that this intensive build produced breadth without depth.

A reasoned supposition follows. The intensive era appears to have been, in the main, a campaign to establish coverage: to get a large number of businesses and sites listed and sorted into categories, so that the directory would have the breadth a directory needs. Filling out those listings, the descriptions, the addresses, the contact details, the locations, was a separate and larger task, and for most of the earlier-layer listings it appears never to have followed.

The later layer’s much higher completeness then reflects a changed practice from 2015 onward: the directory added far fewer listings each year, as the fourth study showed, but it added them more fully. It can be concluded that the directory shifted, around 2014 and 2015, from a breadth-first mode of accumulation to a mode in which a listing was added closer to complete. This is not a criticism of the editorial work of the intensive era; a large, broad skeleton of listings is a reasonable thing to build first, and the sparse earlier layer is simply what a breadth-first build leaves behind when the filling-out does not catch up.

This reading is a supposition about practice, not a finding read directly from a field. The database does not record an editorial intention, and the study cannot show that a breadth-first strategy was consciously chosen; what it can show is that the pattern in the data, many listings, added fast, left sparse, is what such a strategy would produce. The supposition is offered because it is the most economical account of the observed structure, and it is marked as a supposition for that reason.

What the synthesis means for the four prior studies

The two-layer structure changes how each of the four prior findings should be read. The second study’s multimodal completeness distribution, its large groups of listings at the empty and the near-complete extremes, is now explained: the empty mode is the earlier layer and the full mode is the later layer, and the multimodality was the two-layer structure showing through a measure that did not yet have the age dimension to name it. The third study’s finding that only 34.3% of listings carry a location is likewise re-read: three-quarters of the located listings belong to the later layer, so the geographic-coverage gap and the earlier layer are very nearly the same set of records, and improving one would be improving the other.

The fourth study’s front-loaded growth curve gains a second meaning. It is not only a fact about when listings were added but a fact about their quality, because the era that added the most listings added the least complete ones; growth and quality, in this corpus, ran in opposite directions. The first study’s concentration finding stands unaltered as a structural fact, but its interaction with completeness is now clarified: the result that crowded categories hold more complete listings is, in the light of the synthesis, substantially an age effect, since the later layer’s complete listings are also the listings that have most populated the directory’s larger categories. The crowding-completeness association is real, but age lies behind much of it, and the synthesis is what makes that visible.

Re-reading the four studies through the synthesis is not a matter of correcting them; each was accurate in what it measured. It is a matter of supplying the dimension each lacked. The second study saw a multimodal distribution but could not name its cause; the third saw a coverage gap but could not say which listings it fell on; the fourth saw an age skew but could not see that age governed quality. The synthesis does not overturn these findings, it explains them, and a finding that is explained is more useful than one that is merely recorded.

The state of the directory’s listings in 2026

The state of this directory’s listings in 2026 can now be stated as one finding rather than four. The directory holds a substantial and genuinely useful later layer, roughly 5,300 listings that are complete enough and located enough to do the work a directory listing exists to do, resting on a much larger earlier layer of roughly 9,100 listings that are, for the most part, skeletal. Figure 5 draws out the practical consequence of this structure.

Figure 5. The single lever. Because the earlier-layer listings fall short on completeness, geographic coverage, and currency together, one action directed at them, completing and locating them, would raise all three measures at once. The four prior studies’ separate agendas are, in practice, one.

The practical force of the synthesis is in that figure. Because the earlier-layer listings are the same listings on every count, they are the incomplete listings, they are the unlocated listings, and they are the oldest and least current listings, a single programme of work directed at them would move all three measures together. It can be concluded that the directory’s central maintenance question is not four separate questions but one: what to do about the earlier layer. The four prior studies, each of which closed with its own implications for the directory, were in fact describing one task from four directions.

What the two-layer structure does not explain

A synthesis is more trustworthy when it states its own limits, and the two-layer structure, powerful as it is, does not absorb all four prior findings equally. It unifies three of them. The incompleteness of the second study, the partial geographic coverage of the third, and the age skew of the fourth are, as the results have shown, three faces of the single division between the layers.

The first study’s finding is not absorbed in the same way. Category concentration, the uneven distribution of listings across categories, with a Gini of 0.522, is a structural property that holds within each layer, not a difference between the layers; a directory could have two completeness layers and an even category distribution, or one completeness layer and a concentrated one. The synthesis relates concentration to completeness through the age-confounded crowding association, but it does not show concentration to be a face of the two-layer structure, and it should not be read as claiming so.

The anglophone weighting of the third study is a second finding the layers do not explain. That the located listings are 88.3% anglophone is a property the earlier and later layers share; it is a fact about the directory’s linguistic reach, and it would survive the closing of the gap between the layers. The two-layer structure, in short, is the right account of how completeness, coverage, and currency co-vary, and it is not a theory of everything the series has measured. Marking that boundary is part of reporting the synthesis honestly.

Implications for businesses

For a business that holds a listing in this directory, the synthesis carries a message that depends on which layer the listing belongs to, and the probabilities are not even. Because the earlier layer holds 63.3% of all listings, a business whose listing was created more than a decade ago, and most listings were, should expect, as the working assumption, that its listing is a skeleton: a title, a category, perhaps a link, and little else.

The action that follows is the same one the companion studies recommended, and the synthesis explains why its payoff is large. Completing a listing that already carries most of its fields is a refinement; completing an earlier-layer listing that carries almost none of them is a transformation, taking the listing from a near-invisible stub to a fully formed entry that can be matched, located, and read. The sparser the starting point, the greater the gain from the same effort, and the earlier-layer listings are the sparsest starting point in the corpus. For a business already in the later layer, the message is the lighter one of maintenance: the listing is probably substantial already, and the task is to keep it current.

Implications for the directory

For the directory, the two-layer structure is not one finding among several but the central fact around which a maintenance strategy should be built. The earlier layer is 9,086 listings, nearly two-thirds of the corpus, that under-deliver on completeness, on geographic coverage, and on currency at the same time, because they are sparse and old together. A programme that worked through the earlier layer, completing and locating its listings, would by the logic of Figure 5 raise three of the four quality measures the series has tracked, and it would do so as one effort rather than three.

A second implication concerns how the directory describes itself. A single headline figure, a mean completeness, a total listing count, averages across two layers that have little in common and conceals the structure a maintenance strategy would need to see. The directory would describe its own corpus more honestly, and more usefully, by reporting the two layers separately: their sizes, their completeness, their coverage, tracked over time. The two-layer figure is not a more pessimistic account than the single average; it is a more accurate one, and a more actionable one, because it names the part of the corpus on which work would tell.

There is a sequencing point implicit in this. If the directory were to act on the synthesis, the order of work is given by the structure itself: the earlier layer, being both the largest part of the corpus and the part weakest on every measure, is where the first and largest gains lie. A maintenance programme need not treat all 14,362 listings as an undifferentiated mass; the two-layer analysis hands it a prioritised target, and a target of roughly 9,100 listings, while large, is a finite and well-defined body of work.

The corpus, the directory, and the honest number

A question runs underneath the whole synthesis: what single number, if any, honestly represents this directory. The two candidates in ordinary use both fail. The nominal listing count of 14,362 fails because it counts a skeletal earlier-layer stub and a complete later-layer entry as one unit each, when they are not comparable units. The mean completeness of 3.67 fails for the reason already given, that it describes the condition of almost no listing in the corpus.

The honest representation is not a single number but a pair: roughly 5,300 substantially complete listings and roughly 9,100 skeletal ones. A directory’s nominal size and its usable size are different quantities, and the synthesis is precisely the analysis that separates them. The directory is entitled to report its nominal size, as any directory does, but an account that stopped there would conceal the structure this series has spent five studies uncovering.

Stating the pair rather than the single figure is uncomfortable, because it concedes that most of the corpus is not, at present, doing the work a listing exists to do. It is, nonetheless, the accurate account, and it is also the useful one: a directory that knows it holds two layers knows where its corpus is strong, where it is weak, and where the return on maintenance would be greatest. The honest number and the actionable number turn out to be the same number, and it has two parts.

Implications for how directories are studied

The synthesis carries, finally, a methodological lesson that reaches beyond this directory. The four prior studies each measured one property of the corpus carefully and in isolation, and each was correct in what it reported; but each, on its own, missed that the properties it measured were not independent of the others. A directory examined one dimension at a time presents as a set of separate, moderate shortcomings; the same directory examined jointly presents as a single structural fact with several visible faces.

It can be concluded that a study of the quality of any substantial corpus should cross-tabulate its dimensions rather than report them side by side. The cross-tabulation was not, in this series, an optional epilogue to four self-sufficient studies; it was the step at which the actual structure of the corpus became visible, and without it the four findings would have remained four problems rather than one. The general point is that the dimensions of data quality, completeness, currency, coverage, are liable to fail together on the same records, and that a method which never crosses them cannot see that they do.

This lesson has a constructive side as well as a cautionary one. A study that does cross its dimensions gains something the separate studies could not offer: a single, prioritised account of where a corpus most needs work. The four prior studies each produced a recommendation; the synthesis produces one recommendation that subsumes three of them. For a directory, or for anyone maintaining a large corpus, that consolidation is not merely tidier; it is the difference between dividing effort across several fronts and concentrating it where it counts.

Projections and future developments

This study, like the four it synthesises, is a snapshot of one corpus at one reference date, and it is designed to be repeated. The projections below are reasoned conjectures drawn from the observed structure and the mechanisms discussed; they are not statistical forecasts, and they are marked as conjectures.

The first projection concerns the persistence of the two layers. It can be projected that, absent a deliberate programme directed at the earlier layer, the two-layer structure will not only persist but sharpen: the earlier layer will age further while remaining sparse, the later layer will grow slowly and stay relatively complete, and the gap between them will widen. The structure is stable because nothing in the ordinary operation of the directory acts to close it; an earlier-layer listing left alone stays an earlier-layer listing.

The second projection concerns what a remedial programme would look like in the data. Were the directory to undertake the work Figure 5 describes, its effect would be measurable as a convergence: the mean completeness of the earlier layer would rise toward that of the later layer, and the share of the corpus carrying a location would rise with it. The two-layer cross-tabulation of this study is, in that sense, both a diagnosis and a yardstick, and repeating it against later exports would show directly whether any such programme was succeeding.

The third projection concerns the changing value of the two layers. As discovery comes to be mediated by systems that read and synthesise structured data (Lewis et al., 2020; Aggarwal et al., 2024), the later layer, complete, located, and relatively current, is the part of the corpus such systems can actually use, while the sparse earlier layer offers them little and, where it is also stale, may mislead them. It may be projected that the directory’s effective size, measured as the count of listings genuinely useful for automated discovery, is closer to the 5,300 of its later layer than to its nominal 14,362. Two further lines of work would extend this study: a combined analysis that verified earlier-layer listings against the world, so the currency of the corpus could be measured rather than inferred; and the simple repetition of the whole five-study series against later exports, converting a set of snapshots into a record of change.

The value of repetition is worth one further word. A single snapshot, however carefully analysed, cannot distinguish a stable structure from one slowly changing; only a second measurement can. Were the five-study series repeated against an export taken a year or two later, the two-layer cross-tabulation would show at once whether the layers had begun to converge, whether the earlier layer had aged further untouched, and whether the directory’s recent practice had held. The series, in that sense, is best understood not as a final account but as the first establishment of a baseline.

Limitations of the study

The limitations follow from the design and are stated plainly. The study analyses a single corpus, the database of one directory, and is descriptive rather than inferential; the two-layer structure is established exactly for this corpus, and the study does not claim that a similar structure would be found in other directories, though the methodological lesson about cross-tabulation would carry across.

A related limitation concerns generality of a different kind. Because the directory has a particular history, an advertising-free, editorially curated history with a documented intensive period in 2013 and 2014, the specific two-layer structure found here is tied to that history. A directory built differently might show one layer, or three, or a genuine gradient. What this study establishes is that one real, substantial directory is built in two layers, and that the cross-tabulation method is what made the structure visible; it does not establish that two layers is the universal shape of a directory corpus.

The central methodological caution concerns the reading of the cross-tabulations. A cross-tabulation shows that two properties vary together; by itself it does not establish causation. The study has been explicit about this where it matters most: the association between category crowding and completeness is, as the discussion argued, substantially an age effect, and it is reported as an association with that confound named rather than as a causal claim. The age and completeness relationship, by contrast, is a direct cross-tabulation of two measured properties and is reported as such.

Several further limitations should be recorded. The completeness score counts whether a field holds a value, not whether the value is accurate or current; this limitation is inherited from the second study, and it means that an earlier-layer listing counted as sparse is certainly sparse, but a later-layer listing counted as complete is complete only in the sense of field presence. The 2014 to 2015 boundary used to define the two layers is a cut, although the era cross-tabulation shows that the underlying change is genuinely a step at that point rather than a gradient, which is what justifies the cut. The synthesis recovers the four prior measures rather than re-deriving every detail of the four studies, and it relies on those studies for the fuller treatment of each measure. And the entire analysis reflects the single reference date of 25 May 2026.

A last limitation concerns completeness as a measure of value. A listing counted as complete has its ten fields populated, but population is not the same as usefulness: a later-layer listing with a thin, generic description is complete by the score and weak in substance, while the score cannot register the difference. The synthesis therefore measures the structural condition of the corpus, which listings carry their fields and which do not, and not the editorial quality of what those fields contain, a distinction the reader should keep when the word complete is used.

Concluding remarks

This study set out to draw together a connected series of four analyses of a single curated business directory, and to do so not by summarising them but by establishing how their findings relate. The means was a cross-tabulation of the four measured properties, completeness, age, geographic coverage, and category crowding, across all 14,362 listings.

The central finding is that the corpus is two corpora layered together. An earlier layer, added between 2009 and 2014, holds 63.3% of the listings and is predominantly sparse: a mean completeness of 1.80 filled fields of ten, and a location on only 13.6% of its listings. A later layer, added from 2015 onward, holds the other 36.7% and is predominantly complete: a mean completeness of 6.88, and a location on 70.1% of its listings. The four findings of the prior studies, the uneven concentration across categories, the multimodal completeness, the partial geographic coverage, the front-loaded growth, are not four independent problems but four views of this one structure. The single average completeness of 3.67 fields, accurate as an arithmetic mean, describes the condition of almost no listing in the corpus; it is the midpoint of the empty gap between the two layers.

What follows is, for each audience, a single clear statement. For the directory, the maintenance agenda is not four tasks but one: the earlier layer is the same set of listings on every measure, and completing and locating it would raise completeness, coverage, and currency together. For a business, the message is to expect an older listing to be a skeleton and to treat completing it as the transformation it is. And for the study of directories, the lesson is that the dimensions of corpus quality must be cross-tabulated, because measured in isolation they conceal exactly the structure that matters. This study is the fifth and final in the connected series; together, the five provide a complete descriptive account of the state of one curated directory’s listings in 2026.

A final reflection concerns the standing of the series as a whole. Five studies of a directory’s own database, conducted and published by that directory, are at once authored by their subject and about their subject, and the synthesis has reported a structure that is, in plain terms, unflattering: most of the corpus is sparse. Reporting it has been the consistent obligation of the descriptive design adopted throughout, and the methodology of every study, this one included, has been set out in enough detail that any reader with the same export could reproduce the figures and judge the interpretation. A directory willing to measure itself this plainly is, whatever the measurements show, in a position to act on them.

References

Aggarwal, P., et al. (2024). Improving search systems with large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 5, 16). Association for Computing Machinery.

Akerlof, G. A. (1970). The market for “lemons”: Quality uncertainty and the market mechanism. The Quarterly Journal of Economics, 84(3), 488, 500.

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1, 7), 107, 117.

Broder, A. (2002). A taxonomy of web search. ACM SIGIR Forum, 36(2), 3, 10.

Google. Google Search Essentials. Google Search Central documentation. [Industry guidance, not peer-reviewed.]

Jones, R., Zhang, W. V., Rey, B., Jhala, P., & Stipp, E. (2008). Geographic intention and modification in web search. International Journal of Geographical Information Science, 22(3), 229, 246.

Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604, 632.

Lewis, P., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (Vol. 33, pp. 9459, 9474).

Nelson, P. (1970). Information and consumer behavior. Journal of Political Economy, 78(2), 311, 329.

Newman, M. E. J. (2005). Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46(5), 323, 351.

Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211, 218.

Pirolli, P., & Card, S. K. (1999). Information foraging. Psychological Review, 106(4), 643, 675.

Spence, M. (1973). Job market signaling. The Quarterly Journal of Economics, 87(3), 355, 374.

Stigler, G. J. (1961). The economics of information. Journal of Political Economy, 69(3), 213, 225.

Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5, 33.