Category concentration in a curated business directory: which industries compete hardest for online visibility

Author. Gombos Atila Robert, PhD. Owner and Chief Executive Officer, Jasmine Business Directory (D-U-N-S 10-276-4189), Valley Cottage, New York. ORCID: 0000-0001-6468-2811. Correspondence through the author profile.

Data statement. The material analysed in this study was taken directly from the production database of the Jasmine Business Directory. The directory has operated since 2009, has never used paid advertising to acquire listings, and adds about ninety per cent of its entries through manual editorial review. The analysed export is current as of 25 May 2026.

Abstract

This study looks at how business listings spread across the category structure of a curated web directory, and it uses that spread to identify which industries compete most intensively for online visibility. The data are the complete listings table of the Jasmine Business Directory, which has operated since 2009, has grown without paid advertising, and adds most of its entries through manual editorial review. The analysed export, current as of 25 May 2026, contains 14,362 listings assigned across 761 categories. Listing counts per category are treated as a proxy, an openly partial one, for how hard businesses in an industry compete for visibility.

The analysis finds a pronounced concentration. The hundred largest categories hold 50.5% of all listings; the Gini coefficient of category sizes is 0.522; and one category, Law, holds 4.4% of the entire corpus, more than three times the share of the next category. Against this concentrated head sits a long tail in which the median populated category holds sixteen listings and 169 categories stand empty.

The distribution follows the heavy-tailed pattern documented across many organically assembled collections (Newman, 2005). The study reads these findings through the economics of search (Stigler, 1961) and the work on limited attention (Miller, 1968; Pirolli & Card, 1999), argues that a crowded category sets a structurally harder visibility task for each listing within it, and offers reasoned projections for how the distribution may change. Because the corpus has accumulated without advertising, the concentration is read as a sign of genuine listing behaviour rather than of promotional spending.

Keywords. business directories; category concentration; online visibility; long-tail distribution; power-law distribution; information economics; listing density; web directory; market structure; digital discovery; curated directory; search behaviour.

Introduction

Not all business categories are equally crowded. A person setting up a law firm enters a field where a great many other firms also seek to be found; a person setting up a more specialised enterprise may compete for visibility against comparatively few. This unevenness in how businesses cluster across industries is intuitively familiar, yet it is seldom measured directly, because measuring it needs a structured and reasonably complete body of businesses sorted by category. A business directory is precisely such a body, and this study uses one to make the unevenness explicit.

The question can be put plainly. It asks how the listings in a curated business directory spread across the directory’s category structure, and what that spread reveals about which industries compete hardest for online visibility. The dependent quantity throughout is the number of listings assigned to a category, here called its listing density; the analysis treats listing density as a proxy, partial and openly acknowledged as such, for how hard businesses in an industry contend to be found.

The approach is descriptive and quantitative. The complete listings table of the directory was extracted, cleaned, and analysed; listings were counted by category; and the resulting distribution was described with standard measures of concentration, including cumulative shares, the ratio of mean to median, the Gini coefficient, and a partition of categories into size bands. No survey, no experiment, and no external benchmark are involved. This is an exploratory empirical analysis of a single corpus, and its claims are set to that scope.

The data, and the setting they come from, deserve attention at the outset. The corpus is the production database of the Jasmine Business Directory, founded in 2009 and headquartered in Valley Cottage, New York. Two features of the directory bear directly on how the findings should be read.

First, the directory has never used paid advertising to acquire its listings; its corpus has accumulated organically. Second, the directory adds about ninety per cent of its entries through manual editorial review rather than unmediated self-submission. The export analysed here was taken on 25 May 2026 and contains 14,362 listings across 761 categories.

Why this matters is both practical and scholarly. For the business owner deciding how and where to seek visibility, the crowding of a category is a material fact: it sets the difficulty of the task and conditions the value of every effort to stand out. For the directory operator, the spread of listings across categories describes the asset being managed and points to where structural attention is most needed. For the wider study of digital discovery, a curated and advertising-free corpus gives an uncommonly clean view of how businesses spread across industries when that spread is not shaped by promotional spending. The concentration this study documents can therefore be read with a confidence that a corpus assembled through paid placement would not allow.

The time span of the corpus is itself a methodological asset. The directory has operated continuously since 2009, so the analysed export captures listing decisions accumulated over roughly seventeen years. A distribution measured across so long a period is less sensitive to short-term swings than one captured over months, and it reflects a settled pattern rather than a passing one. This longevity, together with the absence of paid acquisition, is part of why the study treats its central finding as a structural feature rather than a passing one.

The study makes three contributions. It gives, first, a precise quantitative description of category concentration in a substantial real-world directory. It offers, second, a reading of that concentration through the economics of search and the work on limited attention, connecting an observed data pattern to an established mechanism. It sets out, third, a set of reasoned projections about how the distribution may change, and what that change implies for businesses and for directories as digital discovery is increasingly handled by automated systems. The rest of the paper follows the usual sequence: a review of the relevant literature; a description of the dataset; an account of the methodology; the results; a discussion; the projections; the limitations; and the concluding remarks.

The study draws on three bodies of work: the economics of information and search, the empirical study of heavy-tailed distributions, and the treatment of attention as a finite resource. Each is summarised here in turn, and the section closes by stating the gap this study addresses.

Directories and the economics of search

The foundational observation that information is costly to obtain, and that this cost shapes market behaviour, belongs to Stigler (1961). Stigler argued that a buyer does not survey every seller before transacting, because each further enquiry carries a cost; the buyer searches only until the expected gain from more search no longer justifies that cost. In this framework a business directory is an institution that lowers the cost of search: it assembles, classifies, and presents sellers so a buyer can locate candidates without canvassing the market unaided.

This framing has a direct consequence for how a directory listing should be valued. If a directory exists to lower the cost of search, then a listing within it is, from the searching party’s point of view, a reduction in the effort needed to find a suitable provider. From the listed business’s point of view it is the reverse: a position within an institution that searchers consult, and therefore a channel through which the business can be found without itself paying the cost of reaching each prospective customer. The category structure organises that channel, and the crowding of a category determines how much of the channel’s attention any one listing can expect.

Later work refined the conditions under which search operates. Nelson (1970, 1974) distinguished search goods, whose qualities a buyer can assess before purchase, from experience goods, whose qualities become apparent only afterwards, and argued that the two are discovered through different informational channels. Akerlof (1970) showed that where sellers know more than buyers about quality, markets can deteriorate unless some institution restores the informational balance; Spence (1973) and Darby and Karni (1973) extended the analysis to signalling and to credence qualities that buyers cannot verify even after purchase. A directory listing takes part in all of these dynamics: at minimum it is an instrument through which a seller becomes findable, and at more it is a structured claim that a buyer reads as partial evidence of the seller’s existence and seriousness.

The move of search to the web brought a parallel literature. Broder (2002) proposed a taxonomy of web search intent, navigational, informational, and transactional, that clarified what users seek when they query. The ranking systems that handle web discovery were formalised by Brin and Page (1998) and by Kleinberg (1999), whose link-analysis models established that visibility online is allocated, not given, and that the allocation follows the structure of the surrounding graph. Arasu, Cho, Garcia-Molina, Paepcke, and Raghavan (2001) surveyed the architecture of web search at the point when directories and engines coexisted as discovery mechanisms. More recently, the rise of retrieval-augmented generation (Lewis et al., 2020) and of large-scale systems that compose answers rather than return links (Aggarwal et al., 2024) has begun to shift discovery again, a shift whose implications for structured corpora such as directories are taken up in the discussion.

The web directory as an institution

The directory analysed here belongs to a particular institutional form whose history clarifies what the present corpus is. In the early web, before algorithmic search became dominant, the human-curated directory was a principal way of organising online resources: editors assigned sites to a browsable taxonomy of subjects, and users navigated that taxonomy rather than querying an index. Arasu, Cho, Garcia-Molina, Paepcke, and Raghavan (2001) described the period in which curated directories and algorithmic engines coexisted as parallel discovery mechanisms, each suited to a different kind of need.

The division of labour between the two is instructive. An algorithmic engine answers a query by ranking an index; a curated directory offers a stable, human-judged classification that a user can browse. Broder’s (2002) taxonomy of search intent implies that these serve different moments, the engine suiting a user who can name what they seek and the directory a user who wants to survey the providers within a category. The two are complements as much as competitors.

What sets a curated directory apart, and what makes its corpus a useful object of study, is the combination of editorial vetting, a stable taxonomy, and structured, machine-readable category data. A curated directory does not merely list businesses; it classifies them according to an editorial judgement, and that classification is what this study analyses. The directory examined here is of exactly this curated kind, and its category structure, the product of editorial decisions accumulated since 2009, is therefore not an incidental container for the data but a considered classification with evidentiary value in its own right.

Concentration and the long tail

That items in a large collection spread unequally, a few holding much and many holding little, is among the most thoroughly documented regularities in quantitative science. Newman (2005) reviewed the evidence for power-law and related heavy-tailed distributions across the sizes of cities, the frequencies of words, the magnitudes of earthquakes, the populations of biological taxa, and many other domains, and surveyed the generative mechanisms proposed to explain them. One recurring mechanism is preferential attachment, under which the probability that an item grows rises with its current size; processes of this kind produce, over time, exactly the steep head and long tail that heavy-tailed distributions show.

The concentration of activity in digital markets specifically has been a sustained subject of study. The economics of information goods, and the observation that low distribution costs let demand extend far down a long tail of niche items, reshaped the understanding of digital commerce in the first decade of this century. The point that matters here is methodological as much as substantive: when a corpus is assembled by many independent decisions over a long period, a heavy-tailed distribution across its categories is the expected outcome, not an anomaly, and the analytical task is to describe the particular form the distribution takes rather than to be surprised by its existence.

The measurement of concentration has a standard apparatus, and this study uses its established parts. The Lorenz curve plots the cumulative share of a quantity against the cumulative share of the units holding it, and the Gini coefficient sums the curve up as the ratio of the area between it and the line of equality to the total area beneath that line. Both were developed for the study of income and wealth but apply without change to any partition of a total among units, including the partition of listings among categories. Their use here borrows a well-understood vocabulary rather than inventing a bespoke one.

Attention as a finite resource

The third body of work concerns the limits of human attention. Miller (1968) and the broader tradition in cognitive psychology established that a person can hold and compare only a small number of items at once; the capacity is narrow and does not expand to match the size of the field presented. Pirolli and Card (1999) carried this insight into the study of information-seeking with their theory of information foraging, which models a searcher as an organism allocating limited effort across an information environment, following cues of expected value and abandoning a patch when its yield falls below the cost of continuing.

The implication for a directory category is direct. A category, however many listings it holds, receives a bounded quantity of any searcher’s attention; the searcher examines a small number of entries and stops. It follows that the share of attention available to any one listing falls as the category grows. This dividing of a fixed attention budget across a variable number of listings is the conceptual bridge between the concentration this study measures and the difficulty of being found, and it is developed formally in the discussion.

The kind of limit at issue is worth distinguishing. The constraint is not that a searcher is unwilling to look, but that looking has a cost and yields diminishing returns; beyond a small number of listings, each further one examined is less likely to improve on what has already been found. Pirolli and Card (1999) formalised exactly this trade-off. The practical consequence for a directory category is that the examined set is small not through inattention but through the rational economy of search itself.

The gap addressed by this study

The literatures summarised above are well developed, but they meet only partially at the point this study occupies. The economics of search explains why directories exist and what a listing does; the distribution literature explains why concentration should be expected; the attention literature explains why concentration matters for discovery. What is comparatively scarce is a precise, openly documented measurement of category concentration in a real, substantial, advertising-free business directory, read through those three lenses together. This study addresses that gap. It does not claim to generalise to all directories or to all industries; it claims to characterise one well-defined corpus rigorously, and to interpret it in a way that is informative beyond itself.

That phrase, informative beyond itself, deserves a precise meaning. A single-corpus study cannot establish what is true of directories in general, but it can establish what is true of one substantial, well-documented directory, and it can connect that truth to mechanisms (search costs, limited attention, preferential growth) that are themselves general. The generality of this study lies in its mechanisms and its method, not in a claim that its specific figures recur elsewhere; a reader may take the approach and the interpretive frame to another corpus while treating the numbers as belonging to this one.

The dataset: a curated, advertising-free directory

The material for this study is the production database of the Jasmine Business Directory. The directory was founded in 2009 by Pecsi Andras and Robert Gombos and is headquartered in Valley Cottage, New York; it has operated continuously since then as a general business directory organised by subject category. Two characteristics of its operation matter for reading the data and are stated here as part of the dataset’s provenance.

The first characteristic is the absence of paid advertising as a means of acquiring listings. Over its operating history the directory has not bought placement or used advertising spend to populate its corpus; entries have accumulated through organic submission and editorial addition. The second characteristic is the predominance of manual editorial work: the directory adds about ninety per cent of its entries through manual editorial review rather than unmediated automated self-submission, an editorial practice for which it received eight awards during 2013 and 2014. Together, these two characteristics mean the corpus analysed here reflects a mix of organic listing behaviour and editorial curation, and is not shaped by promotional spending. This is the basis on which, in the discussion, the observed concentration is read as a property of genuine listing behaviour.

The directory’s seventeen-year operating history bears on the reading of the corpus in a further way. A directory that has classified businesses continuously since 2009 has, over that period, revised and extended its taxonomy as new kinds of business have emerged; the 761 categories are therefore not a static design but the current state of an evolving classification. The spread of listings across them reflects both the businesses that have been listed and the taxonomy as it stood when each was listed, and the analysis takes the classification as the directory presents it at the reference date.

The analysed export was taken from the production database on 25 May 2026 and consists of the directory’s complete set of business listings together with the associated category, address, contact, and content tables. The unit of analysis throughout this study is the individual listing record, of which there are 14,362. Each listing is assigned to exactly one category; a listing therefore appears in one place in the directory’s classification and is counted once. The classification itself has 761 distinct categories after the cleaning described in the methodology. Table 1 summarises the principal characteristics of the dataset.

**Table 1.** Principal characteristics of the analysed dataset.
Attribute	Value
Source	Jasmine Business Directory, production database
Directory founded	2009
Headquarters	Valley Cottage, New York
Listing-acquisition model	Organic submission and editorial addition; no paid advertising
Editorial curation	Approximately 90% of entries added through manual review
Export reference date	25 May 2026
Listings (universe of analysis)	14,362
Categories (total, after cleaning)	761
Categories populated (≥ 1 listing)	593
Categories empty (0 listings)	169
Category assignment per listing	Exactly one

Methodology

The methodology is deliberately straightforward, because the credibility of a descriptive corpus study rests on the transparency of its procedure rather than on analytical complexity. This section describes the extraction and the unit of analysis, the cleaning and exclusions applied, and the metrics used to characterise concentration. The framework that links the raw counts to the study’s interpretive claim is set out in Figure 1.

Figure 1. The analytical framework of the study. Listing counts per category are computed from the production database and read as a partial proxy for competitive intensity; the dashed elements record the explicit limits of that reading.

Data extraction and the unit of analysis

The dataset was obtained as a complete export of the directory’s production database on the reference date. The export has several linked tables, of which the listings table is primary for this study; the address, contact, and content tables are used in a companion study and are not central here. The unit of analysis is the listing record. Each of the 14,362 records carries a category assignment, and that assignment is single-valued: a listing belongs to one category and is therefore counted once and in one place.

This single-valued assignment is an important property for the analysis that follows. Because no listing is shared between categories, the listing counts per category partition the corpus exactly: the counts sum to 14,362, and each category’s share of the corpus is well defined. Concentration measures computed on such a partition need no correction for double counting, and the cumulative shares reported in the results are exact.

The single-valued assignment is a property of the directory’s data model, not an analytical simplification imposed for convenience. The directory files each business in one category by design. An alternative model, in which a business could appear under several headings, would produce a different and more complex counting problem; the model in place yields a clean partition, and the analysis takes advantage of that while acknowledging, in the limitations, what the single-valued model leaves out.

Category deduplication and exclusions

One cleaning step was required before analysis. The category table, as exported, contained a number of duplicate fragment rows that arose from the handling of multi-line text fields during export; left uncorrected, these fragments would have inflated the apparent number of categories. The table was therefore deduplicated by numeric identifier and by resolved category name, yielding 761 distinct, named categories. One category identifier referenced by listings did not resolve to a named category in the cleaned table; the listings carrying that identifier are kept in all counts and concentration figures but cannot be attributed to a named category, and their effect on the results is negligible.

One exclusion was applied. The listings table carries rating and vote fields; on inspection these showed signs of default or seeded values rather than genuine accumulated data, with implausibly repeated figures across many records. They were therefore excluded from the study, and no claim in this paper rests on them. The exclusion is recorded here so the boundary of the analysed material is explicit: this study uses category assignments, not directory-internal rating data, as its evidentiary base.

Concentration metrics

Four families of measure are used to characterise how listings spread across categories. The first is the cumulative share: for a given rank threshold, the proportion of the 14,362 listings held by the largest categories up to that threshold. The second is the comparison of the mean and the median category size, whose divergence is a standard indicator of skew. The third is the Gini coefficient of category sizes, a single number between zero and one that sums up inequality, with zero for perfect equality and one for maximal concentration. The fourth is a partition of the populated categories into size bands, which exposes the full shape of the distribution rather than its summary statistics alone.

These measures are descriptive, not inferential. They characterise the corpus as it stands on the reference date; they do not estimate parameters of a larger population, and no significance test is applied, because there is no sampling and no population beyond the corpus itself. This is a deliberate feature of the design. The study reports what the corpus is, with precision, and reserves interpretation for the discussion, where the findings are read against the literature set out above.

The Gini coefficient reported in this study is computed directly from the category sizes. The 593 populated categories are ordered by size, and the coefficient is derived from the cumulative contribution of each category to the total in the standard manner; the resulting value of 0.522 is an exact property of the corpus rather than an estimate. Because the measure is computed on the complete set of populated categories and not on a sample, it carries no confidence interval, and none is reported.

The descriptive design and its rationale

A word on the design of the study is warranted before the results, because the design determines what the results can and cannot claim. This is a descriptive empirical analysis of a single corpus; in the established vocabulary of research it is a study of the data-paper or descriptive-corpus kind, in which a well-defined body of real data is characterised rigorously rather than used to test a hypothesis against a sampled population.

The choice is deliberate and is matched to the object. The corpus is not a sample drawn from some larger population of directories; it is the entire production database of one directory, and it is the object of interest in itself. Where there is no sampling, there is no sampling error to estimate and no inferential test to apply; the appropriate apparatus is exact description: counts, shares, and concentration measures computed on the complete corpus.

This design trades one thing for another, and the trade should be stated openly. It gives up the ability to generalise, by statistical inference, to directories or industries beyond the one studied. In exchange it delivers an exact, reproducible account of a real and substantial corpus: the counts reported here are deterministic, and any analyst applying the same procedure to the same export would obtain the same figures. For a study whose purpose is to describe the state of a particular directory with precision, and to interpret that state against established theory, the descriptive design is the right instrument for the task rather than a limitation reluctantly accepted.

Results

The results come in four parts: the overall concentration of listings across categories; the identity of the largest categories; the spread of category sizes; and the long tail together with the empty categories. Each part comes with the relevant figure or table, and every figure and table is referred to directly in the text.

Overall concentration of listings

The first finding is that listings are concentrated. They are not spread evenly across the 593 populated categories; a minority of categories hold a disproportionate share of the corpus, and a majority hold comparatively little. Table 2 reports the summary measures of this concentration, and Figure 2 presents the cumulative concentration curve.

**Table 2.** Summary measures of listing concentration across categories (universe: 14,362 listings; 593 populated categories).
Measure	Value
Share held by the 5 largest categories	9.3%
Share held by the 10 largest categories	14.1%
Share held by the 25 largest categories	22.9%
Share held by the 50 largest categories	34.2%
Share held by the 100 largest categories	50.5%
Share held by the largest 10% of populated categories (59 categories)	37.6%
Mean listings per populated category	24.2
Median listings per populated category	16
Largest category (listings)	637
Smallest populated category (listings)	1
Gini coefficient of category sizes	0.522

Figure 2. The cumulative share of all 14,362 listings held by categories ranked from largest. The curve bows well above the even-distribution diagonal; the 100 largest of the 593 populated categories hold 50.5% of the corpus.

The cumulative shares quantify the concentration directly. The five largest categories hold 9.3% of all listings; the ten largest hold 14.1%; the twenty-five largest hold 22.9%; the fifty largest hold 34.2%; and the hundred largest hold 50.5%. That last figure is the most economical statement of the pattern: the hundred largest categories, which are 16.9% of the 593 populated categories, account for exactly half of the corpus. Put another way, the largest tenth of the populated categories holds 37.6% of all listings.

The summary statistics in Table 2 point the same way. The mean populated category holds 24.2 listings while the median holds only 16; the mean exceeds the median by half as much again, which is the usual signature of a right-skewed distribution in which a small number of large categories pull the average upward. The Gini coefficient of category sizes is 0.522, a value that puts the distribution well away from equality and within the range usually regarded as substantially concentrated.

Figure 2 shows the same finding visually. The concentration curve departs sharply from the diagonal that would describe a perfectly even distribution, rising steeply over the first hundred categories before flattening across the long tail. The geometry of the curve, a steep early ascent followed by a long, shallow approach to completion, is the geometry of a heavy-tailed distribution, and the discussion returns to its interpretation.

It is worth pausing on what the concentration does not imply. A concentrated distribution does not mean the smaller categories are unimportant, or that the directory would be improved by their removal; the long tail, as a later section argues, is a necessary feature of a classification fine enough to be useful. Concentration describes how listings are distributed, not which categories deserve to exist. The discussion keeps that distinction in view.

Concentration measured against an even distribution

The concentration is most readily grasped by comparison with what would hold if listings were spread evenly. Were the 14,362 listings divided equally among the 593 populated categories, each would hold about 24 listings, the mean reported in Table 2. The actual distribution departs from that even benchmark substantially: the median category holds 16 listings, well below the mean, while the largest holds 637.

Figure 2 expresses the same departure geometrically. The diagonal represents the even-distribution case, in which the largest categories would hold exactly their proportional share; the concentration curve, which bows well above that diagonal, shows the listings piling into the largest categories far faster than proportionality would dictate. The vertical gap between the curve and the diagonal at any point is the excess share held by the largest categories up to that point, and the area between the two lines is the geometric counterpart of the Gini coefficient.

The Gini coefficient of 0.522 can be placed, with due caution, in interpretive context. A coefficient of zero would describe perfect equality, in which every category held an identical number of listings; a coefficient approaching one would describe a corpus in which almost all listings fell into a single category. A value of 0.522 sits well into the upper half of that range and denotes concentration that is substantial without being extreme. The figure is a compact, single-number summary of a pattern that the cumulative shares and the size bands describe in fuller detail.

The largest categories

Having established that concentration exists, the analysis turns to where it sits. Figure 3 presents the fifteen largest categories by listing count, and Table 3 reports the twenty-five largest together with their individual and cumulative shares.

Figure 3. The fifteen largest categories by listing count. Law, at 637 listings, exceeds the next category more than threefold; the categories from rank two onward descend along a gradual slope.

One category dominates. Law holds 637 listings, which is 4.4% of the corpus and more than three times the count of the second-placed category. No other category comes close. From rank two onward the descent is gradual rather than abrupt: Cosmetic Surgery (195), Services (177), Home Improvement (175), Business & Finance (158), Real Estate (157), and Software (141) form a band of substantial categories, and the slope continues smoothly through the rest of the upper ranks without a second break.

The character of the largest categories is consistent. With the partial exception of the broad residual category Services, they are industries with many independent providers and a strong commercial reason to be found: legal practice, home and property services, finance, software, automotive, and general industry. The legal vertical is more prominent still than the single Law entry suggests, because Personal Injury Lawyers appears separately at rank eleven with 119 listings; the two legal categories together hold a larger combined share than any other industry grouping in the corpus. Table 3 sets out the twenty-five largest categories with cumulative shares.

**Table 3.** The 25 largest categories by listing count, with individual and cumulative shares of the 14,362-listing corpus.
Rank	Category	Listings	Share	Cumulative
1	Law	637	4.4%	4.4%
2	Cosmetic Surgery	195	1.4%	5.8%
3	Services	177	1.2%	7.0%
4	Home Improvement	175	1.2%	8.2%
5	Business & Finance	158	1.1%	9.3%
6	Real Estate	157	1.1%	10.4%
7	Software	141	1.0%	11.4%
8	Home & Garden	131	0.9%	12.3%
9	Automotive	127	0.9%	13.2%
10	Industry	120	0.8%	14.1%
11	Personal Injury Lawyers	119	0.8%	14.9%
12	SEO	101	0.7%	15.6%
13	Newspapers	92	0.6%	16.2%
14	Clothing	91	0.6%	16.9%
15	Political News	88	0.6%	17.5%
16	Alternative health	86	0.6%	18.1%
17	Cleaning	83	0.6%	18.6%
18	Home and Garden	78	0.5%	19.2%
19	Organizations	78	0.5%	19.7%
20	Television	77	0.5%	20.3%
21	Medicine	75	0.5%	20.8%
22	Internet Marketing	75	0.5%	21.3%
23	Education	74	0.5%	21.8%
24	Photography	74	0.5%	22.3%
25	Exercise & fitness	73	0.5%	22.9%

Table 3 also exposes a small irregularity in the directory’s classification that honesty requires recording rather than silently correcting. Two distinct categories, Home & Garden at rank eight and Home and Garden at rank eighteen, carry near-identical names; they are separate entries in the directory’s structure. The analysis takes them as the directory’s data presents them, as two categories, and the irregularity is noted as a genuine property of a classification built and extended over more than a decade.

The gradual descent visible in Figure 3 carries a substantive message of its own. Apart from the single break at the top, where Law stands clear of the field, the largest categories form a smooth gradient rather than a small set of giants separated from the rest. Concentration in this corpus is therefore not a matter of a few dominant categories and a uniform remainder; it is a matter of a continuous slope, along which each category is somewhat larger than the next. The distribution is concentrated in the aggregate while being, in its detail, a steady progression.

The distribution of category sizes

The cumulative measures describe the head of the distribution; the partition of categories into size bands describes its full shape. Figure 4 shows the number of categories falling in each size band, and Table 4 reports, for each band, both the count of categories and the count of listings they contain.

Figure 4. The number of populated categories in each size band. Most categories are small to mid-sized; the modal band, eleven to twenty-five listings, contains 222 of the 593 populated categories, while only 12 categories hold more than one hundred listings.

**Table 4.** Category-size distribution: categories and listings by size band.
Size band (listings)	Categories	% of populated categories	Listings	% of corpus
1	25	4.2%	25	0.2%
2–5	110	18.5%	353	2.5%
6–10	54	9.1%	448	3.1%
11–25	222	37.4%	3,752	26.1%
26–50	118	19.9%	4,119	28.7%
51–100	52	8.8%	3,427	23.9%
100+	12	2.0%	2,238	15.6%

Figure 4 and Table 4 together yield a finding the cumulative measures alone would obscure: the distribution looks different depending on whether you count categories or listings. Counted by category, the corpus is dominated by small and mid-sized categories. The modal band is eleven to twenty-five listings, which holds 222 of the 593 populated categories; a further 110 categories hold between two and five listings, and 25 hold a single listing. Only 12 categories, 2.0% of those populated, hold more than a hundred listings.

Counted by listings, the picture inverts. Those same 12 large categories, though they are a fortieth of the populated categories, hold 15.6% of all listings. The bands above fifty listings, 64 categories in total, hold 39.5% of the corpus between them. The single-listing and two-to-five-listing bands, which together account for 135 categories, hold 2.7% of listings between them.

A reader therefore arrives at two true but opposite-sounding statements: most categories are small, and most listings sit in larger categories. Both follow from the same distribution, and holding them together is essential to reading it correctly.

The gap between the two ways of counting has a practical consequence for anyone using these results. A business owner asking how typical their situation is should be clear about which question they are posing. Asked which kind of category is most common, the answer is the small-to-mid-sized category; asked where a randomly chosen listing is most likely to sit, the answer is a mid-to-large category. The two questions have different answers because the distribution is concentrated, and a reader who conflates them will misjudge how crowded the typical listing’s surroundings are.

The long tail and the empty categories

The concentrated head of the distribution has a structural counterpart in its long tail. The smaller half of the populated categories, the roughly 296 categories below the median size, together hold under 16% of all listings. The tail is not a defect of the corpus; it is the expected lower portion of a heavy-tailed distribution, and a classification fine enough to be useful must contain specific categories into which only a few businesses will ever fall.

Beyond the populated tail lie the empty categories. Of the 761 categories in the directory’s classification, 169 hold no listings at all. An empty category is a place prepared in the classification for businesses not yet listed within it; its existence shows that the directory’s classification is, at the reference date, finer-grained than its corpus has filled. The empty categories are returned to in the discussion, where they are read as the inverse condition to crowding and as a small indicator of how a fixed classification meets an unevenly accumulating corpus.

The empty categories invite a measured reading rather than a hasty one. Their existence is sometimes taken as evidence that a classification is too large; it is read here, more cautiously, as evidence that the classification anticipates businesses not yet listed. Whether an empty category should be populated, kept against future need, or merged with a neighbour is an editorial question the data raises but does not settle, and the discussion returns to it as a matter of directory design rather than of data quality.

Discussion

The results establish that listings in the directory are concentrated across categories, identify where the concentration sits, and describe the full shape of the size distribution. The discussion now interprets these findings: it examines the proxy on which the analysis rests, considers why the distribution takes the shape it does, sets out the mechanism by which concentration affects discoverability, weighs the significance of the corpus having accumulated without advertising, and draws the implications for businesses and for directory design.

Listing density as a proxy for competition

The analysis treats listing density as a proxy for how hard businesses compete for visibility, and the reading of every finding depends on how much that proxy can bear. What density captures is genuine: a category holding many listings is one in which, within this corpus, many businesses are present at once and contending for the attention of anyone who browses that category. As a comparative measure, one category set against another inside the same directory, density is a sound indicator of relative crowding.

What density does not capture must be stated with equal clarity, as Figure 1 records. It does not measure the size of an industry in the wider economy, because industries differ in how heavily their businesses use directories at all. It does not measure search demand, since a crowded category is not necessarily one that many people seek. And it does not measure ranking difficulty in the technical sense studied by Brin and Page (1998) and Kleinberg (1999). The honest position is the one held throughout: density is a partial proxy, informative for comparison within this corpus, and the concentration it reveals is a real pattern that should nonetheless not be over-read as a direct census of competition in the economy.

Why the distribution is concentrated

The concentration documented in the results is not an anomaly of this particular directory. Newman (2005) reviewed the extensive evidence that heavy-tailed distributions, a steep head and a long, thin tail, arise across a wide range of collections assembled by many independent decisions over time, from the sizes of cities to the frequencies of words. A category structure filled by thousands of separate listing decisions over seventeen years belongs to this family, and a heavy-tailed outcome was, in that sense, to be expected.

A plausible generative mechanism is preferential attachment, under which the probability that a category receives the next listing rises with the category’s current size. A larger category is more visible to a submitting business, more established as a destination, and a more natural place for an editor to add a related resource; each of these tendencies makes growth self-reinforcing. It is reasonable to conclude that, had the corpus been assembled by some different process, its category distribution would still have been heavy-tailed, though the precise parameters, the Gini coefficient and the exact top-hundred share, would have differed. The contribution of this study is therefore not to be surprised that concentration exists, but to measure the specific form it takes in this corpus: a Gini coefficient of 0.522 and a top-hundred share of 50.5%.

A further point about the concentration’s origin deserves attention. Because the directory adds most of its entries through editorial review, the self-reinforcing growth described above works not only through the decisions of submitting businesses but also through the decisions of editors. An editor extending the directory works within its existing structure, and a category already substantial is a natural place to add a related resource; the editorial process therefore tends, like the submission process, to reinforce existing concentration rather than counteract it. This is not a criticism of the editorial model but an observation about how concentration compounds within it.

Divided attention in crowded categories

If concentration is the expected shape, the question that remains is why it matters for the businesses concerned. The answer runs through the limits of attention, and Figure 5 sets out the mechanism.

Figure 5. The divided-attention mechanism. A searcher’s examined set is small and does not expand with category size; consequently, any individual listing’s share of attention falls as the category grows. The percentages shown are illustrative rather than measured.

The relevant cognitive facts are well established. Miller (1968) and the later tradition in cognitive psychology showed that a person can hold and compare only a small number of items at once; the capacity is narrow and does not stretch to accommodate a larger field. Pirolli and Card (1999) carried the point into information-seeking with the theory of information foraging, modelling a searcher as allocating limited effort across an environment and abandoning a patch once its yield falls below the cost of continuing. Stigler (1961) had already supplied the economic version: search is costly, so a searcher does not examine every option in a large set.

The consequence for a directory category is the one Figure 5 depicts. A category receives a bounded quantity of any searcher’s attention, a small examined set, regardless of how many listings it holds. It follows arithmetically that any one listing’s share of that attention falls as the category grows: a listing competes for a fixed pool against eleven others in a thin category and against many hundreds in a crowded one. The crowded categories identified in the results, with Law foremost, are therefore the categories that impose the hardest structural visibility task on the businesses within them.

One refinement keeps the mechanism accurate. Attention is not divided uniformly across a category; the examined set is not a random sample but the listings presented first or most prominently. Crowding therefore does not merely dilute visibility evenly; it raises the stakes of whatever governs the ordering of listings within a category. In a thin category, ordering matters little because attention reaches most listings; in a crowded category, the determinants of ordering, among them the completeness and quality of a listing, become decisive.

The mechanism also explains why concentration and discoverability are not the same thing. A category may be concentrated, holding a large share of the corpus, and yet remain navigable if it is well sub-divided and well ordered; a smaller category that is poorly structured may be harder to use. Concentration sets the difficulty of the visibility problem within a category, but the directory’s structural choices determine how much of that difficulty a searcher actually feels. The two act together, and the discussion of directory design returns to the directory’s side of the relationship.

The legal vertical: a closer reading

The dominance of the legal vertical warrants a closer reading, because it is the single most pronounced feature of the distribution. Law holds 637 listings and Personal Injury Lawyers a further 119; together the two legal categories account for 756 listings, or 5.3% of the corpus, the largest industry grouping the data contains, and larger than the next two categories combined.

Several features of legal services as a market plausibly account for this prominence. Legal practice is fragmented into a large number of independent firms rather than concentrated in a few national providers; each firm serves a market that is, in practice, partly bounded by geography and specialism; and the value to a firm of acquiring a single client is high, which sharpens the incentive to be findable. A market with many independent providers, each with a strong commercial reason to seek visibility, is exactly the kind of market that accumulates many listings in a directory.

It can therefore be concluded, as a reasoned supposition rather than a demonstrated fact, that the prominence of the legal vertical reflects a genuine structural feature of that market, its fragmentation into many independent, visibility-seeking providers, rather than a peculiarity of this directory’s history or classification. The same reasoning supplies an expectation for the companion and future studies: industries that share the legal market’s structure, being fragmented and competing hard for client acquisition, should also tend to occupy the crowded head of the distribution, and the results bear this out, with home and property services and finance appearing among the largest categories.

The significance of advertising-free, curated accumulation

The provenance of the corpus, set out in the dataset section, now does interpretive work. Because the directory has never used paid advertising to acquire listings, the concentration documented here cannot be an artefact of promotional spending. A category is large because businesses submitted listings to it and editors added resources to it, not because any party bought placement within it. In a corpus assembled partly through paid inclusion, category size would partly measure who had paid; that confound is absent here.

Because, in addition, about ninety per cent of entries are added through manual editorial review, the corpus reflects an editorial judgement of relevance rather than unmediated self-submission. It can therefore be concluded, with more confidence than a differently assembled corpus would permit, that the distribution observed is a reasonably faithful reflection of genuine listing behaviour, the combined outcome of organic submission and editorial curation over the period from 2009 to 2026.

One qualification belongs here, so as not to overstate the point. Editorial curation is itself a human process, and editorial attention, like any attention, is finite and tends to operate category by category. The word organic, as used here, should be read as free of paid distortion rather than as free of all human shaping. The concentration is not an artefact of advertising; it is, in part, an artefact of how businesses and editors together, without commercial placement, have populated a classification over many years.

Concentration and listing quality considered together

This study measures the crowding of categories; a companion study in the same series measures a different property of the same corpus, the completeness of individual listings. The two properties interact, and considering them together yields a sharper conclusion than either reaches alone.

The interaction follows from the divided-attention mechanism. In a thin category, attention reaches most listings, and an incomplete listing is therefore still likely to be seen; the penalty for incompleteness is modest. In a crowded category, attention reaches only the few listings that ordering and relevance bring to the front, and completeness is among the factors that govern which listings those are; the penalty for incompleteness is correspondingly severe. The cost of an incomplete listing is, in other words, not a constant; it rises with the crowding of the category the listing occupies.

The joint reading produces a more precise prescription than the crowding analysis alone. A business should assess two things together: how crowded its category is, and how complete its listing is. A complete listing in a thin category and an incomplete listing in a crowded category sit at opposite ends of a visibility range, and a business that knows where it falls on both dimensions knows how urgent the work of improvement is. The crowding measured here sets the stakes; the completeness measured by the companion study determines whether a business meets them.

Implications for businesses

For a business owner, the concentration documented here is not an abstract statistic but an actionable fact. The first implication is the value of knowing one’s category’s crowding. A business in Law, in Home Improvement, or in Business & Finance enters a crowded category, and the effort required to be found scales accordingly; a business in one of the several hundred small categories competes against far fewer. Realism about the task begins with knowing which situation applies.

The second implication is that crowding raises the value of every distinguishing factor. Because a crowded category divides attention finely and rewards whatever governs ordering, the completeness and accuracy of a listing, the soundness of the associated website, and genuine reputation all matter more in a crowded category than in a thin one. The third implication concerns category choice: where a business genuinely belongs in more than one category, the less crowded of the eligible categories may be the easier place to be found, provided it remains a category the relevant audience actually browses. This is a consideration to exercise within honesty, not against the truth of what the business is; a listing filed in a less crowded but inaccurate category is found by the wrong people, which serves no one.

One implication for businesses cuts across the others and deserves separate statement. None of the foregoing should be read as advice to compete less in a crowded category or to retreat to a thin one for its own sake. A crowded category is, in most cases, crowded because it is commercially valuable, and a thin category may be thin because few people seek it. The lesson is not avoidance but realism: a business should enter the category that genuinely fits it, understand the crowding it will face there, and invest accordingly in the factors that determine visibility within it.

Implications for directory design

The concentration is, finally, an agenda for the directory itself. A category of a dozen listings can be presented as a simple list that a visitor reads in full; a category of several hundred cannot, and presented as a flat list it leaves most of its listings effectively unseen. The crowded categories therefore place a real obligation on the directory’s structure, and the instruments available are familiar: meaningful sub-categorisation, a considered default ordering, and search within a category.

The directory does not control how many businesses list in a category, but it does control whether the listings in a crowded category remain discoverable; that is a design responsibility rather than an inevitability. The empty categories carry a smaller, related implication. A directory may periodically review whether an empty category should be populated, merged, or retired, and the near-duplicate pair Home & Garden and Home and Garden, identified in the results, is a natural candidate for consolidation. Structural maintenance of this kind is the directory-side counterpart to the business-side task of completing a listing.

A connected design question concerns the directory’s machine-readable structure. As the projections note, automated discovery systems increasingly read a directory’s category data rather than rendering it for a human browser. A classification that is internally consistent, free of near-duplicate categories, with stable identifiers and clear names, is more useful to such systems than one carrying the small irregularities this study has recorded. Structural maintenance therefore serves not only the human visitor but the automated reader, and the case for it strengthens as the second kind of reader grows more important.

Projections and future developments

This study is a snapshot of one corpus at one reference date, and it is designed to be repeated. The projections offered here are reasoned conjectures drawn from the observed pattern and the mechanisms discussed above; they are not statistical forecasts, and they are presented as such.

The first projection concerns the shape of the distribution. Because preferential attachment is self-reinforcing, it can be projected that the heavy-tailed shape will persist and is likely to sharpen modestly over time: the categories that are already large will, other things being equal, continue to attract listings at an above-average rate, and the Gini coefficient may therefore drift upward from its current 0.522. The qualification is that deliberate editorial action, sub-categorising a very large category, or actively developing thin ones, could counteract the drift, so the projection holds only absent such intervention.

The second projection concerns the empty categories. The count of 169 empty categories is a moving figure; it can be projected that some will fill as the corpus grows and others will be revised away through editorial consolidation, so a future export would show a different and probably lower number. The third projection concerns the changing surface of discovery.

As retrieval-augmented and answer-composing systems increasingly handle how businesses are found (Lewis et al., 2020; Aggarwal et al., 2024), the directory’s structured category data becomes an input that such systems read rather than a page that a person browses. A category’s crowding will continue to bear on a business’s visibility, but the contest will increasingly take place inside a composed answer rather than on a list. It may be conjectured that a structured, curated, advertising-free corpus is comparatively well placed to serve as such an input, precisely because its category assignments reflect editorial judgement rather than paid placement.

A concrete development follows from all three projections: the value of repetition. Running this analysis against future exports would turn the present snapshot into a longitudinal record of how the directory’s industries crowd and thin over time. The companion studies in this series, on listing completeness, on geographic distribution, and on growth over time, extend the picture along other dimensions, and a later synthesis will draw them together.

The projections above share a common methodological character that should be made explicit. Each is an extrapolation from an observed pattern and a stated mechanism, not a quantitative forecast with an associated error; the language of projection and conjecture is used deliberately to mark this. A reader should take them as reasoned expectations that the next export will either support or revise, and the value of repeating the study lies precisely in putting these expectations to that test.

Limitations of the study

The limitations follow from the design and are stated plainly. The study analyses a single corpus, one directory’s database, and is descriptive rather than inferential; it characterises that corpus and does not generalise to all directories or to all industries, and no significance testing is applied because there is no sampling beyond the corpus itself.

The central interpretive limitation is the proxy. Listing density is used as a proxy for competitive intensity, and, as Figure 1 and the discussion both record, it does not measure industry size in the economy, search demand, or technical ranking difficulty. The category structure analysed is the directory’s own, complete with the small irregularities, such as the near-duplicate Home & Garden categories, that any classification built and extended over many years accumulates.

A further limitation concerns the single-valued category assignment. Each listing is filed in exactly one category, which is the directory’s data model; a business whose activity genuinely spans several categories is nonetheless recorded once, in one place. This is a property of the directory rather than a distortion introduced by the analysis, but it shapes the counts and should be borne in mind. Finally, the rating and vote fields were excluded as unreliable, one referenced category identifier did not resolve to a named category, and the entire analysis reflects the single reference date of 25 May 2026; a different export date would yield somewhat different figures.

Concluding remarks

This study set out to measure how business listings spread across the categories of a curated directory, and to use that spread to identify which industries compete hardest for online visibility. The central finding is a pronounced concentration. The hundred largest categories hold half of the 14,362-listing corpus; the Gini coefficient of category sizes is 0.522; and one category, Law, holds 4.4% of all listings, more than three times the share of the next category. Against this concentrated head sits a long tail in which the median populated category holds sixteen listings and 169 categories stand empty.

The concentration is best understood as structural. It is the heavy-tailed shape that organically assembled collections characteristically take (Newman, 2005), here measured precisely in a corpus that has accumulated without paid advertising and largely through editorial curation. That provenance matters: it lets the distribution be read as a reasonably faithful reflection of genuine listing behaviour rather than as an artefact of promotional spending. The mechanism that gives the concentration its practical force is the division of a fixed quantity of attention among a variable number of listings: a crowded category is, for that reason, a structurally harder place in which to be found.

Two further conclusions follow from the manner of the corpus’s accumulation. Because the directory has grown since 2009 without paid advertising, the concentration is not bought but earned, listing by listing and editorial decision by editorial decision; and because the corpus is large and long-accumulated, the pattern is unlikely to be an artefact of any single period. The central finding is, in that sense, doubly robust: it rests on a substantial corpus and on a process free of promotional distortion.

The practical conclusions are correspondingly clear. For a business, the crowding of its category sets the difficulty of being found and raises the value of every factor that distinguishes a listing: completeness, accuracy, a maintained presence. For the directory, the concentration is an agenda for structure: sub-categorisation, ordering, and in-category search are the instruments by which the listings in a crowded category remain discoverable. The study is one of a connected series analysing the same corpus from several angles; a later synthesis will draw the threads together into a single account of the state of the directory’s listings.

A final reflection concerns the standing of a study of this kind. An analysis of a directory’s own database, conducted and published by the directory, occupies an unusual position: the directory is at once the author of the study and its object. That position is stated here openly, because transparency about it is what allows the work to be read as research rather than as promotion. The methodology, the exclusions, and the limitations have all been set out so that any reader may reproduce the figures and judge the interpretation independently; the study asks to be assessed on that basis.

References

Aggarwal, P., et al. (2024). Improving search systems with large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 5, 16). Association for Computing Machinery.

Akerlof, G. A. (1970). The market for “lemons”: Quality uncertainty and the market mechanism. The Quarterly Journal of Economics, 84(3), 488, 500.

Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., & Raghavan, S. (2001). Searching the Web. ACM Transactions on Internet Technology, 1(1), 2, 43.

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1, 7), 107, 117.

Broder, A. (2002). A taxonomy of web search. ACM SIGIR Forum, 36(2), 3, 10.

Darby, M. R., & Karni, E. (1973). Free competition and the optimal amount of fraud. The Journal of Law and Economics, 16(1), 67, 88.

Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604, 632.

Lewis, P., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (Vol. 33, pp. 9459, 9474).

Miller, R. B. (1968). Response time in man-computer conversational transactions. In Proceedings of the AFIPS Fall Joint Computer Conference (Vol. 33, pp. 267, 277).

Nelson, P. (1970). Information and consumer behavior. Journal of Political Economy, 78(2), 311, 329.

Nelson, P. (1974). Advertising as information. Journal of Political Economy, 82(4), 729, 754.

Newman, M. E. J. (2005). Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46(5), 323, 351.

Pirolli, P., & Card, S. K. (1999). Information foraging. Psychological Review, 106(4), 643, 675.

Spence, M. (1973). Job market signaling. The Quarterly Journal of Economics, 87(3), 355, 374.

Stigler, G. J. (1961). The economics of information. Journal of Political Economy, 69(3), 213, 225.