Eighty-eight percent of organisations believe building the organisation of the future matters; only eleven percent claim to understand how to do it. That gap, documented by Deloitte Insights (2017), is the cleanest possible summary of the conversation now happening inside marketing teams about AI search visibility. Everyone agrees it matters. Almost nobody can describe the mechanism by which their brand ends up cited inside a ChatGPT answer rather than ignored — and almost nobody can describe why the directory listing they paid for in 2019 still appears to influence what Perplexity says about them in 2024.
The case that follows is a composite. It draws on three engagements run between late 2023 and mid-2024 with mid-market SaaS vendors whose organic search traffic was holding steady but whose presence inside answer engines was thin, inconsistent, or wrong. Names have been changed; numbers are real. The methodology is reproducible, the conclusions are defensible, and the principles are transferable — provided the reader understands which constraints they are transferring across.
The Client Brief That Started This
SaaS Vendor Losing AI Citations
The client — referred to here as Meridian, a workflow automation platform with roughly £14 million in annual recurring revenue and a sales team of forty-two — came in through a referral. Their head of demand generation had noticed something strange. Branded search traffic was up year-over-year. Organic non-branded traffic was flat but stable. Paid acquisition costs were creeping up but not alarmingly. The conventional dashboards looked fine.
What did not look fine was the share-of-voice analysis their content team had begun running quarterly against ChatGPT, Perplexity, Gemini, and Claude. When prospects asked any of those engines for “best workflow automation tools for finance teams” or “alternatives to Zapier for regulated industries”, Meridian appeared in roughly nine percent of responses. Two of their direct competitors — both smaller by revenue, both with weaker backlink profiles according to Ahrefs — appeared in over forty percent.
The brief was deliberately narrow: figure out why, and propose a fix that did not require rebuilding the website or commissioning twelve new pieces of pillar content. Budget ceiling was £38,000 for ninety days, inclusive of any third-party listing fees.
Initial Visibility Audit Findings
The first week was spent triangulating. The audit covered four data sources: Meridian’s own analytics, a sample of two hundred AI engine queries logged by the content team over the previous quarter, a fresh round of one hundred and twenty queries run during the audit week itself, and a backlink and citation profile pulled from Ahrefs and a manual review of the top forty referring domains.
Three findings emerged quickly. First, Meridian’s website itself ranked well — top five for most commercial queries on Google. Second, when AI engines did cite Meridian, they cited the company’s own website roughly sixty percent of the time and a single Forbes contributor article about thirty percent of the time. The remaining ten percent was scattered across Reddit, a 2021 G2 category page, and one trade publication. Third, when competitors were cited, the citation diversity was dramatically higher: a typical competitor showed up via G2, Capterra, two or three vertical-specific listing sites, a Reddit thread, a Software Advice page, and at least one regional listing aggregator.
The asymmetry was not in domain authority. It was in citation surface area. Competitors had simply seeded more independent third-party mentions on the kinds of pages that AI training and retrieval pipelines appear to favour.
Hypothesis About Directory Listings
The working hypothesis, formed at the end of week one, was that AI engines were treating curated third-party listings — particularly category-defining software directories and vertical aggregators — as a kind of authority-validation layer. A brand mentioned by its own site is asserting something. A brand mentioned by a curated listing on a domain that itself ranks well is being endorsed. Engines, when asked to summarise a category, appeared to be weighting endorsement signals more heavily than self-assertion.
This was a hypothesis, not a finding. Confirming it required a structured test rather than another anecdotal scan. As Pew Research Center (2017) notes in its survey of expert opinion on algorithmic systems, the opacity of these models means observed patterns are often the only practical evidence available; the underlying weights are not retrievable. The methodology had to work with that constraint.
Scoping the Investigation
Scoping a project of this kind is mostly an exercise in saying no. The temptation is to test everything: every engine, every prompt variation, every directory category, every region. With £38,000 and ninety days, that was not feasible. The scope was tightened along four axes.
The first axis was engine selection. Four engines were chosen — ChatGPT (with browsing enabled), Perplexity, Gemini, and Claude — on the grounds that they collectively represented the bulk of B2B research traffic the client could plausibly capture, and that their retrieval architectures differed enough to make comparison interesting. Bing Copilot was excluded; its citations overlapped heavily with ChatGPT browsing in early testing, and the marginal information was judged not worth the extra logging burden.
The second axis was query intent. Three intent classes were defined: category-discovery queries (“what are the best tools for X”), comparison queries (“X vs Y”), and problem-led queries (“how do I automate Y in a regulated environment”). Each class produces different citation patterns, and conflating them produces noise.
The third axis was geography. Meridian sold primarily into the UK and DACH region, with a smaller US book. The query set was weighted accordingly: sixty percent UK English, twenty-five percent US English, fifteen percent German.
The fourth axis was the directory universe itself. A long list of one hundred and forty potential listing sources was pruned to forty-two on the basis of three filters: whether they had appeared at least once in the audit’s citation data, whether their own organic visibility on Google for relevant terms was non-trivial, and whether they accepted submissions of the kind Meridian could plausibly produce. Pure pay-to-play link farms were excluded at this stage.
Building the Test Methodology
The methodology had to do two things simultaneously: characterise the existing citation behaviour of each engine, and produce a baseline against which post-implementation changes could be measured. Those are not the same exercise, and the temptation to collapse them is the single most common methodological error in this kind of work.
The team built a query log structured as a relational table. Every query had an ID, an engine, a date-time stamp, an intent class, a geographic locale (set via VPN where relevant), the full text of the response, every cited URL, the domain of every cited URL, and a manually applied tag indicating whether the citation was a primary source, a directory or aggregator, an editorial source, a forum, or a vendor’s own site. A second pass added a “directory authority tier” tag, drawn from a custom rubric that combined Ahrefs Domain Rating, age of the domain, evidence of editorial curation, and whether the directory appeared in our prior citation data.
Each query was run three times, on three different days, at staggered times, to control for the stochasticity that all four engines exhibit. Responses were captured verbatim. Where a citation was given as a hyperlink, it was followed and its destination logged separately from the engine’s own claimed URL — engines occasionally cite a domain that redirects somewhere else, and the redirect target is what the user actually sees.
The total query volume across the baseline phase was four hundred and twenty distinct queries, each run three times, across four engines: 5,040 individual response captures. Roughly fourteen percent of captures contained no citations at all, which is itself a finding — Claude in particular often produced uncited prose.
Selecting Engines and Query Sets
Selecting the query set deserves more attention than it usually gets. The team did not write the queries from scratch. Instead, they sourced phrasings from three places: Meridian’s actual customer-support transcripts (filtered for prospect-stage questions), the “People Also Ask” data on Google for the relevant category, and a smaller pool of queries volunteered by Meridian’s sales engineers as the questions they hear most often on first calls. This sourcing matters because synthetic queries — queries written by marketers imagining what prospects might ask — tend to be cleaner, shorter, and more category-aware than real ones, which biases results toward whichever engine handles canonical phrasings well.
The final query set was 140 queries, each tagged with intent class and locale, then run across four engines for a baseline. The directory universe had been narrowed to forty-two candidates. The combination produced enough data to characterise each engine’s behaviour with reasonable confidence and enough granularity to support directory-level recommendations later.
Table 1 contrasts these approaches at the level of query construction and the trade-offs they introduce.
Table 1: Query sourcing approaches and their methodological trade-offs
| Sourcing approach | Strength | Weakness | When to use |
|---|---|---|---|
| Marketer-imagined queries | Fast, cheap, easily tagged by intent | Biased toward canonical phrasing engines handle well | Early-stage exploratory scoping only |
| Support transcript mining | Reflects real prospect language and confusion | Slow to extract and de-identify; uneven coverage | Mid-stage validation of hypothesis |
| People Also Ask scraping | Reflects what Google believes users want | Already filtered through one search engine’s logic | Comparative work where Google is a reference point |
| Sales engineer interviews | Captures objection-stage and edge-case queries | Small sample; biased toward technical buyers | B2B contexts with consultative sales motions |
Running Queries Across Four Engines
Capturing Directory Mentions Systematically
Running 5,040 captures by hand would have been infeasible. The team built a thin Python harness around each engine’s API where one was available, and used a headless browser with manual prompt entry for ChatGPT (whose browsing tool, at the time of the engagement, did not have a clean public API for the configuration we wanted). Captures were dumped into a shared spreadsheet and then ingested into a small SQLite database for analysis.
The systematic part was the tagging rubric. Every cited URL was classified along three dimensions: source type (directory, editorial, forum, vendor, academic, other), authority tier (one through four, where one was the strongest), and recency of the most recent visible content on the cited page. The recency dimension matters because some engines clearly prefer fresher content; others appear to weight long-standing pages more heavily. Without recency tagging, the differences between engines collapse into noise.
ChatGPT and Perplexity Patterns
ChatGPT and Perplexity, despite being often grouped together in popular discussion, behaved very differently in the data. ChatGPT (browsing on, GPT-4-class model) cited an average of 3.1 distinct domains per response in category-discovery queries. Of those citations, roughly thirty-eight percent were directory or aggregator pages, twenty-two percent were the vendor’s own site, eighteen percent were editorial (trade press, contributor blogs), and the remainder split between forums and miscellaneous. ChatGPT also showed a strong preference for recently updated category pages — a G2 category page updated in the last ninety days was substantially more likely to be cited than the same page eighteen months stale.
Perplexity cited an average of 6.4 distinct domains per response on the same query class. Its citation diversity was the highest of the four engines by a clear margin. Directory citations made up a smaller proportion (around twenty-eight percent) but a larger absolute number, because the total citation volume was higher. Perplexity also surfaced regional and vertical directories that ChatGPT rarely touched. A German-language industrial software listing, for example, appeared in seven percent of Perplexity’s German-locale responses and zero percent of ChatGPT’s.
The most useful pattern across both engines was that comparison queries — “Tool A vs Tool B” — surfaced directory citations at a much higher rate than category-discovery queries. Roughly fifty-four percent of comparison-query citations on Perplexity and forty-seven percent on ChatGPT pointed to directory or aggregator pages. The likely reason: comparison content is concentrated on those pages and on a small number of editorial sites, and engines lean on whichever sources offer side-by-side structured information.
Gemini and Claude Patterns
Gemini’s behaviour sat between ChatGPT and Perplexity on most dimensions but exhibited one strong idiosyncrasy: a heavy preference for sources that also rank well on Google, which is unsurprising given its provenance. Gemini’s directory citations skewed toward G2, Capterra, and TrustRadius — the three domains with the strongest Google footprint in the software-listing space. Niche or regional directories appeared less often than on Perplexity.
Claude was the outlier. Without browsing enabled (the default configuration during baseline testing), Claude produced fluent, often well-structured responses with no live citations at all. With its browsing tool active, Claude cited fewer sources than any other engine — an average of 1.7 per response on the same category-discovery query set — and showed a marked preference for editorial sources over directories. When Claude did cite a directory, it was almost always G2 or a similarly large, established name. Smaller directories, regional listings, and vertical aggregators effectively did not appear.
This matters because it implies that an investment strategy optimised for Claude visibility looks very different from one optimised for Perplexity visibility. The two engines reward different supply-side signals. Anyone reporting an aggregate “AI citation lift” without disaggregating by engine is hiding most of the useful information.
What the Citation Data Revealed
The aggregated baseline data told a coherent story, though one with sharp edges. Directories were not a uniform category. Within the forty-two-directory shortlist, citation frequency followed something close to a power-law distribution: G2 alone accounted for roughly twenty-two percent of all directory citations across the four engines. Capterra accounted for fourteen percent. The next five directories together accounted for another thirty percent. The remaining thirty-five directories shared the final thirty-four percent — meaning many of them appeared only a handful of times across thousands of queries.
A naive reading would conclude that only the top seven directories matter. The data did not support that reading. Looking at citation frequency weighted by query intent and locale produced a different picture. In German-locale queries, two German-language directories that appeared nowhere on the global top-ten list were among the top three sources cited. In queries tagged as “regulated industries”, a single vertical compliance-software directory accounted for nineteen percent of directory citations despite contributing well under one percent of the global volume. Concentration at the top is a feature of the unweighted average; the moment one slices by intent or locale, long-tail directories carry meaningful weight.
Recency mattered, but unevenly. ChatGPT and Perplexity showed clear preference for pages updated within the last six months. Gemini was less recency-sensitive. Claude effectively ignored recency in favour of perceived editorial authority. This finding aligns with the broader observation, made by Pew Research Center (2017), that algorithmic systems trained on different objectives produce divergent behaviour on the same input — and that the divergence is often invisible to end users, who see only the surface response.
One more pattern was worth flagging: structured review content was cited disproportionately. Pages with visible review counts, star ratings, and pros/cons sections were cited at roughly twice the rate of pages with the same product information presented as flowing prose. Whether this is due to retrieval-time matching of structured content or to training-time exposure to review-rich pages is unknowable from the outside. The practical implication is the same either way.
Decision Forks During Analysis
Filtering Low-Authority Directories
The first major decision fork was how aggressively to filter the forty-two-directory candidate list. The temptation was to drop everything below a Domain Rating of fifty. That would have removed twenty-three directories from the list, including several that the baseline data showed were being cited by Perplexity and Gemini in narrow but commercially valuable contexts.
The team adopted a different filter. A directory was retained if it met any one of three conditions: a Domain Rating of fifty or higher, at least three citation appearances in the baseline data, or evident editorial curation combined with a focused vertical. The third condition is judgement-laden but defensible — a small directory with a real editor and a tight scope routinely outperforms a large directory with open submissions and weak moderation. Twenty-nine directories made the cut.
The filtering logic explicitly avoided treating Domain Rating as a gating threshold. Domain Rating measures backlink profile strength on a logarithmic scale, which is a useful but indirect proxy for the kind of signal AI engines appear to weight. Several directories with modest Domain Ratings showed strong AI citation patterns because their content was structured, recent, and topical — three properties Domain Rating does not measure. As Harvard Business Review (2017) observes about cross-cultural management, conflating two related but distinct dimensions — in their case authority and decision-making — produces predictable failures. The same logic applies here: domain authority and AI-citation authority are correlated but not identical, and treating them as identical is the structural error.
Weighting Recency Versus Domain Strength
The second fork concerned how to score directories that scored well on one dimension but poorly on the other. A high-Domain-Rating directory with stale content was scored down; a moderate-Domain-Rating directory with active editorial cycles was scored up. This was operationalised through a composite score with three inputs — domain strength (40%), citation frequency in baseline data (35%), and content recency (25%) — applied within the twenty-nine-directory shortlist.
The weights were not arbitrary, but they were judgement calls. A different team running the same project with a different prior would land on different weights and produce a slightly different ordering. That is fine. The discipline is to make the weights explicit, document them in the project file, and revisit them when the data updates.
Handling Conflicting Engine Signals
The third fork was the most difficult. Several directories were cited heavily by one engine and almost not at all by another. The clearest example was a UK-based business listing site that appeared in fourteen percent of Perplexity’s UK-locale responses, six percent of Gemini’s, three percent of ChatGPT’s, and zero percent of Claude’s. Was that a directory worth pursuing?
The decision was made by mapping engine usage against Meridian’s actual sales pipeline. The marketing operations team had spent the previous two quarters tagging inbound conversations with self-reported source data, including, where volunteered, which AI engine the prospect had used during research. Perplexity-attributed conversations represented a disproportionately high share of pipeline value relative to engine traffic share — Perplexity users were further along in their evaluation by the time they reached sales. That tilted the directory selection toward Perplexity-favoured properties even when ChatGPT visibility was weaker. A different client with a different funnel would have made a different call.
Picking Directories Worth Pursuing
The final shortlist was twelve directories, ranked by composite score and adjusted for engine-pipeline weighting. The list combined three globally established software directories, two regional generalist business listings (one UK, one DACH), four vertical aggregators focused on automation, finance-operations tooling, or compliance software, two review-led editorial properties, and one curated general business listing whose editorial standards had pushed its citation rate above what its modest Domain Rating would predict. As this case study, curated listings with active editorial review tend to outperform open-submission directories by a margin that grows over time as engines learn which sources have stable signal-to-noise ratios.
G2 and Capterra Tradeoffs
G2 and Capterra were both included, but with different effort allocations. G2 was the highest-priority listing on every dimension: highest citation share across engines, strongest structured review surfacing, most active editorial updates. The investment there was full — paid presence, refreshed product description, active solicitation of recent reviews, structured pros/cons content, comparison-page presence against the two named competitors that had been outperforming Meridian in baseline citations.
Capterra was treated as secondary. The cost of a full Capterra investment was high relative to the marginal citation lift the baseline data predicted. The team recommended a maintenance posture — keep the listing current, refresh the description, gather a modest stream of new reviews — rather than the full review-acquisition push given to G2. This split was contentious internally; Meridian’s marketing team had a long relationship with Capterra and resisted the deprioritisation. The compromise was to revisit the Capterra weight at the sixty-day checkpoint based on observed lift.
Niche Vertical Directories
The four vertical directories were where the most interesting work happened. Each one required a different submission approach: one accepted only directly invited vendors and required two customer references; another had open submissions but a six-week editorial review process; a third charged a flat annual fee with no editorial gating; the fourth required an editor-written profile based on a structured intake interview. The team built a project plan that staged these submissions according to their lead times so all four would land within the same eight-week window.
The vertical directories collectively accounted for less than ten percent of total citation volume in the baseline data but a substantially higher share of comparison-query citations within their respective verticals. That is the structural reason to pursue them: in the queries that matter most to a buyer who is close to deciding, vertical directories punch above their weight.
Regional Listing Considerations
The two regional generalist listings — one focused on the UK business landscape, one on the DACH region — were included for two reasons. First, the baseline data showed clear locale-sensitive citation behaviour, particularly on Perplexity. Second, regional listings often serve as bridge sources: they are cited in long-tail problem-led queries where the global directories do not appear because the query is too specific or too local. The cost of regional inclusion was modest, the lead time short, and the upside concentrated in queries where competition was lighter.
Implementation Plan for the Client
Submission Sequencing Over Twelve Weeks
The implementation plan covered twelve weeks and was sequenced around three constraints: editorial lead times, internal content production capacity at Meridian, and the staggered review-collection cadence needed to keep G2 and a couple of other review-led properties continuously fresh.
Weeks one and two were spent on content preparation: refreshed product descriptions tuned to the structural patterns the baseline data had shown were favoured (clear feature lists, explicit positioning against named competitors, structured pros and cons, locale-appropriate phrasing for the German-language listings). The team also produced two new comparison pages on Meridian’s own site, on the grounds that engines often cite vendor comparison pages alongside directory comparison pages, and one without the other was a missed opportunity.
Weeks three through six handled the high-priority submissions: G2’s full refresh and review-acquisition push, the two regional generalist listings, and the two vertical directories with the longest editorial lead times. Weeks four through eight handled the four remaining vertical directories and the secondary review-led properties. Weeks eight through ten covered the two editorial properties that required relationship-led pitching rather than open submission. Weeks ten through twelve were reserved for refresh cycles, review prompts, and any remediation needed where submissions had been rejected or returned for changes.
Two of the twelve directories required customer references. Meridian’s customer success team identified six willing references in the first week, which exceeded the minimum needed; the surplus was held as backup for any directory that requested additional verification later. This is a planning detail that often gets missed — directories with customer-reference requirements have natural bottlenecks if the vendor has not pre-qualified its reference pool.
Throughout the twelve weeks, the team ran the same baseline query set in compressed form once a fortnight: forty representative queries across the four engines, three runs each, 480 captures per fortnightly checkpoint. This was light enough to be sustainable and heavy enough to detect signal. The intent was not to produce a publication-grade longitudinal study; it was to detect whether any of the submissions had begun to surface within the engines’ citation patterns and to flag any directional changes early.
Results After Ninety Days
Citation Lift By Engine
At day ninety, the team ran the full 140-query baseline set across all four engines, three runs each, replicating the original methodology. The headline number was that Meridian’s citation share across the combined query set rose from nine percent to twenty-six percent. That figure, reported on its own, would be flattering and slightly misleading. The disaggregated numbers tell the real story.
The figures presented in Table 2 confirm both the direction of the change and the unevenness across engines, query classes, and locales.
Table 2: Citation share change across engines, query classes, and locales — baseline versus ninety-day measurement
| Segment | Engine | Baseline citation share | Day-90 citation share | Change (pp) |
|---|---|---|---|---|
| Category discovery | ChatGPT | 11% | 27% | +16 |
| Category discovery | Perplexity | 14% | 41% | +27 |
| Category discovery | Gemini | 9% | 22% | +13 |
| Category discovery | Claude | 4% | 6% | +2 |
| Comparison queries | ChatGPT | 13% | 34% | +21 |
| Comparison queries | Perplexity | 16% | 49% | +33 |
| Comparison queries | Gemini | 10% | 28% | +18 |
| Comparison queries | Claude | 5% | 8% | +3 |
| Problem-led queries | ChatGPT | 6% | 15% | +9 |
| Problem-led queries | Perplexity | 8% | 23% | +15 |
| Problem-led queries | Gemini | 5% | 12% | +7 |
| Problem-led queries | Claude | 3% | 4% | +1 |
| UK locale (all classes) | Perplexity | 12% | 38% | +26 |
| DACH locale (all classes) | Perplexity | 9% | 31% | +22 |
| US locale (all classes) | Perplexity | 13% | 33% | +20 |
| UK locale (all classes) | ChatGPT | 10% | 26% | +16 |
| DACH locale (all classes) | ChatGPT | 7% | 19% | +12 |
| Aggregate (all engines, all classes) | Combined | 9% | 26% | +17 |
| Aggregate excluding Claude | Combined | 10% | 30% | +20 |
Perplexity Gains Outpaced ChatGPT
The strongest gains were on Perplexity, which is consistent with the baseline data’s prediction. Perplexity’s higher citation diversity meant that adding directory presence translated into citation appearances more efficiently than on engines with narrower citation vocabularies. The lift in comparison queries — from sixteen percent to forty-nine percent — was particularly large because the new and refreshed comparison pages on G2 and on two of the vertical directories were directly relevant to those query patterns.
ChatGPT showed strong gains too, particularly on comparison queries, but the magnitude was smaller. The likely reason is that ChatGPT’s citation pool is more concentrated; gaining presence in three or four high-weight properties produces a substantial jump, but additional presence in long-tail properties contributes little because ChatGPT seldom reaches those.
Claude Remained Stubbornly Flat
Claude’s lift was minimal — two to three percentage points across the board. That was within the noise band of the measurement protocol, and the team did not claim it as a meaningful improvement. The interpretation, supported by the baseline data, was that Claude’s citation behaviour weights perceived editorial authority and source consolidation in ways that directory submissions do not meaningfully affect over a ninety-day window. Moving Claude would likely require a different strategy entirely — earned editorial coverage in the small number of sources Claude treats as authoritative, rather than directory expansion.
This is a useful reminder that not every channel responds to the same lever. As Pew Research Center (2017) framed the broader question, algorithmic systems vary in objectives, training data, and architecture, and treating them as a single homogeneous “AI search” category is a category error that produces blurred reporting and misallocated investment.
Transferable Principles for Other Brands
Several principles emerged from the engagement that appear to generalise across the three composite cases the engagement drew from. The first is that citation surface area, measured as the count of independent third-party properties that mention a brand in a given context, is a stronger predictor of AI engine visibility than aggregate domain authority. The second is that engine behaviour is heterogeneous enough that any reporting which collapses across engines hides most of the useful signal. The third is that comparison queries respond to directory work more dramatically than category-discovery queries, which has implications for how directory investment should be sequenced relative to a buyer’s funnel position.
A fourth principle, less easily quantified but observed across all three composite engagements: structured content beats prose at almost any size, and recency beats stasis on roughly half of engines. Vendors who had invested in structured comparison content and active update cycles on their directory listings showed citation patterns that materially diverged from vendors with static, prose-heavy entries even when other signals were comparable.
A fifth principle is that the directory universe is not flat. The forty-two-directory shortlist contained directories whose citation contribution per pound of investment varied by more than an order of magnitude. Treating directories as a homogeneous category is the single most expensive analytical shortcut a team can take.
What I Got Wrong Initially
Two specific calls turned out to be wrong, and both are worth surfacing. The first was the initial weighting of Capterra. The maintenance-posture recommendation for Capterra was based on baseline citation share, but the sixty-day checkpoint showed that Capterra was being cited more than expected on US-locale problem-led queries — particularly on Gemini, which had a stronger Capterra preference than the team’s prior had assumed. The plan was adjusted at the sixty-day mark to upgrade Capterra’s review-acquisition pace, which captured most of the missed lift but not all of it. A more cautious initial weighting would have been better than over-confident downgrading.
The second mistake was underestimating how long the German-locale signal would take to manifest. The DACH-region listings were submitted on schedule, but their citation appearance lagged the UK and US gains by roughly four weeks. The team had budgeted for parity. The likely reason for the lag is that some engines appear to recrawl or refresh non-English directory content less frequently than English equivalents, though that is a hypothesis rather than a finding. Anyone planning multi-locale work should bake a buffer into their measurement timeline rather than expecting concurrent lift across locales.
One reflection worth recording, since it shapes how I would scope a similar project today: the team underestimated how much of the work would be content production and how little would be technical submission. The instinct, going in, was to treat directory work as a logistics exercise. It is partly logistics, but the larger share of effort sits in writing, restructuring, and gathering the inputs (reviews, references, screenshots, locale-appropriate copy) that turn a submission into a citation-worthy listing.
Adjusting for Different Constraints
Smaller Budget Scenario
A version of this engagement with a £12,000 budget — roughly a third of what Meridian had — would look meaningfully different. The first cuts would be in measurement infrastructure: the 5,040-capture baseline would shrink to a 600-capture scoping pass, enough to confirm hypothesis direction but not enough to support engine-level disaggregated reporting at the same confidence level. The second cuts would be in directory breadth: the twelve-directory shortlist would shrink to four or five, almost certainly G2, one regional listing, one vertical directory, and one curated general listing. The third cut would be in content production: refreshed descriptions and one new comparison page rather than two.
The expected lift in that scenario would be smaller — likely a citation share rising from nine to roughly eighteen or twenty percent rather than twenty-six — and it would be more concentrated on Perplexity and Gemini, where directory effort produces the strongest marginal returns. Claude visibility would not move at all. ChatGPT visibility would move modestly. The reduced engagement would still produce defensible ROI but would not support engine-level reporting and would give the client less protection against engine-side changes in retrieval behaviour.
A practitioner working with a tight budget should accept narrower coverage rather than thinning the work across more directories. Three deeply executed directory presences will outperform eight half-finished ones in nearly every test scenario the team has observed. As this guide, the difference between a directory presence that contributes citations and one that does not is usually a function of structural completeness — full descriptions, recent reviews, live comparison content, locale-appropriate language — rather than of which directory was chosen.
Tighter Six-Week Timeline
A six-week version of the engagement is feasible but requires aggressive sequencing and willingness to give up the long-lead-time directories. The submissions with editorial review processes longer than four weeks would be deferred to a second phase or dropped. The G2 refresh would still anchor week one. Regional listings would be feasible. Vertical directories with editorial review processes would be either skipped or pursued through pre-existing relationships rather than cold submission.
The measurement protocol would also have to compress. A two-checkpoint design — baseline at week zero, final at week six — would replace the fortnightly checkpoint cadence. That is acceptable for engagements where the goal is operational rather than evidentiary, but it gives up the early signal that lets a team recalibrate mid-stream.
The expected lift from a six-week engagement would be smaller than a twelve-week one not because the work was lower quality but because some directories simply do not propagate into engine citation data within six weeks. The team has observed cases where new directory listings did not appear in Perplexity citation patterns until eight to ten weeks after submission. A six-week timeline systematically undercounts the long-lead-time properties.
Industry Variations to Consider
B2B Versus Consumer Brands
The Meridian engagement was B2B SaaS. Consumer brands operate under different conditions, and the directory landscape they face is not the same. Consumer-facing AI engine queries lean more heavily on editorial sources (Wirecutter-style review properties, mainstream press), aggregators, and marketplace pages than on traditional directory listings. The directory equivalent in consumer contexts is often a marketplace or comparison site — a different kind of property with different submission economics.
The principles transfer; the specific properties do not. A consumer brand running a similar investigation would build its candidate list from a different starting universe and would weight editorial coverage more heavily in its composite scoring. recent commentary on how the directory landscape differs between B2B and consumer contexts, and how listing strategy should adjust accordingly.
Regulated Industry Adjustments
Regulated industries — financial services, healthcare, pharmaceuticals, legal — introduce additional constraints. Submissions often require legal review of copy. Customer-reference solicitation may be limited by confidentiality clauses. Review acquisition may be constrained by industry rules on testimonials. The twelve-week timeline that worked for Meridian would extend to sixteen or eighteen weeks for a regulated client, and the directory shortlist would tilt heavily toward properties with editorial review (which is consistent with regulated-industry brand-protection priorities anyway).
Regulated-industry directories also tend to be more concentrated. There are fewer of them, they cover narrower verticals, and they often serve as gateways to professional buyer audiences that general directories do not reach. In that context, a four- or five-directory strategy may be both adequate and preferable to a wider net. The signal-to-noise ratio of vertical compliance directories tends to be higher than general business listings, and AI engines appear to recognise that distinction in their citation patterns, though the team has not yet run a study that quantifies the effect with precision.
Where I’d Take This Research Next
Tracking Authority Decay Over Time
The most important question the engagement did not answer is how long directory-driven citation lift persists. The ninety-day measurement captured the rise; it did not establish the steady state. Several plausible decay patterns exist: lift could plateau and hold, lift could continue growing for another quarter as engines complete their crawl-and-update cycles, or lift could decay as competitors catch up and as engines rotate which sources they favour. Each pattern has different implications for ongoing investment.
A longitudinal extension of this study would track citation share monthly across the same query set for at least eighteen months, with a parallel control of comparable vendors who did not undertake directory investment. The control problem is the difficult part. Comparable vendors are doing their own marketing, and isolating directory effects from broader brand activity requires either cooperation with the controls (rare, given competitive sensitivity) or careful statistical work on observational data. Neither is easy.
A further extension would test whether directory work has differential effects on the engines as those engines update their underlying models. The engines visibly changed retrieval behaviour at least three times during the ninety-day Meridian engagement; the team observed shifts in citation patterns that did not correlate with anything Meridian had done and that affected competitors as well as the client. Mapping those shifts and characterising how directory presence interacts with them is a substantial research programme rather than a single project.
Two practical implications follow from the analysis as it stands. The first: treat AI engine visibility as a portfolio problem, not a single-channel problem. The data show that engines reward different supply-side signals, that the same investment produces different lift across engines, and that aggregate reporting hides most of the useful detail. A team that allocates a single budget against a single “AI visibility” metric will systematically over-invest in the engines that respond easily and under-invest in the ones that require different work — and will not know it is happening, because the aggregate number will look fine. Disaggregate by engine, by query class, and by locale, or accept that the reporting is decorative.
The second: prioritise structural completeness of a small number of directory presences over breadth across many. The single most consistent finding across the composite engagements was that directories with full descriptions, recent reviews, live comparison content, and locale-appropriate language outperformed directories with thinner entries by margins that swamped any difference in directory size or domain authority. Three thoroughly executed listings will produce more citation lift than ten thin ones, and at lower total cost. The temptation to spray submissions across the directory universe should be resisted; the data do not reward it.
The third, and the one most likely to be ignored: build the measurement before the campaign, not alongside it. A baseline that exists only in retrospect is not a baseline. The engagement described here worked because the first three weeks were spent characterising the existing state in enough detail that subsequent change could be attributed rather than asserted. Teams that skip that step end up with stories rather than evidence, and stories are difficult to defend when the next budget cycle arrives and the CFO wants to know which line items produced the lift.

