[ATLAS]May 15, 202613 min read

Citations Without Clicks

A reference dossier on Answer Engine Optimization in 2026: what the category actually is, what its tools can and cannot measure, and how to decide whether to buy, build, or wait.

The category, honestly

Answer Engine Optimization, or AEO, is the practice of measuring and improving how a brand shows up inside AI-generated answers. The surfaces are no longer just Google's ten blue links. They include ChatGPT, Gemini, Perplexity, Claude, Google AI Overviews, Google AI Mode, Bing Copilot, and the rest of the systems that synthesize an answer instead of ranking pages. The unit of work moves with it. Traditional SEO optimizes pages for rankings and click-through. AEO monitors prompts, mentions, citations, source selection, sentiment, and share of voice inside answers that may never produce a click at all.

The why-now is real, but messier than vendors imply. Q2 2026 is not yet complete, so there is no clean number for the share of informational queries answered inline without a click. What exists is a stack of proxies.

Semrush tracked AI Overviews on 15.69% of all queries in November 2025, after a July peak of 24.61%, and found AI Overviews bleeding from informational into commercial, transactional, and navigational intent. Pew Research Center found that when an AI summary appeared, users clicked a traditional result in 8% of visits versus 15% without one, and clicked a link inside the summary in only 1%. Similarweb reported zero-click news searches climbing from 56% to 69% between May 2024 and May 2025. Ahrefs' December 2025 rerun of its AI Overviews study found a 58% lower click-through rate for the top-ranking page when an AI Overview was present.

Most operationally relevant: Similarweb reported AI platform visits up 28.6% year-over-year into January 2026, while AI referrals to external sites stayed flat. Attention is moving into AI interfaces faster than referral traffic is moving out of them.

The category-formation signal is also strong. HubSpot launched HubSpot AEO on April 14, 2026, at $50 per month. Profound sells Answer Engine Insights and Prompt Volumes, with prompt-volume data sourced from double opt-in consumer panels. AthenaHQ, Otterly.AI, Goodie, Bluefish, Ahrefs Brand Radar, Semrush AI Visibility Toolkit, and Adobe LLM Optimizer all ship some mix of monitoring, citation tracking, prompt-volume estimation, and recommendations.

The disagreement is not whether AI answers matter. It is whether AEO is a new discipline or a new label for overlapping SEO, digital PR, brand authority, content structure, and analytics work.

Mike King at iPullRank uses GEO, Generative Engine Optimization, and frames the target as technical infrastructure and authority signals. The original 2023 GEO research paper from Aggarwal and collaborators reported visibility gains up to 40% in benchmark settings. Aleyda Solis treats AEO as high-overlap with SEO but with different retrieval style, optimization targets, and metrics. SEOFOMO's practitioner survey found the naming itself unsettled, with only 4% of respondents using AEO as their preferred label.

The honest definition: AEO is an emerging measurement and content-operations layer for AI answer surfaces. It is not separate from SEO. It is also not reducible to it. It adds new reporting targets, new uncertainty, and new buyer questions.

Figure 1 — Where each surface sits. Most surfaces cite without driving clicks. Perplexity is the outlier that bridges visibility and traffic.

If You Read Nothing Else

A 1 to 2 week AEO baseline before a vendor demo

Buying AEO tooling before running a manual baseline is how teams end up paying for dashboards that do not change a decision. This is the smallest version of an AEO baseline a team can run in two weeks: a manual prompt audit across the major engines that produces the data a paid tool would, at no software cost. Run it and you have a buy/build/wait decision grounded in your own brand, not a vendor demo.

Days 0 to 1: Pick the prompts. Compile 20 to 50 buyer prompts your customers would actually ask before purchase. Split by problem, category, comparison, pricing, integration, risk, and competitor. Include the prompts where the right answer cites you and the prompts where the right answer cites a competitor.

Days 1 to 3: Run them across the engines. ChatGPT, Gemini, Perplexity, and AI Overviews where available. Capture screenshots and source lists. Run each prompt twice on different days to surface variance.

Days 3 to 5: Log brand visibility. Mention versus not mentioned, citations versus competitors, sentiment accuracy, share of voice within the answer, and source domains. Note hallucinated citations separately. Track the prompts where AI answers diverge meaningfully from your brand position.

Days 5 to 7: Map content and PR capacity. Catalog who can update help-center articles, earn third-party mentions, fix technical access (sitemaps, structured data, indexability), and respond to AI-driven changes within a quarter. Without that capacity, AEO data becomes theater.

Days 8 to 10: Read engine disagreements. Where engines agree on your brand and where they diverge. Why one engine cites you and another does not. Whether the gap is content (you have not written about it), authority (you have written but earn no third-party mentions), or technical (you are not crawlable in the right format).

Days 10 to 12: Calculate baseline value. What would moving 5 to 10 of your weakest prompts to "consistently mentioned" be worth? Compare against the cost of a paid tool plus the team capacity it requires. If the expected value cannot beat your manual baseline plus content fixes, wait.

Days 12 to 14: Decide buy, build, or wait. Produce a one-page memo: prompts measured, engine variance, current visibility, content and PR capacity, expected value of paid tooling, threshold for triggering purchase. The memo is the decision artifact, not the prompt-test results.

Buying without this baseline produces a dashboard that does not change a decision. Running this baseline gives you the data to know whether the dashboard would.

The rest of this dossier explains what each step in the baseline is testing for. The tools section names what AEO vendors actually measure and what they cannot prove. The measurable-versus-noise section separates real signal from vendor framing. The buy/build/wait section names the threshold the baseline data has to clear.

What the tools actually do

Most AEO tools do four jobs. They run or observe prompts and record whether a brand appears. They parse answers for mentions, citations, sentiment, and competitors. They estimate prompt demand or generate prompt sets. They recommend content, PR, review, or technical changes intended to lift inclusion in future answers. The hard part is that these jobs sit at very different confidence levels. Counting a brand mention in a captured answer is measurable. Estimating the universe of real prompts is partly modeled. Claiming that a recommended change caused a citation lift is usually speculative unless the vendor shows controlled tests.

Tool	Public price	What it measures	What it does not prove
HubSpot AEO	$50/month, $45 annual	Visibility, sentiment, prompt tracking, competitor and citation analysis across ChatGPT, Gemini, Perplexity	No published statistical confidence, sample-size methodology, or repeated-run variance handling
HubSpot AEO Grader	Free	One-shot brand sentiment, recognition, share of voice, source quality across OpenAI, Perplexity, Gemini	Measures broad characterization more than durable citation inclusion
Profound	Custom; plan prompt counts public	Answer Engine Insights monitors AI responses; Prompt Volumes estimates real prompts from double opt-in panels across ChatGPT, Gemini, Claude, Perplexity	Panel data still under-samples enterprise, B2B, and niche buyer behavior
AthenaHQ	Self-serve from $295/month	Visibility, competitor monitoring, citation intelligence, prompt-volume estimation, content optimization across nine engines	Credit math gets expensive at scale; proprietary citation engine and volume estimates need careful causal reading
Otterly.AI	$29 / $189 / $489 monthly	User-defined prompts run across ChatGPT, AI Overviews, Perplexity, Copilot; AI Mode and Gemini as add-ons	Reports what happened on the tracked prompt set, not whether the prompt set represents the buyer universe
Goodie	Demo-only; tiers by prompt count	Explorer, Pro, and Enterprise tiers cover 100 to 500+ prompts and 3,000 to 15,000+ AI responses across up to eleven engines	Price opacity makes ROI hard to assess pre-demo; "optimization actions" count is not business impact
Bluefish AI	Custom, demo-based	Enterprise AI monitoring, GEO measurement, AI optimization, AI commerce	Public pages expose little methodology; April 2026 $43M Series B is signal, not validation
Semrush AI Visibility Toolkit	$99/month	One domain, 25 prompts, 300 daily AI Analysis queries, 1,000 daily Prompt Research queries	Low prompt limit at base; extra domains, locations, and users compound cost
Ahrefs Brand Radar	From $199/month; custom prompts from $50	AI mentions, custom prompts, search demand, web visibility across AI Overviews, AI Mode, ChatGPT, Perplexity, Copilot, Gemini, Grok	Search-backed prompt database scales, but custom prompts still face AI answer variance
Adobe LLM Optimizer	Custom annual; minimum 1,000 prompts	Daily prompt analysis, weekly trends, citations, mentions, recommendations, optional Auto-Optimize for AEM and CDN	Too heavy for SMBs; edge-tied recommendations need controlled validation

What each AEO tool measures today, what it costs, and what it does not prove.

Citation tracking and brand monitoring

This is the most mature use case. The tool runs a defined prompt set against the engines, captures the response, and records whether the brand, URL, competitor, or domain appears. Every serious vendor in the table does some version of this. The reliable measurement is narrow: for prompt set P, engine E, geography G, and date range D, how often did the brand appear, with which citations, and near which competitors. Anything broader is a generalization the tool cannot back.

Content optimization

Less mature. HubSpot recommends pages to create or update. AthenaHQ ships a citation engine. Goodie surfaces "optimization actions." Adobe ships Auto-Optimize for eligible Adobe Experience Manager and CDN setups. Semrush adds AI checks to its site audit. The measurable part is whether the recommendation was implemented and whether later answer captures changed. The speculative part is causality. Engines change, competitors change, models change, indexes update. A before-and-after chart is not enough.

Prompt-volume estimation

The most contested product claim in the category. Profound has the strongest public methodology because its Prompt Volumes data is sourced from real, anonymized, double opt-in panel conversations refreshed weekly across four engines. AthenaHQ surfaces numerical volume estimates. Semrush ships Prompt Research. HubSpot suggests prompts from company, competitor, and industry context. These help with prioritization. They are not search-volume equivalents. The universe is less observable, less standardized, and more private.

Competitive analysis

Every serious tool now offers share-of-voice comparison. The output is only as useful as the prompt set. "Who are the best CRM tools" is weak. "What CRM should a 70-person B2B SaaS company with HubSpot marketing and Salesforce sales use" is closer to a real buying journey. This is where operators should spend the most time editing prompts before trusting any dashboard.

What is measurable, and what is vendor noise

AEO measurement today is synthetic monitoring plus answer parsing. Pick prompts, pick engines, run on a schedule, capture answers, parse mentions and citations, report visibility, sentiment, share of voice, and source domains. Otterly says this directly. Profound says its prompts can be auto-generated, manually uploaded, or pulled from Prompt Volumes, and its responses come from a RAG-based monitor across ChatGPT, Perplexity, Microsoft Copilot, and AI Overviews.

That makes citation rate and mention frequency measurable, but only inside a declared test frame. A sound statement: "Your brand was mentioned in 12% of answers for this 400-prompt set, across these engines, in these regions, during these dates." A weak statement: "Your brand appears in 12% of relevant AI searches." The second requires knowing the universe of relevant prompts, the distribution of real user behavior, personalization, geography, temporal change, and model routing. Few vendors disclose enough to back it.

The biggest technical problem is non-determinism. SparkToro and Gumshoe ran 2,961 prompts across ChatGPT, Claude, and Google's AI search surfaces with repeated trials. The same brand list came back less than 1% of the time. The same list in the same order came back less than 0.1% of the time. This does not make AEO measurement useless. It makes single screenshots useless, and it makes rank-style reporting fragile. Visibility percentage across repeated runs is the credible unit. Exact answer position is not.

Vendor disclosures vary. Profound discloses the most on prompt-volume sourcing: panel basis, anonymization, multi-engine aggregation, weekly refresh, GDPR and CCPA language. HubSpot's free Grader discloses what data it sends to the engines and the five dimensions it scores, but its paid AEO product does not publish a sample-size model. Otterly discloses prompt-based monitoring, daily runs, and the reasons manual searches diverge from platform results: memory, history, location, personalization.

AthenaHQ, Goodie, Bluefish, Semrush, Ahrefs, and Adobe expose plan limits and feature classes, but rarely the statistical treatment of repeated runs, confidence intervals, prompt-set selection, or engine drift.

The second problem is the gap between visibility and influence. A citation in ChatGPT or AI Overviews can shape a buyer without producing a referral visit. Similarweb's flat-referrals-while-visits-grow chart is the warning. But influence on purchase is harder to measure than mention frequency. Teams need downstream signals: branded search lift, direct traffic, sales-call mentions, demo-form source notes, AI-referral sessions, assisted pipeline, and controlled content tests where possible.

The third problem is causality. If a vendor recommends a comparison page and the brand starts appearing more often two weeks later, the tool may be right. It may also be benefiting from a model update, fresh indexing, PR coverage, reviews, forum threads, competitor decay, or prompt-set changes. Citation rate is measurable with caveats. Purchase influence is measurable indirectly. Causal lift from any specific AEO action is the least measurable thing in the stack.

Buy, build, or wait

There are three rational paths. The wrong move is buying a tool because the dashboard looks like a rank tracker. AEO dashboards are not rank trackers with new labels. They are sampling instruments for a volatile answer layer.

Buy a tool now

Buy when AI-answer presence is already a competitive question. That usually means B2B SaaS, marketplaces, ecommerce, agencies, education, finance, travel, and any consumer-adjacent category where buyers ask "best X for Y" before a demo or a purchase. The buyer profile is a marketing team with real content capacity, named competitors, budget above the low hundreds per month, and enough volume to care.

HubSpot AEO fits HubSpot-centric SMB and mid-market. Otterly or Semrush fits SEO teams that want affordable baseline monitoring. AthenaHQ, Goodie, Profound, Bluefish, Adobe, and Ahrefs Brand Radar fit teams that need scale, deeper competitive intelligence, multi-engine coverage, enterprise controls, or prompt-volume data.

The threshold condition is simple: the team has to be able to act on the findings. If no one can update content, earn citations, fix technical access, improve review surfaces, or brief sales on the gaps, the tool becomes expensive theater.

Build internal monitoring

Build when the use case is narrow, the team has technical capacity, or the vendor price exceeds the decision value. A focused B2B company can start with 100 to 300 curated buyer prompts, split by journey stage, product, competitor, and use case.

The internal system runs prompts on a schedule where terms allow, captures answers, parses brand and competitor mentions, logs citations, and stores raw outputs. It should repeat prompts often enough to estimate variance, not snapshot once. Build is also right when the team needs methodology control, like enterprise-procurement-only prompts or industry-specific compliance questions.

The weakness is coverage. AI Overviews, AI Mode, ChatGPT, Perplexity, Gemini, Copilot, and Claude do not expose identical APIs or identical user experiences. Google reports AI feature traffic inside the normal Search Console "Web" type, with no separate AIO breakout. Internal builds are good baselines. They rarely match specialized vendor coverage out of the box.

Wait, but run a baseline

Wait when the category is not yet valuable enough for paid tooling. Brands with low recognition, weak content foundations, no clear competitor set, no marketing owner, or no evidence buyers use AI engines during discovery. Wait also when higher-return work is queued: crawlability, category pages, comparison content, reviews, case studies, knowledge-base hygiene.

Waiting is not ignoring. The minimum move is manual baseline testing: run the top 20 to 50 buyer prompts across ChatGPT, Perplexity, Gemini, and AI Overviews monthly, log whether the brand appears, record cited sources. If the brand is absent from every high-intent prompt and buyers mention AI research in calls, paid monitoring becomes easier to justify. If neither is true, the next dollar belongs in basic content and authority first.

The contested part is timing. AEO tools are early. The search shift is not fictional. Make the call on answer-visibility risk, buyer behavior, content capacity, and budget, not on vendor urgency.

Pitfalls and anti-patterns

Optimizing for citations that don't drive business value

A citation is not a conversion path. A brand can appear in low-intent answers and gain nothing. Separate awareness prompts from buying prompts. Track downstream: branded search, direct visits, AI-referral sessions, demo forms, sales-call notes, assisted pipeline. The Similarweb referrals plateau is the warning. AI influence can grow while referral traffic stays small.

Confusing AEO with SEO

The overlap is high but the target is different. Google says standard SEO best practices remain relevant for AI Overviews and AI Mode and that there are no extra technical requirements beyond being indexable and snippet-eligible. That does not mean nothing changed. Aleyda Solis names the practical differences: query fan-out, synthesis, content chunks, factual spans, citations, and mention-based metrics. Treat AEO as an extension of search and brand operations, not a replacement for SEO.

Over-investing in measurement before the category settles

Tool prices range from $29 a month to enterprise annual contracts. That is a sign of category immaturity, not just market segmentation. SEOFOMO's survey is a useful brake: practitioners report traditional SEO still drives more than 95% of SEO ROI in many contexts, while AI search traffic and revenue often sit at 5% or less.

Ignoring AEO because it feels speculative

The opposite error is assuming search traffic returns to its old shape. Pew, Ahrefs, Similarweb, and Semrush all point to a real shift in how users receive answers, even when they disagree on causality and magnitude. Waiting is fine. Blindness is not.

Buying vendor claims without methodology

Any visibility score that does not disclose prompt set, engine coverage, geography, refresh cadence, repeated-run strategy, and source-parsing rules is hard to act on. Ask: How were prompts selected? How many times are they run? Are runs personalized or neutral? Are citations, mentions, and recommendations separated? How are hallucinated citations handled? What is the variance over repeated runs? Without those answers, the score is a sales artifact.

Optimizing for one engine

ChatGPT, Gemini, Perplexity, Claude, Copilot, AI Overviews, and AI Mode behave differently. HubSpot's paid product covers three engines. AthenaHQ and Goodie list broader coverage by tier. Single-engine measurement can work for a narrow buyer base. It cannot stand in for AI-answer visibility as a category.

Treating answers as stable

SparkToro's repeated-prompt findings are the clearest warning. AI recommendations vary from run to run on the same prompt. Teams that chase exact answer order will waste time. Teams that track repeated visibility, source patterns, and prompt clusters will learn more.

What to validate before you spend

List the 20 to 50 prompts your buyers would actually ask before purchase, split by problem, category, comparison, pricing, integration, risk, and competitor.
Run them across ChatGPT, Gemini, Perplexity, and AI Overviews where available. Log brand, competitors, cited sources, sentiment accuracy, and whether the answer would help or hurt the buying journey.
Validate that AI-answer visibility matters to your funnel: AI referrals, branded search change, sales-call notes, demo-form source fields, customer interviews.
Confirm capacity. AEO data only matters if the team can update content, improve citations, earn third-party mentions, repair technical access, and measure downstream impact.
Set the budget threshold before the demo. If the expected value of better AI visibility cannot beat internal monitoring and basic content fixes, wait.

The answer layer is a sampling problem

The answer layer is not a new ranking surface. It is a sampling problem dressed as a dashboard.

If your tool cannot tell you what its prompt set represents, how often it ran, and how much the answer drifted between runs, it is selling you confidence, not measurement.

Key Takeaways

Attention is moving into AI interfaces faster than referrals are moving out. AI platform visits grew 28.6% into January 2026 while AI referrals to external sites stayed flat. Plan for citations without clicks.
AEO is not separate from SEO and not reducible to it. Treat it as a measurement and content-operations layer on top of search, not a replacement for it.
Citation rate is measurable inside a declared test frame. Influence on purchase is measurable indirectly. Causal lift from a specific AEO action is the least measurable thing in the stack.
Single screenshots are useless. SparkToro's repeated-prompt research returned the same brand list less than 1% of the time. Track repeated visibility and source patterns; ignore exact answer position.
The buy decision turns on capacity, not vendor urgency. If no one on the team can act on the findings, the tool is theater. Run the manual baseline before the demo.

Methodology

This dossier reads every public product, pricing, and documentation page shipped by HubSpot, Profound, AthenaHQ, Otterly.AI, Goodie, Bluefish, Semrush, Ahrefs, and Adobe, and grades them against their own disclosed methodology. Vendors that publish a sample-size model, a repeated-run protocol, or a panel basis are credited. Vendors that ship a "visibility score" without one are not. The why-now is anchored in five independent datasets: Pew Research Center on click behavior under AI summaries, Ahrefs on click-through decay under AI Overviews, Similarweb on zero-click news searches and AI referrals, Semrush on AI Overview prevalence by query intent, and SparkToro and Gumshoe on repeated-prompt variance. The measurement critique is built on those five plus standard sampling logic, not vendor marketing. The category framing draws on the most rigorous practitioner work in the space: Mike King at iPullRank, Aleyda Solis, the SEOFOMO practitioner survey, and the 2023 GEO paper by Aggarwal and collaborators that introduced the academic frame. Where they disagree, the disagreements are named. Public prices are current as observed on April 29, 2026. No vendor demos, sandbox trials, or private references were used, and none were needed. Every claim above is sourced.

Sources

Tools Mentioned

LinkedIn X Email

Citations Without Clicks

The category, honestly