AI Brand Visibility Might Be Measured Wrong. Here’s What 6 Months of Data Suggest.

Six months ago I started building an analytics platform for AI brand visibility. The problem I was solving: when a potential customer asks ChatGPT or Perplexity “what’s the best X for Y,” how does the model decide what to recommend? And can brands influence that?

Building the product meant designing a pipeline that runs buyer-intent queries across ChatGPT, Claude, Gemini, Perplexity, and Grok at scale. Thousands of queries, dozens of product categories, multiple languages. Every week, new data. And pretty quickly, the data started telling a story that contradicted what the market assumed about AI brand visibility.

A note on methodology

Before I get into findings, some context on how the platform collects this data.

The pipeline runs each query multiple times across all 5 models to account for non-determinism in LLM outputs. Forced ranking prompts use a structured template requesting a top-10 with reasoning for each position. Source attribution is tracked via URL extraction and named entity recognition. Deal breaker phrases are identified through frequency analysis across reasoning outputs. Over six months the platform has processed thousands of queries across multiple product categories and languages.

This isn’t a peer-reviewed study. It’s operational data from a product that does this every day. But the patterns have been consistent enough to be worth sharing.

The binary trap

The current generation of AI visibility analytics works like this: send a query to a model, parse the response, check if a brand was mentioned. Aggregate across queries. Output a percentage. “Your brand was mentioned in 47% of AI responses.”

That number feels useful. It goes up, you celebrate. It goes down, you worry. It fits neatly into a dashboard and a board deck.

The problem is that it tells you almost nothing actionable.

When a user asks ChatGPT to recommend a CRM for their 10-person startup, the model doesn’t just randomly scatter brand names across the response. It generates an implicit hierarchy. The first brand mentioned gets the most context, the most favorable framing, the most detailed description. By position five or six, brands get a single sentence. By position eight, they’re filler.

Users behave accordingly. The overwhelming majority of attention goes to the first three recommendations. Being mentioned at position #2 and being mentioned at position #9 produce the same visibility score in most tools. In practice, the conversion difference is enormous.

This is the binary trap: treating a ranked output as a binary signal.

What happens when you force the model to explain itself

Standard AI responses are generated in default mode. The model isn’t asked to rank. It isn’t asked to justify. It produces a natural language response and you reverse-engineer structure from it.

But language models have a different mode: chain-of-thought reasoning, sometimes called thinking mode. When you structure the query to activate it, the model processes significantly more information. In our pipeline, models like Grok return up to 80 cited sources in a single reasoning response. They build explicit argumentation. And critically, they explain why they place one brand above another.

This is the approach I chose as an architectural foundation for the platform. It costs 3-5x more per query in API costs compared to default mode. It’s economically feasible when you batch queries across customers but running this manually for one brand at scale is prohibitively expensive. That trade-off is what makes it a product problem, not a script-you-run-once problem.

The output is qualitatively different from standard scraping.

Instead of “here are some good CRMs,” you get a structured ranking where each position has a stated reason. A typical output might look like: Brand A is first because multiple review platforms and a recent industry comparison cite its API depth. Brand B is third because it leads in a specific integration ecosystem. Brand F is eighth because the same phrase – “limited features compared to alternatives” – appears across several sources the model considers authoritative.

That last part turned out to be the most valuable signal in the entire dataset.

The deal breaker phenomenon

As the platform processed more data across categories, a pattern emerged that shaped how I think about the whole space.

Most brands don’t have dozens of problems in AI visibility. They have one. Maybe two. A single phrase or characterization that shows up across models, across queries, across languages. One description that the model latches onto and uses as the primary reason to rank the brand lower.

I started calling these deal breakers. The term stuck because that’s exactly what they are for conversion.

Here’s a real example from a customer’s data. A SaaS brand in a competitive category showed a stable 45-50% mention rate in standard visibility tracking. Their marketing team was cautiously optimistic. The trend line was slightly positive.

When we ran forced ranking with reasoning on the same query set, the brand was consistently at position #8-9 out of 10. And in the majority of responses, the reasoning cited the same phrase: “limited features compared to alternatives.”

We traced it back to three sources: a G2 review from 2023, a tech news article from 2024, and a Crunchbase company description. Three pieces of content, written at different times by different people, that independently used similar language. The model aggregated them into a single signal and applied it universally.

The brand’s marketing team had never seen this phrase. Their visibility dashboard showed 47% and a green arrow. The deal breaker was invisible to every tool they were using.

The most surprising finding wasn’t the deal breakers themselves. It was how stable they are. When I was designing the reporting cadence, I expected model outputs to vary wildly week to week. They don’t. Once a deal breaker phrase enters a brand’s narrative, it persists across 4-6 weeks of measurement until the underlying source is changed. Models are remarkably consistent at perpetuating the language they were exposed to. That stability is what makes weekly tracking meaningful and what makes the whole analytics model work.

How AI models actually build brand associations

The mechanics behind this are worth understanding if you’re trying to influence the outcome. This is also what drove most of the architectural decisions when I was building the platform.

Language models form brand recommendations through two parallel systems. The first is retrieval: the model (or its RAG pipeline) pulls content from external sources during inference. Ahrefs found that 62% of citations in Google’s AI Overviews come from pages outside the traditional top 10 search results. If Google’s own AI layer already looks beyond its own rankings, standalone models like ChatGPT and Claude – which aren’t tied to a search index at all – cast an even wider net.

The second is parametric memory: associations encoded in the model’s weights during training. How often your brand appeared alongside category-relevant terms across the training corpus determines the strength of the association. This is why brands with strong presence on authoritative platforms (Wikipedia, major publications, industry-specific review sites) tend to show stronger baseline positions.

During generation, the model synthesizes both signals. The order of brands in the response reflects an internal relevance score that weighs authority, recency, specificity, and sentiment. In default mode, this score is implicit. In thinking mode, it becomes explicit.

This bifurcation has practical implications. Brands optimizing only for SEO miss the parametric memory layer entirely – strong recent rankings can’t override years of weak training data presence. Conversely, brands with strong Wikipedia and historical media presence can underperform in retrieval-driven engines like Perplexity if their structured data is poor. Understanding this split is what led me to build the platform around multi-model tracking rather than focusing on a single engine.

The bottom line: a single outdated review on a high-authority platform can override thousands of positive mentions on lower-authority sites. The model doesn’t count mentions. It weighs sources.

The measurement gap

Adobe reports that one in four buyers now uses AI as a primary product research tool. Gartner projects continued decline in traditional search volume as AI assistants take share. Authoritas research shows that even ranking #1 on Google gives only a 33% chance of being cited in AI responses.

The channel is growing. The stakes are real. And the dominant measurement approach strips out the two pieces of information that matter most: where exactly you stand, and why.

It’s the equivalent of measuring your sales pipeline by counting how many times prospects mentioned your company name in any context. Technically data. Practically useless for decision-making.

What actually moves the needle

After six months of running the platform and watching how brands respond to the data, a few patterns became clear about what actually shifts position in AI recommendations. These are observational, not controlled experiments, but they’ve been consistent enough across categories and models to be worth sharing.

Structured data matters more than you’d expect. Schema.org markup (Product, Organization, FAQPage) gives models a machine-readable shortcut to understanding what your brand does. In our dataset, sites with complete Product schema tended to rank noticeably higher than comparable brands without it. The implementation cost is hours, not weeks. It’s one of the highest-ROI changes a brand can make for AI visibility.

Content extractability is a real factor. Language models cite what’s easy to cite. Numbered lists, comparison tables, standalone factual claims in 1-2 sentences. If your site is walls of prose with no structural hooks, the model will cite a competitor who made the answer easy to grab. We saw this pattern repeat across categories: the brand with the clearest, most parseable content on its site consistently outperformed brands with better products but worse content structure.

Entity clarity determines attribution. If your site says “we” and “our product” everywhere instead of your actual brand name, models struggle with entity association. We tracked several brands that increased their entity mentions on key pages and saw measurable position improvements within weeks. The brand that names itself clearly and consistently wins the attribution game.

Freshness signals affect ranking more than most realize. Models that incorporate retrieval (especially Perplexity) heavily weight publication dates and update timestamps. A comprehensive but undated comparison page loses to a thinner but recent one. Brands that regularly update their cornerstone content with visible dates tend to hold stronger positions in retrieval-heavy models.

And most importantly: deal breakers can be fixed. The SaaS brand from my earlier example updated their descriptions on three platforms, addressed the “limited features” narrative with detailed comparison content, and requested corrections on outdated reviews. Within a few report cycles, their positions improved meaningfully across models. That’s the kind of outcome that validated the whole approach for me.

What this data changed about how I think

If you’re spending any budget on AI visibility tracking, here’s the mental model shift I’d suggest.

Stop thinking about AI visibility as a score. Start thinking about it as a position with a reason. The score tells you that something is happening. The position and reasoning tell you what to do about it.

The brands that move fastest on this are the ones that treat AI model outputs as a feedback loop, not a leaderboard. They read the reasoning, find the deal breaker, trace it to the source, fix the source, and measure the result. It’s not unlike debugging. The bug is in the training data and retrieval sources, not in the model.

The brands that move slowest are the ones staring at a mention rate percentage, unsure whether 47% is good or bad, and unable to explain to their CEO what they plan to do about it.

That gap between the two is why I built this product. And six months in, the data keeps confirming that the gap is real.

Liked Liked