For the last few years, companies have been pouring money into AI tracking and AI visibility, hoping to gain a competitive edge in the evolving search landscape. Estimates place annual spending on this new breed of search analytics at over $100 million. But are these AI SEO tools actually delivering on their promise, or are they simply generating more noise? See our Full Guide for a deeper dive into the AI SEO tool landscape.

Despite the significant investment, a critical question remains unanswered: are AI tools consistent enough in their recommendations to produce valid visibility metrics? While numerous studies explore AI accuracy in various contexts (we even used Carnegie Mellon’s research on LLM consistency as a model), a void exists when it comes to assessing the consistency of product and brand recommendations from leading AI platforms like ChatGPT, Claude, and Google AI.

How can executives justify allocating substantial marketing budgets to AI tracking without addressing this fundamental question? The absence of research in this area is baffling, considering the potential impact on ROI and strategic decision-making.

Instead of simply lamenting the lack of data, we at AI Tech Insights decided to investigate. Partnering with Patrick O'Donnell of Gumshoe.ai, we embarked on a research project to uncover the truth behind AI-driven visibility. While Patrick’s connection to an AI tracking startup might raise eyebrows, my inherent skepticism and his access to vast AI response data, coupled with his analytical prowess, proved invaluable. I assure you, the process wasn't without rigorous questioning and scrutiny.

Our Hypothesis: A Sea of Randomness?

Our central hypothesis was that AI tools produce such randomized lists of recommendations, and user prompts are so varied, that tracking brand or product rankings/visibility for a specific topic or user intent is ultimately futile. Furthermore, we suspected that companies with ample resources might be better off simply paying AI platforms directly for impression data through their forthcoming advertising products.

The Experiment: Probing the AI Mind

To test our hypothesis, we designed an experiment involving 600 volunteers who repeatedly ran the same AI prompts and meticulously recorded the responses. We focused on the three most popular AI tools in the US: ChatGPT, Claude, and Google search’s AI Overview (or AI Mode when Overviews were unavailable).

The volunteers executed 12 different prompts across these three platforms, resulting in a combined total of 2,961 responses. The collected data was then normalized into ordered product and brand results. One sample prompt was: “What are the top chef’s knives, brand and model, for an amateur home chef with a budget <$300?”

We deliberately chose prompts spanning multiple sectors and encompassing varying sizes, ensuring a representation of spaces with both larger and smaller groups of potentially recommended brands and products.

The Results: A Kaleidoscope of Recommendations

The sheer variety of brand and product combinations generated by the AI tools was striking. The following visualization illustrates the number of unique brands, products, and entities recommended by ChatGPT (green), Claude (orange), and Google AI (blue) in response to the 12 prompts:

[Insert Visualization Here - Description: A bar graph showing the number of unique brands/products recommended by each AI platform for the 12 prompts. The x-axis lists the prompts. The y-axis represents the number of unique entities recommended. A secondary y-axis shows the average number of responses, indicated by scatterplot dots.]

As the visualization demonstrates, the range of recommendations varies significantly depending on the prompt. However, this variation correlates strongly with the frequency of entities appearing in the AI's training data within a given topic. For instance, the number of Volvo dealerships in Los Angeles is significantly less than the number of recently published Science Fiction novels.

The pink scatterplot dots, plotted on the secondary axis, represent the average number of responses provided by the AI tools. This further complicates the challenge of tracking rankings and visibility.

Inconsistency Reigns Supreme

Our analysis revealed a stark reality: if you ask an AI tool for brand or product recommendations multiple times, nearly every response will be unique. This inconsistency manifests in three key ways: the specific brands/products recommended, the order in which they are listed, and the accompanying descriptions or sentiment expressed.

Quantifying this inconsistency, our data shows that ChatGPT or Google AI, when prompted 100 times, have less than a 1% chance of producing the same list of brands in any two responses. Claude exhibits slightly more consistency, but even then, the likelihood of generating the same list twice in 100 runs remains low.

Furthermore, the ordering of recommendations is even more random. You would need to run the same prompt approximately 1,000 times to observe two lists in the same order. And we didn't even try to analyze the variance in how the AIs describe the recommendations or how positive/negative the sentiment was.

The Verdict: Proceed with Caution

The bottom line is clear: AI tools do not provide consistent lists of brand or product recommendations. This inconsistency raises serious concerns about the validity of using these tools for tracking brand visibility and informing marketing strategies.

If your brand doesn't appear where you want it to, simply asking the AI tool a few more times might yield a different result. This inherent randomness presents a challenge for marketers seeking reliable and consistent insights.

As I mentioned on Gaetano DiNardi’s LinkedIn post, AI visibility "experts" could easily exploit this inconsistency. Just as unethical SEO practitioners did in the past, manipulating AI prompts to achieve desired outcomes is a real possibility. This raises ethical concerns and underscores the need for transparency and critical evaluation of AI-driven visibility metrics.

Moving Forward: A Call for Rigor

Our research highlights the need for a more rigorous approach to evaluating and utilizing AI SEO tools. Before investing heavily in these technologies, businesses must demand evidence of their consistency and reliability.

Further research is needed to explore the factors that influence AI recommendation variability and to develop methodologies for mitigating its impact. Until then, proceed with caution and critically evaluate the results generated by AI SEO tools. The allure of AI-powered insights is strong, but it's crucial to discern between genuine value and mere noise.