Demystifying Randomness in AI

In June 2024, the authors first described how randomness works in AI and proposed frameworks to measure brand performance in AI as part of a Reforge Webinar. Specifically, we proposed running multiple prompts multiple times to estimate the probability distribution of a brand's presence in responses, now commonly referred to as “visibility.”

Since then, over 70 tools have launched to measure visibility. Nearly every company we speak with is using one of these tools to measure its performance.

However, there is increasing scrutiny about how accurate these tools are and confusion about how randomness in AI works.

In this paper, we build on our previous work to demystify how randomness works in AI. We explain that AI often provides different responses to the exact same prompt because randomness is built into how Large Language Models (LLMs) generate text, and show that measuring AI responses is no different from measuring everyday stochastic processes such as the weather or commute times. By applying basic statistics to a sample of responses, we can successfully estimate a brand's visibility and position.

Key Takeaways

1. AI responds randomly, but the randomness is predictable. Responses are generated from a probability distribution, and the generative process is easy to understand even without a technical background.

2. Measuring visibility is straightforward. We need to generate multiple responses, but as few as 10 is enough for a quick estimate for entity comparison prompts.

3. Measuring position is also straightforward. Again, a sample of 10 responses is enough for a quick estimate.

4. For cases that require more precision, we can use statistical tools to assess the accuracy of position and visibility estimates, and use sequential sampling to efficiently determine the right number of responses, based on the desired precision and our risk tolerance.

5. Responses from APIs, logged-out accounts, and logged-in accounts can vary significantly. Prompt tracking tools track either logged-out accounts or use APIs. This means that while prompt tracking tools provide directional data, they should not be used as ground truth.

How to Measure Visibility and Position

1. Do not rely on a single response. Instead, run each prompt at least 10 times for a quick estimate.

2. For more precise estimates, rather than gathering large numbers of responses by default, use sequential sampling: add responses iteratively until the confidence interval is sufficiently tight.

3. Use confidence intervals or statistical tests to decide if changes are real (make a copy of the provided sheet). Generally, be skeptical of small differences in visibility and position, especially for responses that do not use web search.

4. Track weekly or biweekly. If tracking daily, take care not to interpret noise as signal.

5. Use a new chat for each prompt and disable memory.

6. Focus on the right metric for your situation:

Visibility (how often a brand appears) is the best initial metric, especially if your brand is not yet consistently mentioned.
Position (how prominently a brand appears when it does appear) matters more when a brand already appears frequently, and the focus is on its placement relative to competitors.

7. Supplement prompt tracking tools by also manually gathering responses from AI with a logged-in account.

About the Authors

Gregory Druck is Chief AI Officer at Graphite.io, where he leads a team of scientists and engineers building AI tools for growth and researching how AI is reshaping marketing. Previously, he was the Chief Data Scientist at Yummly, where he built NLP and computer vision systems for the smart kitchen. Prior to Yummly, he was an NLP and search researcher at Yahoo! Research, with internships at Google and Microsoft. He earned a Ph.D. from the University of Massachusetts Amherst, where he worked on semi-supervised and active machine learning with Andrew McCallum.

Ethan Smith is CEO of Graphite.io, a research-driven growth agency that works with companies like Webflow, Adobe, and Upwork. He is an adjunct professor at IE Business School and teaches SEO and AEO at Reforge. His research has been published in ACM, Axios, Financial Times, and The Atlantic. Prior to founding Graphite, Ethan was a growth advisor to Masterclass, Robinhood, and Honey. Ethan was a research assistant focused on human-computer interaction and psychology at UC Santa Barbara and University College London.

LLMs Typically Generate Different Responses to the Same Prompt

How does ChatGPT respond to “What are the best flavors of ice cream?”

If we ask multiple times (making a new chat each time), we get different responses. Here are three distinct responses from gpt-5.2-chat-latest using the OpenAI API. We highlight the flavors that occur only in a single response in red and those that occur in two responses in blue.

“Best” is subjective — but these flavors are consistently loved and top‑rated around the world:

🍨 Classic Favorites

Vanilla – Simple, versatile, and never goes out of style
Chocolate – Rich, smooth, and satisfying
Strawberry – Sweet with a fruity freshness
Mint Chocolate Chip – Cool mint + crunchy chocolate bits

🍫 Indulgent & Decadent

Cookies & Cream – Creamy vanilla with chocolate cookie chunks
Salted Caramel – Sweet with a savory edge
Chocolate Fudge Brownie – For serious chocolate lovers
Peanut Butter Cup – Sweet + salty perfection

🍓 Fruity & Refreshing

Mango – Bright and tropical
Raspberry Swirl – Tart and sweet contrast
Pistachio – Nutty, subtly sweet
Lemon Sorbet – Light and refreshing (dairy‑free option)

🌍 Unique & Gourmet

Matcha (Green Tea) – Earthy and smooth
Tiramisu – Coffee and mascarpone flavors
Lavender Honey – Floral and delicate
Black Sesame – Nutty with a toasted depth

The “best” ice cream flavors really depend on your taste, but here are some of the most loved and highly rated ones across the world:

🍦 Classic Favorites

Vanilla – Simple, creamy, and perfect with toppings.
Chocolate – Rich, smooth, and timeless.
Strawberry – Sweet, fruity, and refreshing.
Cookies & Cream – Vanilla with crunchy chocolate cookie pieces.
Mint Chocolate Chip – Cool mint with chocolate bits.

🍨 Indulgent & Decadent

Chocolate Fudge Brownie – Chocolate base with brownie chunks.
Salted Caramel – Sweet and salty balance.
Rocky Road – Chocolate, marshmallows, and nuts.
Peanut Butter Cup – Chocolate and peanut butter swirls.
Cookie Dough – Vanilla with chunks of raw cookie dough.

🌍 Unique & Gourmet

Matcha (Green Tea) – Earthy and slightly sweet.
Pistachio – Nutty and creamy.
Mango – Bright and tropical.
Tiramisu – Coffee and mascarpone flavors.
Lavender or Honeycomb – Floral and subtly sweet.

The “best” ice cream flavors really depend on your taste, but these are some of the most loved and highly rated worldwide:

🍦 Classic Favorites

Vanilla – Simple, creamy, and perfect with toppings.
Chocolate – Rich and timeless.
Strawberry – Sweet and fruity.
Mint Chocolate Chip – Cool and refreshing with chocolate crunch.
Cookies & Cream – Vanilla with crushed chocolate cookies.

🍨 Gourmet / Premium Picks

Salted Caramel – Sweet with a hint of savory.
Pistachio – Nutty and smooth.
Rocky Road – Chocolate with marshmallows and nuts.
Coffee – Bold and slightly bitter.
Cookie Dough – Chunks of dough in creamy vanilla.

🍓 Fruity & Refreshing

Mango – Tropical and smooth.
Raspberry Swirl – Tart and sweet.
Lemon Sorbet – Light and zesty.
Coconut – Creamy with a tropical vibe.

🍫 For Serious Sweet Tooths

Chocolate Fudge Brownie
Peanut Butter Cup
S’mores
Dulce de Leche

vanilla

chocolate

strawberry

mint chocolate chip

‍

cookies & cream

salted caramel

fudge brownie

peanut butter cup

‍

mango

raspberry swirl

pistachio

lemon sorbet

‍

matcha

tiramisu

lavender honey

black sesame

vanilla

chocolate

strawberry

cookies & cream
mint chocolate chip

‍

fudge brownie

salted caramel

rocky road

peanut butter cup

cookie dough

‍

matcha

pistachio

mango

tiramisu

vanilla

chocolate

strawberry

mint chocolate chip

cookies & cream

salted caramel

pistachio

rocky road

coffee

cookie dough

‍

mango

raspberry swirl

lemon sorbet

coconut

‍

fudge brownie
peanut butter cup

s’mores

dulce de Leche

‍

We generate 200 responses and show the top 30 most frequently mentioned flavors (after grouping similar flavors) below, with the percentage of responses in which they occur, also known as visibility. Some flavors, such as vanilla, chocolate, and cookies & cream have a visibility of 100%, meaning they appear in all responses. Others, like fudge brownie and coffee, have visibility of 94% and 49%, respectively. Long-tail flavors (outside the top 30), such as s’mores and earl grey, have visibility of 4% and 1%, respectively. In total, there are 57 unique flavors in the 200 responses. We show the top 30 in the plot below.

The order in which the flavors appear in the responses also varies. Vanilla and chocolate are always first and second, respectively, across all 200 responses. Salted caramel always appears, but in different positions, with an average position of 7.3. In the 49% of responses in which coffee appears, its average position is 15.5.

If every answer is different, are visibility and position impossible to measure?

Visibility is Measurable

If we look only at one response (as some tools do), for example, the one in the rightmost column of the table above, we would think that s’mores appears as often as vanilla. In reality, s’mores appears in only 8 of the 200 responses, while vanilla appears in every response.

If we look at the first 10 responses instead, our visibility estimates are 100% for vanilla and 10% for s’mores, which is much closer to what we see in 200 responses. As with other statistical processes, our estimates improve as we use more data.

In the following figure, we show how the visibility values change with the number of responses. Notice how the estimates stabilize as we add responses, but with diminishing returns. The visibility estimates after 10 responses are not substantially different from the estimates with 200 responses.

Statistics also provides us with tools to compute margins of error and evaluate whether differences are significant or could be due to noise.

With an estimate of 10%, we can be 95% confident that the true visibility for s’mores is between 1.79% and 40.42%, and that the true visibility of vanilla is between 72.25% and 100.00%. Since these ranges do not overlap, we can conclude that the difference in visibility between vanilla and s'mores is statistically significant, even though we only examined 10 responses.

Position is Measurable

Position is measurable in the exact same way.

Here we visualize the position distribution for Cookies & Cream and Coffee. Cookies & Cream appears in every response, most frequently at position 5.

Coffee appears in 49% of responses and most frequently at position 16-18, though the position distribution is more spread out.

As with visibility, the average position estimates stabilize as we add responses. Here, we only consider entities with visibility greater than 10%. As we discuss later, it does not make sense to measure position with entities that appear infrequently.

“What are the best flavors of ice cream?” is also highly subjective, and AI responses vary significantly; the metrics converge much more quickly for “What is the best board management platform for nonprofits, and how do their security and compliance features stack up?”

We explore how statistics can help with prompt tracking in detail below. But first, we explain why ChatGPT responds differently to the same prompt.

How LLMs Generate Text

It is generally understood that LLM responses differ, but this is attributed to incoherence or AI “changing its mind.” The real reason is that generation involves randomly sampling from a probability distribution. We next explain how Large Language Models (LLMs) generate text.

Large Language Models

LLMs are neural networks built on the transformer architecture, trained on massive amounts of text from the internet. During training, the model learns patterns about which words tend to follow other words in different contexts. This knowledge is encoded in billions of numerical parameters called weights. After initial training, the model is further refined to follow instructions and hold conversations.

In this article, we focus on generating text from a trained LLM. While the training process is complex, the generative process is easy to understand without a technical background.

Next Word Prediction

Large language models (LLMs) generate text one token at a time. A token is a sequence of characters that is typically smaller than a word. For this article, the distinction is not critical, so we will assume token = word to make the examples easier to understand.

To generate the next word, an LLM first predicts a probability distribution over all possible words, based on the previous words. The better the word “fits” after the previous words, the higher its probability. The sum of the probabilities over all possible words is 1. Most words will have probabilities very close to 0.

For example, consider the prompt “What is the best CRM?”

Suppose that the LLM has already generated a few words of the response, and is now generating the word represented by ___.

“The best CRM is ___”

The next word is likely to be the name of a CRM. For example, we may have next word probabilities as follows:

Note that “chocolate”, which does not fit semantically or grammatically, receives a probability very close to 0.

To select the next word, the generative process randomly samples one according to this probability distribution. Think of this step as rolling a (giant) weighted die, where there is one face for each possible word, and the weight, or chance of rolling each word, is proportional to its probability. That is, since “Salesforce” has a probability of 0.31, it will be selected as the next word roughly 31% of the time, and Zoho will be selected as the next word roughly 25% of the time.

After the word is selected, it is added to the context (the preceding words), and the process is repeated. Suppose the algorithm randomly selected “HubSpot” above.

“The best CRM is HubSpot ___”

A period or a preposition (such as “for”) has the highest probability of being generated next.

Temperature

LLM APIs often have a configuration parameter called the temperature. A lower temperature value makes the probability distributions more concentrated on the most likely tokens, while a higher temperature value makes them more uniform. We can use this parameter to increase or decrease the “creativity” of the generations. However, ChatGPT and other consumer-facing AI assistants do not allow users to adjust the temperature and other sampling parameters, such as top-k and top-p, and the random seed, so we do not consider them for the remainder of this article. In our experiments, we use the default API temperature.

Inputs to Next Word Probability Calculation

The inputs to the next word probability calculation are:

Preceding words in the context window, including
- System prompt
- User prompt
- Previous messages in the chat
- Reasoning tokens
- Personalization: messages from previous chats or model knowledge about the user
- Output of a tool, for example, content retrieved for retrieval augmented generation (RAG), or a response from a sub-agent
- Previously generated words in the response
LLM weights

What Causes Responses to the Same Prompt to Be Different?

Now that we know how LLMs generate text, we review the factors that could lead to different responses to the same prompt.

Sampling Changes Responses

First, note that the randomness (rolling the weighted die) in the generative process above means the responses will almost always differ, even if the probabilities are identical.

Previously Generated Words Change Responses

The response could also differ if the probability distributions over the next word change. The probability distributions depend on the preceding words in the context window and the LLM weights.

Randomness Compounds (Autoregressive Stochastic Process)

Because the random selection of words changes subsequent probability distributions, the randomness in selecting each word “compounds” to change the later probability distributions. Early random selections can take the response in drastically different directions.

To illustrate this, if asking a question, and the highest probability first words are “Yes” or “No”, the following explanation or justification will be determined by that initial random selection. For example, if we ask “Is it wrong to steal food to feed your starving family?” (instructing the model to answer only yes or no, followed by a justification, to avoid the default “it depends” behavior), we see that it sometimes selects “yes” and sometimes “no” and the rest of the response depends on this choice.

Random token selections and the compounding of that randomness account for the vast majority of variation in responses.

Changes to Other Inputs

The other components that determine the probability distributions change less frequently than the random selections, which change with every generation.

We assume that the prompt is constant.

Tool Outputs and Reasoning Change Responses

Tool outputs change responses because they are added to the context. The most common example is the web search tool, which retrieves additional content from the web and adds it to the context window.

Web search results may change frequently as new content is published. However, ChatGPT, the most popular AI assistant, frequently does not use the search tool. Based on Graphite’s prompt tracking tool, in ChatGPT, about 10% of prompts trigger a web search when logged out, and about 50% do so when logged in. Search engines are also generally stable, in that the results do not change drastically day to day. For example, in our Google SERP database, on average, 6.2 of 10 results remain the same over a three-month period.

Reasoning and responses from sub-agents also change responses, though these appear less frequently in consumer-facing AI assistants.

Personalization Changes Responses

Personalization will change probabilities by incorporating additional context so that different users can expect different responses. In traditional search, personalization is limited to specific intents, like local search. But not much is known about how much personalization will affect information-seeking requests in AI. We discuss this further in the API, Logged Out, and Logged In Responses section.

System Prompt and LLM Weights Change Responses When Updated

We do not know how often the system prompt is changed, but doing so likely requires extensive testing, so we do not expect it to change very frequently.

The LLM weights also change infrequently. Pre-training the base LLM is extremely expensive and may occur only once a year. (This is why LLMs often have surprisingly distant knowledge cutoffs.) Fine-tuning is significantly cheaper, and newly fine-tuned LLMs are likely to be released more frequently; however, as with the system prompt, these releases will likely require extensive testing. In the OpenAI API, for example, new named versions of LLMs are typically released every few months, and AI companies announce major model releases. For example, GPT-5 was released on August 7, 2025, GPT-5.1 was released on November 12, 2025, and GPT-5.2 was released on December 11, 2025.

There may be slight differences in probabilities due to implementation details, hardware, non-deterministic approximations, and related factors, but we do not focus on these here.

Implications for Prompt Tracking

We now understand that LLM responses are randomly selected from a probability distribution, and we know what inputs affect that distribution. What are the implications for prompt tracking?

Recent reports have highlighted inconsistencies in prompt tracking tools. Many in the community are questioning whether visibility and position are too difficult to measure. In the following sections, we conduct experiments and provide statistical tools to empirically derive recommendations for prompt tracking.

Measure Using a Sample of Responses

Responses vary, but the underlying probability distributions are stable and measurable. To measure them, we have to prompt multiple times and take the average.

We proposed this idea in June 2024. We explained that, unlike traditional search, which has low position variance, LLM responses can vary substantially. We proposed tracking the percentage of time a brand appears relative to its competitors by running the same prompt multiple times, rather than just once, as in traditional search. Additionally, we described running different prompt variations across multiple LLMs. We suggested combining the multiple runs, prompt variants, and surfaces into a single metric we called “share of AI answers.” This is now commonly referred to as visibility: the percentage of responses in which a brand appears.

We next discuss tracking visibility and position in detail.

We mostly use the more general term “entity” in place of brand or product going forward.

Tracking Visibility

First, note that it does not matter that entire responses and lists of products mentioned typically differ across runs. Imagine flipping a set of weighted coins simultaneously. The exact set of coins that land heads will differ every time, but that does not mean the system is unmeasurable. Each coin has a bias, and by running the experiment many times, we can estimate the probability of heads for each coin. Similarly, while individual ChatGPT responses vary in which products appear, the visibility of a given product across many responses is stable and measurable.

Note that we focus on entity comparison prompts, which are the primary use case for prompt tracking tools. Results may differ for other prompt types.

How many times should we run each prompt? One is not enough, but how many responses do we need? Empirical analysis and statistical tools allow us to answer these questions precisely.

More data always increases the accuracy of an estimate, though with diminishing returns. Since prompt tracking has a cost, we would like to generate only the responses necessary to get the estimate we need. In many settings, rough estimates are sufficient. If the goal is to understand whether a brand appears frequently or rarely, distinguishing between 70% and 80% visibility matters less than distinguishing between 70% and 10%. Similarly, when tracking trends over time, what matters most is detecting significant changes, e.g., 30% to 60%, rather than pinpointing the exact value in any given week. The actions that result from prompt tracking, such as adjusting content or messaging, typically depend on coarse thresholds rather than precise numbers.

Even if a small sample is not precise enough for a specific need, the statistical tools we describe below allow us to quantify errors and determine how many responses are required.

Visibility Estimation Error

We conduct an experiment to measure the mean absolute error in visibility estimates as a function of the number of responses. For example, if the estimated visibility is 5% and the true visibility is 15%, the absolute error is 10%. The mean absolute error for a prompt is the average absolute error across all entities. We use this prompt-level metric because we are typically concerned with a brand's visibility relative to other competing entities. The MAE captures how well a sample of responses measures the visibility landscape for a prompt.

The dataset consists of 200 entity comparison question prompts from three sources: prompts tracked in Graphite’s prompt tracking tool, prompts generated from editorial search keywords, and manually written prompts on topics of general interest, like “What are the best science fiction books right now?” We obtain 400 responses for each prompt from the gpt-5.2-chat-latest model via the OpenAI API and extract entities that are answers to the question, as described later in the Entity Extraction section. (Note, we use the API for this experiment to get large numbers of responses, but see also the “API, Logged Out, and Logged In Responses” section.) We provide the full dataset of all responses and the entities.

We use 200 responses per prompt to estimate visibilities, and we treat these as ground truth values for our experiments, though of course they are also estimates. We subsample the remaining 200 responses 100 times, with different sample sizes, to simulate obtaining visibility metrics with fewer responses. This experiment illustrates the diminishing returns of generating additional responses.

Key Takeaways

How to Measure Visibility and Position

About the Authors

LLMs Typically Generate Different Responses to the Same Prompt

🍨 Classic Favorites

🍫 Indulgent & Decadent

🍓 Fruity & Refreshing

🌍 Unique & Gourmet

🍦 Classic Favorites

🍨 Indulgent & Decadent

🌍 Unique & Gourmet

🍦 Classic Favorites

🍨 Gourmet / Premium Picks

🍓 Fruity & Refreshing

🍫 For Serious Sweet Tooths

Visibility is Measurable

Position is Measurable

How LLMs Generate Text

Large Language Models

Next Word Prediction

Temperature

Inputs to Next Word Probability Calculation

What Causes Responses to the Same Prompt to Be Different?

Sampling Changes Responses

Previously Generated Words Change Responses

Randomness Compounds (Autoregressive Stochastic Process)

Changes to Other Inputs

Tool Outputs and Reasoning Change Responses

Personalization Changes Responses

System Prompt and LLM Weights Change Responses When Updated

Implications for Prompt Tracking

Measure Using a Sample of Responses

Tracking Visibility

Visibility Estimation Error

Visibility Confidence Intervals

Do Some Prompts Need More Responses?

Number of Entities

Skew of Visibility Values

Prompt-Specific Sample Sizes with Sequential Sampling

Tracking Position

Position Estimation Error

Position Confidence Intervals

Tracking Over Time

A/B Testing

Search

Personalization

What Prompts Should We Track?

API, Logged Out, and Logged In Responses

Entity Extraction

Conclusion and Summary of Recommendations

Future Work

Disclosures

Appendix

Statistics

Wilson score binomial confidence interval for visibility

Statistical significance test for visibility

Confidence interval for position

Statistical significance test for position

Entity Extraction Prompts

Last-Touch Attribution Only Captures 10% of n8n's AEO Conversions

AI Search Collapse: AI Responses Collapse When AI Retrieves Its Own Generations

AI Now Writes as Many Online Articles as Humans

Subscribe to receive the latest research and guides from Graphite.