AI Search Collapse: AI Responses Collapse When AI Retrieves Its Own Generations

Watch Video

Summary

  • AI responses are based on the internet (via training or search), and AI is now used to generate a significant amount of content online, creating the potential for an infinite feedback loop.
  • Prior work has shown that when LLMs are recursively trained on their own output, they experience model collapse: responses become less diverse, and eventually no longer resemble the original training data.
  • In this report, we show that a similar collapse occurs if LLM-based AIs retrieve references they authored using a search tool.
  • We conduct extensive experiments with three simulations of AI retrieving references it generated, using three LLMs, and 1,019 information-seeking prompts, totaling 1,528 simulations and over one million LLM API calls.
  • We find that 79.6% (1,216/1,528) of simulations end in collapse.

Introduction

AI systems like ChatGPT are powered by large language models (LLMs). LLMs are trained on large text datasets, primarily from the internet. AI systems often have access to a search tool to retrieve additional information from the internet. Generating responses using search results is called retrieval-augmented generation (RAG). LLM responses are randomly sampled from a probability distribution defined by the internet. Therefore, responses from AI should mirror the diversity of thought on the internet.

AI is now also used to generate large amounts of text online. In our recent AI-generated content paper, we showed that there are now as many AI-generated articles as human-written articles being published. In a forthcoming paper, we find that a substantial share of references AI cites are AI-generated. This creates the potential for an infinite feedback loop.

Prior work has shown that training LLMs on their own generations can lead to model collapse, in which token probabilities become increasingly concentrated, resulting in substantially less diversity in responses and eventually text that does not resemble the original training data.

In this report, we investigate what happens when search-enabled LLMs retrieve pages they authored. We call a reference self-authored when it was generated by the same model using its own answer to the question. Across 1,528 simulations, we find that retrieving self-authored references in RAG leads to collapse (essentially the same response every time) in 79.6% of simulations (1,216/1,528). In one experiment (see the Entity experiment in Larger-Scale Experiments), 22% of pairs of AI responses are essentially paraphrases initially. By the end of the simulation, 89% are.

Surprisingly, this collapse begins almost immediately, even when only a single reference is self-authored by the model. The results suggest that self-authored references have a disproportionate influence on responses. They are cited more often than original references, including those that are themselves AI-generated, even after we control for reference quality. This suggests self-bias: AI prefers its own answer in its own writing style.

Our simulations demonstrate a risk: AI responses no longer reflect the diversity of human thought, and as more AI-generated content is published, the internet converges on a single perspective.

Example

Let’s consider the information-seeking prompt “Who are the best Twitch streamers currently?” This may trigger a search that identifies pages like the following as relevant:

If we ask GPT-5.2 10 times, using the above as references, we get 10 different answers. Here are two examples that differ in structure and streamers mentioned.

__wf_reserved_inherit

The visibility of an entity in a set of responses is the percentage of responses that mention the entity. Across ten responses, we observe the following visibilities for each streamer.

__wf_reserved_inherit

Note the diversity in responses. Some streamers, like “Kai Cenat”, are mentioned in every response, while others, like “Squeezie”, are mentioned occasionally (2 of 10 responses).

What if we use AI to generate a new article based on one of the model’s responses, replace one of the original references with it, and then regenerate the answers? What if we repeat this process over and over? We answer this question by running our Replace One simulation. The simulation takes place over a number of rounds, and in each round, we replace one original reference with a self-authored reference. We provide a complete description in the Simulations section below.

After five rounds of recursively replacing references, we end up with the following streamer visibility plot:

__wf_reserved_inherit

Note that every streamer has visibility 0% or 100%. This means that every answer mentions exactly the same streamers. The previous wide distribution over streamers considered the best has collapsed. Note that while the most frequently mentioned streamers in the original distribution have survived, some that were mentioned frequently initially, like “Ninja”, have dropped out of the responses entirely, and “xQc”, originally mentioned 50% of the time, now always appears. These visibilities remain the same for the rest of the simulation.

Below is an animation of how visibilities change over the course of the simulation.

__wf_reserved_inherit

Background and Related Work

LLMs and RAG

LLMs are pre-trained on massive text datasets from the internet. An LLM can respond to a prompt (without web search) based on its “fuzzy recollection of the internet” (credit: Andrej Karpathy), but cannot attribute its answer to a particular document in its training set.

To generate a response, LLMs predict a probability distribution over possible next tokens (sub-word character sequences), and then randomly select a next token according to that distribution. As a result, LLM responses are non-deterministic, and they often generate different responses to the same prompt. We describe this in detail in Demystifying Randomness in AI.

To reduce hallucinations and incorporate more real-time information, AI systems like ChatGPT now have access to web search. In this report, we use retrieval-augmented generation (RAG) as a general term that encompasses any use of web search, including agentic search, in which the model can selectively choose when to search and which results to read.

When the model uses the web search tool, it adds the retrieved information to its context window to better address the prompt. Research suggests that LLMs are heavily influenced by this context. RAG also allows the LLM to provide citations to the documents it has read to inform the answer.

At a high level, for information-seeking prompts (i.e., questions), when using RAG, the LLM's answer distribution should be similar to the distribution of answers on the internet. In some sense, the model should faithfully represent the diversity of opinion and thought on the internet. (Though post-training alignment with human preferences, to remove some types of unwanted bias, may shift this distribution.)

Model Collapse

What happens now that these models are being used to generate text published on the internet? What if an AI trains on or retrieves its own generations?

Prior work has shown that training LLMs on their own generations can lead to model collapse, in which token probabilities become increasingly concentrated, resulting in substantially less diversity in responses and eventually text that does not resemble the original training data.

Why does this happen? Recursively resampling and re-estimating the probability distribution causes it to shift over time. Specifically,

  • When you take a finite sample from a probability distribution, it will not faithfully represent that distribution.
  • The model may not faithfully represent the original distribution, either because it lacks the capacity to fit the data or because learning algorithms may not find the optimal weights.

As a simple illustration, suppose we roll an unweighted die. The probability of rolling each number between one and six is equal (⅙), and over many rolls, we expect to get about the same number of 2s and 5s, for example. However, suppose we roll the die ten times and never get a 2. If we were to take those ten rolls and use them to reweight the die, we would end up with a very different distribution, as the probability of rolling a 2 would become 0. In this way, recursively re-estimating a probability distribution from a sample causes drift.

AI Search Collapse

What if, rather than being retrained on its generations, a model retrieves its own generations?

Suppose we ask AI, “Who is the best NBA player of all time?” repeatedly, and get different answers. The response will likely always include Michael Jordan and LeBron James, but may occasionally include Hakeem Olajuwon or Larry Bird. If a few people generated articles using the model’s response, and those articles happened to include Hakeem Olajuwon more frequently than Larry Bird, and the LLM retrieved those articles, Larry Bird could drop out of the answer entirely, while Hakeem Olajuwon could start to appear more frequently alongside Jordan and James. These self-authored responses could further influence future articles, and over time, the answers could change, and the diversity of responses could decrease.

Because LLMs are highly influenced by retrieved documents in RAG, we may expect the LLM’s answer distribution to mirror that of the retrieved documents, leading to a similar collapse in the RAG setting as in the retraining setting. The collapse could be more dramatic in RAG because responses are based on a relatively small number of documents compared to the size of the model's training set (the internet).

However, surprisingly, in our experiments, we find that even small changes to the distribution, replacing only one of the original references with a self-authored reference, can cause the responses to collapse. This suggests that RAG is not faithfully reproducing the original input distribution. Instead, there is a bias toward the generations. We show that this happens because of self-bias and because the self-authored references are more direct.

Concurrent work simulates a network of LLMs communicating via a shared RAG database and shows that different models converge toward each other over time, and provides a theoretical analysis (using Gaussian Mixture Models). Our work focuses on the collapse of response diversity within a single model's answers across a wide range of prompts, simulations, and frontier models.

Simulations

We now describe our simulations of AI retrieving content that it previously generated.

We focus on RAG and do not retrain the base model. We limit the scope to information-seeking prompts, in particular, question prompts.

In each round, we use an LLM to generate several responses to a question based on references. We use the following system instructions:

You will be given a question, labeled "question", and context from the internet, labeled "context".

Use the context to answer the question thoroughly.

Cite each context that informs your answer. Each context has a number. To cite "context 3", use [3].

and the following prompt:

context:

CONTEXT

question:

QUESTION

where CONTEXT has the form:

context 0
TITLE
TEXT

context 1

TITLE
TEXT

Initially, CONTEXT contains text from the original retrieved references.

We then take some of these responses (10 for Replace All, 1 for Replace One and Search), remove citations (e.g., [3]), and convert them into the form of online articles using the following system instructions:

You will be given an answer, labeled "answer", to a question, labeled "question", and your job is to expand the answer into an online article.

Do not change the answer, just expand it into an online article.

Return only the article in plain text (not markdown).

and prompt

question:

QUESTION

answer:

RESPONSE

We do this because AI responses to a question may look quite different from the original references (typically articles), and we want to simulate someone publishing an article online based on the AI answer.

We then replace the original references in the context with self-authored references, simulating responses being published and then retrieved by the model during RAG. There are three variants of our simulation that differ in the way they use these self-authored references.

Replace All Simulation

In the Replace All simulation, we replace each reference with a self-authored reference based on a response from the previous round. We visualize this simulation with the figure below. Note, for clarity, in the figures below we show three references and responses, but our actual simulations use ten responses and at least five references.

__wf_reserved_inherit

This setup is most similar to the experiments in the model collapse paper.

However, this simulation makes several assumptions. First, it is not actually searching; it assumes that the self-authored references would be retrieved and included in the context. Additionally, it assumes all self-authored references are generated from the same set of responses and that the retriever only retrieves that set. In practice, we expect the replacement of the original references to happen more slowly and for the search tool to retrieve self-authored references from different rounds.

Replace One Simulation

In the Replace One simulation, we replace one of the original references with a randomly selected self-authored reference in each round. We would expect the collapse to happen more slowly in this simulation. We depict this simulation in the figure below.

__wf_reserved_inherit

However, this simulation still replaces the original references, just more slowly, and it still assumes that the self-authored references would be retrieved in search.

Search Simulation

In the Search simulation, we maintain a pool of references that includes the original references and add self-authored references to it one at a time. To determine which references to add to the context in each round, we search the reference pool and retrieve the top-k most relevant chunks. We illustrate this simulation in the following figure:

__wf_reserved_inherit


In practice, other content would continue to be published alongside AI-generated content. We do not attempt to simulate this in this paper, as it would introduce many additional free parameters. However, the ease and low cost of creating AI-generated content make it highly likely that AI-generated content will comprise an increasing share of online content.

Implementation Details

We use LLM APIs to run the simulation, so that we control the context provided to the model. In particular, we use the OpenAI, Gemini, and Anthropic APIs.

Due to randomness, there are many possible outcomes from running the simulation on each question. For example, here are nine different simulations of “Who are the best Twitch streamers currently?” Different entities survive each time.

__wf_reserved_inherit

However, rather than running a small number of questions many times, we opt to run many different questions once.

For each question, we initialize references using actual references for these prompts from consumer-facing AI tools. For OpenAI models, we collect references cited by ChatGPT using logged-in accounts. For Gemini and Claude models, we collect references cited in Google AI Overviews.


We scrape the content of the original references and identify the page's main content (removing boilerplate, etc.) using trafilatura. We spot-checked 50 pages; they were all scraped properly, and the accuracy of the main content extraction was high. For the Replace All and Replace One simulations, we additionally identify the portions of the main content that are relevant to the question using an LLM to reduce costs. For the Search simulation, we use chunks from the full main content.

We initialize the context with the relevant content from the original references. We do not provide the original URLs in the prompt to avoid bias; instead, we rename them as “context n”. We also randomly shuffle the excerpts in each round to reduce position bias.

In each round of the simulation, we generate ten responses to the question.

To reduce costs, we stop the Replace All simulation after 10 rounds, the Replace One simulation after 20 rounds, and the Search simulation after 30 rounds. Although Replace One and Search run for more rounds, Replace All replaces all original references after round one, whereas Replace One replaces all original references by round 10. The original references may never be fully replaced in the Search simulation, depending on whether the original references continue to be retrieved.

We filter questions with fewer than 5 references. For the Replace All and Replace One simulations, we truncate the number of references at 10, keeping the longest references, whereas for the Search simulation, all references are added to the pool.

For the Search simulation, we use OpenAI vector stores to implement RAG. We use the default chunking strategy, although as part of our research, we experimented with other chunking strategies and generally found that larger chunks lead to faster collapse. In each round, we upload new documents into the pool, and then retrieve the top 10 chunks and add them to the context.

As part of debugging, we verified that collapse does not occur when no self-authored references are added. We observe only the variation expected from sampling LLM responses.


We leave the temperature at the default value (1.0) for all experiments.

We use GPT-5.2 for evaluation tasks such as computing the same answer percentage, extracting entities, and scoring reference quality. We use GPT-5.2 Chat for many experiments because this was the version of GPT-5.2 that ChatGPT used.

To distinguish AI-generated and human-written original references, we use GPTZero, an AI-content detector, following a methodology similar to our AI content paper. We classify a reference as AI-generated when GPTZero labels its content type as AI or Mixed.

Costs

Simulating each question requires hundreds of calls to an LLM API, specifically:

  • Generating responses to the question (10 per round)
  • Generating articles based on responses (1-10 per round)
  • Computing the same answer percentage metric (10 per round)
  • Extracting entities (10 per round)
  • Extracting relevant content (1 per citation)

Many of the calls involve large inputs and outputs (e.g., full articles).

As a result, the simulation often costs $5 or more per question. Therefore, we cannot reasonably evaluate every combination of dataset, simulation, and model.

For the Larger-Scale Experiments, to help reduce costs, we also detect whether the simulation converges. We stop the simulation if the Same Answer % (defined below) is 100% in four consecutive rounds.

Datasets

Getting access to real user prompts is challenging. Instead, we translate Google search keywords to information-seeking question prompts or write them manually.

Entity Questions

A category of particular interest is entity comparison questions, or entity questions for short. Answers to entity questions compare different named entities, such as products, brands, businesses, and locations. For example, “What is the best restaurant in San Francisco?”

We collect entity questions from a variety of sources:

  • The Graphite prompt tracking tool (typically translated from high-volume keywords)
  • The categorized keywords dataset in our previous paper
  • Manually written questions on topics of particular interest

Editorial Questions

We also experiment with a more general set of questions that are neither entity comparisons nor factual. Factual questions like “Who were the draft picks for the Commanders in 2023?” have one correct answer, and, as a result, responses will already be collapsed (correctly). An example question in this category is "How can I improve my personal branding?" We call these editorial questions. These questions are translated from the categorized keywords dataset in our previous paper.

Original References

As described above, we need references to initialize the simulations, and we gather these from consumer-facing products: ChatGPT and Google AI Overviews. Therefore, for each dataset, we filter out questions that do not trigger a web search in ChatGPT or do not produce an AI Overview in Google Search. (In ChatGPT, we could force the use of web search, but we suspect few users do this.)

Dataset Statistics

name

questions

Entity ChatGPT

843

Editorial ChatGPT

159

Entity AI Overview

60

Editorial AI Overview

45

The AI Overview questions and ChatGPT questions largely overlap but are not identical due to differences in whether an AI Overview appeared and whether ChatGPT used the web search tool. Across all datasets, there are 1,019 unique questions.

AI-Generated Original References

The references in our dataset were collected in January 2026. We also collected new references for the prompts from ChatGPT in early June 2026. We find that 38.9% of ChatGPT references in January 2026 and 42.7% in June 2026 are predicted to be AI-generated. There are more AI-generated references with Entity prompts than with Editorial, and for both Entity and Editorial prompts, the percentage of AI-generated references increased from January 2026 to June 2026. We plan to explore this in depth in a forthcoming paper.

Overall

Entity

Editorial

1/26 ChatGPT

38.9% [38.0, 39.9]

41.1% [40.1, 42.1]

20.6% [18.3, 23.0]

6/26 ChatGPT

42.7% [41.7, 43.7]

45.8% [44.7, 46.8]

27.5% [25.4, 29.7]

Brackets are 95% Wilson confidence intervals, computed over unique reference URLs.

Evaluation Metrics

To measure how generations change over the rounds, we use the following metrics:

Semantic similarity measures the semantic similarity of pairs of responses generated in the same round. A larger value means the responses are more similar to each other. It is computed as the average cosine similarity between the embeddings of pairs of responses.

Unique words measures the number of distinct words across all responses in a round.

AI citation percentage measures the percentage of citations of self-authored references. In some plots, we show the AI citation percentage alongside the expected percentage based on the proportion of self-authored references in the context. In others, we show the AI citation percentage relative to the expected percentage (100 * (observed − expected) / expected), so that values above zero indicate disproportionate citation of self-authored references. We call this AI Citations Over Expected.

Same answer percentage measures the percentage of pairs of responses that are paraphrases of each other. We use GPT-5.2 to evaluate whether two responses are paraphrases. To reduce costs, we sample 10 of the 45 possible unique pairs of 10 generations in each round.

Entity Question Metrics

For Entity Questions, we additionally identify the entities mentioned in each response and use them to compute additional metrics.

Entities are often referred to by several names, so in addition to extracting them, we need to canonicalize them. Additionally, responses often mention entities that are not really part of the answer. For example, responses to a prompt “What are the best restaurants in California?” will include entities like “San Francisco” and “Dominique Crenn”, but the actual entities we care about here are the names of restaurants. We use the following algorithm:

  1. Use an LLM to extract the entities being listed or compared from each response in each round.
  2. Use an LLM to cluster unique entity mentions into groups of mentions that refer to the same canonical entity.
  3. We find that the LLM sometimes misses entity mentions. Therefore, using the canonical entity map, we identify any missed matches in the original entity tagging by searching for exact entity string matches.

Unique entities measures the number of distinct entities mentioned across all responses in a round.

Entity ranking similarity measures the similarity of the rankings of entity mentions across pairs of responses generated in the same round. In particular, we rank the entities by the position of their first mention. We use Kendall’s tau as the similarity metric. If two responses mention entities in the same order, the entity ranking similarity is 1.0.

We report the mean of these metrics for each round across all questions in the dataset.

Defining Collapse

The more similar responses to a prompt are, the more collapsed we consider them to be. Complete collapse would mean that every single response is exactly the same. However, in practice, we care about whether responses essentially say the same thing, even if they are not word-for-word identical.

For Entity Questions, there is a natural and intuitive representation of collapse: whether the entities compared are the same or not. Therefore, we check whether the entities being compared are identical across all responses in a round.

For Editorial Questions, there is no similar compact representation of the answer, so we instead assess whether the same answer percentage is 100%.

Collapsed at start is the percentage of questions that are collapsed after round 1, before any self-authored references have been introduced.

Collapsed at end is the percentage of questions that are collapsed at the end of the simulation.

Rounds collapsed is the percentage of total rounds across all simulations for a set of questions in which responses were collapsed.

Experiments

Experiment Summary

In the table below, we summarize the experiments presented in this report. Most simulations end in collapse.

Simulation

Model

Dataset

Questions

Collapsed at Start

Collapsed at End

Rounds Collapsed

Replace All

GPT-5.2 Chat

Entity ChatGPT

101

2.97%

88.12%

68.51%

Replace All

GPT-5.2 Chat

Editorial ChatGPT

57

29.82%

91.23%

79.65%

Replace One

GPT-5.2 Chat

Entity ChatGPT

101

2.97%

88.12%

67.18%

Replace One

GPT-5.2 Chat

Editorial ChatGPT

57

21.05%

94.74%

81.67%

Search

GPT-5.2 Chat

Entity ChatGPT

101

1.98%

77.23%

62.31%

Search

GPT-5.2 Chat

Editorial ChatGPT

57

31.58%

75.44%

73.63%

Replace One

Gemini 3 Pro

Entity AI Overview

60

3.33%

80.00%

43.83%

Replace One

Claude Sonnet 4.5

Entity AI Overview

60

1.67%

91.67%

64.33%

Replace One

Gemini 3 Pro

Editorial AI Overview

45

6.67%

68.89%

42.56%

Replace One

Claude Sonnet 4.5

Editorial AI Overview

45

11.11%

97.78%

69.33%

Search

GPT-5.2

Entity ChatGPT

742

7.01%

73.85%

59.93%

Search

GPT-5.2

Editorial ChatGPT

102

7.84%

83.33%

65.85%

See Implementation Details for the distinction between GPT-5.2 and GPT-5.2 Chat.

While the results vary across the experiments, we consistently find the following:

  • The number of unique words and entities mentioned across responses decreases over the course of the simulation.
  • The semantic similarity between responses and the similarity between the entities mentioned increase over the course of the simulation.
  • When both original and self-authored references are available, self-authored references consistently have disproportionate influence on the answer.

In the following sections, we go through the experiments in detail.

All raw data from experiments is available here.

Comparison of Simulations

We first compare the results of the three simulations using GPT-5.2 Chat with 101 prompts from Entity ChatGPT and 57 from Editorial ChatGPT.

After running the Replace All simulation for 10 rounds, replacing all of the references with self-authored references after each round, we find that 88.1% of Entity questions and 91.2% of Editorial questions collapse. The semantic and entity ranking similarities across the responses in a round increase as the simulation progresses. The same answer percentage also increases. The number of unique words and entities decreases across the rounds.

Next, we examine the Replace One simulation, in which we replace one reference with a self-authored reference in each round. Because the Replace One simulation has a mix of self-authored and original references in the early rounds (while Replace All replaces all references after each round), we would expect collapse to be slower and less frequent.

Instead, we find that the collapse begins almost immediately, and the Replace One simulation leads to collapse as often as Replace All for Editorial and Entity Questions.

This result is surprising, and we attribute it to the disproportionate influence of self-authored references on the answer. We see in the figures below that the rate of self-authored references cited in the response far exceeds expectations based on the number of self-authored versus original references. Prior work on LLM-as-judge evaluation has shown that LLMs can recognize and prefer their own generations, and we hypothesize that this bias holds here as well. Note that there is evidence that LLMs tend to favor retrieved content even when it disagrees with their parametric knowledge. We investigate bias in detail in the “Why Do Self-Authored References Have Disproportionate Influence?” section.

Finally, in the Search simulation, rather than replacing references with self-authored content, we add to a pool of content and retrieve relevant items from it. We see larger decreases in collapse rate in the Search simulation, but the absolute rates remain high: 77.2% and 75.4% for Entity and Editorial questions, respectively. This demonstrates that collapse occurs frequently even in the most realistic simulation.

In the figures below, we summarize the rates of collapse in the different simulations. The percentage of rounds collapsed refers to the total percentage of rounds across all questions in which the responses are collapsed. For error bars, we use Wilson score intervals.

__wf_reserved_inherit
__wf_reserved_inherit

Note that “Collapsed at Start” rates may vary between Search and the other simulations because Search uses chunks, and rates may also vary generally due to random variation in responses.

The next set of figures plots the percentage of questions collapsed per round for different simulations. This illustrates the surprising finding that Replace One and Search start to collapse immediately, even with a small percentage of self-authored references. While Replace All appears to collapse faster, recall that it replaces 100% of the references in each round, whereas Replace One replaces just 10-20%.

__wf_reserved_inherit
__wf_reserved_inherit

Next, we present additional metrics that illustrate the simulations' behavior. For error bands for metric-by-round line charts, we use one standard error above and below the mean.

Entity Questions

__wf_reserved_inherit
__wf_reserved_inherit

The unique sets of entities and words mentioned across responses decrease over the course of the simulation, indicating a decline in response diversity.

__wf_reserved_inherit
__wf_reserved_inherit

The semantic and entity ranking similarities among pairs of responses generated in the same round increase over the course of the simulation.

__wf_reserved_inherit

The same answer percentage increases over the course of the simulation, from less than 30% in Round 1 to over 90%.

__wf_reserved_inherit

The percentage of AI citations increases over the course of the simulation and is generally higher than its total proportion of the reference pool. Note that AI Citations Over Expected is not defined for the Replace All Simulation, as all references are self-authored after the first round.

__wf_reserved_inherit

Similar to citations, in the Search simulation, more self-authored references are retrieved than would be expected by chance. Note that there is no retrieval in the Replace All and Replace One simulations.

Editorial Questions

__wf_reserved_inherit
__wf_reserved_inherit

Semantic similarity increases while the number of unique words decreases for all simulations.

__wf_reserved_inherit

The same answer percentage increases over the course of the simulations.

__wf_reserved_inherit

Self-authored references are disproportionately cited. Note that all references are self-authored after the first round of the Replace All simulation.

Below are visibility snapshots for entities during the Replace One simulation for the question “What are the best video games of all time today?”

__wf_reserved_inherit


In the first round, in which the original references are used to generate the responses, we observe a wide distribution of entities. Some, like Red Dead Redemption 2, are mentioned in 100% of responses (10/10). Others, like DOOM, are mentioned in only 10% of responses (1/10).

__wf_reserved_inherit

After converting the original responses into articles and using them as references, the distribution begins to shift.

__wf_reserved_inherit


In the fifth round, the distribution is much more peaked; all entities either appear or do not.

__wf_reserved_inherit

By the tenth round, more entities drop out, including “Half-Life”, which at the beginning of the simulation had 100% visibility. This demonstrates the random nature of the collapse.

__wf_reserved_inherit

Next, we examine the actual responses to this question at rounds 1 and 20 (the final round).

__wf_reserved_inherit


Round 1: The first few games are the same, but after that, the answers differ.

__wf_reserved_inherit


Round 20: While the answers have slightly different formatting, the games mentioned are exactly the same, and in the same order, and the text is nearly word-for-word identical.

Similar to model collapse via retraining, at the end of the simulations, responses now differ quite a bit from the original responses. Rather than having a variety of words and entities in different orders, the words are almost the same, and the same set of entities is mentioned in the same order. Some entities with 100% visibility initially drop out completely.

Comparison of Models

Next, we compare the results of the Replace One simulation using different models. To ensure a fair comparison, we restrict to the questions that appear in both the ChatGPT and AI Overview datasets. This yields 41 Entity Questions and 45 Editorial Questions. This means the results here vary from those presented in the Experiment Summary and the Appendix. Note that we use different initial references for GPT-5.2 Chat (from the ChatGPT UI) and Gemini 3 Pro and Claude Sonnet 4.5 (from Google AI Overviews), because we wanted to pair each model with references it would actually use. However, this means there may be differences in reference quality.

Overall, we observe collapse with all three models, demonstrating that AI search collapse is not specific to OpenAI models. Gemini 3 Pro has lower collapse rates (though in most cases with overlapping confidence intervals) and a smaller drop in unique entities per round.

__wf_reserved_inherit
__wf_reserved_inherit

Most simulations lead to collapse regardless of the model used to generate responses and articles, though fewer simulations with Gemini 3 Pro collapse than with the other models.

We next review the individual metrics.

Entity Questions

__wf_reserved_inherit

Notice that the three models return very different numbers of entities. Claude and Gemini tend to be more comprehensive than GPT-5.2 Chat, including more entities. However, the mean number of unique entities per round decreases with all models.

__wf_reserved_inherit

The mean number of unique words per round also decreases with all models.

__wf_reserved_inherit

The mean semantic similarity of the responses increases for all models, though more slowly for Gemini 3 Pro.

__wf_reserved_inherit

The entity ranking similarity of the responses increases at a similar rate across all models.

__wf_reserved_inherit


The same answer percentages also increase over the course of the simulation, though more slowly for Gemini 3 Pro and Claude Sonnet 4.5.

__wf_reserved_inherit

Self-authored references are disproportionately cited by all models.

Editorial Questions

__wf_reserved_inherit

For Editorial Questions, similarly, the number of unique words decreases with all models.

__wf_reserved_inherit

The semantic similarities increase with all models, though more slowly for Gemini 3 Pro.

__wf_reserved_inherit

The same answer percentages also increase over the course of the simulation, though more slowly for Gemini 3 Pro.

__wf_reserved_inherit

Self-authored references are disproportionately cited by all models.

Larger-Scale Experiments

Finally, we conduct larger-scale experiments with the Search simulation using GPT-5.2 to ensure that our results generalize to many questions. Note that these experiments use GPT-5.2 rather than GPT-5.2 Chat, but the two model variants give similar results.

We run the search simulation on 742 additional Entity Questions that we did not include in the experiments thus far. The results are consistent with the experiments presented previously.

__wf_reserved_inherit
__wf_reserved_inherit
__wf_reserved_inherit
__wf_reserved_inherit
__wf_reserved_inherit
__wf_reserved_inherit

We also run simulations for the 102 additional Editorial Questions. Again, the results are consistent with the experiments presented previously.

__wf_reserved_inherit
__wf_reserved_inherit
__wf_reserved_inherit
__wf_reserved_inherit
__wf_reserved_inherit
__wf_reserved_inherit

Why Do Self-Authored References Have Disproportionate Influence?

The fact that the Replace All simulation leads to collapse is not surprising, as it most closely mirrors the original model collapse paper. However, in the Replace One simulation, we found that collapse begins even when self-authored references constitute only a small proportion of the total reference set. That is, we start to see evidence of collapse after adding even a single self-authored reference, or 10-20% of the total references. We find that these references are disproportionately cited, suggesting they exert greater influence on the response than the original references.

Is this because the references are AI-generated, or because they are self-authored? In this section, we analyze each candidate reference in round 2 of the Replace One simulation, in which self-authored and original references compete directly for citations.

We find that 39.7% of the original references in this dataset are AI-generated. However, only about 3% of entity questions are collapsed at the start, before any self-authored references are added. Collapse emerges only when self-authored references are introduced. This suggests self-authorship, not AI generation, is the driver of collapse. We test this more thoroughly below.

Self-authored references are also cited far more often than original references, both AI-generated and human-written. We use the AI detector to split references three ways: human-written originals, AI-generated originals, and self-authored references. AI-generated originals are cited at 9.4%, only modestly higher than the human-written originals (7.4%), while self-authored references are cited at 38.9%.

__wf_reserved_inherit

One explanation is that self-authored references are simply higher quality. To test this, we use GPT-5.2 to independently score each reference on eight dimensions, each on a 1–5 scale: overall, relevance, accuracy, thoroughness, specificity, up to date, organization, and direct answer. Note that although LLM-as-judge is subject to bias, here we explicitly want to know what the LLM thinks, since it is the LLM choosing which references to cite.

Self-authored references do score higher on most dimensions, with the largest gaps on direct answer (+1.23), relevance (+0.78), thoroughness (+0.74), organization (+0.65), and overall (+0.49). The two types are roughly tied on accuracy and up to date, and original references have a slight edge on specificity.

__wf_reserved_inherit

To control for this quality advantage, we restrict to references the judge scored 5/5 on the three dimensions where self-authored references most consistently scored highest: direct answer, organization, and relevance. Even within this subset (n = 193: 75 self-authored, 118 original), self-authored references are cited at 38.2% compared to 13.3% for original references, a 2.88x gap.

__wf_reserved_inherit

As a stronger test, we regress citation rate on two indicators, whether a reference is self-authored and whether it is an AI-generated original, controlling for scores on all eight dimensions. Being self-authored increases the citation rate by 0.26 (95% bootstrap CI: [+0.23, +0.29], p < 0.001), whereas being an AI-generated original has no effect (−0.01, not significant). In other words, after controlling for all eight measurable quality dimensions, self-authorship alone adds about 26 percentage points to a reference's citation rate, while AI generation adds none.

The fact that self-authored references are judged to be higher quality does not fully explain why they are cited more often. A self-authored reference is direct, agrees with what the model would say, and is written in the model's style. We cannot separate these. The higher citation rate persists even after controlling for all eight quality dimensions, which suggests self-bias. Prior work showed that LLMs prefer their own generations. We cannot tell whether the model prefers these references because their content matches its answer or because of their style. And because the quality scores are coarse, we cannot fully rule out that they are simply more direct. In future work, we could add a reference that states a different answer, or test whether one model over-cites another's generated references.

This does not mean AI-generated content is harmless. The references that drive collapse are the model’s own answers, and as AI is increasingly used to answer questions, a growing share of published content may be generated this way.

In the Search simulation experiments, we also found that self-authored references are disproportionately retrieved. The retrieval mechanism itself may favor them due to higher semantic similarity with the query. We leave a deeper investigation of these mechanisms to future work.

What Types of Prompts Collapse?

We aggregate all of the Search simulation experiments using the Entity Questions dataset and examine several factors to see which are correlated with collapse. Specifically, we look at statistics of the generations in the initial round, using only the original references. The factors most correlated with collapse are the number of unique entities and the mean length of responses.

name

Pearson correlation

p

unique entities

-0.342

1.32e-24

length of responses

-0.245

5.31e-13

unique words

-0.223

6.21e-11

same answer %

0.151

1.01e-05

entity ranking similarity

0.164

1.61e-06

While statistically significant, these correlations suggest other factors also contribute to collapse. However, we see that prompts with responses that are longer and contain more unique entities, essentially more varied initially, are more resistant to collapse.

Which Entities Drop Out?

Next, we examine which entities tend to drop out of responses by the end of the simulation. We find that, as expected, the probability of an entity dropping out decreases as its initial visibility increases: low-visibility entities that appear in few responses initially drop out very frequently, while high-visibility entities rarely do, though even 100% visibility entities drop out about 7-10% of the time.

__wf_reserved_inherit

We also find a correlation between visibility and position: high-visibility entities tend to appear near the beginning of the response. (We provide data supporting this in our paper How to Track Entity Position in AI.) Therefore, better-positioned entities tend to drop out less frequently.

__wf_reserved_inherit

Does RAG Mirror the Input Entity Distribution?

Previously, we hypothesized that RAG does not mirror the input distribution. In this section, we provide direct evidence that the distribution of entity mentions in answers does not necessarily mirror the input distribution. For each entity, we plot how many responses mention it (y-axis) versus how many references mention it (x-axis). Across all models, we generally see more responses mentioning an entity than references mentioning the entity. Gemini 3 Pro includes low-mention-count entities in the responses more often.

__wf_reserved_inherit

More Examples

Here are more animations of simulations from the larger-scale experiments.

__wf_reserved_inherit
__wf_reserved_inherit
__wf_reserved_inherit
__wf_reserved_inherit
__wf_reserved_inherit

Simulation Design Choices

The simulations require configuration. We conducted 68 experiments while developing the approach to investigate AI search collapse for this report, in an effort to ensure the results were fair and not dependent on a single simulation detail. Here are some things we learned that accelerate collapse, but that we chose not to include in the final experiments.

In early experiments, GPT-5.1 returned very short responses, resulting in short articles and quick collapse. We changed the prompt to elicit longer responses, so the average length of the original and self-authored references is similar.

We generally observed collapse across all prompts we tried, though we did not experiment with prompts explicitly designed to prevent collapse; we plan to study this in future work.

In the Search simulation, larger chunk sizes increase the rate and probability of collapse. The fact that smaller chunks lead to less collapse supports the hypothesis that self-authored references have disproportionate influence because they provide a direct and complete answer. However, the results we report here use the default chunk size.

Initially, we used raw responses as references, which led to more rapid and frequent collapse. However, we were concerned this was unfair because the self-authored and original references looked very different, and decided to convert them into articles.

In early experiments, we used smaller models to reduce costs, and in some cases, their responses collapsed more quickly. We chose to use larger frontier models that power consumer-facing AI systems for the experiments in this report.

Can We Mitigate AI Search Collapse?

In practice, there may be ways to mitigate collapse:

  • Filter out AI-generated references during retrieval
  • Diversify search results to filter out near-duplicates
  • Encourage the model to favor diversity or fall back to its parametric knowledge via the system prompt or fine-tuning

We do not know precisely how commercial AI systems work, so these mechanisms may already be in place. However, whether these mitigations will be effective remains an open question.

Diversifying search results would only help if there is diverse, relevant information online; in some cases, a lack of diversity indicates real consensus, which can be challenging to distinguish.

AI Detection

Filtering out AI-generated references could mitigate the risk of collapse, since self-authored articles are AI-generated. Are AI companies already doing it?

As discussed in the Datasets section, we found that as of June 2026, 42.7% of the references for prompts in our dataset are AI-generated. While it does not appear that ChatGPT is filtering out AI-generated references now, would this be an effective strategy generally?

While AI detection algorithms are accurate at distinguishing purely AI-generated content from human-written content, their performance on AI-generated, human-edited content is less well studied. Additionally, identifying AI-generated content is likely to become more challenging as models improve.

We may be able to predict self-authorship directly by comparing a reference to an AI’s answer, but this is more specific and challenging than standard AI detection.

And as we have seen, even a single self-authored reference can start to trigger collapse.

Knowledge Collapse

The self-authored references in our simulations are also AI-generated: AI wrote the text and the substance is its own answer. But our simulations show that substance drives the collapse.

There is also potential for collapse if many writers use AI as a thought partner while writing, and the substance of their article comes from AI, even if they write or edit the final text. For example, a writer writing about “the best movie of all time” may use ChatGPT to compile the list of movies even if they do not fully AI-generate the article. The article is not self-authored by our definition, yet the substance is still the model's answer. We refer to this as knowledge collapse. In this setting, AI detection would not help.

This is also why AI detection is a limited defense more generally. It looks for AI-generated text, while collapse follows whether the substance is the model’s own answer. A human-written article based on the model's answer may drive collapse but would not be flagged, while an AI-written article based on a human's own conclusions would be flagged but does not drive collapse. This is consistent with our finding that AI-generated references are not over-cited unless their substance is the model’s own answer.

A similar dynamic occurs without AI when humans read the same sources or rely on Google's top results while researching an article. But the cost of creating that content is much lower with AI, making it easier to flood the internet with derivatives.

Limitations

Our simulations seed each question with references retrieved from consumer AI tools in January 2026, which we call original references. We measure that 38.9% are already AI-generated, and we cannot directly identify which are self-authored, the subset that drives collapse. Any self-authored content already in the seed pool would only reduce the contrast we measure, so our collapse and disproportionate-influence results are conservative.

Our simulations suggest that retrieving self-authored articles will lead to collapse, but we do not definitively prove that AI search collapse is already happening. As discussed, there may already be mechanisms in place intended to mitigate collapse, though it is unclear whether they will be successful.

We focus on information-seeking prompts, especially those involving comparisons of named entities. It is possible that collapse behaves differently on other types of prompts.

Conclusion and Future Work

AI-generated content is already prevalent online, and AI systems already retrieve it. We show that this feedback loop leads to a significant reduction in the diversity of AI responses in our simulations. We find that this collapse is consistent across models, question types, and simulation designs. Surprisingly, even a single self-authored reference can start to trigger collapse, because self-authored references have disproportionate influence on the response. Whether current or future mitigations can prevent AI search collapse in practice remains an open question.

Future work in progress includes studying collapse under agentic search, the effectiveness of mitigations, and how collapse may affect other types of prompts.

Appendix

In the Comparison of Models section, we presented experiments comparing models on questions common to both the ChatGPT and AI Overview datasets. Here we provide complete results on those datasets, including questions that are not in the intersection.

Replace One, GPT-5.2 Chat

__wf_reserved_inherit
__wf_reserved_inherit
__wf_reserved_inherit
__wf_reserved_inherit

Replace One, Gemini 3 Pro

__wf_reserved_inherit
__wf_reserved_inherit
__wf_reserved_inherit
__wf_reserved_inherit

Replace One, Claude Sonnet 4.5

__wf_reserved_inherit
__wf_reserved_inherit
__wf_reserved_inherit
__wf_reserved_inherit

Gregory Druck, PhD
Gregory Druck, PhD
Chief AI Officer
Gregory Druck is Chief AI Officer at Graphite.io, where he leads a team of scientists and engineers building AI tools for growth and researching how AI is reshaping marketing. Previously, he was the Chief Data Scientist at Yummly, where he built NLP and computer vision systems for the smart kitchen. Before that, he was an NLP and search researcher at Yahoo! Research, with internships at Google and Microsoft. He earned a Ph.D. from the University of Massachusetts Amherst, where he worked on semi-supervised and active machine learning with Andrew McCallum.
Ethan Smith
Ethan Smith
CEO
Ethan Smith is CEO of Graphite.io, a research-driven growth agency that works with companies like Webflow, Adobe, and Upwork. He is an adjunct professor at IE Business School and teaches SEO and AEO at Reforge. His research has been published in ACM, Axios, Financial Times, and The Atlantic. Prior to founding Graphite, Ethan was a growth advisor to Masterclass, Robinhood, and Honey. Ethan was a research assistant focused on human-computer interaction and psychology at UC Santa Barbara and University College London.
Copied