Skip to main content
What you need to know about llms.txt

Back in October, we had the pleasure of hearing The Atlantic’s CEO, Nicholas Thompson, speak at the AMO Summit. One of the prominent topics he addressed was the commercial deal the company struck with OpenAI. While controversial, his point-of-view is very specific: AI companies have two costs – compute and training data (i.e. content) – yet why are they only willing to pay for the former?

More recently, BlueSky CEO Jay Garber addressed the consent rights of creators and users. At SXSW in March, Garber outlined a potential framework the company is developing that allows users to provide consent as to whether and how their data is used for generative AI.

Companies that create content are beginning to push back on generative AI’s unfettered use of content on the internet. A new standard proposed by technologist Jeremy Howard, called llms.txt is another step towards a more standardized, structured way for AI models to interact with content on the internet. While this first came to light in the fall of 2024, of late is gaining more traction picking up relevance.

We’re here to help clarify (and demystify) what llms.txt is, and provide a short FAQ of what you need to know about llms.txt in case you’re interested in testing the potential efficacy of this newly-proposed standard.

What is LLMs.txt?

llms.txt is a proposed web standard designed to help large language models (LLMs) such as ChatGPT better understand your site on your terms. It’s like robots.txt, but instead of guiding search engine bots, it helps LLMs access clean, structured content and learn what they should and shouldn’t use. Think of it as a roadmap for AI systems to find:

  • What your site is about (summary)
  • Which pages are most important (key links)
  • Whether they’re allowed to use your content (permissions)

Why is something like this being proposed?

Simply put, websites were not built to allow for AI tools or LLMs to crawl or access their information easily; websites are far too complex. Per llmstxt.org: “Converting complex HTML pages with navigation, ads, and JavaScript into LLM-friendly plain text is both difficult and imprecise.

What is inside an llms.txt file?

Placed at the root of your site, this simple text file can include:

  • A brief description of your site
  • A list of important URLs (e.g., /about, /products, /docs, etc.)
  • Optional markdown (.md) versions of pages for easier parsing by LLMs
  • Licensing info or terms that declare if and how your content can be used

Why would you consider creating an llms.txt?

As mentioned above, llms.txt has similar benefits as a robots.txt file, but offer potential benefits for publishers and website owners:

  • Control: Provide guidelines around what content LLMs can train on your site
  • Visibility: Maximize discovery of valuable content to LLMs and users of AI systems
  • Clarity: Minimize potential issues LLMs may encounter if summarizing or referencing content from your site
  • SEO Synergy: Align your site to the evolving search dynamics

What are the potential drawbacks?

  • Adoption: This standard is not yet fully adopted – neither by AI companies nor websites – and AI companies can still bypass it
  • Management with other sitemaps: If not maintained, potential conflicts with robots.txt or XML sitemaps could introduce inconsistencies in how
  • Competition: There is a risk of competitors scraping your content to identify gaps and/or we
  • Redundancy: There are still questions about its veracity; is it any more effective than existing, comparable tools (XML, robots.txt)?

What’s the bottom line?

AI and LLM-powered search are not going away, but increased regulation on how content is used by these models and systems is gaining significant momentum. llms.txt is one example of this and a step in the right direction. And while not the industry standard, it gives you a voice in how your content is used. It’s functional application is proven in other contexts (i.e. search engines), and provides an easy way to protect your IP and make your content more discoverable.

____________________________

Sources / Additional Reading:

If you’re interested in learning more about what you need to know about llms.txt and its adoption and applications, please consult the below sources:

  • llmstxt.org
  • “The Role and Functionality of llms.txt in LLM-Drive Web Interactions” (Profound)
  • “LLMs.txt Explained” (towards data science)
  • “Getting Started with llms.txt” (David Dias)
  • “Meet LLMs.txt, a proposed standard for AI website content crawling” (Search Engine Land)