llms.txt directory

11 comments

Yes! Please standardize the web into simple hypertext so “LLMs can use it”. I promise I won’t build any tools to read it without the ads, tracking, and JavaScript client side garbage myself. I will not partake in any such efforts to surf the web as it was intended be, before its commercialization and commodification. No, sir, I could never!

The problem I see is that people will develop two versions of websites -- the LLM version, optimized to give the model a good impression of their products or services and get them into the training data, and the human version (with sub-versions for mobile and such) which will be SEOd to hell.

No one wins in the long run by creating technical solutions to human incentive problems. It is just a prolonged arms race until

* the incentives are removed

* the process is made so technically complex or expensive that only a few players can profit from them

* it is regulated such that people can make money doing other things which have better risk/reward

* most people just avoid the whole ecosystem because it becomes a cesspool

A brief guide on avoiding this pitfall, for search engine and LLM operators:

Step 1: Punish ranking and visibility of sites whose llms.txt differs from a random sampling of actual web (HTML) content.

Step 2: There is no step 2.

If you can reliably machine-diff format A and format B, there's no need for two different formats in the first place.

> No one wins in the long run by creating technical solutions to human incentive problems. It is just a prolonged arms race

But maybe this time the naive technical solution will work

Everything is temporary.

You, me, LLMs, Google, humanity, Earth, Sol...

We can choose to carry on and perform [what we think are] improvements and make the best of it, or we can choose to cash it in early and simply give up.

Brief guide for site optimizers:

1. Figure out how to embed content that only LLMs see which affect their output

2. Wait for that to stop working

3. Innovate another way to get past new technical problem

llms.txt has a section on "Existing standards" which completely forgets about well-known[0], there's an issue opened three months[1] ago but it seems it was ignored.

[0] https://en.wikipedia.org/wiki/Well-known_URI

[1] https://github.com/AnswerDotAI/llms-txt/issues/2

Exactly, this fits perfectly in the `.well-known` use cases. What a shame.

Unfortunately this requires registration, which is not a simple process.

No more than throwing stuff in the root.

Folks, please note that this proposal is designed to help end users who wish to use AI tools. For instance, so that when you use Cursor or vscode you can get good documentation about the libs you use when coding, for the LLM to help you better.

It’s not related to model training. Nearly all the responses so far are about model training, just like last time this came up on HN.

For instance, I provide llms.txt for my FastHTML lib so that more people can get help from AI to use it, even although it’s too new to be in the training data.

Without this, I’ve seen a lot of folks avoid newer tools and libs that AI can’t help them with. So llms.txt helps avoid lock-in of older tech.

(I wrote the llms.txt proposal and web site.)

This sounds quite good in that case. I've been attempting to convert documentation into markdown to feed to the LLMs to get fresher/more accurate responses.

In this case, we might need a versioning scheme. Libraries have multiple versions and not everyone is on the latest. I still need a way to point my LLM to the ver I'm actually using.

Have there been any declarations by various AI companies (e.g. OpenAI, Anthropic, Perplexity) that they are actually relying upon these llms.txt files?

Is there any evidence that the presence of the llms.txt files will lead to increased inclusion in LLM responses?

And if they are, can I put subtly incorrect data in this file to poison llm responses while keeping my content designed for humans of the best quality?

I'm curious, what would be the reason for doing this?

If one doesn’t want LLMs to scrape data and knows the LLMs will be ignoring the robots.txt file.

Undermine the usefulness of llms in an attempt to force people to visit your site directly.

Anthropic itself publishes a bunch of its own llm.txt files. So I guess that means something

It's telling that nearly every site listed is directly involved with AI in some way, unsurprisingly the broader internet isn't particularly interested in going out of its way to make its content easier for AI companies to scrape.

Deliberately putting garbage data in your llms.txt could be funny though.

You seem to be misunderstanding why a website would make llms.txt

Obviously, they would not make it just for an AI company to scrape

Here's an example. Let's say I run a dev tools company, and I want users to be able to find info about me as easily as possible. Maybe a user's preferred way of searching the web is through a chatbot. If that chatbot also uses llms.txt, it's easy for me to deliver the info, and easy for them to consume. Win-win

Of course adoption is not very widespread, but such is the case for every new standard.

The point of LLMs is they are able to make sense of the web the same way humans can (roughly speaking); so why do they get the special treatment of having direct, ad-free, plain text version of the actual info they’re looking for, while humans aren’t allowed to scroll through a salad recipe without being bombarded with 20 ads?

A human could read the llms.txt if they want to. And a developer could put ads in llms.txt if they wanted to!

I've seen many people joke about intentionally poisoning training data but has that ever worked?

It's hard to gauge the effectiveness of poisoning huge training sets since anything you do is a figurative drop in the ocean, but if you can poison the small amount of data that an AI agent requests on-the-fly to use with RAG then I would guess it's much easier to derail it.

This study shows that controlling 0.1% may be enough.

https://arxiv.org/abs/2410.13722v1

I have noticed some popular copied but incorrect leetcode examples leaking into the dataset.

I suspect it depends on domain specificity, but that seems within the ability of an SEO spammer or decentralized group of individuals.

Seems silly to put garbage data there. Like intentionally doing bad SEO so Google doesn't link you.

I think you should think about it as: I want the LLM to recognize my site as a high quality resource and direct traffic to me.

Imagine user asks ChatGPT a question. LLM has scrapped your website and answers the question. User wants some kind of follow up - read more, what's the source, how can I buy this, whatever - so the LLM links the page it got the data from.

LLMs seem like they're supplanting search. Being early to work with them is an advantage. Working to make your pages look low quality seems like an odd choice.

That sounds like those “react youtubers” taking your content without permission and telling you that you should be grateful for the exposure.

Is it a good response to the react YouTubers to make your content terrible? Or to provide something in your content not available on their?

Whether you like it or not LLMs are going to be how people explore the web. They simply work better than search engines - not least because they can quickly scan numerous sites simultaneously, consume and synthesize the content.

You can choose to sabotage your own content in a likely futile effort to make things worse for LLM users if you want - my point is just that it serves no purpose and misses out on the opportunities in front of you.

Are you kidding? The follow-through on attribution links, where present, is nearly zero. There’s no gains to be had here, only losses.

I'd prefer not to play that game. I'd rather lose a bit of money and traffic and not help LLMs as far as humanly possible.

[deleted]

Making it easier for tech companies to steal my art. Sure, I will get right to it. In what world do these thieves live? I hope they catch something nasty!

Has your art been stolen in the past? If so, how did you get it back?

Poorly specified, wedging structured data into markdown, not widely supported, ignores /.well-known/.

I also don’t understand the problem it purports to solve.

Perplexity is listed, but do they actually abide by llms.txt? And how can we prove they do? Is it all good faith? I wish there were a better way.

llms.txt isn't an opt-out signal like robots.txt, it's a way to provide a simplified version of pages that are easier for LLMs to ingest. It's more of an opt-in for being scraped more accurately.

Or scraped inaccurately. It seems like you could have some fun with this if you were so inclined...

For trusted sites, this is a logical next step.

Why should websites implement yet another custom output format for people^Wsoftware that won’t bother to use existing, loosely yet somewhat structured, open formats?

In a world where people gravitate to LLMs for quick answers instead of wading through ads and whatnot, it seems like you would want an LLM to site your content for further context. If the user just wanted an answer, they probably wouldn't have spent much time on your site.

The ads and whatnot are why the site exists! That’s the point, with the content being the hook. If people aren’t looking at the ads, it’s a loss.

This is a great resource to at least figure out all the LLMs out there and block them. I already updated my robots.txt file. Of course, that is not sufficient, but at least it's a start and hopefully the blocking can get more sophisticated as time goes on.

It looks like the opposite. It is a way to make your site easier to parse for LLMs.

It is, but you can use it as a list of targets for blocking.

it's not "productive", of course, but i don't see any issue with expressing this opinion whatsoever. and i say this being about as starry-eyed a techno-llm-utopian-esque dreamer as they come... sure, the "google" version of LLMs paving over industry has already crossed the rubicon, but everyone should have to reckon the value that they are truly providing not just for consumers but for producers as well... and no one should be offended by showing up in someone's robots.txt... just as i'm sure this commenter is realistic enough to know and understand that putting entries in one's robots.txt is nothing more than a principled, aspirational statement about how the world should be, rather than any sort of real technological impediment.

(but we'll just ignore the obvious irony in that end bit about detection of bots getting smarter... wonder where all this "intelligence" will come from? probably not some natural source, but possibly some sort of... Antinatural Intelligence?)