News | drihu.com

By caseyy, 10 months ago

Yes! Please standardize the web into simple hypertext so “LLMs can use it”. I promise I won’t build any tools to read it without the ads, tracking, and JavaScript client side garbage myself. I will not partake in any such efforts to surf the web as it was intended be, before its commercialization and commodification. No, sir, I could never!

By Eisenstein, 10 months ago

The problem I see is that people will develop two versions of websites -- the LLM version, optimized to give the model a good impression of their products or services and get them into the training data, and the human version (with sub-versions for mobile and such) which will be SEOd to hell.

No one wins in the long run by creating technical solutions to human incentive problems. It is just a prolonged arms race until

* the incentives are removed

* the process is made so technically complex or expensive that only a few players can profit from them

* it is regulated such that people can make money doing other things which have better risk/reward

* most people just avoid the whole ecosystem because it becomes a cesspool

By ssl-3, 10 months ago

A brief guide on avoiding this pitfall, for search engine and LLM operators:

Step 1: Punish ranking and visibility of sites whose llms.txt differs from a random sampling of actual web (HTML) content.

Step 2: There is no step 2.

By Retr0id, 10 months ago

If you can reliably machine-diff format A and format B, there's no need for two different formats in the first place.

By blharr, 10 months ago

Right. And so you'll end up punishing some sites that do it perfectly, while other sites that game the algorithm game first priority.

By mystified5016, 10 months ago

> No one wins in the long run by creating technical solutions to human incentive problems. It is just a prolonged arms race

But maybe this time the naive technical solution will work

By ssl-3, 10 months ago

Everything is temporary.

You, me, LLMs, Google, humanity, Earth, Sol...

We can choose to carry on and perform [what we think are] improvements and make the best of it, or we can choose to cash it in early and simply give up.

By Eisenstein, 10 months ago

Brief guide for site optimizers:

1. Figure out how to embed content that only LLMs see which affect their output

2. Wait for that to stop working

3. Innovate another way to get past new technical problem

By nerdix, 10 months ago

I wish this were true too but it doesn't appear that llms.txt is meant to be a markdown version of the site's content based on the example on the site and after peeking at a few llms.txt files on some of the sites in the directory.

It looks like its text meant to be fed into the llm as a system prompt specific to the site.

The most simple ones just look like a sitemap restricted to documentation: https://www.activepieces.com/docs/llms.txt

Some interesting stuff is in some of them. Like this one that prompts the LLM to explain to the user the ethical issues of using AI agents along with a disclaimer:

https://boehs.org/llms.txt

That site appears to be someone's blog and they don't seem like big fans of LLMs.

https://boehs.org/node/llms-destroying-internet

A pretty clever use of llms.txt.

By riffraff, 10 months ago

llms.txt has a section on "Existing standards" which completely forgets about well-known[0], there's an issue opened three months[1] ago but it seems it was ignored.

[0] https://en.wikipedia.org/wiki/Well-known_URI

[1] https://github.com/AnswerDotAI/llms-txt/issues/2

By varenc, 10 months ago

Using `/.well-known/llms.txt` was the first thought that came to my mind as well. There's a long list of other special well-known URIs using it: https://en.wikipedia.org/wiki/Well-known_URI#List_of_well-kn...

By 8n4vidtmkvmk, 10 months ago

Why not llms.md since it's markdown?

By varenc, 10 months ago

It's true! But the whole point of Markdown is that its completely readable as plaintext as well. And many other well known URIs use the .txt extension. It's not my thing but I think keeping it .txt for simplicity and parity makes sense,

By dethos, 10 months ago

Exactly, this fits perfectly in the `.well-known` use cases. What a shame.

By jph00, 10 months ago

Unfortunately this requires registration, which is not a simple process.

By sneak, 10 months ago

No more than throwing stuff in the root.

By jph00, 10 months ago

Folks, please note that this proposal is designed to help end users who wish to use AI tools. For instance, so that when you use Cursor or vscode you can get good documentation about the libs you use when coding, for the LLM to help you better.

It’s not related to model training. Nearly all the responses so far are about model training, just like last time this came up on HN.

For instance, I provide llms.txt for my FastHTML lib so that more people can get help from AI to use it, even although it’s too new to be in the training data.

Without this, I’ve seen a lot of folks avoid newer tools and libs that AI can’t help them with. So llms.txt helps avoid lock-in of older tech.

(I wrote the llms.txt proposal and web site.)

By 8n4vidtmkvmk, 10 months ago

This sounds quite good in that case. I've been attempting to convert documentation into markdown to feed to the LLMs to get fresher/more accurate responses.

In this case, we might need a versioning scheme. Libraries have multiple versions and not everyone is on the latest. I still need a way to point my LLM to the ver I'm actually using.

10 months ago

[deleted]

By bradarner, 10 months ago

Have there been any declarations by various AI companies (e.g. OpenAI, Anthropic, Perplexity) that they are actually relying upon these llms.txt files?

Is there any evidence that the presence of the llms.txt files will lead to increased inclusion in LLM responses?

By ashenke, 10 months ago

And if they are, can I put subtly incorrect data in this file to poison llm responses while keeping my content designed for humans of the best quality?

By bradarner, 10 months ago

I'm curious, what would be the reason for doing this?

By CamperBob2, 10 months ago

Keep in mind you're asking this question on a site where users regularly defend the Luddites, Ted Kaczynski, and other people who thought they were doing great things for humanity but who actually weren't even doing themselves any favors.

By hiccuphippo, 10 months ago

Undermine the usefulness of llms in an attempt to force people to visit your site directly.

By Diti, 10 months ago

If one doesn’t want LLMs to scrape data and knows the LLMs will be ignoring the robots.txt file.

By nunodonato, 10 months ago

Anthropic itself publishes a bunch of its own llm.txt files. So I guess that means something

By sneak, 10 months ago

Poorly specified, wedging structured data into markdown, not widely supported, ignores /.well-known/.

I also don’t understand the problem it purports to solve.

By jsheard, 10 months ago

It's telling that nearly every site listed is directly involved with AI in some way, unsurprisingly the broader internet isn't particularly interested in going out of its way to make its content easier for AI companies to scrape.

Deliberately putting garbage data in your llms.txt could be funny though.

By ALittleLight, 10 months ago

Seems silly to put garbage data there. Like intentionally doing bad SEO so Google doesn't link you.

I think you should think about it as: I want the LLM to recognize my site as a high quality resource and direct traffic to me.

Imagine user asks ChatGPT a question. LLM has scrapped your website and answers the question. User wants some kind of follow up - read more, what's the source, how can I buy this, whatever - so the LLM links the page it got the data from.

LLMs seem like they're supplanting search. Being early to work with them is an advantage. Working to make your pages look low quality seems like an odd choice.

By shreddit, 10 months ago

That sounds like those “react youtubers” taking your content without permission and telling you that you should be grateful for the exposure.

By ALittleLight, 10 months ago

Is it a good response to the react YouTubers to make your content terrible? Or to provide something in your content not available on their?

Whether you like it or not LLMs are going to be how people explore the web. They simply work better than search engines - not least because they can quickly scan numerous sites simultaneously, consume and synthesize the content.

You can choose to sabotage your own content in a likely futile effort to make things worse for LLM users if you want - my point is just that it serves no purpose and misses out on the opportunities in front of you.

By freeone3000, 10 months ago

Are you kidding? The follow-through on attribution links, where present, is nearly zero. There’s no gains to be had here, only losses.

By vouaobrasil, 10 months ago

I'd prefer not to play that game. I'd rather lose a bit of money and traffic and not help LLMs as far as humanly possible.

10 months ago

[deleted]

By spencerchubb, 10 months ago

You seem to be misunderstanding why a website would make llms.txt

Obviously, they would not make it just for an AI company to scrape

Here's an example. Let's say I run a dev tools company, and I want users to be able to find info about me as easily as possible. Maybe a user's preferred way of searching the web is through a chatbot. If that chatbot also uses llms.txt, it's easy for me to deliver the info, and easy for them to consume. Win-win

Of course adoption is not very widespread, but such is the case for every new standard.

By ppqqrr, 10 months ago

The point of LLMs is they are able to make sense of the web the same way humans can (roughly speaking); so why do they get the special treatment of having direct, ad-free, plain text version of the actual info they’re looking for, while humans aren’t allowed to scroll through a salad recipe without being bombarded with 20 ads?

By spencerchubb, 10 months ago

A human could read the llms.txt if they want to. And a developer could put ads in llms.txt if they wanted to!

By KTibow, 10 months ago

I've seen many people joke about intentionally poisoning training data but has that ever worked?

By jsheard, 10 months ago

It's hard to gauge the effectiveness of poisoning huge training sets since anything you do is a figurative drop in the ocean, but if you can poison the small amount of data that an AI agent requests on-the-fly to use with RAG then I would guess it's much easier to derail it.

By nyrikki, 10 months ago

This study shows that controlling 0.1% may be enough.

https://arxiv.org/abs/2410.13722v1

I have noticed some popular copied but incorrect leetcode examples leaking into the dataset.

I suspect it depends on domain specificity, but that seems within the ability of an SEO spammer or decentralized group of individuals.

By Lariscus, 10 months ago

Making it easier for tech companies to steal my art. Sure, I will get right to it. In what world do these thieves live? I hope they catch something nasty!

By bongodongobob, 10 months ago

Has your art been stolen in the past? If so, how did you get it back?

By whoistraitor, 10 months ago

Perplexity is listed, but do they actually abide by llms.txt? And how can we prove they do? Is it all good faith? I wish there were a better way.

By jsheard, 10 months ago

llms.txt isn't an opt-out signal like robots.txt, it's a way to provide a simplified version of pages that are easier for LLMs to ingest. It's more of an opt-in for being scraped more accurately.

By weare138, 10 months ago

Or scraped inaccurately. It seems like you could have some fun with this if you were so inclined...

By Juliate, 10 months ago

Why should websites implement yet another custom output format for people^Wsoftware that won’t bother to use existing, loosely yet somewhat structured, open formats?

By fchief, 10 months ago

In a world where people gravitate to LLMs for quick answers instead of wading through ads and whatnot, it seems like you would want an LLM to site your content for further context. If the user just wanted an answer, they probably wouldn't have spent much time on your site.

By Juliate, 10 months ago

Ha, but ads and whatnot are not a requirement for a website.

Plus, why wouldn’t llms be able to crawl websites as well?

Are llms going to improve the web or fragment it further as ads did?

By freeone3000, 10 months ago

The ads and whatnot are why the site exists! That’s the point, with the content being the hook. If people aren’t looking at the ads, it’s a loss.

By jrh3, 10 months ago

For trusted sites, this is a logical next step.

By vouaobrasil, 10 months ago

This is a great resource to at least figure out all the LLMs out there and block them. I already updated my robots.txt file. Of course, that is not sufficient, but at least it's a start and hopefully the blocking can get more sophisticated as time goes on.

By hiccuphippo, 10 months ago

It looks like the opposite. It is a way to make your site easier to parse for LLMs.

By vouaobrasil, 10 months ago

It is, but you can use it as a list of targets for blocking.

By keeganpoppen, 10 months ago

it's not "productive", of course, but i don't see any issue with expressing this opinion whatsoever. and i say this being about as starry-eyed a techno-llm-utopian-esque dreamer as they come... sure, the "google" version of LLMs paving over industry has already crossed the rubicon, but everyone should have to reckon the value that they are truly providing not just for consumers but for producers as well... and no one should be offended by showing up in someone's robots.txt... just as i'm sure this commenter is realistic enough to know and understand that putting entries in one's robots.txt is nothing more than a principled, aspirational statement about how the world should be, rather than any sort of real technological impediment.

(but we'll just ignore the obvious irony in that end bit about detection of bots getting smarter... wonder where all this "intelligence" will come from? probably not some natural source, but possibly some sort of... Antinatural Intelligence?)

llms.txt directory