News | drihu.com

By inopinatus, 6 hours ago

JSON unmarshalling often has to consider separately whether an attribute is absent, false, zero, null, or the empty string, but this was never quite semantically ambiguous enough for my tastes, so adding that void-ish values may also now be serialised as a tuple of length [0] seems to me an excellent additional obfuscation.

By joshribakoff, 6 hours ago

The use case here is to reduce the token usage with LLMs, such as an agent that outputs a list of commands eg. Tuples with files to write and their new contents.

Supporting this use case doesn’t require perfectly marshaling every data structure ever.

But to your point the tool could have wider use cases without the limitations.

By inopinatus, 6 hours ago

If one trains a model to understand it then that model will inevitably emit it, which means in turn one shall have to parse it, and now the application supports TOON for anything, and good luck telling the users/customers any different.

By ziofill, 3 hours ago

What if there’s a simple converter back to json after the model output? Is that possible?

By vessenes, 17 hours ago

I’ll be interested to see benchmarks. My expectation is that accuracy will take a hit on mid or longer context prompts: I’d bet that the heavy use of JSON in fine tuning will end up impacting quality of a more terse (less reasoning space) novel encoding.

That said: I like the idea!

By brian-bk, 16 hours ago

There are a very light benchmarks in the Readme, or are you looking for more?

By Mumps, 16 hours ago

Do you mean the [0] Token Benchmarks section? I only see token count numbers.

Which doesn't address the question: do LLMs understand TOON the same as they would JSON? It's quite likely that this notation is not interpreted the same by most LLM, as they would JSON. So benchmarks on, say, data processing tasks, would be warranted.

[0] https://github.com/johannschopplich/toon?tab=readme-ov-file#...

By tujux, 15 hours ago

I think they're talking about these sections:

1. Retrieval Accuracy - https://github.com/johannschopplich/toon?tab=readme-ov-file#...

2. Performance by dataset - https://github.com/johannschopplich/toon?tab=readme-ov-file#...

By andreygrehov, 3 hours ago

I don’t know what I’m talking about (pure fantasy), but what if you train a model on compressed data and then perform inference on compressed data as well? Could this work? With the output also being compressed and then decompressed by the client?

By mentalgear, 4 hours ago

Neat. I did a similar thing with CSV (instead of JSON) a year back. Great that there are measurements, but I think the really interesting measure would have it run against the actual "Structured Output Format" endpoints of LLM providers, e.g. those fine-tuned to return valid JSON.

By awaseem, an hour ago

This is awesome, I saw it on twitter and gave it a star

By chuckadams, 2 hours ago

indentation-based sounds pretty brittle for a serialization format. I imagine a tabular format that factors out repeating keys could be expressed fairly compactly in json itself.

By hedgehog, 5 hours ago

It would be interesting to compare this to BAML and TOML.

By toobulkeh, 4 hours ago

Definitely is a core feature of BAML. My main complaint with BAML is that it’s all or nothing. It’s very opinionated and we can’t get the benefits without the DX and vice versa. Separating this feature without requiring a DSL of model definition is a great add.

By hedgehog, 3 hours ago

TOML has some readability and compactness benefits over JSON while still being both common enough for models to easily be able to process it relatively reliably and widely supported in most languages. I suspect BAML still performs better but likewise due to the tooling work I haven't integrated it.

By anonymoushn, a day ago

Hello, it's probably better to add leading spaces before all of the words rather than none of them

By metalliqaz, 2 hours ago

What is the font used on that README image?

By drewlesueur, 2 hours ago

Looks like one of the variations of Iosevka. https://github.com/be5invis/Iosevka

By metalliqaz, 2 hours ago

Well done, sir!

4 hours ago

[deleted]

By Pxtl, 6 hours ago

I'm sorry I don't see this adding value over various other formats. I don't really want a new object serialization format, I just want the existing ones to have the features I need. YAML but with static typing and schema. XML but without crazy internet features. TOML but with an object format that doesn't hurt my brain. JSON but with decent multiline strings and comments. NestedText but with a sub-standard that provides static-typing and schema and whatnot.

By tptacek, 2 hours ago

This isn't really an interchange formula so much as something you'd JIT compile down to when handing things off to an LLM, right?

By furyofantares, 2 hours ago

And on the way out of the LLM. Token savings nice on the way out too, and then also I have to imagine it's better for the LLM to see one format in all of it's context instead of two.

It seems like a nice idea to me if restricted to that. Although I guess I am not sure if it's really intended that way - the array count for example is probably pretty bad for LLM output.

By tptacek, an hour ago

I feel like on the output side you might be working against LLM training? But I don't know.

By verdverm, 4 hours ago

https://cuelang.org | https://cuetorials.com

CUE can emit the other formats (minus XML because it's a beast of ambiguity, but there are other tools for json->xml i.e.)

It also has modules and imports, a very underrated feature for config languages if you haven't experienced it before

By foxglacier, 4 hours ago

The benchmarks show it performs better than them, so that's the value - cost savings and improved accuracy. I suppose you could convert JSON to TOON just for the LLM and not actually read it with your own brain.

3 hours ago

[deleted]

By 3cats-in-a-coat, 4 hours ago

I'll say the obvious. A lot of this you can just do in JSON.

Let's take the example:

    {
      "users": [
        { "id": 1, "name": "Alice", "role": "admin" },
        { "id": 2, "name": "Bob", "role": "user" }
      ]
    }

    users[2]{id,name,role}:
      1,Alice,admin
      2,Bob,user

We can keep it JSON, but use more compact list expressions, as tuples when pragmatic:

    ["users",
       [1, "Alice", "admin"],
       [2, "Bob", "user"]
    ]

The thing is the game with LLMs is not what's shortest, but what's:

1. Mainstream, so they understand it.

2. What they're tuned for, and their tuned for what's mainstream (JSON).

If you want to go extreme compression you can shove it all in JSON strings too and keep the larger structure JSON:

    ["users",
       "1:admin:Alice",
       "2:user:Bob",
    ]

You may say "how is this better". Well it's better because it's still JSON, there's less to explain to the LLM, and to your other devs. Even if we use a weird compact format like "id:role:name" this is still shorter to explain than a completely different syntax with its whole world of rules.

By rc1, 4 minutes ago

If fairness to toon, the alternative json your giving doesn’t include hints on structure.

Not sure LLM are more “tuned” to JSON.

That said, your general point holds that toon maybe unnecessary. Especially in the examples given. But perhaps plan text would suffice. Toon could be useful when automating inputs with many different shapes.

By s1mon, 3 hours ago

Obligatory XKCD: https://xkcd.com/927/

By meander_water, 18 hours ago

I don't get it, can't you just use yaml instead of inventing another DSL.

By jscheel, 12 hours ago

For repeating objects of the same structure, yaml will still require each key on each object, whereas this is a hybrid with csv, so it defines the keys once.

By 3cats-in-a-coat, 4 hours ago

No one forces us to use objects in JSON with repeated keys you know.

By makapuf, 4 hours ago

Indeed a

    {"header": ["some","column","names"], "values": [[1,2,3],[4,5,6],...]}

could fit.

By mhosayny, 16 hours ago

It's more compact than YAML. More like a combination of YAML and CSV.

By inopinatus, 5 hours ago

Norway.

By dragonwriter, 5 hours ago

YAML 1.2 has been out for 16 years now, so I would simply not assume that the suggestion to use YAML for a new purpose means “use YAML 1.1”.

By inopinatus, 5 hours ago

I could agree that you would not make poor assumptions.

Your LLM, however, may experience cross-format feature superposition and consequential spurious activation.

By flyer23, 5 hours ago

It is, also noone uses it:)

By ifh-hn, 5 hours ago

[dead]

By moralestapia, 15 hours ago

[flagged]

By jayd16, 5 hours ago

I'm not sure which one would win but its a bit telling that compression isn't mentioned at all.

I guess its about LLMs so the idea is has to be plaintext? But if you can train it on TOON can't you train it on BSON?

TOON – Token Oriented Object Notation