Testing LLM Agents Like Software – Behaviour Driven Evals of AI Systems

13 comments

I appreciate the details shared in this paper but it'd be great if they open sourced their implementation!

Curious if the behaviour driven testing can be done by another LLM agent (or a group of agents) - one LLM agent testing another. Could lead to a self-improving loop?

[deleted]

[deleted]

A powerful move beyond benchmarks — this paper redefines LLM evaluation through realistic, behavior-driven testing.

Very interesting work.

Excellent work

Interesting

Nice Work

Nice work

Great work

interesting

[dead]