News | drihu.com

By hubertmarek, 2 hours ago

URL: agentdiff.dev

2 comments

By zachderhake, 32 minutes ago

I tested this. The pain point for me is running evals against production APIs requires maintaining mock accounts, dealing with rate limiting from integrations, and constantly cleaning up test data. The interception approach here (agent calls real URLs, gets routed to local fakes) seems like it might eliminate that overhead... I dealt with this at my last startup and it was really annoying.

By hubertmarek, an hour ago

Hi HN, I noticed it is almost impossible to run evals or train models on 3rd party integrations, so I built interactive environments for them. Feedback is more than welcome. Thanks!

Interesting fact - running evals on 40 tasks for Linear API, most frontier models scored surprisingly well:

- Claude Opus 4.5: 95% (38/40) - GLM 4.6: 87.5% (35/40) - Claude Sonnet 4.5: 85% (34/40) - Claude Haiku 4.5: 82.5% (33/40) - Kimi K2: 82.5% (33/40) - Grok 4.1 Fast: 80% (32/40) - GPT 5.1: 77.5% (31/40)

This makes me think whether we really need to reinvent the wheel and make special interfaces (MCPs) for agents interacting with services, when they can just use APIs as they are.

Show HN: Do we need MCPs? Reverse-engineered Slack and Linear API for Evals & RL