Show HN: Gemini 2.5 is the best model for Kotlin and Android dev

I built Kotlin-bench, the first benchmark for evaluating LLM performance on real-world Kotlin and Android tasks.

Current AI benchmarks like SWE-bench primarily focus on languages like Python or JavaScript, leaving Kotlin and Android development largely overlooked. Kotlin-bench addresses this gap by providing a clear metric for evaluating AI models specifically on Kotlin.

Like SWE-bench, Kotlin-bench assesses AI models using real GitHub issues from widely-used Kotlin repositories such as kotlinx.coroutines and Anki-Android. Solutions are objectively validated by running each project's unit tests—a model passes only if its generated code correctly resolves the original issue.

The dataset includes 100 GitHub issues selected from well-maintained and thoroughly tested repositories.

Major takeaways:

- Gemini 2.5 performed best, solving 14% of tasks, closely followed by Claude 3.7 (thinking mode).

- "Thinking" models substantially outperformed standard models, with Gemini 2.5, Claude 3.7 thinking, and Deepseek R1 topping our charts.

- Surprisingly, OpenAI’s latest models struggled; notably, o3-mini solved only 2% of the tasks.

For a technical deep-dive on how Kotlin-bench was curated and how we scaled thousands of Android containers in parallel for evaluation, read more here:

- https://firebender.com/blog/kotlin-bench

I also open sourced all the code and datasets:

- https://github.com/firebenders/Kotlin-bench

- https://huggingface.co/datasets/firebenders/Kotlin-Bench

Thanks for reading! I'd love feedback, questions, or suggestions on how to improve Kotlin-bench further!

0 comments