News | drihu.com

By HardikVala, 17 hours ago

A minimal implementation of ModernBERT in pure C, inspired by karpathy's llama2.c. The core (tokenizer + inference code) is around 1000 lines of code with no dependencies (except OpenBLAS for fast matrix multiplication, and PCRE for regex).

If you don't know, ModernBERT is a new encoder-only model from answer.ai. Unlike decoder-only models like Llama, encoder models process all input tokens in a single pass (no autoregression), which makes them great for tasks like token classification.

The implementation supports loading any ModernBERT checkpoint from Hugging Face. I've tested it with the base model and a token classification model for anonymizing PII. You can get >1200 tokens/s throughput on a single thread (slightly better than pytorch implementation), though that's not directly comparable to decoder models since there's no token-by-token generation.

I hard-coded the architecture to keep things simple and readable. The tokenizer is a from-scratch BPE implementation that handles most cases (though it's missing some edge cases). The main goal was to support a lightweight deployment of this model, without the heavy baggage of the pytorch ecosystem.

Enjoy.

URL: github.com

0 comments

Show HN: ModernBERT in Pure C