Hello HackerNews!
I’m excited to share what we’ve been working on at nCompass Technologies: an AI inference* platform that gives you a scalable and reliable API to access any open-source AI model — with no rate limits. We don't have rate limits as optimizations we made to our AI model serving software enable us to support a high number of concurrent requests without degrading quality of service for you as a user.
If you’re thinking, well aren’t there a bunch of these already? So were we when we started nCompass. When using other APIs, we found that they weren’t reliable enough to be able to use open source models in production environments. To resolve this, we're building an AI inference engine that enable you, as an end user, to reliably use open source models in production.
Underlying this API, we’re building optimizations at the hosting, scheduling and kernel levels with the single goal of minimizing the number of GPUs required to maximize the number of concurrent requests you can serve, without degrading quality of service.
We’re still building a lot of our optimizations, but we’ve released what we have so far via our API. Compared to vLLM, we currently keep time-to-first-token (TTFT) 2-4x lower than vLLM at the equivalent concurrent request rate. You can check out a demo of our API here:
https://www.loom.com/share/c92f825ac0af4ab18296a16546a75be3
As a result of the optimizations we’ve rolled out so far, we’re releasing a few unique features on our API:
1. Rate-Limits: we don’t have any
Most other API’s out there have strict rate limits and can be rather unreliable. We don’t want API’s for open source models to remain as a solution for prototypes only. We want people to use these APIs like they do OpenAI’s or Anthropic’s and actually make production grade products on top of open source models.
2. Underserved models: we have them
There are a ton of models out there, but not all of them are readily available for people to use if they don’t have access to GPUs. We envision our API becoming a system where anyone can launch any custom model of their choice with minimal cold starts and run the model as a simple API call. Our cold starts for any 8B or 70B model are only 40s and we’ll keep improving this.
Towards this goal, we already have models like `ai4bharat/hercule-hi` hosted on our API to support non-english language use cases and models like `Qwen/QwQ-32B-Preview` to support reasoning based use cases. You can find the other models that we host here: https://console.ncompass.tech/public-models for public ones, and https://console.ncompass.tech/models for private ones that work once you've created an account.
We’d love for you to try out our API by following the steps here: https://www.ncompass.tech/docs/llm_inference/quickstart. We provide $100 of free credit on sign up to run models, and like we said, go crazy with your requests, we’d love to see if you can break our system :)
We’re still actively building out features and optimizations and your input can help shape the future of nCompass. If you have thoughts on our platform or want us to host a specific model, let us know at hello@ncompass.tech.
Happy Hacking!
* it's called inference because the process of taking a query, running it through the model and providing a result is referred to as "inference" in the AI / machine learning world. It's as opposed to "training" or "finetuning" which are processes used to actually develop the AI models that you then run "inference" on.
Random idea -- I think it would be cool for hosts that advertise efficiency to have a dashboard that shows total tokens per watt-hour (or whatever usage:energy metric) graphed over time for each model they host, taking into account as much of their infra as possible.
This would:
- let you boast about your cool proprietary optimizations
- naturally get better over time just from applying public algorithmic improvements
- show up hosts that refuse to do the same
- give you a good incentive to keep on top of your own efficiency and competitiveness over time
- be a good response to users who vaguely know that AI takes "a lot" of energy -- it's actually gotten a lot better, but how much better?
Happy to chat if it would help to have a neutral academic voice involved.
Thanks for the suggestions! This is definitely something we'll be looking at.
We're currently working on providing a more extensive interface to show users a variety of performance metrics of the models they're running. Having efficiency metrics would be a great addition.
I think additionally an important facet of these tests would be providing clarity on the details of the tests to make them reproducible. I find that sometimes reported stats don't quite translate to real-world experiences. It can feel like results are presented using the workloads that look best on a system, so a standardized/reproducible approach would be best.
We're always keen to chat to as many users/experts/academics/enthusiasts as possible. Please feel free to reach me at diederik.vink@ncompass.tech and we can set up a time to meet!
What are the trade-offs you've made to achieve this?
We focused mainly on the scheduling side of things. So we essentially prioritize prefills over decodes. In order to do this correctly, we had to monitor KV cache usage and whenever it's close to running out of memory, we schedule more decodes again.
So this means that you end up either having many decodes wait for prefills to complete or you end up scheduling decodes with prefills. Both scenarios result in slower decodes which is why we're seeing an increase in the ITL. This is the main tradeoff we've made.
So, while time to first token is lower, throughput might also be lower in most cases?
Per user throughput might be lower at the moment yes. We're working on GPU kernel level optimizations now to fix that.
But across all users on our system, the throughput is better because doing more prefills or a large number of grouped decodes has better utilization of the GPU.
The idea is that this works for someone who wants to build a product that is consistent across users in terms of initial response but can trade-off some E2E latency. It ensures that no one is waiting for a long time before getting the first response.
I don’t really get it. Prefill saturates compute and decode saturates memory bandwidth. Why are you not doing mixed batch?
You're totally right and we are doing a mixed batch. What we changed was the priority of performing prefills over decodes.
When looking at a variety of workloads, we realized that prioritizing finishing a query (priotizing decodes) lead to underutilization of the GPU. We noticed there tended to not be enough requests that are concurrently running (because prefill wasn't prioritized) to meaningfully utilize the memory bandwidth with available decodes. This lead to a system that was unfortunately neither compute nor memory bound.
By running mixed batches that prioritize prefills we still compute some decode tokens in our spare capacity, but ensure compute is as saturated as possible. This additionally leads to a buildup of decodes, so that when we are primarily computing decode we're pushing our memory bandwidth as much as we can.
Of course there is still plenty of improvements that can be made on this front. Finding a dynamic balance between prefill and decode that allows us to have both the memory bandwidth and compute being pushed to their limits is the goal from a scheduling perspective. There are a whole host of factors such as the model architecture, input-token:output-token ratio, underlying hardware, KV-cache allocation (and many more) that all play into the pressure placed on memory and compute, so there's definitely still exploration to be done!
1. Why do you have a limited number of models publicly? Do you have to configure each one manually?
2. I don't see the 50% cheaper option. According to your pricing page, 16B+ models will cost $0.90, which is the same price for Together.ai and fireworks.ai
1. Yes that's correct to some degree. Depending on the model details we might need to do some manual tweaking to get everything up and running, but generally we can get a model up within a day. There's always optimizations and tests we like to run before listing something as publically available to ensure the best experience for our users.
If a fully self-serve system is something you would like to see, we would love to hear more!
2. Could you please elaborate on the 50% cheaper option? If you're referring to the line on our website, that is due to our efficiency at scale. This efficiency benefit allows us to provide the models at the price that we do without implementing rate limits to manage our costs. Additionally, this 50% more efficient GPU utilization also benefits anyone looking to use our infrastructure for on-prem solutions.
> Reduce AI GPU Infrastructure Bills by 50%
Ok so how does #2 help me do this?
If you deploy our solution on-prem, you would be able to handle 2x the workload on the same amount of hardware. This ensures you scale up your hardware 2x slower, giving you a ~50% reduction in your GPU Infrastructure bills.
Unrelated: During the dot-com boom, there was a company called nCompass Labs that developed one of the first content management systems (https://en.wikipedia.org/wiki/NCompass_Labs_Inc). Microsoft bought them in 2001. Their product was, "a plug-in for hosting ActiveX controls in Netscape Navigator named ScriptActive." ActiveX itself was a novelty, using C++ templates to define reusable and _downloadable_ web components.
All of this crap was happily replaced with JavaScript frameworks in later years. Yes, back in the early-2000s, your browser might literally download executable code just to render a custom button.
It now makes sense that when we tested the domain ncompass.com it took us to a Microsoft home page, which is why we're ncompass.tech :)
That’s hilarious. I bet if you reach out to Microsoft, they will give you that domain. There’s no way they’re using the trademark anymore.
Interesting approach to model serving - the 2-4x lower TTFT compared to vLLM is impressive, but I'd be curious to see detailed benchmarks across different batch sizes and model architectures to validate those performance claims. The no rate limits policy is bold but could get expensive fast if you're not doing some clever GPU utilization under the hood.
Thanks for your comments. Absolutely, as we were mentioning in one of the other threads, we are really keen on building towards having a reproducible dashboard of efficiency and other metrics.
Also regarding the no rate limits, we agree this is a real challenge and it's part of why we're interested in building this as well. I think the clever GPU utilization tricks are exactly what we're building out and also looking forward to see what the various issues we're going to run into at such scale.
One vote for image inputs here. I would love a fine-tuned qwen-2-vl-72b on demand, but most of the solutions are "talk to us" level expensive. I'm assuming you beat the price or convenience of a replicate / modal solution?
Thanks for the feedback and the specific model suggestion!
Compared to the replicate/modal solutions our big focus is to ensure you don't experience rate limits. We want to ensure you get a good quality of service no matter what.
When it comes to requesting and running specific models, we won't ask you to pay extra just because there's lower demand for that specific model (which it sounds like other providers are doing). We manage scaling up and down instances for the models on your behalf to make sure you get good performance at a fair price point, so you don't have to worry about making the costs work.
Since you're calling out your support for underserved models, can I request you support some SOTA embeddings models? Support for embeddings is poor from other providers with only a handful of outdated models and poor latency.
Hey, great that you mentioned this. We actually had BAAI/bge-m3 on our list of models to put up in the near future to see if people had use for it over an API. It's great to hear that this is something you're looking for. If you could let us know if there was a specific model you wanted to run, we can look into getting that put up soon.
Colbert, colqwen are underserved would benefit from a latency optimized inference service
Awesome, we really appreciate the suggestions! We'll look into getting these up and running shortly!
https://console.ncompass.tech/models has no models on it, just a "Get in Touch" button.
Hey, I'm one of the co-founders, thanks for letting us know! I've just run it a few times and seems fine on my end. It does take a second to load the models, but feel free to let me know if this persists.
Same here. Waited 10 seconds. Then gave up. If the list of models takes so long to load, why should I trust you with loading the models themselves? :)
Hey, that links corresponds to the private models list that only works once you've created an account. If you'd like to see the public models page, please check it out here: https://console.ncompass.tech/public-models.
We've put a wrong hyperlink on the website, but we've fixed that now, thanks for letting us know.
Regarding us being able to reliably host models versus setting up a website largely comes down to our technical background. All of us are hardware engineers so our front-end capabilities are not our strong suit :). But our experience as hardware engineers makes us confident in hosting the models themselves. Both 8B and 70B models, if cached, do actually load in exactly 40s, but please feel free to try out the system and see for yourself!
Looks like you put the private models link in your Show HN post text as well - it's worth fixing.
Are you planning to support any image or video generation models, or focusing on text for now?
Thanks for letting us know, we've updated it now!
Although we're currently only supporting text models, we do definitely have image and video generation models in our roadmap as these are very compute intensive models meaning they would benefit greatly from optimizations. We'd love to hear more about any specific models you're hoping to run! Please feel free to message us with further details (diederik.vink@ncompass.tech).
I've edited the text to include both links and explain the difference between them. Thanks!
this sounds like black magic, kudos to you. i'd love to chat, dm me on https://twitter.com/swyx if you'd find it useful to chat with someone like me.
Thank you! Absolutely, I'll send over a DM and we can take it from there!
What is Groq (rate limited) missing that you aren't?
That's a great question, but its hard to get enough insight into how Groq is serving models to properly know what's missing.
If I had to hazard a guess, it would be that their system architecture (# of chips and chip architecture itself) might not be designed for a high concurrency situation.
[deleted]