I have heard this mentioned, online and in casual conversation
** During business hours these endpoints may get so highly utilised that accuracy and quality decline (ignore latency) (effective available token capacity is reduced etc?**
Has anyone run studies/experiments to show (true or false) - across a large dataset - that the performance of these endpoints during peak usage hours vs, low usage hours changes or does not change (I assume some statistical significance test required).
I don't have the knowledge to answer this question or understand if it's a valid hypothesis. Anyone got resources on this? I could only find tangential mentions in these papers.
[0] https://arxiv.org/html/2507.18007v1 "These challenges contribute to bottlenecks during peak workloads, ultimately affecting inference service quality, scalability, and responsiveness, which requires accurate resource profiling for LLM inference task."
[1] Asking Gemini - https://share.google/aimode/bWb5w9dpf2ZeggqSj
I've wondered this too actually.