The volume of requests to an endpoint typically varies: from moderate in small projects to intensive in production environments. A private endpoint can handle any load: under the hood, it can combine multiple virtual or dedicated servers with GPUs, evenly distributing requests across them via our chat.immers.cloud load balancer.
You can specify the desired number of servers when creating a private endpoint; on each server, a separate instance of vLLM and the model weights will be deployed.