How to reduce GPU infrastructure costs by 2.5x and launch 24/7 AI development

How Cels reduced GPU infrastructure costs by 2.5x by migrating AI staging environments to immers.cloud: A case study on how hosting cloud servers with on-demand configurations solved GPU shortages and enabled 24/7 development.

  • Client: LLC “Medical Screening Systems” (brand Cels)
  • Industry: MedTech (AI in radiology, LLMs for healthcare)
  • Product: AI-powered medical image processing service + AI scribe for clinical consultations and analytics on medical databases

Problem: How cloud pricing and GPU shortages were slowing down development

Cels develops a wide range of AI solutions for healthcare — from medical image analysis to automation of clinical documentation and data processing. In this case study, we focus on infrastructure optimization for two active projects that drove the migration:

  • 3D Computer Vision for detection and segmentation of pathologies in chest CT scans;
  • LLM-based projects for transcribing doctor–patient conversations (AI scribe) and analytics on structured medical data.

Prior to migration, the team hosted their staging environments and test instances in a major public cloud. Over time, they encountered a classic ML development challenge: sharp price increases for GPU servers and a lack of configurations with 16–24 GB VRAM — ideal for staging and debugging.

To stay within budget, the team had to shut down machines overnight and on weekends. But as workload grew and GPU availability at the provider dwindled, even this workaround stopped working. Development of new versions, calibration tests under the Moscow Experiment,and LLM hypothesis validation began depending not on business priorities, but on pricing windows and resource availability.

How moving infrastructure to immers.cloud solved the resource availability issue

Instead of purchasing dedicated hardware or overpaying for oversized cloud instances, Cels’ CTO decided to host GPU configurations from immers.cloud. The prepaid model and transparent pricing enabled the team to quickly deploy staging environments without lengthy approvals or hidden markups.

Infrastructure setup was handled internally by the DevOps engineer, and communication with our team was limited to operational questions about tariffs. Migration went smoothly: pipelines remained intact, and integration with the production backend in Yandex Cloud stayed seamless.

Technical implementation: Two projects, one platform

Project 1: 3D CV Model Inference (Chest CT)

  • Stack: PyTorch, Python, Redis
  • Orchestration: Kubernetes
  • Configuration: rtx3090-1.32.64.160 + dedicated CPU node for connectivity with the main cluster in Yandex Cloud
  • Workload: Staging inference (internal tests, external calibration runs)
  • Performance: Stable parallel processing of 1–2 studies (each containing 300–1,000 images), fully meeting pre-production needs
  • Storage: Local instance disks

Project 2: LLM Inference (AI Scribe)

  • Stack: vLLM
  • Orchestration: Docker Compose
  • Configuration: rtx3090-1.8.32.160 + A2-based GPU machine for Speech-to-Text model inference
  • Workload: Inference of a custom LLM based on Qwen 3, A/B testing of hypotheses, quality metrics collection
  • Performance: 3–5 parallel requests—optimal for research tasks and prompt/architecture validation
  • Storage: Local, with scheduled artifact synchronization

Both projects run in isolated staging environments but can easily scale when transitioning to training or production deployment.

Results in numbers

Metric Before Migration With immers.cloud

Staging infrastructure cost

Base rate + surcharges for idle time/scaling

Reduced by 2–2.5 times

Resource availability

Limited to night/weekend windows

24/7, no restrictions

Speed of deploying new hypotheses

Dependent on quotas and GPU availability

Nodes launched on demand in minutes

Administration

Manual tariff management, constant limit monitoring

Handled by internal DevOps; immers.cloud support responds during business hours

 

The team gained a predictable budget, the ability to run tests at any time, and the flexibility to spin up additional machines for training or new projects.

Client quote:

Migrating our AI development environments to immers.cloud significantly reduced our infrastructure costs while making resources available 24/7. This allows our ML teams to confidently run any tests, validate new model versions, and develop new AI projects and hypotheses.

Conclusion

The Cels case confirms: staging, pipeline debugging, and hypothesis validation often do not require enterprise-grade instances based on data center GPUs—in many scenarios, such configurations offer no meaningful advantage for inference and R&D workloads.

In practice, these tasks are efficiently handled by widely available GPUs like the NVIDIA RTX 3090 and RTX 4090, which feature 24 GB of VRAM and sufficient performance for both CV and LLM inference. Moreover, their high availability in our cloud enables rapid scaling.

If your team works on research tasks, computer vision calibration, or LLM inference, these GPUs allow you to launch experiments quickly and scale workloads without delays.

Updated Date 16.04.2026