In this case study, we examine how a combination of three RTX 4090 GPUs, isolated compute threads, and a switch to NVMe storage transformed an unstable pipeline into a fault-tolerant system that automatically blocks hundreds of spam attacks daily — completely replacing manual labor. With no Kubernetes, no hidden egress fees, and model loading in seconds, discover how KS Auto built a self-sustaining architecture—and why choosing the right GPU infrastructure became the key strategic decision for securely scaling their AI product.
The KS Auto Group has built a digital ecosystem for car enthusiasts—including the https://ks-auto.ru classifieds platform, a club service, a Telegram channel with ~200,000 subscribers, and a YouTube channel with ~1.5 million subscribers. As user activity grew, ensuring content safety and quality became critical. The team needed to automate three core processes:
This led to the development of a multi-layered AI system with approximately 10 validation layers.
Initially, the team prototyped the solution using external cloud APIs (e.g., OpenAI and similar services). However, upon moving to production, key limitations of this approach quickly emerged—common challenges in ML infrastructure:
As a result, the moderation system — designed to operate as a seamless, autonomous pipeline—became hostage to third-party API policies, home internet reliability, and local hardware constraints. Further product development and platform expansion began depending not on business priorities, but on whether the team could keep the system online and retain access to external AI services.
To break the cycle of instability and dependency, the team migrated inference to a hosted cloud server from immers.cloud, deploying a single Ubuntu-based instance equipped with three RTX 4090 GPUs. The core architectural principle was stream isolation: each GPU was permanently assigned to one moderation task. This eliminated task queuing across shared resources and ensured every pipeline operated independently and predictably.
Implementation did not proceed without challenges, but each technical obstacle became a growth point for the system architecture:
Model loading speed. Initially, the server was delivered with HDD drives. Loading a large multimodal model took about 12 minutes, making real-time task switching impossible. The team implemented a temporary solution, a RAM disk, keeping models in RAM for instant access. Simultaneously, the issue was raised with immers.cloud support, and the server was quickly upgraded to NVMe storage. The problem is now fully resolved: large models load in seconds (up to 1.5 minutes for the heaviest versions).
Model juggling. Before stable multimodal solutions became available, separate weights for text and images had to be maintained on the GPUs, requiring constant unloading and reloading, which introduced unnecessary delays and pipeline failure risks. With the arrival of high-quality models like Qwen and Gemma, this phase is now obsolete: a single neural network per GPU now processes both text and images in parallel without reloads.
Compared to a home server: The key difference is 24/7 stability. The system no longer depends on residential internet quality or local power grid overloads. Moderation runs continuously, and the risk of spam breakthroughs during downtime has been completely eliminated.
Compared to other cloud providers: Transparent pricing with no hidden traffic fees, combined with high connection speeds — up to 10 Gbps. But most importantly, flexible virtualization via OpenStack. GPUs are managed as full-fledged virtual resources: instances can be scaled quickly, configurations changed, and transitions to new hardware (e.g., RTX 5090) prepared — all without physical equipment replacement or service downtime.
The client came through a partner recommendation. After comparing with other platforms, the key arguments were:
| Metric | Value | Business Impact |
|---|---|---|
| System Operation Mode | 24/7, without human intervention | Continuous content protection, eliminating vulnerability windows during downtime |
|
Telegram Spam Filtering |
100+ profiles daily |
Preserving community quality and channel reputation with 200,000+ audience |
|
Photo Moderation in Listings |
Automatic verification for make/model consistency + license plate blurring |
Reduced risk of moderator errors, faster publication of legitimate listings |
|
Blog Text Moderation |
Fully AI-powered |
Content publishes quickly, without delays for manual review |
|
Model Loading |
Seconds (up to 1.5 min for large models) on NVMe |
Ability to rapidly switch between tasks and update models |
|
Architecture |
3× RTX 4090, isolated streams |
No queuing, independent scaling for each moderation direction |
|
Labor Cost Savings |
Equivalent to 10–15 full-time moderators |
Resources redirected to product development instead of routine moderation |
|
Content Safety |
Local inference, no data sent to external APIs |
Eliminated risk of account bans due to transmission of prohibited materials |
We set it up, and it just works. For us, that's the main criterion. The infrastructure's reliability fully met our expectations, and support is always available and provides substantive answers.
— Mikhail, AI/ML Technical Consultant, KS Auto
This case study confirms our cloud's strengths in the AI infrastructure market and provides clear directions for sales and product development: