24/7 AI Moderation Instead of a 15-Person Team: How KS Auto Built an AI System on immers.cloud GPU Cloud

In this case study, we examine how a combination of three RTX 4090 GPUs, isolated compute threads, and a switch to NVMe storage transformed an unstable pipeline into a fault-tolerant system that automatically blocks hundreds of spam attacks daily — completely replacing manual labor. With no Kubernetes, no hidden egress fees, and model loading in seconds, discover how KS Auto built a self-sustaining architecture—and why choosing the right GPU infrastructure became the key strategic decision for securely scaling their AI product.

  • Client: KS Auto Group (ks-auto.ru platform, Club Service, Telegram channel with ~200,000 subscribers)
  • Industry: Automotive Tech / AI-Powered User Content Moderation
  • Product: GPU Cloud Server from immers.cloud (3× RTX 4090, NVMe storage, OpenStack virtualization)

Problem: How reliance on external APIs, home servers, and slow disks blocked real-time AI moderation

The KS Auto Group has built a digital ecosystem for car enthusiasts—including the https://ks-auto.ru classifieds platform, a club service, a Telegram channel with ~200,000 subscribers, and a YouTube channel with ~1.5 million subscribers. As user activity grew, ensuring content safety and quality became critical. The team needed to automate three core processes:

  • Spam-bot filtering in messenger communications;
  • Photo verification in listings (e.g., checking vehicle make/model consistency and automatic license plate blurring);
  • Text moderation in blog sections.

This led to the development of a multi-layered AI system with approximately 10 validation layers.

Initially, the team prototyped the solution using external cloud APIs (e.g., OpenAI and similar services). However, upon moving to production, key limitations of this approach quickly emerged—common challenges in ML infrastructure:

  1. Compliance and blocking risks: Spam bots frequently uploaded content prohibited by public API terms (e.g., explicit or fraudulent material). Sending such data to third-party services risked immediate developer account bans and permanent loss of AI access.
  2. Fragility of local hardware: An attempt to run inference on a high-end GPU installed at a developer’s home exposed severe reliability issues. Unstable residential internet meant any connection drop halted moderation entirely. During just 30 minutes of downtime, dozens of spam accounts infiltrated the Telegram channel—directly damaging community trust and brand reputation.
  3. Scalability and cost inefficiency: Manual moderation at scale would have required a 10–15 person team working 24/7. Buying multiple top-tier GPUs for idle “just-in-case” use was financially unjustifiable, while lack of uptime guarantees and scaling flexibility stifled product growth.
  4. Performance bottlenecks: Early test servers used HDDs, causing large AI models to load in ~12 minutes—making rapid task switching impossible. The team resorted to RAM disks and maintained separate models for text and images, constantly juggling weights and overcomplicating the pipeline.

As a result, the moderation system — designed to operate as a seamless, autonomous pipeline—became hostage to third-party API policies, home internet reliability, and local hardware constraints. Further product development and platform expansion began depending not on business priorities, but on whether the team could keep the system online and retain access to external AI services.

Solution: Three GPUs, Three Tasks, One Server

To break the cycle of instability and dependency, the team migrated inference to a hosted cloud server from immers.cloud, deploying a single Ubuntu-based instance equipped with three RTX 4090 GPUs. The core architectural principle was stream isolation: each GPU was permanently assigned to one moderation task. This eliminated task queuing across shared resources and ensured every pipeline operated independently and predictably.

Technical Stack & Optimizations

  • 24/7 Inference: Powered by Ollama and multimodal models including Qwen 3 (e.g., Qwen 3.6:35B), Gemma 4 (31B), and similar. Each GPU runs a single multimodal model capable of processing both text and images simultaneously, removing the need to constantly unload/load separate weights.
  • Fast Image Classification: Lightweight computer vision models run without LLMs, acting like photo sensors to handle basic filtering — offloading primary workload before deeper analysis.
  • Algorithmic Pre-Filter: Image deduplication via 64×64 perceptual hashes computed on the CPU — 12,000 comparisons in just 3 ms, enabling ultra-fast initial spam rejection.
  • Storage Upgrade: Critical migration from HDD to NVMe. Model loading time dropped from ~12 minutes to seconds (max ~1.5 min for largest weights). During transition, a RAM disk kept models hot for instant access.
  • Orchestration: No Kubernetes or Slurm. Instead: custom per-GPU configs + OpenStack GPU virtualization, enabling flexible resource control and rapid scaling without orchestration overhead.

Challenges encountered and how they were resolved

Implementation did not proceed without challenges, but each technical obstacle became a growth point for the system architecture:

Model loading speed. Initially, the server was delivered with HDD drives. Loading a large multimodal model took about 12 minutes, making real-time task switching impossible. The team implemented a temporary solution, a RAM disk, keeping models in RAM for instant access. Simultaneously, the issue was raised with immers.cloud support, and the server was quickly upgraded to NVMe storage. The problem is now fully resolved: large models load in seconds (up to 1.5 minutes for the heaviest versions).

Model juggling. Before stable multimodal solutions became available, separate weights for text and images had to be maintained on the GPUs, requiring constant unloading and reloading, which introduced unnecessary delays and pipeline failure risks. With the arrival of high-quality models like Qwen and Gemma, this phase is now obsolete: a single neural network per GPU now processes both text and images in parallel without reloads.

What distinguished this experience from previous ones

Compared to a home server: The key difference is 24/7 stability. The system no longer depends on residential internet quality or local power grid overloads. Moderation runs continuously, and the risk of spam breakthroughs during downtime has been completely eliminated.

Compared to other cloud providers: Transparent pricing with no hidden traffic fees, combined with high connection speeds — up to 10 Gbps. But most importantly, flexible virtualization via OpenStack. GPUs are managed as full-fledged virtual resources: instances can be scaled quickly, configurations changed, and transitions to new hardware (e.g., RTX 5090) prepared — all without physical equipment replacement or service downtime.

Why immers.cloud specifically

The client came through a partner recommendation. After comparing with other platforms, the key arguments were:

  1. Comprehensive cost efficiency. No hidden fees for internet traffic and predictable pricing for cloud GPUs, which is critical for 24/7 inference;
  2. Qualified support. All issues were resolved promptly via Telegram: from selecting the optimal configuration for multimodal inference to migrating to NVMe storage. Specialists provided substantive answers and helped resolve technical issues without bureaucracy.

Results in numbers

Metric Value Business Impact
System Operation Mode 24/7, without human intervention Continuous content protection, eliminating vulnerability windows during downtime

Telegram Spam Filtering

100+ profiles daily

Preserving community quality and channel reputation with 200,000+ audience

Photo Moderation in Listings

Automatic verification for make/model consistency + license plate blurring

Reduced risk of moderator errors, faster publication of legitimate listings

Blog Text Moderation

Fully AI-powered

Content publishes quickly, without delays for manual review

Model Loading

Seconds (up to 1.5 min for large models) on NVMe

Ability to rapidly switch between tasks and update models

Architecture

3× RTX 4090, isolated streams

No queuing, independent scaling for each moderation direction

Labor Cost Savings

Equivalent to 10–15 full-time moderators

Resources redirected to product development instead of routine moderation

Content Safety

Local inference, no data sent to external APIs

Eliminated risk of account bans due to transmission of prohibited materials

We set it up, and it just works. For us, that's the main criterion. The infrastructure's reliability fully met our expectations, and support is always available and provides substantive answers.

— Mikhail, AI/ML Technical Consultant, KS Auto

Conclusions

This case study confirms our cloud's strengths in the AI infrastructure market and provides clear directions for sales and product development:

  • Performance + availability: The RTX 4090 + NVMe combination delivers low-latency inference, critical for real-time moderation and multimodal model operation;
  • Economics without surprises: AI-segment clients highly value fixed pricing with no hidden traffic fees. This is a direct competitive advantage over major public clouds;
  • Readiness for complex stacks: We support custom configurations and OpenStack virtualization without forcing boxed orchestrators. This matters for teams building pipelines for Ollama/llama.cpp without Kubernetes overhead;
  • Support as part of the product: Rapid response to infrastructure requests (like enabling NVMe) directly impacts retention, LTV, and the creation of referral case studies.
Updated Date 06.05.2026