How to Save on Budget When Your Database Grows 10× by Switching from OpenAI API to Local Inference on immers.cloud

When the listings database grew 10×, OpenAI API token costs became unsustainable. How can you lock in AI expenses and reliably process millions of requests without overpaying?

In this article, we examine the Affario case study: migrating to local inference of the Qwen 2.5 model on immers.cloud GPU server. Learn how to select the right tech stack, automate categorization, and replace unpredictable spending with a fixed budget.

Read the full solution breakdown to discover how to scale your AI projects without increasing costs!

  • Industry: AI Integration into Business Processes
  • Task: Automatic categorization of millions of auto parts listings
  • Solution: Local deployment of Qwen 2.5 on an RTX 3090 GPU server
  • Result: Fixed costs instead of token-based payments, stable operation under loads of 3+ million listings

Introduction: The Scaling Problem with External APIs

Affario specializes in integrating artificial intelligence into business processes. One of their key projects is an auto parts marketplace, where partner stores upload products without strict data structuring.

Until February 2026, product classification was handled via the OpenAI API. The system worked well until the database volume began growing exponentially.

Before: With a database of 300,000 listings, token costs amounted to approximately ₽30,000 over just a few days of peak load.

The forecast was grim: as the database grew to 3 million listings (which happened quickly), using an external API would become financially unsustainable. Variable costs became unpredictable, while purchasing and maintaining proprietary hardware was impractical.

The company needed a solution that would allow them to pay a fixed amount for data processing, regardless of volume.

Technical Solution: Migration to Local Inference

To address this challenge, the team chose to migrate to local inference. The key criterion was the ability to deploy the model quickly, reliably, and without hidden infrastructure setup complexities.

Why immers.cloud?

According to Almaz, the company's AI engineer, the platform was chosen for its optimal balance of price, hardware specifications, and interface usability.

Technology Stack:

  • Model: Qwen 2.5 (selected for its excellent understanding of Russian-language text and auto parts context)
  • Deployment Tool: vLLM (for high-performance inference)
  • Infrastructure: immers.cloud cloud server with NVIDIA RTX 3090 GPU
  • Data Storage: Separate server with S3 storage for images and original listing data

System Workflow Logic

The classification process is fully automated and occurs in several stages without human intervention:

Input Data: A store uploads a listing containing only a title and description. Photos are stored separately and are not processed by the LLM.

Sequential Analysis (Chain-of-Thought):

  • Step 1: The model identifies the vehicle make and model from the text.
  • Step 2: Based on the make/model, the vehicle type is determined (passenger car, truck, motorcycle, etc.).
  • Step 3: The parts category is identified (e.g., "silent blocks," "gearbox").

Output: The model returns the category ID and name from the platform's internal classifier. Listing photos are not sent to the LLM for processing—they remain stored separately in S3.

Implementation Results

Collaboration began in February 2026. Since then, the project has demonstrated impressive scalability.

Metric

Before (OpenAI API)

After (immers.cloud)

Database volume

300,000 listings

3,000,000+ listings

Cost model

Pay-per-token (variable, growing)

Fixed server hosting

Costs with 10× growth

Would have grown proportionally (prohibitively expensive)

Remained unchanged

Deployment complexity

Deployed independently in a short time

Key Benefits for the Client:

Predictable budget. The cost of processing 100,000 or 3 million listings is the same—you pay only for server hosting.

Autonomy and speed. Almaz noted that deploying the Qwen 2.5 model via vLLM was completed without contacting technical support or encountering any difficulties. The platform proved to be intuitive.

Data security. Local inference ensures full control over information, which is critical for commercial aggregators.

«When the volume of listings grew 10×, token costs became unmanageable. Switching to local inference via immers.cloud allowed us to lock in processing costs regardless of volume. We deployed the model quickly, without unnecessary hurdles—and it just works.»

What's Next?

The success of the first phase launched a new initiative. Right now, the Affario team is developing an AI agent for initial handling of customer inquiries. The agent will engage with marketplace users, answering routine questions before escalating requests to a live manager. The infrastructure for this solution will also be built on immers.cloud resources.

Want the Same?

If your API costs are growing faster than your revenue, consider switching to local inference. Hosting GPU servers on immers.cloud lets you scale AI solutions without overpaying per token.

Choose your server configuration

Updated Date 14.05.2026