When the listings database grew 10×, OpenAI API token costs became unsustainable. How can you lock in AI expenses and reliably process millions of requests without overpaying?
In this article, we examine the Affario case study: migrating to local inference of the Qwen 2.5 model on immers.cloud GPU server. Learn how to select the right tech stack, automate categorization, and replace unpredictable spending with a fixed budget.
Read the full solution breakdown to discover how to scale your AI projects without increasing costs!
Affario specializes in integrating artificial intelligence into business processes. One of their key projects is an auto parts marketplace, where partner stores upload products without strict data structuring.
Until February 2026, product classification was handled via the OpenAI API. The system worked well until the database volume began growing exponentially.
Before: With a database of 300,000 listings, token costs amounted to approximately ₽30,000 over just a few days of peak load.
The forecast was grim: as the database grew to 3 million listings (which happened quickly), using an external API would become financially unsustainable. Variable costs became unpredictable, while purchasing and maintaining proprietary hardware was impractical.
The company needed a solution that would allow them to pay a fixed amount for data processing, regardless of volume.
Technical Solution: Migration to Local Inference
To address this challenge, the team chose to migrate to local inference. The key criterion was the ability to deploy the model quickly, reliably, and without hidden infrastructure setup complexities.
According to Almaz, the company's AI engineer, the platform was chosen for its optimal balance of price, hardware specifications, and interface usability.
Technology Stack:
The classification process is fully automated and occurs in several stages without human intervention:
Input Data: A store uploads a listing containing only a title and description. Photos are stored separately and are not processed by the LLM.
Sequential Analysis (Chain-of-Thought):
Output: The model returns the category ID and name from the platform's internal classifier. Listing photos are not sent to the LLM for processing—they remain stored separately in S3.
Collaboration began in February 2026. Since then, the project has demonstrated impressive scalability.
|
Metric |
Before (OpenAI API) |
After (immers.cloud) |
|---|---|---|
|
Database volume |
300,000 listings |
3,000,000+ listings |
|
Cost model |
Pay-per-token (variable, growing) |
Fixed server hosting |
|
Costs with 10× growth |
Would have grown proportionally (prohibitively expensive) |
Remained unchanged |
|
Deployment complexity |
— |
Deployed independently in a short time |
Key Benefits for the Client:
Predictable budget. The cost of processing 100,000 or 3 million listings is the same—you pay only for server hosting.
Autonomy and speed. Almaz noted that deploying the Qwen 2.5 model via vLLM was completed without contacting technical support or encountering any difficulties. The platform proved to be intuitive.
Data security. Local inference ensures full control over information, which is critical for commercial aggregators.
«When the volume of listings grew 10×, token costs became unmanageable. Switching to local inference via immers.cloud allowed us to lock in processing costs regardless of volume. We deployed the model quickly, without unnecessary hurdles—and it just works.»
The success of the first phase launched a new initiative. Right now, the Affario team is developing an AI agent for initial handling of customer inquiries. The agent will engage with marketplace users, answering routine questions before escalating requests to a live manager. The infrastructure for this solution will also be built on immers.cloud resources.
If your API costs are growing faster than your revenue, consider switching to local inference. Hosting GPU servers on immers.cloud lets you scale AI solutions without overpaying per token.
Choose your server configuration