Llama-3-8B GPT-4o-RU1.0 is a fine-tuned version of the Llama-3-8B-Instruct model, created with the goal of significantly improving performance with the Russian language. The developer's key idea was to form a high-quality training dataset using the capabilities of GPT-4o — an OpenAI model known for its advanced multilingual abilities. A carefully cleaned and structured subset of the tagengo-gpt4 dataset was used as a foundation, enabling high training efficiency. 80% of the training examples were in Russian, making this model a specialized tool for Russian-language tasks.
From a technical standpoint, training was conducted for one epoch on two NVIDIA A100 accelerators using the Axolotl framework. The model architecture retains the base structure of Llama 3, with optimizations applied during training: Flash Attention 2 for accelerated processing and DeepSpeed ZeRO-2 for efficient memory distribution. The weights are saved in bfloat16 format for an optimal balance of performance and precision.
The quality of this model is confirmed by results on the MT-Bench benchmark (a multilingual test for evaluating dialog capabilities). In Russian, the model scored 8.12 points, surpassing GPT-3.5-turbo (7.94) and coming very close to the Suzume model (8.19), especially considering that the latter was trained on a dataset eight times larger and more diverse. It is important to note that, unlike many multilingual models where improving one language can degrade performance in English, this model shows an increase in English scores from 7.98 (for the base Llama-3) to 8.01, making it a well-balanced solution.
Thanks to its high proficiency in Russian-language tasks, the model is well-suited for a wide range of applications in the Russian-speaking domain.
| Model Name | Context | Type | GPU | TPS | Tooling | Status | Link |
|---|---|---|---|---|---|---|---|
| ruslandev/llama-3-8b-gpt-4o-ru1.0 | 8,192.0 | Public | RTX4090 | 54.00 | AVAILABLE | chat |
curl https://chat.immers.cloud/v1/endpoints/Llama-3-8B-GPT-4o-RU/generate/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer USER_API_KEY" \
-d '{"model": "Llama-3-8B-GPT-4o-RU", "messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Say this is a test"}
], "temperature": 0, "max_tokens": 150}'
$response = Invoke-WebRequest https://chat.immers.cloud/v1/endpoints/Llama-3-8B-GPT-4o-RU/generate/chat/completions `
-Method POST `
-Headers @{
"Authorization" = "Bearer USER_API_KEY"
"Content-Type" = "application/json"
} `
-Body (@{
model = "Llama-3-8B-GPT-4o-RU"
messages = @(
@{ role = "system"; content = "You are a helpful assistant." },
@{ role = "user"; content = "Say this is a test" }
)
} | ConvertTo-Json)
($response.Content | ConvertFrom-Json).choices[0].message.content
#!pip install OpenAI --upgrade
from openai import OpenAI
client = OpenAI(
api_key="USER_API_KEY",
base_url="https://chat.immers.cloud/v1/endpoints/Llama-3-8B-GPT-4o-RU/generate/",
)
chat_response = client.chat.completions.create(
model="Llama-3-8B-GPT-4o-RU",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Say this is a test"},
]
)
print(chat_response.choices[0].message.content)
Rent your own physically dedicated instance with hourly or long-term monthly billing.
We recommend deploying private instances in the following scenarios:
| Name | GPU | TPS | Max Concurrency | |||
|---|---|---|---|---|---|---|
8,192.0 |
1 | $0.33 | 6.200 | Launch | ||
8,192.0 |
1 | $0.38 | 1.700 | Launch | ||
8,192.0 |
1 | $0.38 | 6.200 | Launch | ||
8,192.0 |
1 | $0.53 | 13.400 | Launch | ||
8,192.0 |
1 | $0.57 | 0.800 | Launch | ||
8,192.0 |
1 | $0.83 | 13.400 | Launch | ||
8,192.0 |
1 | $1.02 | 13.400 | Launch | ||
8,192.0 |
1 | $1.20 | 20.600 | Launch | ||
8,192.0 tensor |
2 | $1.23 | 32.500 | Launch | ||
8,192.0 |
1 | $1.59 | 20.600 | Launch | ||
8,192.0 |
1 | $2.37 | 63.800 | Launch | ||
8,192.0 |
1 | $3.83 | 63.800 | Launch | ||
8,192.0 |
1 | $4.11 | 76.400 | Launch | ||
8,192.0 |
1 | $4.74 | 118.700 | Launch | ||
| Name | GPU | TPS | Max Concurrency | |||
|---|---|---|---|---|---|---|
8,192.0 |
1 | $0.33 | 4.421 | Launch | ||
8,192.0 |
1 | $0.38 | 4.421 | Launch | ||
8,192.0 |
1 | $0.53 | 11.621 | Launch | ||
8,192.0 tensor |
2 | $0.69 | 7.321 | Launch | ||
8,192.0 |
1 | $0.83 | 11.621 | Launch | ||
8,192.0 tensor |
2 | $0.97 | 5.521 | Launch | ||
8,192.0 |
1 | $1.02 | 11.621 | Launch | ||
8,192.0 |
1 | $1.20 | 18.821 | Launch | ||
8,192.0 tensor |
2 | $1.23 | 30.721 | Launch | ||
8,192.0 |
1 | $1.59 | 18.821 | Launch | ||
8,192.0 |
1 | $2.37 | 62.021 | Launch | ||
8,192.0 |
1 | $3.83 | 62.021 | Launch | ||
8,192.0 |
1 | $4.11 | 74.621 | Launch | ||
8,192.0 |
1 | $4.74 | 116.921 | Launch | ||
| Name | GPU | TPS | Max Concurrency | |||
|---|---|---|---|---|---|---|
8,192.0 |
1 | $0.53 | 3.030 | Launch | ||
8,192.0 tensor |
2 | $0.54 | 7.730 | Launch | ||
8,192.0 tensor |
2 | $0.57 | 7.730 | Launch | ||
8,192.0 |
1 | $0.83 | 3.030 | Launch | ||
8,192.0 pipeline |
3 | $0.84 | 6.130 | Launch | ||
8,192.0 |
1 | $1.02 | 3.030 | Launch | ||
8,192.0 tensor |
4 | $1.12 | 13.530 | Launch | ||
8,192.0 |
1 | $1.20 | 10.230 | Launch | ||
8,192.0 tensor |
2 | $1.23 | 22.130 | Launch | ||
8,192.0 pipeline |
3 | $1.43 | 3.430 | Launch | ||
8,192.0 |
1 | $1.59 | 10.230 | Launch | ||
8,192.0 tensor |
4 | $1.82 | 9.930 | Launch | ||
8,192.0 |
1 | $2.37 | 53.430 | Launch | ||
8,192.0 |
1 | $3.83 | 53.430 | Launch | ||
8,192.0 |
1 | $4.11 | 66.030 | Launch | ||
8,192.0 |
1 | $4.74 | 108.330 | Launch | ||
Contact our dedicated neural networks support team at nn@immers.cloud or send your request to the sales department at sale@immers.cloud.