To LLM or not to LLM?

And more importantly: which LLM?

If you can’t make your head around the recent changes in the Large Language Models (LLMs) scene, fear not. You’re not alone.

With over 60 large language models (LLMs) available on the market, it’s easy to get lost in the labyrinth of choices. The scene gets more crowded month after month, as new models enter the game. ☔️ No wonder users around the globe keep pondering whether these new additions can beat the well-known giants.

If these users happen to work in the translation and localization industry, the first question that comes to mind is: Can I successfully use these new LLMs for translation tasks?

You’ll find the answer in our comprehensive test below.

🛝 New kids on the block 🔗

Earlier this year, DeepSeek made it to the headlines with a splash. However, that’s not the only LLM that has recently launched. There are dozens of other newcomers, too.

While DeepSeek promises high performance, open-source accessibility, high efficiency, and low computational costs, it raises significant data privacy concerns. For starters, it stores user information on servers located in China and fails to guarantee that EU users' rights under GDPR will be met. But that’s not all. The company collects various types of user data, including text inputs, prompts, uploaded files, chat history, device information, and even keystroke patterns. If that doesn’t sound like the scenes from the Black Mirror series coming to life, I don’t know what else does.

As if these weren’t enough, DeepSeek tends to be highly biased. Prompt it to give you information on any topic censored by the Chinese government, and you’ll find yourself in a loop of diversions. 👀  Cultural Revolution in China? It’s beyond the tool’s scope. Who owns the Spratly Islands? Let’s switch topics.

article-image
Censorship in DeepSeek.

Questions about Chinese matters are not the only issues that DeepSeek likes to cover with silence. A study by Enkrypt AI found that 83% of bias tests on DeepSeek-R1 resulted in discriminatory output, with biases detected across race, gender, health, and religion. Whatever you plan to do with the tool, make sure to double-check the facts.

Doubao, another Chinese LLM released in January 2025 by ByteDance (the creators of TikTok), shows a similar behavior when it comes to biases and censorship. However, it promises performance that outshines GPT-4o or Claude in areas such as knowledge retention, coding, reasoning, and Chinese language processing. It also scores higher in terms of operational efficiency, using merely USD 0.11 per million tokens for input and USD 0.013 for output. The industry giants have on average 70% higher costs. 📈

Cost-effectiveness is the secret weapon of yet another Chinese LLM. Qwen was published by Alibaba in 2023 and received its most powerful version, Qwen2.5-Max, in January this year. This model supports multilingual tasks across more than 29 languages and promises excellent long-form text generation, outstanding coding proficiency, and mathematical reasoning. It’s also fast and efficient. In terms of speed, Qwen matches GPT-4o and outperforms Gemini 1.5.

Doubao shines in operational efficiency, with considerably lower costs than the LLM giants. Qwen scores high in data privacy, in contrast to DeepSeek, which raises security concerns and tends to be highly biased

Unlike DeepSeek, Qwen scores high in data privacy and seems to be less biased. It correctly answers questions about Chinese or international history and shows fewer biases in other areas. 🧑‍🏫 Another good news is that Qwen aligns with GDPR and other global privacy standards.

If it all sounds too good to be true, here’s a reason to temper your expectations: Qwen2.5-Max also has its downsides. It’s not an open-source LLM and doesn’t provide transparent reasoning behind its responses, as DeepSeek R1 or OpenAI’s o1 do. That’s not necessarily a deal-breaker, but something to consider depending on your needs for transparency and adaptability.

With so many options, choosing the right LLM for translation sounds like an impossible mission. It’s tough to decide which model will help you break language barriers and facilitate all those complex localization and internationalization processes.

🎙️ To what extent are AI tools useful in localization and translation? Listen to our conversation with Dialect’s CEO Jiří Šebek  for an insider take on this topic

LLM giants or the shiny new Chinese toys? 🔗

Let’s put the models to the test. We’ll analyze the translation quality delivered by the new models and compare it with that of pioneers such as ChatGPT, Claude, Mistral, and Gemini.

Why these models? 🔗

Well, ChatGPT, Claude, Mistral, and Gemini were all released before 2024, gained widespread recognition, and were praised for their capabilities. They are also the most popular LLMs. On the other hand, the three Chinese LLMs all emerged in 2025 promising an AI revolution. Our translation quality comparison will be like a battle between the old and the new, between the West and the East, traditional and innovative.

Who will win?

A note about DeepL 🔗

DeepL has been praised by language experts for the quality of its translation results and has recently integrated their newest models with a LLM. However, we decided to exclude it from this analysis since it doesn't allow us to feed it with a prompt, nor upload a TM or a style guide. The resulting translations would be punished on the selected indicators (consistency with TM and style guide, and partially also for contextual appropriateness). In addition, DeepL doesn't support Croatian, which is one of the two languages chosen for the test.

article-image

⚔️ The great LLM translation battle 🔗

Battles without rules equal chaos. To make sure our translation quality comparison doesn’t spiral into confusion, we’ll use the following indicators:

1️⃣ Consistency with translation memories

2️⃣ Rendering variables in the translated strings

3️⃣ Consistency with the style guide (e.g., gender neutral, engaging)

4️⃣ Contextual appropriateness (e.g., does it fit the mobile app context? Is it right for the target users?)

5️⃣ Accuracy (how closely the translated text matches the meaning of the source text)

6️⃣ Fluency (whether the translated text reads naturally in the target language)

All these indicators will be assigned on a scale from 1 to 5 to an English-Polish 🇵🇱 translation of mobile app strings. We’ll compare the score for this language pair with the translation of the same strings from English into Croatian 🇭🇷, which is a low-resource language (Polish is considered medium-source).

Each LLM will be asked to provide a context-specific, gender-neutral translation, based on a style guide and translation memory uploaded as a TMX or TXT file. The instructions will also contain the type of results we’re looking for, such as natural, engaging translation consistent with the uploaded documents.

In each test, we'll use the same set of 10 strings from an educational app of varied lengths and complexity:

  1. Tap here to start your learning adventure!
  2. You are competing against [User].
  3. Settings have been successfully updated.
  4. Swipe left to see more options or tap More to open a full menu.
  5. You have %s new notifications.
  6. Tip of the day: Switch your phone's language setting to the language you're learning to boost daily immersion and accelerate your learning process.
  7. You've earned a badge for your streak of %s days! Keep up the great work to unlock more rewards.
  8. Your pronunciation score has improved by s% since last month! Well done!
  9. Do you really want to delete your account? This action cannot be undone.
  10. Your subscription has expired. Renew now to continue enjoying all the great courses.

🧠 The translation memory contains two 100% matches and two 70% matches, both for Polish and Croatian.

📓 The style guide briefly explains the type of the mobile app (language learning), describes the target audience (adults), and the desired tone (engaging and motivational). It also stresses that translation should be gender-neutral and use the correct terminology for app-related words.

💬 Finally, each model will be presented with the same prompt:

Act as a professional English-Polish (English-Croatian) translator specializing in the translation of mobile apps and educational content. I will provide you with 10 strings from a language learning app for adults. As a reference, I will upload a TMX file with a translation memory, and a DOC file with a style guide. Translate the strings based on the reference files, making sure the translation is engaging, consistent, contextually appropriate, and gender-neutral.
Disclaimer: The translation quality assessments and comparisons presented in this article are based on a limited set of tests perfomed by language professionals and should not be considered exhaustive or definitive. Due to constraints such as access to full model capabilities, proprietary algorithms, and the broad range of potential test conditions, our benchmarking efforts may not fully capture the capabilities of each model. We encourage readers to consult additional research and papers that offer more extensive benchmarking analyses. Bear in mind also that models used in these tests are updated and change all the time. Our benchmarking criteria might be limited.

ChatGPT, version GPT-4o 🔗

ChatGPT struggled with some app-related terms for Polish translations and failed to provide engaging gender-neutral translations. For example, it offered alternative gender verb forms (Zdobyłeś/aś), which was not only incorrectly written (the correct form is Zdobyłeś(-łaś)) but also isn’t a typical wording in mobile apps. In such cases, the translation would be usually rendered with the verb in the present form (Zdobywasz), which is gender-neutral, instead of the past form,, which requires a different form for each gender.

Some strings were translated quite literally (codzienne zanurzenie for daily immersion), the model didn’t respect the translation memory for the 100% matches and wasn’t consistent with the 70% matches. The strings with variables were quite problematic too: “You are competing against [User]” and “You have %s new notifications” were rendered incorrectly. In the first sentence, the variable was translated, and the second reads well only for one plural form.

In Croatian, the model also translated the variable (“User”) but did well with preserving the segments from the translation memory.

All in all, the translation quality wasn’t bad, however, it requires thorough editing, especially when it comes to consistency and fluency.

article-image
Polish translation in ChatGPT.
article-image
Croatian translation in ChatGPT.

Claude, version Sonnet 3.5 🔗

Since Claude doesn’t recognize TMX files, the TM for this model was uploaded as a TXT file, and the prompt was adapted accordingly. In Polish, this model used different terminology for “Tap” compared to ChatGPT, which is considered “more correct”. The translation had a more natural flow and was more engaging compared to ChatGPT, although the strings were not consistent with the translation memory. Gender neutrality wasn’t the model’s strong point either.

Both in Croatian and Polish, the first variable (“User”) was rendered flawlessly, but the strings with variables for multiple plurals turned out to be problematic again.

In both languages, Claude added notes on the translation approach at the end of the output, which is a handy way to sum up all performed processes.

article-image
Polish translation in Claude.
article-image
Croatian translation in Claude.

Mistral 🔗

This time, both TM and the style guide were uploaded to the tool as TXT files, as the model doesn’t support TMX or DOC uploads. The prompts were tweaked accordingly.

Overall, Polish translation was more natural and engaging compared to the previous two models. However, Mistral struggled with variables and didn’t stick to the translation memory. Instead of gender-neutral strings, the model chose the male verb forms, failing to adhere to the style guide. Some phrases were translated too literally (e.g., in strings 3, 6, and 9).

Surprisingly, the model failed to preserve the variable “User” in Polish but retained it correctly in the Croatian translation. Strings with variables for multiple plural forms were again the weak points.

article-image
Polish translation in Mistral.
article-image
Croatian translation in Mistral.

Gemini, version 2.0 Flash 🔗

Gemini supports only image upload, so both the TM and style guide were fed to the model by copying and pasting the content directly into the chat window.

In Polish, the model not only used the strings from the TM but also commented on the instances where it copied a string from the translation memory (see strings 3 and 8). The translations had a decent flow with very few literal phrases, and the terminology was correct as well.

Regarding gender neutrality, the model tried its best to stay impartial, but the passive voice in string 7 came out as a bit too awkward. A better choice would be to use the present tense, where verbs for both genders have the same forms.

The TM entries were used correctly in Croatian, although the model could do better in terms of fluency.

article-image
Polish translation in Gemini.
article-image
Croatian translation in Gemini.

DeepSeek, version 3 🔗

In DeepSeek, the TM was again uploaded as a TXT, as the model doesn’t read TMX files. The style guide remained in the DOC format.

Overall, the Polish translation sounded more natural and more engaging compared to all the LLMs mentioned above. There were fewer literal translations, the terminology was correct, and the model applied 100% matches from the TM. Although DeepSeek rendered correctly the variable in string 2, it struggled with string 5 (the translation wasn’t correct for all plural forms), and failed to follow the gender-neutrality rule.

The model provided handy comments, explaining where TM was used, and displayed an encouraging summary at the end of the process.

In Croatian, the model was average in terms of accuracy and fluency.

article-image
Polish translation in DeepSeek.
article-image
Croatian translation in DeepSeek.

Qwen2.5-Max 🔗

This model allows for only one file upload in a single conversation, hence the content of the translation memory and style guide were pasted into the chat window.

In Polish, there were no surprises when it came to gender neutrality: it was all wrong again (e.g. string 7). Variables were problematic, too (e.g. string 2), and some phrases sounded too literal. However, the model used the TM matches, followed the style guide quite well, and used the correct terminology. Overall, the translation quality was good.

Variables were the main issue in Croatian (string 2.), and the translation didn’t have a natural flow.

article-image
Polish translation in Qwen.
article-image
Croatian translation in Qwen.

Doubao 🔗

There are two ways to use this model: struggle with the Chinese user interface or head straight to Hugging Face, a global AI community. Should you choose the first option: don’t worry, your prompt can remain in English and the tool will reply in English too. For the rest, get ready to rely on your intuition 😊 Alternatively, turn to Google Translate or… an LLM of your choice.

The model doesn’t support TMX files, so the TM was again uploaded as a TXT.

Overall, the quality of the Polish translation wasn’t spectacular: issues with variables (string 2), lack of gender neutrality, and a lot of literal translations. Plus, the matches from the TM weren’t applied. On top of that, there were three grammar mistakes (strings 4 and 6) – an uncommon phenomenon in the other models tested above, and terminology wasn’t correct (e.g. in string no 10).

Croatian results were disappointing too, although finally the “User” variable in string 2. remained untouched.

article-image
Polish translation in Doubao.
article-image
Croatian translation in Doubao.

🏅 What’s the LLM leaderboard? 🔗

This is how each LLM scores for each indicator and language combination:

LLMs Consistency with translation memories Rendering variables in the translated strings Consistency with the style guide Contextual appropriateness Accuracy Fluency Total Score
DeepSeek 5 3 4 5 5 4 22
Gemini 5 3 4 5 5 4 22
Qwen 5 2 4 5 5 4 21
ChatGPT 3 3 4 5 5 3 20
Claude 3 3 4 5 5 4 20
Mistral 3 2 4 5 5 3 19
Doubao 1 2 3 2 4 2 12
Table 1. Test results in Polish.
LLMs Consistency with translation memories Rendering variables in the translated strings Consistency with the style guide Contextual appropriateness Accuracy Fluency Total Score
Claude 5 5 5 5 3 3 26
Gemini 4 5 5 5 4 3 26
DeepSeek 5 5 5 5 3 3 26
ChatGPT 4 4 5 5 3 3 24
Mistral 4 5 5 5 3 2 24
Qwen 4 4 5 5 3 3 24
Doubao 2 5 5 5 2 1 20
Table 2. Test results in Croatian.

There’s no clear winner of our tests. DeepSeek, Gemini, and Qwen lead the ranking for the Polish translation. In Croatian, the podium belongs to Claude, Gemini, and DeepSeek, all scoring the same number of points.

Our great translation battle proved one thing: the East matched the West, the New shook hands with the Old. The Chinese AI models provided quality comparable to Gemini, Claude, and ChatGPT. In both languages, Doubao ended at the bottom and scored lower than Qwen and DeepSeek, but this shouldn’t come as a surprise, as the LLM from ByteDance was trained with the main focus on Chinese language processing.

In general, the LLMs in our test delivered accurate translations, although not always fluent. Variables appeared to be problematic to some extent for nearly each LLM, while the TM and style guide was followed by most models, with some deviations in the Polish text. The context in both languages was preserved quite well, although terminology wasn’t always top-notch. Preserving gender neutrality (as part of the style guide) seemed to be the major issue in all LLMs.

The general quality of the translations in our tests was good, although not always fluent. Variables and gender neutrality were the most problematic elements. Context was preserved quite well for both languages, but the terminology wasn't always accurate

Translation quality depends, of course, on many factors: the source text type, language combination, prompt, and reference files. You might run a similar test with a legal translation from Baha to Swahili and receive completely different results. However, in our scenario, DeepSeek matches or outperforms most LLM giants.

Finally, although LLMs usually struggle with low-resource languages, in our test scenario the Croatian translations of mobile app strings were quite decent. This surprisingly good result could be attributed to the ample context (translation memory, style guide), language pair, text type, or all these factors.

🤔 Where’s the golden middle? 🔗

What do these results tell us?

It’s simple.

Don’t rely solely on LLMs, no matter what scenario. Mixed approaches work best in most situations, so if you decide to employ an LLM for your translation and localization tasks, remember to add the human touch. That’s how you can ensure your content is fluent, engaging, and tailored to your target audience.

If you’re ready for an effective mix of human and machine translation, take a look at our MT and AI features, and head to the Continuous Localization Team to cut your time to access the global market and focus on what matters the most: your product.