Tokenization

The process of breaking text into smaller parts, like words or variables, to make it easier for machines to interpret and translate.

Tokenization refers to the process of breaking down text into smaller units, known as tokens.

In software localization and Natural Language Processing (NLP), tokenization plays a key role in analyzing, processing, and translating content accurately.

In localization workflows, tokenization helps identify which parts of the text should be translated, preserved (e.g., code variables, names, placeholders), or treated differently.

Proper tokenization supports features like string segmentation, translation memory, and automatic formatting. Inaccurate token boundaries may cause translation errors or layout issues in the final localized product, while good tokenization avoids breaking code or formatting while localizing dynamic UI content.

➡️ How do tokenization rules work? #️⃣

  • They vary widely between languages. For example, English uses spaces to separate words, but in languages like Chinese or Thai, tokens are not separated by spaces.
  • These tokens can be words, phrases, symbols, or other meaningful elements depending on the language and application.
  • Localization systems must adapt tokenization logic to suit the structure of each target language.

In software development, tokenization can also refer to placeholder management, where variables (like {username} or %1) are inserted into translatable strings and need protection during translation.

Localazy handles tokenized content safely, preserving placeholders during translation and helping translators focus only on what needs to be translated.

Curious about software localization beyond the terminology?

⚡ Manage your translations with Localazy! 🌍