Why Most Large Language Models Struggle With African Languages

Ask any of the most capable AI systems available today to translate a sentence into Luganda, Dholuo, Runyankole, or Kikuyu; the results range from mediocre to confidently wrong. Ask them to respond fluently in Luganda and most will switch to English or produce something that no native speaker would recognise as natural.

The engineers who built these models were not careless, and neither were they indifferent. It is because several structural problems (in data, in architecture, and in how progress gets measured) compound each other in ways that specifically disadvantage African languages. Understanding those problems is the first step toward fixing them.

The data problem

Modern large language models learn from text. Enormous amounts of it: hundreds of billions of words drawn from the internet, books, code repositories, and digitised documents. The more high-quality text a language has in that training corpus, the better the model learns it.

English has over 6 trillion words of text available on the internet. Mandarin has over 1 trillion. French, Spanish, and German each have hundreds of billions. Luganda has a few million. Runyankole has less. Many Ugandan languages have almost nothing in digitised form at all.

This is not simply because fewer people speak these languages. Luganda has over 7 million speakers, more than the population of many European countries whose languages are well represented in AI training data. It is because the digital record of these languages is thin. Much of the knowledge, discourse, and civic life conducted in Luganda happens in speech, in communities, and in documents that were never digitised.

THE COMPOUNDING EFFECT

Low data representation is not just a mild disadvantage: it compounds. Models trained on little data for a language learn it poorly. Poor performance means fewer people trust or use AI in that language. Fewer users means fewer examples to learn from. The gap between well-represented and underrepresented languages does not stay constant. It actively widens over time without deliberate intervention.

The tokenisation problem

Even for the text that does exist, there is a second, more technical problem that receives much less attention: tokenisation.

Large language models do not process text as words. They process it as tokens, which are fragments of text that the model has learned to recognise as meaningful units. These fragments are determined by a popular algorithm called Byte Pair Encoding (BPE), which learns which character sequences appear frequently enough to deserve their own token by analysing the training corpus.

Because the training corpus is dominated by English, the BPE algorithm learns to tokenise English efficiently. English words typically become one or two tokens each. But when the same tokeniser encounters a morphologically complex African language, the results are very different.

Take the Luganda verb “okubala” (meaning “to count”). A tokeniser trained on English data does not have a merge rule for this word and will fragment it into several arbitrary subword pieces that do not correspond to any meaningful unit in Luganda grammar. The same happens across Bantu languages, where a single verb form can encode subject agreement, tense, object agreement, aspect, and negation (all the information that English expresses with three or four separate words) in one surface form that the tokeniser then cuts into fragments at random boundaries.

Why fragmentation matters

When a word is fragmented into tokens that do not correspond to meaningful units, the model has to learn the language's grammar across broken pieces. It is like trying to learn that 'un-' means 'not' when the tokeniser consistently splits the word at a different point (say, 'unf-' and 'air') so the prefix never appears as a consistent unit. The model can partially compensate with enough training data, but without enough Luganda text to learn these patterns, the fragmentation compounds the data problem.

The measurement problem

A third structural issue is how progress in AI translation gets measured and celebrated. The dominant benchmark for translation quality is BLEU score, measured on datasets that primarily cover high-resource languages. The dominant leaderboards compare models on English, French, German, Chinese, and Spanish competence. African languages, when they appear, are often represented by Swahili alone, a language that, while important, is better resourced than most others in the region.

When a major AI laboratory releases a new model and announces that it outperforms competitors on translation benchmarks, those benchmarks rarely include Luganda, Runyankole, Acholi, or Ateso. The absence is invisible in the announcement. The model is called multilingual, capable, state-of-the-art. For the languages measured, it may be. For the languages not measured, nothing was claimed and nothing was checked.

This creates a perverse incentive structure. Improving performance on African languages does not improve a model's position on the leaderboards that drive commercial and research attention. So improving African language performance does not attract resources or recognition in proportion to its importance to the communities who would benefit.

The assumption problem

There is a fourth issue that is harder to quantify but equally important: the assumptions embedded in how models are designed and how their outputs are evaluated.

Consider the task of evaluating whether a translation is correct. Human evaluation, which means asking native speakers to rate quality, is the most reliable method. But it requires finding and compensating fluent speakers of the target language, which is expensive and logistically complex for languages where the speaker community is geographically dispersed and not represented in the networks of contractors used by major AI companies.

As a result, African language evaluation often defaults to automated metrics like BLEU, or to human evaluators who are not native speakers, or simply does not happen at all. A model that scores reasonably on automated metrics but produces translations that no Luganda speaker would actually use passes evaluation and gets deployed. The gap between benchmark performance and real-world usefulness goes undetected.

Similarly, models trained primarily on written, formal text in African languages (the small amount that exists) learn and register a version of those languages that may not reflect how they are actually spoken. A civic information system that speaks Luganda the way a colonial-era fresh Luganda learner speaks serves no one.

What it takes to change this

The good news is that none of these problems are unsolvable. The bad news is that solving them requires deliberate effort that the current incentive structure of the AI industry does not naturally produce.

More data is needed: not just more text scraped from the internet, but carefully curated parallel data with native speaker verification, speech datasets that capture how Luganda is actually spoken, and civic domain corpora that reflect the specific tasks these models will be used for. This kind of data collection is slow, expensive, and unglamorous. It does not produce research papers but produces the foundation that makes research possible.

Better evaluation is needed: human evaluation by native speakers, measured against tasks that actually matter in civic contexts rather than against Wikipedia sentences. BLEU and chrF++ are proxies. They are useful proxies, but they are not the same as asking a community health worker in Kampala whether the translation actually makes sense.

Local teams need to lead this work because the contextual knowledge required (about language registers, about community communication patterns, about what civic language sounds like in practice) only comes from being part of those communities.

Amplified Access is building high-quality SLMs using a community based approach. Verifying translations and conversations in the communities that speak these languages.