Skip to main content

Research · Sorensen.ai · Open Source · 2026

Study: even when you train AI in Catalan, it still thinks in English — the coca case

Even when you train a language model in Catalan, the concepts still come in English. 10 of 11 LLMs tested — including Catalan-native models — place coca next to cocaine and Coca-Cola when no context is given. The taxonomy behind it — and how to rescue meaning.

What does the study show?

«Coca Is Not Cocaine: Three Lexical-Cultural Collision Modes in Open-Weight LLMs, Probed in Catalan» is an empirical study published on GitHub by Xavier Vinaixa Roselló (Sorensen.ai) testing whether large language models understand Catalan cultural concepts or confuse them with their English translations.

The central question is deliberately mundane: when you say coca in Catalan — a traditional pastry, a flatbread of Catalan cuisine — does the model think of the dessert, or does it think of cocaine and Coca-Cola? Empirical answer: 10 of the 11 models tested, with no context, place coca next to noise anchors (drugs, brands, beverages).

This is not anecdotal: it is a structural property of latent space geometry. The collision doesn't depend on training data alone — it appears even in models explicitly trained to speak Catalan (Salamandra, ALIA) from the Barcelona Supercomputing Center. The good news: all models recover with a Catalan prompt frame.

Methodology

The study analyses K-Nearest Neighbors (KNN) relationships in each model's embedding space, measuring how many anchors from the «noise cluster» (cocaine, Coca-Cola, drugs, beverages, brands) appear among the 10 nearest neighbours of the word coca. A preregistered threshold of ≥ 2 noise anchors outside the English frame identifies a structural collision.

Each model is tested with four prompt frames to isolate the effect of context: bare (no context), anglo_en (English context), catalan_frame (Catalan cultural frame), and neutral_ca (neutral Catalan text). This measures not only the initial position of the word in latent space, but also the rescue capacity of the model as a function of linguistic framing.

All code (Python), CSVs with raw data, reproduction scripts and the final report are public on the GitHub repository. The detailed methodology is documented in metodologia.pdf and docs/methodology.md.

11 models analysed

The study covers three families of open-weight models, stratified by size and purpose:

01

Generalist — small

  • Gemma 2 2B (Google)
  • Gemma 4 E2B (Google)
  • Qwen2 1.5B (Alibaba)
  • Qwen 3.5 4B Base (Alibaba)

02

Generalist — large

  • Gemma 4 26B-A4B (Google)
  • Mistral Small 24B Base (Mistral)
  • Qwen 3.6 35B-A3B (Alibaba)
  • DeepSeek 67B Base (DeepSeek)

03

Catalan control — BSC

  • Salamandra 2B (Barcelona Supercomputing Center)
  • Salamandra 7B (Barcelona Supercomputing Center)
  • ALIA 40B (Barcelona Supercomputing Center)

Key results

Three findings that disrupt assumptions about scale and Catalan training:

  • > Structural collision

    10 of 11 models, with no context, place coca next to cocaine or Coca-Cola. It's a pipeline phenomenon, not just data bias.

  • > Salamandra 7B paradox

    The Catalan-specialised model performs worse under English framing than generalist models (9/10 noise anchors). Catalan training does not guarantee immunity.

  • > DeepSeek 67B diverges

    The only model avoiding drug clusters — but goes straight into fairy-tale semantics (princesses, enchanted forest). First empirical evidence that F1 and F2 collision modes are separable.

  • > Scale doesn't save you

    67B parameters don't make a model more prompt-sensitive. ALIA-40B shows stronger frame-rescue than larger competitors.

  • > Rescue works

    All models recover the correct sense with strong Catalan cultural framing. The most effective intervention is not retraining — it's providing Catalan context at inference time.

Taxonomy: three collision modes

The study proposes a taxonomy of three lexical-cultural failure modes that were previously lumped together:

F1 — Lexical cluster collision

Cognate drag: the native word is dragged into the cluster of a formally similar word in another dominant language. Here: coca → cocaine / Coca-Cola. Mostly observed under bare and anglo_en framing.

F2 — Anglo narrative drift

Cultural frame substitution: the model relocates the concept into an alien imaginary — fairies, princesses, enchanted forest (DeepSeek 67B) — because the native culture is not dense enough in pretraining. The word stops being a drug, but doesn't recover its meaning either.

F3 — Polysemic dominance

The most common sense erases the others: among all valid meanings of the word in the native language, the model imposes the most basic or everyday one and hides the specialised, dialectal or ritualistic senses. The semantic diversity of Catalan compresses into a single meaning.

Why it matters

This study is an empirical exhibit in service of a broader thesis: that the Anglo-centric bias of LLMs isn't only a problem of symbolic inclusion, but a measurable distortion of the semantic reality of minoritised languages. If you ask an English-language chat about Catalan cuisine, there's a non-trivial chance it's reading it through a ghost layer of Anglo-American stereotypes.

It's also a practical guide for journalists, experience designers, and prompt engineers working with non-dominant languages. The engineering takeaway is clear: put cultural context into your inference. Don't outsource it to model size. A Salamandra 7B with a well-built Catalan frame rescues coca; a colossal ChatGPT with no frame doesn't.

It connects directly with Xavi Vinaixa's broader work on data sovereignty, LLM hallucination, and the idea that digital resistance starts by reclaiming control over the raw material of culture — starting with language.

Open source and reproducible

All material — Python code, CSV datasets, reproduction scripts, results tables and full documentation — is open source and lives in the GitHub repository. MIT license for code and CC-BY 4.0 for data. The /docs folder contains: methodology.md, findings.md, limitations.md, related-work.md and reproduce.md.

To cite the study: Vinaixa Roselló, X. (2026). Coca Is Not Cocaine: Three Lexical-Cultural Collision Modes in Open-Weight LLMs, Probed in Catalan. GitHub. ORCID: 0009-0005-2769-9215.

Key concepts

LLM Embeddings KNN Catalan BSC Salamandra BSC ALIA Gemma Qwen Mistral DeepSeek Anglo-centric bias Linguistic sovereignty Open Source

References

  1. Vinaixa Roselló, X. (2026). Coca Is Not Cocaine: Three Lexical-Cultural Collision Modes in Open-Weight LLMs, Probed in Catalan. — GitHub https://github.com/xaviviro/coca-is-not-cocaine
  2. Barcelona Supercomputing Center. Salamandra-7B model card. — Hugging Face https://huggingface.co/BSC-LT/salamandra-7b
  3. Barcelona Supercomputing Center. ALIA-40B model card. — Hugging Face https://huggingface.co/BSC-LT/ALIA-40b
  4. Vinaixa Roselló, X. (2025). Hallucination patterns in open-weight LLMs. — Zenodo (DOI 18976059) https://zenodo.org/records/18976059
  5. Vinaixa Roselló, X. (2025). Fine-tuning of literary style in instruction-tuned LLMs. — Zenodo (DOI 18975628) https://zenodo.org/records/18975628
  6. Cassany, R., Vinaixa, X. & Mauri, M. (2025). Identidad sonora personalizada mediante IA para personas sordas signantes. — Libro de Actas XVII CILCS, Madrid (ISBN 979-13-87819-03-3) https://congresolatina.net/wp-content/uploads/2025/12/Libro-de-actas-XVII-CILCS-2025.pdf
  7. Anthropic. Model Context Protocol — open specification. — modelcontextprotocol.io https://modelcontextprotocol.io/