Research · Sorensen.ai · Open Source · 2026
Study: even when you train AI in Catalan, it still thinks in English — the coca case
Even when you train a language model in Catalan, the concepts still come in English. 10 of 11 LLMs tested — including Catalan-native models — place coca next to cocaine and Coca-Cola when no context is given. The taxonomy behind it — and how to rescue meaning.
What does the study show?
«Coca Is Not Cocaine: Three Lexical-Cultural Collision Modes in Open-Weight LLMs, Probed in Catalan» is an empirical study published on GitHub by Xavier Vinaixa Roselló (Sorensen.ai) testing whether large language models understand Catalan cultural concepts or confuse them with their English translations.
The central question is deliberately mundane: when you say coca in Catalan — a traditional pastry, a flatbread of Catalan cuisine — does the model think of the dessert, or does it think of cocaine and Coca-Cola? Empirical answer: 10 of the 11 models tested, with no context, place coca next to noise anchors (drugs, brands, beverages).
This is not anecdotal: it is a structural property of latent space geometry. The collision doesn't depend on training data alone — it appears even in models explicitly trained to speak Catalan (Salamandra, ALIA) from the Barcelona Supercomputing Center. The good news: all models recover with a Catalan prompt frame.
Methodology
The study analyses K-Nearest Neighbors (KNN) relationships in each model's embedding space, measuring how many anchors from the «noise cluster» (cocaine, Coca-Cola, drugs, beverages, brands) appear among the 10 nearest neighbours of the word coca. A preregistered threshold of ≥ 2 noise anchors outside the English frame identifies a structural collision.
Each model is tested with four prompt frames to isolate the effect of context: bare (no context), anglo_en (English context), catalan_frame (Catalan cultural frame), and neutral_ca (neutral Catalan text). This measures not only the initial position of the word in latent space, but also the rescue capacity of the model as a function of linguistic framing.
All code (Python), CSVs with raw data, reproduction scripts and the final report are public on the GitHub repository. The detailed methodology is documented in metodologia.pdf and docs/methodology.md.
11 models analysed
The study covers three families of open-weight models, stratified by size and purpose:
01
Generalist — small
- Gemma 2 2B (Google)
- Gemma 4 E2B (Google)
- Qwen2 1.5B (Alibaba)
- Qwen 3.5 4B Base (Alibaba)
02
Generalist — large
- Gemma 4 26B-A4B (Google)
- Mistral Small 24B Base (Mistral)
- Qwen 3.6 35B-A3B (Alibaba)
- DeepSeek 67B Base (DeepSeek)
03
Catalan control — BSC
- Salamandra 2B (Barcelona Supercomputing Center)
- Salamandra 7B (Barcelona Supercomputing Center)
- ALIA 40B (Barcelona Supercomputing Center)
Key results
Three findings that disrupt assumptions about scale and Catalan training:
-
> Structural collision
10 of 11 models, with no context, place coca next to cocaine or Coca-Cola. It's a pipeline phenomenon, not just data bias.
-
> Salamandra 7B paradox
The Catalan-specialised model performs worse under English framing than generalist models (9/10 noise anchors). Catalan training does not guarantee immunity.
-
> DeepSeek 67B diverges
The only model avoiding drug clusters — but goes straight into fairy-tale semantics (princesses, enchanted forest). First empirical evidence that F1 and F2 collision modes are separable.
-
> Scale doesn't save you
67B parameters don't make a model more prompt-sensitive. ALIA-40B shows stronger frame-rescue than larger competitors.
-
> Rescue works
All models recover the correct sense with strong Catalan cultural framing. The most effective intervention is not retraining — it's providing Catalan context at inference time.
Taxonomy: three collision modes
The study proposes a taxonomy of three lexical-cultural failure modes that were previously lumped together:
F1 — Lexical cluster collision
Cognate drag: the native word is dragged into the cluster of a formally similar word in another dominant language. Here: coca → cocaine / Coca-Cola. Mostly observed under bare and anglo_en framing.
F2 — Anglo narrative drift
Cultural frame substitution: the model relocates the concept into an alien imaginary — fairies, princesses, enchanted forest (DeepSeek 67B) — because the native culture is not dense enough in pretraining. The word stops being a drug, but doesn't recover its meaning either.
F3 — Polysemic dominance
The most common sense erases the others: among all valid meanings of the word in the native language, the model imposes the most basic or everyday one and hides the specialised, dialectal or ritualistic senses. The semantic diversity of Catalan compresses into a single meaning.
Why it matters
This study is an empirical exhibit in service of a broader thesis: that the Anglo-centric bias of LLMs isn't only a problem of symbolic inclusion, but a measurable distortion of the semantic reality of minoritised languages. If you ask an English-language chat about Catalan cuisine, there's a non-trivial chance it's reading it through a ghost layer of Anglo-American stereotypes.
It's also a practical guide for journalists, experience designers, and prompt engineers working with non-dominant languages. The engineering takeaway is clear: put cultural context into your inference. Don't outsource it to model size. A Salamandra 7B with a well-built Catalan frame rescues coca; a colossal ChatGPT with no frame doesn't.
It connects directly with Xavi Vinaixa's broader work on data sovereignty, LLM hallucination, and the idea that digital resistance starts by reclaiming control over the raw material of culture — starting with language.
Open source and reproducible
All material — Python code, CSV datasets, reproduction scripts, results tables and full documentation — is open source and lives in the GitHub repository. MIT license for code and CC-BY 4.0 for data. The /docs folder contains: methodology.md, findings.md, limitations.md, related-work.md and reproduce.md.
To cite the study: Vinaixa Roselló, X. (2026). Coca Is Not Cocaine: Three Lexical-Cultural Collision Modes in Open-Weight LLMs, Probed in Catalan. GitHub. ORCID: 0009-0005-2769-9215.
Key concepts
Resources and links
GitHub — coca-is-not-cocaine
Official repository: Python code, CSV datasets, reproduction scripts, full report (MIT + CC-BY 4.0)
/docs — documentation
methodology.md, findings.md, limitations.md, related-work.md and reproduce.md
metodologia.pdf
Full methodology document (PDF) with preregistration, thresholds and experimental protocol
Substack — La coca no és cocaïna (excepte per a la IA)
Long-form article (in Catalan) explaining the findings and their political/cultural implications
Salamandra (BSC) on Hugging Face
Catalan-native model from the Barcelona Supercomputing Center tested in the study
ALIA-40b (BSC) on Hugging Face
BSC's Catalan-native 40B-parameter model tested in the study
ORCID — Xavier Vinaixa Roselló
Academic identity of the author
References
- Vinaixa Roselló, X. (2026). Coca Is Not Cocaine: Three Lexical-Cultural Collision Modes in Open-Weight LLMs, Probed in Catalan. — GitHub https://github.com/xaviviro/coca-is-not-cocaine
- Barcelona Supercomputing Center. Salamandra-7B model card. — Hugging Face https://huggingface.co/BSC-LT/salamandra-7b
- Barcelona Supercomputing Center. ALIA-40B model card. — Hugging Face https://huggingface.co/BSC-LT/ALIA-40b
- Vinaixa Roselló, X. (2025). Hallucination patterns in open-weight LLMs. — Zenodo (DOI 18976059) https://zenodo.org/records/18976059
- Vinaixa Roselló, X. (2025). Fine-tuning of literary style in instruction-tuned LLMs. — Zenodo (DOI 18975628) https://zenodo.org/records/18975628
- Cassany, R., Vinaixa, X. & Mauri, M. (2025). Identidad sonora personalizada mediante IA para personas sordas signantes. — Libro de Actas XVII CILCS, Madrid (ISBN 979-13-87819-03-3) https://congresolatina.net/wp-content/uploads/2025/12/Libro-de-actas-XVII-CILCS-2025.pdf
- Anthropic. Model Context Protocol — open specification. — modelcontextprotocol.io https://modelcontextprotocol.io/