Multilingual Search

Query in any of 24 EU languages. Find documents in any language. Semantic search understands meaning across languages.

How Cross-Lingual Search Works

Traditional keyword search requires you to know the exact words used in the target document. If a regulation is published in German and you search in Finnish, a keyword engine will find nothing.

Pauhu uses semantic search powered by multilingual embeddings. Every document is converted into a 1024-dimensional vector that captures its meaning, not just its words. Because the embedding model was trained on 100+ languages simultaneously, documents with similar meaning get similar vectors regardless of what language they are written in.

This means you can:

Not keyword matching Pauhu does not translate your query and then keyword-match. It encodes your query and all documents into the same semantic space. Two texts with the same meaning but different words (even in different languages) will be close together in this space.

Embedding Model: BGE-M3

PropertyValue
ModelBGE-M3 (BAAI General Embedding, Multilingual)
Dimensions1024
Similarity metricCosine similarity (0.0–1.0)
Training languages100+ (includes all 24 EU official languages)
Max sequence length8,192 tokens
FormatONNX (browser-native inference supported)

BGE-M3 is specifically designed for multilingual and cross-lingual retrieval. It outperforms monolingual models on cross-lingual benchmarks and handles code-mixed text well (e.g., an English query with a German legal term).

Using the lang Parameter

The lang parameter on search endpoints controls result filtering, not query language. The search engine always understands your query regardless of what language you write it in.

UsageBehaviour
lang omittedReturns results in all available languages, ranked by semantic similarity
lang=fiReturns only Finnish-language results, still ranked by semantic similarity to your query (which can be in any language)
lang=enReturns only English-language results

Key insight: You can write your query in Finnish and set lang=de to find German documents that match your Finnish query. The embedding model bridges the language gap.

Cross-Lingual Example

The same concept — the EU AI Act — searched in three different languages, all returning the same document:

Query in English

curl "https://staging.pauhu.eu/v1/search/eurlex?q=artificial+intelligence+high-risk+systems&limit=1" \
  -H "Authorization: Bearer YOUR_API_KEY"
{
  "product": "eurlex",
  "results": [{
    "id": "32024R1689",
    "title": "Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI Act)",
    "score": 0.96
  }]
}

Query in Finnish

curl "https://staging.pauhu.eu/v1/search/eurlex?q=tekoaly+korkean+riskin+jarjestelmat&limit=1" \
  -H "Authorization: Bearer YOUR_API_KEY"
{
  "product": "eurlex",
  "results": [{
    "id": "32024R1689",
    "title": "Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI Act)",
    "score": 0.93
  }]
}

Query in German

curl "https://staging.pauhu.eu/v1/search/eurlex?q=kunstliche+Intelligenz+Hochrisiko-Systeme&limit=1" \
  -H "Authorization: Bearer YOUR_API_KEY"
{
  "product": "eurlex",
  "results": [{
    "id": "32024R1689",
    "title": "Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI Act)",
    "score": 0.94
  }]
}

All three queries return the same AI Act regulation with high confidence scores, despite being written in different languages. The slight score variation reflects natural differences in how the embedding model represents each language.

24 Supported Languages

Pauhu supports all 24 official languages of the European Union:

CodeLanguageCodeLanguageCodeLanguage
bgBulgarianfiFinnishmtMaltese
csCzechfrFrenchnlDutch
daDanishgaIrishplPolish
deGermanhrCroatianptPortuguese
elGreekhuHungarianroRomanian
enEnglishitItalianskSlovak
esSpanishltLithuanianslSlovenian
etEstonianlvLatviansvSwedish

Language Support Per Product

Most products contain documents in all 24 EU languages. Some specialised products have narrower coverage:

ProductLanguagesNotes
EUR-Lex24All EU official languages. Most documents available in all languages.
TED24Notices in the language of the contracting authority, summaries in English.
IATE24Terminology in all EU languages. Coverage varies by term.
Consilium24Council conclusions typically in all official languages.
Commission24Major communications in all languages; working documents often English/French/German only.
OEIL24Procedure summaries in the language of the rapporteur plus English.
National Law (lex)23National language per country. Malta uses both Maltese and English.
CURIA24Language of the case plus French (working language of the Court).
EPO3English, French, German (official languages of the EPO).
CORDIS2English plus the project coordinator’s language.
OSMMultilingualNames in local language plus English/international variants.
Wiki24Curated articles in all EU languages where available.
Code2Primarily English. Some projects include localised documentation.
All others24ECB, ECHA, EMA, Eurostat, Publications, data.europa.eu, Who is Who, DPP, European Parliament.

Search Tips

Semantic, not keyword Pauhu uses semantic search, meaning it understands the meaning of your query, not just the individual words. Synonyms, paraphrases, and cross-lingual equivalents all work. You do not need to guess the exact wording used in the source document.

All Documentation · Data Catalog · API Reference