API Reference
IATE terminology (2.4M terms), neural translation (1,440+ language pairs), semantic search (20 Vectorize indexes), document annotation. All services run on Cloudflare Workers in EU jurisdiction.
Architecture
Pauhu® EU runs as a fleet of Cloudflare Workers in EU jurisdiction. Each service is a separate worker with its own domain. There is no single unified gateway URL — each API is accessed at its own worker endpoint.
| Service | Worker | Purpose |
|---|---|---|
| Terminology | terminology.pauhu.eu | IATE term lookup, search, TBX/TMX export |
| Translation | translate.pauhu.eu | Helsinki-NLP OPUS-MT neural translation |
| Search | search.pauhu.eu | Semantic search across 20 data sources |
| Annotation | annotate.pauhu.eu | Topic + deontic classification |
| Models | models.pauhu.ai | ONNX model CDN (2,342 models) |
| Gateway | staging.pauhu.eu | Gate orchestration |
All workers return JSON by default. CORS is enabled for pauhu.ai, pauhu.eu, and localhost development origins.
Authentication
API keys
Generate a self-service API key via the Pauhu search service. Keys follow the format pk_*. Each key is linked to a seat tier that determines data access entitlements.
Create an API key. Requires an email address. New keys start as live tier (activated by Stripe subscription).
curl -X POST https://staging.pauhu.eu/keys/generate \
-H "Content-Type: application/json" \
-d '{"email": "you@company.com"}'
Response:
{
"api_key": "pk_...",
"tier": "live",
"entitlements": {
"raw_feeds": true,
"annotated_feeds": false,
"training_export": false,
"pauhu_ai": false,
"search": true,
"terminology": true,
"translation": true,
"rerank": true
},
"created_at": "2026-02-24T12:00:00.000Z"
}
Check current entitlements and burst limits. Requires Authorization: Bearer <api_key>.
{
"tier": "live",
"entitlements": {
"raw_feeds": true,
"annotated_feeds": false,
"training_export": false,
"pauhu_ai": false,
"search": true,
"terminology": true,
"translation": true,
"rerank": true
},
"burst_limit": "10 req/sec sustained, 50 req/sec peak",
"active": true,
"org_id": null
}
Request authentication
Include your API key as a Bearer token:
Authorization: Bearer pk_...
Without a key, requests run in trial mode: 3 requests/day (IP-based), search + terminology + translation only. No access to raw feeds, annotations, or training export.
Terminology API LIVE
Serves 2,456,445 IATE terms across 24 EU languages with exact lookup and semantic search (BGE-M3, 1024 dimensions).
Exact lookup
Exact match against IATE terminology database.
| Parameter | Type | Description |
|---|---|---|
| term | string | Term to look up (required) |
| lang | string | ISO 639-1 language code (optional, searches all if omitted) |
{
"query": "data protection",
"found": true,
"count": 3,
"results": [...],
"source": "IATE",
"stage": "LOOKUP"
}
Semantic search
BGE-M3 embedding search via Vectorize. Returns semantically similar terms ranked by cosine similarity.
curl -X POST https://staging.pauhu.eu/search \
-H "Content-Type: application/json" \
-d '{"query": "personal data processing", "lang": "en", "limit": 10}'
{
"query": "personal data processing",
"count": 10,
"results": [...],
"source": "IATE Pauhu Search",
"stage": "MODEL"
}
Statistics
Term counts by language.
{
"total": 2456445,
"languages": 24,
"byLanguage": [{"lang": "en", "count": 312847}, ...],
"source": "IATE",
"reliability": "4-star"
}
TBX export (ISO 30042)
Export terminology in TermBase eXchange format. Returns application/x-tbx+xml.
TMX export
Export translation pairs in Translation Memory eXchange format. Returns application/x-tmx+xml.
Batch export
Paginated export for embedding generation pipelines.
{
"lang": "en",
"offset": 0,
"limit": 1000,
"count": 1000,
"hasMore": true,
"nextOffset": 1000,
"terms": [...],
"embedding_model": "bge-m3-onnx"
}
Custom glossaries (tenant)
Upload a custom glossary (CSV or TBX). Terms are merged with IATE at lookup time, with tenant terms taking priority.
Merged lookup: tenant glossary first, then IATE fallback.
List all glossaries for a tenant.
Delete a tenant glossary.
Translation API LIVE
Helsinki-NLP OPUS-MT models at zero inference cost (browser ONNX). 1,440+ language pairs.
Translate text
6-stage translation cascade: KV cache → IATE terminology → rules engine → (reserved) → browser inference (OPUS-MT) → Vectorize semantic verification.
| Parameter | Type | Description |
|---|---|---|
| text | string | Text to translate (required) |
| source_lang | string | Source language (ISO 639-1) |
| target_lang | string | Target language (required) |
curl -X POST https://staging.pauhu.eu/translate \
-H "Content-Type: application/json" \
-d '{"text": "General Data Protection Regulation", "source_lang": "en", "target_lang": "fi"}'
{
"source_lang": "en",
"target_lang": "fi",
"text": "General Data Protection Regulation",
"translation": "Yleinen tietosuoja-asetus",
"model_id": "Helsinki-NLP/opus-mt-en-fi"
}
Full cascade (verbose)
Returns all 6 cascade stages with timing and provenance for each step.
Batch segment translation
Translate an array of segments in a single request.
Supported languages
List all supported language codes and available pairs.
Search API LIVE
Fan-out semantic search across 20 indexes (BGE-M3, 1024 dimensions, cosine similarity). Powered by the Laine Algorithm.
Semantic search
Search across all 20 data source Vectorize indexes simultaneously.
| Parameter | Type | Description |
|---|---|---|
| q | string | Search query (required) |
| limit | integer | Max results (default: 20) |
| domain | string | EuroVoc domain filter (1-21) |
{
"query": "digital product passport ESPR",
"limit": 20,
"results": [
{"product": "eurlex", "id": "32024R1781", "title": "...", "score": 0.89, "url": "..."},
{"product": "commission", "id": "...", "title": "...", "score": 0.84, "url": "..."}
]
}
Instant answers
Knowledge panels from IATE, EUR-Lex, and Wikidata. Returns structured answer snippets.
Web proxy
CORS proxy for institutional search sources. Normalizes results into a common schema.
| Source | Description |
|---|---|
arxiv | arXiv academic papers |
eurostat | Eurostat datasets |
ted | TED procurement notices |
Cross-language siblings
Find all language versions of a EUR-Lex document by CELEX number.
Semantic reranking
Rerank a set of results using BGE-M3 cross-encoder scoring.
DLC packs (browser models)
Signed manifest of downloadable model packs with Ed25519 signatures and SHA-256 checksums.
Core DLC pack: ONNX models and terminology for browser-native inference.
Delta pack: incremental updates since last core download.
Annotation API LIVE
Classifies documents with topic annotations, deontic modalities, and language detection. Split across two service instances (A–E and E–W) for the 20 data sources.
Annotate document
Full annotation pipeline: language detection, topic classification, deontic modality (obligation/prohibition/permission/exemption), word count, product-specific metadata.
curl -X POST https://annotate.pauhu.eu/annotate \
-H "Content-Type: application/json" \
-d '{"text": "Member States shall ensure...", "product": "eurlex"}'
{
"original_path": "...",
"organized_path": "...",
"language": "en",
"deontic": {"modality": "obligation", "confidence": 0.95},
"product": "eurlex",
"topic_domain": "law",
"word_count": 847,
"char_count": 5231
}
Batch annotate
Annotate up to 50 documents in a single request.
Deontic classification only
Lightweight endpoint: returns only deontic modality classification.
{
"language": "en",
"annotation": {
"modality": "prohibition",
"confidence": 0.92
}
}
Deontic modalities
| Modality | Meaning | Example |
|---|---|---|
| Prohibition | Action is forbidden | "Member States shall not permit..." |
| Obligation | Action is required | "Member States shall ensure..." |
| Permission | Action is allowed | "Member States may designate..." |
| Exemption | No requirement applies | "This Regulation shall not apply to..." |
Service metadata
List all registered data source annotators and their product codes.
Annotation counts from R2 sidecar metadata, grouped by product.
Full provenance audit: total annotated documents, per-product breakdown, provenance tier distribution (NATIVE 1.0, PARSED 0.95, KEYWORD ≤0.9).
Indexing API LIVE
Hybrid semantic + BM25 search, alert monitoring, and document health checks across all 20 data sources. Split across two service instances (A–E and E–W).
Health check
Binding smoke test. Returns status of R2, D1, Vectorize, and KV bindings for the worker’s product set.
{
"service": "index",
"status": "healthy",
"bindings": {
"R2_COMMISSION": "ok",
"D1_COMMISSION": "ok",
"RECIPE_ALERTS": "bound",
"CF_AI_TOKEN": "set"
},
"timestamp": "2026-03-03T09:00:00Z"
}
Alerts
Query stored regulatory alerts from KV. Filter by recipe name and severity level.
| Parameter | Type | Description |
|---|---|---|
| recipe | string | Recipe name filter (optional, defaults to all) |
| severity | string | Severity filter: critical, high, medium, low (optional) |
| limit | integer | Max results (default: 20) |
{
"count": 5,
"filters": { "recipe": "*", "severity": "*" },
"alerts": [...]
}
Hybrid query
Hybrid semantic (70% BGE-M3) + BM25 keyword (30%) search within a single product index. Includes DSA Article 27 ranking transparency metadata.
| Parameter | Type | Description |
|---|---|---|
| product | string | Product code, e.g. COMMISSION, EURLEX (required) |
| q | string | Search query (required) |
| lang | string | ISO 639-1 language filter (optional) |
| domain | string | EuroVoc domain ID (optional) |
| limit | integer | Max results (default: 10) |
Backfill ADMIN
Admin-only endpoint. Index unprocessed R2 sidecars for a product. Useful after initial data seeding or to recover from indexing gaps. Requires infrastructure-level access (not available via public API keys).
| Parameter | Type | Description |
|---|---|---|
| product | string | Product code, e.g. COMMISSION, EURLEX (required) |
| limit | integer | Max documents to index in one batch (default: 2000) |
| prefix | string | R2 key prefix filter, e.g. en/ for English documents only (optional) |
{
"product": "COMMISSION",
"limit": 2000,
"prefix": "en/",
"indexed": 500,
"errors": 3
}
Cross-language siblings
Find all language versions of a EUR-Lex document. Returns available languages and document paths.
Ranking methodology
DSA Article 27 ranking transparency. Returns algorithm weights (0.7 semantic, 0.3 keyword), manipulation resistance details, and update frequency. Cached for 24 hours.
Statistics
Document counts per product. The secondary index instance includes IATE term counts.
IATE deontic distribution
Distribution of deontic modalities across IATE terminology entries.
IATE cross-lingual translation
Look up a term in one language and retrieve translations across all 24 EU languages via IATE concept IDs.
{
"concept_id": "C12345",
"source_term": "tietosuoja",
"source_language": "fi",
"languages": 24,
"translations": {
"en": [{ "term": "data protection", "reliability": 4 }],
"de": [{ "term": "Datenschutz", "reliability": 4 }]
}
}
Model CDN LIVE
Serves ONNX models from EU storage (2,342 models). Supports HTTP range requests for large files.
Model manifest with all available models and SHA-256 checksums.
Serve model files with Accept-Ranges: bytes for resumable downloads. CORS enabled for browser-native inference via ONNX Runtime Web / Transformers.js.
Model categories
| Category | Models | Format |
|---|---|---|
| Translation | 1,440+ OPUS-MT pairs | ONNX |
| Embeddings | BGE-M3 (1024d) | ONNX |
| Speech-to-text | Whisper variants | ONNX |
| Text-to-speech | TTS models | ONNX |
| Classification | DistilBERT, domain classifiers | ONNX |
| Code | Qwen2.5-Coder, Phi-3 Mini | ONNX |
Data infrastructure
20 EU institutional data sources are ingested into per-product R2 buckets, annotated via queue-triggered workers, and indexed in Vectorize for semantic search. Each source has matching R2, Queue, D1, and Vectorize resources.
Data sources (20)
Pipeline
Documents flow through: Ingestion → EU storage (with metadata) → Event notification → Queue → Annotation service (topic + deontic classification) → Sidecar JSON. Searchable via the Laine Algorithm across all 20 indexes.
National law databases (27 countries)
27 national law adapters with source database links. Connected to EUR-Lex Sector 7 (290,172 national transposition measures linking EU directives to national implementations).
EuroVoc domains (21)
All annotations use the EU Publications Office EuroVoc thesaurus for domain classification:
04 Politics 08 Education & Comms 16 Environment
08 International 10 Business 17 Industry
10 EU Institutions 11 Agriculture 20 Energy
04 Economics 12 Law 24 Production
20 Trade 14 Geography 28 Employment
24 Finance 16 Intl Organisations 32 Information
28 Social Affairs 20 Transport
EU AI Act transparency
All workers expose an Article 52 transparency endpoint:
{
"ai_system": true,
"provider": "Pauhu Ltd",
"eu_ai_act_article": 52,
"purpose": "...",
"risk_category": "limited",
"jurisdiction": "EU"
}
Access control
Pauhu uses entitlement-based access control, not volume-based rate limiting. Your seat tier determines what data you can access, not how many requests you can make.
Seat tiers
| Tier | Data access | Auth |
|---|---|---|
| Trial | Search, terminology, translation (3 req/day) | IP-based (no key needed) |
| Live | Raw feeds from 20 sources + search + terminology + translation + reranking | API key |
| Annotated | Live + annotated feeds (EuroVoc, deontic) + Pauhu AI platform | API key |
| Training | Live + bulk export for ML training | API key |
Burst protection
Paying seats have no daily request caps. Burst protection prevents abuse:
- Sustained: 3 requests/second sliding window
- Peak: 50 requests/second absolute maximum
Trial tier: 3 requests/day (IP-based), plus burst protection.
Response headers
X-Pauhu-Tier: live
Retry-After: 1 (only if burst limit hit)
Trial tier also receives X-RateLimit-Limit: 50.
Guides
| Guide | Domain | Description |
|---|---|---|
| IATE API Reference | pauhu.eu | Full reference for all 11 terminology endpoints: lookup, search, TBX/TMX export, custom glossaries |
| Recipe Catalog | pauhu.eu | 6 pre-configured monitoring recipes with alert format specification |
| How We Protect Your Data | pauhu.eu | Zone isolation, EU data residency, encryption, access control, audit trails |
| GPU Extensions | pauhu.eu | 6 GPU extension types (LLMs, video, image, real-time video, audio, 3D). Bring your own API keys. |
| Data Source Attributions | pauhu.eu | Licenses, publishers, and modification notices for all 35 data sources |
| Getting Started (Recipe Wizard) | pauhu.ai | Configure your EU regulatory feed in 3 steps |
| Benchmark Guide | pauhu.ai | Interpret browser inference benchmark results |
| Search Guide | pauhu.com | Query syntax, filters, boolean operators, CELEX lookup, 20 product examples |
| Translation Quality Pipeline | pauhu.com | 10-stage translation quality cascade |
| Getting Started (Containerverse) | pauhu.dev | Install and run the EU context container in 3 lines |
| Who is Who Privacy Notice | pauhu.eu | GDPR privacy notice for personal data from the EU Who is Who directory |
| LDS Connector Deployment | pauhu.eu | Deploy and configure the Language Data Space connector |
| LDS Demo Runbook | pauhu.eu | Step-by-step: login, Swagger, certificate upload, data publishing for lds.pauhu.eu |
| Data Catalog | pauhu.eu | 24 data products: source institution, update frequency, record count, languages, license. Machine-readable YAML + /v1/search API reference. |
| Cross-References | pauhu.eu | How EUR-Lex, CURIA, OEIL, TED, and national law link together. Example API responses with linked documents. |
| Data Freshness | pauhu.eu | Sync schedules per product, what “Last updated” means, data currency SLA. |
| Multilingual Search | pauhu.eu | Cross-lingual semantic search with BGE-M3. Query in one language, find documents in another. 24 EU languages. |
| Data Pipeline | pauhu.eu | From EU source to grounded answer: ingestion, STAM annotation, paragraph indexing, semantic search, FiD generation. Sovereign deployment data flow. |
| FiD Dual-Brain Architecture | pauhu.eu | Fusion-in-Decoder: how the retrieval brain (20 data sources) and generation brain (ONNX specialists) work together, cloud and sovereign modes |
| Compass Search | pauhu.com | How the Compass works: 3.2B endpoints, sync, annotation, Vectorize indexing, Laine Algorithm ranking. Developer guide for adding new data sources. |
| MCP Sovereign Mode | pauhu.dev | API reference for sovereign MCP server: 4 tools, air-gapped deployment, SHA-256 audit logging, model adapter patterns |
| AI Transparency (Art. 52) | pauhu.eu | EU AI Act Art. 52 compliance: how Pauhu discloses AI involvement, system classification, user notification, training data sources |
| Pauhu for Government | pauhu.eu | Data sovereignty, GDPR Art. 25/32, NIS2, Traficom compliance, data residency guarantees, procurement compatibility |
| Testausopas (Anne) | pauhu.eu | 5-step testing guide: open, login, search “julkinen hankinta”, translate, chat — with expected results |
| Pauhu julkishallinnolle (yleiskatsaus) | pauhu.eu | 1-page overview for Anne Miettinen’s 40-org GovAI network: procurement, legal compliance, terminology, translation |
| API Quickstart (EN) | pauhu.eu | English API quickstart with curl examples, 20 data sources, eForms procurement, translation, security overview |
| eForms-kenttäopas (BT) | pauhu.eu | eForms SDK 1.14 BT field reference for TED procurement data: 40+ Business Terms with Finnish descriptions, CPV codes, API response example |
| Government Procurement Training | pauhu.eu | 6-module training guide for procurement officials: EU law search, TED notices, IATE terminology, cross-references, multilingual search, compliance checklists |
| Demo: Government Procurement | pauhu.eu | Step-by-step walkthrough: search EUR-Lex, translate to Finnish, check national transposition, sovereign FiD deployment |
| Demo: eForms Procurement Search | pauhu.eu | Search TED notices by BT fields, CPV codes, country comparison, monitoring recipes, CSV/JSON export |
| Demo: Pharmaceutical EMA Compliance | pauhu.eu | EMA variation procedures, ECHA substance checks, SmPC translation, CURIA case law monitoring |
| MACC Guide | all | Microsoft Azure Consumption Commitment — hot-swap Azure workloads to Pauhu at identical North Europe EUR rates |
| CRM API Reference | internal | Sales pipeline API: contacts, companies, deals, activities, tasks, email sequences, AI lead scoring |
| Changelog | all | Release notes: Document extraction integration, two-part tariff, 20 data feeds, browser-native inference |
| Getting Started | pauhu.eu | 7-section guide: signup, first search, filters, products, export, chat, next steps |
| Sovereign Brain | pauhu.eu | How Pauhu thinks: dual-hemisphere architecture, browser-native inference, grounded generation |
| Install Sovereign AI | pauhu.eu | 8-container self-hosted deployment guide for air-gapped and on-premises environments |
| Chip-Agnostic Architecture | pauhu.eu | Why Pauhu runs on any device: ONNX Runtime, WebAssembly, ARM/x86, browser-native inference |
| Two-Path Pricing | pauhu.eu | Pauhu license + Azure pass-through pricing model explained |
| Onboarding Wizard | pauhu.eu | Step-by-step account setup and configuration walkthrough |
| Guide vs. Encyclopedia | pauhu.eu | How Pauhu differs from static reference databases: guided search vs. keyword lookup |
Support
Technical: support@pauhu.eu
Sales: sales@pauhu.eu