Data Pipeline

From EU source document to grounded answer — five stages, zero hallucination, your infrastructure.

1. Pipeline Overview

Every answer Pauhu generates is traceable to a specific paragraph in a specific EU document. The pipeline has five stages. Each stage produces verifiable output. Nothing is generated from memory or training data alone — everything is grounded in source text.


  EU Sources (20)        Annotation Engine       Paragraph Index
  ┌──────────────┐      ┌──────────────────┐    ┌──────────────────┐
  │ EUR-Lex      │      │                  │    │                  │
  │ TED          │      │  STAM Standoff   │    │  Semantic vectors│
  │ CURIA        │─────▶│  Text Annotation │───▶│  + structured    │
  │ IATE         │      │  Model           │    │  metadata (D1)   │
  │ + 16 more    │      │                  │    │                  │
  └──────────────┘      └──────────────────┘    └────────┬─────────┘
                                                         │
                                                         ▼
  Grounded Answer        Laine Search Engine
  ┌──────────────┐      ┌──────────────────┐
  │              │      │                  │
  │  FiD Answer  │◀─────│  26ms paragraph  │
  │  + citations │      │  retrieval       │
  │              │      │                  │
  └──────────────┘      └──────────────────┘
        
Key principle: The generation engine (right hemisphere) never sees a query without retrieved evidence (left hemisphere). If the search engine finds no relevant paragraphs, the system says “I don’t know” rather than guessing. This is the grounding guarantee.

2. Data Ingestion

Pauhu ingests data from 20 EU institutional sources and 28 national law databases. Each source has a dedicated sync process that polls for new and updated documents.

Source Category Examples Sync Frequency
Primary legislation EUR-Lex (regulations, directives, decisions) Every 4 hours (weekdays)
National transposition 28 national law databases (Finlex, Legifrance, etc.) Every 15 minutes
Case law CURIA (Court of Justice) Daily
Procurement TED (Tenders Electronic Daily) Every 6 hours
Terminology IATE (2.4 million terms, 24 languages) Daily
Statistics Eurostat, ECB Weekly / daily
Regulatory agencies ECHA, EMA, EPO Daily / weekly

Documents are stored in their original format (XML, HTML, JSON) with full metadata: CELEX identifier, publication date, document type, language, and official journal reference. SHA-256 checksums verify integrity at ingestion.

What happens at ingestion

  1. The sync process fetches new or updated documents from the source API (CELLAR SPARQL for EUR-Lex, REST APIs for others)
  2. Each document receives a unique storage key: {product}/{celex_or_id}-{language}.xml
  3. A SHA-256 checksum is computed and stored alongside the document
  4. The document is queued for annotation

3. Annotation (STAM)

Raw documents are not useful for search or question answering. The annotation engine transforms each document into a structured, searchable form using STAM (Standoff Text Annotation Model) — an open standard for layered text annotation.

What STAM produces

For each document, the annotation engine produces a sidecar JSON file containing standoff annotations. The original text is never modified — annotations reference character offsets in the source document.

1 Paragraph segmentation

The document is split into individual paragraphs. Each paragraph gets a unique ID, character offsets, and structural metadata (article number, section, recital).

2 Topic classification

Each paragraph is classified across 21 topic domains (agriculture, energy, finance, law, etc.) using a fine-tuned ONNX classifier. Multi-label: a paragraph about renewable energy subsidies scores for both “Energy” and “Finance”.

3 Deontic modality

Each paragraph is classified as an obligation (“Member States shall…”), prohibition (“shall not…”), permission (“may…”), or exemption. This is the legal force of the text.

4 Named entity recognition

EU-specific entities: institution names, legal references (CELEX, ECLI), CPV procurement codes, ECHA substance identifiers, dates, and monetary amounts.

5 Cross-references

Links between documents: “as amended by Regulation (EU) 2024/1689” is resolved to a CELEX identifier and linked bidirectionally.

6 Terminology matching

Paragraphs are matched against 2.4 million IATE terms. When “acquis communautaire” appears, the IATE entry with translations in all 24 EU languages is attached.

Annotation output format

The annotation sidecar is a JSON file stored alongside the source document:

{
  "source": "eurlex/32024R1689-en.xml",
  "checksum": "sha256:a7f3c...",
  "paragraphs": [
    {
      "id": "art-1-para-1",
      "offsets": [1204, 1847],
      "text": "This Regulation lays down...",
      "topics": ["law", "science"],
      "deontic": "obligation",
      "entities": [
        { "type": "celex", "value": "32024R1689", "offsets": [12, 24] }
      ],
      "iate_terms": [
        { "id": "IATE-3567894", "term": "artificial intelligence system" }
      ]
    }
  ]
}

4. Paragraph Indexing

After annotation, each paragraph is indexed in two systems that work together:

Index Type Purpose Technology
Structured index (D1) Exact-match queries: CELEX lookup, date ranges, topic filters, deontic modality filters SQLite-compatible relational database
Semantic index (Vectorize) Meaning-based queries: “rules about AI transparency in healthcare” BGE-M3 embeddings, 1024 dimensions, cosine similarity

Paragraph-level granularity

Most legal search engines index entire documents. Pauhu indexes individual paragraphs. This matters because:

What gets indexed per paragraph

Structured index (D1):
  celex_id, language, paragraph_id, article_number,
  topics[], deontic_modality, publication_date,
  entities[], cross_references[], word_count

Semantic index (Vectorize):
  paragraph_text → BGE-M3 embedding (1024 floats)
  metadata: celex_id, language, paragraph_id
Hybrid search: Every query runs against both indexes simultaneously. The structured index handles filters (date range, topic, language). The semantic index handles meaning. Results are fused using reciprocal rank fusion (RRF) to produce a single ranked list.

When a user queries Pauhu, the Laine search engine executes a hybrid search across both indexes in under 26 milliseconds (p95). This is the left hemisphere — analytical comprehension.

Search flow

  1. Query encoding: The user’s question is encoded into a 1024-dimensional vector using the same BGE-M3 model used at indexing time
  2. Semantic retrieval: The vector index returns the top-N most similar paragraphs by cosine similarity
  3. Structured filtering: Results are filtered by language, date range, topic, product, and any active user filters
  4. Rank fusion: Semantic scores and structured relevance signals are combined via reciprocal rank fusion
  5. Passage return: The top 3–10 paragraphs, with full metadata and source attribution, are returned

The search engine is the “left hemisphere” of the Sovereign Brain architecture. It comprehends the query and finds evidence. It does not generate text.

6. Grounded Generation

The generation engine is the “right hemisphere” — it reads the retrieved paragraphs and produces a fluent answer with citations. It uses the Fusion-in-Decoder (FiD) architecture.

How FiD works

  1. Input: The user’s question + 3–10 retrieved paragraphs (each with its CELEX ID and article reference)
  2. Encoding: Each paragraph is encoded independently by the encoder
  3. Fusion: The decoder attends to all encoded paragraphs simultaneously — it can cross-reference information across multiple documents
  4. Output: A natural-language answer with inline citations: “According to Article 6(1) of Regulation (EU) 2024/1689, high-risk AI systems must…”

Model-agnostic design

The FiD pattern (retrieve → ground → generate → cite) is the product. The underlying model is swappable. The default model runs entirely in the browser via ONNX Runtime — no external API calls, no data leaving your network. You can also connect your own LLM via the model adapter container.

Grounding guarantee

7. Multilingual Flow

EU legislation exists in 24 official languages. Here is how annotations flow across languages:

English-first annotation, cross-language transfer

  1. English is annotated first. The annotation engine processes the English version of each document. This produces the highest-quality annotations because the NLP models perform best on English text.
  2. Structural alignment. EU documents have identical paragraph structure across all 24 language versions (same article numbers, same recitals). The annotation engine aligns paragraphs across languages using document structure, not machine translation.
  3. Annotation projection. Structural annotations (topic, deontic modality, cross-references) are projected from the English version to all parallel versions. A paragraph classified as “obligation” in English is “obligation” in Finnish, French, and all other languages — because it is the same legal provision.
  4. Language-specific NER. Named entity recognition runs independently per language, because entity surface forms differ (e.g., “Court of Justice” vs. “Cour de justice” vs. “Tuomioistuin”).
  5. Multilingual embeddings. The BGE-M3 model produces embeddings in a shared vector space across all 24 languages. A Finnish query retrieves relevant paragraphs regardless of whether the source paragraph is in Finnish, English, or any other EU language.
Cross-language search: A user querying in Finnish can find relevant paragraphs in the English version of a regulation that has not yet been translated to Finnish. The system will note the language mismatch and offer machine translation via the 552-pair Helsinki-NLP OPUS-MT models running locally in the container.

Translation in the pipeline

Translation is not part of the core data pipeline. The annotation and indexing pipeline processes each language version as it arrives from the EU source. Machine translation (Helsinki-NLP OPUS-MT, ONNX format) is available as a separate container for on-demand translation of search results and answers.

8. Data Sovereignty

In a sovereign deployment (on-premise container), the entire pipeline runs on your infrastructure. Here is exactly what stays where:

Component Location Network Access
Source documents Your container storage volume Outbound only: EU source APIs for sync
STAM annotation sidecars Your container storage volume None (processed locally)
Structured index (D1) SQLite file on your volume None
Semantic index (Vectorize) Vector database on your volume None
ONNX models (NLP, search, generation) Pre-loaded in container image None
User queries Your container, your memory None (never transmitted)
Generated answers Your container, your memory None (never transmitted)
Audit log Your container, SHA-256 chained None

Air-gapped mode

For classified environments, the container can run fully air-gapped. Disable the sync process and load data via offline transfer (USB, secure file share). The container includes all models, all indexes, and all annotation logic. No internet connection required for search or answer generation.

What the container phones home

Nothing. The sovereign container has no telemetry, no usage reporting, no licence phone-home. The sync process makes outbound HTTPS requests to EU institutional APIs (EUR-Lex, TED, etc.) to fetch new documents. If you disable sync, the container makes zero outbound connections.

9. Freshness and Sync

The sovereign container includes 23 automated sync processes that keep your data current. Each process polls its EU source API on a schedule:

Source Sync Frequency Typical Latency
EUR-Lex Every 4 hours (weekdays) < 1 hour after Official Journal publication
National law (28 databases) Every 15 minutes Same day
TED procurement Every 6 hours < 6 hours after notice publication
IATE terminology Daily < 24 hours
All other sources Daily or weekly < 24 hours

After sync, new documents automatically flow through annotation and indexing. The entire pipeline — fetch, annotate, index — typically completes within minutes for incremental updates.

Configurable freshness

You control sync frequency via the admin panel at /pauhu on the gateway container (port 8090). Options:

10. Annotation Inheritance

Not every language version of a document needs to be annotated from scratch. Pauhu uses an English-first Rosetta pattern: English is annotated with the highest-quality NLP models, and structural annotations are inherited by all 24 parallel language versions.

Why English first?

What is inherited

Annotation Layer Inherited? Rationale
Topic classification (21 domains) Yes Legal topic does not change across translations
Deontic modality Yes “shall” in English = “doit” in French = same legal force
Cross-references (CELEX links) Yes CELEX identifiers are language-independent
Paragraph structure (offsets, article numbers) Yes Identical document structure across all languages
Named entity recognition No Entity surface forms differ per language
Terminology matching (IATE) No IATE entries are language-specific

How inheritance works

When a non-English version of a document arrives, the annotation engine checks whether the English version has already been annotated. If yes, it copies inheritable annotations (topic, deontic, cross-references) and only runs language-specific models (NER, IATE matching) on the new text. The SQL logic uses COALESCE to prefer the language-specific annotation when available, falling back to the English annotation otherwise:

SELECT
  p.paragraph_id,
  COALESCE(NULLIF(local.topic, ''), en.topic) AS topic,
  COALESCE(NULLIF(local.deontic, ''), en.deontic) AS deontic,
  local.entities  -- always language-specific
FROM paragraphs p
LEFT JOIN annotations local ON p.id = local.paragraph_id AND local.lang = :lang
LEFT JOIN annotations en    ON p.id = en.paragraph_id    AND en.lang = 'en'
Coverage: Annotation inheritance achieves 24/24 EU language coverage for all topic domains. Every paragraph in every language receives topic and deontic annotations, even when the language-specific NLP model has not yet processed the document.

Current status: multilingual rollout

The initial index was populated with English annotations only. Non-English paragraphs are being added through a three-phase rollout:

  1. Multilingual indexing — the indexing pipeline is being updated to process all 24 language versions, not just English. Annotations from the English version are inherited by parallel language versions at indexing time.
  2. Backfill — existing English-only documents are being re-processed to add annotations for all available language versions. This is a one-time operation covering the full 4.7M+ document corpus.
  3. Verification — cross-language annotation consistency is validated: a paragraph classified as “obligation” in English must carry the same classification in all 24 language versions.

11. Vectorize Embedding

After annotation, each paragraph is embedded into a 1024-dimensional vector space for semantic search. The embedding step converts human-readable text into numerical representations that capture meaning.

Embedding model: BGE-M3

Pauhu uses BGE-M3 (BAAI General Embedding — Multi-lingual, Multi-granularity, Multi-function) for all paragraph embeddings:

Embedding pipeline

The full path from source document to searchable vector:

  1. Storage: Source documents are stored in per-product object storage with full metadata
  2. Annotation: The annotation engine produces STAM sidecar JSON (paragraphs, topics, entities, cross-references)
  3. Structured index: Paragraph metadata is written to the relational database (CELEX, language, topics, deontic modality)
  4. Embedding: The embedding service encodes each paragraph using BGE-M3. The raw model output (Float32Array) is normalised via Array.from() to ensure correct serialisation before storage
  5. Vector index: The 1024-float vector is stored in the vector database alongside the paragraph’s metadata (CELEX ID, language, paragraph ID) using cosine similarity
  6. Query-time: The same BGE-M3 model encodes the user’s query, ensuring query and document vectors are in the same space

Why BGE-M3?

Cross-lingual retrieval

A query in Bulgarian retrieves relevant paragraphs originally written in Danish. No translation step needed — the shared vector space handles it natively.

Legal precision

BGE-M3 captures semantic nuance in regulatory text: “mandatory reporting obligation” and “required notification duty” map to nearby vectors, while “voluntary disclosure” maps far away.

12. Adaptive Model Loading

The sovereign container adapts its model loading strategy based on the available device memory. This ensures the system runs efficiently on everything from a developer laptop to a dedicated GPU server.

Three loading tiers

Tier Device Memory Models Loaded Use Case
Lite < 4 GB Search + embeddings only (BGE-M3, ONNX quantized) Browser-native search, no generation
Standard 4–16 GB Search + FiD generation (mT5-small ONNX) + NMT (OPUS-MT selected pairs) Full search + answer generation, selected translation pairs
Full > 16 GB All models: search, FiD, NMT (552 pairs), topic classifiers, NER, specialist models Production sovereign deployment, all features enabled

Progressive download

Models are downloaded progressively, not all at once. The system starts with the search models (needed immediately for queries) and downloads generation and translation models in the background. This means:

Why 300 MB matters

The global semiconductor supply chain is under sustained pressure. Memory prices are volatile, procurement cycles are lengthening, and government IT budgets rarely accommodate high-end GPU servers. Pauhu’s FiD generation model fits in 300 MB of DRAM — less memory than a typical browser tab consumes. This is a deliberate design decision: a model that runs on commodity hardware is a model that every organisation can deploy without special procurement.

Browser-native advantage

In the Lite and Standard tiers, all inference runs inside the browser via ONNX Runtime for WebAssembly. No server, no GPU, no dedicated infrastructure — the user’s own device does the work. For government IT departments, this means:

Note: The adaptive loading specification is being finalised. Memory thresholds and model selection may change before the next release. The principle — automatic adaptation to available hardware — will remain.

13. Works Alongside Your Tools

Pauhu does not replace your existing software. It sits alongside it — a browser sidebar that adds EU regulatory intelligence to whatever you are already working on.

Browser sidebar overlay

The Pauhu sidebar runs as a browser extension or a standalone tab. When you are reading a PDF in your document management system, drafting a contract in your word processor, or reviewing a tender in your procurement platform, the sidebar provides:

No vendor lock-in

Pauhu does not require you to migrate your documents, change your workflow, or adopt a new platform. It works with your existing tools via a standard browser. If you stop using Pauhu, nothing changes in your existing systems — you simply close the sidebar.

No per-seat tax

In the sovereign deployment, the container serves everyone on your network. There is no per-user licensing, no seat counting, and no usage metering. One deployment, unlimited internal users. The subscription covers the container and data updates — not the number of people who use it.

14. 3-Level Topic Hierarchy

Every document in the pipeline is automatically classified into a 3-level topic hierarchy derived from the EU’s official EuroVoc thesaurus (SKOS metadata). No manual tagging is needed — topic annotations are extracted from the source metadata that EU institutions already publish with each document.

  Level 1: Domain (21)
  ├── 04 Politics               ── broad subject area
  ├── 12 Law                    ── broad subject area
  └── 20 Trade                  ── broad subject area
       ...

  Level 2: Micro-Thesaurus (~127)
  ├── 12 Law
  │   ├── MT 1216  Criminal law      ── topical group
  │   ├── MT 1221  Criminal procedure
  │   └── MT 1231  Civil law
       ...

  Level 3: Descriptor (~6,800)
  ├── MT 1216 Criminal law
  │   ├── acquittal               ── specific concept
  │   ├── criminal liability
  │   ├── extradition
  │   └── statute of limitations
       ...

How it works

  1. Source metadata: EUR-Lex, TED, CORDIS, and other EU sources publish EuroVoc descriptors in their document metadata (SKOS RDF). The annotation worker reads these descriptors during ingestion.
  2. Hierarchy resolution: Each descriptor maps to a micro-thesaurus, and each micro-thesaurus maps to a domain. The pipeline stores all 3 levels per document.
  3. Search filtering: Users can filter search results by domain, micro-thesaurus, or descriptor. This narrows millions of documents to the precise legal topic.
21 domains cover all EU institutional activity: politics, international relations, EU institutions, law, economics, trade, finance, social affairs, education, science, business, industry, agriculture, food, transport, environment, energy, geography, international organisations, and more. Each domain contains 3–12 micro-thesauri, and each micro-thesaurus contains 20–150 descriptors.

15. Topics API

The Topics API exposes the 3-level hierarchy for programmatic access. Use it to build topic filters, faceted search interfaces, or domain-specific dashboards.

GET /v1/topics

Returns the list of all 21 top-level domains.

curl https://staging.pauhu.eu/v1/topics

// Response
{
  "domains": [
    { "id": "04", "label": "Politics", "mt_count": 7 },
    { "id": "12", "label": "Law", "mt_count": 8 },
    { "id": "20", "label": "Trade", "mt_count": 5 },
    ...
  ],
  "total": 21
}

GET /v1/topics/:domain

Returns all micro-thesauri within a domain.

curl https://staging.pauhu.eu/v1/topics/12

// Response
{
  "domain": { "id": "12", "label": "Law" },
  "micro_thesauri": [
    { "id": "1216", "label": "Criminal law", "descriptor_count": 48 },
    { "id": "1221", "label": "Criminal procedure", "descriptor_count": 35 },
    { "id": "1231", "label": "Civil law", "descriptor_count": 62 },
    ...
  ],
  "total": 8
}

GET /v1/topics/:domain/:mt

Returns all descriptors within a micro-thesaurus.

curl https://staging.pauhu.eu/v1/topics/12/1221

// Response
{
  "domain": { "id": "12", "label": "Law" },
  "micro_thesaurus": { "id": "1221", "label": "Criminal procedure" },
  "descriptors": [
    { "id": "1109", "label": "acquittal" },
    { "id": "839", "label": "criminal investigation" },
    { "id": "5765", "label": "European arrest warrant" },
    ...
  ],
  "total": 35
}

Filtering search results by topic

Add the eurovoc_mt parameter to any search query to narrow results to a specific micro-thesaurus:

// Search only within "Criminal procedure" (MT 1221)
curl "https://staging.pauhu.eu/v1/search?q=extradition&eurovoc_mt=1221"

// This returns only documents tagged with MT 1221 descriptors,
// filtering out results from other legal areas like civil law or
// administrative law.
Topic filtering + semantic search: Combine Laine semantic similarity with topic filtering for precise results. Without topic filtering, a query for “liability” returns results from criminal law, civil law, corporate law, and insurance. With eurovoc_mt=1216, you get only criminal liability results.

16. Current Status

The pipeline is live and processing data continuously. This section provides a snapshot of current indexing progress.

MetricValueNotes
Products with vectors14 / 24Remaining 10 products queued for indexing
Total vectors4,775Growing as indexers process annotation queue
Annotation consistencyTarget 80%Measured against expert-annotated evaluation set
VEC_EMPTY statusCLOSEDFloat32Array serialization fix deployed (Array.from() normalization)
R2 objects~4.8MAcross all 24 product buckets
Sync frequency15 min – weeklyVaries by product (EUR-Lex: 4h, national law: 15min, ECHA: weekly)
VEC_EMPTY resolution: The embedding pipeline previously produced empty vectors because ONNX Runtime returns Float32Array objects that do not serialize correctly through the indexing queue. The fix normalizes embeddings via Array.from() before storage, ensuring all 1024 dimensions are preserved. This fix is deployed and verified across all 14 indexed products.

Indexing progress by product

Products are indexed across 3 workers based on binding limits. Each worker runs every 5 minutes, processing annotated documents from the queue and inserting vectors into the search index.

WorkerProductsStatus
Indexer Acommission, consilium, cordis, curia, dataeuropa, dpp, ecb, echa, ema, epoLive
Indexer Beuroparl, eurlex, eurostat, iate, lex, oeil, publications, ted, whoiswho, wikiLive
Indexer Ccode, osm, weather, newsLive

17. FAQ

How large is the full dataset?

Approximately 4.7 million documents across 20 EU products and 28 national law databases (EUR-Lex: 1.67M, TED: 1.6M, national law: 256K, OEIL: 204K, and smaller counts across remaining sources). On disk, the annotated dataset with indexes requires approximately 50 GB of storage.

Can I select only specific data sources?

Yes. The admin panel lets you enable or disable individual sources. If you only need EUR-Lex and TED, disable the other 18 sources. Sync, annotation, and indexing will only process your selected sources.

What happens when a document is amended?

The sync process detects the update, re-fetches the document, re-annotates it, and updates both indexes. The old version is preserved in the audit log with its original SHA-256 checksum. Cross-references to the amended document are updated automatically.

Can I add my own documents to the pipeline?

Yes. The container accepts custom documents via API upload. Your documents go through the same annotation and indexing pipeline. They appear alongside EU source data in search results, with clear provenance marking (“Customer document” vs “EUR-Lex”).

How do I verify the pipeline is working?

The admin panel at /pauhu shows pipeline health: last sync time per source, annotation queue depth, index size, and embedding count. The /health endpoint returns machine-readable status for integration with your monitoring tools.

What is the annotation accuracy?

Topic classification: 94% F1 on the annotated evaluation set. Deontic modality: 91% F1. Named entity recognition: 89% F1. These are measured against expert-annotated EU legal documents. All annotations include a confidence score — low-confidence annotations are flagged for human review.

Sovereign Brain Architecture · Installation Guide · FiD Architecture