FiD Dual-Brain Architecture

Fusion-in-Decoder: two specialized brains — retrieval and generation — fused at the decoder layer for grounded, citation-backed EU intelligence. PQ Ready EU Only

Overview

Pauhu® uses a Fusion-in-Decoder (FiD) architecture — originally described by Izacard & Grave (2021) — adapted for EU regulatory intelligence. Instead of a single monolithic model, the system splits into two specialized brains that fuse at inference time:

Retrieval Brain — The Compass search system. Queries fan out across 20 Vectorize indexes simultaneously, ranked by the Laine Algorithm. This is the encoder half of FiD.
Generation Brain — ONNX specialist models running in the browser or on-premises. Domain-specific encoders plus chat models. This is the decoder half of FiD.

The key insight: the generation brain attends to all retrieved passages simultaneously, not sequentially. Each passage is independently encoded by the retrieval brain, then all encoded representations are concatenated and fed to the decoder in a single forward pass. This is what distinguishes FiD from simple Retrieval-Augmented Generation (RAG).

  FiD Dual-Brain Architecture
  ===========================

  Query: "NIS2 implementation deadlines for essential entities"
    |
    v
  +---------------------------------------------------------------+
  |                     RETRIEVAL BRAIN                            |
  |                     (Compass Search)                           |
  |                                                                |
  |  Query --+--> [eurlex index  ] --+                             |
  |          +--> [curia index   ] --+                             |
  |          +--> [ted index     ] --+                             |
  |          +--> [commission    ] --+-- Laine Algorithm ------+   |
  |          +--> [echa index    ] --+  (70% semantic,         |   |
  |          +--> [ema index     ] --+   30% BM25)             |   |
  |          +--> [epo index     ] --+                         |   |
  |          +--> [ecb index     ] --+                         |   |
  |          +--> [... 12 more   ] --+                         |   |
  |                                                            |   |
  |          20 BGE-M3 indexes (1024d, cosine)                 |   |
  |                                                            v   |
  |                                                   Top-K passages|
  +------------------------------------------------------------+---+
                                                               |
                              +--------------------------------+
                              |
                              v
  +---------------------------------------------------------------+
  |                     GENERATION BRAIN                           |
  |                     (ONNX Specialists)                         |
  |                                                                |
  |  Top-K passages ---> [Encode each passage independently]       |
  |                           |                                    |
  |                           v                                    |
  |                 [Concatenate all encodings]                     |
  |                           |                                    |
  |                           v                                    |
  |              +---------------------------+                     |
  |              |  Domain Specialist        |                     |
  |              |  (e.g., Law, Finance)     |                     |
  |              |  XLM-RoBERTa, ~145MB INT8 |                     |
  |              +---------------------------+                     |
  |                           |                                    |
  |                           v                                    |
  |              +---------------------------+                     |
  |              |  Chat Model (decoder)     |                     |
  |              |  SmolLM-135M (free)       |                     |
  |              |  TinyLlama-1.1B (pro)     |                     |
  |              +---------------------------+                     |
  |                           |                                    |
  |                           v                                    |
  |              Answer with inline citations                      |
  |              [source: CELEX 32022L2555, Art. 21(1)]            |
  +---------------------------------------------------------------+

Retrieval Brain

The retrieval brain is responsible for finding relevant passages across all 20 EU data sources. It runs on edge infrastructure within EU jurisdiction.

Fan-out search

Every query is dispatched to all 20 Vectorize indexes in parallel. Each index contains BGE-M3 embeddings (1024 dimensions, cosine similarity) for one data product. The fan-out ensures that a query like “carbon border adjustment” finds results across EUR-Lex legislation, CURIA case law, TED procurement notices, and Eurostat data simultaneously.

Parameter	Value
Embedding model	BGE-M3 (BAAI)
Dimensions	1024
Similarity metric	Cosine
Index count	20 (one per data product)
Languages	24 EU official languages

Laine Algorithm

Raw vector similarity alone misses keyword-critical matches (e.g., CELEX numbers, article references). The Laine Algorithm combines two signals:

70% semantic similarity — cosine distance from BGE-M3 embeddings captures meaning across languages
30% BM25 keyword matching — term frequency / inverse document frequency catches exact identifiers, legal references, and technical codes

The hybrid score is computed as:

  score = 0.70 * cosine_similarity(query_emb, doc_emb)
        + 0.30 * bm25_score(query_tokens, doc_tokens)

Results from all 20 indexes are merged, deduplicated by document ID, and sorted by hybrid score. The top-K passages (default K=10) are forwarded to the generation brain.

Layer 2: Domain embeddings

Domain specialist models enrich retrieval quality via Layer 2 embeddings. When a query is classified as belonging to a specific topic (e.g., Domain 12: Law), the corresponding specialist generates domain-tuned embeddings that are blended with the base BGE-M3 score. This narrows the semantic gap for domain-specific vocabulary — for instance, “consideration” means something very different in contract law vs. general usage.

Ranking transparency (DSA Article 27)

Every search result includes provenance metadata to comply with the Digital Services Act ranking transparency requirements:

source — which data product the passage came from (e.g., eurlex, curia)
semantic_score — the cosine similarity component
keyword_score — the BM25 component
combined_score — final Laine Algorithm score
provenance_tier — NATIVE (original text, 1.0), PARSED (extracted, 0.95), or KEYWORD (≤0.9)

Generation Brain

The generation brain runs entirely in the user’s browser (via ONNX Runtime Web) or on-premises in a containerised deployment. No document content leaves the user’s device during generation.

Domain specialists

21 fine-tuned specialist models cover all EuroVoc topic domains. Each specialist shares an XLM-RoBERTa backbone but is fine-tuned on domain-specific EU corpora:

Property	Value
Backbone	XLM-RoBERTa (cross-lingual)
Quantisation	INT8
Size per specialist	~145 MB
Specialist count	21 (one per topic domain)
Runtime	ONNX Runtime Web (browser) or ONNX Runtime (server)

In the FiD pipeline, the specialist serves two purposes:

Passage re-encoding: Each retrieved passage is re-encoded through the domain specialist, producing richer representations than the base embedding model alone.
Layer 2 scoring: The domain-specific encoding feeds back into passage ranking, allowing the system to promote passages that are more relevant within the identified domain.

Chat models (decoder)

After specialist encoding, a generative chat model synthesises the answer from all passage encodings:

Model	Parameters	Tier	Use case
SmolLM	135M	Free	Short answers, summaries, term definitions
TinyLlama	1.1B	Pro	Multi-paragraph analysis, cross-reference synthesis

Both models run as ONNX graphs in the browser. The pro-tier model loads on demand — only downloaded when the user first triggers a pro-level query.

Browser-native execution

The generation brain uses ONNX Runtime Web with WebAssembly (WASM) and WebGPU backends:

WebGPU — preferred backend on supported browsers (Chrome 113+, Edge 113+). Runs inference on the GPU for faster generation.
WASM — fallback for browsers without WebGPU. Runs on CPU threads via SharedArrayBuffer.
No server round-trip — after the initial model download, all generation happens locally. Document passages stay on the user’s device.

Fusion-in-Decoder

The fusion step is what distinguishes this architecture from simple RAG. Here is how passages flow through the system:

Step-by-step

Query encoding: The user’s query is encoded into a 1024-dimensional BGE-M3 vector.
Fan-out retrieval: The query vector is dispatched to all 20 Vectorize indexes in parallel. Each index returns its top matches.
Laine ranking: Results from all indexes are merged and ranked by the hybrid 70/30 score. Top-K passages are selected.
Independent encoding: Each of the K passages is independently encoded by the domain specialist. This produces K separate hidden-state tensors.
Concatenation: All K encoded representations are concatenated along the sequence dimension into a single extended context.
Decoder attention: The chat model (SmolLM or TinyLlama) attends to the entire concatenated context in one forward pass. Cross-attention layers see all passages simultaneously.
Generation: The decoder generates an answer token by token, with attention weights distributed across all K passages. Citations are produced inline by tracking which passage each attention head focuses on.

  Standard RAG                       Fusion-in-Decoder (Pauhu)
  ============                       =========================

  Passage 1 --+                      Passage 1 --> [Encode] --+
  Passage 2 --+-- Concatenate        Passage 2 --> [Encode] --+-- Concatenate
  Passage 3 --+-- as text            Passage 3 --> [Encode] --+-- as tensors
              |                                                |
              v                                                v
  [Single prompt with                 [Decoder cross-attends
   all passages as                    to ALL encoded passages
   plain text context]                simultaneously]
              |                                                |
              v                                                v
  LLM generates answer               Decoder generates answer
  (context window limit              (scales with K, not
   constrains passage count)          context window length)

  Problem: passages compete          Advantage: each passage
  for context window space.          is encoded independently.
  Adding more passages               Adding more passages
  dilutes each one.                  does not dilute quality.

Why FiD matters for EU data

EU regulatory questions often require synthesising information from multiple legal instruments. A question about NIS2 implementation might need passages from the directive itself (EUR-Lex), national transposition measures (National Law), relevant CURIA case law, and ECHA guidance documents. Standard RAG would concatenate these as text, competing for a fixed context window. FiD encodes each independently, so the decoder can attend to all of them with equal fidelity.

Cloud vs Sovereign

Pauhu supports two deployment modes. The architecture is identical — only the infrastructure layer changes.

Aspect	Cloud FiD	Sovereign FiD
Retrieval brain	EU edge infrastructure (Vectorize indexes)	Local SQLite + local vector index
Generation brain	User’s browser (ONNX Runtime Web)	On-premises server (ONNX Runtime)
Data residency	EU jurisdiction (edge) + user device (browser)	Entirely on-premises
Activation	Default	`PAUHU_SOVEREIGN=true`
Internet required	Yes (for retrieval)	No (fully air-gapped capable)
Model delivery	CDN → browser cache (IndexedDB)	Docker volume mount
IATE terminology	API lookup (EU edge)	Local SQLite database

Cloud FiD (default)

In cloud mode, the retrieval brain runs on EU edge infrastructure. Query vectors are computed on the edge, fan-out search hits all 20 indexes, and ranked passages are returned to the browser. The generation brain then runs entirely in the browser — no document content is sent to any server during generation.

  Cloud FiD
  =========

  Browser                              EU Edge
  +---------------------+              +---------------------+
  | 1. User types query |              |                     |
  |    |                |   HTTPS/TLS  |                     |
  |    +--- query ------+------------->| 2. Encode query     |
  |                     |              |    |                 |
  |                     |              |    v                 |
  |                     |              | 3. Fan-out to 20    |
  |                     |              |    Vectorize indexes |
  |                     |              |    |                 |
  |                     |              |    v                 |
  |                     |              | 4. Laine rank       |
  |                     |   passages   |    |                 |
  | 5. Receive passages |<-------------+----+                 |
  |    |                |              |                     |
  |    v                |              +---------------------+
  | 6. Domain specialist|
  |    encodes each     |
  |    |                |
  |    v                |
  | 7. FiD: concatenate |
  |    + decode         |
  |    |                |
  |    v                |
  | 8. Answer + cites   |
  +---------------------+

Sovereign FiD

In sovereign mode, both brains run on-premises. The containerised deployment includes a gateway, MCP context server, translation server, and an optional sovereign LLM adapter. Set PAUHU_SOVEREIGN=true in your environment to activate.

  Sovereign FiD (air-gapped)
  =========================

  On-premises server
  +-------------------------------------------------------+
  |                                                       |
  |  Gateway (orchestrator)                               |
  |    |                                                  |
  |    +--- query --> Local Retrieval Brain               |
  |    |              (SQLite FTS5 + local vector index)  |
  |    |                        |                         |
  |    |                  ranked passages                 |
  |    |                        |                         |
  |    +--- passages --> Local Generation Brain           |
  |                      (ONNX Runtime, domain specialist)|
  |                             |                         |
  |                      +------+------+                  |
  |                      |             |                  |
  |                   SmolLM       or sovereign LLM       |
  |                   (default)    (ALLaM, Mistral, etc.) |
  |                      |             |                  |
  |                      +------+------+                  |
  |                             |                         |
  |                      Answer + citations               |
  |                                                       |
  |  No external network access required                  |
  +-------------------------------------------------------+

The sovereign LLM adapter supports multiple model providers:

Provider	Config value	Example models
Local ONNX	`onnx-local`	SmolLM, TinyLlama, Mistral (quantised)
Local Transformers	`transformers-local`	Any HuggingFace model
OpenAI-compatible API	`openai-compatible`	ALLaM, SwissGPT, vLLM, Ollama

IATE Integration

IATE (Inter-Active Terminology for Europe) provides 2.4 million terms in 24 EU official languages. In the FiD pipeline, IATE is injected into both brains:

Retrieval brain: term expansion

When a query contains a term that exists in IATE, the retrieval brain expands the query with equivalent terms in the same and related languages. For example, a query containing “data controller” is expanded with “Verantwortlicher” (DE), “responsable du traitement” (FR), and “rekisterinpitäjä” (FI). This expansion happens at the embedding level — the expanded terms are encoded and their vectors are averaged with the original query vector.

  IATE Term Expansion (Retrieval Brain)
  =====================================

  Input query: "data controller obligations under GDPR"
                    |
                    v
  IATE lookup: "data controller" --> IATE ID 1688230
    |
    +-- EN: data controller
    +-- DE: Verantwortlicher
    +-- FR: responsable du traitement
    +-- FI: rekisterinpitaja
    +-- ... (24 languages)
    |
    v
  Expanded query vector = avg(
    embed("data controller obligations under GDPR"),
    embed("Verantwortlicher obligations under GDPR"),
    embed("responsable du traitement obligations under GDPR")
  )
    |
    v
  Fan-out with expanded vector --> finds multilingual passages

Generation brain: term constraints

During generation, IATE terms serve as output constraints. When the decoder generates text that includes domain-specific terminology, the IATE database provides the canonical term form for the target language. This prevents the model from paraphrasing standardised terms — “data controller” remains “data controller”, not “person responsible for data”.

Term pinning: If a passage contains an IATE term, the decoder is constrained to use the IATE-preferred form in the generated output.
Reliability scoring: IATE terms carry reliability scores (1–4). Only terms with reliability ≥3 are used as hard constraints; lower-reliability terms are treated as soft preferences.
Domain scoping: Term constraints are scoped to the identified topic domain. A term that means different things in law vs. finance is pinned to the correct domain-specific definition.

API Endpoints

The FiD pipeline is accessible via the Pauhu EU API. All endpoints run in EU jurisdiction and return DSA Article 27 ranking metadata.

Endpoint	Method	Description
`/v1/search`	GET	Retrieval brain only — returns ranked passages with provenance metadata
`/v1/search/fid`	POST	Full FiD pipeline — retrieval + generation, returns answer with inline citations
`/iate/lookup`	GET	IATE term lookup — returns translations, definitions, reliability scores
`/v1/classify`	POST	Topic classification — identifies which of 21 topic domains a text belongs to

Example: FiD search

  POST /v1/search/fid
  Content-Type: application/json
  Authorization: Bearer pk_your_api_key

  {
    "query": "What are the NIS2 incident reporting deadlines?",
    "sources": ["eurlex", "lex", "commission"],
    "lang": "en",
    "top_k": 10,
    "model": "smollm-135m"
  }

Response

  {
    "answer": "Under NIS2 (Directive 2022/2555), essential and important
      entities must report significant incidents in three stages:
      (1) early warning within 24 hours, (2) incident notification
      within 72 hours, and (3) final report within one month.",
    "citations": [
      {
        "source": "eurlex",
        "celex": "32022L2555",
        "article": "Art. 23(4)",
        "snippet": "...shall submit an early warning within 24 hours...",
        "semantic_score": 0.94,
        "keyword_score": 0.88,
        "combined_score": 0.92
      }
    ],
    "model": "smollm-135m",
    "passages_used": 10,
    "ranking_transparency": {
      "algorithm": "laine-v1",
      "semantic_weight": 0.70,
      "keyword_weight": 0.30,
      "indexes_queried": 3,
      "total_candidates": 847
    }
  }

Comparison with RAG

Pauhu’s FiD architecture addresses several limitations of standard Retrieval-Augmented Generation:

Aspect	Standard RAG	Pauhu FiD
Passage handling	Passages concatenated as plain text in a single prompt	Each passage encoded independently, then fused at decoder layer
Scaling with K	More passages = longer prompt = diluted attention	More passages = more encodings = richer cross-attention (no dilution)
Context window	Limited by LLM context window (4K–128K tokens)	Limited by memory for encoded tensors (typically supports 50+ passages)
Domain adaptation	General-purpose embeddings for retrieval	Domain specialist re-encoding enriches passage representations
Citation tracking	Heuristic (search generated text for passage overlap)	Structural (attention weights directly indicate source passage)
Multilingual	Depends on LLM’s multilingual capability	BGE-M3 retrieval + XLM-RoBERTa encoding + IATE term expansion across 24 languages
Privacy	Passages typically sent to cloud LLM API	Generation runs in browser (Cloud FiD) or on-premises (Sovereign FiD)
Ranking transparency	Opaque — no standard for explaining why a passage was selected	DSA Article 27 compliant — semantic score, keyword score, provenance tier exposed per result

When to use each mode

Cloud FiD — Default for most users. Retrieval runs on EU edge infrastructure, generation runs in the browser. Best balance of search quality and privacy.
Sovereign FiD — For organisations that require full air-gap capability or on-premises data residency. Set PAUHU_SOVEREIGN=true and deploy the container stack. Supports custom LLM adapters (ALLaM, SwissGPT, Mistral, Llama, or any OpenAI-compatible endpoint).

Security

Encryption at rest: AES-256 for all stored data (post-quantum safe)
Encryption in transit: Hybrid post-quantum TLS on edge (X25519Kyber768)
EU jurisdiction: All retrieval infrastructure runs in EU data centres only
Model Last: All security gates pass before any ML inference runs
No training on queries: User queries are never used to train or fine-tune models