FiD Dual-Brain Architecture

Fusion-in-Decoder: two specialized brains — retrieval and generation — fused at the decoder layer for grounded, citation-backed EU intelligence. PQ Ready EU Only

Overview

Pauhu® uses a Fusion-in-Decoder (FiD) architecture — originally described by Izacard & Grave (2021) — adapted for EU regulatory intelligence. Instead of a single monolithic model, the system splits into two specialized brains that fuse at inference time:

  1. Retrieval Brain — The Compass search system. Queries fan out across 20 Vectorize indexes simultaneously, ranked by the Laine Algorithm. This is the encoder half of FiD.
  2. Generation Brain — ONNX specialist models running in the browser or on-premises. Domain-specific encoders plus chat models. This is the decoder half of FiD.

The key insight: the generation brain attends to all retrieved passages simultaneously, not sequentially. Each passage is independently encoded by the retrieval brain, then all encoded representations are concatenated and fed to the decoder in a single forward pass. This is what distinguishes FiD from simple Retrieval-Augmented Generation (RAG).

  FiD Dual-Brain Architecture
  ===========================

  Query: "NIS2 implementation deadlines for essential entities"
    |
    v
  +---------------------------------------------------------------+
  |                     RETRIEVAL BRAIN                            |
  |                     (Compass Search)                           |
  |                                                                |
  |  Query --+--> [eurlex index  ] --+                             |
  |          +--> [curia index   ] --+                             |
  |          +--> [ted index     ] --+                             |
  |          +--> [commission    ] --+-- Laine Algorithm ------+   |
  |          +--> [echa index    ] --+  (70% semantic,         |   |
  |          +--> [ema index     ] --+   30% BM25)             |   |
  |          +--> [epo index     ] --+                         |   |
  |          +--> [ecb index     ] --+                         |   |
  |          +--> [... 12 more   ] --+                         |   |
  |                                                            |   |
  |          20 BGE-M3 indexes (1024d, cosine)                 |   |
  |                                                            v   |
  |                                                   Top-K passages|
  +------------------------------------------------------------+---+
                                                               |
                              +--------------------------------+
                              |
                              v
  +---------------------------------------------------------------+
  |                     GENERATION BRAIN                           |
  |                     (ONNX Specialists)                         |
  |                                                                |
  |  Top-K passages ---> [Encode each passage independently]       |
  |                           |                                    |
  |                           v                                    |
  |                 [Concatenate all encodings]                     |
  |                           |                                    |
  |                           v                                    |
  |              +---------------------------+                     |
  |              |  Domain Specialist        |                     |
  |              |  (e.g., Law, Finance)     |                     |
  |              |  XLM-RoBERTa, ~145MB INT8 |                     |
  |              +---------------------------+                     |
  |                           |                                    |
  |                           v                                    |
  |              +---------------------------+                     |
  |              |  Chat Model (decoder)     |                     |
  |              |  SmolLM-135M (free)       |                     |
  |              |  TinyLlama-1.1B (pro)     |                     |
  |              +---------------------------+                     |
  |                           |                                    |
  |                           v                                    |
  |              Answer with inline citations                      |
  |              [source: CELEX 32022L2555, Art. 21(1)]            |
  +---------------------------------------------------------------+

Retrieval Brain

The retrieval brain is responsible for finding relevant passages across all 20 EU data sources. It runs on edge infrastructure within EU jurisdiction.

Fan-out search

Every query is dispatched to all 20 Vectorize indexes in parallel. Each index contains BGE-M3 embeddings (1024 dimensions, cosine similarity) for one data product. The fan-out ensures that a query like “carbon border adjustment” finds results across EUR-Lex legislation, CURIA case law, TED procurement notices, and Eurostat data simultaneously.

ParameterValue
Embedding modelBGE-M3 (BAAI)
Dimensions1024
Similarity metricCosine
Index count20 (one per data product)
Languages24 EU official languages

Laine Algorithm

Raw vector similarity alone misses keyword-critical matches (e.g., CELEX numbers, article references). The Laine Algorithm combines two signals:

The hybrid score is computed as:

  score = 0.70 * cosine_similarity(query_emb, doc_emb)
        + 0.30 * bm25_score(query_tokens, doc_tokens)

Results from all 20 indexes are merged, deduplicated by document ID, and sorted by hybrid score. The top-K passages (default K=10) are forwarded to the generation brain.

Layer 2: Domain embeddings

Domain specialist models enrich retrieval quality via Layer 2 embeddings. When a query is classified as belonging to a specific topic (e.g., Domain 12: Law), the corresponding specialist generates domain-tuned embeddings that are blended with the base BGE-M3 score. This narrows the semantic gap for domain-specific vocabulary — for instance, “consideration” means something very different in contract law vs. general usage.

Ranking transparency (DSA Article 27)

Every search result includes provenance metadata to comply with the Digital Services Act ranking transparency requirements:

Generation Brain

The generation brain runs entirely in the user’s browser (via ONNX Runtime Web) or on-premises in a containerised deployment. No document content leaves the user’s device during generation.

Domain specialists

21 fine-tuned specialist models cover all EuroVoc topic domains. Each specialist shares an XLM-RoBERTa backbone but is fine-tuned on domain-specific EU corpora:

PropertyValue
BackboneXLM-RoBERTa (cross-lingual)
QuantisationINT8
Size per specialist~145 MB
Specialist count21 (one per topic domain)
RuntimeONNX Runtime Web (browser) or ONNX Runtime (server)

In the FiD pipeline, the specialist serves two purposes:

  1. Passage re-encoding: Each retrieved passage is re-encoded through the domain specialist, producing richer representations than the base embedding model alone.
  2. Layer 2 scoring: The domain-specific encoding feeds back into passage ranking, allowing the system to promote passages that are more relevant within the identified domain.

Chat models (decoder)

After specialist encoding, a generative chat model synthesises the answer from all passage encodings:

ModelParametersTierUse case
SmolLM135MFreeShort answers, summaries, term definitions
TinyLlama1.1BProMulti-paragraph analysis, cross-reference synthesis

Both models run as ONNX graphs in the browser. The pro-tier model loads on demand — only downloaded when the user first triggers a pro-level query.

Browser-native execution

The generation brain uses ONNX Runtime Web with WebAssembly (WASM) and WebGPU backends:

Fusion-in-Decoder

The fusion step is what distinguishes this architecture from simple RAG. Here is how passages flow through the system:

Step-by-step

  1. Query encoding: The user’s query is encoded into a 1024-dimensional BGE-M3 vector.
  2. Fan-out retrieval: The query vector is dispatched to all 20 Vectorize indexes in parallel. Each index returns its top matches.
  3. Laine ranking: Results from all indexes are merged and ranked by the hybrid 70/30 score. Top-K passages are selected.
  4. Independent encoding: Each of the K passages is independently encoded by the domain specialist. This produces K separate hidden-state tensors.
  5. Concatenation: All K encoded representations are concatenated along the sequence dimension into a single extended context.
  6. Decoder attention: The chat model (SmolLM or TinyLlama) attends to the entire concatenated context in one forward pass. Cross-attention layers see all passages simultaneously.
  7. Generation: The decoder generates an answer token by token, with attention weights distributed across all K passages. Citations are produced inline by tracking which passage each attention head focuses on.
  Standard RAG                       Fusion-in-Decoder (Pauhu)
  ============                       =========================

  Passage 1 --+                      Passage 1 --> [Encode] --+
  Passage 2 --+-- Concatenate        Passage 2 --> [Encode] --+-- Concatenate
  Passage 3 --+-- as text            Passage 3 --> [Encode] --+-- as tensors
              |                                                |
              v                                                v
  [Single prompt with                 [Decoder cross-attends
   all passages as                    to ALL encoded passages
   plain text context]                simultaneously]
              |                                                |
              v                                                v
  LLM generates answer               Decoder generates answer
  (context window limit              (scales with K, not
   constrains passage count)          context window length)

  Problem: passages compete          Advantage: each passage
  for context window space.          is encoded independently.
  Adding more passages               Adding more passages
  dilutes each one.                  does not dilute quality.

Why FiD matters for EU data

EU regulatory questions often require synthesising information from multiple legal instruments. A question about NIS2 implementation might need passages from the directive itself (EUR-Lex), national transposition measures (National Law), relevant CURIA case law, and ECHA guidance documents. Standard RAG would concatenate these as text, competing for a fixed context window. FiD encodes each independently, so the decoder can attend to all of them with equal fidelity.

Cloud vs Sovereign

Pauhu supports two deployment modes. The architecture is identical — only the infrastructure layer changes.

AspectCloud FiDSovereign FiD
Retrieval brainEU edge infrastructure (Vectorize indexes)Local SQLite + local vector index
Generation brainUser’s browser (ONNX Runtime Web)On-premises server (ONNX Runtime)
Data residencyEU jurisdiction (edge) + user device (browser)Entirely on-premises
ActivationDefaultPAUHU_SOVEREIGN=true
Internet requiredYes (for retrieval)No (fully air-gapped capable)
Model deliveryCDN → browser cache (IndexedDB)Docker volume mount
IATE terminologyAPI lookup (EU edge)Local SQLite database

Cloud FiD (default)

In cloud mode, the retrieval brain runs on EU edge infrastructure. Query vectors are computed on the edge, fan-out search hits all 20 indexes, and ranked passages are returned to the browser. The generation brain then runs entirely in the browser — no document content is sent to any server during generation.

  Cloud FiD
  =========

  Browser                              EU Edge
  +---------------------+              +---------------------+
  | 1. User types query |              |                     |
  |    |                |   HTTPS/TLS  |                     |
  |    +--- query ------+------------->| 2. Encode query     |
  |                     |              |    |                 |
  |                     |              |    v                 |
  |                     |              | 3. Fan-out to 20    |
  |                     |              |    Vectorize indexes |
  |                     |              |    |                 |
  |                     |              |    v                 |
  |                     |              | 4. Laine rank       |
  |                     |   passages   |    |                 |
  | 5. Receive passages |<-------------+----+                 |
  |    |                |              |                     |
  |    v                |              +---------------------+
  | 6. Domain specialist|
  |    encodes each     |
  |    |                |
  |    v                |
  | 7. FiD: concatenate |
  |    + decode         |
  |    |                |
  |    v                |
  | 8. Answer + cites   |
  +---------------------+

Sovereign FiD

In sovereign mode, both brains run on-premises. The containerised deployment includes a gateway, MCP context server, translation server, and an optional sovereign LLM adapter. Set PAUHU_SOVEREIGN=true in your environment to activate.

  Sovereign FiD (air-gapped)
  =========================

  On-premises server
  +-------------------------------------------------------+
  |                                                       |
  |  Gateway (orchestrator)                               |
  |    |                                                  |
  |    +--- query --> Local Retrieval Brain               |
  |    |              (SQLite FTS5 + local vector index)  |
  |    |                        |                         |
  |    |                  ranked passages                 |
  |    |                        |                         |
  |    +--- passages --> Local Generation Brain           |
  |                      (ONNX Runtime, domain specialist)|
  |                             |                         |
  |                      +------+------+                  |
  |                      |             |                  |
  |                   SmolLM       or sovereign LLM       |
  |                   (default)    (ALLaM, Mistral, etc.) |
  |                      |             |                  |
  |                      +------+------+                  |
  |                             |                         |
  |                      Answer + citations               |
  |                                                       |
  |  No external network access required                  |
  +-------------------------------------------------------+

The sovereign LLM adapter supports multiple model providers:

ProviderConfig valueExample models
Local ONNXonnx-localSmolLM, TinyLlama, Mistral (quantised)
Local Transformerstransformers-localAny HuggingFace model
OpenAI-compatible APIopenai-compatibleALLaM, SwissGPT, vLLM, Ollama

IATE Integration

IATE (Inter-Active Terminology for Europe) provides 2.4 million terms in 24 EU official languages. In the FiD pipeline, IATE is injected into both brains:

Retrieval brain: term expansion

When a query contains a term that exists in IATE, the retrieval brain expands the query with equivalent terms in the same and related languages. For example, a query containing “data controller” is expanded with “Verantwortlicher” (DE), “responsable du traitement” (FR), and “rekisterinpitäjä” (FI). This expansion happens at the embedding level — the expanded terms are encoded and their vectors are averaged with the original query vector.

  IATE Term Expansion (Retrieval Brain)
  =====================================

  Input query: "data controller obligations under GDPR"
                    |
                    v
  IATE lookup: "data controller" --> IATE ID 1688230
    |
    +-- EN: data controller
    +-- DE: Verantwortlicher
    +-- FR: responsable du traitement
    +-- FI: rekisterinpitaja
    +-- ... (24 languages)
    |
    v
  Expanded query vector = avg(
    embed("data controller obligations under GDPR"),
    embed("Verantwortlicher obligations under GDPR"),
    embed("responsable du traitement obligations under GDPR")
  )
    |
    v
  Fan-out with expanded vector --> finds multilingual passages

Generation brain: term constraints

During generation, IATE terms serve as output constraints. When the decoder generates text that includes domain-specific terminology, the IATE database provides the canonical term form for the target language. This prevents the model from paraphrasing standardised terms — “data controller” remains “data controller”, not “person responsible for data”.

API Endpoints

The FiD pipeline is accessible via the Pauhu EU API. All endpoints run in EU jurisdiction and return DSA Article 27 ranking metadata.

EndpointMethodDescription
/v1/searchGETRetrieval brain only — returns ranked passages with provenance metadata
/v1/search/fidPOSTFull FiD pipeline — retrieval + generation, returns answer with inline citations
/iate/lookupGETIATE term lookup — returns translations, definitions, reliability scores
/v1/classifyPOSTTopic classification — identifies which of 21 topic domains a text belongs to

Example: FiD search

  POST /v1/search/fid
  Content-Type: application/json
  Authorization: Bearer pk_your_api_key

  {
    "query": "What are the NIS2 incident reporting deadlines?",
    "sources": ["eurlex", "lex", "commission"],
    "lang": "en",
    "top_k": 10,
    "model": "smollm-135m"
  }

Response

  {
    "answer": "Under NIS2 (Directive 2022/2555), essential and important
      entities must report significant incidents in three stages:
      (1) early warning within 24 hours, (2) incident notification
      within 72 hours, and (3) final report within one month.",
    "citations": [
      {
        "source": "eurlex",
        "celex": "32022L2555",
        "article": "Art. 23(4)",
        "snippet": "...shall submit an early warning within 24 hours...",
        "semantic_score": 0.94,
        "keyword_score": 0.88,
        "combined_score": 0.92
      }
    ],
    "model": "smollm-135m",
    "passages_used": 10,
    "ranking_transparency": {
      "algorithm": "laine-v1",
      "semantic_weight": 0.70,
      "keyword_weight": 0.30,
      "indexes_queried": 3,
      "total_candidates": 847
    }
  }

Comparison with RAG

Pauhu’s FiD architecture addresses several limitations of standard Retrieval-Augmented Generation:

AspectStandard RAGPauhu FiD
Passage handling Passages concatenated as plain text in a single prompt Each passage encoded independently, then fused at decoder layer
Scaling with K More passages = longer prompt = diluted attention More passages = more encodings = richer cross-attention (no dilution)
Context window Limited by LLM context window (4K–128K tokens) Limited by memory for encoded tensors (typically supports 50+ passages)
Domain adaptation General-purpose embeddings for retrieval Domain specialist re-encoding enriches passage representations
Citation tracking Heuristic (search generated text for passage overlap) Structural (attention weights directly indicate source passage)
Multilingual Depends on LLM’s multilingual capability BGE-M3 retrieval + XLM-RoBERTa encoding + IATE term expansion across 24 languages
Privacy Passages typically sent to cloud LLM API Generation runs in browser (Cloud FiD) or on-premises (Sovereign FiD)
Ranking transparency Opaque — no standard for explaining why a passage was selected DSA Article 27 compliant — semantic score, keyword score, provenance tier exposed per result

When to use each mode

Security