Architecture

Murmuration: self-organizing infrastructure for EU data intelligence. Zone-based security, automated task discovery, and multi-reviewer quality gates. PQ Ready EU Only

Overview

Pauhu® infrastructure follows a murmuration pattern — inspired by the self-organizing flight patterns of starlings. Each terminal (processing unit) operates independently within its security zone, communicates via structured messages, and adapts to changing workload through automated task discovery and health monitoring.

Key principles:

Zone isolation: Three security zones (Protected, Controlled, External) with conduit-controlled data flow per IEC 62443-3-3
Model Last: ML inference runs only after all security gates pass verification
Browser-native: ONNX models run in the browser — no data leaves the user's device
EU jurisdiction: All infrastructure runs on Cloudflare Workers in EU regions
Post-quantum ready: AES-256 at rest (PQ-safe), hybrid PQ TLS on edge (X25519Kyber768), algorithm-agile crypto layer. See security docs.

Zone model (IEC 62443-3-3)

Pauhu implements IEC 62443-3-3 zone-based security with four distinct zones and conduit-controlled data flow between them.

  +-----------+    +------------+    +----------+
  | Protected | -> | Controlled | -> | External |
  |  Zone     |    |   Zone     |    |   Zone   |
  |  SL-4     |    |  SL-2/3    |    |   SL-1   |
  +-----------+    +------------+    +----------+
                         |
                   ==============
                   ||  Conduit  ||
                   ==============
                         |
  +-----------+    +----------+
  | Business  | <- |  Audit   |
  |  Zone     |    |   Zone   |
  |  SL-0     |    |   FR6    |
  +-----------+    +----------+

Security Level	Target
SL-4	Protection against state-sponsored attack
SL-3	Protection against sophisticated attack
SL-2	Protection against intentional attack
SL-1	Protection against casual violation
SL-0	No specific requirements

Gate orchestration

A central orchestrator routes each request through zone-specific security gates. Gates pass before any ML model runs (Model Last principle).

                 +-- Protected ------+-- Architecture
                 |   (Constraints)    +-- ML Pipeline
                 |                    +-- Data Engineering
                 |                    +-- Security Audit
                 |                    +-- Legal Review
                 |                    +-- Red Team
                 |
Orchestrator ----+-- Controlled -----+-- Development
                 |   (Conduit)       +-- Operations
                 |                    +-- Infrastructure
                 |                    +-- Product
                 |
                 +-- External -------+-- Frontend
                      (Actions)      +-- Documentation
                                     +-- Internationalization

  Model Last: All security gates pass FIRST → then AI inference

Gate classification

Each gate enforces one of four deontic modalities (from modal logic), determining how strictly it controls data flow:

Modality	Behaviour	Example
Prohibition	Block if violated (MUST NOT)	Reject requests containing PII in search queries
Obligation	Require completion (MUST)	Enforce data license terms before export
Permission	Approve if proposed (MAY)	Allow optional semantic ranking add-on
Exemption	No action needed (EXEMPT)	Pass-through for static documentation

Document extraction on Hetzner

Document extraction is a headless Chrome bridge running on a dedicated server in Helsinki (EU jurisdiction). It provides server-side document extraction, PDF rendering, and accessibility tree snapshotting.

  +-----------------------+        +----------------------+
  | Edge Workers          |        | Helsinki Server      |
  | (EU edge)             |        | (EU datacenter)      |
  |                       |        |                      |
  |  API Router           |        |  Document extraction Bridge     |
  |    |                  |        |    |                  |
  |    +-- /extract ------+------->|    +-- Chrome CDP     |
  |    +-- /pdf-render ---+------->|    +-- Tab lifecycle  |
  |    |                  |        |    +-- Text extract   |
  |  Vision Service       |        |    +-- PDF render     |
  |    +-- annotate ------+--+     |                      |
  |    +-- terminology ---+--+     +----------------------+
  |    |                  |  |
  |  Annotation Service   |  |     +-----------------------+
  |    +-- STAM sidecar --+--+---->| Object Storage (EU)   |
  |                       |        | pinchtab/{product}/   |
  |  Index Service        |        |   {hash}.stam.json    |
  |    +-- DB + Vectorize |        +-----------------------+
  +-----------------------+

Data flow

Client sends URL to /extract or /pdf-render via the API router
The vision service opens a Chrome tab on Document extraction (Helsinki) via authenticated bridge token
Chrome navigates to the URL, extracts text (or renders PDF)
Tab is closed in background — stateless, no data retained
If annotate: true, text is sent to the annotation service for topic and deontic classification
If terminology: true, IATE terms are extracted via the terminology service
For /extract-and-index, STAM sidecar JSON is written to object storage for indexing

Security

Document extraction bridge is authenticated with a rotating token (rotated quarterly)
Documents are processed in ephemeral Chrome tabs — no persistent storage on the server
All data stays in EU jurisdiction (Helsinki datacenter)
Document extraction server is hardened per IEC 62443-3-3 (controlled zone)

Automated task discovery

The murmuration health loop automatically discovers work from four signal sources and creates tasks in the registry. Each task is routed to the appropriate terminal based on its zone and type.

Source	Signal	Routing
Error collector	>10 same error in 24h	By phase: model→ML team, auth→security, ui→frontend, api→development
Git log	FIXME / TODO / HACK in recent commits	Development team
CI failures	Same workflow fails 3+ times in 7d	Operations team
Stale PRs	Open >7 days, no updates	Original author

Failed tasks are automatically retried up to 3 times. After 3 failures, the task is marked as blocked and requires human intervention.

Multi-reviewer quality

Every pull request is reviewed by three independent reviewers:

CodeRabbit: Style, patterns, and best practices
Pauhu Review: Multi-perspective analysis (security, compliance, correctness, performance, maintainability) with EU AI Act compliance checks
Codex Review: Edge cases, logic errors, race conditions, and resource leaks

A consensus job runs after all three reviewers complete. If 2 out of 3 flag critical issues, the PR is blocked until the issues are resolved.

Health loop

A 10-minute cron job runs the murmuration health check across the entire infrastructure:

Open PR status and check rollup
CI workflow failure trends
Task registry: auto-respawn failed tasks (up to 3 attempts)
Task discovery: scan all 4 signal sources

Results are written to a JSON report and committed back to the repository by the murmuration-bot.

Data pipeline

Every EU document flows through a 6-stage pipeline from source to searchable index. This is the same pipeline for all 20 data sources — only the seed script and product code differ.

  Seed Script         Object Storage       Queue              Annotation Service
  (per source)        (per product)        (per product)      (topic + deontic)
       |                   |                    |                    |
  Fetch from EU    Upload with         Storage event         Classify:
  institution      metadata            notification          - language detection
  (SPARQL, REST,   (celex_id,          triggers              - topic domain (1-21)
   SDMX, OAI-PMH)  product, lang)      consumer              - deontic modality
       |                   |                    |               - word/char count
       v                   v                    v                    |
                                                                     v
  STAM Sidecar       Index Service        Database            Vectorize Index
  (.stam.json)       (hybrid search)      (per product)       (BGE-M3, 1024d)
       |                   |                    |                    |
  Annotation         Index into           Structured           Semantic search
  stored next to     DB + Vectorize       metadata for         across 20 indexes
  source doc         (70% semantic,       SQL queries          via Laine Algorithm
  in storage         30% BM25)

Stage 1: Seed

Source-specific seed scripts fetch documents from EU institutions via their official APIs (SPARQL for EUR-Lex/CELLAR, REST for TED/ECHA/EMA, SDMX for ECB/Eurostat, OAI-PMH for CORDIS). Each document is uploaded to its product-specific R2 bucket with customMetadata containing the CELEX ID, product code, language, and source URL.

Stage 2: Queue

Storage event notifications trigger a queue message for each new or updated document. The 20 products are split across two queue consumers for load balancing (products A–E and products E–W).

Stage 3: Annotate

The annotation worker classifies each document:

Language detection: Identifies the document language (24 EU official languages)
EuroVoc classification: Assigns one of 21 topic domains
Deontic modality: Classifies as prohibition, obligation, permission, or exemption
Metadata: Word count, character count, provenance tier (NATIVE 1.0, PARSED 0.95, KEYWORD ≤0.9)

Stage 4: STAM sidecar

Annotations are stored as .stam.json sidecar files next to the source document in R2. The STAM (Stand-off Text Annotation Model) format keeps annotations separate from source text, enabling non-destructive updates and full provenance tracking.

Stage 5: Index

The indexing service reads annotated documents and indexes them into:

D1 databases: Structured metadata for SQL queries (per product)
Vectorize indexes: BGE-M3 embeddings (1024 dimensions, cosine similarity) for semantic search

Stage 6: Search

The Laine Algorithm fans out queries across all 20 Vectorize indexes simultaneously, combining 70% semantic similarity with 30% BM25 keyword matching. Results include DSA Article 27 ranking transparency metadata.

Dead letter queue

Documents that fail annotation after 3 retries are routed to a dead letter queue for manual inspection. The /backfill admin endpoint can re-index documents after the issue is resolved.

DSPy modules

Pauhu uses Stanford DSPy for all prompt optimization and ML orchestration. The codebase contains 486 DSPy signatures, 401 modules, and 34 orchestrators.

DSPy modules replace traditional prompt engineering with programmatic, optimizable pipelines. Each module defines typed signatures (input/output contracts) and can be composed into larger orchestration flows.

Component	Count	Purpose
Signatures	486	Typed input/output contracts
Modules	401	Composable processing units
Orchestrators	34	Multi-module coordination flows

The MetaPromptEngine handles prompt optimization across all services, ensuring consistent quality and EU AI Act transparency compliance.