Metadata Extraction Rulesets

Metadata rulesets let you teach DocBot to extract structured fields from your documents during ingestion. Rules are configured per-source (connector or integration), so you can attach extraction logic to local connectors as well as GitHub, Slack, and Google Drive integrations. Once extracted, these fields power filtered search, sorted retrieval, and document grouping — turning DocBot from a generic Q&A bot into a domain-aware document analysis tool.

Why Rulesets Matter

Without rulesets, DocBot only understands documents as unstructured text. It can find semantically similar chunks, but it cannot reliably answer:

"What does Article 25 of the Copyright Act say?" — semantic search matches legal language, not article numbers
"Show me the last 3 visit notes for patient P-1234" — requires filtering by patient ID and sorting by date
"Compare Article 10 across DSG and URG" — requires grouping by law name

With rulesets, every chunk in Qdrant carries a metadata payload like:

{
  "entity_id": "p-1234",
  "doc_type": "visit_note",
  "doc_date": "2024-01-15",
  "gesetz": "urheberrechtsgesetz",
  "artikel": "25"
}

The AI's query classifier detects the user's intent and automatically uses these fields to filter, sort, or group before (or instead of) semantic search.

How to Configure Rules via UI

Navigate to your source in the admin panel (Connector or Integration)
Open the source settings → Metadata Extraction Rules
Click Add Rule and fill in:
- Field name: the metadata key (e.g. entity_id, artikel, gesetz, doc_date)
- Pattern: a Python regex with one capture group — the captured text becomes the value
- Scope: where to search — full (entire document), header (first 512 chars), or filename
- Level: document (runs once, stamps all chunks) or chunk (runs per chunk with inheritance)
- Chunk boundary: when enabled, each match marks a pre-split boundary before normal chunking runs
- Priority: higher numbers are tried first; first match per field wins
Use the Test button to validate your pattern against sample text before saving
After saving rules, re-sync the source to apply them to existing documents

Scope Reference

Scope	What it searches	Best for
`full`	Entire extracted text of the document	General-purpose: IDs, dates, types anywhere in the document
`header`	First 512 characters only	Structured headers like `PATIENT RECORD`, `INVOICE`, law titles
`filename`	Source filename (basename only)	Naming conventions: `invoice_march.pdf`, `patient_P1234_report.docx`

Level Reference

Level	Behaviour	Best for
`document`	Runs once against the full document text. The extracted value is stamped on all chunks of that document.	Static document-wide metadata: patient ID, law name, document type
`chunk`	Runs against each chunk individually. If a chunk doesn't match, it inherits the last matched value from a preceding chunk.	Section-level metadata: article numbers that change within a document

Chunk Boundary Splitting

chunk_boundary lets a metadata rule influence indexing segmentation directly.

When a rule is configured with "chunk_boundary": true, every regex match becomes a document pre-split boundary before the regular chunking strategy runs. This keeps final chunks aligned with logical document structure instead of splitting mid-section.

Use this for documents with predictable repeating section headers, for example:

Legal text with Art. 1, Art. 2, Art. 3
Markdown handbooks with repeated ## Section headings

Field	Type	Required	Description
`chunk_boundary`	boolean	No	When `true`, rule matches act as document pre-split boundaries during indexing. Default `false`.

info

chunk_boundary changes chunk boundaries only. Metadata extraction behavior (field_name, level, scope, priority) remains the same.

Global Rule Loading for Classification

At query time, the classifier now loads metadata rules globally across all connectors and integrations. Users no longer need to specify which source their query targets — matching rules from any synced connector or integration can be used for intent detection and filter extraction.

Priority

Rules are sorted by priority descending (highest number first)
For each field_name, the first matching rule wins — remaining rules for the same field are skipped
Different fields are evaluated independently
Default priority is 0

Example: Two rules for entity_id — Patient ID: (...) at priority 5 and Member: (...) at priority 10 — the Member pattern is tried first; if it matches, the Patient pattern is never evaluated.

Pattern

A Python-compatible regex:

Must contain exactly one capture group (...) — the captured text becomes the metadata value
If no capture group is present, the full match is used
Patterns are case-insensitive
Invalid patterns are validated on creation (HTTP 422) and silently skipped at extraction time (logged, never crashes ingestion)
All extracted values are normalized to lowercase before storage

Industry Examples

Legal — Swiss Federal Law

[
  { "field_name": "gesetz", "pattern": "Bundesgesetz.*?\\((.+?)\\)", "scope": "header", "level": "document", "priority": 10 },
  { "field_name": "gesetz_kurz", "pattern": "\\(([A-Z]{2,5})\\)", "scope": "header", "level": "document", "priority": 5 },
  { "field_name": "artikel", "pattern": "Art\\.?\\s*(\\d+)", "scope": "full", "level": "chunk", "chunk_boundary": true, "priority": 0 }
]

This lets users ask: "Was sagt Artikel 25 vom Urheberrechtsgesetz?" → the classifier extracts artikel=25 + gesetz=urheberrechtsgesetz and returns only the correct chunk.

Medical — Patient Records

[
  { "field_name": "entity_id", "pattern": "Patient\\s*ID[:\\s]*(\\S+)", "scope": "full", "level": "document", "priority": 0 },
  { "field_name": "doc_type", "pattern": "^(PATIENT RECORD|DISCHARGE SUMMARY|LAB REPORT)", "scope": "header", "level": "document", "priority": 0 },
  { "field_name": "doc_date", "pattern": "(\\d{4}-\\d{2}-\\d{2})", "scope": "header", "level": "document", "priority": 0 }
]

Enables: "Letzte 3 Berichte für Patient P-1234" → metadata-only retrieval, sorted by doc_date descending.

Financial — Invoices

[
  { "field_name": "entity_id", "pattern": "Kunde[:\\s]*(K-\\d+)", "scope": "full", "level": "document", "priority": 0 },
  { "field_name": "doc_type", "pattern": "^(RECHNUNG|OFFERTE|MAHNUNG)", "scope": "header", "level": "document", "priority": 0 },
  { "field_name": "rechnungsnummer", "pattern": "Rechnungs-?Nr\\.?[:\\s]*(\\S+)", "scope": "full", "level": "document", "priority": 0 }
]

How Rulesets Improve AI Answers

Without rulesets	With rulesets
"Artikel 25" returns semantically similar legal text from any law	Returns the exact Article 25 from the specified law
"Patient P-1234" returns chunks mentioning similar patient names	Returns only documents tagged with `entity_id=p-1234`
"Last 3 reports" returns semantically recent-sounding text	Sorts by `doc_date` descending, returns exactly 3
All chunks arrive as a flat blob to the LLM	Chunks are grouped by document with structured metadata headers

What This Unlocks for Further Analysis

Once metadata is attached to every chunk, the system can:

Filter across dimensions: combine entity + date + document type in a single query
Track document lineage: which connector, which file, what version
Power dashboard analytics: count documents by type, entity, or date range
Enable comparison queries: "Compare Article 10 across DSG and URG" groups chunks by law
Support audit trails: trace which specific documents informed each answer

Query Pattern Override

By default, the same pattern is used for both document ingestion and query-time classification. In some cases this is not desirable — for example, an extraction pattern anchored to the start of a line (e.g. ^Chapter\s+(\d+)) will reliably extract chapter numbers from structured documents, but the same pattern will not match when a user types "what does chapter 3 say?" mid-sentence.

The optional query_pattern field solves this. When set, it is used exclusively for query-time matching; the original pattern continues to be used during document ingestion.

Example:

{
  "field_name": "chapter",
  "pattern": "^Chapter\\s+(\\d+)",
  "query_pattern": "Chapter\\s+(\\d+)",
  "scope": "header",
  "level": "document",
  "priority": 5
}

pattern — anchored regex, used when indexing documents (reliable extraction from structured headers)
query_pattern — unanchored regex, used when classifying natural language queries (matches mid-sentence)

query_pattern is optional and backwards-compatible — omitting it means the existing pattern is used for both ingestion and query classification, exactly as before.

Fuzzy Matching & Typo Tolerance

The query classifier supports fuzzy matching for metadata rule patterns. When a query contains an identifier that almost — but not exactly — matches a configured rule, fuzzy matching corrects it before falling through to semantic search.

Fuzzy matching is enabled by default and applies three strategies:

Prefix expansion — bare numbers like 123456 automatically match rules expecting a prefix (e.g., a rule for XYZ123456). Useful when users omit a standard prefix.
Edit-distance tolerance — minor typos such as XYZ123456 (letter O instead of zero) are corrected. Accepts substitutions within a small Levenshtein distance.
Digit-count tolerance — identifiers with ±1 digit (e.g., XYZ123456 for a 6-digit ID pattern) are accepted to handle common input slips.

Fuzzy matches carry a reduced confidence score (0.65 vs. 0.85 for exact matches) so the system signals that a correction was applied. Common keyword typos in queries are also auto-corrected (e.g., "artkel" → "artikel").

info

Fuzzy matching operates before semantic search — if a fuzzy match is found, the classifier returns a metadata-filtered result immediately instead of performing a full vector search.

REST API

Method	Path	Description
`GET`	`/api/metadata/{source_id}/rules`	List all rules for a source
`POST`	`/api/metadata/{source_id}/rules`	Create a new rule
`PUT`	`/api/metadata/{source_id}/rules/{rule_id}`	Update a rule
`DELETE`	`/api/metadata/{source_id}/rules/{rule_id}`	Delete a rule
`POST`	`/api/metadata/{source_id}/test-rule`	Test a pattern against sample text without saving
`POST`	`/api/metadata/{source_id}/refresh`	Re-apply rules to existing chunks (async job)

All endpoints require editor or admin role.

source_id can be either a connector ID or an integration ID.

Important Notes

warning

After adding or changing rules, you must re-sync the source for changes to take effect on existing documents. New documents ingested after rule creation are processed automatically.

info

Metadata rules endpoints require a Pro plan or higher.

info

All metadata values are normalized to lowercase during extraction. This ensures consistent filtering regardless of casing in the source documents or user queries.

Invalid regex patterns are validated on creation (HTTP 422) and silently skipped during extraction (logged, never crashes ingestion)
The query classifier uses metadata fields automatically — no API parameter changes needed for end users
Default chat mode is auto, which runs the classifier before every query

Why Rulesets Matter​

How to Configure Rules via UI​

Scope Reference​

Level Reference​

Chunk Boundary Splitting​

Global Rule Loading for Classification​

Priority​

Pattern​

Industry Examples​

Legal — Swiss Federal Law​

Medical — Patient Records​

Financial — Invoices​

How Rulesets Improve AI Answers​

What This Unlocks for Further Analysis​

Query Pattern Override​

Fuzzy Matching & Typo Tolerance​

REST API​

Important Notes​

Why Rulesets Matter

How to Configure Rules via UI

Scope Reference

Level Reference

Chunk Boundary Splitting

Global Rule Loading for Classification

Priority

Pattern

Industry Examples

Legal — Swiss Federal Law

Medical — Patient Records

Financial — Invoices

How Rulesets Improve AI Answers

What This Unlocks for Further Analysis

Query Pattern Override

Fuzzy Matching & Typo Tolerance

REST API

Important Notes