Metadata Extraction Rulesets
Metadata rulesets let you teach DocBot to extract structured fields from your documents during ingestion. Rules are configured per-source (connector or integration), so you can attach extraction logic to local connectors as well as GitHub, Slack, and Google Drive integrations. Once extracted, these fields power filtered search, sorted retrieval, and document grouping — turning DocBot from a generic Q&A bot into a domain-aware document analysis tool.
Why Rulesets Matter
Without rulesets, DocBot only understands documents as unstructured text. It can find semantically similar chunks, but it cannot reliably answer:
- "What does Article 25 of the Copyright Act say?" — semantic search matches legal language, not article numbers
- "Show me the last 3 visit notes for patient P-1234" — requires filtering by patient ID and sorting by date
- "Compare Article 10 across DSG and URG" — requires grouping by law name
With rulesets, every chunk in Qdrant carries a metadata payload like:
{
"entity_id": "p-1234",
"doc_type": "visit_note",
"doc_date": "2024-01-15",
"gesetz": "urheberrechtsgesetz",
"artikel": "25"
}
The AI's query classifier detects the user's intent and automatically uses these fields to filter, sort, or group before (or instead of) semantic search.
How to Configure Rules via UI
- Navigate to your source in the admin panel (Connector or Integration)
- Open the source settings → Metadata Extraction Rules
- Click Add Rule and fill in:
- Field name: the metadata key (e.g.
entity_id,artikel,gesetz,doc_date) - Pattern: a Python regex with one capture group — the captured text becomes the value
- Scope: where to search —
full(entire document),header(first 512 chars), orfilename - Level:
document(runs once, stamps all chunks) orchunk(runs per chunk with inheritance) - Chunk boundary: when enabled, each match marks a pre-split boundary before normal chunking runs
- Priority: higher numbers are tried first; first match per field wins
- Field name: the metadata key (e.g.
- Use the Test button to validate your pattern against sample text before saving
- After saving rules, re-sync the source to apply them to existing documents
Scope Reference
| Scope | What it searches | Best for |
|---|---|---|
full | Entire extracted text of the document | General-purpose: IDs, dates, types anywhere in the document |
header | First 512 characters only | Structured headers like PATIENT RECORD, INVOICE, law titles |
filename | Source filename (basename only) | Naming conventions: invoice_march.pdf, patient_P1234_report.docx |
Level Reference
| Level | Behaviour | Best for |
|---|---|---|
document | Runs once against the full document text. The extracted value is stamped on all chunks of that document. | Static document-wide metadata: patient ID, law name, document type |
chunk | Runs against each chunk individually. If a chunk doesn't match, it inherits the last matched value from a preceding chunk. | Section-level metadata: article numbers that change within a document |
Chunk Boundary Splitting
chunk_boundary lets a metadata rule influence indexing segmentation directly.
When a rule is configured with "chunk_boundary": true, every regex match becomes a document pre-split boundary before the regular chunking strategy runs. This keeps final chunks aligned with logical document structure instead of splitting mid-section.
Use this for documents with predictable repeating section headers, for example:
- Legal text with
Art. 1,Art. 2,Art. 3 - Markdown handbooks with repeated
## Sectionheadings
| Field | Type | Required | Description |
|---|---|---|---|
chunk_boundary | boolean | No | When true, rule matches act as document pre-split boundaries during indexing. Default false. |
chunk_boundary changes chunk boundaries only. Metadata extraction behavior (field_name, level, scope, priority) remains the same.
Global Rule Loading for Classification
At query time, the classifier now loads metadata rules globally across all connectors and integrations. Users no longer need to specify which source their query targets — matching rules from any synced connector or integration can be used for intent detection and filter extraction.
Priority
- Rules are sorted by priority descending (highest number first)
- For each
field_name, the first matching rule wins — remaining rules for the same field are skipped - Different fields are evaluated independently
- Default priority is
0
Example: Two rules for entity_id — Patient ID: (...) at priority 5 and Member: (...) at priority 10 — the Member pattern is tried first; if it matches, the Patient pattern is never evaluated.
Pattern
A Python-compatible regex:
- Must contain exactly one capture group
(...)— the captured text becomes the metadata value - If no capture group is present, the full match is used
- Patterns are case-insensitive
- Invalid patterns are validated on creation (HTTP 422) and silently skipped at extraction time (logged, never crashes ingestion)
- All extracted values are normalized to lowercase before storage
Industry Examples
Legal — Swiss Federal Law
[
{ "field_name": "gesetz", "pattern": "Bundesgesetz.*?\\((.+?)\\)", "scope": "header", "level": "document", "priority": 10 },
{ "field_name": "gesetz_kurz", "pattern": "\\(([A-Z]{2,5})\\)", "scope": "header", "level": "document", "priority": 5 },
{ "field_name": "artikel", "pattern": "Art\\.?\\s*(\\d+)", "scope": "full", "level": "chunk", "chunk_boundary": true, "priority": 0 }
]
This lets users ask: "Was sagt Artikel 25 vom Urheberrechtsgesetz?" → the classifier extracts artikel=25 + gesetz=urheberrechtsgesetz and returns only the correct chunk.
Medical — Patient Records
[
{ "field_name": "entity_id", "pattern": "Patient\\s*ID[:\\s]*(\\S+)", "scope": "full", "level": "document", "priority": 0 },
{ "field_name": "doc_type", "pattern": "^(PATIENT RECORD|DISCHARGE SUMMARY|LAB REPORT)", "scope": "header", "level": "document", "priority": 0 },
{ "field_name": "doc_date", "pattern": "(\\d{4}-\\d{2}-\\d{2})", "scope": "header", "level": "document", "priority": 0 }
]
Enables: "Letzte 3 Berichte für Patient P-1234" → metadata-only retrieval, sorted by doc_date descending.
Financial — Invoices
[
{ "field_name": "entity_id", "pattern": "Kunde[:\\s]*(K-\\d+)", "scope": "full", "level": "document", "priority": 0 },
{ "field_name": "doc_type", "pattern": "^(RECHNUNG|OFFERTE|MAHNUNG)", "scope": "header", "level": "document", "priority": 0 },
{ "field_name": "rechnungsnummer", "pattern": "Rechnungs-?Nr\\.?[:\\s]*(\\S+)", "scope": "full", "level": "document", "priority": 0 }
]
How Rulesets Improve AI Answers
| Without rulesets | With rulesets |
|---|---|
| "Artikel 25" returns semantically similar legal text from any law | Returns the exact Article 25 from the specified law |
| "Patient P-1234" returns chunks mentioning similar patient names | Returns only documents tagged with entity_id=p-1234 |
| "Last 3 reports" returns semantically recent-sounding text | Sorts by doc_date descending, returns exactly 3 |
| All chunks arrive as a flat blob to the LLM | Chunks are grouped by document with structured metadata headers |
What This Unlocks for Further Analysis
Once metadata is attached to every chunk, the system can:
- Filter across dimensions: combine entity + date + document type in a single query
- Track document lineage: which connector, which file, what version
- Power dashboard analytics: count documents by type, entity, or date range
- Enable comparison queries: "Compare Article 10 across DSG and URG" groups chunks by law
- Support audit trails: trace which specific documents informed each answer
Query Pattern Override
By default, the same pattern is used for both document ingestion and query-time classification. In some cases this is not desirable — for example, an extraction pattern anchored to the start of a line (e.g. ^Chapter\s+(\d+)) will reliably extract chapter numbers from structured documents, but the same pattern will not match when a user types "what does chapter 3 say?" mid-sentence.
The optional query_pattern field solves this. When set, it is used exclusively for query-time matching; the original pattern continues to be used during document ingestion.
Example:
{
"field_name": "chapter",
"pattern": "^Chapter\\s+(\\d+)",
"query_pattern": "Chapter\\s+(\\d+)",
"scope": "header",
"level": "document",
"priority": 5
}
pattern— anchored regex, used when indexing documents (reliable extraction from structured headers)query_pattern— unanchored regex, used when classifying natural language queries (matches mid-sentence)
query_pattern is optional and backwards-compatible — omitting it means the existing pattern is used for both ingestion and query classification, exactly as before.
Fuzzy Matching & Typo Tolerance
The query classifier supports fuzzy matching for metadata rule patterns. When a query contains an identifier that almost — but not exactly — matches a configured rule, fuzzy matching corrects it before falling through to semantic search.
Fuzzy matching is enabled by default and applies three strategies:
- Prefix expansion — bare numbers like
123456automatically match rules expecting a prefix (e.g., a rule forXYZ123456). Useful when users omit a standard prefix. - Edit-distance tolerance — minor typos such as
XYZ123456(letter O instead of zero) are corrected. Accepts substitutions within a small Levenshtein distance. - Digit-count tolerance — identifiers with ±1 digit (e.g.,
XYZ123456for a 6-digit ID pattern) are accepted to handle common input slips.
Fuzzy matches carry a reduced confidence score (0.65 vs. 0.85 for exact matches) so the system signals that a correction was applied. Common keyword typos in queries are also auto-corrected (e.g., "artkel" → "artikel").
Fuzzy matching operates before semantic search — if a fuzzy match is found, the classifier returns a metadata-filtered result immediately instead of performing a full vector search.
REST API
| Method | Path | Description |
|---|---|---|
GET | /api/metadata/{source_id}/rules | List all rules for a source |
POST | /api/metadata/{source_id}/rules | Create a new rule |
PUT | /api/metadata/{source_id}/rules/{rule_id} | Update a rule |
DELETE | /api/metadata/{source_id}/rules/{rule_id} | Delete a rule |
POST | /api/metadata/{source_id}/test-rule | Test a pattern against sample text without saving |
POST | /api/metadata/{source_id}/refresh | Re-apply rules to existing chunks (async job) |
All endpoints require editor or admin role.
source_id can be either a connector ID or an integration ID.
Important Notes
After adding or changing rules, you must re-sync the source for changes to take effect on existing documents. New documents ingested after rule creation are processed automatically.
Metadata rules endpoints require a Pro plan or higher.
All metadata values are normalized to lowercase during extraction. This ensures consistent filtering regardless of casing in the source documents or user queries.
- Invalid regex patterns are validated on creation (HTTP 422) and silently skipped during extraction (logged, never crashes ingestion)
- The query classifier uses metadata fields automatically — no API parameter changes needed for end users
- Default chat mode is
auto, which runs the classifier before every query