BOOST.DocumentProfessor
A stateless, message-driven Java 21 service that interprets extracted document text using configurable rules. It consumes interpretation requests from an AMQP queue, applies pattern matching and field extraction rules, and produces proposals for document classification, party matching, and field values.
Architecture
[AMQP: cmd.document.interpret]
│
▼
MessageConsumerService (CLIENT_ACKNOWLEDGE)
│
▼
InterpretationHandler
│
├──► Parse rules and actions from message
└──► RuleEngine.process()
├──► RuleMatcher (EXACT/CONTAINS/REGEX)
├──► ValueConverter (NUMBER/DATE/CURRENCY/BOOLEAN)
└──► Generate proposals
│
▼
EventPublisher ──► [AMQP: evt.document.interpreted]
└──► [AMQP: evt.document.interpretation.failed] (on failure)Pipeline Position
DocumentExtractor (bytes → text)
│
▼
DocumentProfessor (text → proposals) ◄── You are here
│
▼
DocumentManager (proposals → authoritative state)Message Flow
Input Message
Send to queue: cmd.document.interpret
JMS Property: tenantUUID must be set for proper tenant routing.
{
"documentUUID": "742dba26-...",
"entityUUID": "5d16c711-...",
"organizationId": "org-123",
"clientFilename": "Invoice-12345.pdf",
"contentType": "application/pdf",
"extractedText": "Invoice\nInvoice number: INV-2025-001\nDate: January 25, 2025...",
"currentPartyUUID": "party-acme-001",
"currentDocumentTypeUUID": null,
"currentState": "INBOX",
"fromEmailDomain": "acme.com",
"rules": [
{
"uuid": "rule-001",
"priority": 1,
"isEnabled": true,
"matchMode": "CONTAINS",
"keyword": "Invoice",
"stopOnMatch": false,
"actions": [
{
"uuid": "action-001",
"actionType": "SET_TYPE",
"targetUUID": "doctype-invoice-001",
"isEnabled": true
}
]
},
{
"uuid": "rule-002",
"priority": 2,
"isEnabled": true,
"partyUUID": "party-acme-001",
"pattern": "(?im)^\\s*Invoice\\s+number[:\\s]*(\\S+)",
"actions": [
{
"uuid": "action-002",
"actionType": "EXTRACT_TO",
"target": "invoiceNumber",
"extractGroup": 1,
"isEnabled": true
}
]
}
]
}| Field | Description |
|---|---|
documentUUID | Unique document identifier |
entityUUID | Parent entity this document belongs to |
clientFilename | Original filename (used for filename-based matching) |
extractedText | Text content to analyze |
currentPartyUUID | Current linked party (for precondition matching) |
currentDocumentTypeUUID | Current document type (for precondition matching) |
currentState | Current document state (for precondition matching) |
fromEmailDomain | Sender email domain (for precondition matching) |
rules | Array of match rules with their actions |
Output Event (Success)
Published to queue: evt.document.interpreted
JMS Property: tenantUUID is set to enable tenant-filtered consumption by DocumentManager.
{
"documentUUID": "742dba26-...",
"entityUUID": "5d16c711-...",
"tenantUUID": "org-123",
"success": true,
"processingTimeMs": 45,
"documentTypeProposal": {
"documentTypeUUID": "doctype-invoice-001",
"confidence": 0.8,
"ruleUUID": "rule-001",
"evidence": "Matched keyword 'Invoice' in extractedText"
},
"partyProposals": [
{
"partyUUID": "party-acme-001",
"confidence": 0.8,
"ruleUUID": "rule-005",
"matchedKeyword": "Acme Corporation",
"evidence": "Matched keyword 'Acme Corporation' in extractedText"
}
],
"fieldProposals": [
{
"fieldName": "invoiceNumber",
"rawValue": "INV-2025-001",
"confidence": 0.9,
"ruleUUID": "rule-002",
"evidence": "Matched pattern 'Invoice number[:\\s]*(\\S+)' in extractedText"
},
{
"fieldName": "totalAmount",
"rawValue": "1,234.56",
"convertedValue": 1234.56,
"convertType": "CURRENCY",
"confidence": 0.9,
"ruleUUID": "rule-004"
}
],
"matchedRuleUUIDs": ["rule-001", "rule-002", "rule-004", "rule-005"]
}Error Event
Published to queue: evt.document.interpretation.failed
JMS Property: tenantUUID is set to enable tenant-filtered consumption.
{
"documentUUID": "742dba26-...",
"entityUUID": "5d16c711-...",
"tenantUUID": "org-123",
"success": false,
"processingTimeMs": 12,
"errorMessage": "No extracted text provided"
}Preconditions
Rules can be scoped to specific contexts using preconditions. If a precondition is set, the rule only applies when the document's current context matches:
| Precondition | Description |
|---|---|
partyUUID | Only apply rule when document is linked to this party |
documentTypeUUID | Only apply rule when document has this type |
state | Only apply rule when document is in this state |
fromEmailDomain | Only apply rule when email sender domain matches |
Precondition Example
{
"uuid": "rule-invoice-extract",
"priority": 10,
"isEnabled": true,
"partyUUID": "party-acme-001",
"documentTypeUUID": "doctype-invoice",
"pattern": "(?im)^\\s*Invoice\\s+Date[:\\s]*(\\d{2}\\.\\d{2}\\.\\d{4})",
"actions": [...]
}This rule only runs when:
- Document is linked to party
party-acme-001 - Document type is
doctype-invoice
The request must include context fields for precondition evaluation:
{
"documentUUID": "742dba26-...",
"extractedText": "...",
"currentPartyUUID": "party-acme-001",
"currentDocumentTypeUUID": "doctype-invoice",
"currentState": "REGISTRATION",
"fromEmailDomain": "acme.com",
"rules": [...]
}Match Modes
| Mode | Description | Example |
|---|---|---|
EXACT | Full text must match exactly (case-insensitive) | keyword: "Invoice" matches only "Invoice" or "invoice" |
CONTAINS | Text must contain the keyword | keyword: "Invoice" matches "Tax Invoice #123" |
REGEX | Text must match the regular expression | pattern: "INV-\\d+" matches "INV-2025-001" |
Pattern Priority
If pattern is set, it takes precedence regardless of matchMode. This allows combining keyword detection with capture groups in a single regex:
{
"pattern": "(?im)^\\s*Fakturanummer\\s*[:\\-]?\\s*(\\d{4,})\\s*$",
"actions": [
{
"actionType": "EXTRACT_TO",
"target": "invoiceNumber",
"extractGroup": 1
}
]
}Embedded Regex Flags
Patterns can include embedded flags at the start:
| Flag | Description |
|---|---|
(?i) | Case-insensitive matching |
(?m) | Multi-line mode (^ and $ match line boundaries) |
(?s) | Dot matches newlines |
(?im) | Combined case-insensitive + multi-line |
Regex Flags Field
Alternatively, use the regexFlags field:
| Flag | Description |
|---|---|
i | Case-insensitive matching |
m | Multi-line mode (^ and $ match line boundaries) |
s | Dot matches newlines |
Action Types
| Action Type | Description | Required Fields |
|---|---|---|
SET_TYPE | Propose a document type | targetUUID |
SET_PARTY | Propose a party/customer | targetUUID |
SET_FIELD | Set a field to a fixed value | targetField, targetValue |
EXTRACT_FIELD | Extract field value from regex match | targetField, extractGroup |
EXTRACT_TO | Extract with custom date format | target, extractGroup, convertDate |
EXTRACT_FIELD Example
For the regex Invoice number[:\s]*(\S+) with extractGroup: 1:
- Input:
"Invoice number: INV-2025-001" - Group 0 (full match):
"Invoice number: INV-2025-001" - Group 1:
"INV-2025-001"← This is extracted
EXTRACT_TO Example
EXTRACT_TO is an enhanced version of EXTRACT_FIELD that uses target instead of targetField and supports convertDate for custom date format parsing:
{
"uuid": "rule-invoice-date",
"pattern": "(?im)^\\s*Fakturadatum\\s*[:\\-]?\\s*(\\d{2}\\.\\d{2}\\.\\d{4})",
"actions": [
{
"actionType": "EXTRACT_TO",
"target": "invoiceDate",
"extractGroup": 1,
"convertDate": "dd.MM.yyyy"
}
]
}This extracts "25.01.2026" and converts it to ISO format "2026-01-25".
Value Conversion
The convertType field enables automatic type conversion:
| Convert Type | Input Example | Output |
|---|---|---|
NUMBER | "1,234.56" | 1234.56 |
CURRENCY | "$1,234.56" | 1234.56 |
DATE | "2025-01-25" | 2025-01-25 (ISO format) |
BOOLEAN | "yes", "true", "1" | true |
Custom Date Format
Use convertDate with EXTRACT_TO to specify the input date format:
| Format Pattern | Input Example | Output |
|---|---|---|
dd.MM.yyyy | "25.01.2026" | 2026-01-25 |
MM/dd/yyyy | "01/25/2026" | 2026-01-25 |
d MMM yyyy | "25 Jan 2026" | 2026-01-25 |
Confidence Calculation
Confidence scores are calculated based on match mode:
| Match Mode | Base Confidence |
|---|---|
EXACT | 1.0 |
REGEX | 0.9 |
CONTAINS | 0.8 |
Bonus: +0.05 if matched text length > 10 characters (capped at 1.0)
Configuration
Configuration is loaded from config.properties (external file or classpath).
# AMQP Broker
broker_url=amqp://172.16.200.32:5672
input_queue=cmd.document.interpret
output_queue=evt.document.interpreted
error_queue=evt.document.interpretation.failed| Property | Description | Default |
|---|---|---|
broker_url | AMQP broker connection URL | amqp://localhost:5672 |
input_queue | Queue to consume interpretation requests | cmd.document.interpret |
output_queue | Queue for successful interpretations | evt.document.interpreted |
error_queue | Queue for failed interpretations | evt.document.interpretation.failed |
Tenant Isolation
DocumentProfessor propagates tenant information through the processing pipeline:
- Receives
tenantUUID(ororganizationId) from incoming message JSON body - Propagates the value through RuleEngine to the InterpretationResult
- Sets
tenantUUIDas a JMS message property on published events
This allows DocumentManager to filter events by tenant using JMS selectors.
// Parse tenantUUID from incoming message (with fallback)
String tenantUUID = json.getString("organizationId");
if (tenantUUID == null) {
tenantUUID = json.getString("tenantUUID");
}
request.setTenantUUID(tenantUUID);
// Propagate through RuleEngine
result.setTenantUUID(request.getTenantUUID());
// Set JMS property on outgoing event
message.setStringProperty("tenantUUID", result.getTenantUUID());For detailed information about tenant isolation, see Multi-Tenant Messaging.
Key Design Decisions
| Aspect | Choice | Rationale |
|---|---|---|
| Stateless | No database access | Rules and text provided in message; DocumentManager owns persistence |
| Proposals | Return suggestions, not decisions | DocumentManager applies business rules and thresholds |
| Rule priority | Lower number = higher priority | Rules processed in priority order |
| Stop on match | Optional per rule | Allows early termination for exclusive classifications |
| Evidence | Include match context | Audit trail for how proposals were derived |
| Tenant isolation | Propagate tenantUUID JMS property | Enables tenant-filtered message consumption |
Package Structure
src/main/java/com/luqon/boost/documentprofessor/
├── DocumentProfessorApp.java # Entry point, wiring, shutdown hook
├── TestSender.java # Test utility
├── config/
│ └── AppConfig.java # Loads config.properties
├── model/
│ ├── MatchRule.java # Rule definition
│ ├── RuleAction.java # Action definition
│ ├── InterpretationRequest.java # Input message
│ ├── InterpretationResult.java # Output with toJson()
│ ├── FieldProposal.java # Extracted field proposal
│ ├── PartyProposal.java # Party/customer proposal
│ └── DocumentTypeProposal.java # Document type proposal
├── engine/
│ ├── RuleMatcher.java # EXACT/CONTAINS/REGEX matching
│ ├── ValueConverter.java # Type conversions
│ └── RuleEngine.java # Orchestrates rule processing
├── messaging/
│ ├── MessageHandler.java # Interface
│ ├── MessageConsumerService.java # AMQP consumer
│ └── EventPublisher.java # Publishes results
└── handler/
└── InterpretationHandler.java # Parses request, invokes engineDependencies
| Dependency | Version | Purpose |
|---|---|---|
| Qpid JMS Client | 2.10.0 | AMQP messaging |
| com.luqon.json | 1h | JSON parsing and building |
| SLF4J + Logback | 2.0.16 / 1.5.16 | Logging |
| BoostMiddleware | 1k | Shared utilities |
Building
cd BOOST.DocumentProfessor
mvn clean package -DskipTestsThis produces a fat JAR at target/BOOST.DocumentProfessor-0.0.1-SNAPSHOT.jar.
Running
java -jar target/BOOST.DocumentProfessor-0.0.1-SNAPSHOT.jarThe service will:
- Load configuration from
config.properties - Connect to the AMQP broker
- Start consuming messages from
cmd.document.interpret - Process interpretation requests and publish results
Graceful Shutdown
The service registers a shutdown hook. Send SIGTERM or Ctrl+C to gracefully stop.
Testing
Using TestSender Utility
java -cp target/BOOST.DocumentProfessor-0.0.1-SNAPSHOT.jar \
com.luqon.boost.documentprofessor.TestSender [broker_url]This sends a test message with sample rules and listens for results.
Sample Test Message
The TestSender includes rules for:
- Detecting "Invoice" keyword → SET_TYPE
- Extracting invoice number via regex → EXTRACT_FIELD
- Extracting organization number → EXTRACT_FIELD
- Extracting amount due with CURRENCY conversion → EXTRACT_FIELD
- Matching "Acme Corporation" → SET_PARTY
Rule Processing Flow
1. Sort rules by priority (ascending)
2. For each enabled rule:
a. Try matching against extractedText
b. If no match, try matching against clientFilename
c. If matched:
- Record matched rule UUID
- Process each enabled action:
- SET_PARTY → Add PartyProposal
- SET_TYPE → Set DocumentTypeProposal
- SET_FIELD → Add FieldProposal with fixed value
- EXTRACT_FIELD → Add FieldProposal with extracted value
- If stopOnMatch: break
3. Return InterpretationResult with all proposalsError Handling
| Scenario | Behavior |
|---|---|
| No extracted text | Return error result, publish to error queue |
| No rules provided | Return success with empty proposals |
| Invalid regex pattern | Log warning, skip rule, continue processing |
| Parse error | Exception thrown, message NOT acknowledged |
Message Resilience
DocumentProfessor uses CLIENT_ACKNOWLEDGE mode to ensure no messages are lost:
- Messages are only acknowledged after successful processing
- If an exception occurs, the message is NOT acknowledged
- The broker will automatically redeliver unacknowledged messages
- Parse errors and unexpected exceptions trigger redelivery
- Business logic errors (no text provided) publish error events and acknowledge
For detailed information about message resilience patterns, see Multi-Tenant Messaging.
Logging
Logs are written to:
- Console (stdout)
logs/documentprofessor.log(rolling daily, 30 days retention)
Log levels can be adjusted in logback.xml.
Integration with DocumentManager
DocumentProfessor returns proposals, not decisions. DocumentManager:
- Receives
evt.document.interpreted - Applies confidence thresholds (e.g., auto-accept if > 0.85)
- Resolves conflicts (multiple party candidates)
- Persists accepted values to database
- Updates document status (READY / NEEDS_REVIEW / FAILED)