Skip to content

BOOST.DocumentProfessor

A stateless, message-driven Java 21 service that interprets extracted document text using configurable rules. It consumes interpretation requests from an AMQP queue, applies pattern matching and field extraction rules, and produces proposals for document classification, party matching, and field values.

Architecture

[AMQP: cmd.document.interpret]


  MessageConsumerService (CLIENT_ACKNOWLEDGE)


  InterpretationHandler

        ├──► Parse rules and actions from message
        └──► RuleEngine.process()
                ├──► RuleMatcher (EXACT/CONTAINS/REGEX)
                ├──► ValueConverter (NUMBER/DATE/CURRENCY/BOOLEAN)
                └──► Generate proposals


  EventPublisher ──► [AMQP: evt.document.interpreted]
               └──► [AMQP: evt.document.interpretation.failed] (on failure)

Pipeline Position

DocumentExtractor (bytes → text)


DocumentProfessor (text → proposals)  ◄── You are here


DocumentManager (proposals → authoritative state)

Message Flow

Input Message

Send to queue: cmd.document.interpret

JMS Property: tenantUUID must be set for proper tenant routing.

json
{
  "documentUUID": "742dba26-...",
  "entityUUID": "5d16c711-...",
  "organizationId": "org-123",
  "clientFilename": "Invoice-12345.pdf",
  "contentType": "application/pdf",
  "extractedText": "Invoice\nInvoice number: INV-2025-001\nDate: January 25, 2025...",
  "currentPartyUUID": "party-acme-001",
  "currentDocumentTypeUUID": null,
  "currentState": "INBOX",
  "fromEmailDomain": "acme.com",
  "rules": [
    {
      "uuid": "rule-001",
      "priority": 1,
      "isEnabled": true,
      "matchMode": "CONTAINS",
      "keyword": "Invoice",
      "stopOnMatch": false,
      "actions": [
        {
          "uuid": "action-001",
          "actionType": "SET_TYPE",
          "targetUUID": "doctype-invoice-001",
          "isEnabled": true
        }
      ]
    },
    {
      "uuid": "rule-002",
      "priority": 2,
      "isEnabled": true,
      "partyUUID": "party-acme-001",
      "pattern": "(?im)^\\s*Invoice\\s+number[:\\s]*(\\S+)",
      "actions": [
        {
          "uuid": "action-002",
          "actionType": "EXTRACT_TO",
          "target": "invoiceNumber",
          "extractGroup": 1,
          "isEnabled": true
        }
      ]
    }
  ]
}
FieldDescription
documentUUIDUnique document identifier
entityUUIDParent entity this document belongs to
clientFilenameOriginal filename (used for filename-based matching)
extractedTextText content to analyze
currentPartyUUIDCurrent linked party (for precondition matching)
currentDocumentTypeUUIDCurrent document type (for precondition matching)
currentStateCurrent document state (for precondition matching)
fromEmailDomainSender email domain (for precondition matching)
rulesArray of match rules with their actions

Output Event (Success)

Published to queue: evt.document.interpreted

JMS Property: tenantUUID is set to enable tenant-filtered consumption by DocumentManager.

json
{
  "documentUUID": "742dba26-...",
  "entityUUID": "5d16c711-...",
  "tenantUUID": "org-123",
  "success": true,
  "processingTimeMs": 45,
  "documentTypeProposal": {
    "documentTypeUUID": "doctype-invoice-001",
    "confidence": 0.8,
    "ruleUUID": "rule-001",
    "evidence": "Matched keyword 'Invoice' in extractedText"
  },
  "partyProposals": [
    {
      "partyUUID": "party-acme-001",
      "confidence": 0.8,
      "ruleUUID": "rule-005",
      "matchedKeyword": "Acme Corporation",
      "evidence": "Matched keyword 'Acme Corporation' in extractedText"
    }
  ],
  "fieldProposals": [
    {
      "fieldName": "invoiceNumber",
      "rawValue": "INV-2025-001",
      "confidence": 0.9,
      "ruleUUID": "rule-002",
      "evidence": "Matched pattern 'Invoice number[:\\s]*(\\S+)' in extractedText"
    },
    {
      "fieldName": "totalAmount",
      "rawValue": "1,234.56",
      "convertedValue": 1234.56,
      "convertType": "CURRENCY",
      "confidence": 0.9,
      "ruleUUID": "rule-004"
    }
  ],
  "matchedRuleUUIDs": ["rule-001", "rule-002", "rule-004", "rule-005"]
}

Error Event

Published to queue: evt.document.interpretation.failed

JMS Property: tenantUUID is set to enable tenant-filtered consumption.

json
{
  "documentUUID": "742dba26-...",
  "entityUUID": "5d16c711-...",
  "tenantUUID": "org-123",
  "success": false,
  "processingTimeMs": 12,
  "errorMessage": "No extracted text provided"
}

Preconditions

Rules can be scoped to specific contexts using preconditions. If a precondition is set, the rule only applies when the document's current context matches:

PreconditionDescription
partyUUIDOnly apply rule when document is linked to this party
documentTypeUUIDOnly apply rule when document has this type
stateOnly apply rule when document is in this state
fromEmailDomainOnly apply rule when email sender domain matches

Precondition Example

json
{
  "uuid": "rule-invoice-extract",
  "priority": 10,
  "isEnabled": true,
  "partyUUID": "party-acme-001",
  "documentTypeUUID": "doctype-invoice",
  "pattern": "(?im)^\\s*Invoice\\s+Date[:\\s]*(\\d{2}\\.\\d{2}\\.\\d{4})",
  "actions": [...]
}

This rule only runs when:

  1. Document is linked to party party-acme-001
  2. Document type is doctype-invoice

The request must include context fields for precondition evaluation:

json
{
  "documentUUID": "742dba26-...",
  "extractedText": "...",
  "currentPartyUUID": "party-acme-001",
  "currentDocumentTypeUUID": "doctype-invoice",
  "currentState": "REGISTRATION",
  "fromEmailDomain": "acme.com",
  "rules": [...]
}

Match Modes

ModeDescriptionExample
EXACTFull text must match exactly (case-insensitive)keyword: "Invoice" matches only "Invoice" or "invoice"
CONTAINSText must contain the keywordkeyword: "Invoice" matches "Tax Invoice #123"
REGEXText must match the regular expressionpattern: "INV-\\d+" matches "INV-2025-001"

Pattern Priority

If pattern is set, it takes precedence regardless of matchMode. This allows combining keyword detection with capture groups in a single regex:

json
{
  "pattern": "(?im)^\\s*Fakturanummer\\s*[:\\-]?\\s*(\\d{4,})\\s*$",
  "actions": [
    {
      "actionType": "EXTRACT_TO",
      "target": "invoiceNumber",
      "extractGroup": 1
    }
  ]
}

Embedded Regex Flags

Patterns can include embedded flags at the start:

FlagDescription
(?i)Case-insensitive matching
(?m)Multi-line mode (^ and $ match line boundaries)
(?s)Dot matches newlines
(?im)Combined case-insensitive + multi-line

Regex Flags Field

Alternatively, use the regexFlags field:

FlagDescription
iCase-insensitive matching
mMulti-line mode (^ and $ match line boundaries)
sDot matches newlines

Action Types

Action TypeDescriptionRequired Fields
SET_TYPEPropose a document typetargetUUID
SET_PARTYPropose a party/customertargetUUID
SET_FIELDSet a field to a fixed valuetargetField, targetValue
EXTRACT_FIELDExtract field value from regex matchtargetField, extractGroup
EXTRACT_TOExtract with custom date formattarget, extractGroup, convertDate

EXTRACT_FIELD Example

For the regex Invoice number[:\s]*(\S+) with extractGroup: 1:

  • Input: "Invoice number: INV-2025-001"
  • Group 0 (full match): "Invoice number: INV-2025-001"
  • Group 1: "INV-2025-001" ← This is extracted

EXTRACT_TO Example

EXTRACT_TO is an enhanced version of EXTRACT_FIELD that uses target instead of targetField and supports convertDate for custom date format parsing:

json
{
  "uuid": "rule-invoice-date",
  "pattern": "(?im)^\\s*Fakturadatum\\s*[:\\-]?\\s*(\\d{2}\\.\\d{2}\\.\\d{4})",
  "actions": [
    {
      "actionType": "EXTRACT_TO",
      "target": "invoiceDate",
      "extractGroup": 1,
      "convertDate": "dd.MM.yyyy"
    }
  ]
}

This extracts "25.01.2026" and converts it to ISO format "2026-01-25".

Value Conversion

The convertType field enables automatic type conversion:

Convert TypeInput ExampleOutput
NUMBER"1,234.56"1234.56
CURRENCY"$1,234.56"1234.56
DATE"2025-01-25"2025-01-25 (ISO format)
BOOLEAN"yes", "true", "1"true

Custom Date Format

Use convertDate with EXTRACT_TO to specify the input date format:

Format PatternInput ExampleOutput
dd.MM.yyyy"25.01.2026"2026-01-25
MM/dd/yyyy"01/25/2026"2026-01-25
d MMM yyyy"25 Jan 2026"2026-01-25

Confidence Calculation

Confidence scores are calculated based on match mode:

Match ModeBase Confidence
EXACT1.0
REGEX0.9
CONTAINS0.8

Bonus: +0.05 if matched text length > 10 characters (capped at 1.0)

Configuration

Configuration is loaded from config.properties (external file or classpath).

properties
# AMQP Broker
broker_url=amqp://172.16.200.32:5672
input_queue=cmd.document.interpret
output_queue=evt.document.interpreted
error_queue=evt.document.interpretation.failed
PropertyDescriptionDefault
broker_urlAMQP broker connection URLamqp://localhost:5672
input_queueQueue to consume interpretation requestscmd.document.interpret
output_queueQueue for successful interpretationsevt.document.interpreted
error_queueQueue for failed interpretationsevt.document.interpretation.failed

Tenant Isolation

DocumentProfessor propagates tenant information through the processing pipeline:

  1. Receives tenantUUID (or organizationId) from incoming message JSON body
  2. Propagates the value through RuleEngine to the InterpretationResult
  3. Sets tenantUUID as a JMS message property on published events

This allows DocumentManager to filter events by tenant using JMS selectors.

java
// Parse tenantUUID from incoming message (with fallback)
String tenantUUID = json.getString("organizationId");
if (tenantUUID == null) {
    tenantUUID = json.getString("tenantUUID");
}
request.setTenantUUID(tenantUUID);

// Propagate through RuleEngine
result.setTenantUUID(request.getTenantUUID());

// Set JMS property on outgoing event
message.setStringProperty("tenantUUID", result.getTenantUUID());

For detailed information about tenant isolation, see Multi-Tenant Messaging.

Key Design Decisions

AspectChoiceRationale
StatelessNo database accessRules and text provided in message; DocumentManager owns persistence
ProposalsReturn suggestions, not decisionsDocumentManager applies business rules and thresholds
Rule priorityLower number = higher priorityRules processed in priority order
Stop on matchOptional per ruleAllows early termination for exclusive classifications
EvidenceInclude match contextAudit trail for how proposals were derived
Tenant isolationPropagate tenantUUID JMS propertyEnables tenant-filtered message consumption

Package Structure

src/main/java/com/luqon/boost/documentprofessor/
├── DocumentProfessorApp.java        # Entry point, wiring, shutdown hook
├── TestSender.java                  # Test utility
├── config/
│   └── AppConfig.java               # Loads config.properties
├── model/
│   ├── MatchRule.java               # Rule definition
│   ├── RuleAction.java              # Action definition
│   ├── InterpretationRequest.java   # Input message
│   ├── InterpretationResult.java    # Output with toJson()
│   ├── FieldProposal.java           # Extracted field proposal
│   ├── PartyProposal.java           # Party/customer proposal
│   └── DocumentTypeProposal.java    # Document type proposal
├── engine/
│   ├── RuleMatcher.java             # EXACT/CONTAINS/REGEX matching
│   ├── ValueConverter.java          # Type conversions
│   └── RuleEngine.java              # Orchestrates rule processing
├── messaging/
│   ├── MessageHandler.java          # Interface
│   ├── MessageConsumerService.java  # AMQP consumer
│   └── EventPublisher.java          # Publishes results
└── handler/
    └── InterpretationHandler.java   # Parses request, invokes engine

Dependencies

DependencyVersionPurpose
Qpid JMS Client2.10.0AMQP messaging
com.luqon.json1hJSON parsing and building
SLF4J + Logback2.0.16 / 1.5.16Logging
BoostMiddleware1kShared utilities

Building

bash
cd BOOST.DocumentProfessor
mvn clean package -DskipTests

This produces a fat JAR at target/BOOST.DocumentProfessor-0.0.1-SNAPSHOT.jar.

Running

bash
java -jar target/BOOST.DocumentProfessor-0.0.1-SNAPSHOT.jar

The service will:

  1. Load configuration from config.properties
  2. Connect to the AMQP broker
  3. Start consuming messages from cmd.document.interpret
  4. Process interpretation requests and publish results

Graceful Shutdown

The service registers a shutdown hook. Send SIGTERM or Ctrl+C to gracefully stop.

Testing

Using TestSender Utility

bash
java -cp target/BOOST.DocumentProfessor-0.0.1-SNAPSHOT.jar \
  com.luqon.boost.documentprofessor.TestSender [broker_url]

This sends a test message with sample rules and listens for results.

Sample Test Message

The TestSender includes rules for:

  • Detecting "Invoice" keyword → SET_TYPE
  • Extracting invoice number via regex → EXTRACT_FIELD
  • Extracting organization number → EXTRACT_FIELD
  • Extracting amount due with CURRENCY conversion → EXTRACT_FIELD
  • Matching "Acme Corporation" → SET_PARTY

Rule Processing Flow

1. Sort rules by priority (ascending)
2. For each enabled rule:
   a. Try matching against extractedText
   b. If no match, try matching against clientFilename
   c. If matched:
      - Record matched rule UUID
      - Process each enabled action:
        - SET_PARTY → Add PartyProposal
        - SET_TYPE → Set DocumentTypeProposal
        - SET_FIELD → Add FieldProposal with fixed value
        - EXTRACT_FIELD → Add FieldProposal with extracted value
      - If stopOnMatch: break
3. Return InterpretationResult with all proposals

Error Handling

ScenarioBehavior
No extracted textReturn error result, publish to error queue
No rules providedReturn success with empty proposals
Invalid regex patternLog warning, skip rule, continue processing
Parse errorException thrown, message NOT acknowledged

Message Resilience

DocumentProfessor uses CLIENT_ACKNOWLEDGE mode to ensure no messages are lost:

  • Messages are only acknowledged after successful processing
  • If an exception occurs, the message is NOT acknowledged
  • The broker will automatically redeliver unacknowledged messages
  • Parse errors and unexpected exceptions trigger redelivery
  • Business logic errors (no text provided) publish error events and acknowledge

For detailed information about message resilience patterns, see Multi-Tenant Messaging.

Logging

Logs are written to:

  • Console (stdout)
  • logs/documentprofessor.log (rolling daily, 30 days retention)

Log levels can be adjusted in logback.xml.

Integration with DocumentManager

DocumentProfessor returns proposals, not decisions. DocumentManager:

  1. Receives evt.document.interpreted
  2. Applies confidence thresholds (e.g., auto-accept if > 0.85)
  3. Resolves conflicts (multiple party candidates)
  4. Persists accepted values to database
  5. Updates document status (READY / NEEDS_REVIEW / FAILED)