BOOST.DocumentExtractor

A stateless, message-driven Java 21 service that extracts text from PDF and image documents. It consumes commands from an AMQP queue, downloads files from S3-compatible storage, extracts text using PDFBox (for embedded text) or Tesseract OCR (as fallback), stores the result back to S3, and publishes a completion event.

Architecture

[AMQP: cmd.document.extractText]
        │
        ▼
  MessageConsumerService (CLIENT_ACKNOWLEDGE)
        │
        ▼
  DocumentExtractionHandler
        │
        ├──► S3StorageService.download(bucket, key)
        ├──► ExtractionOrchestrator
        │       ├──► PdfTextExtractor (PDFBox)
        │       └──► OcrTextExtractor (Tesseract/tess4j) [fallback]
        └──► S3StorageService.upload(extractedText)
        │
        ▼
  EventPublisher ──► [AMQP: evt.document.text.extracted]
               └──► [AMQP: evt.document.text.error] (on failure)

Message Flow

Input Message

Send to queue: cmd.document.extractText

JMS Property: tenantUUID must be set for proper tenant routing.

json

{
  "UUID": "742dba26-...",
  "entityUUID": "5d16c711-...",
  "organizationId": "org-123",
  "clientFilename": "InvoiceForm-459.pdf",
  "serverFilename": "742dba26-.../InvoiceForm-459.pdf",
  "contentType": "application/pdf",
  "location": "https://s3.se-01.westbahr.net/boost.app.{tenantId}",
  "metaData": { },
  "createdByUUID": "SYSTEM",
  "createdAt": "2026-01-24T14:12:33"
}

Field	Description
`UUID`	Unique document identifier
`entityUUID`	Parent entity this document belongs to
`clientFilename`	Original filename uploaded by user
`serverFilename`	S3 object key
`contentType`	MIME type (`application/pdf`, `image/png`, etc.)
`location`	S3 endpoint + bucket (e.g., `https://s3.example.com/boost.app.tenant1`)

Output Event (Success)

Published to queue: evt.document.text.extracted

JMS Property: tenantUUID is set to enable tenant-filtered consumption by DocumentManager.

json

{
  "documentUUID": "742dba26-...",
  "entityUUID": "5d16c711-...",
  "tenantUUID": "org-123",
  "textRef": "742dba26-.../extracted-text/InvoiceForm-459.txt",
  "textBucket": "boost.app.{tenantId}",
  "engineUsed": "PDFBOX",
  "pageCount": 3,
  "confidence": 1.0,
  "detectedLanguage": "eng",
  "processingTimeMs": 1234,
  "success": true
}

Field	Description
`textRef`	S3 object key where extracted text is stored
`textBucket`	S3 bucket name
`engineUsed`	`PDFBOX` or `TESSERACT`
`pageCount`	Number of pages processed
`confidence`	Extraction confidence (1.0 for PDFBox, ~0.85 for OCR)
`processingTimeMs`	Time taken to process

Error Event

Published to queue: evt.document.text.error

JMS Property: tenantUUID is set to enable tenant-filtered consumption.

json

{
  "documentUUID": "742dba26-...",
  "entityUUID": "5d16c711-...",
  "tenantUUID": "org-123",
  "textBucket": "boost.app.{tenantId}",
  "processingTimeMs": 234,
  "success": false,
  "errorMessage": "Unsupported content type: application/zip"
}

Supported File Types

Content Type	Extraction Method
`application/pdf`	PDFBox (embedded text), falls back to OCR if text < 50 chars
`image/png`	Tesseract OCR
`image/jpeg`	Tesseract OCR
`image/tiff`	Tesseract OCR
`image/bmp`	Tesseract OCR
`image/gif`	Tesseract OCR

Extraction Logic

Parse content type from incoming message
If PDF:
- Try PDFBox PDFTextStripper with layout preservation
- If extracted text < threshold (default 50 chars): fall back to Tesseract OCR
If image:
- Use Tesseract OCR directly
If unsupported format:
- Publish error event, acknowledge message

Configuration

Configuration is loaded from config.properties (external file or classpath).

properties

# AMQP Broker
broker_url=amqp://172.16.200.32:5672
input_queue=cmd.document.extractText
output_queue=evt.document.text.extracted
error_queue=evt.document.text.error

# S3 Storage
s3_endpoint=https://s3.se-01.westbahr.net
s3_access_key=your-access-key
s3_secret_key=your-secret-key
s3_region=us-east-1

# Extraction Settings
min_text_length_threshold=50
tesseract_data_path=/usr/share/tesseract-ocr/4.00/tessdata
tesseract_language=eng
ocr_timeout_seconds=120

# Timeouts
download_timeout_seconds=60
upload_timeout_seconds=60
temp_dir=/tmp/boost-document-extractor

Property	Description	Default
`min_text_length_threshold`	Minimum chars from PDFBox before falling back to OCR	50
`tesseract_data_path`	Path to Tesseract language data files	`/usr/share/tesseract-ocr/4.00/tessdata`
`tesseract_language`	OCR language code	`eng`
`temp_dir`	Directory for temporary file downloads	`/tmp/boost-document-extractor`

Tenant Isolation

DocumentExtractor propagates tenant information through the processing pipeline:

Receives tenantUUID (or organizationId) from incoming message JSON body
Propagates the value to all outgoing events
Sets tenantUUID as a JMS message property on published events

This allows DocumentManager to filter events by tenant using JMS selectors.

java

// Parse tenantUUID from incoming message (with fallback)
String tenantUUID = json.getString("organizationId");
if (tenantUUID == null) {
    tenantUUID = json.getString("tenantUUID");
}

// Set JMS property on outgoing event
message.setStringProperty("tenantUUID", event.getTenantUUID());

For detailed information about tenant isolation, see Multi-Tenant Messaging.

Key Design Decisions

Aspect	Choice	Rationale
Ack mode	`CLIENT_ACKNOWLEDGE`	Don't lose messages on processing failure
Response	Event publishing	No HTTP request waiting; DocumentManager subscribes
Error reporting	Separate error queue	Clean separation of success/failure paths
PDF extraction	PDFBox first, OCR fallback	Most PDFs have embedded text; OCR is expensive
Idempotency	Check if text exists before processing	Skip re-extraction, still publish success event
Layout	Custom `LayoutPreservingTextStripper`	Preserves column alignment and table structures
Tenant isolation	Propagate `tenantUUID` JMS property	Enables tenant-filtered message consumption

Package Structure

src/main/java/com/luqon/boost/documentextractor/
├── DocumentExtractorApp.java           # Entry point, wiring, shutdown hook
├── config/
│   └── AppConfig.java                  # Loads config.properties, validates
├── model/
│   ├── FileDescriptor.java             # Parsed input file object
│   ├── ExtractionResult.java           # Extraction output
│   ├── ExtractionEvent.java            # Output event payload
│   └── ExtractionEngine.java           # Enum: PDFBOX, TESSERACT
├── messaging/
│   ├── MessageHandler.java             # Interface
│   ├── MessageConsumerService.java     # AMQP consumer, CLIENT_ACKNOWLEDGE
│   └── EventPublisher.java             # Publishes to output/error queues
├── extraction/
│   ├── TextExtractor.java              # Interface
│   ├── PdfTextExtractor.java           # Apache PDFBox
│   ├── OcrTextExtractor.java           # Tesseract via tess4j
│   ├── LayoutPreservingTextStripper.java  # Custom layout-aware extractor
│   ├── ExtractionOrchestrator.java     # Decides extractor, handles fallback
│   └── ExtractionException.java        # Custom exception
├── storage/
│   └── S3StorageService.java           # MinIO SDK: download, upload, exists
├── handler/
│   └── DocumentExtractionHandler.java  # Main processing logic
└── util/
    ├── TenantResolver.java             # Extracts tenantId from location URL
    └── ContentTypeResolver.java        # PDF vs image detection

Dependencies

Dependency	Version	Purpose
Apache PDFBox	3.0.6	PDF text extraction
tess4j	5.18.0	Tesseract OCR wrapper
MinIO SDK	8.6.0	S3-compatible storage
Qpid JMS Client	2.10.0	AMQP messaging
SLF4J + Logback	2.0.16 / 1.5.16	Logging

Building

bash

cd BOOST.DocumentExtractor
mvn clean package -DskipTests

This produces a fat JAR at target/BOOST.DocumentExtractor-0.0.1-SNAPSHOT.jar.

Running

bash

java -jar target/BOOST.DocumentExtractor-0.0.1-SNAPSHOT.jar

The service will:

Load configuration from config.properties
Connect to the AMQP broker
Start consuming messages from cmd.document.extractText
Process documents and publish events

Graceful Shutdown

The service registers a shutdown hook. Send SIGTERM or Ctrl+C to gracefully stop.

Testing

Using TestSender Utility

bash

java -cp target/BOOST.DocumentExtractor-0.0.1-SNAPSHOT.jar \
  com.luqon.boost.documentextractor.TestSender [broker_url]

This sends a test message and listens for results on both success and error queues.

Manual Testing

Upload a PDF/image to S3
Send a message to cmd.document.extractText with the file details
Check evt.document.text.extracted for success event
Verify extracted text in S3 at {prefix}/extracted-text/{filename}.txt

Output Storage

Extracted text is stored in S3 with the following path pattern:

{original-path}/extracted-text/{original-filename}.txt

Example:

Original: 742dba26-.../InvoiceForm-459.pdf
Extracted: 742dba26-.../extracted-text/InvoiceForm-459.txt

Error Handling

Scenario	Behavior
Unsupported content type	Publish error event, acknowledge message
S3 download failure	Exception thrown, message NOT acknowledged (redelivered)
Extraction failure	Publish error event, exception thrown
S3 upload failure	Exception thrown, message NOT acknowledged
Text already exists	Skip extraction, publish success event

Message Resilience

DocumentExtractor uses CLIENT_ACKNOWLEDGE mode to ensure no messages are lost:

Messages are only acknowledged after successful processing
If an exception occurs, the message is NOT acknowledged
The broker will automatically redeliver unacknowledged messages
Temporary failures (S3 timeouts) trigger redelivery
Permanent failures (unsupported file type) publish error events and acknowledge

For detailed information about message resilience patterns, see Multi-Tenant Messaging.

Logging

Logs are written to:

Console (stdout)
logs/document-extractor.log (rolling daily, 30 days retention)

Log levels can be adjusted in logback.xml.

Developer Guidelines

Schema

Item

BOOST.DocumentExtractor

Architecture

Message Flow

Input Message

Output Event (Success)

Error Event

Supported File Types

Extraction Logic

Configuration

Tenant Isolation

Key Design Decisions

Package Structure

Dependencies

Building

Running

Graceful Shutdown

Testing

Using TestSender Utility

Manual Testing

Output Storage

Error Handling

Message Resilience

Logging

Item

BOOST.DocumentExtractor ​

Architecture ​

Message Flow ​

Input Message ​

Output Event (Success) ​

Error Event ​

Supported File Types ​

Extraction Logic ​

Configuration ​

Tenant Isolation ​

Key Design Decisions ​

Package Structure ​

Dependencies ​

Building ​

Running ​

Graceful Shutdown ​

Testing ​

Using TestSender Utility ​

Manual Testing ​

Output Storage ​

Error Handling ​

Message Resilience ​

Logging ​

BOOST.DocumentExtractor

Architecture

Message Flow

Input Message

Output Event (Success)

Error Event

Supported File Types

Extraction Logic

Configuration

Tenant Isolation

Key Design Decisions

Package Structure

Dependencies

Building

Running

Graceful Shutdown

Testing

Using TestSender Utility

Manual Testing

Output Storage

Error Handling

Message Resilience

Logging