Skip to content

BOOST.DocumentExtractor

A stateless, message-driven Java 21 service that extracts text from PDF and image documents. It consumes commands from an AMQP queue, downloads files from S3-compatible storage, extracts text using PDFBox (for embedded text) or Tesseract OCR (as fallback), stores the result back to S3, and publishes a completion event.

Architecture

[AMQP: cmd.document.extractText]


  MessageConsumerService (CLIENT_ACKNOWLEDGE)


  DocumentExtractionHandler

        ├──► S3StorageService.download(bucket, key)
        ├──► ExtractionOrchestrator
        │       ├──► PdfTextExtractor (PDFBox)
        │       └──► OcrTextExtractor (Tesseract/tess4j) [fallback]
        └──► S3StorageService.upload(extractedText)


  EventPublisher ──► [AMQP: evt.document.text.extracted]
               └──► [AMQP: evt.document.text.error] (on failure)

Message Flow

Input Message

Send to queue: cmd.document.extractText

JMS Property: tenantUUID must be set for proper tenant routing.

json
{
  "UUID": "742dba26-...",
  "entityUUID": "5d16c711-...",
  "organizationId": "org-123",
  "clientFilename": "InvoiceForm-459.pdf",
  "serverFilename": "742dba26-.../InvoiceForm-459.pdf",
  "contentType": "application/pdf",
  "location": "https://s3.se-01.westbahr.net/boost.app.{tenantId}",
  "metaData": { },
  "createdByUUID": "SYSTEM",
  "createdAt": "2026-01-24T14:12:33"
}
FieldDescription
UUIDUnique document identifier
entityUUIDParent entity this document belongs to
clientFilenameOriginal filename uploaded by user
serverFilenameS3 object key
contentTypeMIME type (application/pdf, image/png, etc.)
locationS3 endpoint + bucket (e.g., https://s3.example.com/boost.app.tenant1)

Output Event (Success)

Published to queue: evt.document.text.extracted

JMS Property: tenantUUID is set to enable tenant-filtered consumption by DocumentManager.

json
{
  "documentUUID": "742dba26-...",
  "entityUUID": "5d16c711-...",
  "tenantUUID": "org-123",
  "textRef": "742dba26-.../extracted-text/InvoiceForm-459.txt",
  "textBucket": "boost.app.{tenantId}",
  "engineUsed": "PDFBOX",
  "pageCount": 3,
  "confidence": 1.0,
  "detectedLanguage": "eng",
  "processingTimeMs": 1234,
  "success": true
}
FieldDescription
textRefS3 object key where extracted text is stored
textBucketS3 bucket name
engineUsedPDFBOX or TESSERACT
pageCountNumber of pages processed
confidenceExtraction confidence (1.0 for PDFBox, ~0.85 for OCR)
processingTimeMsTime taken to process

Error Event

Published to queue: evt.document.text.error

JMS Property: tenantUUID is set to enable tenant-filtered consumption.

json
{
  "documentUUID": "742dba26-...",
  "entityUUID": "5d16c711-...",
  "tenantUUID": "org-123",
  "textBucket": "boost.app.{tenantId}",
  "processingTimeMs": 234,
  "success": false,
  "errorMessage": "Unsupported content type: application/zip"
}

Supported File Types

Content TypeExtraction Method
application/pdfPDFBox (embedded text), falls back to OCR if text < 50 chars
image/pngTesseract OCR
image/jpegTesseract OCR
image/tiffTesseract OCR
image/bmpTesseract OCR
image/gifTesseract OCR

Extraction Logic

  1. Parse content type from incoming message
  2. If PDF:
    • Try PDFBox PDFTextStripper with layout preservation
    • If extracted text < threshold (default 50 chars): fall back to Tesseract OCR
  3. If image:
    • Use Tesseract OCR directly
  4. If unsupported format:
    • Publish error event, acknowledge message

Configuration

Configuration is loaded from config.properties (external file or classpath).

properties
# AMQP Broker
broker_url=amqp://172.16.200.32:5672
input_queue=cmd.document.extractText
output_queue=evt.document.text.extracted
error_queue=evt.document.text.error

# S3 Storage
s3_endpoint=https://s3.se-01.westbahr.net
s3_access_key=your-access-key
s3_secret_key=your-secret-key
s3_region=us-east-1

# Extraction Settings
min_text_length_threshold=50
tesseract_data_path=/usr/share/tesseract-ocr/4.00/tessdata
tesseract_language=eng
ocr_timeout_seconds=120

# Timeouts
download_timeout_seconds=60
upload_timeout_seconds=60
temp_dir=/tmp/boost-document-extractor
PropertyDescriptionDefault
min_text_length_thresholdMinimum chars from PDFBox before falling back to OCR50
tesseract_data_pathPath to Tesseract language data files/usr/share/tesseract-ocr/4.00/tessdata
tesseract_languageOCR language codeeng
temp_dirDirectory for temporary file downloads/tmp/boost-document-extractor

Tenant Isolation

DocumentExtractor propagates tenant information through the processing pipeline:

  1. Receives tenantUUID (or organizationId) from incoming message JSON body
  2. Propagates the value to all outgoing events
  3. Sets tenantUUID as a JMS message property on published events

This allows DocumentManager to filter events by tenant using JMS selectors.

java
// Parse tenantUUID from incoming message (with fallback)
String tenantUUID = json.getString("organizationId");
if (tenantUUID == null) {
    tenantUUID = json.getString("tenantUUID");
}

// Set JMS property on outgoing event
message.setStringProperty("tenantUUID", event.getTenantUUID());

For detailed information about tenant isolation, see Multi-Tenant Messaging.

Key Design Decisions

AspectChoiceRationale
Ack modeCLIENT_ACKNOWLEDGEDon't lose messages on processing failure
ResponseEvent publishingNo HTTP request waiting; DocumentManager subscribes
Error reportingSeparate error queueClean separation of success/failure paths
PDF extractionPDFBox first, OCR fallbackMost PDFs have embedded text; OCR is expensive
IdempotencyCheck if text exists before processingSkip re-extraction, still publish success event
LayoutCustom LayoutPreservingTextStripperPreserves column alignment and table structures
Tenant isolationPropagate tenantUUID JMS propertyEnables tenant-filtered message consumption

Package Structure

src/main/java/com/luqon/boost/documentextractor/
├── DocumentExtractorApp.java           # Entry point, wiring, shutdown hook
├── config/
│   └── AppConfig.java                  # Loads config.properties, validates
├── model/
│   ├── FileDescriptor.java             # Parsed input file object
│   ├── ExtractionResult.java           # Extraction output
│   ├── ExtractionEvent.java            # Output event payload
│   └── ExtractionEngine.java           # Enum: PDFBOX, TESSERACT
├── messaging/
│   ├── MessageHandler.java             # Interface
│   ├── MessageConsumerService.java     # AMQP consumer, CLIENT_ACKNOWLEDGE
│   └── EventPublisher.java             # Publishes to output/error queues
├── extraction/
│   ├── TextExtractor.java              # Interface
│   ├── PdfTextExtractor.java           # Apache PDFBox
│   ├── OcrTextExtractor.java           # Tesseract via tess4j
│   ├── LayoutPreservingTextStripper.java  # Custom layout-aware extractor
│   ├── ExtractionOrchestrator.java     # Decides extractor, handles fallback
│   └── ExtractionException.java        # Custom exception
├── storage/
│   └── S3StorageService.java           # MinIO SDK: download, upload, exists
├── handler/
│   └── DocumentExtractionHandler.java  # Main processing logic
└── util/
    ├── TenantResolver.java             # Extracts tenantId from location URL
    └── ContentTypeResolver.java        # PDF vs image detection

Dependencies

DependencyVersionPurpose
Apache PDFBox3.0.6PDF text extraction
tess4j5.18.0Tesseract OCR wrapper
MinIO SDK8.6.0S3-compatible storage
Qpid JMS Client2.10.0AMQP messaging
SLF4J + Logback2.0.16 / 1.5.16Logging

Building

bash
cd BOOST.DocumentExtractor
mvn clean package -DskipTests

This produces a fat JAR at target/BOOST.DocumentExtractor-0.0.1-SNAPSHOT.jar.

Running

bash
java -jar target/BOOST.DocumentExtractor-0.0.1-SNAPSHOT.jar

The service will:

  1. Load configuration from config.properties
  2. Connect to the AMQP broker
  3. Start consuming messages from cmd.document.extractText
  4. Process documents and publish events

Graceful Shutdown

The service registers a shutdown hook. Send SIGTERM or Ctrl+C to gracefully stop.

Testing

Using TestSender Utility

bash
java -cp target/BOOST.DocumentExtractor-0.0.1-SNAPSHOT.jar \
  com.luqon.boost.documentextractor.TestSender [broker_url]

This sends a test message and listens for results on both success and error queues.

Manual Testing

  1. Upload a PDF/image to S3
  2. Send a message to cmd.document.extractText with the file details
  3. Check evt.document.text.extracted for success event
  4. Verify extracted text in S3 at {prefix}/extracted-text/{filename}.txt

Output Storage

Extracted text is stored in S3 with the following path pattern:

{original-path}/extracted-text/{original-filename}.txt

Example:

  • Original: 742dba26-.../InvoiceForm-459.pdf
  • Extracted: 742dba26-.../extracted-text/InvoiceForm-459.txt

Error Handling

ScenarioBehavior
Unsupported content typePublish error event, acknowledge message
S3 download failureException thrown, message NOT acknowledged (redelivered)
Extraction failurePublish error event, exception thrown
S3 upload failureException thrown, message NOT acknowledged
Text already existsSkip extraction, publish success event

Message Resilience

DocumentExtractor uses CLIENT_ACKNOWLEDGE mode to ensure no messages are lost:

  • Messages are only acknowledged after successful processing
  • If an exception occurs, the message is NOT acknowledged
  • The broker will automatically redeliver unacknowledged messages
  • Temporary failures (S3 timeouts) trigger redelivery
  • Permanent failures (unsupported file type) publish error events and acknowledge

For detailed information about message resilience patterns, see Multi-Tenant Messaging.

Logging

Logs are written to:

  • Console (stdout)
  • logs/document-extractor.log (rolling daily, 30 days retention)

Log levels can be adjusted in logback.xml.