BOOST.DocumentExtractor
A stateless, message-driven Java 21 service that extracts text from PDF and image documents. It consumes commands from an AMQP queue, downloads files from S3-compatible storage, extracts text using PDFBox (for embedded text) or Tesseract OCR (as fallback), stores the result back to S3, and publishes a completion event.
Architecture
[AMQP: cmd.document.extractText]
│
▼
MessageConsumerService (CLIENT_ACKNOWLEDGE)
│
▼
DocumentExtractionHandler
│
├──► S3StorageService.download(bucket, key)
├──► ExtractionOrchestrator
│ ├──► PdfTextExtractor (PDFBox)
│ └──► OcrTextExtractor (Tesseract/tess4j) [fallback]
└──► S3StorageService.upload(extractedText)
│
▼
EventPublisher ──► [AMQP: evt.document.text.extracted]
└──► [AMQP: evt.document.text.error] (on failure)Message Flow
Input Message
Send to queue: cmd.document.extractText
JMS Property: tenantUUID must be set for proper tenant routing.
{
"UUID": "742dba26-...",
"entityUUID": "5d16c711-...",
"organizationId": "org-123",
"clientFilename": "InvoiceForm-459.pdf",
"serverFilename": "742dba26-.../InvoiceForm-459.pdf",
"contentType": "application/pdf",
"location": "https://s3.se-01.westbahr.net/boost.app.{tenantId}",
"metaData": { },
"createdByUUID": "SYSTEM",
"createdAt": "2026-01-24T14:12:33"
}| Field | Description |
|---|---|
UUID | Unique document identifier |
entityUUID | Parent entity this document belongs to |
clientFilename | Original filename uploaded by user |
serverFilename | S3 object key |
contentType | MIME type (application/pdf, image/png, etc.) |
location | S3 endpoint + bucket (e.g., https://s3.example.com/boost.app.tenant1) |
Output Event (Success)
Published to queue: evt.document.text.extracted
JMS Property: tenantUUID is set to enable tenant-filtered consumption by DocumentManager.
{
"documentUUID": "742dba26-...",
"entityUUID": "5d16c711-...",
"tenantUUID": "org-123",
"textRef": "742dba26-.../extracted-text/InvoiceForm-459.txt",
"textBucket": "boost.app.{tenantId}",
"engineUsed": "PDFBOX",
"pageCount": 3,
"confidence": 1.0,
"detectedLanguage": "eng",
"processingTimeMs": 1234,
"success": true
}| Field | Description |
|---|---|
textRef | S3 object key where extracted text is stored |
textBucket | S3 bucket name |
engineUsed | PDFBOX or TESSERACT |
pageCount | Number of pages processed |
confidence | Extraction confidence (1.0 for PDFBox, ~0.85 for OCR) |
processingTimeMs | Time taken to process |
Error Event
Published to queue: evt.document.text.error
JMS Property: tenantUUID is set to enable tenant-filtered consumption.
{
"documentUUID": "742dba26-...",
"entityUUID": "5d16c711-...",
"tenantUUID": "org-123",
"textBucket": "boost.app.{tenantId}",
"processingTimeMs": 234,
"success": false,
"errorMessage": "Unsupported content type: application/zip"
}Supported File Types
| Content Type | Extraction Method |
|---|---|
application/pdf | PDFBox (embedded text), falls back to OCR if text < 50 chars |
image/png | Tesseract OCR |
image/jpeg | Tesseract OCR |
image/tiff | Tesseract OCR |
image/bmp | Tesseract OCR |
image/gif | Tesseract OCR |
Extraction Logic
- Parse content type from incoming message
- If PDF:
- Try PDFBox
PDFTextStripperwith layout preservation - If extracted text < threshold (default 50 chars): fall back to Tesseract OCR
- Try PDFBox
- If image:
- Use Tesseract OCR directly
- If unsupported format:
- Publish error event, acknowledge message
Configuration
Configuration is loaded from config.properties (external file or classpath).
# AMQP Broker
broker_url=amqp://172.16.200.32:5672
input_queue=cmd.document.extractText
output_queue=evt.document.text.extracted
error_queue=evt.document.text.error
# S3 Storage
s3_endpoint=https://s3.se-01.westbahr.net
s3_access_key=your-access-key
s3_secret_key=your-secret-key
s3_region=us-east-1
# Extraction Settings
min_text_length_threshold=50
tesseract_data_path=/usr/share/tesseract-ocr/4.00/tessdata
tesseract_language=eng
ocr_timeout_seconds=120
# Timeouts
download_timeout_seconds=60
upload_timeout_seconds=60
temp_dir=/tmp/boost-document-extractor| Property | Description | Default |
|---|---|---|
min_text_length_threshold | Minimum chars from PDFBox before falling back to OCR | 50 |
tesseract_data_path | Path to Tesseract language data files | /usr/share/tesseract-ocr/4.00/tessdata |
tesseract_language | OCR language code | eng |
temp_dir | Directory for temporary file downloads | /tmp/boost-document-extractor |
Tenant Isolation
DocumentExtractor propagates tenant information through the processing pipeline:
- Receives
tenantUUID(ororganizationId) from incoming message JSON body - Propagates the value to all outgoing events
- Sets
tenantUUIDas a JMS message property on published events
This allows DocumentManager to filter events by tenant using JMS selectors.
// Parse tenantUUID from incoming message (with fallback)
String tenantUUID = json.getString("organizationId");
if (tenantUUID == null) {
tenantUUID = json.getString("tenantUUID");
}
// Set JMS property on outgoing event
message.setStringProperty("tenantUUID", event.getTenantUUID());For detailed information about tenant isolation, see Multi-Tenant Messaging.
Key Design Decisions
| Aspect | Choice | Rationale |
|---|---|---|
| Ack mode | CLIENT_ACKNOWLEDGE | Don't lose messages on processing failure |
| Response | Event publishing | No HTTP request waiting; DocumentManager subscribes |
| Error reporting | Separate error queue | Clean separation of success/failure paths |
| PDF extraction | PDFBox first, OCR fallback | Most PDFs have embedded text; OCR is expensive |
| Idempotency | Check if text exists before processing | Skip re-extraction, still publish success event |
| Layout | Custom LayoutPreservingTextStripper | Preserves column alignment and table structures |
| Tenant isolation | Propagate tenantUUID JMS property | Enables tenant-filtered message consumption |
Package Structure
src/main/java/com/luqon/boost/documentextractor/
├── DocumentExtractorApp.java # Entry point, wiring, shutdown hook
├── config/
│ └── AppConfig.java # Loads config.properties, validates
├── model/
│ ├── FileDescriptor.java # Parsed input file object
│ ├── ExtractionResult.java # Extraction output
│ ├── ExtractionEvent.java # Output event payload
│ └── ExtractionEngine.java # Enum: PDFBOX, TESSERACT
├── messaging/
│ ├── MessageHandler.java # Interface
│ ├── MessageConsumerService.java # AMQP consumer, CLIENT_ACKNOWLEDGE
│ └── EventPublisher.java # Publishes to output/error queues
├── extraction/
│ ├── TextExtractor.java # Interface
│ ├── PdfTextExtractor.java # Apache PDFBox
│ ├── OcrTextExtractor.java # Tesseract via tess4j
│ ├── LayoutPreservingTextStripper.java # Custom layout-aware extractor
│ ├── ExtractionOrchestrator.java # Decides extractor, handles fallback
│ └── ExtractionException.java # Custom exception
├── storage/
│ └── S3StorageService.java # MinIO SDK: download, upload, exists
├── handler/
│ └── DocumentExtractionHandler.java # Main processing logic
└── util/
├── TenantResolver.java # Extracts tenantId from location URL
└── ContentTypeResolver.java # PDF vs image detectionDependencies
| Dependency | Version | Purpose |
|---|---|---|
| Apache PDFBox | 3.0.6 | PDF text extraction |
| tess4j | 5.18.0 | Tesseract OCR wrapper |
| MinIO SDK | 8.6.0 | S3-compatible storage |
| Qpid JMS Client | 2.10.0 | AMQP messaging |
| SLF4J + Logback | 2.0.16 / 1.5.16 | Logging |
Building
cd BOOST.DocumentExtractor
mvn clean package -DskipTestsThis produces a fat JAR at target/BOOST.DocumentExtractor-0.0.1-SNAPSHOT.jar.
Running
java -jar target/BOOST.DocumentExtractor-0.0.1-SNAPSHOT.jarThe service will:
- Load configuration from
config.properties - Connect to the AMQP broker
- Start consuming messages from
cmd.document.extractText - Process documents and publish events
Graceful Shutdown
The service registers a shutdown hook. Send SIGTERM or Ctrl+C to gracefully stop.
Testing
Using TestSender Utility
java -cp target/BOOST.DocumentExtractor-0.0.1-SNAPSHOT.jar \
com.luqon.boost.documentextractor.TestSender [broker_url]This sends a test message and listens for results on both success and error queues.
Manual Testing
- Upload a PDF/image to S3
- Send a message to
cmd.document.extractTextwith the file details - Check
evt.document.text.extractedfor success event - Verify extracted text in S3 at
{prefix}/extracted-text/{filename}.txt
Output Storage
Extracted text is stored in S3 with the following path pattern:
{original-path}/extracted-text/{original-filename}.txtExample:
- Original:
742dba26-.../InvoiceForm-459.pdf - Extracted:
742dba26-.../extracted-text/InvoiceForm-459.txt
Error Handling
| Scenario | Behavior |
|---|---|
| Unsupported content type | Publish error event, acknowledge message |
| S3 download failure | Exception thrown, message NOT acknowledged (redelivered) |
| Extraction failure | Publish error event, exception thrown |
| S3 upload failure | Exception thrown, message NOT acknowledged |
| Text already exists | Skip extraction, publish success event |
Message Resilience
DocumentExtractor uses CLIENT_ACKNOWLEDGE mode to ensure no messages are lost:
- Messages are only acknowledged after successful processing
- If an exception occurs, the message is NOT acknowledged
- The broker will automatically redeliver unacknowledged messages
- Temporary failures (S3 timeouts) trigger redelivery
- Permanent failures (unsupported file type) publish error events and acknowledge
For detailed information about message resilience patterns, see Multi-Tenant Messaging.
Logging
Logs are written to:
- Console (stdout)
logs/document-extractor.log(rolling daily, 30 days retention)
Log levels can be adjusted in logback.xml.