Generating searchable PDFs from scanned documents automatically with Amazon Textract

Amazon Textract is a machine learning service that makes it easy to extract text and data from virtually any document. Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables. This allows you to use Amazon Textract to instantly “read” virtually any type of document and accurately extract text and data without the need for any manual effort or custom code.

The blog post Automatically extract text and structured data from documents with Amazon Textract shows how to use Amazon Textract to automatically extract text and data from scanned documents without any machine learning (ML) experience. One of the use cases covered in the post is search and discovery. You can search through millions of documents by extracting text and structured data from documents with Amazon Textract and creating a smart index using Amazon ES.

This post demonstrates how to generate searchable PDF documents by extracting text from scanned documents using Amazon Textract. The solution allows you to download relevant documents, search within a document when it is stored offline, or select and copy text.

You can see an example of searchable PDF document that is generated using Amazon Textract from a scanned document. While text is locked in images in the scanned document, you can select, copy, and search text in the searchable PDF document.

To generate a searchable PDF, use Amazon Textract to extract text from documents and add the extracted text as a layer to the image in the PDF document. Amazon Textract detects and analyzes text input documents and returns information about detected items such as pages, words, lines, form data (key-value pairs), tables, and selection elements. It also provides bounding box information, which is an axis-aligned coarse representation of the location of the recognized item on the document page. You can use the detected text and its bounding box information to place text in the PDF page.

PDFDocument is a sample library in AWS Samples GitHub repo and provides the necessary logic to generate a searchable PDF document using Amazon Textract. It also uses open-source Java library Apache PDFBox to create PDF documents, but there are similar PDF processing libraries available in other programming languages.

The following code example shows how to use sample library to generate a searchable PDF document from an image:

...

//Extract text using Amazon Textract
List<TextLine> lines = extractText(imageBytes);

//Generate searchable PDF with image and text
PDFDocument doc = new PDFDocument();
doc.addPage(image, imageType, lines);

//Save PDF to local disk
try(OutputStream outputStream = new FileOutputStream(outputDocumentName)) {
    doc.save(outputStream);
}

...

Generating a searchable PDF from an image document

The following code shows how to take an image document and generate a corresponding searchable PDF document. Extract the text using Amazon Textract and create a searchable PDF by adding the text as a layer with the image.

public class DemoPdfFromLocalImage {

    public static void run(String documentName, String outputDocumentName) throws IOException {

        System.out.println("Generating searchable pdf from: " + documentName);

        ImageType imageType = ImageType.JPEG;
        if(documentName.toLowerCase().endsWith(".png"))
            imageType = ImageType.PNG;

        //Get image bytes
        ByteBuffer imageBytes = null;
        try(InputStream in = new FileInputStream(documentName)) {
            imageBytes = ByteBuffer.wrap(IOUtils.toByteArray(in));
        }

        //Extract text
        List<TextLine> lines = extractText(imageBytes);

        //Get Image
        BufferedImage image = getImage(documentName);

        //Create new pdf document
        PDFDocument pdfDocument = new PDFDocument();

        //Add page with text layer and image in the pdf document
        pdfDocument.addPage(image, imageType, lines);

        //Save PDF to local disk
        try(OutputStream outputStream = new FileOutputStream(outputDocumentName)) {
            pdfDocument.save(outputStream);
            pdfDocument.close();
        }

        System.out.println("Generated searchable pdf: " + outputDocumentName);
    }
    
    private static BufferedImage getImage(String documentName) throws IOException {

        BufferedImage image = null;

        try(InputStream in = new FileInputStream(documentName)) {
            image = ImageIO.read(in);
        }

        return image;
    }

    private static List<TextLine> extractText(ByteBuffer imageBytes) {

        AmazonTextract client = AmazonTextractClientBuilder.defaultClient();

        DetectDocumentTextRequest request = new DetectDocumentTextRequest()
                .withDocument(new Document()
                        .withBytes(imageBytes));

        DetectDocumentTextResult result = client.detectDocumentText(request);

        List<TextLine> lines = new ArrayList<TextLine>();
        List<Block> blocks = result.getBlocks();
        BoundingBox boundingBox = null;
        for (Block block : blocks) {
            if ((block.getBlockType()).equals("LINE")) {
                boundingBox = block.getGeometry().getBoundingBox();
                lines.add(new TextLine(boundingBox.getLeft(),
                        boundingBox.getTop(),
                        boundingBox.getWidth(),
                        boundingBox.getHeight(),
                        block.getText()));
            }
        }

        return lines;
    }
}

Generating a searchable PDF from a PDF document

The following code example takes an input PDF document from an Amazon S3 bucket and generates the corresponding searchable PDF document. You extract text from the PDF document using Amazon Textract, and create a searchable PDF by adding text as a layer with an image for each page.

public class DemoPdfFromS3Pdf {
    public static void run(String bucketName, String documentName, String outputDocumentName) throws IOException, InterruptedException {

        System.out.println("Generating searchable pdf from: " + bucketName + "/" + documentName);

        //Extract text using Amazon Textract
        List<ArrayList<TextLine>> linesInPages = extractText(bucketName, documentName);

        //Get input pdf document from Amazon S3
        InputStream inputPdf = getPdfFromS3(bucketName, documentName);

        //Create new PDF document
        PDFDocument pdfDocument = new PDFDocument();

        //For each page add text layer and image in the pdf document
        PDDocument inputDocument = PDDocument.load(inputPdf);
        PDFRenderer pdfRenderer = new PDFRenderer(inputDocument);
        BufferedImage image = null;
        for (int page = 0; page < inputDocument.getNumberOfPages(); ++page) {
            image = pdfRenderer.renderImageWithDPI(page, 300, org.apache.pdfbox.rendering.ImageType.RGB);

            pdfDocument.addPage(image, ImageType.JPEG, linesInPages.get(page));

            System.out.println("Processed page index: " + page);
        }

        //Save PDF to stream
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        pdfDocument.save(os);
        pdfDocument.close();
        inputDocument.close();

        //Upload PDF to S3
        UploadToS3(bucketName, outputDocumentName, "application/pdf", os.toByteArray());

        System.out.println("Generated searchable pdf: " + bucketName + "/" + outputDocumentName);
    }

    private static List<ArrayList<TextLine>> extractText(String bucketName, String documentName) throws InterruptedException {

        AmazonTextract client = AmazonTextractClientBuilder.defaultClient();

        StartDocumentTextDetectionRequest req = new StartDocumentTextDetectionRequest()
                .withDocumentLocation(new DocumentLocation()
                        .withS3Object(new S3Object()
                                .withBucket(bucketName)
                                .withName(documentName)))
                .withJobTag("DetectingText");

        StartDocumentTextDetectionResult startDocumentTextDetectionResult = client.startDocumentTextDetection(req);
        String startJobId = startDocumentTextDetectionResult.getJobId();

        System.out.println("Text detection job started with Id: " + startJobId);

        GetDocumentTextDetectionRequest documentTextDetectionRequest = null;
        GetDocumentTextDetectionResult response = null;

        String jobStatus = "IN_PROGRESS";

        while (jobStatus.equals("IN_PROGRESS")) {
            System.out.println("Waiting for job to complete...");
            TimeUnit.SECONDS.sleep(10);
            documentTextDetectionRequest = new GetDocumentTextDetectionRequest()
                    .withJobId(startJobId)
                    .withMaxResults(1);

            response = client.getDocumentTextDetection(documentTextDetectionRequest);
            jobStatus = response.getJobStatus();
        }

        int maxResults = 1000;
        String paginationToken = null;
        Boolean finished = false;

        List<ArrayList<TextLine>> pages = new ArrayList<ArrayList<TextLine>>();
        ArrayList<TextLine> page = null;
        BoundingBox boundingBox = null;

        while (finished == false) {
            documentTextDetectionRequest = new GetDocumentTextDetectionRequest()
                    .withJobId(startJobId)
                    .withMaxResults(maxResults)
                    .withNextToken(paginationToken);
            response = client.getDocumentTextDetection(documentTextDetectionRequest);

            //Show blocks information
            List<Block> blocks = response.getBlocks();
            for (Block block : blocks) {
                if (block.getBlockType().equals("PAGE")) {
                    page = new ArrayList<TextLine>();
                    pages.add(page);
                } else if (block.getBlockType().equals("LINE")) {
                    boundingBox = block.getGeometry().getBoundingBox();
                    page.add(new TextLine(boundingBox.getLeft(),
                            boundingBox.getTop(),
                            boundingBox.getWidth(),
                            boundingBox.getHeight(),
                            block.getText()));
                }
            }
            paginationToken = response.getNextToken();
            if (paginationToken == null)
                finished = true;
        }

        return pages;
    }

    private static InputStream getPdfFromS3(String bucketName, String documentName) throws IOException {

        AmazonS3 s3client = AmazonS3ClientBuilder.defaultClient();
        com.amazonaws.services.s3.model.S3Object fullObject = s3client.getObject(new GetObjectRequest(bucketName, documentName));
        InputStream in = fullObject.getObjectContent();
        return in;
    }

    private static void UploadToS3(String bucketName, String objectName, String contentType, byte[] bytes) {
        AmazonS3 s3client = AmazonS3ClientBuilder.defaultClient();
        ByteArrayInputStream baInputStream = new ByteArrayInputStream(bytes);
        ObjectMetadata metadata = new ObjectMetadata();
        metadata.setContentLength(bytes.length);
        metadata.setContentType(contentType);
        PutObjectRequest putRequest = new PutObjectRequest(bucketName, objectName, baInputStream, metadata);
        s3client.putObject(putRequest);
    }
}

Running code on a local machine

To run the code on a local machine, complete the following steps. The code examples are available on the GitHub repo.

Set up your AWS Account and AWS CLI.

For more information, see Getting Started with Amazon Textract.

Download and unzip searchablepdf.zip from the GitHub repo.
Install Apache Maven if it is not already installed.
In the project directory, run mvn package.
Run java -cp target/searchable-pdf-1.0.jar Demo.

This runs the Java project with Demo as the main class.

By default, only the first example to create a searchable PDF from an image on a local drive is enabled. To run other examples, uncomment the relevant lines in Demo class.

Running code in Lambda

To run the code in Lambda, complete the following steps. The code examples are available on the GitHub repo.

Download and unzip searchablepdf.zip from the GitHub repo.
Install Apache Maven if it is not already installed.
In the project directory, run mvn package.

The build creates a .jar in project-dir/target/searchable-pdf1.0.jar, using information in the pom.xml to do the necessary transforms. This is a standalone .jar (.zip file) that includes all the dependencies. This is your deployment package that you can upload to Lambda to create a function. For more information, see AWS Lambda Deployment Package in Java. DemoLambda has all the necessary code to read S3 events and take action based on the type of input document.

Create a Lambda with Java 8 and IAM role that has read and write permissions to the S3 bucket you created earlier.
Configure the IAM role to also have permissions to call Amazon Textract.
Set handler to DemoLambda::handleRequest.
Increase timeout to 5 minutes.
Upload the .jar file you built earlier.
Create an S3 bucket.
In the S3 bucket, create a folder labeled documents.
Add a trigger in the Lambda function such that when an object uploads to the documents folder, the Lambda function executes.

Make sure that you set a trigger for the documents folder. If you add a trigger for the whole bucket, the function also triggers every time an output PDF document generates.

Upload an image (.jpeg or .png) or PDF document to the documents folder in your S3 bucket.

In a few seconds, you should see the searchable PDF document in your S3 bucket.

These steps show simple S3 and Lambda integration. For large-scale document processing, see the reference architecture at following GitHub repo.

Conclusion

This post showed how to use Amazon Textract to generate searchable PDF documents automatically. You can search across millions of documents to find the relevant file by creating a smart search index using Amazon ES. Searchable PDF documents then allows you to select and copy text and search within a document after downloading it for offline use.

To learn more about different text and data extraction features of Amazon Textract, see How Amazon Textract Works.

About the Authors

Kashif Imran is a Solutions Architect at Amazon Web Services. He works with some of the largest strategic AWS customers to provide technical guidance and design advice. His expertise spans application architecture, serverless, containers, NoSQL and machine learning.

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT