Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Author: torontoai

Generating searchable PDFs from scanned documents automatically with Amazon Textract

Amazon Textract is a machine learning service that makes it easy to extract text and data from virtually any document. Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables. This allows you to use Amazon Textract to instantly “read” virtually any type of document and accurately extract text and data without the need for any manual effort or custom code.

The blog post Automatically extract text and structured data from documents with Amazon Textract shows how to use Amazon Textract to automatically extract text and data from scanned documents without any machine learning (ML) experience. One of the use cases covered in the post is search and discovery. You can search through millions of documents by extracting text and structured data from documents with Amazon Textract and creating a smart index using Amazon ES.

This post demonstrates how to generate searchable PDF documents by extracting text from scanned documents using Amazon Textract. The solution allows you to download relevant documents, search within a document when it is stored offline, or select and copy text.

You can see an example of searchable PDF document that is generated using Amazon Textract from a scanned document. While text is locked in images in the scanned document, you can select, copy, and search text in the searchable PDF document.

To generate a searchable PDF, use Amazon Textract to extract text from documents and add the extracted text as a layer to the image in the PDF document. Amazon Textract detects and analyzes text input documents and returns information about detected items such as pages, words, lines, form data (key-value pairs), tables, and selection elements. It also provides bounding box information, which is an axis-aligned coarse representation of the location of the recognized item on the document page. You can use the detected text and its bounding box information to place text in the PDF page.

PDFDocument is a sample library in AWS Samples GitHub repo and provides the necessary logic to generate a searchable PDF document using Amazon Textract.  It also uses open-source Java library Apache PDFBox to create PDF documents, but there are similar PDF processing libraries available in other programming languages.

The following code example shows how to use sample library to generate a searchable PDF document from an image:

...

//Extract text using Amazon Textract
List<TextLine> lines = extractText(imageBytes);

//Generate searchable PDF with image and text
PDFDocument doc = new PDFDocument();
doc.addPage(image, imageType, lines);

//Save PDF to local disk
try(OutputStream outputStream = new FileOutputStream(outputDocumentName)) {
    doc.save(outputStream);
}

...

Generating a searchable PDF from an image document

The following code shows how to take an image document and generate a corresponding searchable PDF document. Extract the text using Amazon Textract and create a searchable PDF by adding the text as a layer with the image.

public class DemoPdfFromLocalImage {

    public static void run(String documentName, String outputDocumentName) throws IOException {

        System.out.println("Generating searchable pdf from: " + documentName);

        ImageType imageType = ImageType.JPEG;
        if(documentName.toLowerCase().endsWith(".png"))
            imageType = ImageType.PNG;

        //Get image bytes
        ByteBuffer imageBytes = null;
        try(InputStream in = new FileInputStream(documentName)) {
            imageBytes = ByteBuffer.wrap(IOUtils.toByteArray(in));
        }

        //Extract text
        List<TextLine> lines = extractText(imageBytes);

        //Get Image
        BufferedImage image = getImage(documentName);

        //Create new pdf document
        PDFDocument pdfDocument = new PDFDocument();

        //Add page with text layer and image in the pdf document
        pdfDocument.addPage(image, imageType, lines);

        //Save PDF to local disk
        try(OutputStream outputStream = new FileOutputStream(outputDocumentName)) {
            pdfDocument.save(outputStream);
            pdfDocument.close();
        }

        System.out.println("Generated searchable pdf: " + outputDocumentName);
    }
    
    private static BufferedImage getImage(String documentName) throws IOException {

        BufferedImage image = null;

        try(InputStream in = new FileInputStream(documentName)) {
            image = ImageIO.read(in);
        }

        return image;
    }

    private static List<TextLine> extractText(ByteBuffer imageBytes) {

        AmazonTextract client = AmazonTextractClientBuilder.defaultClient();

        DetectDocumentTextRequest request = new DetectDocumentTextRequest()
                .withDocument(new Document()
                        .withBytes(imageBytes));

        DetectDocumentTextResult result = client.detectDocumentText(request);

        List<TextLine> lines = new ArrayList<TextLine>();
        List<Block> blocks = result.getBlocks();
        BoundingBox boundingBox = null;
        for (Block block : blocks) {
            if ((block.getBlockType()).equals("LINE")) {
                boundingBox = block.getGeometry().getBoundingBox();
                lines.add(new TextLine(boundingBox.getLeft(),
                        boundingBox.getTop(),
                        boundingBox.getWidth(),
                        boundingBox.getHeight(),
                        block.getText()));
            }
        }

        return lines;
    }
}

Generating a searchable PDF from a PDF document

The following code example takes an input PDF document from an Amazon S3 bucket and generates the corresponding searchable PDF document. You extract text from the PDF document using Amazon Textract, and create a searchable PDF by adding text as a layer with an image for each page.

public class DemoPdfFromS3Pdf {
    public static void run(String bucketName, String documentName, String outputDocumentName) throws IOException, InterruptedException {

        System.out.println("Generating searchable pdf from: " + bucketName + "/" + documentName);

        //Extract text using Amazon Textract
        List<ArrayList<TextLine>> linesInPages = extractText(bucketName, documentName);

        //Get input pdf document from Amazon S3
        InputStream inputPdf = getPdfFromS3(bucketName, documentName);

        //Create new PDF document
        PDFDocument pdfDocument = new PDFDocument();

        //For each page add text layer and image in the pdf document
        PDDocument inputDocument = PDDocument.load(inputPdf);
        PDFRenderer pdfRenderer = new PDFRenderer(inputDocument);
        BufferedImage image = null;
        for (int page = 0; page < inputDocument.getNumberOfPages(); ++page) {
            image = pdfRenderer.renderImageWithDPI(page, 300, org.apache.pdfbox.rendering.ImageType.RGB);

            pdfDocument.addPage(image, ImageType.JPEG, linesInPages.get(page));

            System.out.println("Processed page index: " + page);
        }

        //Save PDF to stream
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        pdfDocument.save(os);
        pdfDocument.close();
        inputDocument.close();

        //Upload PDF to S3
        UploadToS3(bucketName, outputDocumentName, "application/pdf", os.toByteArray());

        System.out.println("Generated searchable pdf: " + bucketName + "/" + outputDocumentName);
    }

    private static List<ArrayList<TextLine>> extractText(String bucketName, String documentName) throws InterruptedException {

        AmazonTextract client = AmazonTextractClientBuilder.defaultClient();

        StartDocumentTextDetectionRequest req = new StartDocumentTextDetectionRequest()
                .withDocumentLocation(new DocumentLocation()
                        .withS3Object(new S3Object()
                                .withBucket(bucketName)
                                .withName(documentName)))
                .withJobTag("DetectingText");

        StartDocumentTextDetectionResult startDocumentTextDetectionResult = client.startDocumentTextDetection(req);
        String startJobId = startDocumentTextDetectionResult.getJobId();

        System.out.println("Text detection job started with Id: " + startJobId);

        GetDocumentTextDetectionRequest documentTextDetectionRequest = null;
        GetDocumentTextDetectionResult response = null;

        String jobStatus = "IN_PROGRESS";

        while (jobStatus.equals("IN_PROGRESS")) {
            System.out.println("Waiting for job to complete...");
            TimeUnit.SECONDS.sleep(10);
            documentTextDetectionRequest = new GetDocumentTextDetectionRequest()
                    .withJobId(startJobId)
                    .withMaxResults(1);

            response = client.getDocumentTextDetection(documentTextDetectionRequest);
            jobStatus = response.getJobStatus();
        }

        int maxResults = 1000;
        String paginationToken = null;
        Boolean finished = false;

        List<ArrayList<TextLine>> pages = new ArrayList<ArrayList<TextLine>>();
        ArrayList<TextLine> page = null;
        BoundingBox boundingBox = null;

        while (finished == false) {
            documentTextDetectionRequest = new GetDocumentTextDetectionRequest()
                    .withJobId(startJobId)
                    .withMaxResults(maxResults)
                    .withNextToken(paginationToken);
            response = client.getDocumentTextDetection(documentTextDetectionRequest);

            //Show blocks information
            List<Block> blocks = response.getBlocks();
            for (Block block : blocks) {
                if (block.getBlockType().equals("PAGE")) {
                    page = new ArrayList<TextLine>();
                    pages.add(page);
                } else if (block.getBlockType().equals("LINE")) {
                    boundingBox = block.getGeometry().getBoundingBox();
                    page.add(new TextLine(boundingBox.getLeft(),
                            boundingBox.getTop(),
                            boundingBox.getWidth(),
                            boundingBox.getHeight(),
                            block.getText()));
                }
            }
            paginationToken = response.getNextToken();
            if (paginationToken == null)
                finished = true;
        }

        return pages;
    }

    private static InputStream getPdfFromS3(String bucketName, String documentName) throws IOException {

        AmazonS3 s3client = AmazonS3ClientBuilder.defaultClient();
        com.amazonaws.services.s3.model.S3Object fullObject = s3client.getObject(new GetObjectRequest(bucketName, documentName));
        InputStream in = fullObject.getObjectContent();
        return in;
    }

    private static void UploadToS3(String bucketName, String objectName, String contentType, byte[] bytes) {
        AmazonS3 s3client = AmazonS3ClientBuilder.defaultClient();
        ByteArrayInputStream baInputStream = new ByteArrayInputStream(bytes);
        ObjectMetadata metadata = new ObjectMetadata();
        metadata.setContentLength(bytes.length);
        metadata.setContentType(contentType);
        PutObjectRequest putRequest = new PutObjectRequest(bucketName, objectName, baInputStream, metadata);
        s3client.putObject(putRequest);
    }
}

Running code on a local machine

To run the code on a local machine, complete the following steps. The code examples are available on the GitHub repo.

  1. Set up your AWS Account and AWS CLI.

For more information, see Getting Started with Amazon Textract.

  1. Download and unzip searchablepdf.zip from the GitHub repo.
  2. Install Apache Maven if it is not already installed.
  3. In the project directory, run mvn package.
  4. Run java -cp target/searchable-pdf-1.0.jar Demo.

This runs the Java project with Demo as the main class.

By default, only the first example to create a searchable PDF from an image on a local drive is enabled. To run other examples, uncomment the relevant lines in Demo class.

Running code in Lambda

To run the code in Lambda, complete the following steps. The code examples are available on the GitHub repo.

  1. Download and unzip searchablepdf.zip from the GitHub repo.
  2. Install Apache Maven if it is not already installed.
  3. In the project directory, run mvn package.

The build creates a .jar in project-dir/target/searchable-pdf1.0.jar, using information in the pom.xml to do the necessary transforms. This is a standalone .jar (.zip file) that includes all the dependencies. This is your deployment package that you can upload to Lambda to create a function. For more information, see AWS Lambda Deployment Package in Java. DemoLambda has all the necessary code to read S3 events and take action based on the type of input document.

  1. Create a Lambda with Java 8 and IAM role that has read and write permissions to the S3 bucket you created earlier.
  2. Configure the IAM role to also have permissions to call Amazon Textract.
  3. Set handler to DemoLambda::handleRequest.
  4. Increase timeout to 5 minutes.
  5. Upload the .jar file you built earlier.
  6. Create an S3 bucket.
  7. In the S3 bucket, create a folder labeled documents.
  8. Add a trigger in the Lambda function such that when an object uploads to the documents folder, the Lambda function executes.

Make sure that you set a trigger for the documents folder. If you add a trigger for the whole bucket, the function also triggers every time an output PDF document generates.

  1. Upload an image (.jpeg or .png) or PDF document to the documents folder in your S3 bucket.

In a few seconds, you should see the searchable PDF document in your S3 bucket.

These steps show simple S3 and Lambda integration. For large-scale document processing, see the reference architecture at following GitHub repo.

Conclusion

This post showed how to use Amazon Textract to generate searchable PDF documents automatically. You can search across millions of documents to find the relevant file by creating a smart search index using Amazon ES. Searchable PDF documents then allows you to select and copy text and search within a document after downloading it for offline use.

To learn more about different text and data extraction features of Amazon Textract, see How Amazon Textract Works.


About the Authors

Kashif Imran is a Solutions Architect at Amazon Web Services. He works with some of the largest strategic AWS customers to provide technical guidance and design advice. His expertise spans application architecture, serverless, containers, NoSQL and machine learning.

 

 

 

 

 

[D] Machine Learning as a career change

Hi all,

I just wanted to get a straw poll opinion on this.

I’m currently a C-suite level employee of a company and I make over $100k/year. My background is backend and front end development and now my job is largely project management with some coding.

I’ve been taking the Fast AI course and wondered whether career transitions to machine learning are quite do-able?

I currently work from home (remote) so would be looking to do the same with machine learning.

I suppose I’m wondering whether salary + remote work would remain as reasonable expectations in a career switch.

Any info or insight would be great 👍

submitted by /u/ceilingbeetle
[link] [comments]

[R] MimickNet, Matching Clinical Ultrasound Post-Processing via CycleGANs Code Release

Not sure how many Ultrasound or medical imaging folks are in here, but thought this might be useful to this group. I’m part of an ultrasound research lab at Duke University, and we’ve recently open-sourced work on ultrasound image post-processing which allows one to mimic proprietary post-processing black-boxes found on commercial ultrasound scanners. Here is the: Paper, Github, Colab notebook.

https://arxiv.org/abs/1908.05782

When creating an ultrasound image from scratch, it is common to have speckle noise, Gaussian noise, clutter, reverberation, and other undesirable forms of image degradation. While raw ultrasound images are very familiar to researchers, medical providers will typically only look at heavily post-processed images in the clinic. Unfortunately, commercial post-processing is generally proprietary and kept secret. The inaccessibility makes apples-to-apples comparisons of novel methods to current clinical practice difficult. It also makes the translation of novel methods into the clinic difficult. Ideally, the post-processing is not secret, and everyone can always have lovely images to look at as a baseline. We find that it is possible to mimic the post-processing found on commercial scanners through CycleGANs by just using images acquired via regular use. CycleGANs do not require any image registration or image pairing to train, which is very convenient. We are releasing the fully trained models so that any researcher has access to clinical-grade like post-processing. We refer to our trained models as MimickNet.

TLDR: Clinical Ultrasound Post-Processing is kept proprietary and secret. However, by using data collected just via intended ultrasound scanner use, it is possible to mimic the post-processing algorithm found on some of the best ultrasound scanners. We are making these models available to any researcher, so we all have access to clinical-grade post-processing.

submitted by /u/ououwen
[link] [comments]

GauGAN Rocket Man: Conceptual Artist Uses AI Tools for Sci-Fi Modeling

Have you ever wondered what it takes to produce the complex imagery in films like Star Wars or Transformers? The man behind the magic, Colie Wertz, is here to explain.

Wertz is a conceptual artist and modeler who works on film, television and video games. He sat down with AI Podcast host Noah Kravitz to explain his specialty in hard modeling, in which he produces digital models of objects with hard surfaces like vehicles, robots and computers.

To make these images, Wertz has taken to using AI art tools such as GauGAN, a real-time painting web app that allows users to create realistic landscapes using generative adversarial networks.

Rather than use GauGAN in the traditional manner, Wertz makes the tools “trick themselves” by putting a mountain in the sky, or snow falling at the bottom of the page, to create a unique image. Then he incorporates his signature spaceships into the scene.

Artist Colie Wertz uses the GauGAN landscape to inspire some of his ship designs.

Wertz appreciates how easily GauGAN builds a background. He says, “Coming from the hard surface world, that’s the kind of stuff that’s kind of always been a curveball for me, like matte painting and background composition.” Now, Wertz is able to focus on the ship and how to “integrate it into a background.”

For some of his creations, Wertz uses the GauGAN landscape to inspire his ship designs. He views AI art as a “creative partner” rather than a replacement for more traditional forms of art.

Wertz’s artistic career kickstarted after he left an architectural design firm in South Carolina and moved to Los Angeles to develop his digital art skills. There, he entered one of his spaceship models created with Photoshop into a contest put on by visual effects production company Electric Image.

Colie Wertz views AI art as a “creative partner” rather than a replacement for more traditional forms of art.

Caption: Wertz views AI art as a “creative partner” rather than a replacement for more traditional forms of art.

The judges were impressed, and Wertz ended up with a job at Industrial Light & Magic, a visual effects company founded by George Lucas. Wertz’s first job was working on the rerelease of Return of the Jedi, building digital models for matte painters.

For listeners curious about Wertz’s current work, they can look at his portfolio, visit his website or follow him on Instagram.

Help Make the AI Podcast Better

Have a few minutes to spare? Fill out this short listener survey. Your answers will help us make a better podcast.

How to Tune in to the AI Podcast

Get the AI Podcast through iTunes, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, Podkicker, Soundcloud, Stitcher and TuneIn. Your favorite not listed here? Email us at aipodcast [at] nvidia [dot] com.

Image credit: Colie Wertz

The post GauGAN Rocket Man: Conceptual Artist Uses AI Tools for Sci-Fi Modeling appeared first on The Official NVIDIA Blog.