Back to Aws Doc Sdk Examples

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

python/cross_service/textract_comprehend_notebook/TextractAndComprehendNotebook.ipynb

latest6.5 KB
Original Source

This cross-service notebook walks you through the process of using Amazon Textract's DetectDocumentText API to extract text from a JPG/JPEG/PNG file containing text, and then using Amazon Comprehend's DetectEntities API to find entities in the extracted text.

You can run this notebook using AWS SageMaker.

To run this notebook using SageMaker

  1. Navigate to Amazon SageMaker from the AWS Management Console. From the SageMaker dashboard, select Notebook and then choose "Notebook instances".
  2. Select "Create notebook instance".
  3. Select the optional "Git repositories" menu, give the notebook instance a name, and choose the option to "Clone a public Git repository to this notebook instance only".
  4. In the "Git repository URL" box, paste in the URL for this repository (https://github.com/awsdocs/aws-doc-sdk-examples).
  5. Select "Create notebook instance".
  6. You will need to add some policies to your Amazon SageMaker role so it can access Amazon Simple Storage Service (Amazon S3), Amazon Textract, and Amazon Comprehend. You can add the following policies to your role: AmazonTextractFullAccess, ComprehendFullAccess, AmazonS3ReadOnlyAccess.
  7. After the notebook instance has been created, select it from the list of notebooks.
  8. Choose "Open Jupyter". After the Jupyter notebook has started, navigate to the directory containing this notebook and select this notebook to run it.

In order to make use of the AWS SDK for Python (Boto 3) in this notebook, you will need to configure your AWS credentials. The GetPass library allows you to input your AWS credentials and keep them unexposed. When you run the code cell below, copy the value of your AWS access key into the first text box and the value of your AWS secret key into the second text box.

python
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: Apache-2.0

import getpass

access_key = getpass.getpass()
secret_key = getpass.getpass()

After setting your security credentials, import any other libraries you need. Set the name of both the Amazon S3 bucket you have your image in and the name of the image itself. In the code below, replace the value of "bucket" with the name of your bucket, replace the value of "document" with the name of the image file you want to analyze, and replace the value of "region_name" with the name of the Region you are operating in.

python
import boto3
import io
from PIL import Image
from IPython.display import display
import json
import pandas as pd
import os

bucket = "amzn-s3-demo-bucket"
document = "Name of your document"
region_name = "Name of your region"

Create a function that connects to both Amazon S3 and Amazon Textract via Boto3. The function presented in the following code starts by connecting to the Amazon S3 resource and retrieving the image you specified from the bucket you specified. The function then connects to Amazon Textract and calls the DetectDocumentText API to extract the text in the image. The lines of text found in the document are stored in a list and returned.

python
def process_text_detection(bucket, document, access_code, secret_code, region):
    # Get the document from Amazon S3
    s3_connection = boto3.resource("s3")

    # Connect to Amazon Textract to detect text in the document
    client = boto3.client(
        "textract",
        region_name=region,
        aws_access_key_id=access_code,
        aws_secret_access_key=secret_code,
    )

    # Get the response from Amazon S3
    s3_object = s3_connection.Object(bucket, document)
    s3_response = s3_object.get()

    # open binary stream using an in-memory bytes buffer
    stream = io.BytesIO(s3_response["Body"].read())

    # load stream into image
    image = Image.open(stream)

    # Display the image
    display(image)

    # process using Amazon S3 object
    response = client.detect_document_text(
        Document={"S3Object": {"Bucket": bucket, "Name": document}}
    )

    # Get the text blocks
    blocks = response["Blocks"]

    # List to store image lines in document
    line_list = []

    # Create image showing bounding box/polygon around the detected lines/text
    for block in blocks:
        if block["BlockType"] == "LINE":
            line_list.append(block["Text"])

    return line_list
python
lines = process_text_detection(bucket, document, access_key, secret_key, region_name)
print("Text found: " + str(lines))

You can now send the lines you extracted from the image to Amazon Comprehend and use the DetectEntities API to find all entities within those lines. You'll need a function that iterates through the list of lines returned by the "process_text_detection" function you wrote earlier and calls the DetectEntities operation on every line.

python
def entity_detection(lines, access_code, secret_code, region):
    # Create a list to hold the entities found for every line
    response_entities = []

    # Connect to Amazon Comprehend
    comprehend = boto3.client(
        service_name="comprehend",
        region_name=region,
        aws_access_key_id=access_code,
        aws_secret_access_key=secret_code,
    )

    # Iterate through the lines in the list of lines
    for line in lines:
        # construct a list to hold all found entities for a single line
        entities_list = []

        # Call the DetectEntities operation and pass it a line from lines
        found_entities = comprehend.detect_entities(Text=line, LanguageCode="en")
        for response_data, values in found_entities.items():
            for item in values:
                if "Text" in item:
                    print("Entities found:")
                    for text, val in item.items():
                        if text == "Text":
                            # Append the found entities to the list of entities
                            entities_list.append(val)
                            print(val)
        # Add all found entities for this line to the list of all entities found
        response_entities.append(entities_list)

    return response_entities
python
print("Calling DetectEntities:")
print("------")
response_ents = entity_detection(lines, access_key, secret_key, region_name)

Now that you have a list of the lines extracted by Amazon Textract and the entities found in those lines, you can create a dataframe that lets you see both. In the code below, a Pandas dataframe is constructed, displaying the lines found in the input image and their associated entities.

python
entities_dict = {"Lines": lines, "Entities": response_ents}
df = pd.DataFrame(entities_dict, columns=["Lines", "Entities"])
print(df)