python/cross_service/textract_comprehend_notebook/TextractAndComprehendNotebook.ipynb
This cross-service notebook walks you through the process of using Amazon Textract's DetectDocumentText API to extract text from a JPG/JPEG/PNG file containing text, and then using Amazon Comprehend's DetectEntities API to find entities in the extracted text.
You can run this notebook using AWS SageMaker.
AmazonTextractFullAccess, ComprehendFullAccess, AmazonS3ReadOnlyAccess.In order to make use of the AWS SDK for Python (Boto 3) in this notebook, you will need to configure your AWS credentials. The GetPass library allows you to input your AWS credentials and keep them unexposed. When you run the code cell below, copy the value of your AWS access key into the first text box and the value of your AWS secret key into the second text box.
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: Apache-2.0
import getpass
access_key = getpass.getpass()
secret_key = getpass.getpass()
After setting your security credentials, import any other libraries you need. Set the name of both the Amazon S3 bucket you have your image in and the name of the image itself. In the code below, replace the value of "bucket" with the name of your bucket, replace the value of "document" with the name of the image file you want to analyze, and replace the value of "region_name" with the name of the Region you are operating in.
import boto3
import io
from PIL import Image
from IPython.display import display
import json
import pandas as pd
import os
bucket = "amzn-s3-demo-bucket"
document = "Name of your document"
region_name = "Name of your region"
Create a function that connects to both Amazon S3 and Amazon Textract via Boto3. The function presented in the following code starts by connecting to the Amazon S3 resource and retrieving the image you specified from the bucket you specified. The function then connects to Amazon Textract and calls the DetectDocumentText API to extract the text in the image. The lines of text found in the document are stored in a list and returned.
def process_text_detection(bucket, document, access_code, secret_code, region):
# Get the document from Amazon S3
s3_connection = boto3.resource("s3")
# Connect to Amazon Textract to detect text in the document
client = boto3.client(
"textract",
region_name=region,
aws_access_key_id=access_code,
aws_secret_access_key=secret_code,
)
# Get the response from Amazon S3
s3_object = s3_connection.Object(bucket, document)
s3_response = s3_object.get()
# open binary stream using an in-memory bytes buffer
stream = io.BytesIO(s3_response["Body"].read())
# load stream into image
image = Image.open(stream)
# Display the image
display(image)
# process using Amazon S3 object
response = client.detect_document_text(
Document={"S3Object": {"Bucket": bucket, "Name": document}}
)
# Get the text blocks
blocks = response["Blocks"]
# List to store image lines in document
line_list = []
# Create image showing bounding box/polygon around the detected lines/text
for block in blocks:
if block["BlockType"] == "LINE":
line_list.append(block["Text"])
return line_list
lines = process_text_detection(bucket, document, access_key, secret_key, region_name)
print("Text found: " + str(lines))
You can now send the lines you extracted from the image to Amazon Comprehend and use the DetectEntities API to find all entities within those lines. You'll need a function that iterates through the list of lines returned by the "process_text_detection" function you wrote earlier and calls the DetectEntities operation on every line.
def entity_detection(lines, access_code, secret_code, region):
# Create a list to hold the entities found for every line
response_entities = []
# Connect to Amazon Comprehend
comprehend = boto3.client(
service_name="comprehend",
region_name=region,
aws_access_key_id=access_code,
aws_secret_access_key=secret_code,
)
# Iterate through the lines in the list of lines
for line in lines:
# construct a list to hold all found entities for a single line
entities_list = []
# Call the DetectEntities operation and pass it a line from lines
found_entities = comprehend.detect_entities(Text=line, LanguageCode="en")
for response_data, values in found_entities.items():
for item in values:
if "Text" in item:
print("Entities found:")
for text, val in item.items():
if text == "Text":
# Append the found entities to the list of entities
entities_list.append(val)
print(val)
# Add all found entities for this line to the list of all entities found
response_entities.append(entities_list)
return response_entities
print("Calling DetectEntities:")
print("------")
response_ents = entity_detection(lines, access_key, secret_key, region_name)
Now that you have a list of the lines extracted by Amazon Textract and the entities found in those lines, you can create a dataframe that lets you see both. In the code below, a Pandas dataframe is constructed, displaying the lines found in the input image and their associated entities.
entities_dict = {"Lines": lines, "Entities": response_ents}
df = pd.DataFrame(entities_dict, columns=["Lines", "Entities"])
print(df)