officefileapi-7069-pdf-document-api-examples-extract-content-from-a-pdf-document-how-to-extract-text-from-a-document.md
Important
You need a license for the DevExpress Office File API Subscription or DevExpress Universal Subscription to use these examples in production code.
The following code snippet uses the PdfDocumentProcessor.Text property to extract the text of a PDF file at runtime.
using DevExpress.Pdf;
string ExtractTextFromPDF(string filePath)
{
string documentText = "";
try {
using (PdfDocumentProcessor documentProcessor =
new PdfDocumentProcessor())
{
documentProcessor.LoadDocument(filePath);
documentText = documentProcessor.Text;
}
}
catch { }
return documentText;
}
Imports DevExpress.Pdf
Private Function ExtractTextFromPDF(ByVal filePath As String) As String
Dim documentText As String = ""
Try
Using documentProcessor As New PdfDocumentProcessor()
documentProcessor.LoadDocument(filePath)
documentText = documentProcessor.Text
End Using
Catch
End Try
Return documentText
End Function
Note
The PdfDocumentProcessor.Text property retrieves the content clipped to the crop box. Use the PdfDocumentProcessor.GetText method to get text without clipping. Set the PdfTextExtractionOptions.ClipToCropBox property to false and pass the PdfTextExtractionOptions object as a method parameter.
Use the PdfDocumentProcessor.GetPageText method to retrieve text from the specified page. This method returns text as a string of lines separated by newlines (“\r\n”). If a document does not contain the specified page, the GetPageText method returns an empty string.
The following code snippet extracts text from the first page without clipping:
using DevExpress.Pdf;
using (PdfDocumentProcessor pdfDocumentProcessor = new PdfDocumentProcessor()) {
pdfDocumentProcessor.LoadDocument("PDF32000_2008.pdf");
string firstPageText =
pdfDocumentProcessor.GetPageText(1, new PdfTextExtractionOptions { ClipToCropBox = false });
}
Imports DevExpress.Pdf
Dim pdfDocumentProcessor As PdfDocumentProcessor = New PdfDocumentProcessor()
pdfDocumentProcessor.LoadDocument("PDF32000_2008.pdf")
Dim firstPageText As String =
pdfDocumentProcessor.GetPageText(1, New PdfTextExtractionOptions With {
.ClipToCropBox = False
})
The PdfDocumentProcessor.GetText method allows you to retrieve text from the specified document area. You can use PdfDocumentPosition objects or the PdfDocumentArea instance to define the area.
The GetText method uses the page coordinate system. Refer to the following help topic for more details: Coordinate Systems.
The following code snippet extracts text between two positions on the first page:
using DevExpress.Pdf;
using (DevExpress.Pdf.PdfDocumentProcessor processor = new DevExpress.Pdf.PdfDocumentProcessor())
{
processor.LoadDocument("TextExtraction.pdf");
PdfDocumentPosition startPosition = new PdfDocumentPosition(1, new PdfPoint(0, 0));
PdfDocumentPosition endPosition = new PdfDocumentPosition(1, new PdfPoint(500, 500));
string pageText =
processor.GetText(startPosition, endPosition, new PdfTextExtractionOptions { ClipToCropBox = false });
Console.WriteLine(pageText);
}
Imports DevExpress.Pdf
Using processor As New DevExpress.Pdf.PdfDocumentProcessor()
processor.LoadDocument("TextExtraction.pdf")
Dim startPosition As New PdfDocumentPosition(1, New PdfPoint(0, 0))
Dim endPosition As New PdfDocumentPosition(1, New PdfPoint(500, 500))
Dim pageText As String =
processor.GetText(startPosition, endPosition, New PdfTextExtractionOptions With {.ClipToCropBox = False})
Console.WriteLine(pageText)
End Using
See Also
How to: Extract Images from a Document with DevExpress PDF Document API