Extract Text From PDF in Java

In this article, we’re going to explain how to extract text from a Pdf file in Java.

An overview of content:

Extract All Text from a Pdf
Read/Extract Text from a Specific Rectangle Area in a Pdf Page
Read/Extract Text using SimpleTextExtractionStrategy

The Pdf library we need:

Spire.PDF for Java

The example Pdf file:
alt text

Sample Code

Imported Namespaces

import com.spire.pdf.*;

import com.spire.pdf.exporting.text.SimpleTextExtractionStrategy;

import java.awt.geom.Rectangle2D;

import java.io.*;

Read/Extract All Text from a Pdf

//Instantiate a PdfDocument object

PdfDocument pdf = new PdfDocument();

//Load the Pdf file

pdf.loadFromFile("Additional.pdf");

StringBuilder sb= new StringBuilder();

//Extract text from every page of the Pdf

for (PdfPageBase page: (Iterable<PdfPageBase>) pdf.getPages()) {

sb.append(page.extractText(true));

}

try {

//Write the text into a .txt file

FileWriter writer = new FileWriter("ExtractText.txt");

writer.write(sb.toString());

writer.flush();

} catch (IOException e) {

e.printStackTrace();

}

//Close the PdfDocument object

pdf.close();

Output:

Read/Extract Text from a Specific Rectangle Area in a Pdf Page

//Instantiate a PdfDocument object

PdfDocument pdf = new PdfDocument();

//Load the Pdf file

pdf.loadFromFile("Additional.pdf");

//Get the first page of the Pdf

PdfPageBase page = pdf.getPages().get(0);

//Instantiate a Rectangle2D object

Rectangle2D rect = new Rectangle2D.Float();

//Set location and size

rect.setFrame( 50, 50, 500, 100);

//Extract text from the given rectangle area in the first page

StringBuilder sb= new StringBuilder();

StringBuilder append = sb.append(page.extractText(rect));

try {

//Write the text into a .txt file

FileWriter writer = new FileWriter("ExtractText.txt");

writer.write(sb.toString());

writer.flush();

} catch (IOException e) {

e.printStackTrace();

}

//Close the PdfDocument object

pdf.close();

Output:

Read/Extract Text using SimpleTextExtractionStrategy

//Instantiate a PdfDocument object

PdfDocument pdf = new PdfDocument();

//Load the Pdf file

pdf.loadFromFile("Additional.pdf");

//Get the first page of the Pdf

PdfPageBase page = pdf.getPages().get(0);

//Extract text from the first page using SimpleTextExtractionStrategy

SimpleTextExtractionStrategy strategy = new SimpleTextExtractionStrategy();

StringBuilder sb= new StringBuilder();

StringBuilder append = sb.append(page.extractText(strategy));

try {

//Write the text into a .txt file

FileWriter writer = new FileWriter("ExtractText.txt");

writer.write(sb.toString());

writer.flush();

} catch (IOException e) {

e.printStackTrace();

}

//Close the PdfDocument object

pdf.close();

Output:

Office Dev Blogs

Search This Blog

Extract Text From PDF in Java

Sample Code

Labels

Comments

Post a Comment

Popular posts from this blog

Add and Delete Digital Signature in Excel with Java

3 Ways to Generate Word Documents from Templates in Java

Insert and Extract OLE objects in Word in Java