Skip to main content

Extract Text and Images from Word Document in Java


In this blog, I’ll introduce an easy solution to extract text and images from a Word document within Java application.

Required Library

Free Spire.Doc for Java

Before using the below code, we need to download Free Spire.Doc for Java and then import the Spire.Doc.jar file into our project. For maven project, you can refer this online tutorial to install Free Spire.Doc for Java from maven repository.

Extract Text

Free Spire.Doc for Java provides a getText method in Document class which we can use to get text from a Word document.

import com.spire.doc.Document;

import java.io.FileWriter;
import java.io.IOException;

public class ReadText{
public static void main(String[] args) throws IOException {
//load Word document
        Document document = new Document();
document.loadFromFile("C:\\Users\\Administrator\\Desktop\\sample.docx");

//get text from document as string
        String text=document.getText();

//write string to a .txt file
        writeStringToTxt(text," ExtractedText.txt");
}
    public static void writeStringToTxt(String content, String txtFileName) throws IOException {

FileWriter fWriter = new FileWriter(txtFileName, true);
try {
fWriter.write(content);
} catch (IOException ex) {
ex.printStackTrace();
} finally {
try {
fWriter.flush();
fWriter.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
}

Extract Images

Extract images is a little bit complicate than extract text. We need to loop through the objects in the document, find the image objects and then extract them.

import com.spire.doc.Document;
import com.spire.doc.documents.DocumentObjectType;
import com.spire.doc.fields.DocPicture;
import com.spire.doc.interfaces.ICompositeObject;
import com.spire.doc.interfaces.IDocumentObject;


import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
import java.util.Queue;

public class ReadTextAndImages {
public static void main(String[] args) throws IOException {
//load word document
        Document document = new Document();
document.loadFromFile("C:\\Users\\Administrator\\Desktop\\sample.docx");

//create a Queue object
        Queue<ICompositeObject> nodes = new LinkedList<ICompositeObject>();

nodes.add(document);

//create a List object
        List<BufferedImage> images = new ArrayList<BufferedImage>();

//loop through the child objects of the document
        while (nodes.size() > 0) {
ICompositeObject node = nodes.poll();

for (int i = 0; i < node.getChildObjects().getCount(); i++) {
IDocumentObject child = node.getChildObjects().get(i);
if (child instanceof ICompositeObject) {
nodes.add((ICompositeObject) child);

if (child.getDocumentObjectType() == DocumentObjectType.Picture) {
DocPicture picture = (DocPicture) child;
images.add(picture.getImage());
}
}
}
}
//save images
        for (int i = 0; i < images.size(); i++) {
File file = new File(String.format("output/extractImageAndText-%d.png", i));
ImageIO.write(images.get(i), "PNG", file);
}

}
}


Comments

Popular posts from this blog

3 Ways to Generate Word Documents from Templates in Java

A template is a document with pre-applied formatting like styles, tabs, line spacing and so on. You can quickly generate a batch of documents with the same structure based on the template. In this article, I am going to show you the different ways to generate Word documents from templates programmatically in Java using Free Spire.Doc for Java library. Prerequisite First of all, you need to add needed dependencies for including Free Spire.Doc for Java into your Java project. There are two ways to do that. If you use maven, you need to add the following code to your project’s pom.xml file. <repositories>               <repository>                   <id>com.e-iceblue</id>                   <name>e-iceblue</name> ...

Insert and Extract OLE objects in Word in Java

You can use OLE (Object Linking and Embedding) to include content from other programs, such as another Word document, an Excel or PowerPoint document to an existing Word document. This article demonstrates how to insert and extract embedded OLE objects in a Word document in Java by using Free Spire.Doc for Java API.   Add dependencies First of all, you need to add needed dependencies for including Free Spire.Doc for Java into your Java project. There are two ways to do that. If you use maven, you need to add the following code to your project’s pom.xml file.     <repositories>               <repository>                   <id>com.e-iceblue</id>                   <name>e-iceblue</name>     ...

Remove Duplicate Rows in Excel in C# and VB.NET

When an Excel file contains a huge amount of records, there might be duplicate records as well. In this blog, I am going to show you how to remove the duplicate rows in an Excel file programmatically in C# and VB.NET. The library I used: Free Spire.XLS for .NET Free Spire.XLS for .NET is a feature-rich Excel API offered by E-iceblue. It can be easily integrated in your .NET (C#, VB.NET, ASP.NET, .NET Core) applications to create, read, edit, convert and print Excel files without using Microsoft Office. Before coding, you need to get Free Spire.XLS for .NET by installing it via NuGet or downloading it via the official website . C# Code using  Spire.Xls;   using  System.Linq;      namespace  RemoveDuplicateRows   {        class  Program       {            static   void  Main( string [] args) ...