Sie sind auf Seite 1von 5

9/28/2017 Code to extract plain text from a PDF file - CodeProject

13,154,847 members (35,843 online) Sign in

Search for articles, questions, tips


articles Q&A forums lounge

Code to extract plain text from a PDF file


NeWi, 21 Jun 2004

4.87 (72 votes) Rate this:

Source code that shows how to decompress and extract text from PDF documents.

Download source files - 3.18 Kb


Download setup - 381 Kb

Introduction
PDF documents are commonly used and their content is usually compressed. This article shows a simple C code that can be used to extract plain text from the PDF file.

Why?
Adobe does allows you to submit PDF files and will extract the text or HTML and mail it back to you. But there are times when you need to extract the text yourself or do it inside an
application. You may also want to apply special formatting (e.g., add tabs) so that the text can be easily imported into Excel for example (when your PDF document mostly contains tables that
you need to port to Excel, which is how this code got developed).

There are several projects on "The Code Project" that show how to create PDF documents, but none that provide free code that shows how to extract text without using a commercial library.
In the reader comments, a need was expressed for code just like what is being supplied here.

There are several libraries out there that read or create PDF file, but you have to register them for commercial use or sign various agreements. The code supplied here is very simple and
basic, but it is entirely free. It only use the ZLIB library which is also free.

Basics
You can download documents such as PDFReference15_v5.pdf from here that explains some of the inners of PDF files. In short, each PDF file contains a number of objects. Each object may
require one or more filters to decompress it and may also provide a stream of data. Text streams are usually compressed using the FlateDecode filter and may be uncompressed using code
from the ZLIB (http://www.zlib.org/) library.

The data for each object can be found between "stream" and "endstream" sections. Once inflated, the data needs to be processed to extract the text. The data usually contains one or more
text objects (starting with BT and ending with ET) with formatting instructions inside. You can learn a lot from the structure of PDF file by stepping through this application.

About Code
This single source code file contains very simple, very basic C code. It initially reads in the entire PDF file into one buffer and then repeatedly scans for "stream" and "endstream" sections. It
does not check which filter should be applied and always assumes FlateDecode. (If it gets it wrong, usually no output is generated for that section of the file, so it is not a big issue). Once the
data stream is inflated (uncompressed), it is processed. During the processing, the code searches for the BT and ET tokens that signify text objects. The contents of each is processed to
extract the text and a guess is made as to whether tabs or new line characters are needed.

The code is far from complete or being any sort of general utility class, but it does demonstrate how you can extract the text yourself. It is enough to show you how and get you going.

The code is however fully functional, so when it is applied to a PDF document, it generally does a fair job of extracting the text. It has been tested on several PDF files.

This code is supplied as is, no warranties. Use at your own risk.

Using The Code


The download contains one C file. To use it, create a simple Windows 32 Console project and add the pdf.c file to the project. You also need to go here (bless them!) and download the free
"zlib compiled DLL" zip file. Extract zdll.lib to your project directory and add it as a project dependency (link against it). Also put zlib1.dll in your project directory. Also put zconf.h and zlib.h in
your project directory and add them to the project.

Now, step through the application and note that the input PDF and output text file names are hardwired at the start of the main method.

Future Enhancements
If there is enough interest, the author may consider uploading a release version with a Windows interface. The code is quite good for extracting data from tables in a form that can be readily
imported into Excel, with the column preserved (because of the tabs that get added).

Code Snippets
https://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file 1/5
9/28/2017 Code to extract plain text from a PDF file - CodeProject
Stream sections are located using initially:

Hide Copy Code


size_t streamstart = FindStringInBuffer (buffer, "stream", filelen);
size_t streamend = FindStringInBuffer (buffer, "endstream", filelen);

And then once the data portion is identified, it is inflated as follows:

Hide Copy Code


z_stream zstrm; ZeroMemory(&zstrm, sizeof(zstrm));
zstrm.avail_in = streamend - streamstart + 1;
zstrm.avail_out = outsize;
zstrm.next_in = (Bytef*)(buffer + streamstart);
zstrm.next_out = (Bytef*)output;
int rsti = inflateInit(&zstrm);
if (rsti == Z_OK)
{
int rst2 = inflate (&zstrm, Z_FINISH);
if (rst2 >= 0)
{
//Ok, got something, extract the text:
size_t totout = zstrm.total_out;
ProcessOutput(fileo, output, totout);
}
}

The main work gets done in the ProcessOutput method which processes the uncompressed stream to extract text portion of any text object. It looks as follows:

Hide Shrink Copy Code


void ProcessOutput(FILE* file, char* output, size_t len)
{
//Are we currently inside a text object?
bool intextobject = false;
//Is the next character literal
//(e.g. \\ to get a \ character or \( to get ( ):
bool nextliteral = false;

//() Bracket nesting level. Text appears inside ()


int rbdepth = 0;

//Keep previous chars to extract numbers etc.:


char oc[oldchar];
int j=0;
for (j=0; j<oldchar; j++) oc[j]=' ';

for (size_t i=0; i<len; i++)


{
char c = output[i];
if (intextobject)
{
if (rbdepth==0 && seen2("TD", oc))
{
//Positioning.
//See if a new line has to start or just a tab:
float num = ExtractNumber(oc,oldchar-5);
if (num>1.0)
{
fputc(0x0d, file);
fputc(0x0a, file);
}
if (num<1.0)
{
fputc('\t', file);
}
}
if (rbdepth==0 && seen2("ET", oc))
{
//End of a text object, also go to a new line.
intextobject = false;
fputc(0x0d, file);
fputc(0x0a, file);
}
else if (c=='(' && rbdepth==0 && !nextliteral)
{
//Start outputting text!
rbdepth=1;
//See if a space or tab (>1000) is called for by looking
//at the number in front of (
int num = ExtractNumber(oc,oldchar-1);
if (num>0)
{
if (num>1000.0)
{
fputc('\t', file);
}
else if (num>100.0)
{
fputc(' ', file);
}
}
}
else if (c==')' && rbdepth==1 && !nextliteral)
{
//Stop outputting text
rbdepth=0;
}
else if (rbdepth==1)
{
//Just a normal text character:
if (c=='\\' && !nextliteral)
{
//Only print out next character
//no matter what. Do not interpret.

https://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file 2/5
9/28/2017 Code to extract plain text from a PDF file - CodeProject
nextliteral = true;
}
else
{
nextliteral = false;
if ( ((c>=' ') && (c<='~')) || ((c>=128) && (c<255)) )
{
fputc(c, file);
}
}
}
}
//Store the recent characters for
//when we have to go back for a number:
for (j=0; j<oldchar-1; j++) oc[j]=oc[j+1];
oc[oldchar-1]=c;
if (!intextobject)
{
if (seen2("BT", oc))
{
//Start of a text object:
intextobject = true;
}
}
}
}

License
This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board
below.

A list of licenses authors might use can be found here

Share
EMAIL TWITTER

About the Author


NeWi No Biography provided
Web Developer
Canada

You may also be interested in...


A Solution Blueprint for DevOps SAPrefs - Netscape-like Preferences Dialog

Extract Text from PDF in C# (100% .NET) Generate and add keyword variations using AdWords API

From pdf files to plain text in a WebMatrix site Window Tabs (WndTabs) Add-In for DevStudio

Comments and Discussions


You must Sign In to use this message board.

https://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file 3/5
9/28/2017 Code to extract plain text from a PDF file - CodeProject
Search Comments

First Prev Next

newbie question 2
Member 12989649 8-Feb-17 5:19

PDF extraction 1
Member 12562886 3-Oct-16 9:04

char vs byte 1
Mattias G 28-Apr-16 11:38

Need c# code 1
Ashok_kavi 15-Feb-16 0:03

Outdated! New PDF versions 1


David Leikis 23-Jan-16 4:41

not working dude!!!!!!!!!!!!!1 1


Member 12081341 23-Oct-15 0:23

It doesn't work after download and setup 1


jim_huang 16-Oct-15 7:22

It doesn't work (unless my version is wrong, (latest download)) 1


DaveLeikis 13-May-15 7:04

.tar compressed instead 1


DaveLeikis 20-Mar-15 6:01

Procedure seen2() need to take into account TAB separator 1


Giovefi 14-Jul-14 14:41

Identity-H without Table 1


xOLIVERx 21-Nov-13 22:27

Linking error please help .... 2


arman787 21-Nov-13 2:31

Message Closed 1
15-Sep-13 16:21

Thank you 1
Member 9962494 3-Apr-13 9:12

[My vote of 1] ! 1
i30mb1 6-Feb-13 21:01

Leila 1
Rashidy_ 30-Dec-12 11:02

Virus in the setup.exe 1


Andrew Guest 15-Nov-12 2:39

A tiny and a less tiny bug in the Source code 1


Gerald Schade 28-Oct-12 22:08

DOC to PDF conversion 1


Member 9331926 3-Sep-12 21:39

Incorrect header check 2


fjsanzano 2-May-12 5:12

My vote of 5 1
manoj kumar choubey 26-Feb-12 19:59

Search a String from PDF file 1


Spoorti Hallur 22-Feb-12 23:07

My vote of 5 1
markjuggles 29-Jan-12 11:45

Brillaint idea 1
Member 3058755 10-Aug-11 5:38

linking error using C++ 2


tsubaryanidhs 12-Dec-10 22:08

Windows Forms Application 2


adsjoom 1-Dec-10 6:27

My vote of 5 1
Geek Master 6-Oct-10 7:18

https://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file 4/5
9/28/2017 Code to extract plain text from a PDF file - CodeProject
My vote of 5
Allen Friend 2-Oct-10 12:28

Not able to find zdll.lib 3


Member 4462805 10-Jun-10 20:29

does not support embedded fonts 1


lipoor 19-May-10 21:01

Compatibility with Mac 1


graccus 29-Mar-10 9:37

THANK YOU!!! 1
marceloflu 18-Feb-10 8:17

Getting Blank Command prompt 1


gokul1507 4-Jan-10 23:33

diacritics 1
tontoncaidd 21-Sep-09 4:38

A pascal translation of this pdf text extractor [modified] 2


Domingo Alvarez 29-Jul-09 13:25

Unable to delete the buffer 1


REDSERPENT7 12-Mar-09 1:04

thanx 1
mrares 26-Feb-09 13:02

.Net version 2
nstuart 23-Feb-09 10:37

Linking error : on deflateInit(....) call 2


mayur8u 11-Feb-09 16:58

Problem with implementation ( help ) 1


ankit09 21-Jan-09 2:43

How to extract it page by page, or rather how to detect a new page? 2


Alexander Schmidt 6-Dec-08 7:15

Reading PDF files with tables 1


Leon Stenneth 24-Nov-08 18:32

Bug in code... 2
nortonio 29-May-08 4:27

thank you (RESPECT PACA PACA) 1


leonel20 14-May-08 7:16

Russian text [modified] 1


Smolensk 13-May-08 20:27

Link Error 1
avinash_ss 5-Mar-08 19:11

how to create the output file in MFC application 1


tunminhein 6-Aug-07 3:59

Memory leak + bad pointer handling and thaks 1


Asger-P 1-Jul-07 10:33

My Simple Change to extract Chinese text. 1


xjyang 21-Oct-06 5:37

Open to Visual Studio 2003 (Visual C++) 1


Shammie Jayaransie 25-Aug-06 5:24

Refresh 1 2 Next

General News Suggestion Question Bug Answer Joke Praise Rant Admin

Permalink | Advertise | Privacy | Terms of Use | Mobile Layout: fixed | fluid Article Copyright 2004 by NeWi
Slectionner une langue
Web01 | 2.8.170927.1 | Last Updated 22 Jun 2004 Everything else Copyright CodeProject, 1999-2017

https://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file 5/5

Das könnte Ihnen auch gefallen