Skip to content

PDF Parsing

gal kahana edited this page May 12, 2023 · 10 revisions

As a by-product to introducing PDF Page appending and embedding (See PDF Embedding for details), a PDF Parser was created. Being of possible interest, i'm describing it here.

Overview

In the library, Parsing a PDF occurs in the process of embedding its pages. The class used for parsing a PDF file is named PDFParser. You can also use it completely independently of the library.

A rather modest parser, it initially reads the object table (xref) into its memory, as well as (specifically) a list of the page object IDs, and the trailer dictionary. This way, keeping a very low memory signature. From this point on using the parser is carried out by querying Objects (by PDF Object IDs) from the parser and receiving them.

The Parser allocates and returns, each time, a parsed object, where any indirect object references are left un-interpreted. If you want to get the pointed object, you must ask for it specifically. This is a bit cumbersome, but allows tight control on what exactly is kept in memory at any given point, so in general there's a rather limited amount of the PDF actually loaded. So it's not a big deal to parse really large files.

For parsed objects the parser returns an object that is always derived from PDFObject. a PDF Object always have a type member, which tells you which PDF object it actually is. By checking this type you can then cast it to the actual type. There's a smart pointer implementation that automatically converts an object to a specific type, for simplicity.

Sample Usage

A usage example can be found in PDF Parser Test, we'll review some of it here.

To initialize the parser with a file do the following:

InputFile pdfFile;
PDFParser parser;

pdfFile.OpenFile("C:\\Test.PDF");
parser.StartPDFParsing(pdfFile.GetInputStream());

The parser is initialized with an object of type IByteReaderWithPosition, which is a stream implementation (you can read more about streams in the library in IO. For files, just use the InputFile class to open the file, and the get its input stream with GetInputStream, as the example shows. At this point the parser will parse the Xref as well as the file trailer object (for PDFs with incremental changes, as well as linarized PDF, it will read all of them). This should take not long, and allow the parser to continue.

After parsing, you can use the PDFParser methods to access information about the PDF. For example parser.GetPDFLevel() should get you the header (!) PDF level. Page count may be retrieved with parser.GetPagesCount(), and you can get individual PDF Pages dictionary object by calling parse.ParsePage(pageIndex). Using parser.GetTrailer() will get you the trailer dictionary, which naturally leads to the Catalog (root) object of the PDF, and from there you can iterate the whole file, using the very important method of parser.ParseNewObject(anObjectID)

That's pretty much it for the usage.

Parsed objects

The returned objects from any of the Parsing methods may be one of the following classes (each representing a PDF Object):

  1. PDFArray - An object representing an array. The class enabled querying for indexed objects or for iterating all array objects.
  2. PDFBoolean - represents a boolean object. you can retrieve its boolean value by using the GetValue function.
  3. PDFDictionary - represents a Dictionary. you can query for objects, by name, check if they exist, and iterate the objects keys/values.
  4. PDFHexString - represents a hexadecimal string. The value of it is any array (represented as a string) of the bytes represented by the pairs of hexadecimal values.
  5. PDFIndirectObjectReference - represents an indirect object reference. Yes - its an object in its own right, and its existence is the core point of not having interpreted objects all way through the PDF. You can use its mObjectID member to get the pointed object ID, probably for using with the PDFParser::ParseNewObject(objectID). it also has mVersion which holds the version of the object, but with the current parser philosophy (giving the most up-to-date object for the given ID at all times), it's probably irrelevant.
  6. PDFInteger, represents an integer value. Note that an integer value is any number that does not have a fraction...even if it potentially could be a fraction. For instance, when in the context of a matrix you see "1", then although in a matrix there are normally "reals", this particular will be integer.
  7. PDFLiteralString - represents a literal string. note that all originally escaped characters will be interpreted. For example, what's written in the string as '\n' will simply be available as a byte of value 10 (0xA).
  8. PDFName - represents a name. Value will have the initial slash removed, and all escaped characters (via usage of # character) interpreted.
  9. PDFNull - represents a null object, declared with the null keyword.
  10. PDFReal - represents a real number. Any number that has a fraction.
  11. PDFStreamInput - (NOT to be confused with PDFStream) represents a stream. you can use it to get the position of where the actual stream starts by using GetStreamContentStart, or QueryStreamDictionary to retrieve the stream dictionary.
  12. PDFSymbol - anything which is not any of the above, normally PDF keywords (unless something else, like null or R) will be recognized as such objects (like obj).

Most objects here that have a single value contained return it via GetValue function. For the array, stream and dictionary consult the headers (or usage sample). It should be simple to use. All objects derive from PDFObject, which provides a GetType function returning EPDFObjectType enumerator letting you know the type of the object.

See below on Parsing Helpers for some classes that can assist in retrieving values from the parsed objects.

PDFObjectCastPtr

When using the parser methods to retrieve objects, and expecting specific objects (like in the case of a page...when you know you should be getting a dictionary) consider using a useful Smart Pointer called PDFObjectCastPtr. For example:

PDFObjectCastPtr<PDFDictionary> myObjectSmartPtr(parser.ParseNewObject(13));

if(!myObjectSmartPtr)
{
 // fail! expecting a dictionary
}

// use the object through the pointer, to get to the pointed objects members.
bool rootExists = myObjectSmartPtr->Exists("Root");

In this example the PDFObjectCastPtr template variable is PDFDictionary. When initialized with a PDFObject retrieved from ParseNewObject it checks if this is actually a dictionary, and if so holds the reference to it when cast. You can now use the pointer operator to get to the object members and carry out the functionalities. Note that if either the returned object is null (for example, cause the object ID does not exist for this PDF) or if its of the wrong type, the smart pointer will be null. To get to the actual pointer use the smart pointer GetPtr.

Ownership, Reference Counting and RefPointPtr

While most of the library is very clear on ownership (for instance with the user having to delete page objects), with the parser it is a bit more complicated. We will see in later that using the parser you can either get an already created Indirect object reference, or a newly created pointed object - in this case it's unclear whether you should free the object after usage or not - because it's not clear whether a new object was created or not.

For this purpose i nulled the ownership problem by using simple reference counting. PDFObject class derives from RefCountObject which has AddRef and Release functions and holds a reference counter. No new tricks here - when the reference hits zero, the objects frees itself. When created the reference is initialized to 1.

Now - when you should use AddRef and Release. simple. if you got the object through a method that started with Get, you should call AddRef when starting to use, and Release when ending. If you got if from a function with other prefixes (such as ParseNewObject or anything that starts with Query), you can assume that the AddRef was already called, and you should only call Release when done using the object.

RefCountPtr

How very unexpected- i did smart pointer implementation to have you avoid all this nuisance. but you gotta pay attention to the rules. So listen carefully. The smart pointer is implemented in RefCountPtr. Note that PDFObjectCastPtr derives from it, so it will also take care of reference counting.

Let's get back to the earlier sample, using PDFObjectCastPtr:

PDFObjectCastPtr<PDFDictionary> myObjectSmartPtr(parser.ParseNewObject(13));

In a similar manner we could have used:

RefCountPtr<PDFObject> myObjectSmartPtr(parser.ParseNewObject(13));

(note that here PDFObject is used as the template object, because that's what the ParseNewObject method returns). When initialized in that manner, the destructor of the smart pointer will call Release. It will not call a AddRef initially because this method is NOT a "Get" function.

In fact, to use the smart pointers with a "Get" function, you must actually use the assignment operator, and not the initialization, like in here:

RefCountPtr<PDFObject> theTrailerSmartPtr;

theTrailerSmartPtr = parser.GetTrailer();

When used in this way (not initialization, but rather assignment) the smart pointer will call AddRef when assigned, and Release when destroyed.

If you pay attention to these two methods of using the pointer...you should be fine.
Very Important!!!! This is an initialization, not assignment:

RefCountPtr<PDFObject> theTrailerSmartPtr = parser.GetTrailer(); // NOT GOOD!!!!

When using the equality operator in the declaration command, this goes to the constructor - not assignment operator. Be careful. You can use the Smart pointer this way, but then you'll have to call theTrailerSmartPtr->AddRef yourself...so what's the point.

Reference for PDFParser functions

List of PDFParser functions (Those that you should use ;)). In each, specified whether you should call AddRef or not. Prefer using the smart pointers to do this job for you:

  1. EStatusCode StartPDFParsing(IByteReaderWithPosition* inSourceStream) First method to use, initialized parsing for the PDF represented by the input stream. Returns status.
  2. double GetPDFLevel() Returns the version defined in the PDF file header. Note that from version 1.4 of PDF the trailer should be consulted as well.
  3. PDFDictionary* GetTrailer() Returns the trailer dictionary of the PDF. You should call AddRef.
  4. PDFObject* ParseNewObject(ObjectIDType inObjectId) Creates and parses a new object for the input object ID. In case of a compound object note that all contained direct objects will be available, but indirect references will remain uninterpreted, represented by a PDFIndirectObjectReference object. Returned value does not require calling AddRef.
  5. 'ObjectIDType GetObjectsCount()' Returns the total objects count for the PDF.
  6. 'PDFObject* QueryDictionaryObject(PDFDictionary* inDictionary,const string& inName)' A very useful function for using dictionaries. Returns the object pointed by the inName parameter. If it's an Indirect Object Reference, will go on to parse the pointed object and return it. Very good when you don't want the hassle of having to figure out if the contained object is direct or not. I much prefer using it, normally, instead of the matching PDFDictionary method which brings just the direct object or indirect reference object. No need to call AddRef.
  7. PDFObject* QueryArrayObject(PDFArray* inArray,unsigned long inIndex) Similar to QueryDictionaryObject but for arrays. Returned the object according to the inIndex parameter. If the object is indirect, will parse it and return. No need to call AddRef.
  8. unsigned long GetPagesCount() Get the total pages count.
  9. PDFDictionary* ParsePage(unsigned long inPageIndex) Parse and return the Page dictionary (no need for casting :) ) for the given page index. No need to call AddRef.
  10. ObjectIDType GetPageObjectID(unsigned long inPageIndex) Get a Page dictionary object ID. This is good if you want to match a page index with a page object ID, normally for mapping page indexes to references to this page. This happens. Really. For instance, to get the index of a page referenced by a PDF/VT file DPart structure.
  11. IByteReader* CreateInputStreamReader(PDFStreamInput* inStream) Good function to use when reading content of streams. Will build an IByteReader reader object with filtered reading, so you can just read the bytes till the stream end, and don't have to think about decoding, or limiting the read. Note that my library supports flate(with all its @#$@ predictors), Ascii85 and DCT decoding for streams to be used like this. For other filters, you're on your own (you can extend the library to support more, see below). When done with the stream delete this object - your code owns it. Important! in order to read the stream you need to both use this method to create a filtered read, AND move the base stream (the file stream or memory stream) to the position of the beginning of the stream, as provided by the GetStreamContentStart method of PDFInputStream. When using the parser in the context of a PDFDocumentCopyingContext you can get this stream via its 'GetSourceDocumentStream' method. This is done because you may not necessarily want to read the stream now. If you do want to prefer using the next method - StartReadingFromStream.
  12. IByteReader* StartReadingFromStream(PDFStreamInput* inStream) This method is similar to CreateInputStreamReader, however it does also position the stream position for reading. This is very good in case you want to create a stream reader and start reading write after it.
  13. void ResetParser() Explicitly reset the parser. Good if you want to reuse for parsing another file.
  14. void SetParserExtender(IPDFParserExtender* inParserExtender) Set an object that implements extensions for the class, see later section for details. (There are two other methods...GetObjectParser and StartStateFileParsing. They are internal).

Parsing Helpers

There are two helper classes to assist in retrieving values from parsed objects. One for retrieving page related values, and the other for retrieving primitives (number, string) values from the parsed objects.

Getting Primitive Values

sometimes you don't want to start digging into a parsed object type in order to get its numeral value, casting when necessary according to its actual type. You can use the ParsedPrimitiveHelper class in order to avoid this necessity.

For example:

PDFObject* myPDFObject; 

double aDoubleValue = ParsedPrimitiveHelper(myPDFObject).GetAsDouble();
long long aLongLongValue= ParsedPrimitiveHelper(myPDFObject).GetAsInteger();

The code shows how to use ParsedPrimitiveHelper class in order to quickly retrieve Double or Integer values using GetAsDouble and GetAsInteger methods. You can also use it to determine whether the object represents a number by using the IsNumber method, or get the text of the object as string, by calling ToString.

Getting Page values

To easily retrieve the boxes values from a page you can use the higher level PDFPageInput class. You can use this class directly on the return value from ParsePage method of the PDFParser class.

For instance consider the following:

PDFPageInput pageInput(myParser,myParser->ParsePage(0));

PDFRectangle mediaBox =  pageInput.GetMediaBox();

The example creates a PDFPageInput wrapper for the first page (index 0). Then it gets its media box.

PDFPageInput has these methods:

  1. GetMediaBox - retrieves the page media box.
  2. GetCropBox - retrieves the page crop box. Default is media box, in case one is not defined.
  3. GetTrimBox - retrieves the page trim box. Default is crop box, in case one is not defined.
  4. GetBleedBox - retrieves the page bleed box. Default is crop box, in case one is not defined.
  5. GetArtBox - retrieves the page art box. Default is crop box, in case one is not defined.

all methods look for inherited values from the PDF page tree in case there's no direct definition on the page object, so you should always get the actual box for the page.

Extending the parser

The IPDFParserExtender class allows users to extend the parser capabilities in two areas:

  1. Support for more filters. The library supports currently Asci85, DCT and Flate filters for stream decoding.
  2. Support encrypted PDFs.

To add capabilities in these areas implement the interface, and set it on the parser using its SetParserExtender method. Note that parsers may also be used in the context of embedding PDFs. To set an extender to all copying activities, use the 'SetParserExtender' of the 'DocumentContext' object of the 'PDFWriter' instance that you use.

OK, so how to extend:

More filters

Out of the box, the parser supports Flate decoding, Ascii85 decoding and DCT decoding.
To support more filters, implement this method of IPDFParserExtender:

virtual IByteReader* CreateFilterForStream(IByteReader* inStream,
                                           PDFName* inFilterName,
                                           PDFDictionary* inDecodeParams);

The method should return an implementation of IByteReader that provides the specific filter decryption. This method accepts 3 parameters:

  1. inStream - the stream to apply the filter on. You should be using this stream as a source for the new filter stream created with this method.
  2. inFilterName - a PDF name object with the filter name
  3. inDecodeParams - a PDF dictionary object with the parameters for the filter

If you support the filter, return a new IByteReader object that implements this filter, with inStream as source (owned!) stream for the filter stream. If you don't support the filter return simply inStream.

Important Note! a slight inconvenience if you don't also support encryption. you have to implement the following methods, as well:

virtual bool DoesSupportEncryption()

If you don't support encryption, implement it to return false.

virtual IByteReader* CreateDecryptionFilterForStream(IByteReader* inStream)

If you don't support encryption, implement it to return inStream.

virtual string DecryptString(string inStringToDecrypt)

If you don't support encryption, implement it to return inStringToDecrypt.

Encryption extension

For those of you who really must implement encrypted PDFs support, either for parsing or embedding, the 'IPDFParserExtender' provides the ability to support this. The following methods should be implements:

virtual bool DoesSupportEncryption();

DoesSupportEncryption should return true if the extender supports it.

virtual IByteReader* CreateDecryptionFilterForStream(IByteReader* inStream);

CreateDecryptionFilterForStream should return a new instance of a reading stream object for decrypting a stream. If you don't support streams decryption return the input inStream.

virtual string DecryptString(string inStringToDecrypt);

DecryptString should decrypt an input string, either literal or hex, and return the decrypted string. If you don't support string decryption just return the input inStringToDecrypt.

virtual void OnObjectStart(long long inObjectID, long long inGenerationNumber);
virtual void OnObjectEnd(PDFObject* inObject);

The OnObjectStart and OnObjectEnd events are provided so that you can tell which indirect object is now being parsed. You normally need the object ID and generation number, provided with OnObjectStart, for the decryption of objects. So figured you should tap to these events. keep a good stack object where you add entries for each OnObjectStart and remove them in OnObjectEnd. From the nature of parsing, it is guaranteed that the calls are nested. Note that you should take care not to try and decrypt the content of the Encrypt dictionary.

A slight inconvenience here. If you don't implement also new filters for streams, you still have to implement this method:

virtual IByteReader* CreateFilterForStream(IByteReader* inStream,
                                           PDFName* inFilterName,
                                           PDFDictionary* inDecodeParams);

If you don't support adding filters to stream support, just return the input stream in inStream.

A Word on PDFs with incremental changes

A main idea in the philosophy of the parser is to refer to the most up-to date object. Hence the methods always have Object ID, and no version. This is largely affected by the current only usage of the parser - to copy pages and objects from a PDF, to a new one...and no one needs the old leftovers. Be aware of this.

Clone this wiki locally