API Structure and Design

API Structure and Design

The API is segregated into 3 modules:

  1. Common

  2. Cos

  3. PD

Common module has general system access and file access and parsing APIs. The ParserState type has been taken from the JuliaIO/JSON.jl. The file headers are not added. Hence, author acknowledges the efforts of the developers for the same package and expects the same be honored by any person developing any derivative work. Note-ParserState is no longer in use. The parser has been moved to the BufferedStreams interfaces. Some minor helper methods are ported to the new interface.

Cos module is the low level file format for PDF. Carousel Object Structure was original term proposed inside Adobe which later transformed into Acrobat. Cos layer has the object structure, definition and the cross references to access them.

PD module is the higher level document access layer. Accessing PDF pages or extracting the content from there or understanding document rendering using fonts or image objects will be typically in this layer. Please note that many objects in the PD layer actually refer to the Cos structure. You can consider PD Layer as the business logic while Cos Layer as the database for it.

Common

    CDTextString

PDF file format structure provides two primary string types. Hexadecimal string CosXString and literal string CosLiteralString. However, these are mere binary representation of string types without having any encoding associated for semantic representation. Determination of encoding is carried out mostly by associated fonts and character maps in the content stream. There are also strings used in descriptions and other attributes of a PDF file where no font or mapping information is provided. This represents the string type in such situations. Typically, strings in PDFs are of 3 types.

  1. Text string a. PDDocEncoded string - Similar to ISO_8859-1 b. UTF-16BE strings

  2. ASCII string

  3. Byte string - Pure binary data no interpretation

1 and 2 can be represented by the CDTextString. convert methods are provided to translate the CosString to CDTextString

source
    CDDate

Internally represented as string objects, these are timezone enabled date and time objects.

PDF files support the string format: (D:YYYYMMDDHHmmSSOHH'mm)

source
    CDDate(s::CDTextString)

PDF files support the string format: (D:YYYYMMDDHHmmSSOHH'mm)

source
    CDRect

An CosArray representation of a rectangle in the lower left and upper right point format

source

COS Objects

    CosObject

PDF is a structured document format with lots of internal data structures like dictionaries, arrays, trees. CosObject is the interface to access these objects and get detailed access to the objects and gather additional information. Although, defined in the COS layer, objects of these type are returned from almost all the APIs. Hence, the objects have a separate significance whether you need to use the Cos layer or not. Below is the object hierarchy.

CosObject                           Abstract
    CosNull                         Value (CosNullType)
CosString                           Abstract
CosName                             Concrete
CosNumeric                          Abstract
    CosInt                          Concrete
    CosFloat                        Concrete
CosBoolean                          Concrete
    CosTrue                         Value (CosBoolean)
    CosFalse                        Value (CosBoolean)
CosDict                             Concrete
CosArray                            Concrete
CosStream                           Concrete (always wrapped as an indirect object)
CosIndirectObjectRef                Concrete (only useful when CosDoc is available)
source
PDFIO.Cos.CosNullConstant.
    CosNull

PDF representation of a null object. Can be applied to CosObject of any type.

source
    CosString

Abstract type that represents a PDF string. In PDF objects are mere byte representations. They translate to actual text strings by application of fonts and associated encodings.

source
    CosName

Name objects are symbols used in PDF documents.

source
    @cn_str(str) -> CosName

A string decorator for easier instantiation of a CosName

source
    CosNumeric

Abstract type for numeric objects. The objects can be an integer CosInt or float CosFloat.

source
    CosInt

An integer in PDF document.

source
    CosFloat

A numeric float data type.

source
    CosBoolean

A boolean object in PDF which is either a CosTrue or CosFalse

source
    CosDict

Name value pair of a PDF objects. The object is very similar to the Dict object. The key has to be of a CosName type.

source
    CosArray

An array in a PDF file. The objects can be any combination of CosObject.

source
    CosStream

A stream object in a PDF. Stream objects have an extends disctionary, followed by binary data.

source
    CosIndirectObjectRef

A parsed data structure to ensure the object information is stored as an object. This has no meaning without a associated CosDoc. When a reference object is hit the object should be searched from the CosDoc and returned.

source

PD

PDFIO.PD.PDDocType.
    PDDoc

An in memory representation of a PDF document. Once created this type has to be used to access a PDF document.

source
PDFIO.PD.pdDocOpenFunction.
    pdDocOpen(filepath::AbstractString) -> PDDoc

Opens a PDF document and provides the PDDoc document object for subsequent query into the PDF file. filepath is the path to the PDF file in the relative or absolute path format. Remember to release the document with pdDocClose, once the object is used.

source
PDFIO.PD.pdDocCloseFunction.
    pdDocClose(doc::PDDoc, num::Int) -> PDDoc

Reclaim the resources associated with a PDDoc object. Once called the PDDoc object cannot be further used.

source
    pdDocGetCatalog(doc::PDDoc) -> CosObject

Catalog is considered the topmost level object in PDF document that is subsequently used to traverse and extract information from a PDF document. To be used for accessing PDF internal objects from document structure when no direct API is available.

source
    pdDocGetNamesDict(doc::PDDoc) -> CosObject

Some information in PDF is stored as name and value pairs not essentially a dictionary. They are all aggregated and can be accessed from one names dictionary object in the document catalog. This method provides access to such values in a PDF file. Not all PDF document may have a names dictionary. In such cases, a CosNull object may be returned.

Please refer to the PDF specification for further details.

source
PDFIO.PD.pdDocGetInfoFunction.
    pdDocGetInfo(doc::PDDoc) -> Dict

Given a PDF document provides the document information available in the DocumentInfo dictionary. The information typically includes creation date, modification date, author, creator used etc. However, all information content are not mandatory. Hence, all information needed may not be available in a document.

Please refer to the PDF specification for further details.

source
    pdDocGetCosDoc(doc::PDDoc) -> CosDoc

PDF document format is developed in two layers. A logical PDF document information is represented over a physical file structure called COS. CosDoc is an access object to the physical file structure of the PDF document. To be used for accessing PDF internal objects from document structure when no direct API is available.

One can access any aspect of PDF using the COS level APIs alone. However, they may require you to know the PDF specification in details and it is not the most intuititive.

source
PDFIO.PD.pdDocGetPageFunction.
    pdDocGetPage(doc::PDDoc, num::Int) -> PDPage

Given a document absolute page number, provides the associated page object.

source
    pdDocGetPageCount(doc::PDDoc) -> Int

Returns the number of pages associated with the document.

source
    pdDocGetPageRange(doc::PDDoc, nums::Range{Int}) -> Vector{PDPage}
    pdDocGetPageRange(doc::PDDoc, label::AbstractString) -> Vector{PDPage}

Given a range of page numbers or a label returns an array of pages associated with it.

source
    pdPageGetContents(page::PDPage) -> CosObject

Page rendering objects are normally stored in a CosStream object in a PDF file. This method provides access to the stream object.

Please refer to the PDF specification for further details.

source
    pdPageIsEmpty(page::PDPage) -> Bool

Returns true when the page has no associated content object.

source
    pdPageGetCosObject(page::PDPage) -> CosObject

PDF document format is developed in two layers. A logical PDF document information is represented over a physical file structure called COS. This method provides the internal COS object associated with the page object.

source
    pdPageGetContentObjects(page::PDPage) -> CosObject

Page rendering objects are normally stored in a CosStream object in a PDF file. This method provides access to the stream object.

source
    pdPageExtractText(io::IO, page::PDPage) -> IO

Extracts the text from the page. This extraction works best for tagged PDF files only. For PDFs not tagged, some line and word breaks will not be extracted properly.

source

PDF Page objects

    PDPageObject

The content streams associated with PDF pages contain the objects that can be rendered. These objects are represented by PDPageObject. These objects can contain a postfix notation based operator prefixed by its operands like:

(Draw this text) Tj

As can be seen above, the string object is a CosString which is a parameter to the operand Tj or draw text. These class of objects are represented by PDPageElement.

However, there are certain objects which only provide grouping information or begin and end markers for grouping information. For example, a text object:

BT
    /F1 11 Tf  %selectfont
    (Draw this text) Tj
ET

These kind of objects are represented by PDPageObjectGroup. In this case, the PDPageObjectGroup contains four PDPageElement. Namely, represented as operators BT, Tf, Tj, ET.

PDPageElement and PDPageObjectGroup can be extended by composition. Hence, there are more specialized objects that can be seen as well.

source
    PDPageElement

A representation of a content object with operator and operand. See PDPageObject for more details.

source
    PDPageObjectGroup

A representation of a content object that encloses other content objects. See PDPageObject for more details.

source
    PDPageTextObject

A PDPageObjectGroup object that represents a block of text. See PDPageObject for more details.

source
    PDPageTextRun

In PDF text may not be contiguous as there may be chnge of font, style, graphics rendering parameters. PDPageTextRun is a unit of text which can be rendered without any change to the graphical parameters. There is no guarantee that a text run will represent a meaningful word or sentence.

PDPageTextRun is a composition implementation of PDPageElement.

source
    PDPageMarkedContent

A PDPageObjectGroup object that represents a group of a object that is logically grouped together in case of a structured PDF document.

source
    PDPageInlineImage

Most images in PDF documents are defined in the PDF document and referenced from the page content stream. PDPageInlineImage objects are directly defined in the page content stream.

source
    PDPage_BeginGroup

A PDPageElement that represents the beginning of a group object.

source
    PDPage_EndGroup

A PDPageElement that represents the end of a group object.

source

Cos

    CosDoc

PDF document format is developed in two layers. A logical PDF document information is represented over a physical file structure called COS. CosDoc is an access object to the physical file structure of the PDF document. To be used for accessing PDF internal objects from document structure when no direct API is available.

One can access any aspect of PDF using the COS level APIs alone. However, they may require you to know the PDF specification in details and they are not the most intuititive.

source
PDFIO.Cos.cosDocOpenFunction.
    cosDocOpen(filepath::AbstractString) -> CosDoc

Provides the access to the physical file and file structure of the PDF document. Returns a CosDoc which can be subsequently used for all query into the PDF files. Remember to release the document with cosDocClose, once the object is used.

source
PDFIO.Cos.cosDocCloseFunction.
    cosDocClose(doc::CosDoc)

Reclaims all system resources consumed by the CosDoc. The CosDoc should not be used after this method is called. cosDocClose only needs to be explicitly called if you have opened the document by 'cosDocOpen'. Documents opened with pdDocOpen do not need to use this method.

source
    cosDocGetRoot(doc::CosDoc) -> CosDoc

The structural starting point of a PDF document. Also known as document root dictionary. This provides details of object locations and document access methodology. This should not be confused with the catalog object of the PDF document.

source
    cosDocGetObject(doc::CosDoc, obj::CosObject) -> CosObject

PDF objects are distributed in the file and can be cross referenced from one location to another. This is called as indirect object referencing. However, to extract actual information one needs access to the complete object (direct object). This method provides access to the direct object after searching for the object in the document structure. If an indirect object reference is passed as an obj parameter the complete indirect object (reference as well as all content of the object) are returned. A direct object passed to the method is returned as is without any translation. This ensures the user does not have to go through checking the type of the objects before accessing the contents.

source
    cosDocGetObject(doc::CosDoc, dict::CosObject, key::CosName) -> CosObject

Returns the object referenced inside the dict dictionary. dict can be a PDF dictionary object reference or an indirect object or a direct CosDict object.

source
cosDocGetPageNumbers(doc::CosDoc, catalog::CosObject, label::AbstractString) -> Range{Int}

PDF utilizes two pagination schemes. An internal global page number that is maintained serially as an integer and PageLabel that is shown by the viewers. Given a label this method returns a range of valid page numbers.

source