API Structure and Design
The API is segregated into 3 modules:
Common
Cos
PD
Common module has general system access and file access and parsing APIs. The ParserState type has been taken from the JuliaIO/JSON.jl. The file headers are not added. Hence, author acknowledges the efforts of the developers for the same package and expects the same be honored by any person developing any derivative work. Note-ParserState is no longer in use. The parser has been moved to the BufferedStreams interfaces. Some minor helper methods are ported to the new interface.
Cos module is the low level file format for PDF. Carousel Object Structure was original term proposed inside Adobe which later transformed into Acrobat. Cos layer has the object structure, definition and the cross references to access them.
PD module is the higher level document access layer. Accessing PDF pages or extracting the content from there or understanding document rendering using fonts or image objects will be typically in this layer. Please note that many objects in the PD layer actually refer to the Cos structure. You can consider PD Layer as the business logic while Cos Layer as the database for it.
Common
PDFIO.Common.CDTextString
— Type. CDTextString
PDF file format structure provides two primary string types. Hexadecimal string CosXString
and literal string CosLiteralString
. However, these are mere binary representation of string types without having any encoding associated for semantic representation. Determination of encoding is carried out mostly by associated fonts and character maps in the content stream. There are also strings used in descriptions and other attributes of a PDF file where no font or mapping information is provided. This represents the string type in such situations. Typically, strings in PDFs are of 3 types.
Text string a. PDDocEncoded string - Similar to ISO_8859-1 b. UTF-16BE strings
ASCII string
Byte string - Pure binary data no interpretation
1 and 2 can be represented by the CDTextString
. convert
methods are provided to translate the CosString
to CDTextString
PDFIO.Common.CDDate
— Type. CDDate
Internally represented as string objects, these are timezone enabled date and time objects.
PDF files support the string format: (D:YYYYMMDDHHmmSSOHH'mm)
PDFIO.Common.CDDate
— Method. CDDate(s::CDTextString)
PDF files support the string format: (D:YYYYMMDDHHmmSSOHH'mm)
PDFIO.Common.CDRect
— Type. CDRect
An CosArray
representation of a rectangle in the lower left and upper right point format
COS Objects
PDFIO.Cos.CosObject
— Type. CosObject
PDF is a structured document format with lots of internal data structures like dictionaries, arrays, trees. CosObject
is the interface to access these objects and get detailed access to the objects and gather additional information. Although, defined in the COS layer, objects of these type are returned from almost all the APIs. Hence, the objects have a separate significance whether you need to use the Cos
layer or not. Below is the object hierarchy.
CosObject Abstract
CosNull Value (CosNullType)
CosString Abstract
CosName Concrete
CosNumeric Abstract
CosInt Concrete
CosFloat Concrete
CosBoolean Concrete
CosTrue Value (CosBoolean)
CosFalse Value (CosBoolean)
CosDict Concrete
CosArray Concrete
CosStream Concrete (always wrapped as an indirect object)
CosIndirectObjectRef Concrete (only useful when CosDoc is available)
PDFIO.Cos.CosNull
— Constant. CosNull
PDF representation of a null
object. Can be applied to CosObject
of any type.
PDFIO.Cos.CosString
— Type. CosString
Abstract type that represents a PDF string. In PDF objects are mere byte representations. They translate to actual text strings by application of fonts and associated encodings.
PDFIO.Cos.CosName
— Type. CosName
Name objects are symbols used in PDF documents.
PDFIO.Cos.@cn_str
— Macro. @cn_str(str) -> CosName
A string decorator for easier instantiation of a CosName
PDFIO.Cos.CosNumeric
— Type. CosNumeric
Abstract type for numeric objects. The objects can be an integer CosInt
or float CosFloat
.
PDFIO.Cos.CosInt
— Type. CosInt
An integer in PDF document.
PDFIO.Cos.CosFloat
— Type. CosFloat
A numeric float data type.
PDFIO.Cos.CosBoolean
— Type. CosBoolean
A boolean object in PDF which is either a CosTrue
or CosFalse
PDFIO.Cos.CosDict
— Type. CosDict
Name value pair of a PDF objects. The object is very similar to the Dict
object. The key
has to be of a CosName
type.
PDFIO.Cos.CosArray
— Type. CosArray
An array in a PDF file. The objects can be any combination of CosObject
.
PDFIO.Cos.CosStream
— Type. CosStream
A stream object in a PDF. Stream objects have an extends
disctionary, followed by binary data.
PDFIO.Cos.CosIndirectObjectRef
— Type. CosIndirectObjectRef
A parsed data structure to ensure the object information is stored as an object. This has no meaning without a associated CosDoc. When a reference object is hit the object should be searched from the CosDoc and returned.
PD
PDFIO.PD.PDDoc
— Type. PDDoc
An in memory representation of a PDF document. Once created this type has to be used to access a PDF document.
PDFIO.PD.pdDocOpen
— Function. pdDocOpen(filepath::AbstractString) -> PDDoc
Opens a PDF document and provides the PDDoc document object for subsequent query into the PDF file. filepath
is the path to the PDF file in the relative or absolute path format. Remember to release the document with pdDocClose
, once the object is used.
PDFIO.PD.pdDocClose
— Function. pdDocClose(doc::PDDoc, num::Int) -> PDDoc
Reclaim the resources associated with a PDDoc
object. Once called the PDDoc
object cannot be further used.
PDFIO.PD.pdDocGetCatalog
— Function. pdDocGetCatalog(doc::PDDoc) -> CosObject
Catalog
is considered the topmost level object in PDF document that is subsequently used to traverse and extract information from a PDF document. To be used for accessing PDF internal objects from document structure when no direct API is available.
PDFIO.PD.pdDocGetNamesDict
— Function. pdDocGetNamesDict(doc::PDDoc) -> CosObject
Some information in PDF is stored as name and value pairs not essentially a dictionary. They are all aggregated and can be accessed from one names
dictionary object in the document catalog. This method provides access to such values in a PDF file. Not all PDF document may have a names dictionary. In such cases, a CosNull
object may be returned.
Please refer to the PDF specification for further details.
PDFIO.PD.pdDocGetInfo
— Function. pdDocGetInfo(doc::PDDoc) -> Dict
Given a PDF document provides the document information available in the DocumentInfo
dictionary. The information typically includes creation date, modification date, author, creator used etc. However, all information content are not mandatory. Hence, all information needed may not be available in a document.
Please refer to the PDF specification for further details.
PDFIO.PD.pdDocGetCosDoc
— Function. pdDocGetCosDoc(doc::PDDoc) -> CosDoc
PDF document format is developed in two layers. A logical PDF document information is represented over a physical file structure called COS. CosDoc
is an access object to the physical file structure of the PDF document. To be used for accessing PDF internal objects from document structure when no direct API is available.
One can access any aspect of PDF using the COS level APIs alone. However, they may require you to know the PDF specification in details and it is not the most intuititive.
PDFIO.PD.pdDocGetPage
— Function. pdDocGetPage(doc::PDDoc, num::Int) -> PDPage
Given a document absolute page number, provides the associated page object.
PDFIO.PD.pdDocGetPageCount
— Function. pdDocGetPageCount(doc::PDDoc) -> Int
Returns the number of pages associated with the document.
PDFIO.PD.pdDocGetPageRange
— Function. pdDocGetPageRange(doc::PDDoc, nums::Range{Int}) -> Vector{PDPage}
pdDocGetPageRange(doc::PDDoc, label::AbstractString) -> Vector{PDPage}
Given a range of page numbers or a label returns an array of pages associated with it.
PDFIO.PD.pdPageGetContents
— Function. pdPageGetContents(page::PDPage) -> CosObject
Page rendering objects are normally stored in a CosStream
object in a PDF file. This method provides access to the stream object.
Please refer to the PDF specification for further details.
PDFIO.PD.pdPageIsEmpty
— Function. pdPageIsEmpty(page::PDPage) -> Bool
Returns true
when the page has no associated content object.
PDFIO.PD.pdPageGetCosObject
— Function. pdPageGetCosObject(page::PDPage) -> CosObject
PDF document format is developed in two layers. A logical PDF document information is represented over a physical file structure called COS. This method provides the internal COS object associated with the page object.
PDFIO.PD.pdPageGetContentObjects
— Function. pdPageGetContentObjects(page::PDPage) -> CosObject
Page rendering objects are normally stored in a CosStream
object in a PDF file. This method provides access to the stream object.
PDFIO.PD.pdPageExtractText
— Function. pdPageExtractText(io::IO, page::PDPage) -> IO
Extracts the text from the page
. This extraction works best for tagged PDF files only. For PDFs not tagged, some line and word breaks will not be extracted properly.
PDF Page objects
PDFIO.PD.PDPageObject
— Type. PDPageObject
The content streams associated with PDF pages contain the objects that can be rendered. These objects are represented by PDPageObject
. These objects can contain a postfix notation based operator prefixed by its operands like:
(Draw this text) Tj
As can be seen above, the string object is a CosString
which is a parameter to the operand Tj
or draw text. These class of objects are represented by PDPageElement
.
However, there are certain objects which only provide grouping information or begin and end markers for grouping information. For example, a text object:
BT
/F1 11 Tf %selectfont
(Draw this text) Tj
ET
These kind of objects are represented by PDPageObjectGroup
. In this case, the PDPageObjectGroup
contains four PDPageElement
. Namely, represented as operators BT
, Tf
, Tj
, ET
.
PDPageElement
and PDPageObjectGroup
can be extended by composition. Hence, there are more specialized objects that can be seen as well.
PDFIO.PD.PDPageElement
— Type. PDPageElement
A representation of a content object with operator and operand. See PDPageObject
for more details.
PDFIO.PD.PDPageObjectGroup
— Type. PDPageObjectGroup
A representation of a content object that encloses other content objects. See PDPageObject
for more details.
PDFIO.PD.PDPageTextObject
— Type. PDPageTextObject
A PDPageObjectGroup
object that represents a block of text. See PDPageObject
for more details.
PDFIO.PD.PDPageTextRun
— Type. PDPageTextRun
In PDF text may not be contiguous as there may be chnge of font, style, graphics rendering parameters. PDPageTextRun
is a unit of text which can be rendered without any change to the graphical parameters. There is no guarantee that a text run will represent a meaningful word or sentence.
PDPageTextRun
is a composition implementation of PDPageElement
.
PDFIO.PD.PDPageMarkedContent
— Type. PDPageMarkedContent
A PDPageObjectGroup
object that represents a group of a object that is logically grouped together in case of a structured PDF document.
PDFIO.PD.PDPageInlineImage
— Type. PDPageInlineImage
Most images in PDF documents are defined in the PDF document and referenced from the page content stream. PDPageInlineImage
objects are directly defined in the page content stream.
PDFIO.PD.PDPage_BeginGroup
— Type. PDPage_BeginGroup
A PDPageElement
that represents the beginning of a group object.
PDFIO.PD.PDPage_EndGroup
— Type. PDPage_EndGroup
A PDPageElement
that represents the end of a group object.
Cos
PDFIO.Cos.CosDoc
— Type. CosDoc
PDF document format is developed in two layers. A logical PDF document information is represented over a physical file structure called COS. CosDoc
is an access object to the physical file structure of the PDF document. To be used for accessing PDF internal objects from document structure when no direct API is available.
One can access any aspect of PDF using the COS level APIs alone. However, they may require you to know the PDF specification in details and they are not the most intuititive.
PDFIO.Cos.cosDocOpen
— Function. cosDocOpen(filepath::AbstractString) -> CosDoc
Provides the access to the physical file and file structure of the PDF document. Returns a CosDoc
which can be subsequently used for all query into the PDF files. Remember to release the document with cosDocClose
, once the object is used.
PDFIO.Cos.cosDocClose
— Function. cosDocClose(doc::CosDoc)
Reclaims all system resources consumed by the CosDoc
. The CosDoc
should not be used after this method is called. cosDocClose
only needs to be explicitly called if you have opened the document by 'cosDocOpen'. Documents opened with pdDocOpen
do not need to use this method.
PDFIO.Cos.cosDocGetRoot
— Function. cosDocGetRoot(doc::CosDoc) -> CosDoc
The structural starting point of a PDF document. Also known as document root dictionary. This provides details of object locations and document access methodology. This should not be confused with the catalog
object of the PDF document.
PDFIO.Cos.cosDocGetObject
— Function. cosDocGetObject(doc::CosDoc, obj::CosObject) -> CosObject
PDF objects are distributed in the file and can be cross referenced from one location to another. This is called as indirect object referencing. However, to extract actual information one needs access to the complete object (direct object). This method provides access to the direct object after searching for the object in the document structure. If an indirect object reference is passed as an obj
parameter the complete indirect object
(reference as well as all content of the object) are returned. A direct object
passed to the method is returned as is without any translation. This ensures the user does not have to go through checking the type of the objects before accessing the contents.
cosDocGetObject(doc::CosDoc, dict::CosObject, key::CosName) -> CosObject
Returns the object referenced inside the dict
dictionary. dict
can be a PDF dictionary object reference or an indirect object or a direct CosDict
object.
PDFIO.Cos.cosDocGetPageNumbers
— Function.cosDocGetPageNumbers(doc::CosDoc, catalog::CosObject, label::AbstractString) -> Range{Int}
PDF utilizes two pagination schemes. An internal global page number that is maintained serially as an integer and PageLabel
that is shown by the viewers. Given a label
this method returns a range
of valid page numbers.