API Reference

The APIs are segregated into 3 modules:

  1. Common
  2. COS
  3. PD

Common module has general system access and file access and parsing APIs.

COS module is the low level file format for PDF. Carousel Object Structure was original term proposed inside Adobe which later transformed into Acrobat. COS layer has the object structure, definition and the cross references to access them.

PD module is the higher level document access layer. Accessing PDF pages or extracting the content from there or understanding document rendering using fonts or image objects will be typically in this layer.

A detailed explanation of these layers and their rational has been explained in the Architecture and Design section.

Common

PDFIO.Common.CDTextStringType
    CDTextString

PDF file format structure provides two primary string types. Hexadecimal string CosXString and literal string CosLiteralString. However, these are mere binary representation of string types without having any encoding associated for semantic representation. Determination of encoding is carried out mostly by associated fonts and character maps in the content stream. There are also strings used in descriptions and other attributes of a PDF file where no font or mapping information is provided. This represents the string type in such situations. Typically, strings in PDFs are of 3 types.

  1. Text string a. PDDocEncoded string - Similar to ISO_8859-1 b. UTF-16BE strings
  2. ASCII string
  3. Byte string - Pure binary data no interpretation

1 and 2 can be represented by the CDTextString. convert methods are provided to translate the CosString to CDTextString

Ref: PDF Specification Section 7.9.2

Note: Internally CDTextString is a String object of julia.

source
PDFIO.Common.CDDateType
    CDDate

Internally represented as string objects, these are timezone enabled date and time objects.

PDF files support the string format: (D:YYYYMMDDHHmmSSOHH'mm)

source
PDFIO.Common.CDDateMethod
    CDDate(s::CDTextString)

PDF files support the string format: (D:YYYYMMDDHHmmSSOHH'mm)

Example

julia> date = CDDate("D:20190425173659+05'30")
D:20190425173659+05'30

julia> date.d
2019-04-25T17:36:59

julia> date.tz
5 hours, 30 minutes

julia> date.ahead
true
source
PDFIO.Common.getUTCTimeFunction
    getUTCTime(d::CDDate) -> CDDate

Removes the timezone information and returns the CDDate at UTC.

Example

julia> getUTCTime(CDDate("D:20190425173659+05'30"))
D:20190425120659Z
source
PDFIO.Common.CDRectType
    CDRect

CosArray representation of a rectangle in the lower left and upper right point format

Note: CDRect maps to a Rect object in the Rectangle package.

Example

julia> CDRect(CosArray(CosObject[CosInt(0), CosInt(0), CosInt(840), CosFloat(640)]))
Rect:[0.0 0.0 840.0 640.0]
source

COS Objects

PDFIO.Cos.CosObjectType
    CosObject

PDF is a structured document format with lots of internal data structures like dictionaries, arrays, trees. CosObject is the interface to access these objects and get detailed access to the objects and gather additional information. Although, defined in the COS layer, objects of these type are returned from almost all the APIs. Hence, the objects have a separate significance whether you need to use the Cos layer or not. Below is the object hierarchy.

CosObject                           Abstract
    CosNull                         Value (CosNullType)
CosString                           Abstract
CosName                             Concrete
CosNumeric                          Abstract
    CosInt                          Concrete
    CosFloat                        Concrete
CosBoolean                          Concrete
    CosTrue                         Value (CosBoolean)
    CosFalse                        Value (CosBoolean)
CosDict                             Concrete
CosArray                            Concrete
CosStream                           Concrete (always wrapped as an indirect object)
CosIndirectObjectRef                Concrete (only useful when CosDoc is available)

Note: As a reader API you may not need to instantiate any of CosObject types. They are normally populated as a result of parsing a PDF file.

source
PDFIO.Cos.CosStringType
    CosString

Abstract type that represents a PDF string. In PDF objects are mere byte representations. They translate to actual text strings by application of fonts and associated encodings.

source
PDFIO.Cos.CosDictType
    CosDict

Name value pair of a PDF objects. The object is very similar to the Dict object. The key has to be of a CosName type.

source
PDFIO.Cos.set!Method
    set!(dict::CosDict, name::CosName, obj::CosObject) -> CosDict

    set!(stm::CosStream, name::CosName, obj::CosObject) -> CosStream

Sets the value on a dictionary object. Setting a CosNull object deletes the object from the dictionary.

In case of CosStream objects the data is added to the extent dictionary.

Example

julia> set!(catalog, cn"Version", cn"1.4")

julia> <<
...
/Version /1.4
...
>>
source
Base.lengthMethod
    length(o::CosArray) -> Int

Number of elements in CosArray

Example

julia> a = CosArray(CosObject[CosInt(1), CosFloat(2f0),
                              CosInt(3), CosFloat(4f0)])
[1 2.0 3 4.0 ]

julia> length(a)
4
source
PDFIO.Cos.CosStreamType
    CosStream

A stream object in a PDF. Stream objects have an extends disctionary, followed by binary data.

source
PDFIO.Cos.CosIndirectObjectRefType
    CosIndirectObjectRef

A parsed data structure to ensure the object information is stored as an object. This has no meaning without a associated CosDoc. When a reference object is hit the object should be searched from the CosDoc and returned.

source
Base.getFunction
    get(o::CosObject) -> val
    get(o::CosIndirectObjectRef) -> (objnum, gennum)
source
    get(o::CosArray, isNative=false) -> Vector{CosObject}

An array in a PDF file. The objects can be any combination of CosObject.

isNative = true will return the underlying native object inside the CosArray by invoking get method on it.

Example

julia> a = CosArray(CosObject[CosInt(1), CosFloat(2f0), CosInt(3), CosFloat(4f0)])
[1 2.0 3 4.0 ]

julia> get(a)
4-element Array{CosObject,1}:
 1  
 2.0
 3  
 4.0

julia> get(a, true)
4-element Array{Real,1}:
 1    
 2.0f0
 3    
 4.0f0
source
    get(dict::CosDict, name::CosName, defval::T = CosNull) where T ->
                                                            Union{CosObject, T}
    get(stm::CosStream, name::CosName, defval::T = CosNull) where T ->
                                                            Union{CosObject, T}

Returns the value as a CosObject for the key name or the defval provided.

In case of CosStream objects the data is collected from the extent dictionary.

Example

julia> get(catalog, cn"Version")
null

julia> get(catalog, cn"Version", cn"1.4")
/1.4

julia> get(catalog, cn"Version", "1.4")
"1.4"
source
    get(stm::CosStream) -> IO

Decodes the stream and provides output as an IO.

Example

julia> stm

448 0 obj
<<
	/FFilter	/FlateDecode
	/F	(/tmp/tmpIyGPhL/tmp9hwwaG)
	/Length	437
>>
stream
...
endstream
endobj

julia> io = get(stm)
IOBuffer(data=UInt8[...], readable=true, writable=true, ...)
source

PD

PDFIO.PD.PDDocType
    PDDoc

An in memory representation of a PDF document. Mostly, used as an opaque handle to be passed on to other methods.

See pdDocOpen.

source
PDFIO.PD.pdDocOpenFunction
    pdDocOpen(filepath::AbstractString) -> PDDoc

Opens a PDF document and provides the PDDoc document object for subsequent query into the PDF file. filepath is the path to the PDF file in the relative or absolute path format.

Remember to release the document with pdDocClose, once the object is no longer required. Although doc has certain members, it should normally considered as an opaque handle.

Example

julia> doc = pdDocOpen("test/PDFTest-0.0.4/stillhq/3.pdf")

PDDoc ==>

CosDoc ==>
	filepath:		/home/sambit/.julia/dev/PDFIO/test/PDFTest-0.0.4/stillhq/3.pdf
	size:			817945
	hasNativeXRefStm:	 false
	Trailer dictionaries: 
	<<
	/Info	146 0 R
	/Prev	814755
	/Size	163
	/Root	154 0 R
	/ID	[<2ff783c9846ab546bd49f709cb7be307> <2ff783c9846ab546bd49f709cb7be307> ]
>>
	<<
	/Size	153
	/ID	[<2ff783c9846ab546bd49f709cb7be307> <2ff783c9846ab546bd49f709cb7be307> ]
>>

Catalog:
154 0 obj
<<
	/Type	/Catalog
	/Pages	152 0 R
>>
endobj

isTagged: none
source
PDFIO.PD.pdDocCloseFunction
    pdDocClose(doc::PDDoc, num::Int) -> Nothing

Reclaim the resources associated with a PDDoc object. Once called the PDDoc object cannot be further used.

Example

julia> pdDocClose(doc)
source
PDFIO.PD.pdDocGetCatalogFunction
    pdDocGetCatalog(doc::PDDoc) -> CosObject

Catalog is considered the topmost level object in PDF document that is subsequently used to traverse and extract information from a PDF document. To be used for accessing PDF internal objects from document structure when no direct API is available.

Example

julia> pdDocGetCatalog(doc)

154 0 obj
<<
	/Pages	152 0 R
	/Type	/Catalog
>>
endobj
source
PDFIO.PD.pdDocGetNamesDictFunction
    pdDocGetNamesDict(doc::PDDoc) -> CosObject

Some information in PDF is stored as name and value pairs not essentially a dictionary. They are all aggregated and can be accessed from one names dictionary object in the document catalog. This method provides access to such values in a PDF file. Not all PDF document may have a names dictionary. In such cases, a CosNull object may be returned.

Please refer to the PDF specification for further details.

Example

julia> pdDocGetNamesDict(doc)

220 0 obj
<<
	/IDS	123 0 R
	/Dests	119 0 R
	/URLS	124 0 R
>>
endobj
source
PDFIO.PD.pdDocGetInfoFunction
    pdDocGetInfo(doc::PDDoc) -> Dict

Given a PDF document provides the document information available in the Document Info dictionary. The information typically includes creation date, modification date, author, creator used etc. However, all information content are not mandatory. Hence, all information needed may not be available in a document. If document does not have Info dictionary at all this method returns nothing.

Please refer to the PDF specification for further details.

Example

julia> pdDocGetInfo(doc)
Dict{String,Union{CDDate, String, CosObject}} with 7 entries:
  "Subject"  => "AU-B Australian Documents"
  "Producer" => "HPA image bureau 1998-1999"
  "Author"   => "IP Australia"
  "ModDate"  => D:19990527113911Z
  "Keywords" => "Patents"
  "Creator"  => "HPA image bureau 1998-1999"
  "Title"    => "199479714D"
source
PDFIO.PD.pdDocGetCosDocFunction
    pdDocGetCosDoc(doc::PDDoc) -> CosDoc

PDF document format is developed in two layers. A logical PDF document information is represented over a physical file structure called COS. CosDoc is an access object to the physical file structure of the PDF document. To be used for accessing PDF internal objects from document structure when no direct API is available.

One can access any aspect of PDF using the COS level APIs alone. However, they may require you to know the PDF specification in details and it is not the most intuititive.

Example

julia> cosdoc = pdDocGetCosDoc(doc)

CosDoc ==>
	filepath:		/home/sambit/.julia/dev/PDFIO/test/PDFTest-0.0.4/stillhq/3.pdf
	size:			817945
	hasNativeXRefStm:	 false
	Trailer dictionaries: 
	<<
	/ID	[<2ff783c9846ab546bd49f709cb7be307> <2ff783c9846ab546bd49f709cb7be307> ]
	/Size	163
	/Prev	814755
	/Info	146 0 R
	/Root	154 0 R
>>
	<<
	/ID	[<2ff783c9846ab546bd49f709cb7be307> <2ff783c9846ab546bd49f709cb7be307> ]
	/Size	153
>>
source
PDFIO.PD.pdDocGetPageFunction
    pdDocGetPage(doc::PDDoc, num::Int) -> PDPage
    pdDocGetPage(doc::PDDoc, ref::CosIndirectObjectRef) -> PDPage

Given a document absolute page number or object reference, provides the associated page object.

Example

julia> page = pdDocGetPage(doc, 1)
PDFIO.PD.PDPageImpl(...)
julia> page = pdDocGetPage(doc, CosIndirectObjectRef(155, 0))
PDFIO.PD.PDPageImpl(...)
source
PDFIO.PD.pdDocGetPageCountFunction
    pdDocGetPageCount(doc::PDDoc) -> Int

Returns the number of pages associated with the document.

Example

julia> pdDocGetPageCount(doc)
30
source
PDFIO.PD.pdDocGetPageRangeFunction
    pdDocGetPageRange(doc::PDDoc, nums::AbstractRange{Int}) -> Vector{PDPage}
    pdDocGetPageRange(doc::PDDoc, label::AbstractString) -> Vector{PDPage}

Given a range of page numbers or a label returns an array of pages associated with it. For a detailed explanation on page labels, refer to the method pdDocHasPageLabels.

Example

julia> pages = pdDocGetPageRange(doc, 1:4);

julia> typeof(pages)
Array{PDFIO.PD.PDPageImpl,1}

julia> length(pages)
4
source
PDFIO.PD.pdDocHasPageLabelsFunction
    pdDocHasPageLabels(doc::PDDoc) -> Bool

Returns true if the document has page labels defined.

As per PDF Specification 1.7 Section 12.4.2, a document may optionally define page labels (PDF 1.3) to identifyeach page visually on the screen or in print. Page labels and page indices need not coincide: the indices shallbe fixed, running consecutively through the document starting from 0 for the first page, but the labels may be specified in any way that is appropriate for the particular document.

Example

julia> PDFIO.PD.pdDocHasPageLabels(doc)
false
source
PDFIO.PD.pdDocGetPageLabelFunction
    pdDocGetPageLabel(doc::PDDoc, pageno::Int) -> String

Returns the page label if the page has a page label associated to it.

As per PDF Specification 1.7 Section 12.4.2, a document may optionally define page labels (PDF 1.3) to identify each page visually on the screen or in print. Page labels and page indices need not coincide: the indices shallbe fixed, running consecutively through the document starting from 0 for the first page, but the labels may be specified in any way that is appropriate for the particular document.

Example

julia> pdDocGetPageLabel(doc, 3)
"ii"
source
PDFIO.PD.pdDocGetOutlineFunction
    pdDocGetOutline(doc::PDDoc) -> PDOutline

Given a PDF document provides the document Outline (Table of Contents) available in the Document Catalog dictionary. If document does not have Outline, this method returns nothing.

A PDF document may contain a document outline that the conforming reader may display on the screen, allowing the user to navigate interactively from one part of the document to another. The outline consists of a tree-structured hierarchy of outline items (sometimes called bookmarks), which serve as a visual table of contents to display the document’s structure to the user. The user may interactively open and close individual items by clicking them with the mouse. When an item is open, its immediate children in the hierarchy shall become visible on the screen; each child may in turn be open or closed, selectively revealing or hiding further parts of the hierarchy. When an item is closed, all of its descendants in the hierarchy shall be hidden. Clicking the text of any visible item activates the item, causing the conforming reader to jump to a destination or trigger an action associated with the item. - Section 12.3.3 - Document management — Portable document format — Part 1: PDF 1.7

Example

julia> outline = pdDocGetOutline(doc)
555 0 R

julia> iob = IOBuffer();

julia> using AbstractTrees; print_tree(iob, outline)

julia> write(stdout, iob.data)
Contents
├─ Table of Contents
├─ 1. Introduction
├─ 2. Quick Steps - Kernel Compile
│  ├─ 2.1. Precautionary Preparations
│  ├─ 2.2. Minor Upgrading of Kernel
│  ├─ 2.3. For the Impatient
│  ├─ 2.4. Building New Kernel - Explanation of Steps
│  ├─ 2.5. Troubleshooting
...
source
PDFIO.PD.pdDocHasSignatureFunction
    pdDocHasSignature(doc::PDDoc) -> Bool

Returns true when the document has at least one signature field.

This does not mean there is an actual digital signature embedded in the document. A PDF document can be signed and content can be approved by one or more reviewers. Signature fields are placeholders for storing and rendering such information.

Example

julia> pdDocHasSignature(doc)
true
source
PDFIO.PD.pdDocValidateSignaturesFunction
    pdDocValidateSignatures(doc::PDDoc; export_certs=false) -> Vector{Dict{Symbol, Any}}

Input

paramDescription
docThe document for which all the signatures are to be validated.
export_certsOptional keyword parameter when set, exports all the
certificates that are embeded in the PDF document. These
certificates can be for end-entities or one or more certifying
authorities.
Certificates are exported to the file <PDF filename>.pem.

Output

Vector of dictionary objects representing one dictionary object for each signature. The dictionary objects map the symbols to output as per the following table.

SymbolDescription
:NameThe name of the person or authority signing the document.
:PObject reference of the page in which the signature is found.
:MThe CDDate when the document was signed.
:certsThe certificates associated with every signature object.
:subfilterThe subfilter of PDF signature object.
:FQTFully qualified title of the signature form.
:chainThe certificate chain that validated the signature.
:passedValidation status of the signature (true / false)
:error_messageError message returned during the validation
:stacktraceThe stack dump of where the validation failure occurred

Notes

  1. Any additional certificates needed for validating a certificate trust chain has to be added manually to the Trust Store file at: <Package Directory>/data/certs/cacerts.pem in the PEM format. Normally, certificate authorities (root as well as intermediate) are represented in the trust store.
  2. Presence of an end-entity certificate in the Trust Store ensures that the chain validation for the certificate does not have to be carried out. However, this is not considered a good practice for certificates as the certificate validation is an important attribute to avoid security breaches in the chain. In case of self-signed certificates with not CA capabilities this may be the only option.
  3. Validation of digital signatures are limited to the approval signature validation as per section 12.8.1 of PDF Spec. 1.7. Signatures for permissions and usage rights are not validated as per this method. This API only provides a validation report. It does not modify access to any parts of the document based on the validation output. The consumer of the API needs to take appropriate action based on the validation report as desired in their applications.
  4. Revocation - When time is embedded in the signature as signing-time attribute or a signed timestamp or PDF sigature dictionary has M attribute, then those are picked up for validation. However, revocation information are not used during validation.
  5. PDF 2.0 Support - The support is only experimental. While some subfilters like /ETSI.CAdES.detached are supported. Document Security Store (DSS) and Document Time Stamp (DTS) has not been implemented.

Example

julia> r = pdDocValidateSignatures(doc);

julia> r[1] # Failure case
Dict{Symbol,Any} with 8 entries:
  :Name          => "JAYANT KUMAR ARORA"
  :P             => 1 0 R
  :M             => D:20190425173659+05'30
  :error_message => "Error in Crypto Library:
                        140322274480320:error:02001002:system library:..."
  :subfilter     => /adbe.pkcs7.sha1
  :stacktrace    => ["error(::String) at error.jl:33",
                     "openssl_error(::Int32) at PDCrypt.jl:96",
                     "PDFIO.PD.PDCertStore() at PDCrypt.jl:148",
                     ...]
  :FQT           => "Signature1"
  :passed        => false

julia> r[1] # Passed case
Dict{Symbol,Any} with 8 entries:
  :Name      => "JAYANT KUMAR ARORA"
  :P         => 1 0 R
  :M         => D:20190425173659+05'30
  :certs     => Dict{Symbol,Any}[Certificate Parameters...]
  :subfilter => /adbe.pkcs7.sha1
  :FQT       => "Signature1"
  :chain     => Dict{Symbol,Any}[Certificate Parameters...]
  :passed    => true
source
PDFIO.PD.pdPageGetContentsFunction
    pdPageGetContents(page::PDPage) -> CosObject

Page rendering objects are normally stored in a CosStream object in a PDF file. This method provides access to the stream object.

Please refer to the PDF specification for further details.

Example

julia> pdPageGetContents(page)

448 0 obj
<<
	/Length	437
	/FFilter	/FlateDecode
	/F	(/tmp/tmpZnGGFn/tmp5J60vr)
>>
stream
...
endstream
endobj
source
PDFIO.PD.pdPageIsEmptyFunction
    pdPageIsEmpty(page::PDPage) -> Bool

Returns true when the page has no associated content object.

Example

julia> pdPageIsEmpty(page)
false
source
PDFIO.PD.pdPageGetCosObjectFunction
    pdPageGetCosObject(page::PDPage) -> CosObject

PDF document format is developed in two layers. A logical PDF document information is represented over a physical file structure called COS. This method provides the internal COS object associated with the page object.

source
PDFIO.PD.pdPageGetContentObjectsFunction
    pdPageGetContentObjects(page::PDPage) -> CosObject

Page rendering objects are normally stored in a CosStream object in a PDF file. This method provides access to the stream object.

source
PDFIO.PD.pdPageGetMediaBoxFunction
    pdPageGetMediaBox(page::PDPage) -> CDRect{Float32}
    pdPageGetCropBox(page::PDPage) -> CDRect{Float32}
Returns the media box associated with the page. See 14.11.2 PDF 1.7 Spec.

It's typically, the designated size of the paper for the page. When a crop box is not defined, it defaults to the media box.

Example

julia> pdPageGetMediaBox(page)
Rect:[0.0 0.0 595.0 792.0]

julia> pdPageGetCropBox(page)
Rect:[0.0 0.0 595.0 792.0]
source
PDFIO.PD.pdPageGetFontsFunction
    pdPageGetFonts(page::PDPage) -> Dict{CosName, PDFont}()

Returns a dictionary of fonts in the page.

#Example

julia> pdPageGetFonts(page)
Dict{CosName,PDFIO.PD.PDFont} with 4 entries:
  /F0 => PDFont(…
  /F4 => PDFont(…
  /F8 => PDFont(…
  /F9 => PDFont(…
source
PDFIO.PD.pdPageExtractTextFunction
    pdPageExtractText(io::IO, page::PDPage) -> IO

Extracts the text from the page. This extraction works best for tagged PDF files. For PDFs not tagged, some line and word breaks will not be extracted properly.

Example

Following code will extract the text from a full PDF file.

function getPDFText(src, out)
    doc = pdDocOpen(src)
    docinfo = pdDocGetInfo(doc)
    open(out, "w") do io
		npage = pdDocGetPageCount(doc)
        for i=1:npage
            page = pdDocGetPage(doc, i)
            pdPageExtractText(io, page)
        end
    end
    pdDocClose(doc)
    return docinfo
end
source
PDFIO.PD.PDOutlineType
    PDOutline

Representation of PDF document Outline (Table of Contents).

Use the methods from AbstractTrees package to traverse the elements.

source
PDFIO.PD.PDDestinationType
    PDDestination

Used for variety of purposes to locate a rectangular region in a PDF document. Particularly, used in outlines, actions etc.

The structure can denote a location outside of a document as well like in remote GoTo(GoToR) actions. In such cases, it's best be used with filename additionally. Moreover, page references have no meaning in remote file references. Hence, the pageno attribute has been set to Int unlike the PDF Spec 32000-2008 12.3.2.2.

- `pageno::Int` - Page number location
- `layout::CosName` - Various view layouts are possible. Please review the

PDF spec for details. - values::Vector{Float32} - [left, bottom, right, top] sequence array. Not all values are used. The usage depends on the layout parameter. - zoom::Float32 - Zoom value for the view. Can be zero depending on - layout where it's intrinsic; hence, redundant.

source
PDFIO.PD.pdOutlineItemGetAttrFunction
    pdOutlineItemGetAttr(item::PDOutlineItem) -> Dict{Symbol, Any}

Attributes stored with an PDOutlineItem object. The traversal parameters like Prev, Next, First, Last and Parent are stored with the structure.

The following keys are stored in the dictionary object returned:

  • :Title - The title assigned to the item (shows up in the table of content)
  • :Count - A representation of no of items open under the outline item. Please

refer to the PDF Spec 32000-2008 section 12.3.2.2 for details. Mostly, used for rendering on a user interface.

  • :Destination - (filepath, PDDestination) value. Filepath is an empty string

if the destination refers to a location in the same PDF file. This parameter is a combination of /Dest and /A attribute in the PDF specification. The action element is analyzed and data is extracted and stored with the PDDestination as the final refered location.

  • :C - The color of the outline in the DeviceRGB space.
  • :F - Flags for title text rendering italic=1, bold=2

Example

    julia> pdOutlineItemGetAttr(outlineitem)
Dict{Symbol,Any} with 5 entries:
  :F           => 0x00
  :Title       => "Table of Contents"
  :Count       => 0
  :Destination => ("", PDDestination(2, /XYZ, Float32[0.0, 0.0, 0.0, 756.0], 0.0))
  :C           => Float32[0.0, 0.0, 0.0]
source

PDF Page objects

PDFIO.PD.PDPageObjectType
    PDPageObject

The content streams associated with PDF pages contain the objects that can be rendered. These objects are represented by PDPageObject. These objects can contain a postfix notation based operator prefixed by its operands like:

(Draw this text) Tj

As can be seen above, the string object is a CosString which is a parameter to the operand Tj or draw text. These class of objects are represented by PDPageElement.

However, there are certain objects which only provide grouping information or begin and end markers for grouping information. For example, a text object:

BT
    /F1 11 Tf  %selectfont
    (Draw this text) Tj
ET

These kind of objects are represented by PDPageObjectGroup. In this case, the PDPageObjectGroup contains four PDPageElement. Namely, represented as operators BT, Tf, Tj, ET.

PDPageElement and PDPageObjectGroup can be extended by composition. Hence, there are more specialized objects that can be seen as well.

source
PDFIO.PD.PDPageTextRunType
    PDPageTextRun

In PDF text may not be contiguous as there may be chnge of font, style, graphics rendering parameters. PDPageTextRun is a unit of text which can be rendered without any change to the graphical parameters. There is no guarantee that a text run will represent a meaningful word or sentence.

PDPageTextRun is a composition implementation of PDPageElement.

source

COS Methods

PDFIO.Cos.CosDocType
    CosDoc

PDF file structure provides how the objects are arranged in a PDF file. PDF is designed to be accessed in a random access order. Some of the objects in PDF like fonts can be referred from multiple page objects. To address these concerns objects are provided reference identifiers and mappings are provided from various locations in the PDF files. Moreover, to reduce the size of the files, the objects are put inside stream containers and can be compressed. Access to a specific object reference may need several lookups before the actual object can be traced. All these lead to a fairly complex arrangement of objects. CosDoc wraps all the object reference schemes and provide a simplified API called cosDocGetObject and simplifies object look up. Thus any PDF object can be classified into the following forms based on how they are represented in a document:

  • Direct Objects: Direct objects are defined where they are referred or used.
  • Indirect Objects: Indirect objects have reference identifiers, there location in a PDF document is described through a Object Reference identifier.

One can access any aspect of PDF using the COS level APIs alone. However, they may require you to know the PDF specification in details and they are not the most intuititive.

source
PDFIO.Cos.cosDocOpenFunction
    cosDocOpen(filepath::AbstractString) -> CosDoc

Provides the access to the physical file and file structure of the PDF document. Returns a CosDoc which can be subsequently used for all query into the PDF files. Remember to release the document with cosDocClose, once the object is used.

source
PDFIO.Cos.cosDocCloseFunction
    cosDocClose(doc::CosDoc)

Reclaims all system resources consumed by the CosDoc. The CosDoc should not be used after this method is called. cosDocClose only needs to be explicitly called if you have opened the document by 'cosDocOpen'. Documents opened with pdDocOpen do not need to use this method.

source
PDFIO.Cos.cosDocGetRootFunction
    cosDocGetRoot(doc::CosDoc) -> CosDoc

The structural starting point of a PDF document. Also known as document catalog dictionary.

source
PDFIO.Cos.cosDocGetObjectFunction
    cosDocGetObject(doc::CosDoc, obj::CosObject) -> CosObject

PDF objects are distributed in the file and can be cross referenced from one location to another. This is called as indirect object referencing. However, to extract actual information one needs access to the complete object (direct object). This method provides access to the direct object after searching for the object in the document structure. If an indirect object reference is passed as obj parameter the complete indirect object (reference as well as all content of the object) are returned. A direct object passed to the method is returned as is without any translation. This ensures the user does not have to go through checking the type of the objects before accessing the contents.

Example

julia> cosDocGetObject(doc.cosDoc, CosIndirectObjectRef(555, 0))

555 0 obj
<<
	/Count	18
	/Last	629 0 R
	/First	556 0 R
>>
endobj

julia> cosDocGetObject(doc.cosDoc, cn"DirectObject")
/DirectObject
source
    cosDocGetObject(doc::CosDoc, dict::CosObject, key::Union{CosName, CosNullType}) -> CosObject

Returns the object referenced inside the dict dictionary.

  • dict can be a PDF dictionary object reference or an indirect object or a direct CosDict object.
  • key can be CosNull as well. In such a case, a replicated CosDict with direct or indirect objects will be returned for all the input dict keys.

Example

julia> catalog

652 0 obj
<<
	/Outlines	555 0 R
	/PageLayout	/SinglePage
	/PageMode	/UseOutlines
	/Pages	446 0 R
	/Type	/Catalog
	/OpenAction	[447 0 R /XYZ null null 0 ]
>>
endobj

julia> pages = cosDocGetObject(doc.cosDoc, catalog, cn"Pages")

446 0 obj
<<
	/Kids	[447 0 R 449 0 R 451 0 R]
	/Count	3
	/Type	/Pages
>>
endobj

julia> cosDocGetObject(doc.cosDoc, catalog, cn"PageLayout")
/SinglePage
source