PDF (.pdf)

Background & Context

    • MIME type: application/pdf
    • Adobe Acrobat format.
    • Standard format for exchanging and archiving multi-page documents.
    • PDF is an acronym for Portable Document Format.
    • Binary file format.
    • Stores text, fonts, images, and 2D vector graphics in a device and resolutionindependent way.
    • Can also store embedded raster images.
    • Supports multiple lossy and lossless compression methods.

Import & Export

  • Import["file.pdf"] imports a PDF file, returning a list of rasterized images for each page.
  • Import["file.pdf",elem] imports the specified element from a PDF file.
  • Import["file.pdf",{elem,suba,subb,}] imports a subelement.
  • The import format can be specified with Import["file","PDF"] or Import["file",{"PDF",elem,}].
  • Export["file.pdf",expr] creates a PDF file from an arbitrary expression, cell, or notebook object.
  • The Wolfram Language does not rasterize fonts or 2D vector graphics when exporting to PDF.
  • Export["file.pdf",expr,elem] creates a PDF file by treating expr as specifying element elem.
  • Export["file.pdf",{expr1,expr2,},{{elem1,elem2,}}] treats each expri as specifying the corresponding elemi.
  • Export["file.pdf",expr,opt1->val1,] exports expr with the specified option elements taken to have the specified values.
  • Export["file.pdf",{elem1->expr1,elem2->expr2,},"Rules"] uses rules to specify the elements to be exported.
  • See the reference pages for full general information on Import and Export.
  • ImportString and ExportString support PDF.

Notebook Interface

  • In the notebook front end, Insert Picture and the Open menu allow import of a PDF file into a cell.
  • Save As exports the active notebook as a PDF file.
  • Save Selection As exports the selected part of a notebook to PDF.

Import Elements

  • General Import elements:
  • "Elements"list of elements and options available in this file
    "Rules"full list of rules for each element and option
    "Options"list of rules for options, properties, and settings
  • Structure elements:
  • "ContentsGraph"graph of the table of contents from the document
    "ContentsStartPage"list of rules giving table of contents name and page numbers
    "PageCount"number of pages
    "Summary"summary of the file
  • Data representation elements for the whole PDF document:
  • "Plaintext"a string giving the textual content of the whole document
    "FormattedText"a sequence of formatted text for the whole document
  • Data representation elements given as a list representing each page of the document:
  • "PageFormattedText"a list of formatted text, each representing a page
    "PageGraphics"a list of Graphics objects, each representing a page
    "PageImages"a list of Image objects, each representing a page
    "PagePlaintext"a list of strings, each representing the plaintext of a page
  • Import by default uses the "ImageList" element.
  • Metadata elements:
  • "Author"author of the document
    "CreationDate"creation date of the document, given as a DateObject
    "Creator"program that created the content
    "Keywords"keywords from the document
    "ModificationDate"modification date of the document, given as a DateObject
    "MetaInformation"metadata given as strings and date objects
    "Producer"program that converted the data to PDF
    "Subject"the subject of the document
    "Title"document title
    "Version"version of the PDF specification for the file
  • Hyperlink, annotation, and form field elements:
  • "FormFieldRules"association of page numbers and lists of rules giving form field names and values
    "HighlightedText"association of page numbers and list of strings for each highlighted section of text on each page
    "Hyperlinks"association of page numbers and list of Hyperlink objects for each link on each page
    "TextAnnotations"association of page numbers and text from annotations
    "URLs"association of page numbers and list of URL objects for each link on each page
  • Embedded images elements:
  • "EmbeddedImageCount"association of page numbers and number of images
    "EmbeddedImages"association of page numbers and embedded images for each page
  • Attachments elements:
  • "AttachmentCount"number of attachments
    "AttachmentList"lists of processed attachments as expressions
    "AttachmentNames"list of attachment names
    "AttachmentDetails"lists of associations giving attachment content and metadata
    "RawAttachmentList"attachments given as a list of byte arrays
    "AttachmentData"list of associations giving raw attachment data and metadata
  • The element "AttachmentDetails" is a list giving an association for each attachment. Each association typically has the following keys:
  • "Name"name assigned to the attachment
    "Content"imported content
    "CreationDate"creation date recorded for the attachment
    "ModificationDate"modification date recorded for the attachment
    "ByteCount"number of bytes in the attachment
  • The element "AttachmentData" is a list giving an association for each attachment. Each association typically has the following keys:
  • "Name"name assigned to the attachment
    "RawContent"raw content as a byte array
    "CreationDate"creation date recorded for the attachment
    "ModificationDate"modification date recorded for the attachment
    "ByteCount"number of bytes in the attachment
  • For elements with multiple parts, use subelements for partial data import in either of the {elem,page,index} or {elem,index} form, where page and index can be any of the following:
  • nnth item
    -ncounts from the end
    n;;mfrom n through m
    n;;m;;sfrom n through m with steps of s
    {n1,n2,}specific items ni
  • Use {"FormFieldRules",page,names} to import form values corresponding to the fields names.

Options

  • Import options:
  • "Password"Nonedocument password given as a string
    "TextOutlines"Truewhether to import characters as outlines
    "Render"Allparts of the document to render in ImageList
    RasterSizeAutomaticraster size in pixels for rasterization
    ImageResolution$ImageResolutionimage resolution in dpi for rasterization
    "AttachmentRules"<||>rules to control how to import attachments
  • Possible settings for "Render":
  • "Annotations"annotations such as highlighting or additional text boxes
    "FormFields"data from filled out form fields
    Allrender all elements from the document
    Nonerender no additional elements from the document
  • Export options:
  • ImageSizeAutomaticoverall image size
    ImageResolution72image resolution for rasterization in dpi
    "AllowRasterization"Automaticwhether to rasterize a graphic that requires advanced versions of PDF
  • Possible settings for "AllowRasterization":
  • Automaticrasterize a graphic that contains features such as transparency or gradients that require advanced versions of PDF to render
    Truealways rasterize graphics
    Falsealways use vector graphics, deploying advanced PDF features where necessary for faithful rendering

Examples

open allclose all

Basic Examples  (4)

Import pages of a PDF file:

Import a PDF as plaintext:

Export an image to PDF:

Export a typeset mathematical formula to a resolution-independent PDF:

Scope  (3)

Import  (3)

Import the first page of a PDF file:

Import the first page of a PDF as plaintext:

Import some metadata:

Import Elements  (23)

Available Elements  (1)

List of available elements:

Structure Elements  (3)

"ContentsGraph"  (1)

Import a graph of the table of contents for the file:

Get the names of the edges of the graph:

"ContentsStartPage"  (1)

Import the page every section starts on:

"PageCount"  (1)

Import the number of pages in the document:

Data Representation  (5)

"Plaintext"  (1)

Import the text from the whole document:

"FormattedText"  (1)

Import the names and values from form fields in the document:

"PageGraphics"  (1)

Import a list of graphics for each page of the document:

"PageImages"  (1)

Import a list of images for each page of the document:

"PagePlaintext"  (1)

Import the text from each page of the document as a list:

Metadata  (7)

"Author"  (1)

Import the author of the document:

"CreationDate"  (1)

Import the creation date of the document:

"Creator"  (1)

Import the program that created the document:

"ModificationDate"  (1)

Import the modification date of the document:

"Producer"  (1)

Import the program that converted the document:

"Title"  (1)

Import the title of the document:

"Version"  (1)

Import the PDF version of the document:

Annotations and Form Fields  (5)

"FormFieldRules"  (1)

Import the names and values from form fields in the document:

"HighlightedText"  (1)

Import the plaintext of text that is highlighted in the document:

"Hyperlinks"  (1)

Import the hyperlinks in the document:

"TextAnnotations"  (1)

Import the plaintext of text annotations in the document:

"URLs"  (1)

Import the URLs in the document:

Embedded Images  (2)

"EmbeddedImageCount"  (1)

Import the number of embedded images for each page of the document:

"EmbeddedImages"  (1)

Import the embedded images from each page of the document:

Import Options  (4)

ImageResolution  (1)

Import the PDF with a resolution suitable for FHD screens:

Import the PDF with a resolution suitable for HiDPI screens:

RasterSize  (1)

Render a very small image from the PDF:

Render a larger image from the PDF:

"RenderedElements"  (1)

Import an image of the document without rendering annotations:

Compare to the document with rendered annotations:

"TextOutlines"  (1)

Import a document without the outlines of the characters:

Show the difference between two results: