WARC (.warc)


    MIME type: application/warc.
    Web archive format.
    Used to archive full webpages.
    Revision of the Internet Archive's ARC File Format.
    Supports ISO 28500.

Import and Export

  • Import["file.warc",elements] imports a WARC file, returning a dataset.


  • General Import elements:
  • "Elements" list of elements and options available in this file
    "Rules"full list of rules for each element and option
    "Options"list of rules for options, properties and settings
  • Additional elements include:
  • "Dataset" dataset containing common interpreted WARC elements
    "RawDataset"dataset containing all interpreted WARC elements
    "RawStringDataset"dataset containing common unformatted WARC headers
    "RawData"dataset containing all unformatted WARC headers
  • The "Dataset" and "RawDataset" elements interpret a date as a DateObject, and a payload as an HTTPRequest.
  • The "RawStringDataset" and "RawData" elements do not perform any interpretation.
  • The "Dataset" and "Headers" elements always return the following information for each WARC element:
  • "URL"URL of the element
    "ContentType"MIME content type
    "Content"the main content of the element
    "AccessDate"when the resource was accessed
    "WARCType"the type of WARC element
    "WARCVersion"version for the WARC element
    "WARCRecordID"unique element identifier
  • The "RawDataset" and "RawData" elements may return additional elements, such as "WARC-Block-Digest".


Basic Examples  (1)

Import a WARC file:

Click for copyable input

Import all headers:

Click for copyable input
Introduced in 2018