WARC (.warc)

Background & Context

    • MIME type: application/warc.
    • Web archive format.
    • Used to archive full webpages.
    • Revision of the Internet Archive's ARC File Format.
    • Supports ISO 28500.

Import

Import Elements

  • General Import elements:
  • "Elements" list of elements and options available in this file
    "Summary"summary of the file
    "Rules"list of rules for all available elements
  • Additional elements include:
  • "Dataset" dataset containing common interpreted WARC elements
    "RawDataset"dataset containing all interpreted WARC elements
    "RawStringDataset"dataset containing common unformatted WARC headers
    "RawData"dataset containing all unformatted WARC headers
  • The "Dataset" and "RawDataset" elements interpret a date as a DateObject, and a payload as an HTTPRequest.
  • The "RawStringDataset" and "RawData" elements do not perform any interpretation.
  • The "Dataset" and "Headers" elements always return the following information for each WARC element:
  • "URL"URL of the element
    "ContentType"MIME content type
    "Content"the main content of the element
    "AccessDate"when the resource was accessed
    "WARCType"the type of WARC element
    "WARCVersion"version for the WARC element
    "WARCRecordID"unique element identifier
  • The "RawDataset" and "RawData" elements may return additional elements, such as "WARC-Block-Digest".

Examples

Basic Examples  (1)

Import a WARC file:

Import all headers: