Skip to content

Latest commit

 

History

History
44 lines (35 loc) · 3.73 KB

archiving-introduction.md

File metadata and controls

44 lines (35 loc) · 3.73 KB

Web Archiving Introduction

Introduction

Search engines depend on indexes which are built by crawling and caching the content of the web. While for search engines this caching can be temporary (before the relevant data is extracted into an index) it overlaps significantly with web archiving and using open crawl data often involves working with web archiving formats and tooling.

In this document we'll be particularly interested in discussing the file formats utilized in modern web archiving.

Origins

The ARC format was created by the Internet Archive for use with it's Wayback Machine. It's success is WARC, released in a "finalized" form in 2009, this format (and subsequent revisions) continues to be the mainstay of web archiving.

WARC Format

A WARC file contains WARC records which are composed of eight pieces, six of which are actually utilized currently:

  • warcinfo - information about the request, "good provenance information" as Blumenthal puts it.
  • request - The HTTP request made by the archiving tool to the website that results in the response received.
  • response - The response received from the website (including file contents).
  • revisit - A record that has been previously archived and hasn't changed in subsequent visit.
  • resource - May include screenshots, videos of the page.
  • conversion - Conversion of older data into a current format (e.g. as an image standard is deprecated this might contain a replacement image in a new format).
  • continuation - Allows one to reference another WARC record that contains the remainder of the record.
  • metadata - Various metadata depending on the archiving source.

They can be opened with a text editor (although many WARC files are quite large and may require a special editor with large file support).

WAT Format

Includes data extracted from the WARC format using JSON. Includes metadata, request, and response as well as the links extracted from the page.

WET Format

Includes data extracted from a WARC in plaintext.

Bibliography/Resources