Search engines depend on indexes which are built by crawling and caching the content of the web. While for search engines this caching can be temporary (before the relevant data is extracted into an index) it overlaps significantly with web archiving and using open crawl data often involves working with web archiving formats and tooling.
In this document we'll be particularly interested in discussing the file formats utilized in modern web archiving.
The ARC format was created by the Internet Archive for use with it's Wayback Machine. It's success is WARC, released in a "finalized" form in 2009, this format (and subsequent revisions) continues to be the mainstay of web archiving.
- Official Standard Specifications for WARC 1.1
- Karl-Rainer Blumenthal. The stack: An introduction to the WARC file. Archive.org, 4/2021.
- A great introductory article to WARC including it's history, purpose, and implementation.
- Wikipedia on Web ARChive
A WARC file contains WARC records which are composed of eight pieces, six of which are actually utilized currently:
- warcinfo - information about the request, "good provenance information" as Blumenthal puts it.
- request - The HTTP request made by the archiving tool to the website that results in the response received.
- response - The response received from the website (including file contents).
- revisit - A record that has been previously archived and hasn't changed in subsequent visit.
- resource - May include screenshots, videos of the page.
- conversion - Conversion of older data into a current format (e.g. as an image standard is deprecated this might contain a replacement image in a new format).
- continuation - Allows one to reference another WARC record that contains the remainder of the record.
- metadata - Various metadata depending on the archiving source.
They can be opened with a text editor (although many WARC files are quite large and may require a special editor with large file support).
Includes data extracted from the WARC format using JSON. Includes metadata, request, and response as well as the links extracted from the page.
Includes data extracted from a WARC in plaintext.
- Archive.org
- Note that the Internet Archive maintains a general blog but for those interested in more technical aspects of the Archive, see the Archive-It blog, which also covers general Archive-It news along with some technical posts.
- A New Wayback: Improving Web Archive Replay. 9/2021.
- Karl-Rainer Blumenthal. The stack: A guide to A/V web archiving with youtube-dl. 1/2021.
- Karl-Rainer Blumenthal. The stack: High fidelity web collecting at scale with Brozzler. 11/2020.
- Molly Bragg, Kristine Hanna, et al. Web Archiving Lifecycle Model. 3/2013.
- Note that the Internet Archive maintains a general blog but for those interested in more technical aspects of the Archive, see the Archive-It blog, which also covers general Archive-It news along with some technical posts.
- CommonCrawl.org
- Stephen Merity. Navigating the WARC file format. 4/2014.
- Brief introduction to the WARC format, but perhaps more importantly (as the info seems less readibly available around the web), discusses the WET and WAT formats.
- Stephen Merity. Navigating the WARC file format. 4/2014.