Heritrix

Updated on Apr 25, 2026

Edit

Comment

Written in Java License Apache License		Type Web crawler Website crawler.archive.org

Stable release 3.2.0 / January 10, 2014 (2014-01-10) Operating system Linux/Unix-like/Windows (unsupported)

Heritrix is a web crawler designed for web archiving. It was written by the Internet Archive. It is free software license and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.

Heritrix was developed jointly by the Internet Archive and the Nordic national libraries on specifications written in early 2003. The first official release was in January 2004, and it has been continually improved by employees of the Internet Archive and other interested parties.

Heritrix was not the main crawler used to crawl content for the Internet Archive's web collection for many years. The largest contributor to the collection is Alexa Internet. Alexa crawls the web for its own purposes, using a crawler named ia_archiver. Alexa then donates the material to the Internet Archive. The Internet Archive itself did some of its own crawling using Heritrix, but only on a smaller scale.

Starting in 2008, the Internet Archive began performance improvements to do its own wide scale crawling, and now does collect most of its content.

Projects using Heritrix

A number of organizations and national libraries are using Heritrix, among them:

Austrian National Library, Web Archiving

Bibliotheca Alexandrina's Internet Archive

Bibliothèque nationale de France

British Library

California Digital Library's Web Archiving Service

CiteSeerX

Documenting Internet2

Internet Memory Foundation

Library and Archives Canada

Library of Congress [1]

National and University Library of Iceland

National Library of Finland

National Library of New Zealand

National Library of the Netherlands (Koninklijke Bibliotheek)

Netarkivet.dk

Smithsonian Institution Archives

National Library of Israel

Arc files

Older versions of Heritrix by default stored the web resources it crawls in an Arc file. This Arc is wholly unrelated to ARC (file format). This format has been used by the Internet Archive since 1996 to store its web archives. More recently it saves by default in the WARC file format, similar to ARC but more precisely specified and flexible. Heritrix can also be configured to store files in a directory format similar to the Wget crawler that uses the URL to name the directory and filename of each resource.

An Arc file stores multiple archived resources in a single file in order to avoid managing a large number of small files. The file consists of a sequence of URL records, each with a header containing metadata about how the resource was requested followed by the HTTP header and the response. Arc files range between 100 to 600 MB.

Example:

Tools for processing Arc files

Heritrix includes a command-line tool called arcreader which can be used to extract the contents of an Arc file. The following command lists all the URLs and metadata stored in the given Arc file (in CDX format):

arcreader IA-2006062.arc

The following command extracts hello.html from the above example assuming the record starts at offset 140:

arcreader -o 140 -f dump IA-2006062.arc

Other tools:

Arc processing tools

WERA (Web ARchive Access)

Command-line tools

Heritrix comes with several command-line tools:

htmlextractor - displays the links Heritrix would extract for a given URL

hoppath.pl - recreates the hop path (path of links) to the specified URL from a completed crawl

manifest_bundle.pl - bundles up all resources referenced by a crawl manifest file into an uncompressed or compressed tar ball

cmdline-jmxclient - enables command-line control of Heritrix

arcreader - extracts contents of ARC files (see above)

Further tools are available as part of the Internet Archive's warctools project.

References

Heritrix Wikipedia

(Text) CC BY-SA

Contents

Projects using Heritrix

Arc files

Tools for processing Arc files

Command-line tools

References