Girish Mahajan (Editor)

Web ARChive

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit
Filename extension
  
.warc

Extended from
  
ARC

Open format?
  
Yes

Internet media type
  
application/warc

Standard
  
ISO 28500:2009

Website
  
archive-access.sourceforge.net/warc/

The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, and later-date transformations.

WARC is now recognised by most national library systems as the standard to follow for web archival.

Software

  • Crawlers
  • Heritrix web archiver in Java
  • wget (since version 1.14)
  • wpull (e.g. for ArchiveBot)
  • StormCrawler
  • Apache Nutch
  • WARC software library in Python
  • warc-explorer, a Java tool to browse WARC archives
  • ArchiveFS, a filesystem to mount WARC archives
  • WSDK, a set of simple, compact, and highly optimized Erlang modules to manipulate (create/read/write) WARC files.
  • References

    Web ARChive Wikipedia