Canterbury corpus

Updated on Apr 25, 2026

Edit

Comment

The Canterbury corpus is a collection of files intended for use as a benchmark for testing lossless data compression algorithms. It was created in 1997 at the University of Canterbury, New Zealand and designed to replace the Calgary corpus. The files were selected based on their ability to provide representative performance results.

In its most commonly used form, the corpus consists of 11 files, selected as "average" documents from 11 classes of documents, totaling 2,810,784 bytes as follows.

References

Canterbury corpus Wikipedia

(Text) CC BY-SA

Contents

References