Rahul Sharma (Editor)

Carrot2

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit
Developer(s)
  
Carrot Search

Written in
  
Java

Development status
  
Active

Operating system
  
Cross-platform

Carrot2

Stable release
  
3.15.0 / November 4, 2016 (2016-11-04)

Type
  
Text mining and cluster analysis

Carrot² is an open source search results clustering engine. It can automatically cluster small collections of documents, e.g. search results or document abstracts, into thematic categories. Apart from two specialized search results clustering algorithms, Carrot² offers ready-to-use components for fetching search results from various sources. Carrot² is written in Java and distributed under the BSD license.

Contents

History

The initial version of Carrot² was implemented in 2001 by Dawid Weiss as part of his MSc thesis to validate the applicability of the STC clustering algorithm to clustering search results in Polish. In 2003, a number of other search results clustering algorithms were added, including Lingo, a novel text clustering algorithm designed specifically for clustering of search results. While the source code of Carrot² was available since 2002, it was only in 2006 when version 1.0 was officially released. In the same year, version 2.0 was released with improved user interface and extended tool set. In 2009, version 3.0 brought significant improvements in clustering quality, simplified API and new GUI application for tuning clustering based on the Eclipse Rich Client Platform.

Architecture and components

The architecture of Carrot² is based on processing components arranged into pipelines. Two major groups or processing components in Carrot² are: document sources and clustering algorithms.

Document sources

Document sources provide data for further processing. Typically, they would e.g. fetch search results from an external search engine, Lucene / Solr index or load text files from a local disk.

Currently, Carrot² has built-in support for the following document sources:

  • Bing Search API
  • Lucene index
  • OpenSearch
  • PubMed
  • Solr server
  • eTools metasearch engine
  • Generic XML files
  • Other document sources can be integrated based on the code examples provided with Carrot² distribution.

    Clustering algorithms

    Carrot² offers two specialized document clustering algorithms that place emphasis on the quality of cluster labels:

  • Lingo: a clustering algorithm based on the Singular value decomposition
  • STC: Suffix Tree Clustering
  • Other algorithms can be easily added to Carrot².

    APIs

    Carrot² clustering can be called through a number of APIs.

    Java API

    Because Carrot² is implemented in Java, it can be integrated with Java software through its native Java API.

    C# / .NET API

    Carrot² provides a native C# API for calling clustering from C# / .NET software without installing a Java runtime. The Carrot² C# API requires .NET Framework version 3.5 or later.

    Other platforms

    Other platforms can call Carrot² clustering through the REST service exposed by the Document Clustering Server. Example integration code is provided for PHP5, C#, Ruby and cURL.

    Tools

    Carrot² offers a number of supporting tools that can be used to quickly set up clustering on custom data, further tuning of clustering results and exposing Carrot² clustering as a remote service:

  • Carrot2 Document Clustering Workbench: a standalone GUI application for experimenting with Carrot² clustering on data from common search engines or custom data,
  • Carrot2 Document Clustering Server: exposes Carrot² clustering as a REST service,
  • Carrot2 Command Line Interface: applications that allow invoking Carrot² clustering from command line,
  • Carrot2 Web Application: exposes Carrot² clustering as a web application for end users.
  • Carrot Search, a commercial spin-off of the Carrot² project, works on further development of Carrot², offers a real-time text clustering algorithm compliant with the Carrot² framework as well as text mining consulting services based on open source and proprietary software.

    Carrot Search Labs

    Carrot² gave rise to a number of independent open source projects released under the umbrella of Carrot Search Labs. Currently, the following projects are available:

  • Randomized Testing: a JUnit test runner with built-in utilities to make every test run slightly different (randomized). Also an ANT task for running JUnit tests on parallel JVMs, with load balancing and other bells and whistles.
  • High Performance Primitive Collections for Java: Lists, Sets, Maps and other collections of primitives for Java tuned for highest performance and memory efficiency.
  • jSuffixArrays: Several Java implementations of the Suffix Array data structure with different performance and memory characteristics.
  • JUnitBenchmarks: A set of extensions for turning JUnit4 tests into performance micro-benchmarks with GC monitoring, time variance measurement and simple graphical visualizations.
  • SmartSprites: fully automatic maintenance of CSS sprites; no tedious copying and pasting to the CSS when adding or changing sprited images.
  • References

    Carrot2 Wikipedia