Rahul Sharma (Editor)

OpenRefine

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit
Development status
  
Active

Written in
  
Java

OpenRefine

Developer(s)
  
Google, open source community

Initial release
  
November 10, 2010; 6 years ago (2010-11-10)

Stable release
  
2.5 / December 11, 2011; 5 years ago (2011-12-11)

Repository
  
github.com/OpenRefine/OpenRefine

OpenRefine, formerly called Google Refine, is a standalone open source desktop application for data cleanup and transformation to other formats, the activity known as data wrangling. It is similar to spreadsheet applications (and can work with spreadsheet file formats); however, it behaves more like a database.

Contents

It operates on rows of data which have cells under columns, which is very similar to relational database tables. An OpenRefine project consists of one table. The user can filter the rows to display using facets that define filtering criteria (for example, showing rows where a given column is not empty). Unlike spreadsheets, most operations in OpenRefine are done on all visible rows: transformation of all cells in all rows under one column, creation of a new column based on existing column data, etc. All actions that were done on a dataset are stored in a project and can be replayed on another dataset.

Unlike spreadsheets, no formulas are stored in the cells, but formulas are used to transform the data, and transformation is done only once. Transformation expressions can be written in Google Refine Expression Language (GREL), Jython (i.e. Python) and Clojure.

The program has a web user interface. However, it is not hosted on the web (SAAS), but is available for download and use on the local machine. When starting OpenRefine, it starts a web server and starts a browser to open the web UI powered by this web server.

Possible uses of software

  • Cleaning messy data: for example if working with a text file with some semi-structured data, it can be edited using transformations, facets and clustering to make the data cleanly structured.
  • Transformation of data: converting values to other formats, normalizing and denormalizing.
  • Parsing data from web sites: OpenRefine has a URL fetch feature and jsoup HTML parser and DOM engine.
  • Adding data to dataset by fetching it from webservices (i.e. returning json). For example, can be used for geocoding addresses to geographic coordinates.
  • Aligning to Wikidata (formerly Freebase): this involves reconciliation — mapping string values in cells to entities in Wikidata.
  • Supported formats from import and export

    Import is supported from following formats:

  • TSV, CSV
  • Text file with custom separators or columns split by fixed width
  • XML
  • RDF triples (RDF/XML and Notation3 serialization formats)
  • JSON
  • Google Spreadsheets, Google Fusion Tables
  • If input data is in a non-standard text format, it can be imported as whole lines, without splitting into columns, and then columns extracted later with OpenRefine's tools. Archived and compressed files are supported (.zip, .tar.gz, .tgz, .tar.bz2, .gz, or .bz2) and Refine can download input files from a URL. To use web pages as input, it is possible to import list of URLs and then invoke a URL fetch function.

    Export is supported in following formats:

  • TSV
  • CSV
  • Microsoft Excel
  • HTML table
  • Templating exporter: it is possible to define custom template for outputting data, for example as MediaWiki table.
  • Whole OpenRefine projects in native format can be exported as a .tar.gz archive.

    History

    OpenRefine started life as Freebase Gridworks developed by Metaweb and has been available as open source since January, 2010. On 16 July 2010, Google acquired Metaweb, the creators of Freebase, and on 10 November 2010 renamed their Freebase Gridworks software to Google Refine, releasing version 2.0. On 2 October 2012, original author David Huynh announced that Google would soon stop its active support of Google Refine. Since then, the codebase has been in transition to an open source project named OpenRefine.

    Books

  • Verborgh, Ruben; De Wilde, Max, Using OpenRefine, Packt Publishing; 114 p. September 2013. ISBN 9781783289080
  • References

    OpenRefine Wikipedia