Girish Mahajan (Editor)

Apache Avro

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit
Development status
  
Active

License
  
Apache License 2.0

Apache Avro

Developer(s)
  
Apache Software Foundation

Stable release
  
1.8.1 / May 19, 2016 (2016-05-19)

Repository
  
git-wip-us.apache.org/repos/asf/avro.git

Type
  
remote procedure call framework

Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

Contents

It is similar to Thrift and Protocol Buffers, but does not require running a code-generation program when a schema changes (unless desired for statically-typed languages).

Apache Spark SQL can access Avro as a data source.

Avro Object Container File

An Avro Object Container File consists of:

  • A file header, followed by
  • one or more file data blocks.
  • A file header consists of:

  • Four bytes, ASCII 'O', 'b', 'j', followed by 1.
  • file metadata, including the schema definition.
  • The 16-byte, randomly-generated sync marker for this file.
  • For data blocks Avro specifies two serialization encodings: binary and JSON. Most applications will use the binary encoding, as it is smaller and faster. For debugging and web-based applications, the JSON encoding may sometimes be appropriate.

    Schema Definition

    Avro schemas are defined using JSON. Schemas are composed of primitive types (null, boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed).

    Simple schema example:

    Serializing and Deserializing

    Data in Avro might be stored with its corresponding schema, meaning serialized item can be read without knowing the schema ahead of time.

    Example serialization and deserialization code in Python

    Serialization:

    File "users.avro" will contain the schema in JSON and a compact binary representation of the data:

    $ od -c users.avro 0000000 O b j 001 004 026 a v r o . s c h e m 0000020 a 272 003 { " t y p e " : " r e c 0000040 o r d " , " n a m e s p a c e 0000060 " : " e x a m p l e . a v r o 0000100 " , " n a m e " : " U s e r 0000120 " , " f i e l d s " : [ { " 0000140 t y p e " : " s t r i n g " , 0000160 " n a m e " : " n a m e " } 0000200 , { " t y p e " : [ " i n t 0000220 " , " n u l l " ] , " n a m 0000240 e " : " f a v o r i t e _ n u 0000260 m b e r " } , { " t y p e " : 0000300 [ " s t r i n g " , " n u l 0000320 l " ] , " n a m e " : " f a 0000340 v o r i t e _ c o l o r " } ] } 0000360 024 a v r o . c o d e c  n u l l 0000400 0 211 266 / 030 334 ˪ ** P 314 341 267 234 310 5 213 0000420 6 004 , A l y s s a 0 200 004 002 006 B 0000440 e n 0 016 0 006 r e d 211 266 / 030 334 ˪ ** 0000460 P 314 341 267 234 310 5 213 6 0000471

    Deserialization:

    This outputs:

    Languages with APIs

    Though theoretically any language could use Avro, the following languages have APIs written for them:

  • C
  • C++
  • C#
  • Go
  • Haskell
  • Java
  • Perl
  • PHP
  • Python
  • Ruby
  • Scala
  • Avro IDL

    In addition to supporting JSON for type and protocol definitions, Avro includes experimental support for an alternative interface description language (IDL) syntax known as Avro IDL. Previously known as GenAvro, this format is designed to ease adoption by users familiar with more traditional IDLs and programming languages, with a syntax similar to C/C++, Protocol Buffers and others.

    References

    Apache Avro Wikipedia