Girish Mahajan (Editor)

DocumentDB

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

Azure DocumentDB is Microsoft’s multi-tenant distributed database service for managing JSON documents at Internet scale. It is schema-free and generally classified as a NoSQL database. It might be labeled as a NewSQL database since it uses an SQL-like query language and it supports ACID compliant transactions, however it lacks a relational data model which is generally expected of NewSQL databases.

Contents

Dynamically tunable Throughput, Space, and Consistency

With the current recommended option of "partitioned collection" type, DocumentDB is dynamically tunable along three dimensions:

  1. Throughput. Developers reserve throughput of the service according to the application's varying load. Behind the scenes, DocumentDB will scale up resources (memory, processor, partitions, replicas, etc.) to achieve that requested throughput while maintaining the 99.99th percentile of latency for reads to under 10ms and for writes to under 15ms. Throughput is specified in request units (RUs) per second. The number of RUs consumed for a particular operation vary based upon a number of factors, but the fetching of a single 1KB document by id spends roughly 1 RU. Delete, update, and insert operations consume roughly 5 RUs assuming 1KB documents. Big queries and stored procedure executions can consume 100s or 1000s of RUs based upon the complexity of the operations needed.
  2. Space. Similarly, Developers can specify how much storage they will need. Both space and throughput directly effect how much the user is charged but either can be tuned up dynamically to handle peak load and down to save costs when more lightly loaded.
  3. Consistency. DocumentDB provides four consistency levels: strong, bounded-staleness, session, and eventual. The further to the left in this list, the greater the consistency but the higher the RU cost which essentially lowers available throughput for the same RU setting. Session level consistency is the default. Even when set to lower consistency level, any arbitrary set of operations can be executed in an ACID-compliant transaction by performing those operations from within a stored procedure. You can also change the consistency level for each request using the x-ms-consistency-level request header or the equivalent option in your SDK.

Partitioning

DocumentDB added automatic partitioning capability in 2016 with the introduction of partitioned collections. Behind the scenes, the collection will span multiple physical partitions with documents distributed by a caller-supplied partition key. DocumentDB automatically decides how many partitions to spread your data across depending upon the size and throughput needs. When DocumentDB decides to add (or remove) partitions, your data remains available while it is rebalanced across the new (or remaining) partitions.

Before partitioned collections were available it was common to write your own code to partition your data and some of the DocumentDB SDKs explicitly supported several different partitioning schemes. That mode is still available but now only recommended when your needs will not exceed the capacity of one collection or when the built-in partitioning capability does not otherwise meet your needs.

Automatic Indexing

By default, every field in each document is automatically indexed generally providing good performance without tuning to specific query patterns. These defaults can be modified by setting an indexing policy which can vary per field.

Stored Procedures, Triggers, and User Defined Functions (UDF) written in JavaScript

A JavaScript engine is embedded in DocumentDB. This is a perfect fit for JSON documents, but it is also enables additional functionality:

  • Stored Procedures. Functions that bundle an arbitrarily complex set of operations and logic into an ACID-compliant transaction. They are isolated from changes made while the stored procedure is executing and either all write operations succeed or they all fail, leaving the database in a consistent state. Stored procedures are executed in a single partition which necessitates that the caller provide a partition key when calling into a partitioned collection. Stored procedures can be used to make up for the lack of certain functionality. For instance, the lack of aggregation capability is made up for by the implementation of an OLAP cube as a stored procedure in the open sourced documentdb-lumenize project.
  • Triggers. Functions that get executed before or after specific operations (like on a document insertion for example) that can either alter the operation or cancel it.
  • User Defined Functions (UDF). Functions that can be called from and augment the SQL query language making up for limited SQL support.
  • Supported environments

    In the following environments all features (except Direct Mode which is currently only supported for .NET) are explicitly supported with dedicated SDKs:

  • .NET
  • .NET Core
  • node.js (JavaScript)
  • Java
  • Python
  • Additionally, DocumentDB can be accessed with the following:

  • REST API. All features except Direct Mode are supported. You can call this REST API from any language or platform. In fact, the node.js, Java, and Python SDKs are essentially thin wrappers calling this REST API.
  • MongoDB driver-level protocol support. Most features are implemented with two notable exceptions: 1) the low-level (undocumented?) API that allows applications like Meteor to install themselves as a replica and receive all changes as an event stream, and 2) aggregations.
  • Querying DocumentDB repositories

    Several mechanisms for querying are provided:

    1. SQL-like query language with adjustments to match JSON data types.
    2. LINQ language integrated queries.
    3. JavaScript language integrated queries. This is only available from the server-side SDK exposed to stored procedures, triggers, and user defined functions. It is modeled after the Underscore.js API.
    4. MongoDB query language (JSON) via the MongoDB driver-level protocol support.

    Other features

    Additionally DocumentDB has support for:

  • Global distribution. Global distribution was added to DocumentDB's capability in 2016. This feature lets you scale your DocumentDB instance across different regions around the world and define what type of consistency you expect between the regions, from strong to eventual. It is even possible to configure an automatic and transparent failover for a given region.
  • BLOB storage via a behind-the-scenes integration with Azure BLOB Storage. If an Azure Blob Storage instance doesn’t exist, one is automatically provisioned when the first write to blob storage is issued.
  • GeoJSON support for storing and querying geographical information
  • Acclaim

    Gartner Research positions Microsoft as the leader in the Magic Quadrant Operational Database Management Systems in 2016 and explicitly calls out the unique capabilities of DocumentDB in their writeup.

    Real-world use cases

  • Social network architectures.
  • Integrations with identity providers like Auth0.
  • Training

  • Introduction to Azure DocumentDB (Pluralsight) provides an overview and extensive demos on querying, building client applications, programming the server, and more.
  • Criticism and cautions

  • Triggers must be explicitly specified for each operation that you wish to use them which renders them ineffective as a mechanism for maintaining business logic consistency unless you can be certain that all the correct triggers are specified for every operation.
  • .NET LINQ language integrated queries are not fully supported. More and more LINQ support has been added over time, but developers are often confused when the LINQ code that they use on other systems fails to work as expected on DocumentDB as evidenced by the large number of StackOverflow questions containing both tags.
  • The lack of fully functioning local version. However, a local emulator running under MS Windows for developer desktop use was added in the fall of 2016.
  • Aggregation capability in SQL limited to COUNT, SUM, MIN, MAX, AVG functions. No support for GROUP BY or other aggregation functionality found in database systems. However, stored procedures can be used to implement in-the-database aggregation capability.
  • "Collection" means something different in DocumentDB. It is simply a bucket of documents. There is a tendency to equate them to tables where each collection would hold only a single type of document which is not recommended with DocumentDB. Rather, developers are encouraged to distinguish document types with a "type" field or by adding an "isTypeA = true" field to all documents of TypeA, "isTypeB = true" for all documents of Type B, etc. This is especially confusing to developers that are coming from MongoDB which has a "collection" entity that is intended to be used in a very different way.
  • The lack of query plan visibility (e.g. "EXPLAIN" keyword in SQL).
  • Support only for pure JSON data types. Most notably, DocumentDB lacks support for date-time data requiring that you store this data using the available data types. For instance, it can be stored as an ISO-8601 string or epoch integer. MongoDB, the database to which DocumentDB is most often compared, extended JSON in their BSON binary serialization specification to cover date-time data as well as traditional number types, regular expressions, and Undefined. However, many argue that DocumentDB's choice of pure JSON is actually an advantage as it's a better fit for JSON-based REST APIs and the JavaScript engine built into the database.
  • Vendor lock-in. Since DocumentDB is only available as a PaaS offering from Microsoft Azure and there is currently no API compatible alternative, once you build a system on DocumentDB, you will not be able to easily get away from paying Azure for your usage of it.
  • References

    DocumentDB Wikipedia