Cassandra - to BATCH or not to BATCH

This post is about Cassandra’s batch statements and which kind of batch statements are ok and which not. Often, when batch statements are discussed, it’s not clear if a particular statement refers to single- or multi-partition batches or to both - which is the most important question IMO (you should know why after you’ve read this post).

Instead, it is often differentiated between logged and unlogged batches ([1], [2], [3]). For one thing, logged/unlogged may cause misunderstandings (more on this below). For another thing, unlogged batches (for multiple partitions) are deprecated since Cassandra 2.1.6 ([1], [4], [5]). Therefore I think that the distinction between logged/unlogged is not that (no longer) useful.

In the following I’m describing multi- and single-partition batches: what they provide, what they cost and when it’s ok to use them.

TL;DR: Multi partition batches should only be used to achieve atomicity for a few writes on different tables. Apart from this they should be avoided because they’re too expensive. Single partition batches can be used to get atomicity + isolation, they’re not much more expensive than normal writes.

Definitions

At first, what are single partition and what are multi partition batches?

A single partition BATCH only contains statements with a single partition key. If, for example, we have a table user_tweets with PRIMARY KEY(user_id, tweet_id), then a batch with inserts with user_id=1, tweet_id=1 and user_id=1,tweet_id=2 would target the same partition (on a single node).

A multi partition batch is a BATCH that contains statements with different partition keys. If, for example, we have a table users with PRIMARY KEY(user_id), then a batch with inserts for user_id=1 and user_id=2 would target 2 partitions (which may live on different nodes). The same holds true if a batch contains writes for different tables or even keyspaces: obviously such writes also target different partitions (maybe on different nodes).

Multi Partition Batches

Let’s see what the characteristics of multi partition batches are…

What does a multi partition batch provide?

Atomicity. Nothing more, notably no isolation.

That means that from the writers point of view all or nothing gets written (atomicity). But other clients can see the first write (insert/update/delete) before other writes from the batch are applied (no isolation).

What does a multi partition batch cost?

Let me just quote Christopher Batey, because he has summarized this very well in his post “Cassandra anti-pattern: Logged batches” [3]:

Cassandra [is first] writing all the statements to a batch log. That batch log is replicated to two other nodes in case the coordinator fails. If the coordinator fails then another replica for the batch log will take over. [..] The coordinator has to do a lot more work than any other node in the cluster.

Again, in bullets what has to be done:

  • serialize the batch statements
  • write the serialized batch to the batch log system table
  • replicate of this serialized batch to 2 nodes
  • coordinate writes to nodes holding the different partitions
  • on success remove the serialized batch from the batch log (also on the 2 replicas)

Here’s an image (from Christopher Batey’s blog post) that visualizes this for a batch with 8 insert statements for different partitions (equally distributed across nodes) in a 8 node cluster:

And this does even not show arrows for responses and batch log cleanup (between the coordinator and batch log replicas).

Hopefully it’s clear that such a multi partition batch is very expensive!

What are the limitations of multi partition batches?

Cassandra by default logs warnings for batches > 5kb and fails/refuses batches > 50kb, according to the configuration of batch_size_{warn|fail}_threshold_in_kb ([7], [8]).

When should you use logged / multi partition batches?

Usually - just don’t!

When you’re writing to multiple partitions you should prefer multiple async writes, as described in “Cassandra: Batch loading without the Batch keyword” [9].

Multi partition batches can be used when atomicity / consistency across tables is needed - in such cases very small “logged” batches are fine.

But imagine a situation where you’re writing to two tables users (PK: id) and users_by_username (PK: username) in the context of a “batch” job (e.g. data import/replication, housekeeping, whatever). Then maybe tens of thousands of records might be updated in a very short time. In such a case I’d also prefer two async writes for each user (plus appropriate failure handling) because I’d consider batches too expensive in this case.

Single Partition Batches

Now let’s see what the characteristics of single partition batches are…

What does a single partition batch provide?

Atomicity + isolation.

That means that all or nothing gets written. Other clients don’t see partial updates, but only all of them after they’ve been applied.

Single partition batches (compared to multiple async statements) save network round-trips between the client and the server and therefore may improve throughput.

What does a single partition batch cost?

There’s no batch log written for single partition batches. The coordinator doesn’t have any extra work (as for multi partition writes) because everything goes into a single partition. Single partition batches are optimized: they are applied with a single RowMutation [10].

In a few words: single partition batches don’t put much more load on the server than normal writes.

Sorry that I have no image for this, but a single arrow between client and server would be just too boring ;-)

What are the limitations of single partition batches?

Today (01/2016) Cassandra’s batch_size_{warn|fail}_threshold_in_kb config is also applied to single partition batches. Therefore also for a batch targeting a single partition you’ll get warnings. This should be resolved soon [11] so that single partition batches then are free to use.

As it’s the case for other writes single partition batches should not be too large, because otherwise they would cause heap pressure and in consequence lead to long gc pauses [12].

Because single partition batches are often used in combination with wide rows you should make sure that a row (partition) doesn’t become too large. The recommendation of Datastax is that rows should not become much bigger than 100mb (you can check this with nodetool). While this is not directly a limitation of single partition batches you should be aware of this fact to keep your C* cluster happy - partition your partitions.

When should you use single partition batches?

Single partition batches should be used when atomicity and isolation is required.

Even if you only need atomicity (and no isolation) you should model your data so that you can use single partition instead of multi partition batches.

Single partition batches may also be used to increase the throughput compared to multiple un-batched statements. Of course you must benchmark your workload with your own setup/infrastructure to verify this assumption. If you don’t want to do this you shouldn’t use single partition batches if you don’t need atomicity/isolation.

Logged vs. Unlogged

In this paragraph I want to explain why I find the differentiation in the first place between logged and unlogged not that useful.

I think a major issue is that “logged” or “unlogged” may refer to different things:

  • server side: “logged” means that C* writes a batch log (plus all the ceremony around this), “unlogged” means that no batch log is written.
  • client side: “logged” refers to BEGIN BATCH (respectively the driver’s logged batch), “unlogged” refers to BEGIN UNLOGGED BATCH. E.g. the java driver allows to create a new BatchStatement(BatchStatement.Type.UNLOGGED), the default type being Type.LOGGED.

Even when the client is asking for a logged batch (with BEGIN BATCH or the driver’s equivalent) this does not necessarily mean that the server (C*) will really write a batch log (and do all these costly things). Let’s have a look at the possible combinations of the client’s instruction for LOGGED/UNLOGGED and the partition property (single/multi). For each combination we see if C* is writing a batch log, if the batch is applied atomically and if it’s applied in isolatation.

↓ client: (un)logged // single/multi partition → single partition multi partition
logged (BEGIN BATCH) no batch log, atomic, isolated batch log, atomic
unlogged (BEGIN UNLOGGED BATCH) no batch log, atomic, isolated no batch log

Remember that unlogged batches for multiple partitions are deprecated since Cassandra 2.1.6 ([1], [4], [5])? With the table above it should be clear that one can just forget this “unlogged” thing. Hopefully at some point it will be completely removed from CQL and drivers.

Rant: Improve the docs!

An issue today (01/2016) is that blog posts and documentation often only talk about multi partition batches (without making this explicit).

E.g. the CQL commands reference for BATCH [14] explains all the bad things around BATCHes and talks about logged and unlogged. Unfortunately it does not make clear how important it is if the batch targets a single or multiple partitions. Today it only says ”However, transactional row updates within a partition key are isolated: clients cannot read a partial update.”. Nothing more about single partition batches.

The same holds true for the “Using and misusing batches” section [1] in the CQL docs. There you can read

Batches place a burden on the coordinator for both logged and unlogged batches.

True for multi partition batches, wrong for single partition batches.

Another example is the javadoc for the java driver’s BatchStatement.Type [15]. There LOGGED is documented with

A logged batch: Cassandra will first write the batch to its distributed batch log to ensure the atomicity of the batch (atomicity meaning that if any statement in the batch succeeds, all will eventually succeed).

And for UNLOGGED you read

A batch that doesn’t use Cassandra’s distributed batch log. Such batch are not guaranteed to be atomic.

This should also mention the relevance of the partition keys involved in the batch statement.

C* docs team, please mention that for single partition batches there’s no batch log and no ceremony involved and they’re atomic and isolated!

Summary

  • When you hear/read s.th. about C* batch statements, ask (yourself) if it’s about single- or multi-partition batches.
  • The common mantra that batch statements should be avoided most often refers to multi-partition batches.
  • Multi partition batches should only be used to achieve atomicity for a few writes on different tables. Apart from this they should be avoided because they’re too expensive.
  • Single partition batches can be used to achieve atomicity and isolation. They’re not much more expensive than normal writes.

References

  1. CQL for Cassandra 2.2+: Using and misusing batches (latest version)
  2. Christopher Batey’s Blog: Cassandra anti-pattern: Misuse of unlogged batches
  3. Christopher Batey’s Blog: Cassandra anti-pattern: Logged batches
  4. CASSANDRA-9282: Warn on unlogged batches
  5. DataStax Help Center: New messages about the use of batches in Cassandra 2.1
  6. CQL Glossary: partition key
  7. CASSANDRA-6487: Log WARN on large batch sizes
  8. CASSANDRA-8011: Fail on large batch sizes
  9. Cassandra: Batch loading without the Batch keyword
  10. CASSANDRA-6737: A batch statements on a single partition should not create a new CF object for each update
  11. CASSANDRA-10876: Alter behavior of batch WARN and fail on single partition batches
  12. DataStax Help Center: Common Causes of GC pauses
  13. Cassandra 2.1 tools: nodetool cfhistograms (for C* 3.+ tablehistograms)
  14. CQL for Cassandra 2.2+: CQL commands / BATCH
  15. Javadoc for BatchStatement.Type

Comments