Map(Reduce) Analytics on Google AppEngine

AppEngine Analytics Google AppEngine is a great tool for building simple web applications which are automatically scalable. All of the basic building blocks are readily available and accessible from both python and java. This includes a database, a caching layer, and support for background tasks.

What about the big data analytics and informatics that made Google famous? Does AppEngine help us there as well? The answer is yes; although with some serious limitations.

AppEngine MapReduce

Recently Mike Aizatsky and others at Google started working on a great add-on to AppEngine called MapReduce. This is, of course, conceptually the same MapReduce originally published by Jeffrey Dean and Sanjay Ghemawat here and the same MapReduce upon which Hadoop is based.

Unfortunately, for various reasons, Google decided not to make their internal MapReduce infrastructure available. Instead, they have decided to develop an entirely new system within the sandboxed world of AppEngine. This allows them to provide some additional batch processing support, while protecting their infrastructure.

For AppEngine developers, this means that a limited set of analytics functionality is currently available out-of-the-box. The new functionality amounts to support for “map-only” jobs at medium scale (64 shards). Reduce functionality, which is critical for most analytics, is not yet available. If your application requires reduce-like aggregates, your options include: waiting for Google to release more features, switching to Amazon EMR or EC2, or working around Google’s limitations.

Assuming you chose option number three, there are several ways to implement aggregates on top of AppEngine’s map jobs. As an example, lets look at a simple word count.

Built-in Counters

To count how often a few words (up to about a thousand or so) appear in a large set of documents, we can use the platform’s built in counters. The code looks something like this or this:

from mapreduce import operation as op

def process(entity):
    word1 = entity.content[0]
    word2 = entity.content[1]
    yield op.counters.Increment(word1)
    yield op.counters.Increment(word2)

Unfortunately, for large aggregates this won’t work. Assume now that you want to count all of the words in the documents. There will be too many counters to use the built-in facility. Since AppEngine does not support reduce operations, the usual approach of allowing a reducer to sum up counts from each mapper is also unavailable. What remains is to implement an efficient, scalable counter.

Sharded Counters

Because of the parallel nature of many mappers counting at the same time, one resonable approach is to implement sharded counters. The goal is to avoid database contention by splitting a single counter into several shards. These shards can then be incremented independently and summed up to get a total counter value. This technique can be quite effective, but still requires a database read and write for every counter increment. In practice I found that this severely limited the throughput of my MapReduce jobs on AppEngine.

In-Memory Counters

To avoid blocking on database reads and writes, a logical approach is to consider using the low latency memcache layer AppEngine provides to implement completely in-memory counters. This works great for short jobs with a small to medium number of counters. Unfortunately, when running a long job with many counters, it is very likely that some (sometimes many) counters will be evicted from the cache before the job completes. Since AppEngine only allows cache eviction policy hints and shares one big cache among all apps, there is no way to prevent this.

Write-Behind Counters

For long running jobs with many counters, we need a combination of low latency (in memory) counters and durability to survive cache evictions. One elegant solution is to use AppEngine’s task queues to write counter values to the datastore periodically. This allows counter increments to continue at high throughput without blocking on datastore interactions. Instead, updates to the datastore are batched and written back less frequently. This works well because the memcache is unlikely to evict recently incremented counters before the background task has a chance to store the counter’s value.

Note: With this approach there is a small chance that a counter value will be lost. However, for many applications this is acceptable and well worth the increased mapper throughput.