12 Oct 2014 Comment

I’ve recently started working with CouchDB and am finding it to be a really elegant way of storing and retrieving data. In my adventures around the Internet in search of related resources, I have found that - in spite of the maturity of CouchDB and similar document databases - there still seems to be a lot of confusion about reduce functions and, in particular, their mysterious rereduce parameter.

The documentation around reduce functions can be a bit obscure so, in this post, I’m going to talk through the process of building a CouchDB reduce function from the ground up. I’ll build a simple view to report on the numbers of books in a multilingual “digital library” application and then discuss the mysterious rereduce parameter, what it is and its importance.

The brief

Let’s say we’re building an online library. Users can log in, browse for books in multiple different languages, check them out and have them delivered. In order to encourage people to sign up, we want to display some impressive statistics on the homepage; specifically, we want to tell the user how many books in our library are available in their language.

The implementation

At their simplest, each of our book documents might contain the title of the book and an array containing the languages in which it’s available:

{
    "_id": "d983bb8b8fb38b1c7f9f9bdde3000e59",
    "_rev": "2-dab366d27386bf1a7f4af41311322f3b",
    "title": "The Hobbit",
    "languages": [
        "en",
        "fr",
        "de"
    ]
}

Using this document structure, let’s begin with a simple map function. If you’re not familiar with the basics of map functions, the official guide is a good place to start.

Our view function will look like this:

function(doc) {
    if (doc.languages) {
        for(var language in doc.languages) {
          emit(doc.languages[language], 1);
        }
    }
}

For each book, this map function will produce a key/value pair for every language in which it’s available. The key for each key/value pair is the language code, and the value is always 1. To illustrate, the document for The Hobbit shown above will produce the following rows:

[
    {"id":"d983bb8b8fb38b1c7f9f9bdde3000e59","key":"de","value":1},
    {"id":"d983bb8b8fb38b1c7f9f9bdde3000e59","key":"en","value":1},
    {"id":"d983bb8b8fb38b1c7f9f9bdde3000e59","key":"fr","value":1}
]

This may not seem very useful until we consider what it gives us. We now have a view, which, when given a language as the key, will give us back a row for every book, with each row having a value of 1. So, in order to find out how many books are available in, for example, German, we simply need to sum all of the values associated with the key "de". We can write a reduce function to do this as follows:

function(keys, values) {
    return sum(values);
}

This function has two parameters; keys and values. keys is an array in which each of the elements contains a key from the map function and the id of the document that produced it. In our case, each of the elements of keys will look something like this:

["de", "d983bb8b8fb38b1c7f9f9bdde3000e59"]

values is an array of the values produced by the map function. At the moment, each of our values is 1.

Reduce functions are run (at index-time) for groups of keys, so when we pass key="de", the result we receive will be the result of the reduce function run for every row of the view with a key of "de". If we have 27 books that are available in German, our result will look like this:

{
    "rows":[{ "key": "de", "value": 27 }]
}

This gives us everything we need to satisfy our original requirement. This particular reduce function is so often useful that CouchDB actually has it built-in; you can replace the entire function with _sum and it will behave in exactly the same way.

You may have noticed the absence of a rereduce parameter in our reduce function so far. The reason for this will hopefully become clear shortly.

The re-reduce

Suppose we now want to extend our view so it will tell us not only how many books are available in a particular language, but the total word count of all of those books (because big numbers are impressive). The first thing we’ll need to do is adjust the shape of our documents:

{
    "_id": "d983bb8b8fb38b1c7f9f9bdde3000e59",
    "_rev": "2-dab366d27386bf1a7f4af41311322f3b",
    "title": "The Hobbit",
    "languages": {
        "en": { "words": 95022 },
        "fr": { "words": 98254 },
        "de": { "words": 93254 }
    }
}

Now we can rework our map function so that each document will emit its word count for each language:

function(doc) {
    if (doc.languages) {
        for(var lang in doc.languages) {
            var words = doc.languages[lang].words;
            emit(lang, words);
        }
    }
}

This gives us a word count for each row in the view, instead of just 1. The Hobbit will now produce the following results instead:

[
    {"id":"d983bb8b8fb38b1c7f9f9bdde3000e59","key":"de","value":93254},
    {"id":"d983bb8b8fb38b1c7f9f9bdde3000e59","key":"en","value":95022},
    {"id":"d983bb8b8fb38b1c7f9f9bdde3000e59","key":"fr","value":98254}
]

We still want to know how many books there are for each language, but we can do this by counting the number of rows for a particular key. Let’s update our reduce function so it gives us both the number of rows (the number of books) and the sum of the values (the total word count):

function(keys, values) {
    var stats = { books: 0, words: 0 };

    // The number of books is the length of values
    stats.books = values.length;

    // The total word count is the sum of values
    stats.words = sum(values);

    return stats;
}

This will give us a result like the following:

{
    "rows":[{
        "key": "de",
        "value":{ "books":27, "words":2176004 }
    }]
}

As long as we only have a few documents, this will work perfectly. However, as our database grows, we will start to notice some strange results:

{
    "rows":[{
        "key":"de",
        "value": {
            books: 3,
            words: "0[object Object][object Object][object Object]"
        }
    }]
}

The reason for this is that - when a view contains a large number of rows - CouchDB uses a divide and conquer strategy to calculate reduce results more efficiently. It does this by breaking up the key/value pairs into smaller sets and running the reduce function on each of these smaller sets separately. Once this is done, it bundles all of the results into an array and runs this array through the reduce function again. This process can happen several times before the final result is produced. When the reduce function is run for the results of a previous reduce, the rereduce parameter is set to true so you can handle it properly.

The output of our reduce function is the stats object, which means when CouchDB runs a rereduce for our view, the values array looks like this:

[{ "books": 28, "words": 2326084 }, { "books": 23, "words": 1927543 }, ... ]

We wrote our reduce function to accept an array of integers, so this clearly won’t work. What we need to do is amend our reduce function so, in a rereduce scenario, it will take an array of our stats objects, sum the values of books and words and give us back a single stats object with the totals:

function(keys, values, rereduce) {
    var stats = { books: 0, words: 0 };

    if (rereduce) {
        for(var i=0; i < values.length; i++) {
            stats.books += values[i].books;
            stats.words += values[i].words;
        }

        return stats;
    }

    stats.books = values.length;
    stats.words = sum(values);

    return stats;
}

We didn’t need to worry about the rereduce parameter for our earlier reduce function because each reduce took an array of integers and produced a single integer, meaning that the rereduce array was also an array of integers.

There’s plenty more to explore in the world of reduce functions but this post will hopefully have outlined the basics well enough to get you started. Reduce functions can be a little unintuitive to begin with, but once you’re comfortable with the rereduce concept, they become a very powerful tool for efficient aggregation of your view results.