What is Re-reduce in MongoDB Map-Reduce?

In my previous post on Map-Reduce, we had a look at MongoDB Map-Reduce functionality using a simple sample. In this post, I’m going to explain what is Re-reduce and why it is important to know about Re-reduce when you write your reduce function. I recommend you to go through the previous post before reading this one because I’m going to use the same sample to explain some concepts here.

In Map-Reduce, the map function produces a set of key-value pairs with redundant keys. For example, if we consider the word count sample in the previous post, if we have 25 occurrences of the word “from”, there will be 25 key-value pairs like ({word:from}, {count:1}) emitted from the map phase. After all map tasks are completed, the Map-Reduce framework has to shuffle and sort the key-value pairs produced by all map tasks. This will group all values under a particular key and produce an array of values. According to normal Map-Reduce standard, the reducer will receive a particular key with an array of values which contains all values emitted by the map phase. In other words, reducer will be called only once for a particular key.

// array containing 25 values
reduce("from", [{count:1}, {count:1}, {count:1}, ...])

If we can guarantee that,  we can simply write the reduce function given in the previous post as follows.

function reduce(key, counts) {
    // we assume all values under this key are contained in counts
    return { count:counts.length };
}

However, in MongoDB Map-Reduce we can’t guarantee the above condition. When a particular key contains a large number of values, it will call the reduce function for the same key several times by splitting the set of values into parts. This is called Re-reduce.

Let’s consider the same example of word “from” again in a Re-reduce. Let’s assume MonogoDB executes the reduce function 3 times by selecting a subset of values out of the set of values available before each reduce step. Assume, first the reduce function will be called for 10 {count:1} values and that will return {count:10}.

// array containing 10 values
reduce("from", [{count:1}, {count:1}, {count:1}, ...])

Now for the key “from”, we have 15 {count:1} values and 1 {count:10} value. Then the second reduction will be called on a subset of these 16 values. Assume it’s called for 8 {count:1} values. That will return {count:8}.

// array containing 8 values
reduce("from", [{count:1}, {count:1}, {count:1}, ...])

Finally the third reduction will get an array like [{count:10}, {count:8}, {count:1}, {count:1}, …] which contains the results of previous 2 deductions and the remaining 7 {count:1} values.

// array containing 9 values
reduce("from", [{count:10}, {count:8}, {count:1}, {count:1}, {count:1}, ...])

So the output of the third reduction will be {count:25}. Note that in the above example output of the first reduction has gone into the third reduction. But that is not a must. It might have gone into the second reduction step as well.

Now you can understand the reason why we can’t implement the reduce function as given above (using “counts.length”) if we are using MongoDB Map-Reduce. We always have to keep in mind about the Re-reduce when implementing the reduce function to avoid errors.

Advertisements

About isurues
Age : 24 Date of Birth : 05.11.1984 Country : Sri Lanka

One Response to What is Re-reduce in MongoDB Map-Reduce?

  1. Emily says:

    Yeah, the thing about reduce function is clear. Thanks for the samples, it’s very useful.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: