In my previous post we discussed how to write a very simple Java client to read/write data from/to a MongoDB database. In this post, we are going to see how to use inbuilt Map-Reduce functionality in MonogoDB to perform a word counting task. If you are new to Map-Reduce, please go through the Google Map-Reduce paper to see how it works. In order to understand this post properly please go through the previous post as well because we are going to use the same collection created in that post to apply Map-Reduce.
In our previous example, we created a “book” collection inside our “sample” database in MonogoDB. Then we inserted 3 pages into the “book” collection as 3 separate documents. Now we are going to apply Map-Reduce on the “book” collection to get a count of each individual word contained in all three pages. In MongoDB Map-Reduce, we can write our map and reduce functions using javascript. Following is the map function we are going to use for our purpose.
function map() {
// get content of the current document
var cnt = this.content;
// split the content into an array of words using a regular expression
var words = cnt.match(/\w+/g);
// if there are no words, return
if (words == null) {
return;
}
// for each word, output {word, count} pair
for (var i = 0; i < words.length; i++) {
emit({ word:words[i] }, { count:1 });
}
}
MongoDB will apply this map function on top of each and every document in the given collection. Format of the documents contained in our “book” collection is as follows.
{ "_id" : ObjectId("519f6c1f44ae9aea2881672a"), "pageId" : "page1", "content" : "your page1 content" }
In the above map function, “this” keyword always refers to the document on which the function is applied. Therefore, “this.content” will return the content of the page. Then we split the content into words and emit a count of 1 for each word found in the page. For example, if the word “from” appeared 10 times in the current page, there will be 10 ({word:from}, {count:1}) key-value pairs emitted. Likewise the map function will be applied into all 3 documents in our “book” collection before starting the reduce phase.
Following is the reduce function we are going to use.
function reduce(key, counts) {
var cnt = 0;
// loop through call count values
for (var i = 0; i < counts.length; i++) {
// add current count to total
cnt = cnt + counts[i].count;
}
// return total count
return { count:cnt };
}
In Map-Reduce, the reduce function will get all values for a particular key as an array. For example, if we found the word “from” 25 times in all 3 pages, reduce function will be called with the key “from” and value “[{count:1}, {count:1}, {count:1}, …]”. This array will contain 25 elements. So in the reduce function, we have to get the total of counts to calculate the total number of occurrences of a particular word.
Now we have our map and reduce functions. Let’s see how to write a simple Java code to execute Map-Reduce on our “book” collection.
package sample.mongo;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.MapReduceCommand;
import com.mongodb.MongoClient;
import java.io.IOException;
import java.io.InputStream;
public class WordCount {
public static void main(String[] args) {
try {
// create a MongoClient by connecting to the MongoDB instance in localhost
MongoClient mongoClient = new MongoClient("localhost", 27017);
// access the db named "sample"
DB db = mongoClient.getDB("sample");
// access the input collection
DBCollection collection = db.getCollection("book");
// read Map file
String map = readFile("wc_map.js");
// read Reduce file
String reduce = readFile("wc_reduce.js");
// execute MapReduce on the input collection and direct the result to "wordcounts" collection
collection.mapReduce(map, reduce, "wordcounts", MapReduceCommand.OutputType.REPLACE, null);
} catch (Exception e) {
e.printStackTrace();
}
}
/**
* Reads the specified file from classpath
*/
private static String readFile(String fileName) throws IOException {
// get the input stream
InputStream fileStream = WordCount.class.getResourceAsStream("/" + fileName);
// create a buffer with some default size
byte[] buffer = new byte[8192];
// read the stream into the buffer
int size = fileStream.read(buffer);
// create a string for the needed size and return
return new String(buffer, 0, size);
}
}
In order the execute the above code, make sure you have “wc_map.js” and “wc_reduce.js” files on your project classpath containing above map and reduce functions. In the above Java code, first we connect to our MongoDB database and get a reference to the “book” collection. Then we read our map and reduce functions as a String from the classpath. Finally we execute the “mapReduce()” method on our input collection. This will apply our map and reduce functions on the “book” collection and store the output in a new collection called “wordcounts”. If there’s an already existing “wordcounts” collection, it will be replaced by the new one. If you need more details on “mapReduce()” method, please have a look at the documentation and java doc.
Finally let’s log into our MongoDB console and output collection “wordcounts”.
isuru@isuru-w520:~$ mongo
MongoDB shell version: 2.0.4
connecting to: test
> use sample
switched to db sample
>
>
> db.wordcounts.find()
{ "_id" : { "word" : "1930s" }, "value" : { "count" : 1 } }
{ "_id" : { "word" : "A" }, "value" : { "count" : 5 } }
{ "_id" : { "word" : "After" }, "value" : { "count" : 3 } }
...
>
That’s it. Here we had a look at a very basic Map-Reduce sample using MongoDB. Map-Reduce can be used to perform more complex tasks efficiently. If you are interested, you can have a look at some more MongoDB Map-Reduce samples here.