My thanks go out to John Heintz for helping me to create this.
Here's a video of the whole thing:
Unit Testing HadoopMapReduceBefore we go into the details of everything involved in HadoopMapReduce, let’s go through an overview of how we’ve been testing our MapReduce jobs. In this example, we are doing this common demonstration of a word count MapReduce job. The idea is that sentences will come in, they will be split into words, count them up, and summarize in our reduce job.
Visualizing the SystemUnit Testing provides many different advantages to a programmer. And all of them have different importance to different people, but one of these advantages is specification. In particular, with MapReduce, being able to understand the flow and transformation of the data can be more enlightening than the other aspects of Unit Testing.
One of the things that we strive to do with ApprovalTests, is create output that is meaningful and insightful. To demonstrate this, I’m going to start output of my word count unit test.
| [cat cat dog] |
-> maps via WordCountMapper to ->
-> reduces via WordCountReducer to ->
Here is the entire line to create that output:
HadoopApprovals.verifyMapReduce(new WordCountMapper(), new WordCountReducer(), 0, "cat cat dog");Easy right? Let’s look into the parts.
MR UnitHadoopApprovals sits on top of MR Unit . You will need to grab it and include the jars. The javaDocs for HadoopApprovals mention all of the jars needed to run it. They are:
HadoopApprovalsHadoopApprovals has three main functions: the ability to test the mapper, the reducer, and the map reduce combination. To make this easier, we are going to use extra information with the generics called SmartMappers and SmartReducers. We will talk about those in the next section.
Testing a MapperHere’s the method to test the mapper:
verifyMapping(SmartMapper mapper, Object key, Object input)
To test a mapper, you need the mapper you’re going to test and the keyValue pair going in. Many times, mappers do not actually use the key that comes in, but regardless, you will need one anyway. Do not make it null.
Testing a ReducerHere’s the method to test the reducer:
verifyReducer(SmartReducer reducer, Object key, Object... values)
When testing a reducer, you will need to pass in the key plus a set of values for the key. You will notice you can pass in regular strings and regular integers instead of Text and LongWritable. Again, this is because of the SmartMappers we will talk about in the next section.
Testing a full MapReduce jobHere’s the method to test the MapReduce:
SmartReducer reducer, Object key, Object input)
As you might expect, this is almost identical to testing a mapper. Except with the extra addition of a reducer.
SmartMappersI wrote a blog about getting generic type information at runtime from Java. You can read it here:
The long a short of it is, if you add an extra interface to your mapping and reducing jobs, you can avoid a lot of boiler-plate code needed to state what the runtime types of the input and output are. You can do this by simply having your MapReduce jobs extend SmartMapper instead of Mapper.
“What? I don’t want to change my stuff!”
Fair point. If you want to test against an existing mapper or reducer, and you don’t want to change it to use the SmartMapper extensions, you can always wrap it using the MapperWrapper (bonus: fun to say out loud) or ReducerWrapper. Here’s a sample of how to do that:
SmartMapper mapper = new MapperWrapper(new WordCountMap(),
Once you do this, you will no longer need to wrap all your inputs in the writable classes, HadoopApprovals will wrap primitives into the appropriate context for you.