Mario's blog: Testing Hadoop: MRUnit + Mockito + Reflection

I'm going to start these post series with one of the main principles in TDD: writing tests first.

As we know, applications based on distributed computing models like MapReduce are tricky to debug. There a lot of cases in which our code can throw an exception or behaves in a unexpected way because of a small bug in our code or problably due to the input data received. So whatever the reason is, it's better to detect and fix these bugs in early stages of our development.

For testing our MapReduce jobs, there is a well-known framework called MRUnit, in fact, it was a few years ago a TLP (Top Level Project) from the Apache Incubator. It helps us out to test and debug our pieces of code in isolation and in an easy way: you can write unit tests for your mappers and reducers and integration tests as well.
There are a lot of examples out there talking about how to use this library, so I prefer focusing in a mix of concepts to able to test our code using MRUnit, Mockito and Java Reflection API. The idea is as follows:

Unit testing of mappers and reducers via MRUnit.
Mocking classes with Mockito to emulate behaviour of some variables in the mapper and/or the reducer.
Java Reflection API to modify some private variables defined in these classes.

Firts of all, defining our dependencies in Maven (take into account the "classifier" tag, I'm using Hadoop 2 version):


 <dependencies>
  <dependency>
   <groupid>junit</groupid>
   <artifactid>junit</artifactid>
   <version>4.12</version>
   <scope>test</scope>
  </dependency>
  <dependency>
   <groupid>org.apache.mrunit</groupid>
   <artifactid>mrunit</artifactid>
   <version>1.1.0</version>
   <classifier>hadoop2</classifier>
   <scope>test</scope>
  </dependency>
  <dependency>
   <groupid>org.mockito</groupid>
   <artifactid>mockito-all</artifactid>
   <version>1.10.19</version>
   <scope>test</scope>
  </dependency>

  <dependency>
   <groupid>org.apache.hadoop</groupid>
   <artifactid>hadoop-hdfs</artifactid>
   <version>${hadoop.version}</version>
  </dependency>
  <dependency>
   <groupid>org.apache.hadoop</groupid>
   <artifactid>hadoop-auth</artifactid>
   <version>${hadoop.version}</version>
  </dependency>
  <dependency>
   <groupid>org.apache.hadoop</groupid>
   <artifactid>hadoop-common</artifactid>
   <version>${hadoop.version}</version>
  </dependency>
  <dependency>
   <groupid>org.apache.hadoop</groupid>
   <artifactid>hadoop-client</artifactid>
   <version>${hadoop.version}</version>
  </dependency>

  <dependency>
   <groupid>redis.clients</groupid>
   <artifactid>jedis</artifactid>
   <version>2.6.2</version>
  </dependency>
 </dependencies>

You can see an unexpected dependency: Jesis, a Java client to interact with Redis (a key-value cache and store system with good performance).
Why is this? The reason is pretty simple: Redis will work as a cache engine for sending values to the mapper for its operations (in this case we'll use it as a dictionary, this is, joining two datasets).

This is the driver class with its inner classes (mapper and reducer):


import java.io.IOException;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import redis.clients.jedis.Jedis;

public class Driver extends Configured implements Tool {

 public static class ExampleMapper extends
   Mapper {
  Jedis jedisClient = new Jedis("localhost");

  @Override
  public void setup(Context context) {
   jedisClient.select(0);
  }

  @Override
  public void map(LongWritable key, Text value, Context context)
    throws IOException, InterruptedException {
   String[] columns = value.toString().split(";");
   if (columns.length < 14) {
    return;
   }
   String mappedValue = jedisClient.get(columns[10]);
   if (mappedValue != null && !mappedValue.equals("")) {
    columns[10] = mappedValue;
   }
   value.set(StringUtils.join(columns, ";"));
   context.write(value, NullWritable.get());
  }
 }

 @Override
 public int run(String[] args) throws Exception {
  if (args.length != 2) {
   System.err.printf("Usage: %s [generic options] <input> <output>\n",
     getClass().getSimpleName());
   ToolRunner.printGenericCommandUsage(System.err);
   return -1;
  }
  Job job = Job.getInstance(getConf(), "MapReduce Test Example");
  job.setJarByClass(getClass());
  
  FileInputFormat.addInputPath(job, new Path(args[0]));
  FileOutputFormat.setOutputPath(job, new Path(args[1]));

  job.setMapperClass(ExampleMapper.class);
  job.setMapOutputKeyClass(Text.class);
  job.setMapOutputValueClass(NullWritable.class);
  
  job.setNumReduceTasks(0);

  return job.waitForCompletion(true) ? 0 : 1;
 }

 public static void main(String[] args) throws Exception {
  int status = ToolRunner.run(new Driver(), args);
  System.exit(status);
 }
}

This job has no mystery at all, actually it's a Map-Only job and consists of replacing a value from a column with its mapped value got from Redis (kinda map-side join). Just for the shake of illustration.
The datasets merged contain projects funded under seventh framework programme for research and technological development (FP7) from 2007 to 2013, from European Union Open Portal. The job maps activity codes (cached in Redis) in the projects extract.

On the other side, the test cases:


import java.io.IOException;
import java.lang.reflect.Field;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;

import redis.clients.jedis.Jedis;

import org.junit.Before;
import org.junit.Test;

import static org.mockito.Mockito.*;


public class TestDriver {

 MapDriver mapDriver;

 @Before
 public void setUp() throws NoSuchFieldException, SecurityException,
   IllegalArgumentException, IllegalAccessException {
  Driver.ExampleMapper mapper = new Driver.ExampleMapper();
  mapDriver = MapDriver.newMapDriver(mapper);

  Jedis mockedJedis = mock(Jedis.class);
  when(mockedJedis.get(any(String.class))).thenReturn("Dummy value");
  
  Field privateField = mapper.getClass().getDeclaredField("jedisClient");
  privateField.setAccessible(true);
  privateField.set(mapper, mockedJedis);
 }

 @Test
 public void testMapper() throws IOException, NoSuchFieldException,
   SecurityException, IllegalArgumentException, IllegalAccessException {
  mapDriver.withInput(new LongWritable(),
      new Text(
        "86250;217246;BONUS+;Multilateral call for research projects within the Joint Baltic Sea Research Programme BONUS+;BONUS EEIG - representing altogether 10 RTD organisations in the Baltic Sea states ;2007-05-10;2012-05-09;22512219;7266762;FP7-GA;ERANET;;FP7-2007-ERANET-4.2.;relatedContact:Dr Kaisa KONONEN"));
  mapDriver.withOutput(
      new Text(
        "86250;217246;BONUS+;Multilateral call for research projects within the Joint Baltic Sea Research Programme BONUS+;BONUS EEIG - representing altogether 10 RTD organisations in the Baltic Sea states ;2007-05-10;2012-05-09;22512219;7266762;FP7-GA;Dummy value;;FP7-2007-ERANET-4.2.;relatedContact:Dr Kaisa KONONEN"),
      NullWritable.get());

  mapDriver.runTest();
 }
}

We don't want to connect to Redis for testing purposes (or maybe we couldn't reach it); we must emulate it. Because of that, in the first highlighted lines we create a mock object and set the returning value of its get method. But that's not enough, the mapper has a private varible, jedisClient, and we have to "mocketizer" it. To do this, we need the Reflection API to modify this variable in runtime setting this mocked object.

And that's it! MRUnit takes care of its part and test whatever we like in our MapReduce jobs! :-P

6 comments:

Unknown24 June 2015 at 11:38
Thank you so much for sharing this worthwhile to spent time on. You are running a really awesome blog. Keep up this good work

Hadoop training velachery
Hadoop training institute in t nagar
Mario Molina28 June 2015 at 19:26
Many thanks Steve! Working on it on my free time! ;-)
Unknown23 November 2015 at 11:03
Learning new technology would give oneself a true confidence in the current emerging Information Technology domain. With the knowledge of big data the most magnificent cloud computing technology one can go the peek of data processing. As there is a drastic improvement in this field everyone are showing much interest in pursuing this technology. Your content tells the same about evolving technology. Thanks for sharing this.

Hadoop Training in Chennai | Big Data Training Chennai | Big Data Training | Big Data Course in Chennai
Mario Molina27 November 2015 at 10:12
Thanks for your words Andrew ;-)
Unknown31 December 2015 at 11:46
There is a huge demand for professional big data analysts who are able to use the software which is used to process the big data in order to get accurate results. MNC's are looking for professionals who can process their data so that they can get into a accurate business decision which would eventually help them to earn more profits, they can serve their customers better, and their risk is lowered.
big data training in chennai|big data training|big data course in chennai|big data training chennai|big data hadoop training in chennai
Mary Ellen14 January 2016 at 00:43
Thank you, this was exactly what I was looking for.

Mario's blog

Pages

Monday, 8 December 2014

Testing Hadoop: MRUnit + Mockito + Reflection

6 comments:

Contact Form