As we know, applications based on distributed computing models like MapReduce are tricky to debug. There a lot of cases in which our code can throw an exception or behaves in a unexpected way because of a small bug in our code or problably due to the input data received. So whatever the reason is, it's better to detect and fix these bugs in early stages of our development.
For testing our MapReduce jobs, there is a well-known framework called MRUnit, in fact, it was a few years ago a TLP (Top Level Project) from the Apache Incubator. It helps us out to test and debug our pieces of code in isolation and in an easy way: you can write unit tests for your mappers and reducers and integration tests as well.
There are a lot of examples out there talking about how to use this library, so I prefer focusing in a mix of concepts to able to test our code using MRUnit, Mockito and Java Reflection API. The idea is as follows:
- Unit testing of mappers and reducers via MRUnit.
- Mocking classes with Mockito to emulate behaviour of some variables in the mapper and/or the reducer.
- Java Reflection API to modify some private variables defined in these classes.
<dependencies>
<dependency>
<groupid>junit</groupid>
<artifactid>junit</artifactid>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupid>org.apache.mrunit</groupid>
<artifactid>mrunit</artifactid>
<version>1.1.0</version>
<classifier>hadoop2</classifier>
<scope>test</scope>
</dependency>
<dependency>
<groupid>org.mockito</groupid>
<artifactid>mockito-all</artifactid>
<version>1.10.19</version>
<scope>test</scope>
</dependency>
<dependency>
<groupid>org.apache.hadoop</groupid>
<artifactid>hadoop-hdfs</artifactid>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupid>org.apache.hadoop</groupid>
<artifactid>hadoop-auth</artifactid>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupid>org.apache.hadoop</groupid>
<artifactid>hadoop-common</artifactid>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupid>org.apache.hadoop</groupid>
<artifactid>hadoop-client</artifactid>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupid>redis.clients</groupid>
<artifactid>jedis</artifactid>
<version>2.6.2</version>
</dependency>
</dependencies>
You can see an unexpected dependency: Jesis, a Java client to interact with Redis (a key-value cache and store system with good performance).Why is this? The reason is pretty simple: Redis will work as a cache engine for sending values to the mapper for its operations (in this case we'll use it as a dictionary, this is, joining two datasets).
This is the driver class with its inner classes (mapper and reducer):
import java.io.IOException;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import redis.clients.jedis.Jedis;
public class Driver extends Configured implements Tool {
public static class ExampleMapper extends
Mapper {
Jedis jedisClient = new Jedis("localhost");
@Override
public void setup(Context context) {
jedisClient.select(0);
}
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] columns = value.toString().split(";");
if (columns.length < 14) {
return;
}
String mappedValue = jedisClient.get(columns[10]);
if (mappedValue != null && !mappedValue.equals("")) {
columns[10] = mappedValue;
}
value.set(StringUtils.join(columns, ";"));
context.write(value, NullWritable.get());
}
}
@Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>
This job has no mystery at all, actually it's a Map-Only job and consists of replacing a value from a column with its mapped value got from Redis (kinda map-side join). Just for the shake of illustration.The datasets merged contain projects funded under seventh framework programme for research and technological development (FP7) from 2007 to 2013, from European Union Open Portal. The job maps activity codes (cached in Redis) in the projects extract.
On the other side, the test cases:
import java.io.IOException;
import java.lang.reflect.Field;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import redis.clients.jedis.Jedis;
import org.junit.Before;
import org.junit.Test;
import static org.mockito.Mockito.*;
public class TestDriver {
MapDriver mapDriver;
@Before
public void setUp() throws NoSuchFieldException, SecurityException,
IllegalArgumentException, IllegalAccessException {
Driver.ExampleMapper mapper = new Driver.ExampleMapper();
mapDriver = MapDriver.newMapDriver(mapper);
Jedis mockedJedis = mock(Jedis.class);
when(mockedJedis.get(any(String.class))).thenReturn("Dummy value");
Field privateField = mapper.getClass().getDeclaredField("jedisClient");
privateField.setAccessible(true);
privateField.set(mapper, mockedJedis);
}
@Test
public void testMapper() throws IOException, NoSuchFieldException,
SecurityException, IllegalArgumentException, IllegalAccessException {
mapDriver.withInput(new LongWritable(),
new Text(
"86250;217246;BONUS+;Multilateral call for research projects within the Joint Baltic Sea Research Programme BONUS+;BONUS EEIG - representing altogether 10 RTD organisations in the Baltic Sea states ;2007-05-10;2012-05-09;22512219;7266762;FP7-GA;ERANET;;FP7-2007-ERANET-4.2.;relatedContact:Dr Kaisa KONONEN"));
mapDriver.withOutput(
new Text(
"86250;217246;BONUS+;Multilateral call for research projects within the Joint Baltic Sea Research Programme BONUS+;BONUS EEIG - representing altogether 10 RTD organisations in the Baltic Sea states ;2007-05-10;2012-05-09;22512219;7266762;FP7-GA;Dummy value;;FP7-2007-ERANET-4.2.;relatedContact:Dr Kaisa KONONEN"),
NullWritable.get());
mapDriver.runTest();
}
}
We don't want to connect to Redis for testing purposes (or maybe we couldn't reach it); we must emulate it. Because of that, in the first highlighted lines we create a mock object and set the returning value of its get method. But that's not enough, the mapper has a private varible, jedisClient, and we have to "mocketizer" it. To do this, we need the Reflection API to modify this variable in runtime setting this mocked object.And that's it! MRUnit takes care of its part and test whatever we like in our MapReduce jobs! :-P
Thank you so much for sharing this worthwhile to spent time on. You are running a really awesome blog. Keep up this good work
ReplyDeleteHadoop training velachery
Hadoop training institute in t nagar
Many thanks Steve! Working on it on my free time! ;-)
ReplyDeleteLearning new technology would give oneself a true confidence in the current emerging Information Technology domain. With the knowledge of big data the most magnificent cloud computing technology one can go the peek of data processing. As there is a drastic improvement in this field everyone are showing much interest in pursuing this technology. Your content tells the same about evolving technology. Thanks for sharing this.
ReplyDeleteHadoop Training in Chennai | Big Data Training Chennai | Big Data Training | Big Data Course in Chennai
Thanks for your words Andrew ;-)
ReplyDeleteThere is a huge demand for professional big data analysts who are able to use the software which is used to process the big data in order to get accurate results. MNC's are looking for professionals who can process their data so that they can get into a accurate business decision which would eventually help them to earn more profits, they can serve their customers better, and their risk is lowered.
ReplyDeletebig data training in chennai|big data training|big data course in chennai|big data training chennai|big data hadoop training in chennai
Thank you, this was exactly what I was looking for.
ReplyDelete