Probably you want to get your hands on too quickly without knowing the basics and that's not a good thing. Although what I'm going to include here don't cover the basics, I think some practical tips can be useful when you're going to implement a Hadoop job.
Extends from Configured and implements Tool in your driver
You can run a job without doing this and it'll work fine. However, if you want to change some config params (Hadoop is highly configurable) you'll have to change your code or even hardcode it.
An useful utility class is "org.apache.hadoop.util.GenericOptionsParser" which helps us out to parse and sets generic Hadoop arguments. It's not a good style of coding using this class directly so instead, we must use "org.apache.hadoop.util.ToolRunner" class that use it internally. Also, ToolRuner class needs a "org.apache.hadoop.util.Tool" object to exec its run() method. For that, extending "org.apache.hadoop.conf.Configured" and implementing "org.apache.hadoop.util.Tool" interface (all this in the driver class) will help to get the "org.apache.hadoop.conf.Configurable" object and it'll require to implement the run() method. Finally, implementing this run() method to configure our job, we just have to call ToolRunner.run().
public class MyDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
Configuration conf = this.getConf();
Job job = Job.getInstance(conf, "My job");
//...
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int status = ToolRunner.run(new DiveBoardDriver(), arguments);
System.exit(status);
}
}
Set Jar file to use in the job
Hadoop needs to know where the jar with Map and Reduce tasks is to send it to nodes to perform these tasks. You can do this using setJarByClass() method, which will find out the jar by the specified class or using setJar() method, which will set the jar file location.
job.setJarByClass(getClass());
job.setJar("/path/to/my/jar");
Number of reducer tasks to use
If your job is a Map-only job, you must to set it to zero. If not, it'll depend on the number of reducer slots available in your cluster (if you're working on local, just 1 is ok) and the files that the job is processing too. There is a formula in which you multiply a factor (with values between 0.95 to 1.75) by the number of reducer slosts available.
job.setNumReduceTasks(N);
Adding third-party libraries to a job
If you have some dependencies in your job, this is, third-party libraries, you'll need these libraries are installed in all nodes of your cluster.
Using the CLI (Command Line Interface), you'll be able to include these libraries by the "-libjars" param and it'll work but, what if you're not using the CLI? One possible option would be to build a "fat jar" including all dependencies in it, containing your classes and third-party classes too, but the file could be too big.
The distributed cache can help us: you'll have to upload our third-party libraries to HDFS and then, add these files to the distributed cache. Prior versions to Hadoop 2.2.0 had to use "org.apache.hadoop.filecache.DistributedCache" class, for later versions this functionality is included in "org.apache.hadoop.mapreduce.Job" class.
job.setCacheFiles(URI[] files);
job.addCacheFile(new URI("/path/to/file"));
Not always Hadoop is a good option
Before thinking in "how to solve this problem with Hadoop", you should ask yourself what are the requirements: it could be files you have to process aren't that huge and you can use another tool for it, or maybe you need a quick response in the results (real-time o near real-time), or time to develop and deploy your application, etc.
Hadoop is quite interesting and you can do so many things with the tools of its ecosystem but this has nothing to do with the functionality you want to solve.
No comments:
Post a Comment