Pages

Friday, 20 February 2015

Improving performance on Pig and Hive: Apache Tez

Big Data in general and Hadoop in particular is becoming more and more important in organizations; they are incorporating Big Data technologies in their IT infrastructure and taking advantage of its benefits. In the Information Era, data/information management (and transform it into knowledge) is the key in order to improve decision-making processes, discover business opportunities, monitor or forecast events, among others; there are a lot of narrative out there talking about this.

Whereas Hadoop itself can be understood as a batch processing framework, a lot of technologies of its ecosystem, that are not, are coming out. These technologies help to bring Hadoop closer to a (near) real-time processing system and make better and/or faster results from analytics techniques applied to Big Data.

In this sense, there is a project called Apache Tez, gratuated from the incubator few months ago, which allows us to execute our MapReduce jobs in a different way. Tez is a processing framework on top of YARN which makes DAG (Directed Acyclic Graph) topologies for processing applications. It transforms MapReduce jobs into dataflows made of Inputs, Processors and Outputs through its API in which defines three main components: DAG (defines the overall job), Vertex (defines user logic and resources) and Edge (defines connections between vertices). Tez reduces job execution time significantly avoiding some intermediate data and steps. Projects such as Cascading (from version 3.0), Hive (from version 0.13.0) and recently Pig (from version 0.14.0) are supporting it.

In this post, I'll show the results of executing a Pig and a Hive script with and without Tez.
The dataset used is obtained from GitHub Archive and I'll process events arised during this week related when a user stars a project ("WatchEvent").
Used versions: Hadoop 2.6.0, Tez 0.6.0, Pig 0,14.0 and Hive 0.14.0.

In PigLatin, the script is as follows:

-- Exec without Tez: pig starred_repos.pig
-- Exec with Tez: pig -x tez starred_repos.pig
SET job.name 'Starred Repos on Github';

register json-simple-1.1.jar
register elephant-bird-pig-4.6.jar
register elephant-bird-hadoop-compat-4.6.jar

data = LOAD '/user/hadoop/data/githubarchive/*.json'  USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') AS (json:map[]);

json_data = FOREACH data GENERATE
 json#'id' AS id:long, json#'type' AS type:chararray,
 json#'public' AS public:boolean, 
 json#'created_at' AS created_at:datetime,
 json#'actor'#'id' AS actor__id:long, json#'actor'#'login' AS actor__login:chararray, json#'actor'#'url' AS actor__url:chararray,
 json#'repo'#'id' AS repo__id:long, FLATTEN(STRSPLIT (json#'repo'#'name', '/')) AS (repo__name:chararray, repo__project:chararray), json#'repo'#'url' AS repo__url:chararray;

watch_events = FILTER json_data BY type == 'WatchEvent';
grouped_events = GROUP watch_events BY (repo__name, repo__project);
counted_events = FOREACH grouped_events GENERATE COUNT($1), group;
sorted_events = ORDER counted_events BY $0 DESC, $1 ASC;
DUMP sorted_events;

And the script in Hive Query Language:

-- Exec: hive -f starred_repos.hql
-- Comment this line to execute without Tez
SET hive.execution.engine = tez;
SET mapred.job.name = 'Starred Repos on Github';

ADD JAR hive-serde-cdh-twitter-1.0.jar;

CREATE EXTERNAL TABLE IF NOT EXISTS githubarchive(id STRING, 
    type STRING,
    actor STRUCT<
        id:BIGINT,
        login:STRING,
        gravatar_id:STRING,
        url:STRING,
        avatar_url:STRING>,
    repo STRUCT<
        id:BIGINT,
        name:STRING,
        url:STRING>,
    org STRUCT<
        id:BIGINT,
        login:STRING,
        gravatar_id:STRING,
        url:STRING,
        avatar_url:STRING>,
    public BOOLEAN,
    created_at STRING)
  ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
  LOCATION '/user/hadoop/data/githubarchive';

SELECT COUNT(*) AS total_count, repos.repo__name, repos.repo__project
    FROM (
        SELECT split(repo.name, '/')[0] AS repo__name, split(repo.name, '/')[1] AS repo__project
        FROM githubarchive
        WHERE type == 'WatchEvent'
    ) repos
    GROUP BY repos.repo__name, repos.repo__project
    ORDER BY total_count DESC, repos.repo__name ASC;

Although the difference between executing the scripts with or without Tez is not too significant, depending on your datasets and operations in your scripts, this difference could be much bigger:


By the way, the top 10 starred projects on GitHub this week are (repo/project):
  1. IonicaBizau/git-stats (2013 stars).
  2. getify/You-Dont-Know-JS (1550 stars).
  3. phanan/htaccess (1234 stars).
  4. joewalnes/websocketd (1085 stars).
  5. facebook/stetho (876 stars).
  6. sdelements/lets-chat (849 stars).
  7. pixle/subway (750 stars).
  8. ejci/favico.js (567 stars).
  9. babel/babel (552 stars).
  10. paulirish/memory-stats.js (493 stars).

No comments:

Post a Comment