Whereas Hadoop itself can be understood as a batch processing framework, a lot of technologies of its ecosystem, that are not, are coming out. These technologies help to bring Hadoop closer to a (near) real-time processing system and make better and/or faster results from analytics techniques applied to Big Data.
In this sense, there is a project called Apache Tez, gratuated from the incubator few months ago, which allows us to execute our MapReduce jobs in a different way. Tez is a processing framework on top of YARN which makes DAG (Directed Acyclic Graph) topologies for processing applications. It transforms MapReduce jobs into dataflows made of Inputs, Processors and Outputs through its API in which defines three main components: DAG (defines the overall job), Vertex (defines user logic and resources) and Edge (defines connections between vertices). Tez reduces job execution time significantly avoiding some intermediate data and steps. Projects such as Cascading (from version 3.0), Hive (from version 0.13.0) and recently Pig (from version 0.14.0) are supporting it.
In this post, I'll show the results of executing a Pig and a Hive script with and without Tez.
The dataset used is obtained from GitHub Archive and I'll process events arised during this week related when a user stars a project ("WatchEvent").
Used versions: Hadoop 2.6.0, Tez 0.6.0, Pig 0,14.0 and Hive 0.14.0.
In PigLatin, the script is as follows:
-- Exec without Tez: pig starred_repos.pig
-- Exec with Tez: pig -x tez starred_repos.pig
SET job.name 'Starred Repos on Github';
register json-simple-1.1.jar
register elephant-bird-pig-4.6.jar
register elephant-bird-hadoop-compat-4.6.jar
data = LOAD '/user/hadoop/data/githubarchive/*.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') AS (json:map[]);
json_data = FOREACH data GENERATE
json#'id' AS id:long, json#'type' AS type:chararray,
json#'public' AS public:boolean,
json#'created_at' AS created_at:datetime,
json#'actor'#'id' AS actor__id:long, json#'actor'#'login' AS actor__login:chararray, json#'actor'#'url' AS actor__url:chararray,
json#'repo'#'id' AS repo__id:long, FLATTEN(STRSPLIT (json#'repo'#'name', '/')) AS (repo__name:chararray, repo__project:chararray), json#'repo'#'url' AS repo__url:chararray;
watch_events = FILTER json_data BY type == 'WatchEvent';
grouped_events = GROUP watch_events BY (repo__name, repo__project);
counted_events = FOREACH grouped_events GENERATE COUNT($1), group;
sorted_events = ORDER counted_events BY $0 DESC, $1 ASC;
DUMP sorted_events;
And the script in Hive Query Language:
-- Exec: hive -f starred_repos.hql
-- Comment this line to execute without Tez
SET hive.execution.engine = tez;
SET mapred.job.name = 'Starred Repos on Github';
ADD JAR hive-serde-cdh-twitter-1.0.jar;
CREATE EXTERNAL TABLE IF NOT EXISTS githubarchive(id STRING,
type STRING,
actor STRUCT<
id:BIGINT,
login:STRING,
gravatar_id:STRING,
url:STRING,
avatar_url:STRING>,
repo STRUCT<
id:BIGINT,
name:STRING,
url:STRING>,
org STRUCT<
id:BIGINT,
login:STRING,
gravatar_id:STRING,
url:STRING,
avatar_url:STRING>,
public BOOLEAN,
created_at STRING)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/hadoop/data/githubarchive';
SELECT COUNT(*) AS total_count, repos.repo__name, repos.repo__project
FROM (
SELECT split(repo.name, '/')[0] AS repo__name, split(repo.name, '/')[1] AS repo__project
FROM githubarchive
WHERE type == 'WatchEvent'
) repos
GROUP BY repos.repo__name, repos.repo__project
ORDER BY total_count DESC, repos.repo__name ASC;
Although the difference between executing the scripts with or without Tez is not too significant, depending on your datasets and operations in your scripts, this difference could be much bigger:
By the way, the top 10 starred projects on GitHub this week are (repo/project):
- IonicaBizau/git-stats (2013 stars).
- getify/You-Dont-Know-JS (1550 stars).
- phanan/htaccess (1234 stars).
- joewalnes/websocketd (1085 stars).
- facebook/stetho (876 stars).
- sdelements/lets-chat (849 stars).
- pixle/subway (750 stars).
- ejci/favico.js (567 stars).
- babel/babel (552 stars).
- paulirish/memory-stats.js (493 stars).
No comments:
Post a Comment