Статьи

Большие данные по запросу с Apache Whirr

Эта статья Себастьяна Гоасгуена выходит через Марка Хинкля.

Этот пост немного более формален, чем обычно, поскольку я написал его для учебника о том, как запускать hadoop в облаках, но я подумал, что это было очень полезно, поэтому я публикую его здесь для всеобщей пользы (надеюсь).

Когда CloudStack окончил Apache Incubator в марте 2013 года, он присоединился к Hadoop в качестве проекта верхнего уровня (TLP) в рамках Apache Software Foundation (ASF). Это сделало ASF единственным Open Source Foundation, который содержит облачную платформу и решение для больших данных. Более того, более внимательный взгляд на проекты по созданию всего ASF показывает, что примерно 30% инкубатора Apache и 10% TLP связаны с «большими данными». Такие проекты, как Hbase, Hive, Pig и Mahout, являются подпроектами Hadoop TLP. Ambari, Kafka, Falcon и Mesos являются частью инкубатора и основаны на Hadoop.

В дополнение к CloudStack, API-оболочки, такие как Libcloud, deltacloud и jclouds, также являются частью ASF. Для соединения CloudStack и Hadoop в ASF также есть два интересных проекта: Apache Whirr a TLP и Provisionr, который в настоящее время находится в стадии разработки. И Whirr, и Provisionr стремились предоставить уровень абстракции для определения инфраструктуры больших данных на основе Hadoop и создания экземпляров этой инфраструктуры в облаках, включая облака на основе Apache CloudStack. Такое сосуществование CloudStack и всей экосистемы Hadoop в рамках одного и того же Open Source Foundation означает, что к обоим проектам применимы одинаковые принципы управления, процессы и разработки, что обеспечивает большую синергию и обещает еще большую взаимодополняемость.

В этом руководстве мы представляем Apache Whirr, приложение, которое можно использовать для определения, предоставления и настройки решений для больших данных в облаках на основе CloudStack. Whirr автоматически запускает экземпляры в облаке и запускает их в Boostrapps. Он также может добавлять пакеты, такие как Hive, Hbase и Yarn для сокращений карты.

Whirr [1] — это «набор библиотек для запуска облачных сервисов» и, в частности, сервисов больших данных. Whirr основан на jclouds [2]. Jclouds — это уровень абстракции на основе Java, который обеспечивает общий интерфейс для большого набора облачных сервисов и провайдеров, таких как Amazon EC2, серверы Rackspace и CloudStack. Таким образом, все облачные провайдеры, поддерживаемые в Jclouds, поддерживаются в Whirr. Основные участники Whirr включают четырех разработчиков из Cloudera, известного дистрибутива Hadoop. Whirr также можно использовать в качестве инструмента командной строки, что упрощает пользователям определение и предоставление кластеров Hadoop в облаке.

As an Apache project, Whirr comes as a source tarball and can be downloaded from one of the Apache mirrors [3]. Similarly to CloudStack, Whirr community members can host packages. Cloudera is hosting whirr packages to ease the installation. For instance on Ubuntu and Debian based systems you can add the Cloudera repository by creating/etc/apt/sources.list.d/cloudera.list and putting the following contents in it:

deb  http://archive.cloudera.com/cdh4/-cdh4 contrib 
deb-src http://archive.cloudera.com/cdh4/-cdh4 contrib

With this repository in place, one can install whirr with:

$sudo apt-get install whirr

The whirr command will now be available. Developers can use the latest version of Whirr by cloning the software repository, writing new code and submitting patches the same way that they would submit patches to CloudStack. To clone the git repository of Whirr do:

$git clone git://git.apache.org/whirr.git

They can then build their own version of whirr using maven:

$mvn install

The whirr binary will be located under the /bin directory. Adding it to one’s path with:

$export PATH=$PATH:/path/to/whirr/bin

Will make the whirr command available in the user’s environment. Successfull installation can be checked by simply entering:

$whirr --help

With whirr installed, one now needs to specify the credentials of the Cloud that will be used to create the Hadoop infrastructure. A ~/.whirr/credentials has been created during the installation phase. The type of provider (e.g cloudstack), the endpoint of the cloud and the access and secret keys need to be entered in this credentials file like so:

PROVIDER=cloudstack
IDENTITY=
CREDENTIAL=
ENDPOINT=

For instance on Exoscale [4] a CloudStack based cloud in Switzerland, the endpoint would be https://api.exoscale.ch/compute

Now that the CloudStack cloud endpoint and keys have been configured, the hadoop cluster that we want to instantiate needs to be defined. This is done in a properties file using a set of Whirr specific configuration variables [5]. Below is the content of the file with explanations in-line:

---------------------------------------
# Set the name of your hadoop cluster
whirr.cluster-name=hadoop
# Change the name of cluster admin user
whirr.cluster-user=${sys:user.name}
# Change the number of machines in the cluster here
# Below we define one hadoop namenode and 3 hadoop datanode
whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,3 hadoop-datanode+hadoop-tasktracker
# Specify which distribution of hadoop you want to use
# Here we choose to use the Cloudera distribution
whirr.env.repo=cdh4
whirr.hadoop.install-function=install_cdh_hadoop
whirr.hadoop.configure-function=configure_cdh_hadoop
# Use a specific instance type.
# Specify the uuid of the CloudStack service offering to use for the instances of your hadoop cluster
whirr.hardware-id=b6cd1ff5-3a2f-4e9d-a4d1-8988c1191fe8
# If you use ssh key pairs to access instances in the cloud
# Specify them like so
whirr.private-key-file=${sys:user.home}/.ssh/id_rsa_exoscale
whirr.public-key-file=${whirr.private-key-file}.pub
# Specify the template to use for the instances
# This is the uuid of the CloudStack template
whirr.image-id=1d16c78d-268f-47d0-be0c-b80d31e765d2
------------------------------------------------------

To launch this Hadoop cluster use the whirr command line:

$whirr launch-cluster --config hadoop.properties

The following example output shows the instances being started and boostrapped. At the end of the provisioning, whirr returns the ssh command that shall be used to access the hadoop instances.

-------------------
Running on provider cloudstack using identity mnH5EbKcKeJd456456345634563456345654634563456345
Bootstrapping cluster
Configuring template for bootstrap-hadoop-datanode_hadoop-tasktracker
Configuring template for bootstrap-hadoop-namenode_hadoop-jobtracker
Starting 3 node(s) with roles [hadoop-datanode, hadoop-tasktracker]
Starting 1 node(s) with roles [hadoop-namenode, hadoop-jobtracker]
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(b9457a87-5890-4b6f-9cf3-1ebd1581f725)
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(9d5c46f8-003d-4368-aabf-9402af7f8321)
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(6727950e-ea43-488d-8d5a-6f3ef3018b0f)
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-namenode_hadoop-jobtracker} on node(6a643851-2034-4e82-b735-2de3f125c437)
<< success executing InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(b9457a87-5890-4b6f-9cf3-1ebd1581f725): {output=This function does nothing. It just needs to exist so Statements.call("retry_helpers") doesn't call something which doesn't exist
Get:1 http://security.ubuntu.com precise-security Release.gpg [198 B]
Get:2 http://security.ubuntu.com precise-security Release [49.6 kB]
Hit http://ch.archive.ubuntu.com precise Release.gpg
Get:3 http://ch.archive.ubuntu.com precise-updates Release.gpg [198 B]
Get:4 http://ch.archive.ubuntu.com precise-backports Release.gpg [198 B]
Hit http://ch.archive.ubuntu.com precise Release
..../snip/.....
You can log into instances using the following ssh commands:
[hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no 
 [a href="mailto:[email protected]" style="text-decoration: none; color: rgb(101, 173, 197); cursor: pointer;"][email protected]
[hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no 
 [email protected]
[hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no 
 [email protected]
[hadoop-namenode+hadoop-jobtracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no 
 [email protected]
-----------

To destroy the cluster from your client do:

$whirr destroy-cluster --config hadoop.properties.

Whirr gives you the ssh command to connect to the instances of your hadoop cluster, login to the namenode and browse the hadoop file system that was created:

$ hadoop fs -ls /
Found 5 items
drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:11 /hadoop
drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:10 /hbase
drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:10 /mnt
drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:11 /tmp
drwxrwxrwx   - hdfs supergroup          0 2013-06-21 20:11 /user

Create a directory to put your input data.

$ hadoop fs -mkdir input
$ hadoop fs -ls /user/sebastiengoasguen
Found 1 items
drwxr-xr-x   - sebastiengoasguen supergroup          0 2013-06-21 20:15 /user/sebastiengoasguen/input

Create a test input file and put in the hadoop file system:

$ cat foobar 
this is a test to count the words
$ hadoop fs -put ./foobar input
$ hadoop fs -ls /user/sebastiengoasguen/input
Found 1 items
-rw-r--r--   3 sebastiengoasguen supergroup         34 2013-06-21 20:17 /user/sebastiengoasguen/input/foobar

Define the map-reduce environment. Note that this default Cloudera distribution installation uses MRv1. To use Yarn one would have to edit the hadoop.properties file.

$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce

Start the map-reduce job:

$ hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar wordcount input output
13/06/21 20:19:59 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/06/21 20:20:00 INFO input.FileInputFormat: Total input paths to process : 1
13/06/21 20:20:00 INFO mapred.JobClient: Running job: job_201306212011_0001
13/06/21 20:20:01 INFO mapred.JobClient:  map 0% reduce 0%
13/06/21 20:20:11 INFO mapred.JobClient:  map 100% reduce 0%
13/06/21 20:20:17 INFO mapred.JobClient:  map 100% reduce 33%
13/06/21 20:20:18 INFO mapred.JobClient:  map 100% reduce 100%
13/06/21 20:20:21 INFO mapred.JobClient: Job complete: job_201306212011_0001
13/06/21 20:20:22 INFO mapred.JobClient: Counters: 32
13/06/21 20:20:22 INFO mapred.JobClient:   File System Counters
13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of bytes read=133
13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of bytes written=766347
13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of read operations=0
13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/06/21 20:20:22 INFO mapred.JobClient:     FILE: Number of write operations=0
13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of bytes read=157
13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of bytes written=50
13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of read operations=2
13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/06/21 20:20:22 INFO mapred.JobClient:     HDFS: Number of write operations=3
13/06/21 20:20:22 INFO mapred.JobClient:   Job Counters 
13/06/21 20:20:22 INFO mapred.JobClient:     Launched map tasks=1
13/06/21 20:20:22 INFO mapred.JobClient:     Launched reduce tasks=3
13/06/21 20:20:22 INFO mapred.JobClient:     Data-local map tasks=1
13/06/21 20:20:22 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=10956
13/06/21 20:20:22 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=15446
13/06/21 20:20:22 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/06/21 20:20:22 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/06/21 20:20:22 INFO mapred.JobClient:   Map-Reduce Framework
13/06/21 20:20:22 INFO mapred.JobClient:     Map input records=1
13/06/21 20:20:22 INFO mapred.JobClient:     Map output records=8
13/06/21 20:20:22 INFO mapred.JobClient:     Map output bytes=66
13/06/21 20:20:22 INFO mapred.JobClient:     Input split bytes=123
13/06/21 20:20:22 INFO mapred.JobClient:     Combine input records=8
13/06/21 20:20:22 INFO mapred.JobClient:     Combine output records=8
13/06/21 20:20:22 INFO mapred.JobClient:     Reduce input groups=8
13/06/21 20:20:22 INFO mapred.JobClient:     Reduce shuffle bytes=109
13/06/21 20:20:22 INFO mapred.JobClient:     Reduce input records=8
13/06/21 20:20:22 INFO mapred.JobClient:     Reduce output records=8
13/06/21 20:20:22 INFO mapred.JobClient:     Spilled Records=16
13/06/21 20:20:22 INFO mapred.JobClient:     CPU time spent (ms)=1880
13/06/21 20:20:22 INFO mapred.JobClient:     Physical memory (bytes) snapshot=469413888
13/06/21 20:20:22 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=5744541696
13/06/21 20:20:22 INFO mapred.JobClient:     Total committed heap usage (bytes)=207687680

And you can finally check the output:

$ hadoop fs -cat output/part-* | head
this 1
to  1
the  1
a  1
count 1
is  1
test 1
words 1

Of course this is a silly example of map-reduce job and you will want to do much more critical tasks. In order to benchmark your cluster Hadoop comes with examples jar.

To benchmark your hadoop cluster you can use the TeraSort tools available in the hadoop distribution. Generate some 100 MB of input data with TeraGen (100 byte rows):

$hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar teragen 1000000 output3

Sort it with TeraSort:

$ hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar terasort output3 output4

And then validate the results with TeraValidate:

$hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar teravalidate output4 outvalidate

Performance of map-reduce jobs run in Cloud based hadoop clusters will be highly dependent on the hadoop configuration, the template and the service offering being used and of course on the underlying hardware of the Cloud. Hadoop was not designed to run in the Cloud and therefore some assumptions were made that do not fit the Cloud model, see [6] for more information. Deploying Hadoop in the Cloud however is a viable solution for on-demand map-reduce applications. Development work is currently under way within the Google Summer of Code program to provide CloudStack with a compatible Amazon Elastic Map-Reduce (EMR) service. This service will be based on Whirr or a new Amazon CloudFormation compatible interface called StackMate [7].

[1] http://whirr.apache.org [2] http://jclouds.incubator.apache.org [3] http://www.apache.org/dyn/closer.cgi/whirr/ [4] http://exoscale.ch [5] http://whirr.apache.org/docs/0.8.2/configuration-guide.html [6] http://wiki.apache.org/hadoop/Virtual%20Hadoop [7] https://github.com/chiradeep/stackmate