Эта статья Себастьяна Гоасгуена выходит через Марка Хинкля.
Этот пост немного более формален, чем обычно, поскольку я написал его для учебника о том, как запускать hadoop в облаках, но я подумал, что это было очень полезно, поэтому я публикую его здесь для всеобщей пользы (надеюсь).
Когда CloudStack окончил Apache Incubator в марте 2013 года, он присоединился к Hadoop в качестве проекта верхнего уровня (TLP) в рамках Apache Software Foundation (ASF). Это сделало ASF единственным Open Source Foundation, который содержит облачную платформу и решение для больших данных. Более того, более внимательный взгляд на проекты по созданию всего ASF показывает, что примерно 30% инкубатора Apache и 10% TLP связаны с «большими данными». Такие проекты, как Hbase, Hive, Pig и Mahout, являются подпроектами Hadoop TLP. Ambari, Kafka, Falcon и Mesos являются частью инкубатора и основаны на Hadoop.
В дополнение к CloudStack, API-оболочки, такие как Libcloud, deltacloud и jclouds, также являются частью ASF. Для соединения CloudStack и Hadoop в ASF также есть два интересных проекта: Apache Whirr a TLP и Provisionr, который в настоящее время находится в стадии разработки. И Whirr, и Provisionr стремились предоставить уровень абстракции для определения инфраструктуры больших данных на основе Hadoop и создания экземпляров этой инфраструктуры в облаках, включая облака на основе Apache CloudStack. Такое сосуществование CloudStack и всей экосистемы Hadoop в рамках одного и того же Open Source Foundation означает, что к обоим проектам применимы одинаковые принципы управления, процессы и разработки, что обеспечивает большую синергию и обещает еще большую взаимодополняемость.
В этом руководстве мы представляем Apache Whirr, приложение, которое можно использовать для определения, предоставления и настройки решений для больших данных в облаках на основе CloudStack. Whirr автоматически запускает экземпляры в облаке и запускает их в Boostrapps. Он также может добавлять пакеты, такие как Hive, Hbase и Yarn для сокращений карты.
Whirr [1] — это «набор библиотек для запуска облачных сервисов» и, в частности, сервисов больших данных. Whirr основан на jclouds [2]. Jclouds — это уровень абстракции на основе Java, который обеспечивает общий интерфейс для большого набора облачных сервисов и провайдеров, таких как Amazon EC2, серверы Rackspace и CloudStack. Таким образом, все облачные провайдеры, поддерживаемые в Jclouds, поддерживаются в Whirr. Основные участники Whirr включают четырех разработчиков из Cloudera, известного дистрибутива Hadoop. Whirr также можно использовать в качестве инструмента командной строки, что упрощает пользователям определение и предоставление кластеров Hadoop в облаке.
As an Apache project, Whirr comes as a source tarball and can be downloaded from one of the Apache mirrors [3]. Similarly to CloudStack, Whirr community members can host packages. Cloudera is hosting whirr packages to ease the installation. For instance on Ubuntu and Debian based systems you can add the Cloudera repository by creating/etc/apt/sources.list.d/cloudera.list and putting the following contents in it:
deb http://archive.cloudera.com/cdh4/-cdh4 contrib deb-src http://archive.cloudera.com/cdh4/-cdh4 contrib
With this repository in place, one can install whirr with:
$sudo apt-get install whirr
The whirr command will now be available. Developers can use the latest version of Whirr by cloning the software repository, writing new code and submitting patches the same way that they would submit patches to CloudStack. To clone the git repository of Whirr do:
$git clone git://git.apache.org/whirr.git
They can then build their own version of whirr using maven:
$mvn install
The whirr binary will be located under the /bin directory. Adding it to one’s path with:
$export PATH=$PATH:/path/to/whirr/bin
Will make the whirr command available in the user’s environment. Successfull installation can be checked by simply entering:
$whirr --help
With whirr installed, one now needs to specify the credentials of the Cloud that will be used to create the Hadoop infrastructure. A ~/.whirr/credentials has been created during the installation phase. The type of provider (e.g cloudstack), the endpoint of the cloud and the access and secret keys need to be entered in this credentials file like so:
PROVIDER=cloudstack IDENTITY= CREDENTIAL= ENDPOINT=
For instance on Exoscale [4] a CloudStack based cloud in Switzerland, the endpoint would be https://api.exoscale.ch/compute
Now that the CloudStack cloud endpoint and keys have been configured, the hadoop cluster that we want to instantiate needs to be defined. This is done in a properties file using a set of Whirr specific configuration variables [5]. Below is the content of the file with explanations in-line:
--------------------------------------- # Set the name of your hadoop cluster whirr.cluster-name=hadoop # Change the name of cluster admin user whirr.cluster-user=${sys:user.name} # Change the number of machines in the cluster here # Below we define one hadoop namenode and 3 hadoop datanode whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,3 hadoop-datanode+hadoop-tasktracker # Specify which distribution of hadoop you want to use # Here we choose to use the Cloudera distribution whirr.env.repo=cdh4 whirr.hadoop.install-function=install_cdh_hadoop whirr.hadoop.configure-function=configure_cdh_hadoop # Use a specific instance type. # Specify the uuid of the CloudStack service offering to use for the instances of your hadoop cluster whirr.hardware-id=b6cd1ff5-3a2f-4e9d-a4d1-8988c1191fe8 # If you use ssh key pairs to access instances in the cloud # Specify them like so whirr.private-key-file=${sys:user.home}/.ssh/id_rsa_exoscale whirr.public-key-file=${whirr.private-key-file}.pub # Specify the template to use for the instances # This is the uuid of the CloudStack template whirr.image-id=1d16c78d-268f-47d0-be0c-b80d31e765d2 ------------------------------------------------------
To launch this Hadoop cluster use the whirr command line:
$whirr launch-cluster --config hadoop.properties
The following example output shows the instances being started and boostrapped. At the end of the provisioning, whirr returns the ssh command that shall be used to access the hadoop instances.
------------------- Running on provider cloudstack using identity mnH5EbKcKeJd456456345634563456345654634563456345 Bootstrapping cluster Configuring template for bootstrap-hadoop-datanode_hadoop-tasktracker Configuring template for bootstrap-hadoop-namenode_hadoop-jobtracker Starting 3 node(s) with roles [hadoop-datanode, hadoop-tasktracker] Starting 1 node(s) with roles [hadoop-namenode, hadoop-jobtracker] >> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(b9457a87-5890-4b6f-9cf3-1ebd1581f725) >> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(9d5c46f8-003d-4368-aabf-9402af7f8321) >> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(6727950e-ea43-488d-8d5a-6f3ef3018b0f) >> running InitScript{INSTANCE_NAME=bootstrap-hadoop-namenode_hadoop-jobtracker} on node(6a643851-2034-4e82-b735-2de3f125c437) << success executing InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(b9457a87-5890-4b6f-9cf3-1ebd1581f725): {output=This function does nothing. It just needs to exist so Statements.call("retry_helpers") doesn't call something which doesn't exist Get:1 http://security.ubuntu.com precise-security Release.gpg [198 B] Get:2 http://security.ubuntu.com precise-security Release [49.6 kB] Hit http://ch.archive.ubuntu.com precise Release.gpg Get:3 http://ch.archive.ubuntu.com precise-updates Release.gpg [198 B] Get:4 http://ch.archive.ubuntu.com precise-backports Release.gpg [198 B] Hit http://ch.archive.ubuntu.com precise Release ..../snip/..... You can log into instances using the following ssh commands: [hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no [a href="mailto:[email protected]" style="text-decoration: none; color: rgb(101, 173, 197); cursor: pointer;"][email protected] [hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no [email protected] [hadoop-datanode+hadoop-tasktracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no [email protected] [hadoop-namenode+hadoop-jobtracker]: ssh -i /Users/sebastiengoasguen/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no [email protected] -----------
To destroy the cluster from your client do:
$whirr destroy-cluster --config hadoop.properties.
Whirr gives you the ssh command to connect to the instances of your hadoop cluster, login to the namenode and browse the hadoop file system that was created:
$ hadoop fs -ls / Found 5 items drwxrwxrwx - hdfs supergroup 0 2013-06-21 20:11 /hadoop drwxrwxrwx - hdfs supergroup 0 2013-06-21 20:10 /hbase drwxrwxrwx - hdfs supergroup 0 2013-06-21 20:10 /mnt drwxrwxrwx - hdfs supergroup 0 2013-06-21 20:11 /tmp drwxrwxrwx - hdfs supergroup 0 2013-06-21 20:11 /user
Create a directory to put your input data.
$ hadoop fs -mkdir input $ hadoop fs -ls /user/sebastiengoasguen Found 1 items drwxr-xr-x - sebastiengoasguen supergroup 0 2013-06-21 20:15 /user/sebastiengoasguen/input
Create a test input file and put in the hadoop file system:
$ cat foobar this is a test to count the words $ hadoop fs -put ./foobar input $ hadoop fs -ls /user/sebastiengoasguen/input Found 1 items -rw-r--r-- 3 sebastiengoasguen supergroup 34 2013-06-21 20:17 /user/sebastiengoasguen/input/foobar
Define the map-reduce environment. Note that this default Cloudera distribution installation uses MRv1. To use Yarn one would have to edit the hadoop.properties file.
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
Start the map-reduce job:
$ hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar wordcount input output 13/06/21 20:19:59 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/06/21 20:20:00 INFO input.FileInputFormat: Total input paths to process : 1 13/06/21 20:20:00 INFO mapred.JobClient: Running job: job_201306212011_0001 13/06/21 20:20:01 INFO mapred.JobClient: map 0% reduce 0% 13/06/21 20:20:11 INFO mapred.JobClient: map 100% reduce 0% 13/06/21 20:20:17 INFO mapred.JobClient: map 100% reduce 33% 13/06/21 20:20:18 INFO mapred.JobClient: map 100% reduce 100% 13/06/21 20:20:21 INFO mapred.JobClient: Job complete: job_201306212011_0001 13/06/21 20:20:22 INFO mapred.JobClient: Counters: 32 13/06/21 20:20:22 INFO mapred.JobClient: File System Counters 13/06/21 20:20:22 INFO mapred.JobClient: FILE: Number of bytes read=133 13/06/21 20:20:22 INFO mapred.JobClient: FILE: Number of bytes written=766347 13/06/21 20:20:22 INFO mapred.JobClient: FILE: Number of read operations=0 13/06/21 20:20:22 INFO mapred.JobClient: FILE: Number of large read operations=0 13/06/21 20:20:22 INFO mapred.JobClient: FILE: Number of write operations=0 13/06/21 20:20:22 INFO mapred.JobClient: HDFS: Number of bytes read=157 13/06/21 20:20:22 INFO mapred.JobClient: HDFS: Number of bytes written=50 13/06/21 20:20:22 INFO mapred.JobClient: HDFS: Number of read operations=2 13/06/21 20:20:22 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/06/21 20:20:22 INFO mapred.JobClient: HDFS: Number of write operations=3 13/06/21 20:20:22 INFO mapred.JobClient: Job Counters 13/06/21 20:20:22 INFO mapred.JobClient: Launched map tasks=1 13/06/21 20:20:22 INFO mapred.JobClient: Launched reduce tasks=3 13/06/21 20:20:22 INFO mapred.JobClient: Data-local map tasks=1 13/06/21 20:20:22 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=10956 13/06/21 20:20:22 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=15446 13/06/21 20:20:22 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/06/21 20:20:22 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/06/21 20:20:22 INFO mapred.JobClient: Map-Reduce Framework 13/06/21 20:20:22 INFO mapred.JobClient: Map input records=1 13/06/21 20:20:22 INFO mapred.JobClient: Map output records=8 13/06/21 20:20:22 INFO mapred.JobClient: Map output bytes=66 13/06/21 20:20:22 INFO mapred.JobClient: Input split bytes=123 13/06/21 20:20:22 INFO mapred.JobClient: Combine input records=8 13/06/21 20:20:22 INFO mapred.JobClient: Combine output records=8 13/06/21 20:20:22 INFO mapred.JobClient: Reduce input groups=8 13/06/21 20:20:22 INFO mapred.JobClient: Reduce shuffle bytes=109 13/06/21 20:20:22 INFO mapred.JobClient: Reduce input records=8 13/06/21 20:20:22 INFO mapred.JobClient: Reduce output records=8 13/06/21 20:20:22 INFO mapred.JobClient: Spilled Records=16 13/06/21 20:20:22 INFO mapred.JobClient: CPU time spent (ms)=1880 13/06/21 20:20:22 INFO mapred.JobClient: Physical memory (bytes) snapshot=469413888 13/06/21 20:20:22 INFO mapred.JobClient: Virtual memory (bytes) snapshot=5744541696 13/06/21 20:20:22 INFO mapred.JobClient: Total committed heap usage (bytes)=207687680
And you can finally check the output:
$ hadoop fs -cat output/part-* | head this 1 to 1 the 1 a 1 count 1 is 1 test 1 words 1
Of course this is a silly example of map-reduce job and you will want to do much more critical tasks. In order to benchmark your cluster Hadoop comes with examples jar.
To benchmark your hadoop cluster you can use the TeraSort tools available in the hadoop distribution. Generate some 100 MB of input data with TeraGen (100 byte rows):
$hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar teragen 1000000 output3
Sort it with TeraSort:
$ hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar terasort output3 output4
And then validate the results with TeraValidate:
$hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar teravalidate output4 outvalidate
Performance of map-reduce jobs run in Cloud based hadoop clusters will be highly dependent on the hadoop configuration, the template and the service offering being used and of course on the underlying hardware of the Cloud. Hadoop was not designed to run in the Cloud and therefore some assumptions were made that do not fit the Cloud model, see [6] for more information. Deploying Hadoop in the Cloud however is a viable solution for on-demand map-reduce applications. Development work is currently under way within the Google Summer of Code program to provide CloudStack with a compatible Amazon Elastic Map-Reduce (EMR) service. This service will be based on Whirr or a new Amazon CloudFormation compatible interface called StackMate [7].
[1] http://whirr.apache.org [2] http://jclouds.incubator.apache.org [3] http://www.apache.org/dyn/closer.cgi/whirr/ [4] http://exoscale.ch [5] http://whirr.apache.org/docs/0.8.2/configuration-guide.html [6] http://wiki.apache.org/hadoop/Virtual%20Hadoop [7] https://github.com/chiradeep/stackmate