Example use cases¶
A couple of example use cases are described below.
We assume that the hod
command is readily available in the environment; if it is not by default, maybe you should
load a module first: see which hod
or hanythingondemand
modules are available via module avail
, and load one
of them using module load
.
To check, just run hod
without arguments, which should produce basic usage information (see hod command).
Contents
Common aspects¶
Configuring HOD¶
You can/should configure HOD by defining the HOD work directory and specifying which module should be loaded in the HOD job being submitted (see also Configuration options for hod create).
To configure hod batch
, you can set the following environment variables:
$ export HOD_BATCH_HOD_MODULE=hanythingondemand/3.0.0-intel-2015b-Python-2.7.10
$ export HOD_BATCH_WORKDIR=$VSC_SCRATCH/hod
Likewise, for hod create
:
$ export HOD_CREATE_HOD_MODULE=hanythingondemand/3.0.0-intel-2015b-Python-2.7.10
$ export HOD_CREATE_WORKDIR=$VSC_SCRATCH/hod
If HOD is being provided via an environment module, it is likely that the module provides decent default values for these already.
The examples below will assume that this configuration is in place already.
Available distributions¶
To get an overview of readily available HOD distributions, to select a value to specify to --dist
,
use hod dists
(slightly trimmed output):
$ hod dists
* HBase-1.0.2
modules: HBase/1.0.2, Hadoop/2.6.0-cdh5.4.5-native
...
* Hadoop-2.6.0-cdh5.4.5-native
modules: Hadoop/2.6.0-cdh5.4.5-native
...
* Jupyter-notebook-5.1.0
modules: Hadoop/2.6.0-cdh5.8.0-native, Spark/2.0.0, IPython/5.1.0-intel-2016b-Python-2.7.12, matplotlib/1.5.1-intel-2016b-Python-2.7.12
Interactively using a Hadoop cluster¶
To interactively use an HOD cluster, you should
- create an HOD cluster, using
hod create
- connect to it once it is up and running, using
hod connect
- execute your commands
See the example below for more details; basic usage information for hod create
is available at hod create.
Using screen
¶
To interactively start commands that may require some time to finish, we strongly recommended starting a so-called screen session after connecting to the HOD cluster.
Basic usage:
- to start a screen session, simply the
screen
command; to specify a name for the session, usescreen -S <name>
- to get an overview of running screen sessions, use
screen -ls
- to detach from a screen session, with the option to later reattach to it, us the
Ctrl-A-D
key combination. - to end a screen session, simply type
exit
(no reattaching possible later!) - to reconnect to a screen session, use
screen -r <name>
; or simply usescreen -r
if there’s only one running screen session
More information about screen
is available at http://www.gnu.org/software/screen/manual/screen.html.
Example: Hadoop WordCount¶
In the example below, we create a Hadoop HOD cluster, connect to it, and run the standard WordCount example Hadoop job.
create a Hadoop HOD cluster labelled
hod_hadoop
:$ hod create --dist Hadoop-2.5.0-cdh5.3.1-native --label hod_hadoop Submitting HOD cluster with label 'hod_hadoop'... Job submitted: Jobid 12345.master15.delcatty.gent.vsc state Q ehosts
check the status of the HOD cluster (‘
Q
‘ for queued, ‘R
‘ for running):$ hod list Cluster label Job ID State Hosts hod_hadoop 12345.master15.delcatty.gent.vsc Q ... $ hod list Cluster label Job ID State Hosts hod_hadoop 12345.master15.delcatty.gent.vsc R node2001.delcatty.gent.vsc
connect to the running HOD cluster:
$ hod connect hod_hadoop Connecting to HOD cluster with label 'hod_hadoop'... Job ID found: 12345.master15.delcatty.gent.vsc HOD cluster 'hod_hadoop' @ job ID 12345.master15.delcatty.gent.vsc appears to be running... Setting up SSH connection to node2001.delcatty.gent.vsc... Welcome to your hanythingondemand cluster (label: hod_hadoop) Relevant environment variables: HADOOP_CONF_DIR=/user/scratch/gent/vsc400/vsc40000/hod/hod/12345.master15.delcatty.gent.vsc/vsc40000.node2001.delcatty.os.26323/conf HADOOP_HOME=/apps/gent/CO7/haswell-ib/software/Hadoop/2.5.0-cdh5.3.1-native/share/hadoop/mapreduce HOD_LOCALWORKDIR=/user/scratch/gent/vsc400/vsc40000/hod/hod/12345.master15.delcatty.gent.vsc/vsc40000.node2001.delcatty.os.26323 List of loaded modules: Currently Loaded Modulefiles: 1) cluster/delcatty(default) 2) Java/1.7.0_76 3) Hadoop/2.5.0-cdh5.3.1-native
run Hadoop WordCount example
change to local work directory of this cluster:
$ cd $HOD_LOCALWORKDIR
download example input file for wordcount:
$ curl http://www.gutenberg.org/files/98/98.txt -o tale-of-two-cities.txt
build
WordCount.jar
(note: assumes that$HOME/WordCount.java
is available):$ cp $HOME/WordCount.java . $ javac -classpath $(hadoop classpath) WordCount.java $ jar cf WordCount.jar WordCount*.class
run
WordCount
Hadoop example:$ hadoop jar WordCount.jar WordCount tale-of-two-cities.txt wordcount.out # (output omitted)
query result:
$ grep ^city wordcount.out/part-r-00000 city 20 city, 9 city. 5
Running a batch script on a Hadoop cluster¶
Since running a pre-defined set of commands is a common pattern, HOD also supports an alternative to creating an HOD cluster and using it interactively.
Via hod batch
, a script can be provided that should be executed on an HOD cluster. In this mode, HOD will:
- start an HOD cluster with the specified configuration (working directory, HOD distribution, etc.)
- execute the provided script
- automatically destroy the cluster once the script has finished running
This alleviates the need to wait until a cluster effectively starts running and entering the commands interactively.
See also the example below; basic usage information for hod batch
is available at hod batch –script=<script-name>.
Example: Hadoop WordCount¶
The classic Hadoop WordCount can be run using the following script (wordcount.sh
) on an HOD cluster:
#!/bin/bash
# move to (local) the local working directory of HOD cluster on which this script is run
export WORKDIR=$VSC_SCRATCH/$PBS_JOBID
mkdir -p $WORKDIR
cd $WORKDIR
# download example input file for wordcount
curl http://www.gutenberg.org/files/98/98.txt -o tale-of-two-cities.txt
# build WordCount.jar (note: assumes that ``$HOME/WordCount.java`` is available)
cp $HOME/WordCount.java .
javac -classpath $(hadoop classpath) WordCount.java
jar cf WordCount.jar WordCount*.class
# run WordCount Hadoop example
hadoop jar WordCount.jar WordCount tale-of-two-cities.txt wordcount.out
# copy results
cp -a wordcount.out $HOME/$PBS_JOBNAME.$PBS_JOBID
Note
No modules need to be loaded in order to make sure the required software is available (i.e., Java, Hadoop). Setting up the working environment in which the job will be run is done right after starting the HOD cluster.
To check which modules are/will be available, you can use module list
in the script you supply to
hod batch
or check the details of the HOD distribution you use via hod clone <dist-name> <destination>.
To run this script on a Hadoop cluster, we can submit it via hod batch
:
$ hod batch --dist Hadoop-2.5.0-cdh5.3.1-native --script $PWD/wordcount.sh --label wordcount
Submitting HOD cluster with label 'wordcount'...
Job submitted: Jobid 12345.master15.delcatty.gent.vsc state Q ehosts
$ hod list
Cluster label Job ID State Hosts
wordcount 12345.master15.delcatty.gent.vsc R node2001.delcatty.gent.vsc
Once the script is finished, the HOD cluster will destroy itself, and the job running it will end:
$ hod list
Cluster label Job ID State Hosts
wordcount 12345.master15.delcatty.gent.vsc <job-not-found> <none>
Hence, the results should be available (see the cp
at the end of the submitted script):
$ ls $HOME/HOD_wordcount.12345.master15.delcatty.gent.vsc
total 416
-rw-r--r-- 1 example example 210041 Oct 22 13:34 part-r-00000
-rw-r--r-- 1 example example 0 Oct 22 13:34 _SUCCESS
$ grep ^city $HOME/HOD_wordcount.12345.master15.delcatty.gent.vsc/part-r-00000
city 20
city, 9
city. 5
Note
To get an email when the HOD cluster is started/stopped, use the -m
option,
see –job-mail/-m <string>.
Connecting to an IPython notebook running on an HOD cluster¶
Running an IPython notebook on an HOD cluster is as simple as creating an HOD cluster using the appropriate distribution, and then connecting to the IPython notebook over an SSH tunnel.
For example:
create HOD cluster using an IPython HOD distribution:
$ hod create --dist IPython-notebook-3.2.1 --label ipython_example Submitting HOD cluster with label 'ipython_example'... Job submitted: Jobid 12345.master15.delcatty.gent.vsc state Q ehosts
determine head node of HOD cluster:
$ hod list Cluster label Job ID State Hosts ipython_example 12345.master15.delcatty.gent.vsc R node2001.delcatty.gent.vsc
connect to IPython notebook by pointing your web browser to http://localhost:8888, using a SOCKS proxy over an SSH tunnel to the head node
node2001.delcatty.gent.vsc
(see Connecting to web user interfaces for detailed information)