Example use cases

A couple of example use cases are described below.

We assume that the hod command is readily available in the environment; if it is not by default, maybe you should load a module first: see which hod or hanythingondemand modules are available via module avail, and load one of them using module load.

To check, just run hod without arguments, which should produce basic usage information (see hod command).

Common aspects

Configuring HOD

You can/should configure HOD by defining the HOD work directory and specifying which module should be loaded in the HOD job being submitted (see also Configuration options for hod create).

To configure hod batch, you can set the following environment variables:

$ export HOD_BATCH_HOD_MODULE=hanythingondemand/3.0.0-intel-2015b-Python-2.7.10
$ export HOD_BATCH_WORKDIR=$VSC_SCRATCH/hod

Likewise, for hod create:

$ export HOD_CREATE_HOD_MODULE=hanythingondemand/3.0.0-intel-2015b-Python-2.7.10
$ export HOD_CREATE_WORKDIR=$VSC_SCRATCH/hod

If HOD is being provided via an environment module, it is likely that the module provides decent default values for these already.

The examples below will assume that this configuration is in place already.

Available distributions

To get an overview of readily available HOD distributions, to select a value to specify to --dist, use hod dists (slightly trimmed output):

$ hod dists
* HBase-1.0.2
    modules: HBase/1.0.2, Hadoop/2.6.0-cdh5.4.5-native
...
* Hadoop-2.6.0-cdh5.4.5-native
    modules: Hadoop/2.6.0-cdh5.4.5-native
...
* Jupyter-notebook-5.1.0
    modules: Hadoop/2.6.0-cdh5.8.0-native, Spark/2.0.0, IPython/5.1.0-intel-2016b-Python-2.7.12, matplotlib/1.5.1-intel-2016b-Python-2.7.12

Interactively using a Hadoop cluster

To interactively use an HOD cluster, you should

  1. create an HOD cluster, using hod create
  2. connect to it once it is up and running, using hod connect
  3. execute your commands

See the example below for more details; basic usage information for hod create is available at hod create.

Using screen

To interactively start commands that may require some time to finish, we strongly recommended starting a so-called screen session after connecting to the HOD cluster.

Basic usage:

  • to start a screen session, simply the screen command; to specify a name for the session, use screen -S <name>
  • to get an overview of running screen sessions, use screen -ls
  • to detach from a screen session, with the option to later reattach to it, us the Ctrl-A-D key combination.
  • to end a screen session, simply type exit (no reattaching possible later!)
  • to reconnect to a screen session, use screen -r <name>; or simply use screen -r if there’s only one running screen session

More information about screen is available at http://www.gnu.org/software/screen/manual/screen.html.

Example: Hadoop WordCount

In the example below, we create a Hadoop HOD cluster, connect to it, and run the standard WordCount example Hadoop job.

  • create a Hadoop HOD cluster labelled hod_hadoop:

    $ hod create --dist Hadoop-2.5.0-cdh5.3.1-native --label hod_hadoop
    
    Submitting HOD cluster with label 'hod_hadoop'...
    Job submitted: Jobid 12345.master15.delcatty.gent.vsc state Q ehosts
    
  • check the status of the HOD cluster (‘Q‘ for queued, ‘R‘ for running):

    $ hod list
    
    Cluster label       Job ID                              State       Hosts
    hod_hadoop          12345.master15.delcatty.gent.vsc        Q
    
    ...
    
    $ hod list
    
    Cluster label       Job ID                              State       Hosts
    hod_hadoop          12345.master15.delcatty.gent.vsc        R       node2001.delcatty.gent.vsc
    
  • connect to the running HOD cluster:

    $ hod connect hod_hadoop
    
    Connecting to HOD cluster with label 'hod_hadoop'...
    Job ID found: 12345.master15.delcatty.gent.vsc
    HOD cluster 'hod_hadoop' @ job ID 12345.master15.delcatty.gent.vsc appears to be running...
    Setting up SSH connection to node2001.delcatty.gent.vsc...
    Welcome to your hanythingondemand cluster (label: hod_hadoop)
    
    Relevant environment variables:
    HADOOP_CONF_DIR=/user/scratch/gent/vsc400/vsc40000/hod/hod/12345.master15.delcatty.gent.vsc/vsc40000.node2001.delcatty.os.26323/conf
    HADOOP_HOME=/apps/gent/CO7/haswell-ib/software/Hadoop/2.5.0-cdh5.3.1-native/share/hadoop/mapreduce
    HOD_LOCALWORKDIR=/user/scratch/gent/vsc400/vsc40000/hod/hod/12345.master15.delcatty.gent.vsc/vsc40000.node2001.delcatty.os.26323
    
    List of loaded modules:
    Currently Loaded Modulefiles:
      1) cluster/delcatty(default)        2) Java/1.7.0_76                  3) Hadoop/2.5.0-cdh5.3.1-native
    
  • run Hadoop WordCount example

    • change to local work directory of this cluster:

      $ cd $HOD_LOCALWORKDIR
      
    • download example input file for wordcount:

      $ curl http://www.gutenberg.org/files/98/98.txt -o tale-of-two-cities.txt
      
    • build WordCount.jar (note: assumes that $HOME/WordCount.java is available):

      $ cp $HOME/WordCount.java .
      $ javac -classpath $(hadoop classpath) WordCount.java
      $ jar cf WordCount.jar WordCount*.class
      
    • run WordCount Hadoop example:

      $ hadoop jar WordCount.jar WordCount tale-of-two-cities.txt wordcount.out
      # (output omitted)
      
    • query result:

      $ grep ^city wordcount.out/part-r-00000
      city    20
      city,   9
      city.   5
      

Running a batch script on a Hadoop cluster

Since running a pre-defined set of commands is a common pattern, HOD also supports an alternative to creating an HOD cluster and using it interactively.

Via hod batch, a script can be provided that should be executed on an HOD cluster. In this mode, HOD will:

  • start an HOD cluster with the specified configuration (working directory, HOD distribution, etc.)
  • execute the provided script
  • automatically destroy the cluster once the script has finished running

This alleviates the need to wait until a cluster effectively starts running and entering the commands interactively.

See also the example below; basic usage information for hod batch is available at hod batch –script=<script-name>.

Example: Hadoop WordCount

The classic Hadoop WordCount can be run using the following script (wordcount.sh) on an HOD cluster:

#!/bin/bash

# move to (local) the local working directory of HOD cluster on which this script is run
export WORKDIR=$VSC_SCRATCH/$PBS_JOBID
mkdir -p $WORKDIR
cd $WORKDIR

# download example input file for wordcount
curl http://www.gutenberg.org/files/98/98.txt -o tale-of-two-cities.txt

# build WordCount.jar (note: assumes that ``$HOME/WordCount.java`` is available)
cp $HOME/WordCount.java .
javac -classpath $(hadoop classpath) WordCount.java
jar cf WordCount.jar WordCount*.class

# run WordCount Hadoop example
hadoop jar WordCount.jar WordCount tale-of-two-cities.txt wordcount.out

# copy results
cp -a wordcount.out $HOME/$PBS_JOBNAME.$PBS_JOBID

Note

No modules need to be loaded in order to make sure the required software is available (i.e., Java, Hadoop). Setting up the working environment in which the job will be run is done right after starting the HOD cluster.

To check which modules are/will be available, you can use module list in the script you supply to hod batch or check the details of the HOD distribution you use via hod clone <dist-name> <destination>.

To run this script on a Hadoop cluster, we can submit it via hod batch:

$ hod batch --dist Hadoop-2.5.0-cdh5.3.1-native --script $PWD/wordcount.sh --label wordcount
Submitting HOD cluster with label 'wordcount'...
Job submitted: Jobid 12345.master15.delcatty.gent.vsc state Q ehosts

$ hod list
Cluster label       Job ID                              State       Hosts
wordcount           12345.master15.delcatty.gent.vsc        R       node2001.delcatty.gent.vsc

Once the script is finished, the HOD cluster will destroy itself, and the job running it will end:

$ hod list
Cluster label       Job ID                              State               Hosts
wordcount           12345.master15.delcatty.gent.vsc        <job-not-found> <none>

Hence, the results should be available (see the cp at the end of the submitted script):

$ ls $HOME/HOD_wordcount.12345.master15.delcatty.gent.vsc
total 416
-rw-r--r-- 1 example  example  210041 Oct 22 13:34 part-r-00000
-rw-r--r-- 1 example  example       0 Oct 22 13:34 _SUCCESS

$ grep ^city $HOME/HOD_wordcount.12345.master15.delcatty.gent.vsc/part-r-00000
city        20
city,       9
city.       5

Note

To get an email when the HOD cluster is started/stopped, use the -m option, see –job-mail/-m <string>.

Connecting to an IPython notebook running on an HOD cluster

Running an IPython notebook on an HOD cluster is as simple as creating an HOD cluster using the appropriate distribution, and then connecting to the IPython notebook over an SSH tunnel.

For example:

  • create HOD cluster using an IPython HOD distribution:

    $ hod create --dist IPython-notebook-3.2.1 --label ipython_example
    Submitting HOD cluster with label 'ipython_example'...
    Job submitted: Jobid 12345.master15.delcatty.gent.vsc state Q ehosts
    
  • determine head node of HOD cluster:

    $ hod list
    Cluster label       Job ID                              State       Hosts
    ipython_example 12345.master15.delcatty.gent.vsc    R       node2001.delcatty.gent.vsc
    
  • connect to IPython notebook by pointing your web browser to http://localhost:8888, using a SOCKS proxy over an SSH tunnel to the head node node2001.delcatty.gent.vsc (see Connecting to web user interfaces for detailed information)