If your job didn’t work as expected, you’ll need to check the logs.

It’s important to realise that both hanythingondemand itself and the services it is running (e.g. Hadoop) produce logs.

Which logs you should be diving into depends on the information you are looking for or the kind of problems you run into.

hanythingondemand logs

For hanythingondemand itself, there are three places to consider:

  1. When submitting your job to start the cluster, hanythingondemand logs to your terminal session. The potential errors here are usually:
    • PBS isn’t running or isn’t accessible. If so, contact your administrators.
    • Your environment is broken. For example, if you’re using a Python version for a cluster that doesn’t work on the login node.
  2. If PBS is accessible and tries to run the job but it failed to start properly (e.g. due to a problem with MPI) you will see errors in Hanythingondemand.e${PBS_JOBID}. This will be in the directory from where you ran the job.
  3. When PBS starts your job, it will start logging to hod.output.$(hostname).$(pid). If your service configuration files have problems (e.g. typos in the commands, bad paths, etc) then the error will be here. For example if a service failed to start you will see a message in the logs saying: Problem occured with cmd.

Service logs

Hadoop logs

By default, the log files for a Hadoop cluster will be in $HOD_LOCALWORKDIR/log, where the $HOD_LOCALWORKDIR is an environment variable set by hod connect.

Expanded, this is in the workdir of the HOD cluster as follows: $workdir/$PBS_JOBID/${USER}.${HOSTNAME}.${PID}/log

One of the advantages of having the log files on a parallel file system is that one no longer needs to use special tools for log aggregation (Flume, Logstash, Logly, etc) since all the logs for all the nodes are in a single directory structure.

Hadoop logs have two components:

  1. Service logs: These are in $HOD_LOCALWORKDIR/log. Examples are: yarn-username-resourcemanager-node.domain.out, yarn-username-nodemanager-node.domain.out.

  2. Container logs: Each piece of Hadoop work takes place in a container. Output from your program will appear in these files. These are organized by application/container/stderr and stdout. For example:


IPython logs

IPython logs to stdout and stderr. These are sent by hanythingondemand to $HOD_LOCALWORKDIR/log/pyspark.stdout and $HOD_LOCALWORKDIR/log/pyspark.stderr

hod batch logs

Logs for your script running under hod batch are found in your $PBS_O_WORKDIR in: <script-name>.o<$PBS_JOBID> and <script-name>.e<$PBS_JOBID>.

If you want to watch the progress of your job while it’s running, it’s advisable to write your script so that it pipes output to the tee command.