Queueing more cleverly

Often in scientific analyses, you will face the problem that you have a large number of differently named output files and you want to analyze them crunching through one file per job. Up to now, we have only used a basic Queue command, queueing multiple jobs with different process IDs. But HTCondor can also help out in the common use case where file names or more complex configuration sets need to be queued.

In our simple example, we will render some 3D images: We have two rather complex scenes whose files live in separate directories.

For the rendering, we will use the open source renderer POV-Ray (Persistence of Vision Raytracer). Since well-made 3D objects are better made by people with more artistic sense than the regular HTCondor user, I have used existing artwork available under the Creative Commons Attribution-Share Alike 3.0 Unported CC-BY-3.0 license.

Hence, we start with an attribution to the artists:

The artwork available as “dice” in this course was made available under Creative Commons Attribution-Share Alike 3.0 Unported CC-BY-3.0 . The artwork is called “PNG transparency demonstration”” and has been shared by user ed_g2s on Wikimedia at https://commons.wikimedia.org/wiki/File:PNG_transparency_demonstration_1.png.

For the course, I have added an additional file dice_movie.pov and a render_movie.ini which chooses some more light settings for creation of a movie. In addition, a render.ini file has been added to render a simple frame.

The artwork available as “mini_demo” in this course including the scene and all materials were made available under Creative Commons Attribution-Share Alike 3.0 Unported CC-BY-3.0 . The artwork is called “Mini Cooper and Building” and has been shared by © 2004 Gilles Tran http://www.oyonale.com.

For the course, I have added an additional file demo_mini_movie.pov and a render_movie.ini which chooses some more light settings for creation of a movie. In addition, a render.ini file has been added to render a simple frame.

Again, we need a file to describe our job, and an actual job payload. We will use a flexible job payload (a shell script taking parameters) and use a single job description file for all scenes.

Save the following into a file of your choosing or use the file Debian12_render_scenes.jdl from the repository.

JobBatchName = Debian12_render_scenes
+ContainerOS = "Debian12"
+CephFS_IO   = "none"
+MaxRuntimeHours = 1

Scene = $Fdb(ScenePath)

Executable=render_pov_single.sh
Arguments = $(Scene)

Transfer_input_files = povray/$(Scene)
Transfer_output_files = $(Scene).png

Error                   = logs/err.$(ClusterId).$(Process)
Output                  = logs/out.$(ClusterId).$(Process)
Log                     = logs/log.$(ClusterId).$(Process)

Request_cpus = 4
Request_memory = 1000 MB
Request_disk = 100 MB

Queue ScenePath matching dirs (povray/*)

Save the following into a file of your choosing or use the file render_pov_single.sh from the repository.

#!/bin/bash

source /etc/profile
set -e
SCENE=$1

cd ${SCENE}
povray +V render.ini
mv ${SCENE}.png ..

Please check that the shell script is executable - if not, run chmod +x render_pov_single.sh.

First, take a look at the job description file. Can you understand how it works? Some helpful pointers follow.

In general, if the syntax is unclear, you may want to check out the HTCondor documentation. You can check the HTCondor version used on your submission machine with condor_q -version.

For example, to get an explanation on what the strange magic line Scene = $Fdb(ScenePath) is doing, it is best to start from the HTCondor web page, since links to the HTCondor documentation are sadly not stable yet¹. As you might guess, $Something() is the syntax of a built-in function. You will find it explained in chapter 3.3.10.

Can you find out what it does, and why might we need it? To answer this question, you should also understand the Queue command. If in doubt, this is the right point in time to ask!

As soon as everything is understood and you know what to expect, it is time to submit the jobs:

$ condor_submit Debian12_render_scenes.jdl
Submitting job(s)..
2 job(s) submitted to cluster 98.

These jobs may run for a little while, so let’s take the time to check on them!² POV-Ray produces some progress output on STDERR. You can access that live from your submit machine using (with 98.0 being the first job’s id):

$ condor_tail -no-stdout -stderr -f 98.0
Rendered 105472 of 480000 pixels (21%)

You can also ask for more output with:

$ condor_tail -no-stdout -stderr -maxbytes 100000 -f 98.0
...
Rendered 105472 of 480000 pixels (21%)

You can also check the log file of the job, and use condor_q to check resource usage:

$ condor_q -af:hj Cmd ResidentSetSize_RAW RequestMemory RequestCPUs DiskUsage_RAW RequestDisk Owner RemoteHost
 ID      Cmd                                                                    ResidentSetSize_RAW RequestMemory RequestCPUs DiskUsage_RAW RequestDisk Owner     RemoteHost             
  98.0   /home/student00/htcondor-bonn/files/render_pov_single.sh               undefined           1000          4           7             102400      student00 slot1_1@htcondor-t-wn-0
  98.1   /home/student00/htcondor-bonn/files/render_pov_single.sh               undefined           1000          4           53064         102400      student00 undefined

Check out status and resource consumption of those jobs. Do they match with the requests formulated in the job description? What about the units?

If a job does not start, you may also want to check out (for job id 98.0):

$ condor_q -better-analyze 98.0

Check out your results

As soon as the jobs have finished, you should find two new image files in your submit directory. The best way to look at them is to copy them to your local machine (on Linux or MacOS X, use scp or rsync, on Windows, either use the same commands in Windows Subsystem for Linux (WSL), or use e.g. WinSCP). Once they have arrived, use a normal image viewer.

Queueing with a complex set of parameters

Finally, you may encounter very complex analysis tools in your scientific career which need a lot of configuration parameters. We don’t provide a hands-on example here, since the possibilities are endless, but instead, we present an example snippet of a JDL file and configuration file to queue a complex set of jobs. At this point, it is important to remember about the possibilities you are granted by HTCondor - an actual implementation will always be specific for the analysis tool you are using.

Consider the following lines from a JDL:

Executable = myWrapperForAComplexAnalysisTool.sh
Arguments  = $(Process) $(INPUT_FOLDER) $(DATASETS) $(OUTPUT_FOLDER) $(MIN_CONFIDENCE)

if $(Debugging)
  slice = [1:]
  Arguments = -v $(Arguments)
endif

# Submit jobs as defined in input file
Queue INPUT_FOLDER DATASETS OUTPUT_FOLDER MIN_CONFIDENCE from $(slice) list_of_tasks.txt

and the following lists_of_tasks.txt accompanying it:

/clusterfs/user/myself/input/HIGGS/ AOD.07709524._000062.pool.root.1;AOD.07709524._000063.pool.root.1;AOD.07709524._000064.pool.root.1 /cephfs/user/freyermu/output/HIGGS/ 0.23
/clusterfs/user/myself/input/HIGGS/ AOD.07709524._000023.pool.root.1;AOD.07709524._000024.pool.root.1;AOD.07709524._000025.pool.root.1 /cephfs/user/freyermu/output/HIGGS/ 0.13
# ...

The several “columns” (separated by spaces) are identified as the variable names passed to the Queue command. Note that the DATASETS column contains a list of datasets separated by ; which may for example be parsed by the wrapper script or the analysis software. This may for example be helpful if job runtime would otherwise be very short, and the actual setup / teardown phase would take long compared to the job runtime. Examples for necessary, but heavy setup / teardown could be:

Condor file transfer of large / huge number of input files
Extraction of the actual software (for example, it might be stored as a tarball on a cluster file system, and be extracted on scratch space for actual use, since cluster file systems scale bad with many small files)
Cleanup of the job scratch directory (this also takes time!)
Necessary cache filling, software startup time etc.

Can you follow along the example, and understand all parts of it? For example, what would happen if you would name the full JDL file analysis.jdl and submit as follows?

condor_submit 'Debugging=true' analysis.jdl

Do you have an example use case in mind? Feel free to ask questions!

Another important attribute for your job description file is the possibility to remap the file names of input and output files. Imagine the program you use expects a specific input file name, and produces a hardcoded output filename. A simple workaround would be to write a job wrapper script which renames the files accordingly. However, you can also do this:

transfer_output_remap = "output.root=output/histograms_$(Process).root"

This would move the file output.root which is expected to be produced by the job in the working directory to the subdirectory output/ on the execute machine when the job has finished, and give it a unique name by using the process ID. The same is possible for input files.

Related to this, the initialdir setting effectively changes the directory before submitting the single job. This allows to prepare multiple sets of input files in different subdirectories on the submit machine, and to collect the logs and outputs in those subdirectories.

Do you have example use cases in mind? Again, feel free to ask questions!

A very much improved online documentation is part of the HTCondor 8.8 series. ↩
If they finish too fast, you will also find a Debian12_render_scenes_hq.jdl in the repository. Note that this requires significantly more resources, so please only use that if the normal jobs are too short for investigating their behaviour. ↩

back to landing page