Running a DAG
HTCondor’s DAGMAN functionality allows to express complex dependencies between jobs. In our simple example, we will do a video render: We will first have a lot of jobs each of which renders a single frame, and finally, we have one job creating an output video from the frames.
For this, we will again use the open source renderer POV-Ray (Persistence of Vision Raytracer), and also make use of the two scenes already used in the previous exercise (the attribution to the artists can be found there and with the artwork).
Again, we need a file to describe our job, and an actual job payload, per job. However, we are now running two different kinds of jobs: The first type of job does the image rendering, the second type of job takes the produced images and creates a movie file from them.
The first kind of job: Image rendering
This job is actually very similar to the job we used in the previous exercise. The only added parts are some different quality settings to speed up the rendering, and the actual animation.
Save the following into a file of your choosing or use the file
Debian12_render_movie_frames.jdl
from the repository.
Save the following into a file of your choosing or use the file
render_pov_movie.sh
from the repository.
Please check that the shell script is executable - if not, run
chmod +x render_pov_movie.sh
.
The second kind of job: Creating the movie
Save the following into a file of your choosing or use the file
Debian12_create_movie.jdl
from the repository.
Save the following into a file of your choosing or use the file
create_pov_movie.sh
from the repository.
Please check that the shell script is executable - if not, run
chmod +x create_pov_movie.sh
.
The DAG file
Now, we need the final ingredient: A DAG file which describes the interdependencies between these kinds of jobs. In the end, only this file will be submitted and take care of running the JDL files outlined before.
To reduce the computational effort, everybody should render only one movie at first (if there is time, feel free to submit the second!). For this reason, two alternative DAG files are prepared:
The first file has the following content and is available from the repository under the name
Debian12_render_movie_dice.dag
.
The second file has the following content and is available from the repository under the name
Debian12_render_movie_mini_demo.dag
.
Please choose one of the two files. Can you explain the differences between the two? If it is not clear to you how the files interact, now is the right time to ask!
There are other interesting functionalities of DAGMAN you may want to check out. For example, you can use
Retry
to automatically retry a failed node a given number of times, orPRIORITY
to give a job a higher priority than other jobs in the same DAG. Another helpful feature is-maxidle 50
as parameter tocondor_submit_dag
to limit the number of maximum idle jobs in the queue (i.e. DAGMAN takes care to submit jobs slowly, making sure the idle queue is never too full). You could also put a number of maximum total jobs in any state at a time.
Submit the job as follows and check what happens:
What is happening? Where do you expect to find files on your submit machine?
Check the progress of your jobs, they will run for a while. The following commands and the log files may be useful (feel free to try them all!):
Also check out the log file produced by dagman!
Note that condor_watch_q
will prove quite useful especially to check on the status of complicated DAGs. It will present output such as:
Please note that condor_watch_q
may require a restart to become aware of new job clusters.
Check out the man pages or the HTCondor documentation — can you find more interesting parameters of jobs?
You may want to play with priorities and resource requests for jobs which are still waiting in the queue (you can only rank your own jobs against each other!). Helpful commands could be (for a job id
72.0
and cluster id72
):
If you have been really attentive, you may have noticed that sometimes, the jobs do not run in the expected order - for example, you may see less render jobs running than cores are available, or movie creation jobs hanging in the queue even though there are free resources. Do you have an explanation why?
Hint 1
Check the resource requests of the two different job types carefully - is there a difference?Hint 2
You may remember about partitionable slots of HTCondor... how could they impact efficiency here?Hint 3
The test cluster uses the setting `CLAIM_WORKLIFE = 300`. Check the HTCondor documentation on the effects of this setting!
Check out your results
If all went well, you should find render_results
and a final .mp4
movie file. Copy them to your local machine to watch them.
How does the quality compare to the still images we rendered before?
If you are the first to arrive here, you may want to play a bit more with DAGs. You could, for example, render the other movie, but first intentionally “break” the
make_video
job, e.g. by editing the shell script and making it return a bad exit status (exit 1
) before doing anything. How does DAGMAN react to this (check the logs)? Can you continue the DAG from where it left off after “fixing” the shell script again by reverting it to the original state?
Still there? You may want to check if there are free cluster resources and, if so, improve the quality of the rendering and re-render. You can check and adapt the render settings in the corresponding
.ini
file. Note that quality setting9
takes significantly longer than any lower setting. For themini_demo
scene, you will also find that two different POV-Ray files are used. Check out the differences usingdiff
and choose some settings in between! During re-rendering, remember to also checkcondor_userprio
to see how priorities are evolving. Note that there are alsohq
versions of the scripts and.jdl
files, which are not referenced by the DAG files. These produce much better quality, but will use significant resources also on a large cluster, so please be aware of that before running them. Remember that jobs also affect your user priority!