The Structure of the Artifex Cluster


The cluster consists of a central server, and a set of distributed 'nodes' that connect into it. The nodes differ in OS, compiler version, number of CPUs, amount of RAM, connection speed, availability times etc.

The idea of the cluster is that different tests ("runs") can be performed, with the individual jobs that make up these runs being farmed out around all the cluster nodes that happen to be available at the moment.

Cluster runs can either be triggered automatically on a commit, at specific times/dates (for nightly/weekly tests) or directly by users.

Different runs can test different things; some runs test gs (or gpcl6, or gxps etc), some mupdf (and mujstest). Another can generate a bitmap comparison based on the results of a previous run.

The cluster master.

Casper, an AWS node, is our clustermaster. This is responsible for keeping the queued jobs and farming them out around the nodes.

We actually run 3 parallel clusters, priority 0, 1 and 2. This enables us to run user jobs (level 0, highest priority), git jobs (ones caused by commits, level 1, middle priority) and auto jobs (nightly/weekly jobs and release tests, level 2, low priority).

At the moment priority 0 and 1 are still running together in priority 0 while the code is tested. I will change this in a few days, if no problems are found. The description below assumes I have changed this.

Something (I have yet to figure out what), runs This checks to see if another instance of is already running - if it is it just exits. If not it goes into an infinite loop, kicking off, and every 10 seconds. and are just symbolic links to, so all 3 processes are identical. The first thing they do is to look at the name they have been invoked under and use that as their priority. For (most of) the rest of this description (except where I explicitly say otherwise), I'll just describe the level 0 clustermaster process. The others are pretty much identical, but use suffixed filenames; so whereas the job list for level 0 ( goes into jobs, the jobs list for level 1 ( goes into jobs.1. is the core of the server. It's operation is split into 3 phases.

Phase 1 runs every time it is invoked, and consists of spotting new jobs that need to be added into the queue of jobs.

Phase 2 has checking to see if there is another instance of itself running. If there is, and it is too old, it kills the old version and exits. If there is, and it's not too old, it will exit. If there isn't, it proceeds to phase 3. This dance simply ensures that a) we don't get a clustermaster stuck in an infinite loop, and b) ensures that we have just one of them running at a time.

Phase 3 involves actually running the job, collating the results, and mailing out the results.

Then the clustermaster exits. Every invocation lasts only long enough to do a single run.

Phase 1 in detail

The cluster system contains various different copies of the source. /home/regression/cluster/{ghostpdl.mupdf} are git clones of the sources. In addition /home/regression/cluster/user/XXXX/{ghostpdl,mupdf} contain per-user copies (not git clones).

At priority 1, the clustermaster checks for new commits having been made to the ghostpdl/mupdf git repos. If found, it queues one job per each new commit (by adding a line to queue.lst).

At priority 0, the clustermaster checks for user jobs, by looking for various droppings that leaves in the users source directories. These include,, If found the cluster removes these and creates jobs in queue.1.lst.

No checking is done for priority 2 auto jobs, because these are placed directly into the queue.2.lst file by (typically called from a crontab).

Phase 2 in detail

Any clustermaster that gets past phase 2 puts its process id into This means that when the clustermaster starts up, it can check to see if there is a clustermaster process supposedly running by checking that file. If it exists, it checks the time. No run should take more than 2 hours, so if the time stamp of the pid is older than this, we decide that it's probably dead. To be sure, we remove the file, and wait for 5 minutes. This prevents any other clustermasters from being spawned, and gives any old clustermaster time to shut down neatly.

If the pid no longer refers to a clustermaster process, we do similarly.

If we pass these tests, we put our own PID into the file.

Phase 3 in detail

During this entire phase, we periodically check to see if the PID in is ours. If it isn't (or the file has been deleted), we take this as a sign that we have been killed off, and exit.

First, we check to see if we have any queued runs. If we do, we take the first one, and execute it.

We create a job.start file (or job.1.start or job.2.start according to priority) which contains details of the job to run.

Then we call to get the job list; this goes into jobs. This is a list of lines, one per test to run. The lines are in the format:

   <test> <logfile> true ; <command string>

<<>test<>> is a string that describes the test to be used; a combination of test file name (with / replaced by double _), and the devices/resolution/banding settings. e.g. tests_private__pdf__PDF_1.7_FTS__fts_17_1705.pdf.pdf.pkmraw.300.0

<<>logfile<>> is a string that gives a name to be used for the logfile of the test. e.g. __temp__/tests_private__pdf__PDF_1.7_FTS__fts_17_1705.pdf.pdf.pkmraw.300.0..gs_pdf

<<>command string<>> is a ';' separated list of commands to be executed to run the test. These commands and their outputs are captured to the logfile.

Typically the command string will involve at least one call to ghostscript (or another executable) to produce output.

Simple ghostscript tests will call ghostscript, and pipe its output through md5sum. This md5sum is then echoed to the output for processing when the job results are returned to the clustermaster.

Tests using pdfwrite/xpswrite/ps2write etc will call ghostscript once to output an intermediate file, then again to render that intermediate file. md5sums of both intermediate and final results are sent back in the logs.

Tests using bmpcmp will call the version of ghostscript under test, and then the reference version of ghostscript to generate 2 different bitmaps. These will then be fed into bmpcmp for comparison. In the case of pdfwrite etc, there will be 4 invocations. rsync is used as part of the command string to upload the output of bmpcmp.

Each of these lines will be handed out to nodes during the course of the job.

Then we actually run the job.

The clustermaster starts a service up listening on a port (9000+priority), and goes into a loop.

When a node connects to the service, it sends a bunch of headers, and then either "OK" or "ABORT".

  • One of the headers is "id", a value parrotted back to the server from where the node has read it from job.start. If this doesn't match the id we expect, the node is still trying to run an old job. We tell it to abort.

  • One of the headers is "node", the node name. This must always be present, or we'll abort the node.

  • One of the headers is "caps", a string that describes the capabilities of the node. On the first time a node connects, we check to see that the node is suitable for the job; If it is unsuitable, we add it to the 'cull' list and abort the node. On subsequent connections we check to see if the node is in the cull list; if it is, we abort it again.

  • One of the headers is "failure". This will be set if the node has encountered a failure during a task we have given it (such as building the source, or uploading the results). If it fails during building, we abort the job. If it fails during upload, we tell it to try again.

  • One of the headers is "busy". This will be set if the node is still busy doing a task we gave it (such as building the source, or uploading the results). It connects in anyway so we know that it is still working on the problem, and hasn't died (so we won't time it out).

  • The first thing a new node is told is to "BUILD". This involves updating the source and test respositories and then building. This can take a while, so the node continues to connect in periodically to tell us it is "busy" working on it.

  • One of the headers is "jobs", followed by 2 numbers. The first is the number of jobs pending (the number of jobs that has been given to the node that have not yet been run). The second is the number of jobs running at the moment. When a node connects with no pending jobs, and we have jobs still to h

  • Each node connects to this service in turn. I to request jobs. They are either given jobs to do, or told to 'standby'.

  • If we don't hear a heartbeat from a node every 3 minutes, we assume that that node has died. We therefore put the jobs that have been sent to that node back into the joblist, and carry on without that node.

  • If any of the nodes reports a compileFail, this is taken as a hard failure.

  • When we've sent every job out, and then every node has reported in to ask for more jobs, we know that every job has been successfully completed. This job is complete.

  • If we drop too many nodes, we try again.

Next we mark all the machines as idle, unpack the build logs, create, populate and file the job report (including emailing it out).

We remove the job we have just exited from the queue, delete the file, and then exit.

This script does the usual "if I am not the only one, then exit" dance.

If it is the only one, then it simply runs in a loop listening for connections on port 9001. For each connection it gets, it receives the machine name, a file suffix and some file contents. If 'machine.start' exists (i.e. if there is a new job to run), then it returns 1, otherwise 0.

The remote nodes use this to send a heartbeat (and status) to the server, and to be informed when they have a new job to begin.

This file knows how to generate the "job strings" to pass to the clients for execution.

This is where the knowledge of how to pass parameters to gs/mupdf etc is, along with what files to test with what extensions etc.

A Cluster Node

A cluster node has a git clone of tests and tests_private. It also has local copies of the ghostpdl, mupdf, users/XXXX/ghostpdl and users/XXXX/mupdf directories from the server. These are kept up to date using rsync.

Periodically (typically by a cronjob every 10 minutes), we run

This file contains the actual name of the cluster node - this is the only bit of node to node customisation that should be required. Accordingly this is the only file that cannot trivially be updated from the central server.

When run this checks to see if there are existing clusterpull processes running, checking for timeouts etc.

Once it is satisfied that it is the only game it town, it starts the real work.

Firstly it pulls down a new version of the script.

Next it pulls down a new version of the script (not all nodes have this version of yet), and runs it. This detects things like what OS is running, what version of gcc is in use, what wordsize etc, and reports these 'capabilities' back to the central cluster master. This enables the clustermaster to decide what jobs can be safely sent to this machine.

Then it goes into a loop calling once every 10 seconds to tell the server that it lives.

If returns with a non-zero exit code, it knows that there is a job to start.

It fetches the $machine.start file from the clustermaster, deletes it on the clustermaster, fetches from the clustermaster and executes contains the guts of the client logic. This knows how to build each of the different products by looking in the $machine.start file obtained from the clustermaster by

It then repeatedly connects to the server to request jobs and run them.

-- Robin Watts - 2017-02-04

See also:

ClusterNodes, ClusterWork, ClusterHowTo


Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2019-05-10 - RobinWatts
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc