Cluster work

This is a list of work that needs to be done on the cluster/notes on work underway/notes on completed work.

Windows cluster nodes

Windows nodes can now join the cluster. They must be running Windows 10, because we rely on the "bash on windows 10" feature to run the main script.

Sadly this cannot call windows binaries at the moment, so we need to use outbash (from in order to make it work.

We call into git bash to work there. This involves lots of hairy rewriting of the job strings with regexps.

I can rewrite the pipe for md5sum to use a fifo, but something is going wrong with it, so I've resorted to just using a simple file now.

I have successfully run simple gs jobs through the system. I need to check pcl/xps/mupdf and pdfwrite jobs next.

Set up Windows Cluster Nodes as VMs.

Currently I'm just using my own windows machine as a node. The plan is to run virtual nodes on the cluster machines using libkvm and virsh. I've installed libkvm on every machine in the office so far, but not done more than that.

Weekly/Nightly/Release tests

I've set up a system for doing non-user, not git-triggered jobs. I've called these 'auto' jobs.

Essentially, I have an 'auto' directory (casper:/home/regression/cluster/auto) in which I can have other directories. By convention I put nightly tests in 'nightly' and weekly in 'weekly'. Each test goes in its own directory inside here, so for instance: auto/weekly/debug-Z@.

Every 'auto' job has a jobdef.txt file in each such directory, that contains information about which jobs to run. At it's simplest this contains details of a single job to run. For instance:

  • product options <extras=-Z@> make

The possible entries on this line are:

the products to build
the options (to the cluster system) to use
the make target to use
any command to run before make
any command to run after make
any filters to use
the git SHA/branch/tag to test (e.g. origin/ghostpdl-9.20)

Leaving any of these blank will just use a sensible default.

For options:

Use 32bit wordsize build
Only run on machines in the office (where the bandwidth between them is presumably fast and suitable for copying large bitmaps)
Allow for a larger timeout on such jobs
Run the extended set of tests (i.e. all the tests/devices that marcos used to test on overnights etc)
Run with ufst enabled
Run with luratech enabled

To start an auto job, use the '' script, for example: ./ auto/nightly/gs

The regression user has a set of crontab entries to call appropriately to do the nightly/weekly tests.

Test between 2 given points.

In order to release test we need to compare a given SHA/tag/branch (the proposed release) with another one (the previous release).

The cluster is normally set up to test each run against a 'previous' result. The md5 results from the new job are typically then kept as the baseline for the next job of this type.

For release runs that's no good, so we introduce a new 'ref' entry to the jobdef.txt file. For example, we might have the following 2 lines in a jobdef.txt file.

  • product <gs> ref <ghostpdl-9.19> options <extended>
  • product <gs> ref <ghostpdl-9.19> rev <ghostpdl-9.20> options <extended>

The first line causes a job to run that builds the 'ref' revision, and runs the tests to generate the baseline md5 sums - this is a 'reference generating' run.

The second line causes a job to run that generates the 'rev' revision, and compares it against the former. In this run, because ref and rev are both specified, we do NOT update the stored baselines. Thus we can repeatedly run this second line until we're happy without needing to rerun the first one.

Both these jobs only generate md5sums, not bitmaps. The hope is that from release to release most of the files should be unchanged, so the number of files that need to be examined as bitmaps should be small.

Consider pushing scan build onto the cluster

For every cluster push, or weekly?

Consider pushing coverage onto the cluster

Done, but dies with full jobs due to fimeouts. Consider revamping the cluster core.

Rework cluster central loop


  • Make it timeout less.
  • Make it less sensitive to node disconnections (if a node disconnects, be lazy about redistributing its jobs).
  • Simplify the current rats-nest of code (remove heartbeat if possible)
  • Allow for nodes joining part way during a run (to allow for AWS nodes spinning up).
  • Avoid nodes getting out of sync w.r.t what job is running.


Currently, jobs are started by the cluster making a series of $m.start files (one for each machine $m). The nodes periodically download $m.start and use that to trigger a new job.

Instead, we'll create a single 'job.start' node, that contains the same stuff, but also a job id (probably the time of the job start like 20190408115600 or something). Nodes will download this file rather than $m.start (and they won't delete it after they download it, obviously). This file will remain present for the entire duration of a job.

Currently, the cluster master starts just the nodes that it thinks should be capable of running the job (as it only creates $m.start if it thinks that $m is capable). Now we start every node.

Currently, the cluster sends back a capability string to be stored as $m.caps that says what a node can do or not. We no longer send that file separately but as part of the main cluster loop.

When a node connects, it will send a series of lines, terminated by a line that says OK or ABORT. Each of these lines will be of the form:

<var> <value>

For example:

   node <clustername>
   id <jobid>

This enables us to update the protocol later.

node <clustername>
Inform the clustermaster which node is calling in.

id <jobid>
Inform the clustermaster which job the node thinks it is running. If this mismatches, the clustermaster will tell the node to abort.

caps <capability string>
The node tells the clustermaster what it's capabilities are.

status <status string>
A status string for the node. Can be anything.

jobs <num pending>
The number of jobs the server has been given, and has not queued yet. When this drops to zero, the clustermaster should consider giving the node more jobs, as capacity is sitting idle.

Inform the clustermaster that the node is currently busy doing previously allocated work (building, running jobs, or uploading files).

failure <reason>
Inform the clustermaster that the node failed, and that failure logs have been uploaded. Negative reasons are considered fatal and should stop the job on all nodes. Values for reason include:
  • -1: Failed because of a build problem.
  • 1: The node failed because of a network problem.
  • 2: Failed because of too many timeouts.

If 'ABORT' is received, the node will disconnect and clean up. Note that the node may then start itself up again from scratch and try to join in the job again, so the cluster master should be able to cope with this.

Once 'OK' is received by the master, the master will send a response, then disconnect.

The node should abort instantly. This will be used if the capabilities string for the node is unsuitable, if the jobid is wrong, the job has been aborted, etc.

The node should build. This will be the first command sent by the master to a node. This will trigger updates of test files, source files, and the build of the source files. The node will report 'busy' while it is doing this.

The node should upload any log files. The node will report 'busy' while it is doing this.

The node should go away and continue with its existing work. Come back in a bit to see if the situation has changed.

JOBS\n<num jobs>\n<list of job strings>\n
The master passes a list of job strings to the node.

Normal course of events:

master node
creates job.start  
  downloads job.start
  <- Connects, and sends capabilities/job id
If jobid is bad -> ABORT (but be prepared to accept node again in future)  
If capabilities are bad -> ABORT (and remember node as no good for this job)  
-> BUILD  
  Update test files/source files
  If updates fail <- ABORT (with appropriate reason)
  Build code
  If builds fail, upload <- ABORT (with appropriate reason)
  <- Connect in without busy
Feed jobs ->  
  <- Connect in with status updates every n seconds
When node reports empty:  
if we have more jobs -> feed jobs  
else if we have delinquent nodes, -> reschedule jobs from delinquent node  
else if node not uploaded -> upload  
when all nodes complete && uploaded -> abort  

-- Robin Watts - 2017-02-15

See also:

ClusterNodes, ClusterStructure, ClusterHowTo


bmpcmp output - next/back links on each page of diffs

-- Chris Liddell - 2017-03-10

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r7 - 2019-04-08 - RobinWatts
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc