create new tag
view all tags

How to do stuff with the cluster

How to remove jobs from the queue

The current queue of jobs is kept on casper in 5 files, one for each 'priority' (where priority is a mixture of priority and architecture).

  • Priority 0 jobs (highest, user jobs) are held in: /home/regression/cluster/queue.lst
  • Priority 1 jobs (middle, git commit triggered jobs) are held in: /home/regression/cluster/queue.1.lst
  • Priority 2 jobs (lowest, auto jobs) are held in: /home/regression/cluster/queue.2.lst
  • Priority 3 jobs (highest, user jobs, arm machines) are held in: /home/regression/cluster/queue.3.lst
  • Priority 4 jobs (lowest, auto jobs, arm machines) are held in: /home/regression/cluster/queue.4.lst

It's a dead simple 1 line per job format.

To remove a job from the queue, load it into an editor, remove the line, save it back. Do this fast, because obviously there is a race condition going on.

Removing a job from the queue will NOT stop the current run.

How to stop the current run.

The nicest way is to 'touch abort.job' (or abort.1.job or abort.2.job etc as appropriate) in /home/regression/cluster, but this relies on the clustermaster not actually being jammed, or in an infinite loop etc.

A slightly nastier way (that avoids the 5 minutes) is to: kill `cat clustermaster.pid` && rm clustermaster.pid (or clustermaster.1.pid or clustermaster.2.pid etc as appropriate).

The nastiest way is to directly kill the clustermaster.pl process. A new one will start up ~20 seconds later, spot that the old one is in problems, and commence a 5 minute timeout and restart process. Sometimes you may have to resort to this. You'll need to clear out the .pid file too.

Note that none of these things will remove the current job from the queue, so the new cluster master will start up and restart the old job. If you want to kill a job and restart, then edit the queue first, then stop the current run.

How to reset the cluster after a force push.

Suppose you have commits A, then B, then C. C fails disastrously when cluster tested, so we want to force push the golden repo back to B. Do so.

This leaves the cluster in a confused state, for 2 reasons.

Firstly the cluster bases it's 'difference to previous job' reports on the most recent entry in it's database of tabs. We can solve that simply by doing:

rm -rf {,mupdf-,mujstest-}archive/<sha-for-C>*

The second problem is that the cluster checks every 20 seconds or so for master having changed. This check goes wrong if master has not moved 'forwards'. To fix this:

For ghostpdl:

cd /home/regression/cluster/ghostpdl && git checkout master && git reset --hard <sha-for-A>

For gpdf:

cd /home/regression/cluster/ghostpdl && git checkout pdfi && git reset --hard <sha-for-A> && git checkout master

For mupdf:

cd /home/regression/cluster/mupdf

git checkout incoming_master && git reset --hard <sha-for-A>

git checkout master && git reset --hard <sha-for-A>

(Note A, not B! We we want to make the cluster rerun B, so we tell it that the last job it knows about is the one before B)

How to set up a cluster release test

Log into casper as regression.

Change into /home/regression/cluster.

Look into auto/release for an appropriate directory. Either copy an existing one, or make a new one. In this example, we'll update the gs release test, so we'll reuse the existing 'gs' directory.

Inside that directory there should be a jobdef.txt file that says what to test. Lines beginning with # are comments. All other lines describe a job to run.

Typically we run 3 jobs. The first job generates the reference. For example:

product <gs> ref <ghostpdl-9.20-regression-test> options <extended>

Test 'gs' as a reference run on tag (or SHA) ghostpdl-9.20-regression-test, using the 'extended' set of tests.

The second job runs the target revision, and compares back to the reference we just generated.

product <gs> ref <ghostpdl-9.20-regression-test> rev <83b54c5> options <extended>

Test 'gs' on tag (or SHA) 83b54c5, against the given reference (ghostpdl-9.20-regression-test) using the 'extended' set of tests.

Finally we generate the bmpcmp for those results:

product <bmpcmp> ref <ghostpdl-9.20-regression-test> rev <83b54c5> options <extended cull -w 3 -t 32>

Run a 'bmpcmp' on the results between those 2 commits. "cull" some of the results to avoid generating too many (i.e. if the ppmraw shows a difference, don't bother generating the pgm or the pbm, as they will just show the same difference). Use the "extended" tests. Allow for a slight window and threshold for bmpcmp differences.

Once you have edited the file so you're happy:

./enqueueAuto.pl auto/release/gs

The results will then be mailed out, and can be viewed as:


and the bmpcmp at:


Note, the / in "release/gs" has been replaced by a double underscore in the above bmpcmp link.

As a trick to reduce the number of needless differences between release X and X+1, it is worth checking out back to X, cherry picking the commit that changes the release number, and commiting that with a tag of X-regression-test to golden. Possibly it may be worth pulling other commits in to this branch as required. Note that currently regressions and references can only be tags or SHAs, not branch names.

-- Robin Watts - 2017-02-27

See also:

ClusterNodes, ClusterStructure, ClusterWork


Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r5 - 2019-07-23 - RobinWatts
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc