The Structure of the Artifex Cluster

Overview

The cluster consists of a central server, and a set of distributed 'nodes' that connect into it. The nodes differ in OS, compiler version, number of CPUs, amount of RAM, connection speed, availability times etc.

The idea of the cluster is that different tests ("runs") can be performed, with the individual jobs that make up these runs being farmed out around all the cluster nodes that happen to be available at the moment.

Cluster runs can either be triggered automatically on a commit, at specific times/dates (for nightly/weekly tests) or directly by users.

Different runs can test different things; some runs test gs (or gpcl6, or gxps etc), some mupdf (and mujstest). Another can generate a bitmap comparison based on the results of a previous run.

The cluster master.

Casper, an AWS node, is our clustermaster. This is responsible for keeping the queued jobs and farming them out around the nodes.

We actually run 3 parallel clusters, priority 0, 1 and 2. This enables us to run user jobs (level 0, highest priority), git jobs (ones caused by commits, level 1, middle priority) and auto jobs (nightly/weekly jobs and release tests, level 2, low priority).

Something (I have yet to figure out what), runs runClustermaster.sh. This checks to see if another instance of runClustermaster.sh is already running - if it is it just exits. If not it goes into an infinite loop, kicking off clustermaster.pl, clustermaster1.pl and clustermaster2.pl every 10 seconds.

clustermaster1.pl and clustermaster2.pl are just symbolic links to clustermaster.pl, so all 3 processes are identical. The first thing they do is to look at the name they have been invoked under and use that as their priority. For (most of) the rest of this description (except where I explicitly say otherwise), I'll just describe the level 0 clustermaster process. The others are pretty much identical, but use suffixed filenames; so whereas the job list for level 0 (clustermaster.pl) goes into jobs, the jobs list for level 1 (clustermaster1.pl) goes into jobs.1.

clustermaster.pl is the core of the server. It's operation is split into 3 phases.

Phase 1 runs every time it is invoked, and consists of spotting new jobs that need to be added into the queue of jobs.

Phase 2 has clustermaster.pl checking to see if there is another instance of itself running. If there is, and it is too old, it kills the old version and exits. If there is, and it's not too old, it will exit. If there isn't, it proceeds to phase 3. This dance simply ensures that a) we don't get a clustermaster stuck in an infinite loop, and b) ensures that we have just one of them running at a time.

Phase 3 involves actually running the job, collating the results, and mailing out the results.

Then the clustermaster exits. Every clustermaster.pl invocation lasts only long enough to do a single run.

Phase 1 in detail

The cluster system contains various different copies of the source. /home/regression/cluster/{ghostpdl.mupdf} are git clones of the sources. In addition /home/regression/cluster/user/XXXX/{ghostpdl,mupdf} contain per-user copies (not git clones).

At priority 1, the clustermaster checks for new commits having been made to the ghostpdl/mupdf git repos. If found, it queues one job per each new commit (by adding a line to queue.lst).

At priority 0, the clustermaster checks for user jobs, by looking for various droppings that clusterpush.sh leaves in the users source directories. These include gs.run, ghostscript.run, cluster_command.run. If found the cluster removes these and creates jobs in queue.1.lst.

No checking is done for priority 2 auto jobs, because these are placed directly into the queue.2.lst file by enqueueAuto.pl (typically called from a crontab).

Phase 2 in detail

Any clustermaster that gets past phase 2 puts its process id into clustermaster.pid (or =clustermaster.1.pid etc). This means that when the clustermaster starts up, it can check to see if there is a clustermaster process supposedly running by checking that file. If it exists, it checks the time. If the time stamp is sufficiently old, we decide that it's probably dead. "Sufficiently old" depends on the priority levels; priority 2 jobs are given much longer than higher priority ones to allow stuff like coverage to run. If we decide the old clustermaster is dead, we take over the clusterMaster.pid file, and wait for 5 minutes. This prevents any other clustermasters from being spawned, and gives any old clustermaster time to spot the change of pid and shut down neatly.

If the pid no longer refers to a clustermaster process, we do similarly.

If we pass these tests, we put our own PID into the clustermaster.pid file. We are now the active clustermaster, and are expected to run the next job.

Phase 3 in detail

During this entire phase, we periodically check to see if the PID in clustermaster.pid is ours. If it isn't (or the file has been deleted), we take this as a sign that we have been killed off, and exit.

We check to see if we have any queued runs; if not, we can release the clustermaster.pid and exit.

Otherwise, we read the first one from the file, and execute it.

We create a job.start file (or job.1.start or job.2.start according to priority) which contains details of the job to run.

Then we call build.pl to get the job list; this goes into jobs. This is a list of lines, one per test to run. The lines (barring some initial comment lines) are in the format:

   <test> <logfile> true ; <command string>

<<>test<>> is a string that describes the test to be used; a combination of test file name (with / replaced by double _), and the devices/resolution/banding settings. e.g. tests_private__pdf__PDF_1.7_FTS__fts_17_1705.pdf.pdf.pkmraw.300.0

<<>logfile<>> is a string that gives a name to be used for the logfile of the test. e.g. __temp__/tests_private__pdf__PDF_1.7_FTS__fts_17_1705.pdf.pdf.pkmraw.300.0..gs_pdf

<<>command string<>> is a ';' separated list of commands to be executed to run the test. These commands and their outputs are captured to the logfile.

Typically the command string will involve at least one call to ghostscript (or another executable) to produce output.

Simple ghostscript tests will call ghostscript, and pipe its output through md5sum. This md5sum is then echoed to the output for processing when the job results are returned to the clustermaster.

Tests using pdfwrite/xpswrite/ps2write etc will call ghostscript once to output an intermediate file, then again to render that intermediate file. md5sums of both intermediate and final results are sent back in the logs.

Tests using bmpcmp will call the version of ghostscript under test, and then the reference version of ghostscript to generate 2 different bitmaps. These will then be fed into bmpcmp for comparison. In the case of pdfwrite etc, there will be 4 invocations. rsync is used as part of the command string to upload the output of bmpcmp.

Each of these lines will be handed out to nodes during the course of the job.

Then we actually run the job.

The clustermaster starts a service up listening on a port (9000+priority), and goes into a loop.

Firstly in the loop, we check to see if someone has usurped our .pid file, and exit neatly if it has.

Next, we check to see if a higher priority job has been run (by checking for the existence of job.start or job.1.start etc). If it has, we "pause" ourselves. This involves us updating the timestamp on our .pid files (and our expected timeout times for nodes), so that no matter how long the other job takes, we won't be pushed into a timeout).

When a node connects to the service, it sends a bunch of headers, and then either "OK" or "ABORT".

  • One of the headers is "id", a value parrotted back to the server from where the node has read it from job.start. If this doesn't match the id we expect, the node is still trying to run an old job. We tell it to abort.

  • One of the headers is "node", the node name. This must always be present, or we'll abort the node. This should never happen in a well-behaved system.

  • One of the headers is "myid", the process id of the nodes cluster process. If this ever changes during a job, it's a sign that the cluster node has gone very confused (or is running 2 competing clusterpull processes). We therefore can't trust that node, and disable it. This should never happen in a well-behaved system.

  • One of the headers is "caps", a string that describes the capabilities of the node. On the first time a node connects, we check to see that the node is suitable for the job; If it is unsuitable, we add it to the 'cull' list and abort the node. On subsequent connections we check to see if the node is in the cull list; if it is, we abort it again.

  • One of the headers is "failure". This will be set if the node has encountered a failure during a task we have given it (such as building the source, or uploading the results). If it fails during building, we abort the job. If it fails during upload, we tell it to try again.

  • One of the headers is "busy". This will be set if the node is still busy doing a task we gave it (such as building the source, or uploading the results). It connects in anyway so we know that it is still working on the problem, and hasn't died (so we won't time it out).

  • The first thing a new node is told is to "BUILD". This involves updating the source and test respositories and then building. This can take a while, so the node continues to connect in periodically to tell us it is "busy" working on it.

  • One of the headers is "jobs", followed by 2 numbers. The first is the number of jobs pending (the number of jobs that has been given to the node that have not yet been run). The second is the number of jobs running at the moment. When a node connects with no pending jobs, and we have jobs still to h

  • Each node connects to this service in turn. I to request jobs. They are either given jobs to do, or told to 'standby'.

  • If a file ".down" exists on the server, then we take that to mean that that node should be ignored. We abort it, and push any jobs it's had already back onto the list.

  • If we don't hear a heartbeat from a node every 3 minutes, we assume that that node has died. We therefore put the jobs that have been sent to that node back into the joblist, and carry on without that node.

  • The first time a node connects, after checking it's capabilities, we tell it to build. It will start this as a background task, and keep connecting in periodically while it is doing it. Each time it connects in it says it's "busy", and we'll just tell it to "CONTINUE".

  • When a node finishes building, it will connect to us again; we'll spot that it is no longer busy, and check the "failure" header. If it's failed, we abort the job entirely. If not, it is eligible to be fed jobs.

  • When a node connects, it tells us how many jobs it has queued, and how many are running at a time. We hand out jobs in batches to each node. When each node finishes reading the job list, it sends an acknowledgement back. If we don't get that acknowledgement, we assume that the connection dropped, and we will resend that job batch again later.

  • If we have no more jobs to hand out, we check to see whether any nodes are delinquent (i.e. has it been more than 3 minutes since that node last checked in). If so, we assume that node has died, we put that nodes jobs back into the joblist to be handed out again, and we mark that node as delinquent. This stops us requiring it to upload the results.

  • We keep track of which nodes have results to upload. Whenever a node connects with all its jobs having run, and we have no more jobs to hand out, if it still has data to upload, we tell it to "UPLOAD". The node does this in the background, and continues to connect while it is doing so, just telling us that it is "busy". When the node connects having finished the upload, we can clear our "has results to upload" flag.

  • When all the jobs have been handed out, and all the nodes have finished running all their jobs, and all the nodes have no more results to upload, the run is finished.

Next we mark all the machines as idle, unpack the build logs, create, populate and file the job report (including emailing it out).

We remove the job we have just exited from the queue, delete the clustermaster.pid file, and then exit.

monitorPort.pl

Previous versions of the cluster code have used a second port to handle nodes checking in to say they are still alive. This is no longer required since the rewrite of the server/node communication.

build.pl

This file knows how to generate the "job strings" to pass to the clients for execution.

This is where the knowledge of how to pass parameters to gs/mupdf etc is, along with what files to test with what extensions etc.

A Cluster Node

A cluster node has a git clone of tests and tests_private. It also has local copies of the ghostpdl, mupdf, gsview, luratech, ufst, users/XXXX/ghostpdl and users/XXXX/mupdf directories from the server. The user ones are kept up to date using rsync, the others by git.

Periodically (typically by a cronjob every 10 minutes), we run clusterpull.pl.

Unless it is explicitly given a name by the caller, this file now gets the name from the machine from "hostname".

When run this checks to see if there are existing clusterpull processes running, checking for timeouts etc.

Once it is satisfied that it is the only game in town, it starts the real work.

It enters a loop:

  • First it checks to see that no one has usurped it's .pid file, and exits if it has.

  • It updates the timestamp on the .pid file so future checks won't time us out.

  • For each priority level of the cluster:

    • If run.pid exists, then we are already processing a job. Nothing more to do.

    • We delete any existing job.start, and try to pull down a new one from the clustermaster.

    • If no job.start arrives, then we have nothing to do.

    • We pull down an updated run.pl file from the cluster, and fork a process to run it.

    • We wait for either the run.pl to exit, or for it to create a run.pid file.

That's all.

run.pl

run.pl contains the guts of the client logic. This knows how to build each of the different products by looking in the job.start file obtained from the clustermaster by clusterpull.pl.

It then repeatedly connects to the server to request jobs and run them.

-- Robin Watts - 2017-02-04

See also:

ClusterNodes, ClusterWork, ClusterHowTo

Comments

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r4 - 2019-05-16 - RobinWatts
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc