The Structure of the Artifex Cluster

Overview

The cluster consists of a central server, and a set of distributed 'nodes' that connect into it. The nodes differ in OS, compiler version, number of CPUs, amount of RAM, connection speed, availability times etc.

The idea of the cluster is that different tests ("runs") can be performed, with the individual jobs that make up these runs being farmed out around all the cluster nodes that happen to be available at the moment.

Cluster runs can either be triggered automatically on a commit, at specific times/dates (for nightly/weekly tests) or directly by users.

Different runs can test different things; some runs test gs (or gpcl6, or gxps etc), some mupdf (and mujstest). Another can generate a bitmap comparison based on the results of a previous run.

The cluster master.

Casper, an AWS node, is our clustermaster. This keeps the queue of runs, and for each run in turn farms out jobs around the available resources, and gathers the results in at the end to form the results.

Something (I have yet to figure out what), runs runClustermaster.sh. This checks to see if another instance of runClustermaster.sh is already running - if it is it just exits. If not it goes into an infinite loop, kicking off clustermaster.pl every 10 seconds.

clustermaster.pl is the core of the server. It's operation is split into 3 phases.

Phase 1 runs every time it is invoked, and consists of spotting new jobs that need to be added into the queue of jobs.

Phase 2 has clustermaster.pl checking to see if there is another instance of itself running. If there is, and it is too old, it kills the old version and exits. If there is, and it's not too old, it will exit. If there isn't, it proceeds to phase 3.

Phase 3 involves actually running the job, collating the results, and mailing out the results.

Then the clustermaster exits. Every clustermaster.pl invocation lasts only long enough to do a single run.

Phase 1 in detail

The cluster system contains various different copies of the source. /home/regression/cluster/{ghostpdl.mupdf} are git clones of the sources. In addition /home/regression/cluster/user/XXXX/{ghostpdl,mupdf} contain per-user copies (not git clones).

First, the clustermaster checks for new commits having been made to the ghostpdl/mupdf git repos. If found, it queues one job per each new commit.

Next the clustermaster checks for user jobs, by looking for various droppings that clusterpush.sh leaves in the users source directories. These include gs.run, ghostscript.run, cluster_command.run. If found the cluster removes these and creates jobs in the job queue.

Phase 2 in detail

Any clustermaster that gets past phase 2 puts it's process id into clustermaster.pid. This means that when the clustermaster starts up, it can check to see if there is a clustermaster process supposedly running by checking that file. If it exists, it checks the time. No run should take more than 2 hours, so if the time stamp of the pid is older than this, we decide that it's probably dead. To be sure, we remove the clusterMaster.pid file, and wait for 5 minutes. This prevents any other clustermasters from being spawned, and gives any old clustermaster time to shut down neatly.

If the pid no longer refers to a clustermaster process, we do similarly.

If we pass these tests, we put our own PID into the clustermaster.pid file.

Phase 3 in detail

First, we ensure that we have a monitorPort.pl task running. If one isn't running, we start it. This allows the nodes to show their 'heartbeat' to the clustermonitor. See below.

During this entire phase, we periodically check to see if the PID in clustermaster.pid is ours. If it isn't (or the file has been deleted), we take this as a sign that we have been killed off, and exit.

Next, we check to see if we have any queued runs. If we do, we take the first one, and execute it.

We detect the available machines. We cull this list according to job type (not all machines can run all jobs - windows machines can't run linux jobs, for example).

Then we call build.pl to get the job list. This goes into job.txt and is a list of commands to be fed out to the client nodes, 1 line per job.

Then we actually run the job. To allow for network outages etc, we try this set of steps up to 5 times until we get success or a non-network related failure.

  • The status files for the nodes are cleared.

  • A 'start' file is created for each node in use to trigger it.

  • The clustermaster starts a service up listening on a port (9000).

  • Each node connects to this service in turn to request jobs. They are either given jobs to do, or told to 'standby'.

  • If we don't hear a heartbeat from a node every 3 minutes, we assume that that node has died. We therefore put the jobs that have been sent to that node back into the joblist, and carry on without that node.

  • If any of the nodes reports a compileFail, this is taken as a hard failure.

  • When we've sent every job out, and then every node has reported in to ask for more jobs, we know that every job has been successfully completed. This job is complete.

  • If we drop too many nodes, we try again.

Next we mark all the machines as idle, unpack the build logs, create, populate and file the job report (including emailing it out).

We remove the job we have just exited from the queue, delete the clustermaster.pid file, and then exit.

monitorPort.pl

This script does the usual "if I am not the only one, then exit" dance.

If it is the only one, then it simply runs in a loop listening for connections on port 9001. For each connection it gets, it receives the machine name, a file suffix and some file contents. If 'machine.start' exists (i.e. if there is a new job to run), then it returns 1, otherwise 0.

The remote nodes use this to send a heartbeat (and status) to the server, and to be informed when they have a new job to begin.

build.pl

This file knows how to generate the "job strings" to pass to the clients for execution.

This is where the knowledge of how to pass parameters to gs/mupdf etc is, along with what files to test with what extensions etc.

A Cluster Node

A cluster node has a git clone of tests and tests_private. It also has local copies of the ghostpdl, mupdf, users/XXXX/ghostpdl and users/XXXX/mupdf directories from the server. These are kept up to date using rsync.

Periodically (typically by a cronjob every 10 minutes), we run clusterpull.sh.

This file contains the actual name of the cluster node - this is the only bit of node to node customisation that should be required. Accordingly this is the only file that cannot trivially be updated from the central server.

When run this checks to see if there are existing clusterpull processes running, checking for timeouts etc.

Once it is satisfied that it is the only game it town, it starts the real work.

Firstly it pulls down a new version of the heartbeat.pl script.

Next it pulls down a new version of the sendCaps.pl script (not all nodes have this version of clusterpull.sh yet), and runs it. This detects things like what OS is running, what version of gcc is in use, what wordsize etc, and reports these 'capabilities' back to the central cluster master. This enables the clustermaster to decide what jobs can be safely sent to this machine.

Then it goes into a loop calling heartbeat.pl once every 10 seconds to tell the server that it lives.

If heartbeat.pl returns with a non-zero exit code, it knows that there is a job to start.

It fetches the $machine.start file from the clustermaster, deletes it on the clustermaster, fetches run.pl from the clustermaster and executes run.pl.

run.pl

run.pl contains the guts of the client logic. This knows how to build each of the different products by looking in the $machine.start file obtained from the clustermaster by clusterpull.sh.

It then repeatedly connects to the server to request jobs and run them.

-- Robin Watts - 2017-02-04

See also:

ClusterNodes, ClusterWork, ClusterHowTo

Comments

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r2 - 2017-02-27 - RobinWatts
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc