Difference: ClusterWork (5 vs. 6)

Revision 62019-04-08 - RobinWatts

Line: 1 to 1

Cluster work

Line: 79 to 79

Consider pushing coverage onto the cluster

Find out how to run the coverage tests.
Done, but dies with full jobs due to fimeouts. Consider revamping the cluster core.

Rework cluster central loop


  • Make it timeout less.
  • Make it less sensitive to node disconnections (if a node disconnects, be lazy about redistributing its jobs).
  • Simplify the current rats-nest of code (remove heartbeat if possible)
  • Allow for nodes joining part way during a run (to allow for AWS nodes spinning up).
  • Avoid nodes getting out of sync w.r.t what job is running.


Currently, jobs are started by the cluster making a series of $m.start files (one for each machine $m). The nodes periodically download $m.start and use that to trigger a new job.

Instead, we'll create a single 'job.start' node, that contains the same stuff, but also a job id (probably the time of the job start like 20190408115600 or something). Nodes will download this file rather than $m.start (and they won't delete it after they download it, obviously). This file will remain present for the entire duration of a job.

Currently, the cluster master starts just the nodes that it thinks should be capable of running the job (as it only creates $m.start if it thinks that $m is capable). Now we start every node.

Currently, the cluster sends back a capability string to be stored as $m.caps that says what a node can do or not. We no longer send that file separately but as part of the main cluster loop.

When a node connects, it will send a series of lines, terminated by a line that says OK or ABORT. Each of these lines will be of the form:

<var> <value>

For example:

   node <clustername>
   id <jobid>
   jobs <num jobs pending>
   timeouts <num> <max>

This enables us to update the protocol later.

node <clustername>
Inform the clustermaster which node is calling in.

id <jobid>
Inform the clustermaster which job the node thinks it is running. If this mismatches, the clustermaster will tell the node to abort.

jobs <num jobs pending>
This tells the clustermaster how many jobs the node has been given to run that have not currently completed. When this number drops to 0, the clustermaster can consider giving the node another batch of jobs.

caps <capability string>
The node tells the clustermaster what it's capabilities are.

status <status string>
A status string for the node. Can be anything.

Inform the clustermaster that the node is currently busy doing previously allocated work (building, or uploading files).

failure <reason>
Inform the clustermaster that the node failed, and that failure logs have been uploaded. Negative reasons are considered fatal and should stop the job on all nodes. Values for reason include:
  • -1: Failed because of a build problem.
  • 1: The node failed because of a network problem.
  • 2: Failed because of too many timeouts.

If 'ABORT' is received, the node will disconnect and clean up. Note that the node may then start itself up again from scratch and try to join in the job again, so the cluster master should be able to cope with this.

Once 'OK' is received by the master, the master will send a response, then disconnect.

The node should abort instantly. This will be used if the capabilities string for the node is unsuitable, if the jobid is wrong, the job has been aborted, etc.

The node should build. This will be the first command sent by the master to a node. This will trigger updates of test files, source files, and the build of the source files. The node will report 'busy' while it is doing this.

The node should upload any log files. The node will report 'busy' while it is doing this.

The node should go away and continue with its existing work. Come back in a bit to see if the situation has changed.

JOBS\n<num jobs>\n<list of job strings>\n
The master passes a list of job strings to the node.

Normal course of events:

master node
creates job.start  
  downloads job.start
  <- Connects, and sends capabilities/job id
If jobid is bad -> ABORT (but be prepared to accept node again in future)  
If capabilities are bad -> ABORT (and remember node as no good for this job)  
-> BUILD  
  Update test files/source files
  If updates fail <- ABORT (with appropriate reason)
  Build code
  If builds fail, upload <- ABORT (with appropriate reason)
  <- Connect in without busy
Feed jobs ->  
  <- Connect in with status updates every n seconds
When node reports empty:  
if we have more jobs -> feed jobs  
else if we have delinquent nodes, -> reschedule jobs from delinquent node  
else if node not uploaded -> upload  
when all nodes complete && uploaded -> abort  
  -- Robin Watts - 2017-02-15
This site is powered by the TWiki collaboration platform Powered by PerlCopyright 2014 Artifex Software Inc