Skip to content

Revamp job avoidance and accounting

Aaron Graubert requested to merge jobavoid into master

Created by: julianhess

  • Avoidance

    • avoided shards are now implemented as Slurm noops, by starting the batch paused then cancelling any noop'd shards
    • delocalization.py computes SHA1 checksums for every output
    • delocalization.py saves output patterns to job manifest, so that Orchestrator.job_avoid() can match them
    • Orchestrator.job_avoid() totally revamped to read from individual shard manifests, rather than entire job dataframe
  • Accounting

    • We save accounting information for each shard to disk, and can reload it in the exact format returned by Backend.sacct(). This lets us keep track of accounting info across avoided jobs
    • Simplified accounting in Orchestrator.wait_for_jobs_to_finish
  • Job exit, localization, and teardown exit codes are all saved to disk by entrypoint.sh

  • Add hashing features to utils.py

  • Docker backend can connect to preexisting controller container, in which case the backend won't attempt to stop the container after it exits

  • Bump version to 0.10

Merge request reports