Miscellaneous rare bug fixes
Created by: julianhess
-
load_acct_from_disk
overrides whatever job status got written if the job exit code was 0. When loading .sacct from disk, we check to see if the shard was avoided (based on.*exit_code
, andself.job_spec[job] = None
). If it was avoided, we override whatever status got loaded from disk to "COMPLETED", since wolF checks to see if all statuses == "COMPLETED" to infer whether the whole task finished successfully. - Fix a crash in
wait_for_jobs_to_finish
. If the job has already finished, thesqueue
command probing its runtime will fail, which will raise an exception and crash the whole task. - Fix a job avoidance bug: if
job_avoid
fails, it may not rollback any changes toself.job_spec
. This is problematic, since failure totally wipes the output directory, but an updatedself.job_spec
could potentially indicate jobs should be avoided, which would cause an incomplete batch to be submitted. - Update Docker backend to catch known recoverable Docker errors in
backend.invoke()
and retry the command with exponential backoff.