Skip to content

Miscellaneous rare bug fixes

Aaron Graubert requested to merge rare_bugs into master

Created by: julianhess

  • load_acct_from_disk overrides whatever job status got written if the job exit code was 0. When loading .sacct from disk, we check to see if the shard was avoided (based on .*exit_code, and self.job_spec[job] = None). If it was avoided, we override whatever status got loaded from disk to "COMPLETED", since wolF checks to see if all statuses == "COMPLETED" to infer whether the whole task finished successfully.
  • Fix a crash inwait_for_jobs_to_finish. If the job has already finished, the squeue command probing its runtime will fail, which will raise an exception and crash the whole task.
  • Fix a job avoidance bug: if job_avoid fails, it may not rollback any changes to self.job_spec. This is problematic, since failure totally wipes the output directory, but an updated self.job_spec could potentially indicate jobs should be avoided, which would cause an incomplete batch to be submitted.
  • Update Docker backend to catch known recoverable Docker errors in backend.invoke() and retry the command with exponential backoff.

Merge request reports