Skip to content

Handle jobs that fail to delocalize on the cluster

Aaron Graubert requested to merge github/fork/julianhess/missing_outputs into master

Created by: julianhess

Currently, orchestrator.make_output_DF() assumes that the keys of self.job_spec are identical to those of the outputs parameter. The outputs parameter is almost always generated by running localizer.delocalize(), so assuming that self.job_spec.keys() == outputs.keys() implicitly assumes that every single job properly delocalized.

This is not necessarily a safe assumption; for example, a long-running task that runs out of preemption attempts will never get a chance to run delocalization.py

This PR will do two things:

  1. Attempt to run delocalization.py on the controller, so that we can at least recover stdout/stderr
  2. For jobs that were totally lost, fill in values for any missing keys in outputs to {}

Merge request reports