Handle jobs that fail to delocalize on the cluster
Created by: julianhess
Currently, orchestrator.make_output_DF()
assumes that the keys of self.job_spec
are identical to those of the outputs
parameter. The outputs
parameter is almost always generated by running localizer.delocalize()
, so assuming that self.job_spec.keys() == outputs.keys()
implicitly assumes that every single job properly delocalized.
This is not necessarily a safe assumption; for example, a long-running task that runs out of preemption attempts will never get a chance to run delocalization.py
This PR will do two things:
- Attempt to run
delocalization.py
on the controller, so that we can at least recover stdout/stderr - For jobs that were totally lost, fill in values for any missing keys in
outputs
to{}