-
Notifications
You must be signed in to change notification settings - Fork 52
job-exec: make job signaling more configurable and fix documentation #6788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Problem: The job-exec module doesn't currently support sending SIGTERM to tasks multiple times, but this feature has been requested. Currently, the job-exec module only sends SIGTERM once in the case of a fatal exception, or never in the case of a timeout exception (only SIGALRM is sent before moving to SIGKILL) Add a `max-term-count` configuration option with a default of 2 that determines how many times SIGTERM is sent to the job tasks before transitioning to SIGKILL. Adjust one test that expected tasks to be terminated by SIGKILL instead of SIGTERM.
Problem: The delay that the job execution system uses to transition from killing the job tasks to the job shell is hardcoded to 5*kill-timeout and can not be configured. Add a `kill-shell-delay` configuration option and module parameter that can optionally be used to set the delay from the time the first term signal is sent to when the job-exec module transitions to killing the job shell. The default is 5 * kill-timeout. Expose the currently configured `kill-shell-delay` in the job-exec module stats. For jobs, also expose the remaining time until the kill shell timer elapses.
Problem: No tests in the testsuite ensure that kill-shell-delay is set to a multiple of kill-timeout. Add a test to t2402-job-exec-dummy.t.
Problem: The documentation of [exec] table config is out of date. Also, the description of the job termination process is not correct. Update `flux-config-exec(5)` with new configuration keys, as well as a corrected description in the job termination section.
Problem: The job-exec module configuration tests do not cover some of the newer exec config options. Add tests to t2403-job-exec-conf.t.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6788 +/- ##
==========================================
- Coverage 83.84% 83.83% -0.01%
==========================================
Files 535 535
Lines 89322 89342 +20
==========================================
+ Hits 74889 74899 +10
- Misses 14433 14443 +10
🚀 New features to boost your workflow:
|
- This is repeated with ``term-signal`` at an interval of | ||
``kill-timeout`` until ``kill-shell-delay`` has elapsed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was there consideration renaming kill-timeout
to term-timeout
? It just seems a tad confusing. Just beginning to skim the code for review, I got confused quickly.
If it's a backwards compatibility thing, could rename this and variables, but have a backwards compatible check for the config?
Edit: unfortunately we use the term kill
for a lot of our signaling, which I suspect is how this originated. Alternate (possibly bad) idea, use sigkill
name instead of kill
a lot of places could be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered that, but it is a kill(2)
timeout, and also the job execution system sends SIGTERM
only max-term-count
times, then switches to SIGKILL
, so term-timeout
would be a misnomer as well. I agree this is pretty confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this PR is a bit too much. We could consider not including the kill-shell-delay
and just fix the documentation to report that it is always 5*kill_timeout
. We do need to keep max-term-count
because it was requested, but adding kill-shell-delay
may not be required (not sure if anyone would adjust it)
This PR adds a couple tunables to the job-exec module for how it signals jobs and then fixes the documentation to actually describe how jobs are terminated. I'm not super happy with the result, but it is an improvement over what's there now because it fixes a couple glaring issues:
SIGTERM
to a job at most once (none in the case of a timeout exception).This PR adds two new tunables
kill-shell-delay
(default: 5 *kill-timeout
) andmax-term-count
(default: 2) to fix the above issues.Additionally, the documentation in the
flux-config-exec(5)
JOB TERMINATION
section was incorrect. This is now fixed to correctly describe the new termination process given the changes above.Fixes #6784