Skip to content

job-exec: make job signaling more configurable and fix documentation #6788

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Apr 30, 2025

This PR adds a couple tunables to the job-exec module for how it signals jobs and then fixes the documentation to actually describe how jobs are terminated. I'm not super happy with the result, but it is an improvement over what's there now because it fixes a couple glaring issues:

  • the time between when the execution system starts signaling a job and when it transitions to killing the job shell is not configurable
  • the execution system sent SIGTERM to a job at most once (none in the case of a timeout exception).

This PR adds two new tunables kill-shell-delay (default: 5 * kill-timeout) and max-term-count (default: 2) to fix the above issues.

Additionally, the documentation in the flux-config-exec(5) JOB TERMINATION section was incorrect. This is now fixed to correctly describe the new termination process given the changes above.

Fixes #6784

grondo added 5 commits April 28, 2025 10:41
Problem: The job-exec module doesn't currently support sending
SIGTERM to tasks multiple times, but this feature has been
requested. Currently, the job-exec module only sends SIGTERM once
in the case of a fatal exception, or never in the case of a timeout
exception (only SIGALRM is sent before moving to SIGKILL)

Add a `max-term-count` configuration option with a default of 2 that
determines how many times SIGTERM is sent to the job tasks before
transitioning to SIGKILL.

Adjust one test that expected tasks to be terminated by SIGKILL
instead of SIGTERM.
Problem: The delay that the job execution system uses to transition
from killing the job tasks to the job shell is hardcoded to
5*kill-timeout and can not be configured.

Add a `kill-shell-delay` configuration option and module parameter
that can optionally be used to set the delay from the time the first
term signal is sent to when the job-exec module transitions to killing
the job shell. The default is 5 * kill-timeout.

Expose the currently configured `kill-shell-delay` in the job-exec
module stats. For jobs, also expose the remaining time until the kill
shell timer elapses.
Problem: No tests in the testsuite ensure that kill-shell-delay
is set to a multiple of kill-timeout.

Add a test to t2402-job-exec-dummy.t.
Problem: The documentation of [exec] table config is out of date.
Also, the description of the job termination process is not correct.

Update `flux-config-exec(5)` with new configuration keys, as well as
a corrected description in the job termination section.
Problem: The job-exec module configuration tests do not cover some
of the newer exec config options.

Add tests to t2403-job-exec-conf.t.
Copy link

codecov bot commented Apr 30, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.83%. Comparing base (7e7f0fa) to head (2d048e2).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6788      +/-   ##
==========================================
- Coverage   83.84%   83.83%   -0.01%     
==========================================
  Files         535      535              
  Lines       89322    89342      +20     
==========================================
+ Hits        74889    74899      +10     
- Misses      14433    14443      +10     
Files with missing lines Coverage Δ
src/modules/job-exec/job-exec.c 77.64% <100.00%> (+0.62%) ⬆️

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines +146 to +147
- This is repeated with ``term-signal`` at an interval of
``kill-timeout`` until ``kill-shell-delay`` has elapsed.
Copy link
Member

@chu11 chu11 Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was there consideration renaming kill-timeout to term-timeout? It just seems a tad confusing. Just beginning to skim the code for review, I got confused quickly.

If it's a backwards compatibility thing, could rename this and variables, but have a backwards compatible check for the config?

Edit: unfortunately we use the term kill for a lot of our signaling, which I suspect is how this originated. Alternate (possibly bad) idea, use sigkill name instead of kill a lot of places could be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that, but it is a kill(2) timeout, and also the job execution system sends SIGTERM only max-term-count times, then switches to SIGKILL, so term-timeout would be a misnomer as well. I agree this is pretty confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this PR is a bit too much. We could consider not including the kill-shell-delay and just fix the documentation to report that it is always 5*kill_timeout. We do need to keep max-term-count because it was requested, but adding kill-shell-delay may not be required (not sure if anyone would adjust it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

job-exec: signal behavior on timeout and cancel doesn't match docs
2 participants