Skip to content

fix: subscriber events deadlock #314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

fix: subscriber events deadlock #314

wants to merge 3 commits into from

Conversation

michaeladler
Copy link
Member

@michaeladler michaeladler commented Apr 14, 2025

Description

Rewrite subscriber events implementation since upstream seems to be abandoned and buggy.

Reported-by: Felix Moessbauer [email protected]

Issues Addressed

List and link all the issues addressed by this PR.

Change Type

Please select the relevant options:

  • Bug fix (non-breaking change that resolves an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist

  • I have read the CONTRIBUTING document.
  • My changes adhere to the established code style, patterns, and best practices.
  • I have added tests that demonstrate the effectiveness of my changes.
  • I have updated the documentation accordingly (if applicable).
  • I have added an entry in the CHANGELOG to document my changes (if applicable).

Copy link

codecov bot commented Apr 14, 2025

Codecov Report

Attention: Patch coverage is 88.61386% with 23 lines in your changes missing coverage. Please review.

Project coverage is 79.83%. Comparing base (e8519f6) to head (b010e89).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
internal/handler/job/events/events.go 88.46% 7 Missing and 2 partials ⚠️
middleware/sse/responder.go 57.14% 7 Missing and 2 partials ⚠️
api/wfx.go 50.00% 3 Missing ⚠️
cmd/wfxctl/cmd/job/events/events.go 90.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #314   +/-   ##
=======================================
  Coverage   79.83%   79.83%           
=======================================
  Files          94       94           
  Lines        4413     4469   +56     
=======================================
+ Hits         3523     3568   +45     
- Misses        645      655   +10     
- Partials      245      246    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@michaeladler michaeladler force-pushed the fix/job-events branch 5 times, most recently from 9258597 to 074f63e Compare April 15, 2025 11:19
@michaeladler michaeladler marked this pull request as ready for review April 15, 2025 11:20
@michaeladler michaeladler requested a review from stormc as a code owner April 15, 2025 11:20
- Addressed an issue where reverse proxy servers drop HTTP connections
  after a short timeout. To mitigate this, added periodic comment lines
  every X seconds to keep the connection alive.
- Rewrote the subscriber events implementation due to upstream being
  abandoned and buggy, which caused issues like deadlocks/hang-ups.

Fixes: 976875f
Reported-by: Felix Moessbauer <[email protected]>
Signed-off-by: Michael Adler <[email protected]>
@stormc
Copy link
Collaborator

stormc commented Apr 17, 2025

The new implementation regularly sends ping/keep-alive over the connection as you cannot control the kernel side of things, at least not on clients, that may close channels on inactivity. Hence, control is exercised at the application level, namely in form of regular data sent, i.e., ping or keep-alive. This should be documented.

Signed-off-by: Michael Adler <[email protected]>
@@ -66,6 +66,10 @@ Below is a high-level overview of how the communication flow operates:
2. Upon receipt of the request, `wfx` sets the `Content-Type` header to `text/event-stream`.
3. The server then initiates a stream of job events in the response body, allowing clients to receive instant updates.

**Note**: To prevent the connection from being closed due to inactivity (e.g., when no job events occur), periodic keep-alive events are sent during such idle periods.
This ensures the connection remains open, avoiding closure by, e.g., proxies or the kernel.
Copy link
Collaborator

@stormc stormc May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due since one may not be able to control all involved parties (clients, wfx, and intermediates). If this was the case, then keep alive messages would be not needed. Is it worth it to add this reasoning?

@@ -13,11 +13,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Added (existing but undocumented) `/health` and `/version` endpoint to OpenAPI spec
- OpenAPI v3 spec is served at `/api/wfx/v1/openapi.json`
- Add build tags to `--version` output
- `wfxctl`: added `--auto-reconnect` flag to `job events` subcommand to automatically reconnect on connection loss.
**Note**: This may result in missed events during the reconnection period if `wfxctl` logs are not carefully monitored.
Copy link
Collaborator

@stormc stormc May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, shouldn't that be the logs of wfx by whatever appropriate means? wfxctl looses connection and hence cannot know what it missed, in particular not log that? On the other hand, you get message IDs by which you can infer that there must've been missed log entries...

This may result in a large number of events, though. For a more targeted approach, filter parameters may be used.
**Note**: The `--auto-reconnect` flag should be used with caution, as it may result in missed events after a connection loss.
When this flag is used, `wfxctl` does not terminate upon losing the connection, so its logs should be monitored to detect such occurrences.
After a connection loss, fetching the job's current status and comparing it with the received events can help identify any missed events.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can help or does help, i.e., this is the mechanism to detect missed log entries? Also, what should be compared? The job state, progress, ...? What about the SSE IDs? Can't they be used?

@@ -64,6 +65,8 @@ const (
const (
preferedStorage = "sqlite"
sqliteDefaultOpts = "file:wfx.db?_fk=1&_journal=WAL"

DefaultPingIntervalSSE = 30 * time.Second // should be "short enough", i.e. shorter than the default read timeout of most reverse proxies
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... since this time period is shorter than the one for closing connections due to inactivity by the kernel in its default setting.

@@ -74,6 +77,8 @@ func NewFlagset() *pflag.FlagSet {
f.StringSlice(SchemeFlag, []string{"http"}, "the listeners to enable, this can be repeated and defaults to the schemes in the swagger spec")
f.Duration(CleanupTimeoutFlag, 10*time.Second, "grace period for which to wait before killing idle connections")
f.Duration(GracefulTimeoutFlag, 15*time.Second, "grace period for which to wait before shutting down the server")
f.Duration(PingIntervalSSEFlag, DefaultPingIntervalSSE, "interval to send periodic keep-alive messages to prevent server-sent events connections from being closed")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... due to inactivity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants