fix: subscriber events deadlock #314

michaeladler · 2025-04-14T15:05:57Z

Description

Rewrite subscriber events implementation since upstream seems to be abandoned and buggy.

Reported-by: Felix Moessbauer [email protected]

Issues Addressed

List and link all the issues addressed by this PR.

Change Type

Please select the relevant options:

Bug fix (non-breaking change that resolves an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist

I have read the CONTRIBUTING document.
My changes adhere to the established code style, patterns, and best practices.
I have added tests that demonstrate the effectiveness of my changes.
I have updated the documentation accordingly (if applicable).
I have added an entry in the CHANGELOG to document my changes (if applicable).

codecov · 2025-04-14T15:07:29Z

Codecov Report

Attention: Patch coverage is 88.61386% with 23 lines in your changes missing coverage. Please review.

Project coverage is 79.83%. Comparing base (e8519f6) to head (b010e89).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/handler/job/events/events.go	88.46%	7 Missing and 2 partials ⚠️
middleware/sse/responder.go	57.14%	7 Missing and 2 partials ⚠️
api/wfx.go	50.00%	3 Missing ⚠️
cmd/wfxctl/cmd/job/events/events.go	90.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #314   +/-   ##
=======================================
  Coverage   79.83%   79.83%           
=======================================
  Files          94       94           
  Lines        4413     4469   +56     
=======================================
+ Hits         3523     3568   +45     
- Misses        645      655   +10     
- Partials      245      246    +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Addressed an issue where reverse proxy servers drop HTTP connections after a short timeout. To mitigate this, added periodic comment lines every X seconds to keep the connection alive. - Rewrote the subscriber events implementation due to upstream being abandoned and buggy, which caused issues like deadlocks/hang-ups. Fixes: 976875f Reported-by: Felix Moessbauer <[email protected]> Signed-off-by: Michael Adler <[email protected]>

Signed-off-by: Michael Adler <[email protected]>

CHANGELOG.md

stormc · 2025-04-17T09:08:35Z

The new implementation regularly sends ping/keep-alive over the connection as you cannot control the kernel side of things, at least not on clients, that may close channels on inactivity. Hence, control is exercised at the application level, namely in form of regular data sent, i.e., ping or keep-alive. This should be documented.

cmd/wfx/cmd/config/flags.go

docs/operations.md

internal/handler/job/events/events.go

middleware/sse/responder.go

Signed-off-by: Michael Adler <[email protected]>

stormc · 2025-05-02T19:23:30Z

docs/operations.md

@@ -66,6 +66,10 @@ Below is a high-level overview of how the communication flow operates:
 2. Upon receipt of the request, `wfx` sets the `Content-Type` header to `text/event-stream`.
 3. The server then initiates a stream of job events in the response body, allowing clients to receive instant updates.

+**Note**: To prevent the connection from being closed due to inactivity (e.g., when no job events occur), periodic keep-alive events are sent during such idle periods.
+This ensures the connection remains open, avoiding closure by, e.g., proxies or the kernel.


This is due since one may not be able to control all involved parties (clients, wfx, and intermediates). If this was the case, then keep alive messages would be not needed. Is it worth it to add this reasoning?

stormc · 2025-05-02T19:24:30Z

CHANGELOG.md

@@ -13,11 +13,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Added (existing but undocumented) `/health` and `/version` endpoint to OpenAPI spec
 - OpenAPI v3 spec is served at `/api/wfx/v1/openapi.json`
 - Add build tags to `--version` output
+- `wfxctl`: added `--auto-reconnect` flag to `job events` subcommand to automatically reconnect on connection loss.
+  **Note**: This may result in missed events during the reconnection period if `wfxctl` logs are not carefully monitored.


Hm, shouldn't that be the logs of wfx by whatever appropriate means? wfxctl looses connection and hence cannot know what it missed, in particular not log that? On the other hand, you get message IDs by which you can infer that there must've been missed log entries...

stormc · 2025-05-02T19:29:17Z

docs/operations.md

-This may result in a large number of events, though. For a more targeted approach, filter parameters may be used.
+**Note**: The `--auto-reconnect` flag should be used with caution, as it may result in missed events after a connection loss.
+When this flag is used, `wfxctl` does not terminate upon losing the connection, so its logs should be monitored to detect such occurrences.
+After a connection loss, fetching the job's current status and comparing it with the received events can help identify any missed events.


Can help or does help, i.e., this is the mechanism to detect missed log entries? Also, what should be compared? The job state, progress, ...? What about the SSE IDs? Can't they be used?

stormc · 2025-05-02T19:35:38Z

cmd/wfx/cmd/config/flags.go

@@ -64,6 +65,8 @@ const (
 const (
 	preferedStorage   = "sqlite"
 	sqliteDefaultOpts = "file:wfx.db?_fk=1&_journal=WAL"
+
+	DefaultPingIntervalSSE = 30 * time.Second // should be "short enough", i.e. shorter than the default read timeout of most reverse proxies


... since this time period is shorter than the one for closing connections due to inactivity by the kernel in its default setting.

stormc · 2025-05-02T19:36:27Z

cmd/wfx/cmd/config/flags.go

@@ -74,6 +77,8 @@ func NewFlagset() *pflag.FlagSet {
 	f.StringSlice(SchemeFlag, []string{"http"}, "the listeners to enable, this can be repeated and defaults to the schemes in the swagger spec")
 	f.Duration(CleanupTimeoutFlag, 10*time.Second, "grace period for which to wait before killing idle connections")
 	f.Duration(GracefulTimeoutFlag, 15*time.Second, "grace period for which to wait before shutting down the server")
+	f.Duration(PingIntervalSSEFlag, DefaultPingIntervalSSE, "interval to send periodic keep-alive messages to prevent server-sent events connections from being closed")


... due to inactivity

michaeladler force-pushed the fix/job-events branch 5 times, most recently from 9258597 to 074f63e Compare April 15, 2025 11:19

michaeladler marked this pull request as ready for review April 15, 2025 11:20

michaeladler requested a review from stormc as a code owner April 15, 2025 11:20

michaeladler force-pushed the fix/job-events branch from 074f63e to 8f16e4a Compare April 15, 2025 11:29

michaeladler added 2 commits April 16, 2025 09:26

feat(wfxctl): auto reconnect possibility

99e642c

Signed-off-by: Michael Adler <[email protected]>