-
Notifications
You must be signed in to change notification settings - Fork 7
fix: subscriber events deadlock #314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #314 +/- ##
=======================================
Coverage 79.83% 79.83%
=======================================
Files 94 94
Lines 4413 4469 +56
=======================================
+ Hits 3523 3568 +45
- Misses 645 655 +10
- Partials 245 246 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
9258597
to
074f63e
Compare
074f63e
to
8f16e4a
Compare
- Addressed an issue where reverse proxy servers drop HTTP connections after a short timeout. To mitigate this, added periodic comment lines every X seconds to keep the connection alive. - Rewrote the subscriber events implementation due to upstream being abandoned and buggy, which caused issues like deadlocks/hang-ups. Fixes: 976875f Reported-by: Felix Moessbauer <[email protected]> Signed-off-by: Michael Adler <[email protected]>
Signed-off-by: Michael Adler <[email protected]>
The new implementation regularly sends ping/keep-alive over the connection as you cannot control the kernel side of things, at least not on clients, that may close channels on inactivity. Hence, control is exercised at the application level, namely in form of regular data sent, i.e., ping or keep-alive. This should be documented. |
Signed-off-by: Michael Adler <[email protected]>
@@ -66,6 +66,10 @@ Below is a high-level overview of how the communication flow operates: | |||
2. Upon receipt of the request, `wfx` sets the `Content-Type` header to `text/event-stream`. | |||
3. The server then initiates a stream of job events in the response body, allowing clients to receive instant updates. | |||
|
|||
**Note**: To prevent the connection from being closed due to inactivity (e.g., when no job events occur), periodic keep-alive events are sent during such idle periods. | |||
This ensures the connection remains open, avoiding closure by, e.g., proxies or the kernel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due since one may not be able to control all involved parties (clients, wfx, and intermediates). If this was the case, then keep alive messages would be not needed. Is it worth it to add this reasoning?
@@ -13,11 +13,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |||
- Added (existing but undocumented) `/health` and `/version` endpoint to OpenAPI spec | |||
- OpenAPI v3 spec is served at `/api/wfx/v1/openapi.json` | |||
- Add build tags to `--version` output | |||
- `wfxctl`: added `--auto-reconnect` flag to `job events` subcommand to automatically reconnect on connection loss. | |||
**Note**: This may result in missed events during the reconnection period if `wfxctl` logs are not carefully monitored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, shouldn't that be the logs of wfx
by whatever appropriate means? wfxctl
looses connection and hence cannot know what it missed, in particular not log that? On the other hand, you get message IDs by which you can infer that there must've been missed log entries...
This may result in a large number of events, though. For a more targeted approach, filter parameters may be used. | ||
**Note**: The `--auto-reconnect` flag should be used with caution, as it may result in missed events after a connection loss. | ||
When this flag is used, `wfxctl` does not terminate upon losing the connection, so its logs should be monitored to detect such occurrences. | ||
After a connection loss, fetching the job's current status and comparing it with the received events can help identify any missed events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can help or does help, i.e., this is the mechanism to detect missed log entries? Also, what should be compared? The job state, progress, ...? What about the SSE IDs? Can't they be used?
@@ -64,6 +65,8 @@ const ( | |||
const ( | |||
preferedStorage = "sqlite" | |||
sqliteDefaultOpts = "file:wfx.db?_fk=1&_journal=WAL" | |||
|
|||
DefaultPingIntervalSSE = 30 * time.Second // should be "short enough", i.e. shorter than the default read timeout of most reverse proxies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... since this time period is shorter than the one for closing connections due to inactivity by the kernel in its default setting.
@@ -74,6 +77,8 @@ func NewFlagset() *pflag.FlagSet { | |||
f.StringSlice(SchemeFlag, []string{"http"}, "the listeners to enable, this can be repeated and defaults to the schemes in the swagger spec") | |||
f.Duration(CleanupTimeoutFlag, 10*time.Second, "grace period for which to wait before killing idle connections") | |||
f.Duration(GracefulTimeoutFlag, 15*time.Second, "grace period for which to wait before shutting down the server") | |||
f.Duration(PingIntervalSSEFlag, DefaultPingIntervalSSE, "interval to send periodic keep-alive messages to prevent server-sent events connections from being closed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... due to inactivity
Description
Rewrite subscriber events implementation since upstream seems to be abandoned and buggy.
Reported-by: Felix Moessbauer [email protected]
Issues Addressed
List and link all the issues addressed by this PR.
Change Type
Please select the relevant options:
Checklist