Replies: 4 comments 10 replies
-
6.5 introduced #4675 - just like the go client handling, we forcibly kill the watches after they've been running for between 5 and 10 minutes. Apparently it's still possible for the api server to stop sending events, but not actually close, an active watch. We had one instance of this fixed upstream, but since it seems to happen more often than the wild, we needed to be more defensive.
This particular exception handling is new. We are trying to distinguish between connection related exceptions and exceptions that are due to other problems, such as protocol violations. The behavior you are seeing here does not match with what I saw locally with the jdk client - for example killing the api server pod would result in onClose being called instead - so there could be some variation by JRE. So unfortunately we'll need to correct the handling in JdkWebSocketImpl for this, and probably will need to be more tolerant to the possibility that the informer is simply crash looping - #5047 (comment)
Yes, you may call the method isRunning, or call stopped().whenComplete(...) to install a call back that will include the terminating exception.
I need to correct that - the exceptionHandler does have all terminating exceptions forwarded to it. By default if it's a non-http gone WatchException and the informer had already been started, it will terminate. |
Beta Was this translation helpful? Give feedback.
-
It will be important to differentiate the exact case. With 6.5+ a single watch cannot be up for more than 10 minutes. After that time if you are not getting events, then it could be one of:
|
Beta Was this translation helpful? Give feedback.
-
@scholzj can you confirm what is happening when you override the exception handler and see the message "watch failed for v1/namespaces/strimzi/pods, will retry" - does the informer resume? |
Beta Was this translation helpful? Give feedback.
-
@shawkins I got some logs from another user with 6.6.1. It is attached in https://github.com/orgs/strimzi/discussions/8343#discussioncomment-5975573 (the file
But never for pods - always only for our custom resources. So not sure how much it helps. |
Beta Was this translation helpful? Give feedback.
-
In Strimzi, we use informers and indexers to watch various Kubernetes resources - such as Pods, Secrets, or custom resources. Based on the reports from the users, it seems to happen to some of them that the informers get stuck. When it gets stuck, the informer does not get any new events and the indexer is not updated anymore. While the re-sync events are still raised by the informer, they are not really useful when the informer gives you old information. We seem to be experiencing that with 6.4 and 6.6.1.
The way we use the informer is that we use the
runnableInformer(...)
method => e.g. withclient.pods().runnableInformer(300_000L)
whereclient
is the Kubernetes client instance. We start the informer and wait for it to sync. I also tried to add the exception handler with the.exceptionHandler()
method to log any possible errors. At least in one environment it seems to log something like this:(The first message is from the
.exceptionHandler(...)
method, the second is from Fabric8 itself). But not sure how this is related as this should be recoverable.What might be also notable, we use the JDK Java client.
I'm not really sure what is causing this - whether is environment-related, something in our code, etc. And I never managed to see it in any of my environments. Is there something you can recommend to debug this?
Thanks a lot for any possible suggestions.
Beta Was this translation helpful? Give feedback.
All reactions