How to best debug informers / check that they are working? #5152

scholzj · 2023-05-18T19:33:51Z

scholzj
May 18, 2023

In Strimzi, we use informers and indexers to watch various Kubernetes resources - such as Pods, Secrets, or custom resources. Based on the reports from the users, it seems to happen to some of them that the informers get stuck. When it gets stuck, the informer does not get any new events and the indexer is not updated anymore. While the re-sync events are still raised by the informer, they are not really useful when the informer gives you old information. We seem to be experiencing that with 6.4 and 6.6.1.

The way we use the informer is that we use the runnableInformer(...) method => e.g. with client.pods().runnableInformer(300_000L) where client is the Kubernetes client instance. We start the informer and wait for it to sync. I also tried to add the exception handler with the .exceptionHandler() method to log any possible errors. At least in one environment it seems to log something like this:

2023-05-17 00:14:15 ERROR InformerUtils:29 - Caught exception in the Pod informer which is started
io.fabric8.kubernetes.client.WatcherException: Could not process websocket message
	at io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener.onError(WatcherWebSocketListener.java:51) ~[io.fabric8.kubernetes-client-6.6.1.jar:?]
	at io.fabric8.kubernetes.client.jdkhttp.JdkWebSocketImpl$ListenerAdapter.onError(JdkWebSocketImpl.java:85) ~[io.fabric8.kubernetes-httpclient-jdk-6.6.1.jar:?]
	at jdk.internal.net.http.websocket.WebSocketImpl$ReceiveTask.processError(WebSocketImpl.java:508) ~[java.net.http:?]
	at jdk.internal.net.http.websocket.WebSocketImpl$ReceiveTask.run(WebSocketImpl.java:462) ~[java.net.http:?]
	at jdk.internal.net.http.common.SequentialScheduler$CompleteRestartableTask.run(SequentialScheduler.java:149) ~[java.net.http:?]
	at jdk.internal.net.http.common.SequentialScheduler$SchedulableTask.run(SequentialScheduler.java:230) ~[java.net.http:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]
Caused by: java.net.SocketException: Connection reset
	at sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426) ~[?:?]
	at jdk.internal.net.http.SocketTube.readAvailable(SocketTube.java:1170) ~[java.net.http:?]
	at jdk.internal.net.http.SocketTube$InternalReadPublisher$InternalReadSubscription.read(SocketTube.java:833) ~[java.net.http:?]
	at jdk.internal.net.http.SocketTube$SocketFlowTask.run(SocketTube.java:181) ~[java.net.http:?]
	at jdk.internal.net.http.common.SequentialScheduler$SchedulableTask.run(SequentialScheduler.java:230) ~[java.net.http:?]
	at jdk.internal.net.http.common.SequentialScheduler.runOrSchedule(SequentialScheduler.java:303) ~[java.net.http:?]
	at jdk.internal.net.http.common.SequentialScheduler.runOrSchedule(SequentialScheduler.java:256) ~[java.net.http:?]
	at jdk.internal.net.http.SocketTube$InternalReadPublisher$InternalReadSubscription.signalReadable(SocketTube.java:774) ~[java.net.http:?]
	at jdk.internal.net.http.SocketTube$InternalReadPublisher$ReadEvent.signalEvent(SocketTube.java:957) ~[java.net.http:?]
	at jdk.internal.net.http.SocketTube$SocketFlowEvent.handle(SocketTube.java:253) ~[java.net.http:?]
	at jdk.internal.net.http.HttpClientImpl$SelectorManager.handleEvent(HttpClientImpl.java:979) ~[java.net.http:?]
	at jdk.internal.net.http.HttpClientImpl$SelectorManager.lambda$run$3(HttpClientImpl.java:934) ~[java.net.http:?]
	at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
	at jdk.internal.net.http.HttpClientImpl$SelectorManager.run(HttpClientImpl.java:934) ~[java.net.http:?]
2023-05-17 00:14:15 WARN  Reflector:150 - watch failed for v1/namespaces/strimzi/pods, will retry

(The first message is from the .exceptionHandler(...) method, the second is from Fabric8 itself). But not sure how this is related as this should be recoverable.

What might be also notable, we use the JDK Java client.

I'm not really sure what is causing this - whether is environment-related, something in our code, etc. And I never managed to see it in any of my environments. Is there something you can recommend to debug this?

Are there some classed on which we should set some DEBUG logging to watch for some possible issues? If we enable DEBUG logging for everything it creates quite a lot of data so I'm not sure how much we will find there.
Is there some way how to doublecheck if the informer is alive from our code to be able to somehow handle it when the error happens?

Thanks a lot for any possible suggestions.

shawkins · 2023-05-18T20:49:03Z

shawkins
May 18, 2023
Collaborator

We seem to be experiencing that with 6.4 and 6.6.1.

6.5 introduced #4675 - just like the go client handling, we forcibly kill the watches after they've been running for between 5 and 10 minutes. Apparently it's still possible for the api server to stop sending events, but not actually close, an active watch. We had one instance of this fixed upstream, but since it seems to happen more often than the wild, we needed to be more defensive.

And I never managed to see it in any of my environments

This particular exception handling is new. We are trying to distinguish between connection related exceptions and exceptions that are due to other problems, such as protocol violations.

The behavior you are seeing here does not match with what I saw locally with the jdk client - for example killing the api server pod would result in onClose being called instead - so there could be some variation by JRE. So unfortunately we'll need to correct the handling in JdkWebSocketImpl for this, and probably will need to be more tolerant to the possibility that the informer is simply crash looping - #5047 (comment)

Is there some way how to doublecheck if the informer is alive from our code to be able to somehow handle it when the error happens?

Yes, you may call the method isRunning, or call stopped().whenComplete(...) to install a call back that will include the terminating exception.

I also tried to add the exception handler with the .exceptionHandler() method to log any possible errors.

The exceptionHandler is operating at a higher level than this unfortunately. It's supposed to be for refining the behavior of when higher level problem, such as deserialization errors, occur with the list / watch calls.

I need to correct that - the exceptionHandler does have all terminating exceptions forwarded to it. By default if it's a non-http gone WatchException and the informer had already been started, it will terminate.

1 reply

scholzj May 18, 2023
Author

I know there were some fixes in 6.5. That is why I hoped they would address this when it was reported with 6.4. But now it still seems to be an issue in 6.6. In the log which I have from Strimzi, the informer seems to be stuck for at least 50 minutes - so the timeout and reopening does not seem to fix it.

I will try to add some additional logic to use the isRunning or the completion future to see what that returns.

shawkins · 2023-05-18T21:25:03Z

shawkins
May 18, 2023
Collaborator

But now it still seems to be an issue in 6.6. In the log which I have from Strimzi, the informer seems to be stuck for at least 50 minutes - so the timeout and reopening does not seem to fix it.

It will be important to differentiate the exact case. With 6.5+ a single watch cannot be up for more than 10 minutes. After that time if you are not getting events, then it could be one of:

the api server is not returning any events, which is unlikely
the informer has stopped itself. Whether the informer has been stopped is easily confirmed with the stopped future. The next question would be whether the stopping was appropriate. The log snippet above implies that the informer will retry with a full relist / watch, it's not clear just from the above why that is not happening
or some unknown behavior which should no longer include crash looping - or it least that should be clear with a debug log that includes io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener.

2 replies

scholzj May 18, 2023
Author

With 6.5+ a single watch cannot be up for more than 10 minutes.

Well, if I understand it correctly, this happens only internally inside the informer. So the watch is closed and recreated again and the informer continues to run, right?

shawkins May 18, 2023
Collaborator

Yes we are including a timeout in the watch request sent to the api server and have added a client-side fail-safe to terminate the watch as well. The expected scenario is that the watch will be re-established.

shawkins · 2023-05-19T13:03:32Z

shawkins
May 19, 2023
Collaborator

@scholzj can you confirm what is happening when you override the exception handler and see the message "watch failed for v1/namespaces/strimzi/pods, will retry" - does the informer resume?

3 replies

scholzj May 19, 2023
Author

I cannot really confirm that I'm afraid ... but I think at least for some of them it does. This is the full log I have from the user: https://gist.githubusercontent.com/Aaron-ML/88d0c9fb46d41862a9dbf4b9d9476e77/raw/cda38d24c5e741e84d62905a6927f84707eca2f3/operator.log ... if you look for the error, it shows up multiple times in random intervals. So my read on this is that the informer recovered from it at least in all apart from the last case because otherwise there would be no next watch failure if the previous failure was not recreated. But I do not know if the last one for example for some reason failed.

shawkins May 19, 2023
Collaborator

I see 198 retries and 434 loggings of the reset exception - there would generally be two loggings for the informer, once by the overridden exceptionHandler, then once by the reflector warning. The unmatched loggings of the reset exception seem to be coming from plain watches:

2023-05-16 19:33:55 ERROR AbstractOperator:492 - Watcher closed with exception in namespace strimzi
io.fabric8.kubernetes.client.WatcherException: Could not process websocket message
	at io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener.onError(WatcherWebSocketListener.java:51) ~[io.fabric8.kubernetes-client-6.6.1.jar:?]

The watches that have close handling may be able to restart, but those called out in strimzi/strimzi-kafka-operator#8060 will not.

I'd say at first glance then it does appear that the informers are recovering as expected given that the exceptionHandler is telling the informer to always retry.

if you look for the error, it shows up multiple times in random intervals

#5153 will clean up seeing the Connection reset exceptions raised to this level in the first place - they should be handled by the Watch layer. We'll only proactively terminate on specific protocol errors. The changes should also make it clearer in the logs what is going on - for example if the Watch is simply restarting without doing anything.

It's not expected that you should have to use an exceptionHandler, but that is definitely a viable workaround in the meantime.

scholzj May 19, 2023
Author

Right, sorry, I forgot to mention that ... the watchers we use directly outside of informers have our own reconnect mechanism (or the 2 from the issue you mentioned don't, but the others do). So you would not see the Fabric8 message for them. Only the watch errors logged by InformerUtils are from Informers.

It's not expected that you should have to use an exceptionHandler, but that is definitely a viable workaround in the meantime.

Yeah, I know ... I hoped it would give more insights. Which in a way does, but not enough.

scholzj · 2023-05-23T08:50:49Z

scholzj
May 23, 2023
Author

@shawkins I got some logs from another user with 6.6.1. It is attached in https://github.com/orgs/strimzi/discussions/8343#discussioncomment-5975573 (the file strimzi-cluster-operator-6f7c887cb-qzq56-3.txt). It has some more debug logging on Fabric8 classes. It lists several times something like this:

2023-05-23 02:33:12 DEBUG WatcherWebSocketListener:71 - WebSocket close received. code: 1000, reason:
2023-05-23 02:33:12 DEBUG AbstractWatchManager:200 - Scheduling reconnect task in 1000 ms
2023-05-23 02:33:13 DEBUG AbstractWatchManager:283 - Watching https://10.0.0.1:443/apis/kafka.strimzi.io/v1beta2/namespaces/kafka/kafkamirrormaker2s?resourceVersion=80775308&allowWatchBookmarks=true&watch=true...
2023-05-23 02:33:13 DEBUG WatchConnectionManager:60 - Closing websocket io.fabric8.kubernetes.client.jdkhttp.JdkWebSocketImpl@3181f6bb
2023-05-23 02:33:13 DEBUG WatchConnectionManager:63 - Websocket already closed io.fabric8.kubernetes.client.jdkhttp.JdkWebSocketImpl@3181f6bb
2023-05-23 02:33:13 DEBUG WatcherWebSocketListener:41 - WebSocket successfully opened

But never for pods - always only for our custom resources. So not sure how much it helps.

4 replies

shawkins May 23, 2023
Collaborator

@scholzj this looks like an entirely expected sequence:

2023-05-23 02:33:13 DEBUG AbstractWatchManager:283 - Watching https://10.0.0.1:443/apis/kafka.strimzi.io/v1beta2/namespaces/kafka/kafkamirrormaker2s?resourceVersion=80775308&allowWatchBookmarks=true&watch=true...

Indicates that this is not an informer based watch - there is no timeout specified.

2023-05-23 02:33:12 DEBUG WatcherWebSocketListener:71 - WebSocket close received. code: 1000, reason:

The server has shutdown the watch websocket. Without a timeout specified the watch can live anywhere between 1-2 hours by default on openshift.

2023-05-23 02:33:13 DEBUG WatcherWebSocketListener:41 - WebSocket successfully opened

After double checking that the previous watch was closed, a new one has successfully started.

scholzj May 23, 2023
Author

Indicates that this is not an informer based watch - there is no timeout specified.

That explains why the Pods are not listed there as that is an informer and not a watch. That said, it does not seem like anything else was logged there that would explain why the informer got stuck.

shawkins May 23, 2023
Collaborator

@scholzj the only log that seems to contain informer related pod watches is the problem-at link in the subsequent comment.

2023-05-23 07:09:04 DEBUG AbstractWatchManager:283 - Watching https://10.0.0.1:443/api/v1/namespaces/kafka/pods?labelSelector=strimzi.io%2Fkind&resourceVersion=201943446&timeoutSeconds=600&allowWatchBookmarks=true&watch=true...

The last such log is:

2023-05-23 08:38:45 DEBUG AbstractWatchManager:283 - Watching https://10.0.0.1:443/api/v1/namespaces/kafka/pods?labelSelector=strimzi.io%2Fkind&resourceVersion=202001704&timeoutSeconds=600&allowWatchBookmarks=true&watch=true...

From there things go silent for the pods watch in terms of restarts. This is not expected as both the client and the server should enforce a timeout. The latest pr proposes a fail safe restart of the watch if our attempt to kill it fails: https://github.com/fabric8io/kubernetes-client/pull/5153/files#diff-7ca5806d751b9d76ce33e1adbef2dabd2ef3c10d499aeb66215451bd8bbff802R167

The most likely problem sequence would have to do with

kubernetes-client/httpclient-jdk/src/main/java/io/fabric8/kubernetes/client/jdkhttp/JdkWebSocketImpl.java

Line 130 in 14833da

    
           cf.thenRunAsync(() -> webSocket.abort(), CompletableFuture.delayedExecutor(1, TimeUnit.MINUTES));

- if for whatever reason we don't receive a close back from the server within a minute of sending one, then abort is invoked. However I just double checked the side-effects of abort and it never invokes onClose or onError, so the downstream logic would have no idea the websocket is dead. I'll see what makes sense to refine this - at least add some logging and some additional safety logic - or possibly remove it as Jetty nor Vertx have this logic.

scholzj May 23, 2023
Author

Makes sense. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to best debug informers / check that they are working? #5152

{{title}}

Replies: 4 comments 10 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to best debug informers / check that they are working? #5152

scholzj May 18, 2023

Replies: 4 comments · 10 replies

shawkins May 18, 2023 Collaborator

scholzj May 18, 2023 Author

shawkins May 18, 2023 Collaborator

scholzj May 18, 2023 Author

shawkins May 18, 2023 Collaborator

shawkins May 19, 2023 Collaborator

scholzj May 19, 2023 Author

shawkins May 19, 2023 Collaborator

scholzj May 19, 2023 Author

scholzj May 23, 2023 Author

shawkins May 23, 2023 Collaborator

scholzj May 23, 2023 Author

shawkins May 23, 2023 Collaborator

scholzj May 23, 2023 Author

scholzj
May 18, 2023

Replies: 4 comments 10 replies

shawkins
May 18, 2023
Collaborator

scholzj May 18, 2023
Author

shawkins
May 18, 2023
Collaborator

scholzj May 18, 2023
Author

shawkins May 18, 2023
Collaborator

shawkins
May 19, 2023
Collaborator

scholzj May 19, 2023
Author

shawkins May 19, 2023
Collaborator

scholzj May 19, 2023
Author

scholzj
May 23, 2023
Author

shawkins May 23, 2023
Collaborator

scholzj May 23, 2023
Author

shawkins May 23, 2023
Collaborator

scholzj May 23, 2023
Author