Skip to content

Look into AMQP/RabbitMQ heartbeat #1411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
reidsunderland opened this issue Mar 29, 2025 · 4 comments
Open

Look into AMQP/RabbitMQ heartbeat #1411

reidsunderland opened this issue Mar 29, 2025 · 4 comments
Labels
efficiency improve performance and/or reduce overhead enhancement New feature or request ReliabilityRecovery improve behaviour in failure situations.

Comments

@reidsunderland
Copy link
Member

It seems like the AMQP library we are using defaults to setting the heartbeat value to 0.

This means that the RabbitMQ broker doesn't detect when a connection has been broken. The sarra client reconnects itself on a new connection, but the old one seems to get left behind on the server.

https://www.rabbitmq.com/docs/heartbeats

https://docs.celeryq.dev/projects/amqp/en/latest/#:~:text=Support%20for%20heartbeats,Connection.send_heartbeat.

Image

Image

Image

@reidsunderland reidsunderland added efficiency improve performance and/or reduce overhead enhancement New feature or request ReliabilityRecovery improve behaviour in failure situations. labels Mar 29, 2025
@reidsunderland
Copy link
Member Author

reidsunderland commented Mar 29, 2025

If we set heartbeat to a low value, it could cause problems during slow downloads. The py-amqp library documentation recommends calling heartbeat_tick "on the order of once per second" https://docs.celeryq.dev/projects/amqp/en/latest/reference/amqp.connection.html#:~:text=heartbeat_tick(rate,timeout

@petersilva
Copy link
Contributor

I think you already know this, but just clarifying... there are two settings:

heartbeat = None
Final heartbeat interval value (in float seconds) after negotiation

heartbeat_tick(rate=2)[source]
Send heartbeat packets if necessary.

I dunno if we even need a setting. I guess we could try to figure out a setting that works
for all cases and just hard code it. otoh, if a connection is idle it's going to be exchanging packets relatively often, I don't know if that's a concern (a thousand connections means 1000 packets/sec.) seems like a DOS attack on the broker. I wonder why it's so frequent... my gut tells me 10 or 100 seconds is a better choice... but often the recommendations come from experience.

@reidsunderland
Copy link
Member Author

That's a good point, and I didn't understand why they recommended calling heartbeat_tick every ~1 second.

But I just re-read the documentation and looked at the code and I think it makes sense now. https://docs.celeryq.dev/projects/amqp/en/latest/_modules/amqp/connection.html#Connection.heartbeat_tick

If we want to manage timing of the heartbeat packets, we can use the send_heartbeat() method. And if we want to pass off most of the timing work to the AMQP library, then we call heartbeat_tick() frequently, and it figures out when to send the heartbeat packet based on the negotiated Connection.heartbeat() value (i.e. it doesn't send a heartbeat packet every time we call it).

RabbitMQ uses a default value of 60 seconds, and RabbitMQ maintained AMQP libraries (Java, .NET and Erlang) negotiate a heartbeat value based on the server's suggestion and the client's preference: https://www.rabbitmq.com/docs/heartbeats#heartbeats-timeout

But it's not super uncommon for file transfer to take >60 seconds, so I think that would be too low as a default unless we re-work the transfer code to allow housekeeping (or another mechanism to trigger the heartbeat) to run in the middle of a download.

Maybe a good start would be calculating the heartbeat interval from the existing timeout setting?

@petersilva
Copy link
Contributor

so like currently... the algorithm processes stuff a batch at a time. Maybe this means we should put ack inside the action phase, so that traffic on the message bus occurs between each transfer instead of only after all the transfers in the batch have been attempted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
efficiency improve performance and/or reduce overhead enhancement New feature or request ReliabilityRecovery improve behaviour in failure situations.
Projects
None yet
Development

No branches or pull requests

2 participants