Look into AMQP/RabbitMQ heartbeat #1411

reidsunderland · 2025-03-29T16:10:12Z

It seems like the AMQP library we are using defaults to setting the heartbeat value to 0.

This means that the RabbitMQ broker doesn't detect when a connection has been broken. The sarra client reconnects itself on a new connection, but the old one seems to get left behind on the server.

https://www.rabbitmq.com/docs/heartbeats

https://docs.celeryq.dev/projects/amqp/en/latest/#:~:text=Support%20for%20heartbeats,Connection.send_heartbeat.

reidsunderland · 2025-03-29T16:21:39Z

If we set heartbeat to a low value, it could cause problems during slow downloads. The py-amqp library documentation recommends calling heartbeat_tick "on the order of once per second" https://docs.celeryq.dev/projects/amqp/en/latest/reference/amqp.connection.html#:~:text=heartbeat_tick(rate,timeout

petersilva · 2025-03-29T18:54:45Z

I think you already know this, but just clarifying... there are two settings:

heartbeat = None
Final heartbeat interval value (in float seconds) after negotiation

heartbeat_tick(rate=2)[source]
Send heartbeat packets if necessary.

I dunno if we even need a setting. I guess we could try to figure out a setting that works
for all cases and just hard code it. otoh, if a connection is idle it's going to be exchanging packets relatively often, I don't know if that's a concern (a thousand connections means 1000 packets/sec.) seems like a DOS attack on the broker. I wonder why it's so frequent... my gut tells me 10 or 100 seconds is a better choice... but often the recommendations come from experience.

reidsunderland · 2025-03-29T21:17:44Z

That's a good point, and I didn't understand why they recommended calling heartbeat_tick every ~1 second.

But I just re-read the documentation and looked at the code and I think it makes sense now. https://docs.celeryq.dev/projects/amqp/en/latest/_modules/amqp/connection.html#Connection.heartbeat_tick

If we want to manage timing of the heartbeat packets, we can use the send_heartbeat() method. And if we want to pass off most of the timing work to the AMQP library, then we call heartbeat_tick() frequently, and it figures out when to send the heartbeat packet based on the negotiated Connection.heartbeat() value (i.e. it doesn't send a heartbeat packet every time we call it).

RabbitMQ uses a default value of 60 seconds, and RabbitMQ maintained AMQP libraries (Java, .NET and Erlang) negotiate a heartbeat value based on the server's suggestion and the client's preference: https://www.rabbitmq.com/docs/heartbeats#heartbeats-timeout

But it's not super uncommon for file transfer to take >60 seconds, so I think that would be too low as a default unless we re-work the transfer code to allow housekeeping (or another mechanism to trigger the heartbeat) to run in the middle of a download.

Maybe a good start would be calculating the heartbeat interval from the existing timeout setting?

petersilva · 2025-03-30T01:47:39Z

so like currently... the algorithm processes stuff a batch at a time. Maybe this means we should put ack inside the action phase, so that traffic on the message bus occurs between each transfer instead of only after all the transfers in the batch have been attempted.

reidsunderland added efficiency improve performance and/or reduce overhead enhancement New feature or request ReliabilityRecovery improve behaviour in failure situations. labels Mar 29, 2025

petersilva mentioned this issue Apr 28, 2025

ack messages earlier for improved performance and reliability #1430

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look into AMQP/RabbitMQ heartbeat #1411

Look into AMQP/RabbitMQ heartbeat #1411

reidsunderland commented Mar 29, 2025

reidsunderland commented Mar 29, 2025 •

edited

Loading

petersilva commented Mar 29, 2025

reidsunderland commented Mar 29, 2025

petersilva commented Mar 30, 2025

Look into AMQP/RabbitMQ heartbeat #1411

Look into AMQP/RabbitMQ heartbeat #1411

Comments

reidsunderland commented Mar 29, 2025

reidsunderland commented Mar 29, 2025 • edited Loading

petersilva commented Mar 29, 2025

reidsunderland commented Mar 29, 2025

petersilva commented Mar 30, 2025

reidsunderland commented Mar 29, 2025 •

edited

Loading