Hello Neutron folks,
We discussed in the Operators feedback session about OVN heartbeat and the use of "infinity" values for large-scale deployments because we have a significant infrastructure impact when a short 'agent_down_time' is configured. agent_down_time is intended to specify how long the heartbeat can be missed before
On Mon, 2023-06-19 at 12:03 -0300, Roberto Bartzen Acosta wrote: the agent is considered down. it was not intented to contol the interval at which the heatbeat was sent. https://opendev.org/openstack/neutron/commit/628442aed7400251f12809a45605bd7... intoduced a colation between the two but it resulted in the agent incorrectly being considered down and causing port binding failures if the agent_down_time was set too large.
The merged patch [1] limited the maximum delay to 10 seconds. I understand the requirement to use random values to avoid load spikes, but why does this fix limit the heartbeat to 10 seconds? What is the goal of the agent_down_time parameter in this case? How will it work for someone who has hundreds of compute nodes / metadata agents?
the change in [1] shoudl just change the delay before _update_chassis is invoked that at least was the intent. im expecting the interval between heatbeats to be ratlimaited via the mechim that was used before https://opendev.org/openstack/neutron/commit/628442aed7400251f12809a45605bd717f494c4e?style=split&whitespace=show-all was implemented. i.e. whwen a SbGlobalUpdateEvent is generated now we are clamping the max wait to 10 seconds instead of cfg.CONF.agent_down_time // 2 which was causing port binding failures. the timer object will run the passed in fucntion after the timer interval has expired. https://docs.python.org/3/library/threading.html#timer-objects but it will not re run multiple times and the function we are invoking does not loop internally so only one update will happen per invocation of run. i believe the actual heatbeat/reporting interval is controlled by cfg.CONF.AGENT.report_interval https://github.com/openstack/neutron/blob/cbb89fdb1414a1b3a8e8b3a9a4154ef627... so i think if you want to reduce the interval in a large envionment to can do that by setting [AGENT] report_interval=<your value> im not that familiar with this code but that was my original understanding. the sllep before its rerun is calucated in oslo.service https://github.com/openstack/oslo.service/blob/1.38.0/oslo_service/loopingca... https://github.com/openstack/oslo.service/blob/1.38.0/oslo_service/loopingca... the neutron core team can correct me if that is incorrect but i would not expct this to negitivly impact large clouds.
Regards, Roberto
[1] - https://review.opendev.org/c/openstack/neutron/+/883687