[neutron] - OVN heartbead - short agent_down_time

smooney at redhat.com smooney at redhat.com
Mon Jun 19 17:01:44 UTC 2023


On Mon, 2023-06-19 at 12:03 -0300, Roberto Bartzen Acosta wrote:
> Hello Neutron folks,
> 
> We discussed in the Operators feedback session about OVN heartbeat and the
> use of "infinity" values for large-scale deployments because we have a
> significant infrastructure impact when a short 'agent_down_time' is
> configured.
agent_down_time is intended to specify how long the heartbeat can be missed before
the agent is considered down. it was not intented to contol the interval at which the heatbeat
was sent.

https://opendev.org/openstack/neutron/commit/628442aed7400251f12809a45605bd717f494c4e
intoduced a colation between the two but it resulted in the agent incorrectly being considered down
and causing port binding failures if the agent_down_time was set too large.
> 
> The merged patch [1] limited the maximum delay to 10 seconds. I understand
> the requirement to use random values to avoid load spikes, but why does
> this fix limit the heartbeat to 10 seconds? What is the goal of the
> agent_down_time parameter in this case? How will it work for someone who
> has hundreds of compute nodes / metadata agents?
the change in [1] shoudl just change the delay before _update_chassis is invoked
that at least was the intent. im expecting the interval between heatbeats to be ratlimaited
via the mechim that was used before 
https://opendev.org/openstack/neutron/commit/628442aed7400251f12809a45605bd717f494c4e?style=split&whitespace=show-all
was implemented.

i.e. whwen a SbGlobalUpdateEvent is generated now we are clamping the max wait to 10 seconds instead of
cfg.CONF.agent_down_time // 2 which was causing port binding failures.

the timer object will run the passed in fucntion after the timer interval has expired.

https://docs.python.org/3/library/threading.html#timer-objects

but it will not re run multiple times and the function we are invoking does not loop internally
so only one update will happen per invocation of run.

i believe the actual heatbeat/reporting interval is controlled by cfg.CONF.AGENT.report_interval

https://github.com/openstack/neutron/blob/cbb89fdb1414a1b3a8e8b3a9a4154ef627bb9d1a/neutron/agent/metadata/agent.py#L313-L317

so i think if you want to reduce the interval in a large envionment to can do that by setting

[AGENT]
report_interval=<your value>

im not that familiar with this code but that was my original understanding.
the sllep before its rerun is calucated in oslo.service
https://github.com/openstack/oslo.service/blob/1.38.0/oslo_service/loopingcall.py#L184-L194
https://github.com/openstack/oslo.service/blob/1.38.0/oslo_service/loopingcall.py#L154-L159

the neutron core team can correct me if that is incorrect but i would not expct this to negitivly impact large clouds.

> 
> Regards,
> Roberto
> 
> [1] - https://review.opendev.org/c/openstack/neutron/+/883687
> 




More information about the openstack-discuss mailing list