[neutron] - OVN heartbead - short agent_down_time

smooney at redhat.com smooney at redhat.com
Mon Jun 19 18:45:31 UTC 2023


On Mon, 2023-06-19 at 14:58 -0300, Roberto Bartzen Acosta wrote:
> Thanks for your feedback Sean.
> 
> Em seg., 19 de jun. de 2023 às 14:01, <smooney at redhat.com> escreveu:
> 
> > On Mon, 2023-06-19 at 12:03 -0300, Roberto Bartzen Acosta wrote:
> > > Hello Neutron folks,
> > > 
> > > We discussed in the Operators feedback session about OVN heartbeat and
> > the
> > > use of "infinity" values for large-scale deployments because we have a
> > > significant infrastructure impact when a short 'agent_down_time' is
> > > configured.
> > agent_down_time is intended to specify how long the heartbeat can be
> > missed before
> > the agent is considered down. it was not intented to contol the interval
> > at which the heatbeat
> > was sent.
> > 
> > 
> > https://opendev.org/openstack/neutron/commit/628442aed7400251f12809a45605bd717f494c4e
> > intoduced a colation between the two but it resulted in the agent
> > incorrectly being considered down
> > and causing port binding failures if the agent_down_time was set too large.
> > > 
> > > The merged patch [1] limited the maximum delay to 10 seconds. I
> > understand
> > > the requirement to use random values to avoid load spikes, but why does
> > > this fix limit the heartbeat to 10 seconds? What is the goal of the
> > > agent_down_time parameter in this case? How will it work for someone who
> > > has hundreds of compute nodes / metadata agents?
> > the change in [1] shoudl just change the delay before _update_chassis is
> > invoked
> > that at least was the intent. im expecting the interval between heatbeats
> > to be ratlimaited
> > via the mechim that was used before
> > 
> > https://opendev.org/openstack/neutron/commit/628442aed7400251f12809a45605bd717f494c4e?style=split&whitespace=show-all
> > was implemented.
> > 
> > i.e. whwen a SbGlobalUpdateEvent is generated now we are clamping the max
> > wait to 10 seconds instead of
> > cfg.CONF.agent_down_time // 2 which was causing port binding failures.
> > 
> > the timer object will run the passed in fucntion after the timer interval
> > has expired.
> > 
> > https://docs.python.org/3/library/threading.html#timer-objects
> > 
> > but it will not re run multiple times and the function we are invoking
> > does not loop internally
> > so only one update will happen per invocation of run.
> > 
> > i believe the actual heatbeat/reporting interval is controlled by
> > cfg.CONF.AGENT.report_interval
> > 
> > 
> > https://github.com/openstack/neutron/blob/cbb89fdb1414a1b3a8e8b3a9a4154ef627bb9d1a/neutron/agent/metadata/agent.py#L313-L317
> > 
> > so i think if you want to reduce the interval in a large envionment to can
> > do that by setting
> > 
> > [AGENT]
> > report_interval=<your value>
> > 
> 
> I agree that the mechanism for sending heartbeats is controlled by
> report_interval, however, from what I understand, their original idea would
> be to configure complementary values: report_interval and agent_down_time
> would be associated with the status of network agents.
> 
> https://docs.openstack.org/neutron/2023.1/configuration/neutron.html
> report_interval: "Seconds between nodes reporting state to server; should
> be less than agent_down_time, best if it is half or less than
> agent_down_time."
> agent_down_time: "Seconds to regard the agent is down; should be at least
> twice report_interval, to be sure the agent is down for good."
so it think thsi was do aggressive or in correct advice.

with that advice if report_interval was 30 then agent_down_time shoudl be at least
60 but i dont think that was actully consiveritive enough a 3:1 ratio
i.e. report_interval:30 agent_down_time:90 woudl have been more reasonable.

that was actully what i orgianlly starte with just changing the delay form 
randint(0, cfg.CONF.agent_down_time // 2) to randint(0, cfg.CONF.agent_down_time // 3)

but when discussion it on irc we didnt think this needed to be configurable at all.

so the suggestion was to change to randint(0, 10)

i decided to blend both approches
and do 
max_delay = max(min(cfg.CONF.agent_down_time // 3, 10), 3)
delay = randint(0, max_delay)

the code that was modified is contoling the jiter we apply to the nodes not the rate at which the updates are sent
report_interval and agent_down_time shoudl still be set at complementary values.


cfg.CONF.agent_down_time however is not really a good import into that jitter calculation
if we wanted this to be tweakable it really should be its own config value.


> 
> 
> > 
> > im not that familiar with this code but that was my original understanding.
> > the sllep before its rerun is calucated in oslo.service
> > 
> > https://github.com/openstack/oslo.service/blob/1.38.0/oslo_service/loopingcall.py#L184-L194
> > 
> > https://github.com/openstack/oslo.service/blob/1.38.0/oslo_service/loopingcall.py#L154-L159
> > 
> > the neutron core team can correct me if that is incorrect but i would not
> > expct this to negitivly impact large clouds.
> > 
> 
> Note 1: My point is that the function SbGlobalUpdateEvent seems to be using
> the agent_down_time disassociated from the original function ( double /
> half relation).
> 
> Note 2: I'm curious to know the behavior of this modification with more
> than 200 chassis and with thousands of OVN routers. In this case, with many
> configurations being applied at the same time (a lot of events in
> SB_Global) and that require the agent running on Chassis to respond the
> report_interval at the same time as it is transitioning configs (probably
> millions of openflow flows entries).  Is 10 seconds enough?
is up to 10 seconds of jitter enough. i think its a more reasonable value then using
agent_down_time devieed by any fixed value.


report_interval default to 30
https://github.com/openstack/neutron/blob/cbb89fdb1414a1b3a8e8b3a9a4154ef627bb9d1a/neutron/conf/agent/common.py#L112
agent_down_time default to 75
https://github.com/openstack/neutron/blob/cbb89fdb1414a1b3a8e8b3a9a4154ef627bb9d1a/neutron/conf/agent/database/agents_db.py#L19-L22

so previosuly this would have been a range of 0-45
the jitter really should not exceed report_interval

i could see repalceing 10 with cfg.CONF.report_interval

so replace
            max_delay = max(min(cfg.CONF.agent_down_time // 3, 10), 3)
            delay = randint(0, max_delay)
with
            max_delay = max(min(cfg.CONF.agent_down_time // 3, cfg.CONF.report_interval), 3)
            delay = randint(0, max_delay)

but really f we do need this to be configurable then
we shoudl just add a report_interval_jitter cofnig option and then 
we could simplfy it to 

            delay = randint(0, cfg.CONF.report_interval_jitter)

looking at the code we dont actually need to calculate a random jitter on each event either
we could just do it once when the heartbeat is created by passing the delay as initial_delay
https://github.com/openstack/neutron/blob/cbb89fdb1414a1b3a8e8b3a9a4154ef627bb9d1a/neutron/agent/metadata/agent.py#L313-L317

that way the updates will happen at a deterministic interval (cfg.CONF.report_interval) with a fixed random offset
determined when the agent starts.

im not currenlty planning on either makign this run only once when the agent start or intolducion a dedicated config
option but i think ither would be fine.

prior to https://review.opendev.org/c/openstack/neutron/+/883687 we were seeing ci failure due to the change in
https://opendev.org/openstack/neutron/commit/628442aed7400251f12809a45605bd717f494c4e so instead of reviing it adn
reintoducing the bug it was trying to fix i limited the max delyay but we dont know how it will affect large
deployments.

i would suggest startign a patch if you belive the current behavior will be probelmatic but keep in mind that
addign too much jitter/delay can cause vm boots/migtrations to randomly fail leavign the instance in an error state.
that is what our temest ci result where detechting and that was preventing use form mergin patches.

in prodcution that would have resulted in operator having to manually fix things in the db and or rerunnign the
migrations. for end users they woudl have either seen no valid host error when booting vms or the boots would have taken
longer as we woudl have had to retry alternative hosts and hope we dont hit the "dead agent" issue while the agent is
actully runing fine the heatbeat is just being delayed excessivly long.

> 
> 
> 
> > 
> > > 
> > > Regards,
> > > Roberto
> > > 
> > > [1] - https://review.opendev.org/c/openstack/neutron/+/883687
> > > 
> > 
> > 
> 




More information about the openstack-discuss mailing list