[neutron] - OVN heartbead - short agent_down_time

Roberto Bartzen Acosta roberto.acosta at luizalabs.com
Thu Jun 22 14:58:50 UTC 2023


I understand Nova's requirements, but OVN heartbeat has a significant
impact on the Southbound Database.

We have a related topic on etherpad about this (Vancouver PTG):

   - "(labedz) OVN heartbeat mechanism - big mechanism with significant
   infrastructure impact for ? Why we need to be on OVN southbound with
   Neutron?"


Sean mentioned some reasons to use the metadata heartbeat mechanism:

   - "i would suggest startign a patch if you belive the current behavior
   will be probelmatic but keep in mind that addign too much jitter/delay can
   cause vm boots/migtrations to randomly fail leavign the instance in an
   error state."


Maybe we shouldn't try to get the network agent status without considering
the OVN backend impact. OVN should take a long time processing messages
from ovs-vswitchd daemon on chassis (OVSDB transactions). In this case, the
ovn-controller still blocked by the unix socket between ovn-controller <->
ovs-vswitchd, and during this sync the ovn-controller cannot process any
"heartbeat" because ovn-controller is busy with the last cfg. In other
words, the time to bump the heartbeat cfg is very dependent on the number
of resources used (scaling).

This specific patch is related to "OVN Metadata agent" heartbeat and use
the neutron:ovn-metadata-sb-cfg to bump nb_cfg config number:

            table = ('Chassis_Private' if self.agent.has_chassis_private
                     else 'Chassis')
            self.agent.sb_idl.db_set(
                table, self.agent.chassis, ('external_ids', {
                    ovn_const.OVN_AGENT_METADATA_SB_CFG_KEY:
                        str(row.nb_cfg)})).execute()

As I understand, this is very similar to "OVN Controller agent" heartbeat
but in ovn-controller case we are talking about the
"neutron:liveness_check_at" to bump cfg on NB_Global table.
        last_ping = self.nb_ovn.nb_global.external_ids.get(
            ovn_const.OVN_LIVENESS_CHECK_EXT_ID_KEY)

In both cases, to transition cfg numbers we need ovn-controller
availability... I suppose that being able to customize this value is better
for large-scale cases.

It seems to me that's what we talked about in Vancouver Rodolfo
(scalability vs reliability). OVN needs to evolve with I-P (incremental
processing) to respond faster to configuration changes, but while that
doesn't happen, we'll have to live with bigger timeouts...



Em qua., 21 de jun. de 2023 às 18:00, Ihar Hrachyshka <ihrachys at redhat.com>
escreveu:

> On Mon, Jun 19, 2023 at 11:04 AM Roberto Bartzen Acosta <
> roberto.acosta at luizalabs.com> wrote:
>
>> Hello Neutron folks,
>>
>> We discussed in the Operators feedback session about OVN heartbeat and
>> the use of "infinity" values for large-scale deployments because we have a
>> significant infrastructure impact when a short 'agent_down_time' is
>> configured.
>>
>
> This is tangentially related, but note that using "infinity" values for
> agent_down_time is unsafe:
> https://bugzilla.redhat.com/show_bug.cgi?id=2215407 (depending on whether
> your "infinity" value is larger than ~15 days, assuming 32 bit ints used on
> your platform).
>
>
>>
>>
>> The merged patch [1] limited the maximum delay to 10 seconds. I
>> understand the requirement to use random values to avoid load spikes, but
>> why does this fix limit the heartbeat to 10 seconds? What is the goal of
>> the agent_down_time parameter in this case? How will it work for someone
>> who has hundreds of compute nodes / metadata agents?
>>
>> Regards,
>> Roberto
>>
>> [1] - https://review.opendev.org/c/openstack/neutron/+/883687
>>
>>
>> *‘Esta mensagem é direcionada apenas para os endereços constantes no
>> cabeçalho inicial. Se você não está listado nos endereços constantes no
>> cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa
>> mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão
>> imediatamente anuladas e proibidas’.*
>>
>>  *‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para
>> assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não
>> poderá aceitar a responsabilidade por quaisquer perdas ou danos causados
>> por esse e-mail ou por seus anexos’.*
>>
>

-- 




_‘Esta mensagem é direcionada apenas para os endereços constantes no 
cabeçalho inicial. Se você não está listado nos endereços constantes no 
cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa 
mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão 
imediatamente anuladas e proibidas’._


* **‘Apesar do Magazine Luiza tomar 
todas as precauções razoáveis para assegurar que nenhum vírus esteja 
presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por 
quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230622/20916bd2/attachment.htm>


More information about the openstack-discuss mailing list