[neutron][nova] [kolla] vif plugged timeout

Sean Mooney smooney at redhat.com
Wed Nov 24 13:56:41 UTC 2021


On Wed, 2021-11-24 at 00:21 +0000, Tony Liu wrote:
> I hit the same problem, from time to time, not consistently. I am using OVN.
> Typically, it takes no more than a few seconds for neutron to confirm the port is up.
> The default timeout in my setup is 600s. Even the ports shows up in both OVN SB
> and NB, nova-compute still didn't get confirmation from neutron. Either neutron
> didn't pick it up or the message was lost and didn't get to nova-compute.
> Hoping someone could share more thoughts.

there are some knonw bugs in this area.
basicaly every neutorn backend behaves slightly differently with regards to how/when it send the network
vif plugged event and this depend on many factors and change form release to release.

for exampel im pretty sure in the past ml2/ovs used to send network-vif-plugged events for ports that are adminstiratively
disabel since nova/os-vif still pluggs thoses into the ovs bridge we would expect them to be sent however that apparently has
changed at some point. leading to https://bugs.launchpad.net/nova/+bug/1951623

ml2/ovn never send network-vif-plugged events when teh port is plugged it cheats and send them when the port is bound but the exact
rules for that have also chagne over the last few releases.

nova has no way to discover this behavior from neutron and we have to do our best to geuess based on some atrrbutes of the port.

for example as noted below the firewall dirver used with ml2/ovs makes a difference
if you use iptables_hybrid we sue the hybrid_plug mechanisum

that means the vm tap device is added to a linux bridge which is then connect to ovs with a veth pair.
for move operation like live migrate the linux bridge and veth pair are created on the destionat in prelivemigrate and nova waits for the event.
sicne we cant detech what security group driver is used from the port we have to guess based on if hybrid_plug=true in the port  binding profile.

for iptables hybrid_plug is True for noop and openvswich security group driver hybrid_plug is set to false
https://review.opendev.org/c/openstack/nova/+/767368

attempted to account for the fact that network-vif-plugged woudl not be sent in thet later case in prelive migrate since at the time the
vm interface was only plugged in to ovs by libvirt during the migration.
https://review.opendev.org/c/openstack/nova/+/767368/1/nova/network/model.py#547

    def get_live_migration_plug_time_events(self):
        """Returns a list of external events for any VIFs that have
        "plug-time" events during live migration.
        """
        return [('network-vif-plugged', vif['id'])
                for vif in self if vif.has_live_migration_plug_time_event]
https://review.opendev.org/c/openstack/nova/+/767368/1/nova/network/model.py#472
    def has_live_migration_plug_time_event(self):
        """Returns whether this VIF's network-vif-plugged external event will
        be sent by Neutron at "plugtime" - in other words, as soon as neutron
        completes configuring the network backend.
        """
        return self.is_hybrid_plug_enabled()

what that code does is skip waiting for network-vif plugged event during live migration for all interfaces wehre hybrid_plug is false
which include ml2/ovs with noop or openvswitch security group driver and ml2/ovn as it never send them at the correct time.

it turns out to fix https://bugs.launchpad.net/nova/+bug/1951623 we also should be skipping waiting if the admin state on the port is
disabled by adding  and vif['active'] == 'active' to the list comprehention.

the code shoul also have addtional knoladge of the network backedn to make the reight descissions however
the bound_drivers intoduced by https://specs.openstack.org/openstack/neutron-specs/specs/train/port-binding-extended-information.html
was never actully implemeted in neutron so neutron does not curretnly tell nova if it ml2/ovs or ml2/ovn or ml2/odl

all of the above have vif_type OVS so we cant renable waiting for netwrok vif plugged event when hybrid_plug is false and ml2/ovs is used
sicne while it would be correct for ml2/ovs it woudl break ml2/ovn so we are forced to support the least capable netowrk backend in any situation.

until this is fix in nova and neutron its unlikely you will be able ot adres this in kolla in a meaningful way.
every time we skip waiting for a network-vif-plugged event in nova when there ideally woudl be one as part fo a move operation
we introduced a race between the vm starting on the destinatnion host and the network backend completing its configuration so
simpley settign [DEFAULT]/vif_plugging_is_fatal=False or [compute]/live_migration_wait_for_vif_plug=false risk the vm not haveing netwroking
when configured.

https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.vif_plugging_is_fatal
https://docs.openstack.org/nova/latest/configuration/config.html#compute.live_migration_wait_for_vif_plug

the do provide ways for operators to work around some bugs as will the recently added [workarounds]/wait_for_vif_plugged_event_during_hard_reboot option
https://docs.openstack.org/nova/latest/configuration/config.html#workarounds.wait_for_vif_plugged_event_during_hard_reboot

however this shoudl not be complexity that the operator shoudl have to understand and configure via kolla.

we shoudl fix the contract between nova and neutron includeing requireing out of tree network vendros like cisco aci or other core plugins to actully 
conform to the interface but after 5 eyars of trying to get this fixed its still not and we just have to play the wackamole game everytime someoen reports
another edgecase.

in this specific case i dont knwo why you are not getting the event but ffor ml2/ovs both the l2 agent and dhcp agent but need to notify the neutron server that provisiouning
is complete and apprent the port also now need to be admin state actitve/up before the network-vif-plugged event is sent.

in the case wehre it fails i woudl chekc the dhcp agaent log, l2 agent log and neutorn server log try and se if one or both of the l2/dhcp agent failed to provision the port.
i would guess it sthe dhcp agent given it works on the retry to the next host.

regards
sean

> Thanks!
> Tony
> ________________________________________
> From: Laurent Dumont <laurentfdumont at gmail.com>
> Sent: November 22, 2021 02:05 PM
> To: Michal Arbet
> Cc: openstack-discuss
> Subject: Re: [neutron][nova] [kolla] vif plugged timeout
> 
> How high did you have to raise it? If it does appear after X amount of time, then the VIF plug is not lost?
> 
> On Sat, Nov 20, 2021 at 7:23 AM Michal Arbet <michal.arbet at ultimum.io<mailto:michal.arbet at ultimum.io>> wrote:
> + if i raise vif_plugged_timeout ( hope i rember it correct ) in nova to some high number ..problem dissapear ... But it's only workaround
> 
> D�a so 20. 11. 2021, 12:05 Michal Arbet <michal.arbet at ultimum.io<mailto:michal.arbet at ultimum.io>> nap�sal(a):
> Hi,
> 
> Has anyone seen issue which I am currently facing ?
> 
> When launching heat stack ( but it's same if I launch several of instances ) vif plugged in timeouts an I don't know why, sometimes it is OK ..sometimes is failing.
> 
> Sometimes neutron reports vif plugged in < 10 sec ( test env ) sometimes it's 100 and more seconds, it seems there is some race condition but I can't find out where the problem is. But on the end every instance is spawned ok (retry mechanism worked).
> 
> Another finding is that it has to do something with security group, if noop driver is used ..everything is working good.
> 
> Firewall security setup is openvswitch .
> 
> Test env is wallaby.
> 
> I will attach some logs when I will be near PC ..
> 
> Thank you,
> Michal Arbet (Kevko)
> 
> 
> 
> 
> 
> 




More information about the openstack-discuss mailing list