Hi,
This is related to bug https://bugs.launchpad.net/nova/+bug/1751923 <https://bugs.launchpad.net/nova/+bug/1751923> . I don’t see if this was fixed in more recent versions as we are running Rocky, but according to the different code reviews linked to the bug report, this was never committed into Openstack master. I apologize in advance if this was already fixed elsewhere (it’s marked as fixed in Stein, but the reviews say the code was never committed?).
On Tue, 2020-10-06 at 14:35 -0400, Jean-Philippe Méthot wrote: this was commited in https://review.opendev.org/#/c/591607/ and was first released in stien. it was not backported upstream becasue https://review.opendev.org/#/c/614167/20 has a bug. but we backported just https://review.opendev.org/#/c/591607/ downstream in redhat osp all the way back too newton and it works fine. so for redhat osp at least this is fixed but we did not backport the online db migration in https://review.opendev.org/#/c/614167/20 which trys to popultate the virtual interface table jsut the force refresh.
Essentially, we’re running into a production issue where sometimes, after being shutdown for a while, our VMs ports just straight up disappear from Nova. Obviously, since this is production, we have to scramble to link back the port to the VM to bring the VM back up. As a result, we have not identified yet the exact source of our issue. However, we do have tested Mohammed Naser’s patch linked to this issue and it has at the very least offered us a band-aid since the VMs appear to be keeping their ports now.
Would it be possible to review and commit this patch or Matt Riedeman’s patch to master and backport it?
we did not backport it due to the db migration bug but its fixed form stein on upstream. given we have not had issue backporting https://review.opendev.org/#/c/591607/ without https://review.opendev.org/#/c/614167/20 downstream i think it would be resonable to do upstream.
Couldn’t it just have a configuration option to enable it? While I’m not convinced it can fix the root cause of our problem, it could at least contribute to the stability of our and other people’s Openstack cluster. so this is a subtel thing. its not really a nova bug. its an issue where invalid data is returned by neuton and that currupts the nova database. The force refesh will heal nova if and only if the neutron issue that casue the issue in the first place is resovled. if the neutron issue is not fix then the force refresh will contiune to force update the nova networking info cache with incomplete data.
so if you never have a netuon issue that returns invalid data then you will never need this patch if you do for say because you broke the neutron policy file then this backprot will fix the nova database only once the policy issue is corrected. we have had several large customer that have had issue with neutron due to misconfiging the polify file or due to a third part sdn contol who maintianed port information in an external db seperate form neutron. in the case of the policy file customer this self healing worked once they corrected the issue. in the case of the sdn contoler customer it did not until the sdn vendor fix the sdn contols db. once it returned correct data again the periodic task healed nova.
Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc. 4414-4416 Louis B Mayer Laval, QC, H7P 0G1, Canada TEL : +1.514.802.1644 - Poste : 2644 FAX : +1.514.612.0678 CA/US : 1.855.774.4678 FR : 01 76 60 41 43 UK : 0808 189 0423