Hello, We also backported this patch up to newton and it works fine most of the time. The thing is that, the heal operation is healing instances one by one, the default interval between heal is 60 seconds. So based on number of instances you have on host, you may have to wait a long time before the instance is really healed. You can of course reduce this interval between heal, but then it would load your neutron server. If you have a lot of computes it can be an issue. We choose another way in my company by implementing this: https://review.opendev.org/#/c/702394/ which is not perfect as commented by Sean and others, but with this, you have a quick and easy way to refresh one instance info cache using: nova refresh-network <instance_uuid> Cheers, -- Arnaud Morin On 06.10.20 - 16:15, Jean-Philippe Méthot wrote:
we did not backport it due to the db migration bug but its fixed form stein on upstream. given we have not had issue backporting https://review.opendev.org/#/c/591607/ without https://review.opendev.org/#/c/614167/20 downstream i think it would be resonable to do upstream.
If it could be backported to Rocky and maybe even Queens, for those who still run Queens, I’m sure it would be strongly appreciated (at least we would since we wouldn’t have to patch manually when we update packages)
Couldn’t it just have a configuration option to enable it? While I’m not convinced it can fix the root cause of our problem, it could at least contribute to the stability of our and other people’s Openstack cluster. so this is a subtel thing. its not really a nova bug. its an issue where invalid data is returned by neuton and that currupts the nova database. The force refesh will heal nova if and only if the neutron issue that casue the issue in the first place is resovled. if the neutron issue is not fix then the force refresh will contiune to force update the nova networking info cache with incomplete data.
so if you never have a netuon issue that returns invalid data then you will never need this patch if you do for say because you broke the neutron policy file then this backprot will fix the nova database only once the policy issue is corrected. we have had several large customer that have had issue with neutron due to misconfiging the polify file or due to a third part sdn contol who maintianed port information in an external db seperate form neutron. in the case of the policy file customer this self healing worked once they corrected the issue. in the case of the sdn contoler customer it did not until the sdn vendor fix the sdn contols db. once it returned correct data again the periodic task healed nova.
That’s interesting because we run a very basic neutron + openvswitch setup with default policies. Additionally, we have tested the nova patch I mentioned earlier for a long while and it seemed to at least prevent the instances from losing their port. Doesn’t that imply that neutron has consistently returned correct data in our setup in particular? So our issue could be elsewhere? I could be wrong and it’s not a hill I’m willing to die on, I’m just pointing out my own observations.
Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc. 4414-4416 Louis B Mayer Laval, QC, H7P 0G1, Canada TEL : +1.514.802.1644 - Poste : 2644 FAX : +1.514.612.0678 CA/US : 1.855.774.4678 FR : 01 76 60 41 43 UK : 0808 189 0423