On 27/11/2025 13:26, Nell Jerram wrote:
Could enable_qemu_monitor_announce_self blocking be responsible for 12 _minutes_ of delay? That sounds huge! i dont see any other way that that congi option coudl have an effect and be responceible for the repored issue.
it does not make sense that changing that value woudl actuly afffect this at all. if that was a blockign call and it did nto return it may explain the delay otherwise my actual opinion is htis is a coincidence
Also, can I ask if this is _only_ a problem with the OpenStack status reporting (i.e. "openstack server migration list")? Or does it also affect the actual liveness of the migrated instance?
if its related to enable_qemu_monitor_announce_self it cant affect the livelyness of the insace and it woudl obly be a reporting issue. i think this si much more likely ot be related to this feature request https://bugs.launchpad.net/nova/+bug/2128665 https://blueprints.launchpad.net/nova/+spec/refine-network-setup-procedure-i... and hte comemnt thread we dicssed https://review.opendev.org/c/openstack/nova/+/966106/1/nova/virt/libvirt/hos... the tldr is there is a kernel bug https://lore.kernel.org/all/20240626191830.3819324-1-yang@os.amperecomputing... that is only fixed in 6.13 which can cause the souce vm to take minutes to stop as it is waith for the kernel to deaclocated the memory. we do not actully mark the live migration as complete until after that is complete. so i think that why its taking mintues for the status to go to complete.
(Coincidentally, I am also currently investigating live migration. I'm seeing a problem where data transfer on an existing connection to the instance is held up for about 12 seconds after the migration has completed.)
im not sure but maybe that is related to the kernel bug? libvirt does have to do more then just tasnfor the data before it can compelte the migraton or unpause the vm on the dest but i dont knwo the detail well enough to say what that entails in detail.
On Thu, Nov 27, 2025 at 12:15 PM Sean Mooney <smooney@redhat.com> wrote:
On 27/11/2025 11:57, Sean Mooney wrote: > > > On 27/11/2025 02:46, Nguyễn Hữu Khôi wrote: >> Hello. >> >> I just select live migrate from horizon without destination. Instance >> is on shared storage, I don't use force-complete. It looks like >> enable_qemu_monitor_announce_self = true cause this problem. It is ok >> if I change it to false. I use this option on openstack Xena because >> without it after live migration I cannot ping or access instances. >> This cloud uses OVS. I test with my current cloud 2025.1 which uses >> OVN, it seems we don't need this option anymore. Pls correct me if I >> am wrong. i should have also said that this option was needed for ovn in the past as well but as of caracal or so ovn and neuton now supprot multiple logical swich ports before that ovn also sufforted form network downtime in a simialr way to ovs escpially on vlan networks because it woudl not set up any egress flows until we activated the prot binding in post live migrate instead of seting it up in pre-live-migrate like it shoudl have.
landed in zed to add the neutron supprothttps://review.opendev.org/c/openstack/neutron/+/828455 <http://review.opendev.org/c/openstack/neutron/+/828455> i dont recally the specific ovn version requried for that optimsation to be present but i vaguly have ovn 23? in my mind but that could be way off.
ovn also has other issues with how it work today under load https://bugs.launchpad.net/neutron/+bug/2069718
there is a good youtube presentation from the sumit on how ovn can have connectivy issues in larger deployment ro where the reconsitation time for ovn is long https://www.youtube.com/watch?v=POjSOxKyrE0
> it a workaroudn option so it was never requried > it just help mitigate some downtime that could happen in older release > of openstack > it shoudl not be requried anymore. > > but is an unexpected sideeffect although it benine over all > > so what happening is when we invoke qemu to send thos RARP packets we > must be waiting for that > to complete before completing the migration. > > i htinki see the problem > > https://github.com/openstack/nova/blob/master/nova/virt/libvirt/guest.py#L66...
> > > we are dreictly calling the qemu monitor via the libvirtmon module > which isa cmodule. > we are expectign that to yeild doe to eventlet but its possibel that > that call is blocking. > > on one hand we proablly shoudl be kicking that off in a background > tread and jsut movign on with the rest > of post live migration, on the ohter hand this shoudl not really be > need after antleop or bobcat ish. > so im not sure useful it is to do that change now. > > anyway yes truning off the > enable_qemu_monitor_announce_self workaround in 2025.1 should be safe. >> >> Nguyen Huu Khoi >> >> Nguyen Huu Khoi >> >> >> On Wed, Nov 26, 2025 at 6:23 PM Sean Mooney <smooney@redhat.com> wrote: >>> how did you "complete the migration" was it via the force-comlete >>> action? >>> that can take several minutes to complete at the qemu level as it need >>> to transfer any outstanding memory or block data. >>> >>> also what are defining as the migration end time? when the vm on the >>> dest starts? (if post copy live migration is used the migration is >>> still >>> in-progress when that happens >>> when the vm on the source is stopped? the vm has not been cleaned up so >>> the migration is not complete in nova at this point as we need to >>> remove any >>> volume attachments, delete local files and update neutron in post-live >>> migrate before the migration is actually completed. >>> >>> it will not be updated to complete until all cleanup on the source host >>> is also complete >>> what you are reporting is not obviously indicative of a bug or error. >>> >>> without more information we cant really help you understand if this is >>> normal or not. >>> >>> On 26/11/2025 01:18, Nguyễn Hữu Khôi wrote: >>>> Hello. >>>> >>>> I am working on instance migration. After I complete the migration, >>>> when I run *|openstack server migration list|* to check its status, it >>>> still shows as /running/. It takes around *12 minutes* before it >>>> updates to /completed /status. >>>> >>>> My OPS: 2025.1 >>>> >>>> Thank you. >>>> >>>> Nguyen Huu Khoi >