[nova][train] live migration issue
Hello Guys, on train centos7 I am facing live migration issue only for some instances (not all). The error reported is: 2021-05-19 08:45:57.096 142537 ERROR nova.compute.manager [-] [instance: b18450e8-b3db-4886-a737-c161d99c6a46] Live migration failed.: libvirtError: Unable to read from monitor: Connection reset by peer The instance remains in pause on both source and destination host. Any help,please ? Ignazio
I am sorry, the openstack version is stein Il giorno mer 19 mag 2021 alle ore 10:48 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:
Hello Guys, on train centos7 I am facing live migration issue only for some instances (not all). The error reported is: 2021-05-19 08:45:57.096 142537 ERROR nova.compute.manager [-] [instance: b18450e8-b3db-4886-a737-c161d99c6a46] Live migration failed.: libvirtError: Unable to read from monitor: Connection reset by peer
The instance remains in pause on both source and destination host.
Any help,please ? Ignazio
(Hi, we've talked on #openstack-nova; updating on list too.) On Wed, May 19, 2021 at 10:48:11AM +0200, Ignazio Cassano wrote:
Hello Guys, on train centos7 I am facing live migration issue only for some instances (not all). The error reported is: 2021-05-19 08:45:57.096 142537 ERROR nova.compute.manager [-] [instance: b18450e8-b3db-4886-a737-c161d99c6a46] Live migration failed.: libvirtError: Unable to read from monitor: Connection reset by peer
The instance remains in pause on both source and destination host.
Any help,please ?
Summarizing the issue for those who are following along this conversation: The debugging chat tral from #openstack-nova starts here: http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2... Version ------- - libvirt: 4.5.0, package: 36.el7_9.5 - QEMU: 2.12.0qemu-kvm-ev-2.12.0-44.1.el7_8.1 - kernel: 3.10.0-1160.25.1.el7.x86_64 Problem ------- It seems to be some guests (on NFS) seem to crash during live migration with the below errors in the QEMU guest log: [...] 2021-05-19T08:12:30.396878Z qemu-kvm: Failed to load virtqueue_state:vring.used 2021-05-19T08:12:30.397555Z qemu-kvm: Failed to load virtio/virtqueues:vq 2021-05-19T08:12:30.397581Z qemu-kvm: Failed to load virtio-blk:virtio 2021-05-19T08:12:30.397606Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:08.0/virtio-blk' 2021-05-19T08:12:30.399542Z qemu-kvm: load of migration failed: Input/output error 2021-05-19 08:12:31.022+0000: shutting down, reason=crashed [...] And this error from libvirt (as obtained via `journalctl -u libvirtd -l --since=yesterday -p err`): error : qemuDomainObjBeginJobInternal:6825 : Timed out during operation: cannot acquire state change lock (held by monitor=remo Diagnosis --------- Further, these "cannot acquire state change lock" error from libvirt is notoriously hard to debug without a reliable reproducer. As it could be due to QEMU getting hung, which in turn could be caused by stuck I/O. See also the discussion (but no conclusion) on this related QEMU bug[1]. Particularly comment#11. In short, without a solid reproducer, these virtio issues are hard to track down, I'm afraid. [1] https://bugs.launchpad.net/nova/+bug/1761798 -- live migration intermittently fails in CI with "VQ 0 size 0x80 Guest index 0x12c inconsistent with Host index 0x134: delta 0xfff8" -- /kashyap
Hello, some news ....I wonder if they can help: I am testing with some virtual machine again. If I follows this steps it works (but I lost network connection): 1) Detach network interface from instance 2) Attach network interface to instance 3) Migrate instance 4) Loggin into instance using console and restart networking while if I restart networking before live migration it does not work. So, when someone mentioned ######################## we get this "guest index inconsistent" error when the migrated RAM is inconsistent with the migrated 'virtio' device state. And a common case is where a 'virtio' device does an operation after the vCPU is stopped and after RAM has been transmitted. #############################à the network traffic could be the problem ? Ignazio Il giorno mer 19 mag 2021 alle ore 16:35 Kashyap Chamarthy < kchamart@redhat.com> ha scritto:
(Hi, we've talked on #openstack-nova; updating on list too.)
Hello Guys, on train centos7 I am facing live migration issue only for some instances (not all). The error reported is: 2021-05-19 08:45:57.096 142537 ERROR nova.compute.manager [-] [instance: b18450e8-b3db-4886-a737-c161d99c6a46] Live migration failed.:
On Wed, May 19, 2021 at 10:48:11AM +0200, Ignazio Cassano wrote: libvirtError:
Unable to read from monitor: Connection reset by peer
The instance remains in pause on both source and destination host.
Any help,please ?
Summarizing the issue for those who are following along this conversation:
The debugging chat tral from #openstack-nova starts here:
http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2...
Version -------
- libvirt: 4.5.0, package: 36.el7_9.5 - QEMU: 2.12.0qemu-kvm-ev-2.12.0-44.1.el7_8.1 - kernel: 3.10.0-1160.25.1.el7.x86_64
Problem -------
It seems to be some guests (on NFS) seem to crash during live migration with the below errors in the QEMU guest log:
[...] 2021-05-19T08:12:30.396878Z qemu-kvm: Failed to load virtqueue_state:vring.used 2021-05-19T08:12:30.397555Z qemu-kvm: Failed to load virtio/virtqueues:vq 2021-05-19T08:12:30.397581Z qemu-kvm: Failed to load virtio-blk:virtio 2021-05-19T08:12:30.397606Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:08.0/virtio-blk' 2021-05-19T08:12:30.399542Z qemu-kvm: load of migration failed: Input/output error 2021-05-19 08:12:31.022+0000: shutting down, reason=crashed [...]
And this error from libvirt (as obtained via `journalctl -u libvirtd -l --since=yesterday -p err`):
error : qemuDomainObjBeginJobInternal:6825 : Timed out during operation: cannot acquire state change lock (held by monitor=remo
Diagnosis ---------
Further, these "cannot acquire state change lock" error from libvirt is notoriously hard to debug without a reliable reproducer. As it could be due to QEMU getting hung, which in turn could be caused by stuck I/O.
See also the discussion (but no conclusion) on this related QEMU bug[1]. Particularly comment#11.
In short, without a solid reproducer, these virtio issues are hard to track down, I'm afraid.
[1] https://bugs.launchpad.net/nova/+bug/1761798 -- live migration intermittently fails in CI with "VQ 0 size 0x80 Guest index 0x12c inconsistent with Host index 0x134: delta 0xfff8"
-- /kashyap
participants (2)
-
Ignazio Cassano
-
Kashyap Chamarthy