[nova][train] live migration issue

Kashyap Chamarthy kchamart at redhat.com
Wed May 19 14:35:02 UTC 2021


(Hi, we've talked on #openstack-nova; updating on list too.)

On Wed, May 19, 2021 at 10:48:11AM +0200, Ignazio Cassano wrote:
> Hello Guys,
> on train centos7 I am facing live migration issue only for some instances
> (not all).
> The error reported is:
> 2021-05-19 08:45:57.096 142537 ERROR nova.compute.manager [-] [instance:
> b18450e8-b3db-4886-a737-c161d99c6a46] Live migration failed.: libvirtError:
> Unable to read from monitor: Connection reset by peer
> 
> The instance remains in pause on both source and destination host.
> 
> Any help,please ?

Summarizing the issue for those who are following along this conversation:

The debugging chat tral from #openstack-nova starts here:
http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2021-05-19.log.html#t2021-05-19T08:50:11

Version
-------

- libvirt: 4.5.0, package: 36.el7_9.5
- QEMU: 2.12.0qemu-kvm-ev-2.12.0-44.1.el7_8.1
- kernel: 3.10.0-1160.25.1.el7.x86_64

Problem
-------

It seems to be some guests (on NFS) seem to crash during live migration
with the below errors in the QEMU guest log:

    [...]
    2021-05-19T08:12:30.396878Z qemu-kvm: Failed to load virtqueue_state:vring.used
    2021-05-19T08:12:30.397555Z qemu-kvm: Failed to load virtio/virtqueues:vq
    2021-05-19T08:12:30.397581Z qemu-kvm: Failed to load virtio-blk:virtio
    2021-05-19T08:12:30.397606Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:08.0/virtio-blk'
    2021-05-19T08:12:30.399542Z qemu-kvm: load of migration failed: Input/output error
    2021-05-19 08:12:31.022+0000: shutting down, reason=crashed
    [...]

And this error from libvirt (as obtained via `journalctl -u libvirtd -l
--since=yesterday -p err`):

    error : qemuDomainObjBeginJobInternal:6825 : Timed out during
    operation: cannot acquire state change lock (held by monitor=remo

Diagnosis
---------

Further, these "cannot acquire state change lock" error from libvirt is
notoriously hard to debug without a reliable reproducer.  As it could be
due to QEMU getting hung, which in turn could be caused by stuck I/O.

See also the discussion (but no conclusion) on this related QEMU bug[1].
Particularly comment#11.

In short, without a solid reproducer, these virtio issues are hard to
track down, I'm afraid.


[1] https://bugs.launchpad.net/nova/+bug/1761798 -- live migration
    intermittently fails in CI with "VQ 0 size 0x80 Guest index 0x12c
    inconsistent with Host index 0x134: delta 0xfff8"

-- 
/kashyap




More information about the openstack-discuss mailing list