[nova] limiting number of attached devices
Hi all, we've recently faced a peculiar issue with one of the users, namely: - fat VM with 100 cores - virtio-net multiqueue set up for the VM (`hw:vif_multiqueue_enabled`) - quite a number of NICs attached. On an attempt to attach yet another NIC (~20th), instance crashes with specific error message in the qemu log ``` qemu-system-x86_64: ../../accel/kvm/kvm-all.c:1754: kvm_irqchip_commit_routes: Assertion `ret == 0' failed. ``` After debugging, we believe that what's happening is we are hitting an intrinsic limit of IRQs in a KVM guest ( `KVM_MAX_IRQ_ROUTES 4096` here https://github.com/torvalds/linux/blob/master/include/linux/kvm_host.h#L2175 ). With multiqueue and this number of vCPUs, each new NIC adds ~200 interrupts in the guest, so with some other system ones and some more devices already present, at some point, adding one more NIC overflows this current limit. In this particular case, the problem is exaggerated by the fact that what's requesting the new NIC attachment is some sort of CNI-like automation inside the guest, reacting on some external events (like another k8s node going down and pods being migrated). This got me thinking - should nova be able to somehow hard-limit the number of attached devices to a VM? I do not think nova should be "counting the IRQs" per se, as there are things outside that nova does not necessarily know (like what and how many actual devices a given qemu machine type creates inside the guest). But having some knob on a user side, for example a special metadata set on the instance, or at least an extra spec property for the flavor, could possibly suffice in this case. I'm eager to hear your thoughts on this, please chime in with any ideas on how to prevent such self-destructing instances :-) Best regards, -- Dr. Pavlo Shchelokovskyy Principal Software Engineer Mirantis Inc www.mirantis.com
participants (1)
-
Pavlo Shchelokovskyy