Forcing restart of a worker node with running guest
So one of my worker nodes is, to put it mildly, rather unhappy: Message from syslogd@openstack-w1 at Jul 25 15:57:49 ... kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [vnc_worker:9141] Message from syslogd@openstack-w1 at Jul 25 15:57:57 ... kernel:NMI watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [migration/10:60] Message from syslogd@openstack-w1 at Jul 25 15:58:05 ... kernel:NMI watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [migration/8:49] I found out when it was taking 30 min to delete a guest. So, what I can do in a forceful way? 1. How to kill the guest? Can I kill it through virsh or openstack compute service will get sad? 2. What would happen if I stop the compute service? 3. What would happen if I get really annoyed and tell worker node to reboot?
On 7/25/2019 11:04 AM, Mauricio Tavares wrote:
I found out when it was taking 30 min to delete a guest. So, what I can do in a forceful way?
1. How to kill the guest? Can I kill it through virsh or openstack compute service will get sad?
I would try to avoid this if possible, but you might need to kill the guest in the hypervisor if doing it through nova won't get the job done. What happens in nova-compute is undefined, but you'd probably see some errors as expected if you're doing anything with that server at the hypervisor layer, like trying to get the guest power state. What nova is tracking and what is in the hypervisor are different things, and if you delete the guest out of band from nova, you'll need to delete the server to sync the nova database. If the delete is stuck in the compute API, thinking it's already deleting (I think we have an old bug for that and force delete, and I hit something similar today), you could try resetting the server status to ERROR [1] and then try deleting it in the API again.
2. What would happen if I stop the compute service?
This won't really do anything to the guest in the hypervisor unless [2] tries to change the guest state on restart. In my experience that option has not been very reliable / predictable.
3. What would happen if I get really annoyed and tell worker node to reboot?
Pretty much the same as #2 from a nova perspective I think. Depending on how libvirt and/or the guest domain is configured, the libvirt-guest service might try to resume the guest. [1] openstack server set --state error <server> [2] https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.res... -- Thanks, Matt
On Thu, Jul 25, 2019 at 3:44 PM Matt Riedemann <mriedemos@gmail.com> wrote:
On 7/25/2019 11:04 AM, Mauricio Tavares wrote:
I found out when it was taking 30 min to delete a guest. So, what I can do in a forceful way?
1. How to kill the guest? Can I kill it through virsh or openstack compute service will get sad?
I would try to avoid this if possible, but you might need to kill the guest in the hypervisor if doing it through nova won't get the job done. What happens in nova-compute is undefined, but you'd probably see some errors as expected if you're doing anything with that server at the hypervisor layer, like trying to get the guest power state.
What nova is tracking and what is in the hypervisor are different things, and if you delete the guest out of band from nova, you'll need to delete the server to sync the nova database. If the delete is stuck in the compute API, thinking it's already deleting (I think we have an old bug for that and force delete, and I hit something similar today), you could try resetting the server status to ERROR [1] and then try deleting it in the API again.
2. What would happen if I stop the compute service?
This won't really do anything to the guest in the hypervisor unless [2] tries to change the guest state on restart. In my experience that option has not been very reliable / predictable.
3. What would happen if I get really annoyed and tell worker node to reboot?
Pretty much the same as #2 from a nova perspective I think. Depending on how libvirt and/or the guest domain is configured, the libvirt-guest service might try to resume the guest.
Does that mean it is using the standard libvirt config files?
[1] openstack server set --state error <server> [2] https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.res...
Thanks for the info. It turned out the issue is hardware related, so shutting the worker node down is way past the realm of possibility into the realm of it will happen today.
--
Thanks,
Matt
On Thu, Jul 25, 2019 at 4:24 PM Mauricio Tavares <raubvogel@gmail.com> wrote:
On Thu, Jul 25, 2019 at 3:44 PM Matt Riedemann <mriedemos@gmail.com> wrote:
On 7/25/2019 11:04 AM, Mauricio Tavares wrote:
I found out when it was taking 30 min to delete a guest. So, what I can do in a forceful way?
1. How to kill the guest? Can I kill it through virsh or openstack compute service will get sad?
I would try to avoid this if possible, but you might need to kill the guest in the hypervisor if doing it through nova won't get the job done. What happens in nova-compute is undefined, but you'd probably see some errors as expected if you're doing anything with that server at the hypervisor layer, like trying to get the guest power state.
What nova is tracking and what is in the hypervisor are different things, and if you delete the guest out of band from nova, you'll need to delete the server to sync the nova database. If the delete is stuck in the compute API, thinking it's already deleting (I think we have an old bug for that and force delete, and I hit something similar today), you could try resetting the server status to ERROR [1] and then try deleting it in the API again.
2. What would happen if I stop the compute service?
This won't really do anything to the guest in the hypervisor unless [2] tries to change the guest state on restart. In my experience that option has not been very reliable / predictable.
3. What would happen if I get really annoyed and tell worker node to reboot?
Pretty much the same as #2 from a nova perspective I think. Depending on how libvirt and/or the guest domain is configured, the libvirt-guest service might try to resume the guest.
Does that mean it is using the standard libvirt config files?
[1] openstack server set --state error <server> [2] https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.res...
Thanks for the info. It turned out the issue is hardware related, so shutting the worker node down is way past the realm of possibility into the realm of it will happen today.
Update: after I dealt with the hardware, I now was able to tell the instance to go to silicon heaven: [raub@openstack-hn ~(keystone_admin)]$ openstack server list +--------------------------------------+----------+--------+--------------------------------------------+--------+-------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+----------+--------+--------------------------------------------+--------+-------------+ | 1f76ca35-9d7f-4403-ae72-bcbfa1cc9b99 | desktop1 | ERROR | physnet1=10.20.20.66, 192.168.20.66 | centos | netro.small | +--------------------------------------+----------+--------+--------------------------------------------+--------+-------------+ [raub@openstack-hn ~(keystone_admin)]$ openstack server delete desktop1 [raub@openstack-hn ~(keystone_admin)]$ openstack server list [raub@openstack-hn ~(keystone_admin)]$ Thank you for providing all the different options to account for the possible increasing degrees of things going bad! I will save this message for next time...
--
Thanks,
Matt
participants (2)
-
Matt Riedemann
-
Mauricio Tavares