OpenStack Ansible Service troubleshooting
We've started deploying new Xena clusters with openstack-ansible. We keep running into problems with some parts of openstack not working. A service will fail or need restarted, but it's not clear which one or why. Recently, one of our test clusters (2 hosts) stopped working. I could login to horizon, but I could not create instances. At first it told me that a message wasn't answered quick enough. I assumed the problem was rabbitmq and restarted the container, but this didn't help. I eventually restarted every container and the nova- compute and haproxy services on the host. But this didn't help either. I eventually rebooted both hosts, but this made things worse (I think I broke the galera cluster doing this). After bootstrapping the galera cluster, I can log back into horizon, but I still cannot create hosts. It tells me "Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance [UUID]" If I look at the journal for nova-compute, I see this error: "libvirt.libvirtError: Failed to activate service 'org.freedesktop.machine1': timed out " Looking at systemd-machined, it won't start due to "systemd- machined.service: Job systemd-machined.service/start failed with result 'dependency'." I'm not sure what "dependency" it's referring to. In the cluster that does work, this service is running. But on both hosts on the cluster that do not, this service is not running. What should I be looking at here to fix? -- John Ratliff Systems Automation Engineer GlobalNOC @ Indiana University
Hi John. Well, it seems you've made a bunch of operations that were not required in the first place. However, I believe that at the end you've identified the problem correctly. systemd-machined service should be active and running on nova-compute hosts with kvm driver. I'd suggest looking deeper at why this service systemd-machined can't be started. What does journalctl says about that? As one of dependency systemd-machined requires to have /var/lib/machines. And I do have 2 assumptions there: 1. Was systemd-tmpfiles-setup.service activated? As we have seen sometimes that upon node boot due to some race condition it was not, which resulted in all kind of weirdness 2. Don't you happen to run nova-compute on the same set of hosts where LXC containers are placed? As for example, in AIO setup we do manage /var/lib/machines/ mount with systemd var-lib-machines.mount. So if you happen to run nova-computes on controller host or AIO - this is another thing to check. вт, 4 окт. 2022 г. в 17:48, John Ratliff <jdratlif@globalnoc.iu.edu>:
We've started deploying new Xena clusters with openstack-ansible. We keep running into problems with some parts of openstack not working. A service will fail or need restarted, but it's not clear which one or why.
Recently, one of our test clusters (2 hosts) stopped working. I could login to horizon, but I could not create instances.
At first it told me that a message wasn't answered quick enough. I assumed the problem was rabbitmq and restarted the container, but this didn't help. I eventually restarted every container and the nova- compute and haproxy services on the host. But this didn't help either. I eventually rebooted both hosts, but this made things worse (I think I broke the galera cluster doing this).
After bootstrapping the galera cluster, I can log back into horizon, but I still cannot create hosts. It tells me
"Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance [UUID]"
If I look at the journal for nova-compute, I see this error:
"libvirt.libvirtError: Failed to activate service 'org.freedesktop.machine1': timed out "
Looking at systemd-machined, it won't start due to "systemd- machined.service: Job systemd-machined.service/start failed with result 'dependency'."
I'm not sure what "dependency" it's referring to. In the cluster that does work, this service is running. But on both hosts on the cluster that do not, this service is not running.
What should I be looking at here to fix?
-- John Ratliff Systems Automation Engineer GlobalNOC @ Indiana University
On Tue, 2022-10-04 at 18:21 +0200, Dmitriy Rabotyagov wrote:
Hi John.
Well, it seems you've made a bunch of operations that were not required in the first place. However, I believe that at the end you've identified the problem correctly. systemd-machined service should be active and running on nova-compute hosts with kvm driver. I'd suggest looking deeper at why this service systemd-machined can't be started. What does journalctl says about that?
It's not very chatty, though I think your next question might answer the why. $ sudo journalctl -u systemd-machined -- Logs begin at Tue 2022-10-04 17:45:02 UTC, end at Tue 2022-10-04 18:43:45 UTC. -- Oct 04 18:43:37 os-comp1 systemd[1]: Dependency failed for Virtual Machine and Container Registration Service. Oct 04 18:43:37 os-comp1 systemd[1]: systemd-machined.service: Job systemd-machined.service/start failed with result 'dependency'.
As one of dependency systemd-machined requires to have /var/lib/machines. And I do have 2 assumptions there: 1. Was systemd-tmpfiles-setup.service activated? As we have seen sometimes that upon node boot due to some race condition it was not, which resulted in all kind of weirdness
It appears to be. The output looks very similar between the broken and working clusters. $ sudo systemctl status systemd-tmpfiles-setup ● systemd-tmpfiles-setup.service - Create Volatile Files and Directories Loaded: loaded (/lib/systemd/system/systemd-tmpfiles- setup.service; static; vendor preset: enabled) Active: active (exited) since Mon 2022-10-03 18:23:53 UTC; 24h ago Docs: man:tmpfiles.d(5) man:systemd-tmpfiles(8) Main PID: 1460 (code=exited, status=0/SUCCESS) Tasks: 0 (limit: 8192) Memory: 0B CGroup: /system.slice/systemd-tmpfiles-setup.service Warning: journal has been rotated since unit was started, output may be incomplete. However, /var/lib/machines does not appear to be correct. On the working cluster, this is mounted as an ext4 filesystem and has a lost+found directory along with a directory for a defined instance. There is no mount listed on the broken cluster, and the directory is empty.
2. Don't you happen to run nova-compute on the same set of hosts where LXC containers are placed? As for example, in AIO setup we do manage /var/lib/machines/ mount with systemd var-lib-machines.mount. So if you happen to run nova-computes on controller host or AIO - this is another thing to check.
$ sudo journalctl -u var-lib-machines.mount -- Logs begin at Tue 2022-10-04 18:01:46 UTC, end at Tue 2022-10-04 18:52:53 UTC. -- Oct 04 18:43:37 os-comp1 systemd[1]: Mounting Virtual Machine and Container Storage (Compatibility)... Oct 04 18:43:37 os-comp1 mount[1272300]: mount: /var/lib/machines: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error. Oct 04 18:43:37 os-comp1 systemd[1]: var-lib-machines.mount: Mount process exited, code=exited, status=32/n/a Oct 04 18:43:37 os-comp1 systemd[1]: var-lib-machines.mount: Failed with result 'exit-code'. Oct 04 18:43:37 os-comp1 systemd[1]: Failed to mount Virtual Machine and Container Storage (Compatibility). This appears to be the problem. It looks like /dev/loop0 is probably supposed to reference /var/lib/machines.raw. I tried running fsck on /dev/loop0, but it doesn't think there is a valid extX filesystem on any of the superblocks. Maybe /dev/loop0 is not really pointing to /var/lib/machines.raw? Not sure how to tell if that's the case. Maybe I should try to loopback this, or create a blank filesystem image. -- John Ratliff Systems Automation Engineer GlobalNOC @ Indiana University
On Tue, 2022-10-04 at 14:56 -0400, John Ratliff wrote:
On Tue, 2022-10-04 at 18:21 +0200, Dmitriy Rabotyagov wrote:
Hi John.
Well, it seems you've made a bunch of operations that were not required in the first place. However, I believe that at the end you've identified the problem correctly. systemd-machined service should be active and running on nova-compute hosts with kvm driver. I'd suggest looking deeper at why this service systemd-machined can't be started. What does journalctl says about that?
It's not very chatty, though I think your next question might answer the why.
$ sudo journalctl -u systemd-machined -- Logs begin at Tue 2022-10-04 17:45:02 UTC, end at Tue 2022-10-04 18:43:45 UTC. -- Oct 04 18:43:37 os-comp1 systemd[1]: Dependency failed for Virtual Machine and Container Registration Service. Oct 04 18:43:37 os-comp1 systemd[1]: systemd-machined.service: Job systemd-machined.service/start failed with result 'dependency'.
As one of dependency systemd-machined requires to have /var/lib/machines. And I do have 2 assumptions there: 1. Was systemd-tmpfiles-setup.service activated? As we have seen sometimes that upon node boot due to some race condition it was not, which resulted in all kind of weirdness
It appears to be. The output looks very similar between the broken and working clusters.
$ sudo systemctl status systemd-tmpfiles-setup ● systemd-tmpfiles-setup.service - Create Volatile Files and Directories Loaded: loaded (/lib/systemd/system/systemd-tmpfiles- setup.service; static; vendor preset: enabled) Active: active (exited) since Mon 2022-10-03 18:23:53 UTC; 24h ago Docs: man:tmpfiles.d(5) man:systemd-tmpfiles(8) Main PID: 1460 (code=exited, status=0/SUCCESS) Tasks: 0 (limit: 8192) Memory: 0B CGroup: /system.slice/systemd-tmpfiles-setup.service
Warning: journal has been rotated since unit was started, output may be incomplete.
However, /var/lib/machines does not appear to be correct. On the working cluster, this is mounted as an ext4 filesystem and has a lost+found directory along with a directory for a defined instance.
There is no mount listed on the broken cluster, and the directory is empty.
2. Don't you happen to run nova-compute on the same set of hosts where LXC containers are placed? As for example, in AIO setup we do manage /var/lib/machines/ mount with systemd var-lib-machines.mount. So if you happen to run nova-computes on controller host or AIO - this is another thing to check.
$ sudo journalctl -u var-lib-machines.mount -- Logs begin at Tue 2022-10-04 18:01:46 UTC, end at Tue 2022-10-04 18:52:53 UTC. -- Oct 04 18:43:37 os-comp1 systemd[1]: Mounting Virtual Machine and Container Storage (Compatibility)... Oct 04 18:43:37 os-comp1 mount[1272300]: mount: /var/lib/machines: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error. Oct 04 18:43:37 os-comp1 systemd[1]: var-lib-machines.mount: Mount process exited, code=exited, status=32/n/a Oct 04 18:43:37 os-comp1 systemd[1]: var-lib-machines.mount: Failed with result 'exit-code'. Oct 04 18:43:37 os-comp1 systemd[1]: Failed to mount Virtual Machine and Container Storage (Compatibility).
This appears to be the problem. It looks like /dev/loop0 is probably supposed to reference /var/lib/machines.raw. I tried running fsck on /dev/loop0, but it doesn't think there is a valid extX filesystem on any of the superblocks. Maybe /dev/loop0 is not really pointing to /var/lib/machines.raw? Not sure how to tell if that's the case.
Maybe I should try to loopback this, or create a blank filesystem image.
Okay, I'm not sure what happened here. The systemd unit mount file for var-lib-machines is different on the broken cluster than the working cluster. It talks about a btrfs system, but the /var/lib/machines.raw file is an ext4 filesystem, like the one on the working cluster. I copied the unit file from the working cluster to the broken cluster, and I could mount /var/lib/machines, get systemd-machined working, and create machines now. I have no idea what happened. I feel like there must have been a system update that changed (reverted from openstack-ansible?) something, but I'm just not sure. In any event, you helped me figure it out. Thanks. -- John Ratliff Systems Automation Engineer GlobalNOC @ Indiana University
Oh, well, I do recall now that package update could brake systemd mount, as in prior releases we placed our own systemd unit file in place and now we just leverage systemd overrides functionality [1]. I think what you can do is find out what package does provide this mount file and mark it for hold. Or cherry-pick and apply mentioned change. [1] https://review.opendev.org/c/openstack/openstack-ansible-lxc_hosts/+/834183 вт, 4 окт. 2022 г., 22:23 John Ratliff <jdratlif@globalnoc.iu.edu>:
On Tue, 2022-10-04 at 14:56 -0400, John Ratliff wrote:
On Tue, 2022-10-04 at 18:21 +0200, Dmitriy Rabotyagov wrote:
Hi John.
Well, it seems you've made a bunch of operations that were not required in the first place. However, I believe that at the end you've identified the problem correctly. systemd-machined service should be active and running on nova-compute hosts with kvm driver. I'd suggest looking deeper at why this service systemd-machined can't be started. What does journalctl says about that?
It's not very chatty, though I think your next question might answer the why.
$ sudo journalctl -u systemd-machined -- Logs begin at Tue 2022-10-04 17:45:02 UTC, end at Tue 2022-10-04 18:43:45 UTC. -- Oct 04 18:43:37 os-comp1 systemd[1]: Dependency failed for Virtual Machine and Container Registration Service. Oct 04 18:43:37 os-comp1 systemd[1]: systemd-machined.service: Job systemd-machined.service/start failed with result 'dependency'.
As one of dependency systemd-machined requires to have /var/lib/machines. And I do have 2 assumptions there: 1. Was systemd-tmpfiles-setup.service activated? As we have seen sometimes that upon node boot due to some race condition it was not, which resulted in all kind of weirdness
It appears to be. The output looks very similar between the broken and working clusters.
$ sudo systemctl status systemd-tmpfiles-setup ● systemd-tmpfiles-setup.service - Create Volatile Files and Directories Loaded: loaded (/lib/systemd/system/systemd-tmpfiles- setup.service; static; vendor preset: enabled) Active: active (exited) since Mon 2022-10-03 18:23:53 UTC; 24h ago Docs: man:tmpfiles.d(5) man:systemd-tmpfiles(8) Main PID: 1460 (code=exited, status=0/SUCCESS) Tasks: 0 (limit: 8192) Memory: 0B CGroup: /system.slice/systemd-tmpfiles-setup.service
Warning: journal has been rotated since unit was started, output may be incomplete.
However, /var/lib/machines does not appear to be correct. On the working cluster, this is mounted as an ext4 filesystem and has a lost+found directory along with a directory for a defined instance.
There is no mount listed on the broken cluster, and the directory is empty.
2. Don't you happen to run nova-compute on the same set of hosts where LXC containers are placed? As for example, in AIO setup we do manage /var/lib/machines/ mount with systemd var-lib-machines.mount. So if you happen to run nova-computes on controller host or AIO - this is another thing to check.
$ sudo journalctl -u var-lib-machines.mount -- Logs begin at Tue 2022-10-04 18:01:46 UTC, end at Tue 2022-10-04 18:52:53 UTC. -- Oct 04 18:43:37 os-comp1 systemd[1]: Mounting Virtual Machine and Container Storage (Compatibility)... Oct 04 18:43:37 os-comp1 mount[1272300]: mount: /var/lib/machines: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error. Oct 04 18:43:37 os-comp1 systemd[1]: var-lib-machines.mount: Mount process exited, code=exited, status=32/n/a Oct 04 18:43:37 os-comp1 systemd[1]: var-lib-machines.mount: Failed with result 'exit-code'. Oct 04 18:43:37 os-comp1 systemd[1]: Failed to mount Virtual Machine and Container Storage (Compatibility).
This appears to be the problem. It looks like /dev/loop0 is probably supposed to reference /var/lib/machines.raw. I tried running fsck on /dev/loop0, but it doesn't think there is a valid extX filesystem on any of the superblocks. Maybe /dev/loop0 is not really pointing to /var/lib/machines.raw? Not sure how to tell if that's the case.
Maybe I should try to loopback this, or create a blank filesystem image.
Okay, I'm not sure what happened here.
The systemd unit mount file for var-lib-machines is different on the broken cluster than the working cluster. It talks about a btrfs system, but the /var/lib/machines.raw file is an ext4 filesystem, like the one on the working cluster.
I copied the unit file from the working cluster to the broken cluster, and I could mount /var/lib/machines, get systemd-machined working, and create machines now.
I have no idea what happened. I feel like there must have been a system update that changed (reverted from openstack-ansible?) something, but I'm just not sure.
In any event, you helped me figure it out. Thanks.
-- John Ratliff Systems Automation Engineer GlobalNOC @ Indiana University
participants (2)
-
Dmitriy Rabotyagov
-
John Ratliff