Re: OpenStack Ansible Service troubleshooting

4 Oct 2022

      On Tue, 2022-10-04 at 14:56 -0400, John Ratliff wrote:
...
On Tue, 2022-10-04 at 18:21 +0200, Dmitriy Rabotyagov wrote:
...
Hi John.
Well, it seems you've made a bunch of operations that were not
required in the first place. However, I believe that at the end
you've
identified the problem correctly. systemd-machined service should
be
active and running on nova-compute hosts with kvm driver.
I'd suggest looking deeper at why this service systemd-machined
can't
be started. What does journalctl says about that?
It's not very chatty, though I think your next question might answer
the why.
$ sudo journalctl -u systemd-machined
-- Logs begin at Tue 2022-10-04 17:45:02 UTC, end at Tue 2022-10-04
18:43:45 UTC. --
Oct 04 18:43:37 os-comp1 systemd[1]: Dependency failed for Virtual
Machine and Container Registration Service.
Oct 04 18:43:37 os-comp1 systemd[1]: systemd-machined.service: Job
systemd-machined.service/start failed with result 'dependency'.
...
As one of dependency systemd-machined requires to have
/var/lib/machines. And I do have 2 assumptions there:
1. Was systemd-tmpfiles-setup.service activated? As we have seen
sometimes that upon node boot due to some race condition it was
not,
which resulted in all kind of weirdness
It appears to be. The output looks very similar between the broken
and
working clusters.
$ sudo systemctl status systemd-tmpfiles-setup                       
● systemd-tmpfiles-setup.service - Create Volatile Files and
Directories
     Loaded: loaded (/lib/systemd/system/systemd-tmpfiles-
setup.service; static; vendor preset: enabled)
     Active: active (exited) since Mon 2022-10-03 18:23:53 UTC; 24h
ago
       Docs: man:tmpfiles.d(5)
             man:systemd-tmpfiles(8)
   Main PID: 1460 (code=exited, status=0/SUCCESS)
      Tasks: 0 (limit: 8192)
     Memory: 0B
     CGroup: /system.slice/systemd-tmpfiles-setup.service
Warning: journal has been rotated since unit was started, output may
be
incomplete.
However, /var/lib/machines does not appear to be correct. On the
working cluster, this is mounted as an ext4 filesystem and has a
lost+found directory along with a directory for a defined instance.
There is no mount listed on the broken cluster, and the directory is
empty.
...
2. Don't you happen to run nova-compute on the same set of hosts
where
LXC containers are placed? As for example, in AIO setup we do
manage
/var/lib/machines/ mount with systemd var-lib-machines.mount. So if
you happen to run nova-computes on controller host or AIO - this is
another thing to check.
$ sudo journalctl -u var-lib-machines.mount
-- Logs begin at Tue 2022-10-04 18:01:46 UTC, end at Tue 2022-10-04
18:52:53 UTC. --
Oct 04 18:43:37 os-comp1 systemd[1]: Mounting Virtual Machine and
Container Storage (Compatibility)...
Oct 04 18:43:37 os-comp1 mount[1272300]: mount: /var/lib/machines:
wrong fs type, bad option, bad superblock on /dev/loop0, missing
codepage or helper program, or other error.
Oct 04 18:43:37 os-comp1 systemd[1]: var-lib-machines.mount: Mount
process exited, code=exited, status=32/n/a
Oct 04 18:43:37 os-comp1 systemd[1]: var-lib-machines.mount: Failed
with result 'exit-code'.
Oct 04 18:43:37 os-comp1 systemd[1]: Failed to mount Virtual Machine
and Container Storage (Compatibility).
This appears to be the problem. It looks like /dev/loop0 is probably
supposed to reference /var/lib/machines.raw. I tried running fsck on
/dev/loop0, but it doesn't think there is a valid extX filesystem on
any of the superblocks. Maybe /dev/loop0 is not really pointing to
/var/lib/machines.raw? Not sure how to tell if that's the case.
Maybe I should try to loopback this, or create a blank filesystem
image.
Okay, I'm not sure what happened here.

The systemd unit mount file for var-lib-machines is different on the
broken cluster than the working cluster. It talks about a btrfs system,
but the /var/lib/machines.raw file is an ext4 filesystem, like the one
on the working cluster.

I copied the unit file from the working cluster to the broken cluster,
and I could mount /var/lib/machines, get systemd-machined working, and
create machines now.

I have no idea what happened. I feel like there must have been a system
update that changed (reverted from openstack-ansible?) something, but
I'm just not sure.

In any event, you helped me figure it out. Thanks.

-- 
John Ratliff
Systems Automation Engineer 
GlobalNOC @ Indiana University