Open Stack

Tue Oct 4 20:22:58 UTC 2022

On Tue, 2022-10-04 at 14:56 -0400, John Ratliff wrote:
> On Tue, 2022-10-04 at 18:21 +0200, Dmitriy Rabotyagov wrote:
> > Hi John.
> > 
> > Well, it seems you've made a bunch of operations that were not
> > required in the first place. However, I believe that at the end
> > you've
> > identified the problem correctly. systemd-machined service should
> > be
> > active and running on nova-compute hosts with kvm driver.
> > I'd suggest looking deeper at why this service systemd-machined
> > can't
> > be started. What does journalctl says about that?
> 
> It's not very chatty, though I think your next question might answer
> the why.
> 
> $ sudo journalctl -u systemd-machined
> -- Logs begin at Tue 2022-10-04 17:45:02 UTC, end at Tue 2022-10-04
> 18:43:45 UTC. --
> Oct 04 18:43:37 os-comp1 systemd[1]: Dependency failed for Virtual
> Machine and Container Registration Service.
> Oct 04 18:43:37 os-comp1 systemd[1]: systemd-machined.service: Job
> systemd-machined.service/start failed with result 'dependency'.
> 
> > 
> > As one of dependency systemd-machined requires to have
> > /var/lib/machines. And I do have 2 assumptions there:
> > 1. Was systemd-tmpfiles-setup.service activated? As we have seen
> > sometimes that upon node boot due to some race condition it was
> > not,
> > which resulted in all kind of weirdness
> 
> It appears to be. The output looks very similar between the broken
> and
> working clusters.
> 
> $ sudo systemctl status systemd-tmpfiles-setup                       
> ● systemd-tmpfiles-setup.service - Create Volatile Files and
> Directories
>      Loaded: loaded (/lib/systemd/system/systemd-tmpfiles-
> setup.service; static; vendor preset: enabled)
>      Active: active (exited) since Mon 2022-10-03 18:23:53 UTC; 24h
> ago
>        Docs: man:tmpfiles.d(5)
>              man:systemd-tmpfiles(8)
>    Main PID: 1460 (code=exited, status=0/SUCCESS)
>       Tasks: 0 (limit: 8192)
>      Memory: 0B
>      CGroup: /system.slice/systemd-tmpfiles-setup.service
> 
> Warning: journal has been rotated since unit was started, output may
> be
> incomplete.
> 
> However, /var/lib/machines does not appear to be correct. On the
> working cluster, this is mounted as an ext4 filesystem and has a
> lost+found directory along with a directory for a defined instance.
> 
> There is no mount listed on the broken cluster, and the directory is
> empty.
> 
> > 2. Don't you happen to run nova-compute on the same set of hosts
> > where
> > LXC containers are placed? As for example, in AIO setup we do
> > manage
> > /var/lib/machines/ mount with systemd var-lib-machines.mount. So if
> > you happen to run nova-computes on controller host or AIO - this is
> > another thing to check.
> 
> $ sudo journalctl -u var-lib-machines.mount
> -- Logs begin at Tue 2022-10-04 18:01:46 UTC, end at Tue 2022-10-04
> 18:52:53 UTC. --
> Oct 04 18:43:37 os-comp1 systemd[1]: Mounting Virtual Machine and
> Container Storage (Compatibility)...
> Oct 04 18:43:37 os-comp1 mount[1272300]: mount: /var/lib/machines:
> wrong fs type, bad option, bad superblock on /dev/loop0, missing
> codepage or helper program, or other error.
> Oct 04 18:43:37 os-comp1 systemd[1]: var-lib-machines.mount: Mount
> process exited, code=exited, status=32/n/a
> Oct 04 18:43:37 os-comp1 systemd[1]: var-lib-machines.mount: Failed
> with result 'exit-code'.
> Oct 04 18:43:37 os-comp1 systemd[1]: Failed to mount Virtual Machine
> and Container Storage (Compatibility).
> 
> This appears to be the problem. It looks like /dev/loop0 is probably
> supposed to reference /var/lib/machines.raw. I tried running fsck on
> /dev/loop0, but it doesn't think there is a valid extX filesystem on
> any of the superblocks. Maybe /dev/loop0 is not really pointing to
> /var/lib/machines.raw? Not sure how to tell if that's the case.
> 
> Maybe I should try to loopback this, or create a blank filesystem
> image.
> 
> 
> 

Okay, I'm not sure what happened here.

The systemd unit mount file for var-lib-machines is different on the
broken cluster than the working cluster. It talks about a btrfs system,
but the /var/lib/machines.raw file is an ext4 filesystem, like the one
on the working cluster.

I copied the unit file from the working cluster to the broken cluster,
and I could mount /var/lib/machines, get systemd-machined working, and
create machines now.

I have no idea what happened. I feel like there must have been a system
update that changed (reverted from openstack-ansible?) something, but
I'm just not sure.

In any event, you helped me figure it out. Thanks.

-- 
John Ratliff
Systems Automation Engineer 
GlobalNOC @ Indiana University
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5598 bytes
Desc: not available
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20221004/450e5627/attachment.bin>

Open Stack

OpenStack Ansible Service troubleshooting

OpenStack

Community

Documentation

Branding & Legal