Open Stack

Tue Oct 4 20:45:01 UTC 2022

Oh, well, I do recall now that package update could brake systemd mount, as
in prior releases we placed our own systemd unit file in place and now we
just leverage systemd overrides functionality [1].
I think what you can do is find out what package does provide this mount
file and mark it for hold. Or cherry-pick and apply mentioned change.

[1]
https://review.opendev.org/c/openstack/openstack-ansible-lxc_hosts/+/834183

вт, 4 окт. 2022 г., 22:23 John Ratliff <jdratlif at globalnoc.iu.edu>:

> On Tue, 2022-10-04 at 14:56 -0400, John Ratliff wrote:
> > On Tue, 2022-10-04 at 18:21 +0200, Dmitriy Rabotyagov wrote:
> > > Hi John.
> > >
> > > Well, it seems you've made a bunch of operations that were not
> > > required in the first place. However, I believe that at the end
> > > you've
> > > identified the problem correctly. systemd-machined service should
> > > be
> > > active and running on nova-compute hosts with kvm driver.
> > > I'd suggest looking deeper at why this service systemd-machined
> > > can't
> > > be started. What does journalctl says about that?
> >
> > It's not very chatty, though I think your next question might answer
> > the why.
> >
> > $ sudo journalctl -u systemd-machined
> > -- Logs begin at Tue 2022-10-04 17:45:02 UTC, end at Tue 2022-10-04
> > 18:43:45 UTC. --
> > Oct 04 18:43:37 os-comp1 systemd[1]: Dependency failed for Virtual
> > Machine and Container Registration Service.
> > Oct 04 18:43:37 os-comp1 systemd[1]: systemd-machined.service: Job
> > systemd-machined.service/start failed with result 'dependency'.
> >
> > >
> > > As one of dependency systemd-machined requires to have
> > > /var/lib/machines. And I do have 2 assumptions there:
> > > 1. Was systemd-tmpfiles-setup.service activated? As we have seen
> > > sometimes that upon node boot due to some race condition it was
> > > not,
> > > which resulted in all kind of weirdness
> >
> > It appears to be. The output looks very similar between the broken
> > and
> > working clusters.
> >
> > $ sudo systemctl status systemd-tmpfiles-setup
> > ● systemd-tmpfiles-setup.service - Create Volatile Files and
> > Directories
> >      Loaded: loaded (/lib/systemd/system/systemd-tmpfiles-
> > setup.service; static; vendor preset: enabled)
> >      Active: active (exited) since Mon 2022-10-03 18:23:53 UTC; 24h
> > ago
> >        Docs: man:tmpfiles.d(5)
> >              man:systemd-tmpfiles(8)
> >    Main PID: 1460 (code=exited, status=0/SUCCESS)
> >       Tasks: 0 (limit: 8192)
> >      Memory: 0B
> >      CGroup: /system.slice/systemd-tmpfiles-setup.service
> >
> > Warning: journal has been rotated since unit was started, output may
> > be
> > incomplete.
> >
> > However, /var/lib/machines does not appear to be correct. On the
> > working cluster, this is mounted as an ext4 filesystem and has a
> > lost+found directory along with a directory for a defined instance.
> >
> > There is no mount listed on the broken cluster, and the directory is
> > empty.
> >
> > > 2. Don't you happen to run nova-compute on the same set of hosts
> > > where
> > > LXC containers are placed? As for example, in AIO setup we do
> > > manage
> > > /var/lib/machines/ mount with systemd var-lib-machines.mount. So if
> > > you happen to run nova-computes on controller host or AIO - this is
> > > another thing to check.
> >
> > $ sudo journalctl -u var-lib-machines.mount
> > -- Logs begin at Tue 2022-10-04 18:01:46 UTC, end at Tue 2022-10-04
> > 18:52:53 UTC. --
> > Oct 04 18:43:37 os-comp1 systemd[1]: Mounting Virtual Machine and
> > Container Storage (Compatibility)...
> > Oct 04 18:43:37 os-comp1 mount[1272300]: mount: /var/lib/machines:
> > wrong fs type, bad option, bad superblock on /dev/loop0, missing
> > codepage or helper program, or other error.
> > Oct 04 18:43:37 os-comp1 systemd[1]: var-lib-machines.mount: Mount
> > process exited, code=exited, status=32/n/a
> > Oct 04 18:43:37 os-comp1 systemd[1]: var-lib-machines.mount: Failed
> > with result 'exit-code'.
> > Oct 04 18:43:37 os-comp1 systemd[1]: Failed to mount Virtual Machine
> > and Container Storage (Compatibility).
> >
> > This appears to be the problem. It looks like /dev/loop0 is probably
> > supposed to reference /var/lib/machines.raw. I tried running fsck on
> > /dev/loop0, but it doesn't think there is a valid extX filesystem on
> > any of the superblocks. Maybe /dev/loop0 is not really pointing to
> > /var/lib/machines.raw? Not sure how to tell if that's the case.
> >
> > Maybe I should try to loopback this, or create a blank filesystem
> > image.
> >
> >
> >
>
> Okay, I'm not sure what happened here.
>
> The systemd unit mount file for var-lib-machines is different on the
> broken cluster than the working cluster. It talks about a btrfs system,
> but the /var/lib/machines.raw file is an ext4 filesystem, like the one
> on the working cluster.
>
> I copied the unit file from the working cluster to the broken cluster,
> and I could mount /var/lib/machines, get systemd-machined working, and
> create machines now.
>
> I have no idea what happened. I feel like there must have been a system
> update that changed (reverted from openstack-ansible?) something, but
> I'm just not sure.
>
> In any event, you helped me figure it out. Thanks.
>
> --
> John Ratliff
> Systems Automation Engineer
> GlobalNOC @ Indiana University
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20221004/f38930c4/attachment-0001.htm>

Open Stack

OpenStack Ansible Service troubleshooting

OpenStack

Community

Documentation

Branding & Legal