[ironic] [infra] Making Glean work with IPA for static IP assignment
Hi Ian, Given our timezone difference, I've decided to send you an email, adding openstack-discuss in CC for greater exposure. We're trying to make IPA (our agent ramdisk) to work without relying on DHCP for cases like Edge deployments. We've settled on the network_data.json format for the API side and wanted to use Glean on the ramdisk to apply it. You can read more details in [1]. The problem is, I cannot make Glean work with any ramdisk I build. The crux of the problem seems to be that NetworkManager (used by default in RHEL, CentOS, Fedora and Debian at least) starts very early, creates the default connection and ignores whatever files Glean happens to write afterwards. On Debian running `systemctl restart networking` actually helped to pick the new configuration, but I'm not sure we want to do that in Glean. I haven't been able to make NetworkManager pick up the changes on RH systems so far. I build ramdisks using IPA-builder [2] by adding the simple-init element. I've tried removing dhcp-all-interfaces (which we depend on by default) to no effect. I've tried disabling the DHCP server, ended up with no IP connectivity at all. I haven't tried to shutdown and restart a connection as recommended in [3] since it's not trivial to do via SSH. Do you maybe have any hints how to proceed? I'd be curious to know how static IP assignment works in the infra setup. Do you have images with NetworkManager there? Do you use the simple-init element? Any help is very appreciated. Dmitry [1] https://specs.openstack.org/openstack/ironic-specs/specs/not-implemented/L3-... [2] https://opendev.org/openstack/ironic-python-agent-builder [3] https://mail.gnome.org/archives/networkmanager-list/2014-January/msg00032.ht... -- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
On Tue, Nov 24, 2020 at 11:54:55AM +0100, Dmitry Tantsur wrote:
The problem is, I cannot make Glean work with any ramdisk I build. The crux of the problem seems to be that NetworkManager (used by default in RHEL, CentOS, Fedora and Debian at least) starts very early, creates the default connection and ignores whatever files Glean happens to write afterwards. On Debian running `systemctl restart networking` actually helped to pick the new configuration, but I'm not sure we want to do that in Glean. I haven't been able to make NetworkManager pick up the changes on RH systems so far.
So we do use NetworkManager in the OpenDev images, and we do not see NetworkManager starting before glean. The way it should work is that simple-init in dib installs glean to the image. That runs the glean install script (use --use-nm argument if DIB_SIMPLE_INIT_NETWORKMANAGER, which is default on centos/fedora) which installs two things; udev rules and a systemd handler. The udev is pretty simple [1] and should add a "Wants" for each net device; e.g. eth1 would match and create a Wants glean@eth1.service, which then matches [2] which should write out the ifcfg config file. After this, NetworkManager should start, notice the config file for the interface and bring it up.
Do you maybe have any hints how to proceed? I'd be curious to know how static IP assignment works in the infra setup. Do you have images with NetworkManager there? Do you use the simple-init element?
As noted yes we use this. Really only in two contexts; it's Rackspace that doesn't have DHCP so we have to setup the interface statically from the configdrive data. Other clouds all provide DHCP, which is used when there's no configdrive data. Here is a systemd-analyze from one of our Centos nodes if it helps: graphical.target @18.403s └─multi-user.target @18.403s └─unbound.service @5.467s +12.934s └─network.target @5.454s └─NetworkManager.service @5.339s +112ms └─network-pre.target @5.334s └─glean@ens3.service @4.227s +1.102s └─basic.target @4.167s └─sockets.target @4.166s └─iscsiuio.socket @4.165s └─sysinit.target @4.153s └─systemd-udev-settle.service @1.905s +2.245s └─systemd-udev-trigger.service @1.242s +659ms └─systemd-udevd-control.socket @1.239s └─system.slice At a guess; I feel like the udev bits are probably not happening correctly in your case? That's important to get the glean@<interface> service in the chain to pre-create the config file -i [1] https://opendev.org/opendev/glean/src/branch/master/glean/init/glean-udev.ru... [2] https://opendev.org/opendev/glean/src/branch/master/glean/init/glean-nm@.ser...
Hi, Thank you for your input! On Wed, Nov 25, 2020 at 3:09 AM Ian Wienand <iwienand@redhat.com> wrote:
On Tue, Nov 24, 2020 at 11:54:55AM +0100, Dmitry Tantsur wrote:
The problem is, I cannot make Glean work with any ramdisk I build. The crux of the problem seems to be that NetworkManager (used by default in RHEL, CentOS, Fedora and Debian at least) starts very early, creates the default connection and ignores whatever files Glean happens to write afterwards. On Debian running `systemctl restart networking` actually helped to pick the new configuration, but I'm not sure we want to do that in Glean. I haven't been able to make NetworkManager pick up the changes on RH systems so far.
So we do use NetworkManager in the OpenDev images, and we do not see NetworkManager starting before glean.
Okay, thanks for confirming. Maybe it's related to how IPA is built? It's not exactly a normal image after all, although it's pretty close to one.
The way it should work is that simple-init in dib installs glean to the image. That runs the glean install script (use --use-nm argument if DIB_SIMPLE_INIT_NETWORKMANAGER, which is default on centos/fedora) which installs two things; udev rules and a systemd handler.
I have checked that these are installed, but I don't know how to verify a udev rule.
The udev is pretty simple [1] and should add a "Wants" for each net device; e.g. eth1 would match and create a Wants glean@eth1.service, which then matches [2] which should write out the ifcfg config file. After this, NetworkManager should start, notice the config file for the interface and bring it up.
Yeah, I definitely see logging from NetworkManager DHCP before this service is run (i.e. before the output from Glean).
Do you maybe have any hints how to proceed? I'd be curious to know how static IP assignment works in the infra setup. Do you have images with NetworkManager there? Do you use the simple-init element?
As noted yes we use this. Really only in two contexts; it's Rackspace that doesn't have DHCP so we have to setup the interface statically from the configdrive data. Other clouds all provide DHCP, which is used when there's no configdrive data.
Here is a systemd-analyze from one of our Centos nodes if it helps:
graphical.target @18.403s └─multi-user.target @18.403s └─unbound.service @5.467s +12.934s └─network.target @5.454s └─NetworkManager.service @5.339s +112ms └─network-pre.target @5.334s └─glean@ens3.service @4.227s +1.102s └─basic.target @4.167s └─sockets.target @4.166s └─iscsiuio.socket @4.165s └─sysinit.target @4.153s └─systemd-udev-settle.service @1.905s +2.245s └─systemd-udev-trigger.service @1.242s +659ms └─systemd-udevd-control.socket @1.239s └─system.slice
# systemd-analyze critical-chain multi-user.target @2min 6.301s └─tuned.service @1min 32.273s +34.024s └─network.target @1min 31.590s └─network-pre.target @1min 31.579s └─glean@enp1s0.service @36.594s +54.952s └─system-glean.slice @36.493s └─system.slice @4.083s └─-.slice @4.080s # systemd-analyze critical-chain NetworkManager.service NetworkManager.service +9.287s └─network-pre.target @1min 31.579s └─glean@enp1s0.service @36.594s +54.952s └─system-glean.slice @36.493s └─system.slice @4.083s └─-.slice @4.080s # cat /etc/sysconfig/network-scripts/ifcfg-enp1s0 # Automatically generated, do not edit DEVICE=enp1s0 BOOTPROTO=static HWADDR=52:54:00:1f:79:7e IPADDR=192.168.122.42 NETMASK=255.255.255.0 ONBOOT=yes NM_CONTROLLED=yes # ip addr ... 2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:1f:79:7e brd ff:ff:ff:ff:ff:ff inet 192.168.122.77/24 brd 192.168.122.255 scope global dynamic noprefixroute enp1s0 valid_lft 42957sec preferred_lft 42957sec inet6 fe80::f182:7fb4:7a39:eb7b/64 scope link noprefixroute valid_lft forever preferred_lft forever
At a guess; I feel like the udev bits are probably not happening correctly in your case? That's important to get the glean@<interface> service in the chain to pre-create the config file
It seems that the ordering is correct and the interface service is executed, but the IP address is nonetheless wrong. Can it be related to how long glean takes to run in my case (54 seconds vs 1 second in your case)? Dmitry
-i
[1] https://opendev.org/opendev/glean/src/branch/master/glean/init/glean-udev.ru... [2] https://opendev.org/opendev/glean/src/branch/master/glean/init/glean-nm@.ser...
-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
On Wed, Nov 25, 2020 at 11:54:13AM +0100, Dmitry Tantsur wrote:
# systemd-analyze critical-chain multi-user.target @2min 6.301s └─tuned.service @1min 32.273s +34.024s └─network.target @1min 31.590s └─network-pre.target @1min 31.579s └─glean@enp1s0.service @36.594s +54.952s └─system-glean.slice @36.493s └─system.slice @4.083s └─-.slice @4.080s
# systemd-analyze critical-chain NetworkManager.service NetworkManager.service +9.287s └─network-pre.target @1min 31.579s └─glean@enp1s0.service @36.594s +54.952s └─system-glean.slice @36.493s └─system.slice @4.083s └─-.slice @4.080s
It seems that the ordering is correct and the interface service is executed, but the IP address is nonetheless wrong.
I agree, this seems to say to me that NetworkManager should run after network.pre-target, and glean@enp1s0 should be running before it. The glean@enp1s0.service is set as oneshot [1] which should prevent network-pre.target being reached until it exits: oneshot ... [the] service manager will consider the unit up after the main process exits. It will then start follow-up units. To the best of my knowledge the dependencies are correct; but if you go through the "git log" of the project you can find some history of us thinking ordering was correct and finding issues.
Can it be related to how long glean takes to run in my case (54 seconds vs 1 second in your case)?
The glean script doesn't run asynchronously in any way (at least not on purpose!). I can't see any way it could exit before the ifcfg file is written out.
# cat /etc/sysconfig/network-scripts/ifcfg-enp1s0 ...
The way NM support works is writing out this file which is read by the NM ifcfg-rh plugin [2]. AFAIK that's built-in to NM so would not be missing, and I think you'd have to go to effort to manually edit /etc/NetworkManager/conf.d/99-main-plugins.conf to have it ignored. I'm afraid that's overall not much help. Are you sure there isn't an errant dhclient running somehow that grabs a different address? Does it get the correct address on reboot; implying the ifcfg- file is read correctly but somehow isn't in place before NetworkManager starts? -i [1] https://opendev.org/opendev/glean/src/branch/master/glean/init/glean-nm@.ser... [2] https://developer.gnome.org/NetworkManager/stable/nm-settings-ifcfg-rh.html
Hi Ian, We were trying the same thing and the deploy fails when we use CentOS8 or Ubuntu ramdisk. Glean is able to modify the network scripts but looks like Networking/NetworkManager is not restarted after that and ip is not assigned to the interface. Manually I just did "systemctl restart NetworkManager" on the CentOS8 system and after that the deploy succeeded. Is there a bug for this? and is there any plan to fix the issue ? If there is no bug existing for the issue, I am planning to raise one. The image is built using following command: disk-image-create -o centos_deploy_image ironic-python-agent-ramdisk centos simple-init devuser selinux-permissive As a side note, the centos7 image created using above works fine for us and the dhcpless deploy works end-to-end using ironic. Regards Nisha On Thu, Nov 26, 2020 at 6:51 AM Ian Wienand <iwienand@redhat.com> wrote:
On Wed, Nov 25, 2020 at 11:54:13AM +0100, Dmitry Tantsur wrote:
# systemd-analyze critical-chain multi-user.target @2min 6.301s └─tuned.service @1min 32.273s +34.024s └─network.target @1min 31.590s └─network-pre.target @1min 31.579s └─glean@enp1s0.service @36.594s +54.952s └─system-glean.slice @36.493s └─system.slice @4.083s └─-.slice @4.080s
# systemd-analyze critical-chain NetworkManager.service NetworkManager.service +9.287s └─network-pre.target @1min 31.579s └─glean@enp1s0.service @36.594s +54.952s └─system-glean.slice @36.493s └─system.slice @4.083s └─-.slice @4.080s
It seems that the ordering is correct and the interface service is executed, but the IP address is nonetheless wrong.
I agree, this seems to say to me that NetworkManager should run after network.pre-target, and glean@enp1s0 should be running before it.
The glean@enp1s0.service is set as oneshot [1] which should prevent network-pre.target being reached until it exits:
oneshot ... [the] service manager will consider the unit up after the main process exits. It will then start follow-up units.
To the best of my knowledge the dependencies are correct; but if you go through the "git log" of the project you can find some history of us thinking ordering was correct and finding issues.
Can it be related to how long glean takes to run in my case (54 seconds vs 1 second in your case)?
The glean script doesn't run asynchronously in any way (at least not on purpose!). I can't see any way it could exit before the ifcfg file is written out.
# cat /etc/sysconfig/network-scripts/ifcfg-enp1s0 ...
The way NM support works is writing out this file which is read by the NM ifcfg-rh plugin [2]. AFAIK that's built-in to NM so would not be missing, and I think you'd have to go to effort to manually edit /etc/NetworkManager/conf.d/99-main-plugins.conf to have it ignored.
I'm afraid that's overall not much help. Are you sure there isn't an errant dhclient running somehow that grabs a different address? Does it get the correct address on reboot; implying the ifcfg- file is read correctly but somehow isn't in place before NetworkManager starts?
-i
[1] https://opendev.org/opendev/glean/src/branch/master/glean/init/glean-nm@.ser... [2] https://developer.gnome.org/NetworkManager/stable/nm-settings-ifcfg-rh.html
-- The Secret Of Success is learning how to use pain and pleasure, instead of having pain and pleasure use you. If You do that you are in control of your life. If you don't life controls you.
Hi, Getting back to this, sorry for the delay. Yes, I'm pretty sure it's NetworkManager, not something else. Here are relevant parts of boot logs from a recent runs: [ 63.613821] NetworkManager[244]: <info> [1615995259.7778] NetworkManager (version 1.26.0-12.el8_3) is starting... (for the first time) [ 71.637264] systemd[1]: Starting Glean for interface enp1s0 with NetworkManager... Starting Glean for interface enp1s0 with NetworkManager... [ 77.622901] glean.sh[327]: mount: /mnt/config: /dev/sr0 already mounted on /mnt/config. !!! As you see, Glean starts quite early, but then... !!! [ 92.699494] NetworkManager[244]: <info> [1615995288.9848] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1) [ 93.040232] NetworkManager[244]: <info> [1615995289.3256] manager: (enp1s0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2) [ 94.434450] NetworkManager[244]: <info> [1615995290.7198] device (enp1s0): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external') [ 94.713545] NetworkManager[244]: <info> [1615995290.9986] device (enp1s0): carrier: link connected [ 96.487825] NetworkManager[244]: <info> [1615995292.7699] device (enp1s0): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed') [ 96.712608] NetworkManager[244]: <info> [1615995292.9979] policy: auto-activating connection 'Wired connection 1' (cabef811-9cf9-3d92-9391-95712a3d3481) !!! This auto-activation triggers DHCP !!! [ 97.789768] NetworkManager[244]: <info> [1615995294.0750] device (enp1s0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed') [ 98.084735] NetworkManager[244]: <info> [1615995294.3699] dhcp4 (enp1s0): activation: beginning transaction (timeout in 30 seconds) [ 98.303574] NetworkManager[244]: <info> [1615995294.5883] dhcp4 (enp1s0): dhclient started with pid 382 [ 108.882870] NetworkManager[244]: <info> [1615995305.0369] dhcp4 (enp1s0): address 192.168.122.105 !!! 10 seconds later we have the IP address configured !!! [ 126.636082] glean.sh[326]: DEBUG:glean:Starting glean [ 127.885587] glean.sh[326]: DEBUG:glean:Only considering interface enp1s0 from arguments [ 127.908001] glean.sh[326]: DEBUG:glean:Interface matched: enp1s0 (52:54:00:9e:b1:16) [ 127.920045] glean.sh[326]: DEBUG:glean:52:54:00:9e:b1:16 configured via config-drive [ 128.635484] systemd[1]: Started Glean for interface enp1s0 with NetworkManager. !!! 20 seconds later (it's a nested VM, everything is slow) glean actually kicks in !!! [ 130.752564] systemd[1]: Reached target Network is Online. At this point the IP address is from DHCP, not from Glean. Any ideas? Dmitry On Thu, Nov 26, 2020 at 2:20 AM Ian Wienand <iwienand@redhat.com> wrote:
On Wed, Nov 25, 2020 at 11:54:13AM +0100, Dmitry Tantsur wrote:
# systemd-analyze critical-chain multi-user.target @2min 6.301s └─tuned.service @1min 32.273s +34.024s └─network.target @1min 31.590s └─network-pre.target @1min 31.579s └─glean@enp1s0.service @36.594s +54.952s └─system-glean.slice @36.493s └─system.slice @4.083s └─-.slice @4.080s
# systemd-analyze critical-chain NetworkManager.service NetworkManager.service +9.287s └─network-pre.target @1min 31.579s └─glean@enp1s0.service @36.594s +54.952s └─system-glean.slice @36.493s └─system.slice @4.083s └─-.slice @4.080s
It seems that the ordering is correct and the interface service is executed, but the IP address is nonetheless wrong.
I agree, this seems to say to me that NetworkManager should run after network.pre-target, and glean@enp1s0 should be running before it.
The glean@enp1s0.service is set as oneshot [1] which should prevent network-pre.target being reached until it exits:
oneshot ... [the] service manager will consider the unit up after the main process exits. It will then start follow-up units.
To the best of my knowledge the dependencies are correct; but if you go through the "git log" of the project you can find some history of us thinking ordering was correct and finding issues.
Can it be related to how long glean takes to run in my case (54 seconds vs 1 second in your case)?
The glean script doesn't run asynchronously in any way (at least not on purpose!). I can't see any way it could exit before the ifcfg file is written out.
# cat /etc/sysconfig/network-scripts/ifcfg-enp1s0 ...
The way NM support works is writing out this file which is read by the NM ifcfg-rh plugin [2]. AFAIK that's built-in to NM so would not be missing, and I think you'd have to go to effort to manually edit /etc/NetworkManager/conf.d/99-main-plugins.conf to have it ignored.
I'm afraid that's overall not much help. Are you sure there isn't an errant dhclient running somehow that grabs a different address? Does it get the correct address on reboot; implying the ifcfg- file is read correctly but somehow isn't in place before NetworkManager starts?
-i
[1] https://opendev.org/opendev/glean/src/branch/master/glean/init/glean-nm@.ser... [2] https://developer.gnome.org/NetworkManager/stable/nm-settings-ifcfg-rh.html
-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
Does adding a Before=NetworkManager.service into the service file for glean-nm.service help with the ordering, perhaps? -Jay Faulkner On Wed, Mar 17, 2021 at 8:55 AM Dmitry Tantsur <dtantsur@redhat.com> wrote:
Hi,
Getting back to this, sorry for the delay. Yes, I'm pretty sure it's NetworkManager, not something else. Here are relevant parts of boot logs from a recent runs:
[ 63.613821] NetworkManager[244]: <info> [1615995259.7778] NetworkManager (version 1.26.0-12.el8_3) is starting... (for the first time) [ 71.637264] systemd[1]: Starting Glean for interface enp1s0 with NetworkManager... Starting Glean for interface enp1s0 with NetworkManager... [ 77.622901] glean.sh[327]: mount: /mnt/config: /dev/sr0 already mounted on /mnt/config.
!!! As you see, Glean starts quite early, but then... !!!
[ 92.699494] NetworkManager[244]: <info> [1615995288.9848] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1) [ 93.040232] NetworkManager[244]: <info> [1615995289.3256] manager: (enp1s0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2) [ 94.434450] NetworkManager[244]: <info> [1615995290.7198] device (enp1s0): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external') [ 94.713545] NetworkManager[244]: <info> [1615995290.9986] device (enp1s0): carrier: link connected [ 96.487825] NetworkManager[244]: <info> [1615995292.7699] device (enp1s0): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed') [ 96.712608] NetworkManager[244]: <info> [1615995292.9979] policy: auto-activating connection 'Wired connection 1' (cabef811-9cf9-3d92-9391-95712a3d3481)
!!! This auto-activation triggers DHCP !!!
[ 97.789768] NetworkManager[244]: <info> [1615995294.0750] device (enp1s0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed') [ 98.084735] NetworkManager[244]: <info> [1615995294.3699] dhcp4 (enp1s0): activation: beginning transaction (timeout in 30 seconds) [ 98.303574] NetworkManager[244]: <info> [1615995294.5883] dhcp4 (enp1s0): dhclient started with pid 382 [ 108.882870] NetworkManager[244]: <info> [1615995305.0369] dhcp4 (enp1s0): address 192.168.122.105
!!! 10 seconds later we have the IP address configured !!!
[ 126.636082] glean.sh[326]: DEBUG:glean:Starting glean [ 127.885587] glean.sh[326]: DEBUG:glean:Only considering interface enp1s0 from arguments [ 127.908001] glean.sh[326]: DEBUG:glean:Interface matched: enp1s0 (52:54:00:9e:b1:16) [ 127.920045] glean.sh[326]: DEBUG:glean:52:54:00:9e:b1:16 configured via config-drive [ 128.635484] systemd[1]: Started Glean for interface enp1s0 with NetworkManager.
!!! 20 seconds later (it's a nested VM, everything is slow) glean actually kicks in !!!
[ 130.752564] systemd[1]: Reached target Network is Online.
At this point the IP address is from DHCP, not from Glean.
Any ideas?
Dmitry
On Thu, Nov 26, 2020 at 2:20 AM Ian Wienand <iwienand@redhat.com> wrote:
On Wed, Nov 25, 2020 at 11:54:13AM +0100, Dmitry Tantsur wrote:
# systemd-analyze critical-chain multi-user.target @2min 6.301s └─tuned.service @1min 32.273s +34.024s └─network.target @1min 31.590s └─network-pre.target @1min 31.579s └─glean@enp1s0.service @36.594s +54.952s └─system-glean.slice @36.493s └─system.slice @4.083s └─-.slice @4.080s
# systemd-analyze critical-chain NetworkManager.service NetworkManager.service +9.287s └─network-pre.target @1min 31.579s └─glean@enp1s0.service @36.594s +54.952s └─system-glean.slice @36.493s └─system.slice @4.083s └─-.slice @4.080s
It seems that the ordering is correct and the interface service is executed, but the IP address is nonetheless wrong.
I agree, this seems to say to me that NetworkManager should run after network.pre-target, and glean@enp1s0 should be running before it.
The glean@enp1s0.service is set as oneshot [1] which should prevent network-pre.target being reached until it exits:
oneshot ... [the] service manager will consider the unit up after the main process exits. It will then start follow-up units.
To the best of my knowledge the dependencies are correct; but if you go through the "git log" of the project you can find some history of us thinking ordering was correct and finding issues.
Can it be related to how long glean takes to run in my case (54 seconds vs 1 second in your case)?
The glean script doesn't run asynchronously in any way (at least not on purpose!). I can't see any way it could exit before the ifcfg file is written out.
# cat /etc/sysconfig/network-scripts/ifcfg-enp1s0 ...
The way NM support works is writing out this file which is read by the NM ifcfg-rh plugin [2]. AFAIK that's built-in to NM so would not be missing, and I think you'd have to go to effort to manually edit /etc/NetworkManager/conf.d/99-main-plugins.conf to have it ignored.
I'm afraid that's overall not much help. Are you sure there isn't an errant dhclient running somehow that grabs a different address? Does it get the correct address on reboot; implying the ifcfg- file is read correctly but somehow isn't in place before NetworkManager starts?
-i
[1] https://opendev.org/opendev/glean/src/branch/master/glean/init/glean-nm@.ser... <https://urldefense.proofpoint.com/v2/url?u=https-3A__opendev.org_opendev_glean_src_branch_master_glean_init_glean-2Dnm-40.service-23L13&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=NKR1jXf8to59hDGraABDUb4djWcsAXM11_v4c7uz0Tg&m=BrbOLly0UyDhmVy0bJISbRK1Y5YrrOvNg1YCCD5SHvU&s=eRQWThy8wzWvvv_lev73UOEHtOuiSRAKAt13jOs8H14&e=> [2] https://developer.gnome.org/NetworkManager/stable/nm-settings-ifcfg-rh.html <https://urldefense.proofpoint.com/v2/url?u=https-3A__developer.gnome.org_NetworkManager_stable_nm-2Dsettings-2Difcfg-2Drh.html&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=NKR1jXf8to59hDGraABDUb4djWcsAXM11_v4c7uz0Tg&m=BrbOLly0UyDhmVy0bJISbRK1Y5YrrOvNg1YCCD5SHvU&s=ULChg-xvPrW8321vs7PAPO57zpkIrWti2rJm3MBWWrI&e=>
-- Red Hat GmbH, https://de.redhat.com/ <https://urldefense.proofpoint.com/v2/url?u=https-3A__de.redhat.com_&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=NKR1jXf8to59hDGraABDUb4djWcsAXM11_v4c7uz0Tg&m=BrbOLly0UyDhmVy0bJISbRK1Y5YrrOvNg1YCCD5SHvU&s=QKfW3tIDICS7UVtyuyyoIZI2Qd6Y3XdZMQJmfjYY1Ls&e=> , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
On Wed, Mar 17, 2021 at 04:52:10PM +0100, Dmitry Tantsur wrote:
[ 63.613821] NetworkManager[244]: <info> [1615995259.7778] NetworkManager (version 1.26.0-12.el8_3) is starting... (for the first time) [ 71.637264] systemd[1]: Starting Glean for interface enp1s0 with
Any ideas?
That seems to say that the NetworkManager daemon is starting before glean.sh. My NetworkManager /usr/lib/systemd/system/NetworkManager.service has [Unit] Description=Network Manager Documentation=man:NetworkManager(8) Wants=network.target After=network-pre.target dbus.service Before=network.target network.service The glean service https://opendev.org/opendev/glean/src/branch/master/glean/init/glean@.servic... has [Unit] Description=Glean for interface %I DefaultDependencies=no Before=network-pre.target Wants=network-pre.target ... [Service] Type=oneshot It feels like we're really doing out best to tell NetworkManager to start after network-pre.target and glean to start before it. The service is "oneshot", doesn't exit until it is finished, and has no timeout, so I don't see how network-pre can become active before glean@.service finishes? Can you run with "debug" on the kernel command-line, to maybe see why it chose to start NM? Can you dump "systemd-analyze" plot maybe? I know we looked at the dependency chain previously and it seemed OK ... As you've seen with https://review.opendev.org/c/opendev/glean/+/781133 https://review.opendev.org/c/opendev/glean/+/781174 there are certainly ways we can optimise glean more. But I really would have thought these would just slow down the boot, not cause ordering issues... -i
Ian, Jay, On Thu, Mar 18, 2021 at 6:12 AM Ian Wienand <iwienand@redhat.com> wrote:
On Wed, Mar 17, 2021 at 04:52:10PM +0100, Dmitry Tantsur wrote:
[ 63.613821] NetworkManager[244]: <info> [1615995259.7778] NetworkManager (version 1.26.0-12.el8_3) is starting... (for the first time) [ 71.637264] systemd[1]: Starting Glean for interface enp1s0 with
Any ideas?
That seems to say that the NetworkManager daemon is starting before glean.sh.
My NetworkManager /usr/lib/systemd/system/NetworkManager.service has
[Unit] Description=Network Manager Documentation=man:NetworkManager(8) Wants=network.target After=network-pre.target dbus.service
I have this too.
Before=network.target network.service
The glean service
https://opendev.org/opendev/glean/src/branch/master/glean/init/glean@.servic... has
[Unit] Description=Glean for interface %I DefaultDependencies=no Before=network-pre.target Wants=network-pre.target ... [Service] Type=oneshot
It feels like we're really doing out best to tell NetworkManager to start after network-pre.target and glean to start before it.
The service is "oneshot", doesn't exit until it is finished, and has no timeout, so I don't see how network-pre can become active before glean@.service finishes?
Can you run with "debug" on the kernel command-line, to maybe see why it chose to start NM? Can you dump "systemd-analyze" plot maybe? I know we looked at the dependency chain previously and it seemed OK ...
I think systemd ordering is of no use here. What I suspect is happening is NetworkManager starting to start before udev inserts glean-nm@ services. The issue with network-pre is similar. It does not finish before glean-nm@ starts, but it does finish long after NetworkManager. The explanation I can come up with is the following: network-pre is a passive target, it does not fire until something requests it. glean-nm@ requests it with Wants=network-pre, but at this point NetworkManager is already starting, so its After=network-pre (without Wants, as intended) does not have an effect. These are pure speculations at this point, but that's all I have. What I'm considering now to fix Glean is an additional systemd service that will start glean without arguments (i.e. for all interfaces that are already up) very early, maybe explicitly Before=NetworkManager. Since it will be a normal service, not one inserted by udev, the ordering will work correctly.
As you've seen with
https://review.opendev.org/c/opendev/glean/+/781133 https://review.opendev.org/c/opendev/glean/+/781174
there are certainly ways we can optimise glean more. But I really would have thought these would just slow down the boot, not cause ordering issues...
Oh, and another thing: Glean has a lock that is interface-agnostic (i.e. global). Which means that while it's processing the loopback interface, it cannot be processing real interfaces. This forced serialization may contribute to the slowness. In the end, we may go down a different path in ironic-python-agent since we may not really want Glean by default, only when configdrive is present. But fixing Glean would be nice anyway. Dmitry
-i
-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
On Thu, Mar 18, 2021 at 12:18 PM Dmitry Tantsur <dtantsur@redhat.com> wrote:
Ian, Jay,
On Thu, Mar 18, 2021 at 6:12 AM Ian Wienand <iwienand@redhat.com> wrote:
On Wed, Mar 17, 2021 at 04:52:10PM +0100, Dmitry Tantsur wrote:
[ 63.613821] NetworkManager[244]: <info> [1615995259.7778] NetworkManager (version 1.26.0-12.el8_3) is starting... (for the first time) [ 71.637264] systemd[1]: Starting Glean for interface enp1s0 with
Any ideas?
That seems to say that the NetworkManager daemon is starting before glean.sh.
My NetworkManager /usr/lib/systemd/system/NetworkManager.service has
[Unit] Description=Network Manager Documentation=man:NetworkManager(8) Wants=network.target After=network-pre.target dbus.service
I have this too.
Before=network.target network.service
The glean service
https://opendev.org/opendev/glean/src/branch/master/glean/init/glean@.servic... has
[Unit] Description=Glean for interface %I DefaultDependencies=no Before=network-pre.target Wants=network-pre.target ... [Service] Type=oneshot
It feels like we're really doing out best to tell NetworkManager to start after network-pre.target and glean to start before it.
The service is "oneshot", doesn't exit until it is finished, and has no timeout, so I don't see how network-pre can become active before glean@.service finishes?
Can you run with "debug" on the kernel command-line, to maybe see why it chose to start NM? Can you dump "systemd-analyze" plot maybe? I know we looked at the dependency chain previously and it seemed OK ...
I think systemd ordering is of no use here. What I suspect is happening is NetworkManager starting to start before udev inserts glean-nm@ services.
The issue with network-pre is similar. It does not finish before glean-nm@ starts, but it does finish long after NetworkManager. The explanation I can come up with is the following: network-pre is a passive target, it does not fire until something requests it. glean-nm@ requests it with Wants=network-pre, but at this point NetworkManager is already starting, so its After=network-pre (without Wants, as intended) does not have an effect.
These are pure speculations at this point, but that's all I have.
What I'm considering now to fix Glean is an additional systemd service that will start glean without arguments (i.e. for all interfaces that are already up) very early, maybe explicitly Before=NetworkManager. Since it will be a normal service, not one inserted by udev, the ordering will work correctly.
This approach has worked! The first change is https://review.opendev.org/c/opendev/glean/+/781460 that allows an optional early service. The second is https://review.opendev.org/c/openstack/diskimage-builder/+/781491 for the DIB to pass extra install arguments. I've also added Clark's and yours patches to the picture. They provide an improvement but alone don't seem enough to fix the problem. Dmitry
As you've seen with
https://review.opendev.org/c/opendev/glean/+/781133 https://review.opendev.org/c/opendev/glean/+/781174
there are certainly ways we can optimise glean more. But I really would have thought these would just slow down the boot, not cause ordering issues...
Oh, and another thing: Glean has a lock that is interface-agnostic (i.e. global). Which means that while it's processing the loopback interface, it cannot be processing real interfaces. This forced serialization may contribute to the slowness.
In the end, we may go down a different path in ironic-python-agent since we may not really want Glean by default, only when configdrive is present. But fixing Glean would be nice anyway.
Dmitry
-i
-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
participants (4)
-
Dmitry Tantsur
-
Ian Wienand
-
Jay Faulkner
-
Nisha Agarwal