Ian, Jay,


On Thu, Mar 18, 2021 at 6:12 AM Ian Wienand <iwienand@redhat.com> wrote:
On Wed, Mar 17, 2021 at 04:52:10PM +0100, Dmitry Tantsur wrote:
> [   63.613821] NetworkManager[244]: <info>  [1615995259.7778]  NetworkManager (version 1.26.0-12.el8_3) is starting... (for the first time)
> [   71.637264] systemd[1]: Starting Glean for interface enp1s0 with

> Any ideas?

That seems to say that the NetworkManager daemon is starting before
glean.sh.

My NetworkManager /usr/lib/systemd/system/NetworkManager.service has

  [Unit]
  Description=Network Manager
  Documentation=man:NetworkManager(8)
  Wants=network.target
  After=network-pre.target dbus.service

I have this too.
 
  Before=network.target network.service

The glean service
 https://opendev.org/opendev/glean/src/branch/master/glean/init/glean@.service
has

 [Unit]
 Description=Glean for interface %I
 DefaultDependencies=no
 Before=network-pre.target
 Wants=network-pre.target
 ...
 [Service]
 Type=oneshot

It feels like we're really doing out best to tell NetworkManager to
start after network-pre.target and glean to start before it.

The service is "oneshot", doesn't exit until it is finished, and has
no timeout, so I don't see how network-pre can become active before
glean@.service finishes?

Can you run with "debug" on the kernel command-line, to maybe see why
it chose to start NM?  Can you dump "systemd-analyze" plot maybe?  I
know we looked at the dependency chain previously and it seemed OK ...

I think systemd ordering is of no use here. What I suspect is happening is NetworkManager starting to start before udev inserts glean-nm@ services.

The issue with network-pre is similar. It does not finish before glean-nm@ starts, but it does finish long after NetworkManager. The explanation I can come up with is the following: network-pre is a passive target, it does not fire until something requests it. glean-nm@ requests it with Wants=network-pre, but at this point NetworkManager is already starting, so its After=network-pre (without Wants, as intended) does not have an effect.

These are pure speculations at this point, but that's all I have.

What I'm considering now to fix Glean is an additional systemd service that will start glean without arguments (i.e. for all interfaces that are already up) very early, maybe explicitly Before=NetworkManager. Since it will be a normal service, not one inserted by udev, the ordering will work correctly.
 

As you've seen with

 https://review.opendev.org/c/opendev/glean/+/781133
 https://review.opendev.org/c/opendev/glean/+/781174

there are certainly ways we can optimise glean more.  But I really
would have thought these would just slow down the boot, not cause
ordering issues...

Oh, and another thing: Glean has a lock that is interface-agnostic (i.e. global). Which means that while it's processing the loopback interface, it cannot be processing real interfaces. This forced serialization may contribute to the slowness.

In the end, we may go down a different path in ironic-python-agent since we may not really want Glean by default, only when configdrive is present. But fixing Glean would be nice anyway.

Dmitry
 

-i



--
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill