Many thanks Eugen for your interest and reply on this issue. I did manage to 'resolve' this sufficiently by improving the relevant test code, and in case it is of any use to you, here is my commit message about that:
==============
The `test_detach_interface` test:
- creates a VM attached to one network
- finds the VM's port and calls `interface_detach` for it
- checks that the VM is now unpingable
- calls `interface_attach` with the relevant `network_id`
- hard reboots the VM
- checks that the VM is now pingable again.
Failure symptoms:
1. `test_detach_interface` fails in its last step with
AssertionError: ['ana33z cannot ping 616fuw (10.28.0.131)', 'gmwmc8 cannot ping 616fuw (10.28.0.131)'] is not false : Some failures: ['ana33z cannot ping 616fuw (10.28.0.131)', 'gmwmc8 cannot ping 616fuw (10.28.0.131)']
where 616fuw is the VM that had an interface detached and reattached and then got rebooted.
2. Following that, `test_ifdown_interface` fails with the same assertion, on its last line of code where it is supposed to have full connectivity again.
3. `nova-compute.log` for about the time of the `test_detach_interface` test shows a `DeviceRemovedFailed` event.
4. `ip l` and `ip a` output from the beginning of `test_ifdown_interface` shows that the VM has two NICs: eth0 with no IPs and eth1 with an IP address. Whereas it's expected at this point that the VM only has one NIC, eth0, with an IP address.
The minimal fixes are, both in `test_detach_interface`:
1. To make sure that the detached port really has gone, before re-attaching the VM to the network. Not doing this seems to allow the detach and attach operations to overlap with each other, resulting in the VM having two NICs instead of just one. That directly messes up `test_ifdown_interface`, because that test assumes that eth0 is the active NIC.
2. To make sure that the VM has become active again after its reboot, before testing for connectivity.
(Weirdly, an alternative to (1) seems to be running `watch neutron port-list` in parallel with the test. This confused me for quite a while, because I wouldn't expect `neutron port-list` to modify any Neutron state! But it's highly reproducible that with only (2), plus running `watch neutron port-list` in parallel, the tests reliably pass; and that with only (2), and without any `port-list`, `test_ifdown_interface` reliably fails.)
In summary, in my case I don't think it was related to agent health, but rather to the detach operation running a bit slower than it did in previous releases, and hence being more likely to overlap with a subsequent attach operation, given that my test code was not careful enough about this.
Best wishes - Nell