On Tue, 2022-11-15 at 09:02 -0800, Clark Boylan wrote:
On Tue, Nov 15, 2022, at 6:14 AM, Tobias McNulty wrote:
As an update, I tried the non-HWE kernel with the same result. Could it be a hardware/driver issue with the 10G NICs? It's so repeatable. I'll look into finding some other hardware to test with.
Has anyone else experienced such a complete failure with cloud-init and/or security groups, and do you have any advice on how I might continue to debug this?
I'm not sure this will be helpful since you seem to have narrowed down the issue to VM networking, but here are some of the things that I do when debugging boot time VM setup failures:
* Use config drive instead of metadata service. The metadata service hasn't always been reliable. * Bake information like DHCP config for interfaces and user ssh keys into an image and boot that. This way you don't need to rely on actions taken at boot time. * Use a different boot time configurator tool. Glean is the one the OpenDev team uses for test nodes. When I debug things there I tend to test with cloud-init to compare glean behavior. But you can do this in reverse.
Again, I'm not sure this is helpful in this specific instance. But thought I'd send it out anyway to help those who may land here through Google search in the future.
one thing that you shoudl check in addtion to considering ^ is make sure that the nova api is configured to use memcache. cloud init only retries request until the first request succceds. once the first request works it assumes that the rest will. if you are using a loadbalance and multipel nova-metadtaa-api process without memcache, and it take more then 10-30 seconds(cant recall how long cloud-init waits) to build the metadatta respocnce then cloud init can fail. basically if the second request need to rebuild everythign again because its not in a shared cache( memcache) then teh request can time out and cloud init wont try again.
Many thanks, Tobias