On Tue, Nov 15, 2022, at 6:14 AM, Tobias McNulty wrote:
As an update, I tried the non-HWE kernel with the same result. Could it be a hardware/driver issue with the 10G NICs? It's so repeatable. I'll look into finding some other hardware to test with.
Has anyone else experienced such a complete failure with cloud-init and/or security groups, and do you have any advice on how I might continue to debug this?
I'm not sure this will be helpful since you seem to have narrowed down the issue to VM networking, but here are some of the things that I do when debugging boot time VM setup failures: * Use config drive instead of metadata service. The metadata service hasn't always been reliable. * Bake information like DHCP config for interfaces and user ssh keys into an image and boot that. This way you don't need to rely on actions taken at boot time. * Use a different boot time configurator tool. Glean is the one the OpenDev team uses for test nodes. When I debug things there I tend to test with cloud-init to compare glean behavior. But you can do this in reverse. Again, I'm not sure this is helpful in this specific instance. But thought I'd send it out anyway to help those who may land here through Google search in the future.
Many thanks, Tobias