Hello team, I am also facing the similar kind of problem in which cloud-init not able to push key-pair inside the image due to which I am not able to ssh the vm Inside the vm On curl http://169.254.169.254/openstack Failed to connect to 169.254.169.254 port 80 connection refused VM with direct external ip not able to ssh But vm with tenant network with floating ip able to ssh this is very strange scenario In neutron logs I am also getting error Unexpected number of DHCP interface for metadata proxy expected 1, got2 Please provide the assistance on this Thanks Arihant Jain On Fri, 18 Nov, 2022, 7:24 am Tobias McNulty, <tobias@caktusgroup.com> wrote:
Thank you all for the helpful responses and suggestions. I tried these steps, but I am afraid the problem was user error.
I thought I had adequately tested the internal network previously, but that was not the case. cloud-init and security groups now appear to work seamlessly on an internal subnet. Furthermore, floating IPs from the external subnet are properly allocated and are reachable from the LAN.
I believe the issue was that I accidentally left DHCP disabled on the internal subnet previously. When I disable DHCP on the internal subnet now, a new instance will hang for ~400-500 seconds at this point in the boot process:
Starting [0;1;39mLoad AppArmor pro���managed internally by snapd[0m... Starting [0;1;39mInitial cloud-init job (pre-networking)[0m... Mounting [0;1;39mArbitrary Executable File Formats File System[0m... [[0;32m OK [0m] Mounted [0;1;39mArbitrary Executable File Formats File System[0m. [ 7.673299] cloud-init[508]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init-local' at Fri, 18 Nov 2022 01:18:29 +0000. Up 7.61 seconds. [[0;32m OK [0m] Finished [0;1;39mLoad AppArmor pro���s managed internally by snapd[0m.
Eventually the instance finishes booting and displays the timeout attempting to reach 169.254.169.254:
[ 430.150383] cloud-init[551]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init' at Fri, 18 Nov 2022 01:25:31 +0000. Up 430.12 seconds. <snip> [ 430.210288] cloud-init[551]: 2022-11-18 01:25:31,748 - url_helper.py[ERROR]: Timed out, no response from urls: ['http://169.254.169.254/openstack'] [ 430.217100] cloud-init[551]: 2022-11-18 01:25:31,749 - util.py[WARNING]: No active metadata service found
In summary, I believe that:
* cloud-init will timeout if DHCP is disabled (presumably because it has no IP with which to make a request?) * Security groups may not work as expected for instances created in an external subnet. The proper configuration is to create instances in a virtual subnet and assign floating IPs from the external subnet.
Hopefully this message is helpful to someone in the future, and thank you all for your patience and support!
Tobias
On Tue, Nov 15, 2022 at 12:27 PM Sean Mooney <smooney@redhat.com> wrote:
On Tue, 2022-11-15 at 09:02 -0800, Clark Boylan wrote:
On Tue, Nov 15, 2022, at 6:14 AM, Tobias McNulty wrote:
As an update, I tried the non-HWE kernel with the same result. Could
be a hardware/driver issue with the 10G NICs? It's so repeatable. I'll look into finding some other hardware to test with.
Has anyone else experienced such a complete failure with cloud-init and/or security groups, and do you have any advice on how I might continue to debug this?
I'm not sure this will be helpful since you seem to have narrowed down
* Use config drive instead of metadata service. The metadata service
hasn't always been reliable.
* Bake information like DHCP config for interfaces and user ssh keys into an image and boot that. This way you don't need to rely on actions taken at boot time. * Use a different boot time configurator tool. Glean is the one the OpenDev team uses for test nodes. When I debug things there I tend to test with cloud-init to compare glean behavior. But you can do this in reverse.
Again, I'm not sure this is helpful in this specific instance. But
it the issue to VM networking, but here are some of the things that I do when debugging boot time VM setup failures: thought I'd send it out anyway to help those who may land here through Google search in the future.
one thing that you shoudl check in addtion to considering ^ is make sure that the nova api is configured to use memcache.
cloud init only retries request until the first request succceds. once the first request works it assumes that the rest will. if you are
using a loadbalance and multipel nova-metadtaa-api process
without memcache, and it take more then 10-30 seconds(cant recall how long cloud-init waits) to build the metadatta respocnce then cloud init can fail. basically if the second request need to rebuild everythign again because its not in a shared cache( memcache) then teh request can time out and cloud init wont try again.
Many thanks, Tobias