Hello team,

I am also facing the similar kind of problem in which cloud-init not able to push key-pair inside the image due to which I am not able to ssh the vm 

Inside the vm
On curl http://169.254.169.254/openstack 
Failed to connect to 169.254.169.254 port 80 connection refused 


VM with direct external ip not able to ssh
But vm with tenant network with floating ip able to ssh this is very strange scenario 

In neutron logs I am also getting error
Unexpected number of DHCP interface for metadata proxy expected 1, got2


Please provide the assistance on this


Thanks 
Arihant Jain

On Fri, 18 Nov, 2022, 7:24 am Tobias McNulty, <tobias@caktusgroup.com> wrote:
Thank you all for the helpful responses and suggestions. I tried these
steps, but I am afraid the problem was user error.

I thought I had adequately tested the internal network previously, but
that was not the case. cloud-init and security groups now appear to
work seamlessly on an internal subnet. Furthermore, floating IPs from
the external subnet are properly allocated and are reachable from the
LAN.

I believe the issue was that I accidentally left DHCP disabled on the
internal subnet previously. When I disable DHCP on the internal subnet
now, a new instance will hang for ~400-500 seconds at this point in
the boot process:

         Starting [0;1;39mLoad AppArmor pro���managed internally by snapd[0m...
         Starting [0;1;39mInitial cloud-init job (pre-networking)[0m...
         Mounting [0;1;39mArbitrary Executable File Formats File System[0m...
[[0;32m  OK  [0m] Mounted [0;1;39mArbitrary Executable File Formats
File System[0m.
[    7.673299] cloud-init[508]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1
running 'init-local' at Fri, 18 Nov 2022 01:18:29 +0000. Up 7.61
seconds.
[[0;32m  OK  [0m] Finished [0;1;39mLoad AppArmor pro���s managed
internally by snapd[0m.

Eventually the instance finishes booting and displays the timeout
attempting to reach 169.254.169.254:

[  430.150383] cloud-init[551]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1
running 'init' at Fri, 18 Nov 2022 01:25:31 +0000. Up 430.12 seconds.
<snip>
[  430.210288] cloud-init[551]: 2022-11-18 01:25:31,748 -
url_helper.py[ERROR]: Timed out, no response from urls:
['http://169.254.169.254/openstack']
[  430.217100] cloud-init[551]: 2022-11-18 01:25:31,749 -
util.py[WARNING]: No active metadata service found

In summary, I believe that:

* cloud-init will timeout if DHCP is disabled (presumably because it
has no IP with which to make a request?)
* Security groups may not work as expected for instances created in an
external subnet. The proper configuration is to create instances in a
virtual subnet and assign floating IPs from the external subnet.

Hopefully this message is helpful to someone in the future, and thank
you all for your patience and support!

Tobias

On Tue, Nov 15, 2022 at 12:27 PM Sean Mooney <smooney@redhat.com> wrote:
>
> On Tue, 2022-11-15 at 09:02 -0800, Clark Boylan wrote:
> > On Tue, Nov 15, 2022, at 6:14 AM, Tobias McNulty wrote:
> > > As an update, I tried the non-HWE kernel with the same result. Could it
> > > be a hardware/driver issue with the 10G NICs? It's so repeatable. I'll
> > > look into finding some other hardware to test with.
> > >
> > > Has anyone else experienced such a complete failure with cloud-init
> > > and/or security groups, and do you have any advice on how I might
> > > continue to debug this?
> >
> > I'm not sure this will be helpful since you seem to have narrowed down the issue to VM networking, but here are some of the things that I do when debugging boot time VM setup failures:
> >
> > * Use config drive instead of metadata service. The metadata service hasn't always been reliable.
> > * Bake information like DHCP config for interfaces and user ssh keys into an image and boot that. This way you don't need to rely on actions taken at boot time.
> > * Use a different boot time configurator tool. Glean is the one the OpenDev team uses for test nodes. When I debug things there I tend to test with cloud-init to compare glean behavior. But you can do this in reverse.
> >
> > Again, I'm not sure this is helpful in this specific instance. But thought I'd send it out anyway to help those who may land here through Google search in the future.
>
> one thing that you shoudl check in addtion to considering ^
> is make sure that the nova api is configured to use memcache.
>
> cloud init only retries request until the first request succceds.
> once the first request works it assumes that the rest will. if you are using a loadbalance and multipel nova-metadtaa-api process
> without memcache, and it take more then 10-30 seconds(cant recall how long cloud-init waits) to build the metadatta respocnce then
> cloud init can fail. basically if the second request need to rebuild everythign again because its not in a shared cache( memcache)
> then teh request can time out and cloud init wont try again.
>
> >
> > >
> > > Many thanks,
> > > Tobias
> >
>