If it happens intermittently, there might be an underlying issue. This reminds me of a customer issue we had to deal with two years ago. At some point, the network stack suddenly bacame unreliable, many VMs lost their ssh connections and their customers complained. We did a lot of debugging, lots of daemon restarts, but in the end it just seemed to be neutron overloading, I vaguely remember neutron-server logs indicating that. At that time, they were still waiting for the third control node to become ready, so they only had two of them up and running. And for some reason, restarting neutron services in a random order didn't resolve anything. I found out that it helped to stop all neutron daemons, ensure all processes were actually dead. Then resumed operation by starting neutron-server first, waited a minute or two, then continued with the rest of the agents. Since then, this has become my standard procedure when dealing with neutron issues. And since the third control node has joined, they haven't faced this issue again (yet). So if this happens to you every now and then, I suggest to monitor the load and look out for neutron logs that might point to the underlying issue. Regards, Eugen Zitat von AJ_ sunny <jains8550@gmail.com>:
Hello Eugen,
IP configurations on vm is happening via neutron DHCP But yesterday we enabled config drive metadata on Ubuntu guest image after that we faced the same issue and then restart memchached, Neutron metadata agent, neutron DHCP agent and neutron l3 agent service and add ingress security group rule over TCP protocol over remote ip 169.254.169.254 then I immediately launched vm it didn't work But I tried again after 8 hours i.e. today's morning with debian, CentOS and Ubuntu guest images everything is working fine This is very strange behaviour from last two days it was not working but now it start working
Can you please input your thoughts or RCA for this?
Thanks Arihant Jain
On Wed, 9 Oct, 2024, 2:48 pm Eugen Block, <eblock@nde.ag> wrote:
Hi,
for external networks you will need to inject metadata via config-drive. Does your VM have the IP configured which neutron assigned to it?
Zitat von AJ_ sunny <jains8550@gmail.com>:
Hello team,
I am also facing the similar kind of problem in which cloud-init not able to push key-pair inside the image due to which I am not able to ssh the vm
Inside the vm On curl http://169.254.169.254/openstack Failed to connect to 169.254.169.254 port 80 connection refused
VM with direct external ip not able to ssh But vm with tenant network with floating ip able to ssh this is very strange scenario
In neutron logs I am also getting error Unexpected number of DHCP interface for metadata proxy expected 1, got2
Please provide the assistance on this
Thanks Arihant Jain
On Fri, 18 Nov, 2022, 7:24 am Tobias McNulty, <tobias@caktusgroup.com> wrote:
Thank you all for the helpful responses and suggestions. I tried these steps, but I am afraid the problem was user error.
I thought I had adequately tested the internal network previously, but that was not the case. cloud-init and security groups now appear to work seamlessly on an internal subnet. Furthermore, floating IPs from the external subnet are properly allocated and are reachable from the LAN.
I believe the issue was that I accidentally left DHCP disabled on the internal subnet previously. When I disable DHCP on the internal subnet now, a new instance will hang for ~400-500 seconds at this point in the boot process:
Starting [0;1;39mLoad AppArmor pro���managed internally by snapd[0m... Starting [0;1;39mInitial cloud-init job (pre-networking)[0m... Mounting [0;1;39mArbitrary Executable File Formats File System[0m... [[0;32m OK [0m] Mounted [0;1;39mArbitrary Executable File Formats File System[0m. [ 7.673299] cloud-init[508]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init-local' at Fri, 18 Nov 2022 01:18:29 +0000. Up 7.61 seconds. [[0;32m OK [0m] Finished [0;1;39mLoad AppArmor pro���s managed internally by snapd[0m.
Eventually the instance finishes booting and displays the timeout attempting to reach 169.254.169.254:
[ 430.150383] cloud-init[551]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init' at Fri, 18 Nov 2022 01:25:31 +0000. Up 430.12 seconds. <snip> [ 430.210288] cloud-init[551]: 2022-11-18 01:25:31,748 - url_helper.py[ERROR]: Timed out, no response from urls: ['http://169.254.169.254/openstack'] [ 430.217100] cloud-init[551]: 2022-11-18 01:25:31,749 - util.py[WARNING]: No active metadata service found
In summary, I believe that:
* cloud-init will timeout if DHCP is disabled (presumably because it has no IP with which to make a request?) * Security groups may not work as expected for instances created in an external subnet. The proper configuration is to create instances in a virtual subnet and assign floating IPs from the external subnet.
Hopefully this message is helpful to someone in the future, and thank you all for your patience and support!
Tobias
On Tue, Nov 15, 2022 at 12:27 PM Sean Mooney <smooney@redhat.com> wrote:
On Tue, 2022-11-15 at 09:02 -0800, Clark Boylan wrote:
On Tue, Nov 15, 2022, at 6:14 AM, Tobias McNulty wrote: > As an update, I tried the non-HWE kernel with the same result.
> be a hardware/driver issue with the 10G NICs? It's so repeatable. I'll > look into finding some other hardware to test with. > > Has anyone else experienced such a complete failure with cloud-init > and/or security groups, and do you have any advice on how I might > continue to debug this?
I'm not sure this will be helpful since you seem to have narrowed down
* Use config drive instead of metadata service. The metadata service
hasn't always been reliable.
* Bake information like DHCP config for interfaces and user ssh keys into an image and boot that. This way you don't need to rely on actions taken at boot time. * Use a different boot time configurator tool. Glean is the one the OpenDev team uses for test nodes. When I debug things there I tend to test with cloud-init to compare glean behavior. But you can do this in reverse.
Again, I'm not sure this is helpful in this specific instance. But
Could it the issue to VM networking, but here are some of the things that I do when debugging boot time VM setup failures: thought I'd send it out anyway to help those who may land here through Google search in the future.
one thing that you shoudl check in addtion to considering ^ is make sure that the nova api is configured to use memcache.
cloud init only retries request until the first request succceds. once the first request works it assumes that the rest will. if you are
using a loadbalance and multipel nova-metadtaa-api process
without memcache, and it take more then 10-30 seconds(cant recall how long cloud-init waits) to build the metadatta respocnce then cloud init can fail. basically if the second request need to rebuild everythign again because its not in a shared cache( memcache) then teh request can time out and cloud init wont try again.
> > Many thanks, > Tobias