Kolla Ansible on Ubuntu 20.04 - cloud-init & other network issues
Hi, I'm attempting to use Kolla Ansible 14.6.0 to deploy OpenStack Yoga on a small 3-node Ubuntu 20.04 cluster. The nodes have 128 GB RAM each, dual Xeon processors, and dual 10G Intel NICs. The NICs are connected to access ports on a 10G switch with separate VLANs for the local and external networks. All the playbooks run cleanly, but cloud-init is failing in the Ubuntu 20.04 and 22.04 VMs I attempt to boot. The VM images are unmodified from https://cloud-images.ubuntu.com/, and cloud-init works fine if I mount a second volume with user-data. The error is a timeout attempting to reach 169.254.169.254. This occurs both when booting a VM in an internal routed network and directly in an external network. I tried various neutron plugin agents (ovn, linuxbridge, and openvswitch both with and without firewall_driver = openvswitch <https://docs.openstack.org/kolla-ansible/latest/reference/networking/neutron.html#openvswitch-ml2-ovs>) first with a clean install of the entire OS each time, all with the same result. Running tcpdump looking for 169.254.169.254 shows nothing. As a possible clue, the virtual NICs are unable to pass any traffic (e.g., to reach an external DHCP server) unless I completely disable port security on the interface (even if the associated security group is wide open). But disabling port security does not fix cloud-init (not to mention I don't really want to disable port security). Are there any additional requirements related to deploying OpenStack with Kolla on Ubuntu 20.04? This is a fairly vanilla configuration using the multinode inventory as a starting point. I tried to follow the Quick Start <https://docs.openstack.org/kolla-ansible/yoga/user/quickstart.html> as closely as possible; the only material difference I see is that I'm using the same 3 nodes for control + compute. I am using MAAS so it's easy to get a clean OS install on all three nodes ahead of each attempt. I plan to try again with the standard (non-HWE) kernel just in case, but otherwise I am running out of ideas. In case of any additional clues, here are my globals.yml and inventory file, along with the playbook I'm using to configure the network, images, VMs, etc., after bootstrapping the cluster: https://gist.github.com/tobiasmcnulty/7dbbdbc67abc08cbb013bf5983852ed6 Thank you in advance for any advice! Cheers, Tobias
As an update, I tried the non-HWE kernel with the same result. Could it be a hardware/driver issue with the 10G NICs? It's so repeatable. I'll look into finding some other hardware to test with. Has anyone else experienced such a complete failure with cloud-init and/or security groups, and do you have any advice on how I might continue to debug this? Many thanks, Tobias On Sat, Nov 12, 2022 at 12:12 PM Tobias McNulty <tobias@caktusgroup.com> wrote:
Hi,
I'm attempting to use Kolla Ansible 14.6.0 to deploy OpenStack Yoga on a small 3-node Ubuntu 20.04 cluster. The nodes have 128 GB RAM each, dual Xeon processors, and dual 10G Intel NICs. The NICs are connected to access ports on a 10G switch with separate VLANs for the local and external networks.
All the playbooks run cleanly, but cloud-init is failing in the Ubuntu 20.04 and 22.04 VMs I attempt to boot. The VM images are unmodified from https://cloud-images.ubuntu.com/, and cloud-init works fine if I mount a second volume with user-data. The error is a timeout attempting to reach 169.254.169.254. This occurs both when booting a VM in an internal routed network and directly in an external network.
I tried various neutron plugin agents (ovn, linuxbridge, and openvswitch both with and without firewall_driver = openvswitch <https://docs.openstack.org/kolla-ansible/latest/reference/networking/neutron.html#openvswitch-ml2-ovs>) first with a clean install of the entire OS each time, all with the same result. Running tcpdump looking for 169.254.169.254 shows nothing. As a possible clue, the virtual NICs are unable to pass any traffic (e.g., to reach an external DHCP server) unless I completely disable port security on the interface (even if the associated security group is wide open). But disabling port security does not fix cloud-init (not to mention I don't really want to disable port security).
Are there any additional requirements related to deploying OpenStack with Kolla on Ubuntu 20.04?
This is a fairly vanilla configuration using the multinode inventory as a starting point. I tried to follow the Quick Start <https://docs.openstack.org/kolla-ansible/yoga/user/quickstart.html> as closely as possible; the only material difference I see is that I'm using the same 3 nodes for control + compute. I am using MAAS so it's easy to get a clean OS install on all three nodes ahead of each attempt. I plan to try again with the standard (non-HWE) kernel just in case, but otherwise I am running out of ideas. In case of any additional clues, here are my globals.yml and inventory file, along with the playbook I'm using to configure the network, images, VMs, etc., after bootstrapping the cluster:
https://gist.github.com/tobiasmcnulty/7dbbdbc67abc08cbb013bf5983852ed6
Thank you in advance for any advice!
Cheers, Tobias
Hi, just one more thing to check: whenever I had troubles with the metadata it was usually apparmor blocking the access. For testing purposes (or if you're behind a firewall anyway) you could try to disable all the security related daemons and see if that helps. If you don't have it enabled, do you see any errors in the neutron logs? Zitat von Tobias McNulty <tobias@caktusgroup.com>:
As an update, I tried the non-HWE kernel with the same result. Could it be a hardware/driver issue with the 10G NICs? It's so repeatable. I'll look into finding some other hardware to test with.
Has anyone else experienced such a complete failure with cloud-init and/or security groups, and do you have any advice on how I might continue to debug this?
Many thanks, Tobias
On Sat, Nov 12, 2022 at 12:12 PM Tobias McNulty <tobias@caktusgroup.com> wrote:
Hi,
I'm attempting to use Kolla Ansible 14.6.0 to deploy OpenStack Yoga on a small 3-node Ubuntu 20.04 cluster. The nodes have 128 GB RAM each, dual Xeon processors, and dual 10G Intel NICs. The NICs are connected to access ports on a 10G switch with separate VLANs for the local and external networks.
All the playbooks run cleanly, but cloud-init is failing in the Ubuntu 20.04 and 22.04 VMs I attempt to boot. The VM images are unmodified from https://cloud-images.ubuntu.com/, and cloud-init works fine if I mount a second volume with user-data. The error is a timeout attempting to reach 169.254.169.254. This occurs both when booting a VM in an internal routed network and directly in an external network.
I tried various neutron plugin agents (ovn, linuxbridge, and openvswitch both with and without firewall_driver = openvswitch <https://docs.openstack.org/kolla-ansible/latest/reference/networking/neutron.html#openvswitch-ml2-ovs>) first with a clean install of the entire OS each time, all with the same result. Running tcpdump looking for 169.254.169.254 shows nothing. As a possible clue, the virtual NICs are unable to pass any traffic (e.g., to reach an external DHCP server) unless I completely disable port security on the interface (even if the associated security group is wide open). But disabling port security does not fix cloud-init (not to mention I don't really want to disable port security).
Are there any additional requirements related to deploying OpenStack with Kolla on Ubuntu 20.04?
This is a fairly vanilla configuration using the multinode inventory as a starting point. I tried to follow the Quick Start <https://docs.openstack.org/kolla-ansible/yoga/user/quickstart.html> as closely as possible; the only material difference I see is that I'm using the same 3 nodes for control + compute. I am using MAAS so it's easy to get a clean OS install on all three nodes ahead of each attempt. I plan to try again with the standard (non-HWE) kernel just in case, but otherwise I am running out of ideas. In case of any additional clues, here are my globals.yml and inventory file, along with the playbook I'm using to configure the network, images, VMs, etc., after bootstrapping the cluster:
https://gist.github.com/tobiasmcnulty/7dbbdbc67abc08cbb013bf5983852ed6
Thank you in advance for any advice!
Cheers, Tobias
On Tue, Nov 15, 2022, at 6:14 AM, Tobias McNulty wrote:
As an update, I tried the non-HWE kernel with the same result. Could it be a hardware/driver issue with the 10G NICs? It's so repeatable. I'll look into finding some other hardware to test with.
Has anyone else experienced such a complete failure with cloud-init and/or security groups, and do you have any advice on how I might continue to debug this?
I'm not sure this will be helpful since you seem to have narrowed down the issue to VM networking, but here are some of the things that I do when debugging boot time VM setup failures: * Use config drive instead of metadata service. The metadata service hasn't always been reliable. * Bake information like DHCP config for interfaces and user ssh keys into an image and boot that. This way you don't need to rely on actions taken at boot time. * Use a different boot time configurator tool. Glean is the one the OpenDev team uses for test nodes. When I debug things there I tend to test with cloud-init to compare glean behavior. But you can do this in reverse. Again, I'm not sure this is helpful in this specific instance. But thought I'd send it out anyway to help those who may land here through Google search in the future.
Many thanks, Tobias
On Tue, 2022-11-15 at 09:02 -0800, Clark Boylan wrote:
On Tue, Nov 15, 2022, at 6:14 AM, Tobias McNulty wrote:
As an update, I tried the non-HWE kernel with the same result. Could it be a hardware/driver issue with the 10G NICs? It's so repeatable. I'll look into finding some other hardware to test with.
Has anyone else experienced such a complete failure with cloud-init and/or security groups, and do you have any advice on how I might continue to debug this?
I'm not sure this will be helpful since you seem to have narrowed down the issue to VM networking, but here are some of the things that I do when debugging boot time VM setup failures:
* Use config drive instead of metadata service. The metadata service hasn't always been reliable. * Bake information like DHCP config for interfaces and user ssh keys into an image and boot that. This way you don't need to rely on actions taken at boot time. * Use a different boot time configurator tool. Glean is the one the OpenDev team uses for test nodes. When I debug things there I tend to test with cloud-init to compare glean behavior. But you can do this in reverse.
Again, I'm not sure this is helpful in this specific instance. But thought I'd send it out anyway to help those who may land here through Google search in the future.
one thing that you shoudl check in addtion to considering ^ is make sure that the nova api is configured to use memcache. cloud init only retries request until the first request succceds. once the first request works it assumes that the rest will. if you are using a loadbalance and multipel nova-metadtaa-api process without memcache, and it take more then 10-30 seconds(cant recall how long cloud-init waits) to build the metadatta respocnce then cloud init can fail. basically if the second request need to rebuild everythign again because its not in a shared cache( memcache) then teh request can time out and cloud init wont try again.
Many thanks, Tobias
Thank you all for the helpful responses and suggestions. I tried these steps, but I am afraid the problem was user error. I thought I had adequately tested the internal network previously, but that was not the case. cloud-init and security groups now appear to work seamlessly on an internal subnet. Furthermore, floating IPs from the external subnet are properly allocated and are reachable from the LAN. I believe the issue was that I accidentally left DHCP disabled on the internal subnet previously. When I disable DHCP on the internal subnet now, a new instance will hang for ~400-500 seconds at this point in the boot process: Starting [0;1;39mLoad AppArmor pro���managed internally by snapd[0m... Starting [0;1;39mInitial cloud-init job (pre-networking)[0m... Mounting [0;1;39mArbitrary Executable File Formats File System[0m... [[0;32m OK [0m] Mounted [0;1;39mArbitrary Executable File Formats File System[0m. [ 7.673299] cloud-init[508]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init-local' at Fri, 18 Nov 2022 01:18:29 +0000. Up 7.61 seconds. [[0;32m OK [0m] Finished [0;1;39mLoad AppArmor pro���s managed internally by snapd[0m. Eventually the instance finishes booting and displays the timeout attempting to reach 169.254.169.254: [ 430.150383] cloud-init[551]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init' at Fri, 18 Nov 2022 01:25:31 +0000. Up 430.12 seconds. <snip> [ 430.210288] cloud-init[551]: 2022-11-18 01:25:31,748 - url_helper.py[ERROR]: Timed out, no response from urls: ['http://169.254.169.254/openstack'] [ 430.217100] cloud-init[551]: 2022-11-18 01:25:31,749 - util.py[WARNING]: No active metadata service found In summary, I believe that: * cloud-init will timeout if DHCP is disabled (presumably because it has no IP with which to make a request?) * Security groups may not work as expected for instances created in an external subnet. The proper configuration is to create instances in a virtual subnet and assign floating IPs from the external subnet. Hopefully this message is helpful to someone in the future, and thank you all for your patience and support! Tobias On Tue, Nov 15, 2022 at 12:27 PM Sean Mooney <smooney@redhat.com> wrote:
On Tue, 2022-11-15 at 09:02 -0800, Clark Boylan wrote:
On Tue, Nov 15, 2022, at 6:14 AM, Tobias McNulty wrote:
As an update, I tried the non-HWE kernel with the same result. Could it be a hardware/driver issue with the 10G NICs? It's so repeatable. I'll look into finding some other hardware to test with.
Has anyone else experienced such a complete failure with cloud-init and/or security groups, and do you have any advice on how I might continue to debug this?
I'm not sure this will be helpful since you seem to have narrowed down the issue to VM networking, but here are some of the things that I do when debugging boot time VM setup failures:
* Use config drive instead of metadata service. The metadata service hasn't always been reliable. * Bake information like DHCP config for interfaces and user ssh keys into an image and boot that. This way you don't need to rely on actions taken at boot time. * Use a different boot time configurator tool. Glean is the one the OpenDev team uses for test nodes. When I debug things there I tend to test with cloud-init to compare glean behavior. But you can do this in reverse.
Again, I'm not sure this is helpful in this specific instance. But thought I'd send it out anyway to help those who may land here through Google search in the future.
one thing that you shoudl check in addtion to considering ^ is make sure that the nova api is configured to use memcache.
cloud init only retries request until the first request succceds. once the first request works it assumes that the rest will. if you are using a loadbalance and multipel nova-metadtaa-api process without memcache, and it take more then 10-30 seconds(cant recall how long cloud-init waits) to build the metadatta respocnce then cloud init can fail. basically if the second request need to rebuild everythign again because its not in a shared cache( memcache) then teh request can time out and cloud init wont try again.
Many thanks, Tobias
Hello team, I am also facing the similar kind of problem in which cloud-init not able to push key-pair inside the image due to which I am not able to ssh the vm Inside the vm On curl http://169.254.169.254/openstack Failed to connect to 169.254.169.254 port 80 connection refused VM with direct external ip not able to ssh But vm with tenant network with floating ip able to ssh this is very strange scenario In neutron logs I am also getting error Unexpected number of DHCP interface for metadata proxy expected 1, got2 Please provide the assistance on this Thanks Arihant Jain On Fri, 18 Nov, 2022, 7:24 am Tobias McNulty, <tobias@caktusgroup.com> wrote:
Thank you all for the helpful responses and suggestions. I tried these steps, but I am afraid the problem was user error.
I thought I had adequately tested the internal network previously, but that was not the case. cloud-init and security groups now appear to work seamlessly on an internal subnet. Furthermore, floating IPs from the external subnet are properly allocated and are reachable from the LAN.
I believe the issue was that I accidentally left DHCP disabled on the internal subnet previously. When I disable DHCP on the internal subnet now, a new instance will hang for ~400-500 seconds at this point in the boot process:
Starting [0;1;39mLoad AppArmor pro���managed internally by snapd[0m... Starting [0;1;39mInitial cloud-init job (pre-networking)[0m... Mounting [0;1;39mArbitrary Executable File Formats File System[0m... [[0;32m OK [0m] Mounted [0;1;39mArbitrary Executable File Formats File System[0m. [ 7.673299] cloud-init[508]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init-local' at Fri, 18 Nov 2022 01:18:29 +0000. Up 7.61 seconds. [[0;32m OK [0m] Finished [0;1;39mLoad AppArmor pro���s managed internally by snapd[0m.
Eventually the instance finishes booting and displays the timeout attempting to reach 169.254.169.254:
[ 430.150383] cloud-init[551]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init' at Fri, 18 Nov 2022 01:25:31 +0000. Up 430.12 seconds. <snip> [ 430.210288] cloud-init[551]: 2022-11-18 01:25:31,748 - url_helper.py[ERROR]: Timed out, no response from urls: ['http://169.254.169.254/openstack'] [ 430.217100] cloud-init[551]: 2022-11-18 01:25:31,749 - util.py[WARNING]: No active metadata service found
In summary, I believe that:
* cloud-init will timeout if DHCP is disabled (presumably because it has no IP with which to make a request?) * Security groups may not work as expected for instances created in an external subnet. The proper configuration is to create instances in a virtual subnet and assign floating IPs from the external subnet.
Hopefully this message is helpful to someone in the future, and thank you all for your patience and support!
Tobias
On Tue, Nov 15, 2022 at 12:27 PM Sean Mooney <smooney@redhat.com> wrote:
On Tue, 2022-11-15 at 09:02 -0800, Clark Boylan wrote:
On Tue, Nov 15, 2022, at 6:14 AM, Tobias McNulty wrote:
As an update, I tried the non-HWE kernel with the same result. Could
be a hardware/driver issue with the 10G NICs? It's so repeatable. I'll look into finding some other hardware to test with.
Has anyone else experienced such a complete failure with cloud-init and/or security groups, and do you have any advice on how I might continue to debug this?
I'm not sure this will be helpful since you seem to have narrowed down
* Use config drive instead of metadata service. The metadata service
hasn't always been reliable.
* Bake information like DHCP config for interfaces and user ssh keys into an image and boot that. This way you don't need to rely on actions taken at boot time. * Use a different boot time configurator tool. Glean is the one the OpenDev team uses for test nodes. When I debug things there I tend to test with cloud-init to compare glean behavior. But you can do this in reverse.
Again, I'm not sure this is helpful in this specific instance. But
it the issue to VM networking, but here are some of the things that I do when debugging boot time VM setup failures: thought I'd send it out anyway to help those who may land here through Google search in the future.
one thing that you shoudl check in addtion to considering ^ is make sure that the nova api is configured to use memcache.
cloud init only retries request until the first request succceds. once the first request works it assumes that the rest will. if you are
using a loadbalance and multipel nova-metadtaa-api process
without memcache, and it take more then 10-30 seconds(cant recall how long cloud-init waits) to build the metadatta respocnce then cloud init can fail. basically if the second request need to rebuild everythign again because its not in a shared cache( memcache) then teh request can time out and cloud init wont try again.
Many thanks, Tobias
Hi, for external networks you will need to inject metadata via config-drive. Does your VM have the IP configured which neutron assigned to it? Zitat von AJ_ sunny <jains8550@gmail.com>:
Hello team,
I am also facing the similar kind of problem in which cloud-init not able to push key-pair inside the image due to which I am not able to ssh the vm
Inside the vm On curl http://169.254.169.254/openstack Failed to connect to 169.254.169.254 port 80 connection refused
VM with direct external ip not able to ssh But vm with tenant network with floating ip able to ssh this is very strange scenario
In neutron logs I am also getting error Unexpected number of DHCP interface for metadata proxy expected 1, got2
Please provide the assistance on this
Thanks Arihant Jain
On Fri, 18 Nov, 2022, 7:24 am Tobias McNulty, <tobias@caktusgroup.com> wrote:
Thank you all for the helpful responses and suggestions. I tried these steps, but I am afraid the problem was user error.
I thought I had adequately tested the internal network previously, but that was not the case. cloud-init and security groups now appear to work seamlessly on an internal subnet. Furthermore, floating IPs from the external subnet are properly allocated and are reachable from the LAN.
I believe the issue was that I accidentally left DHCP disabled on the internal subnet previously. When I disable DHCP on the internal subnet now, a new instance will hang for ~400-500 seconds at this point in the boot process:
Starting [0;1;39mLoad AppArmor pro���managed internally by snapd[0m... Starting [0;1;39mInitial cloud-init job (pre-networking)[0m... Mounting [0;1;39mArbitrary Executable File Formats File System[0m... [[0;32m OK [0m] Mounted [0;1;39mArbitrary Executable File Formats File System[0m. [ 7.673299] cloud-init[508]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init-local' at Fri, 18 Nov 2022 01:18:29 +0000. Up 7.61 seconds. [[0;32m OK [0m] Finished [0;1;39mLoad AppArmor pro���s managed internally by snapd[0m.
Eventually the instance finishes booting and displays the timeout attempting to reach 169.254.169.254:
[ 430.150383] cloud-init[551]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init' at Fri, 18 Nov 2022 01:25:31 +0000. Up 430.12 seconds. <snip> [ 430.210288] cloud-init[551]: 2022-11-18 01:25:31,748 - url_helper.py[ERROR]: Timed out, no response from urls: ['http://169.254.169.254/openstack'] [ 430.217100] cloud-init[551]: 2022-11-18 01:25:31,749 - util.py[WARNING]: No active metadata service found
In summary, I believe that:
* cloud-init will timeout if DHCP is disabled (presumably because it has no IP with which to make a request?) * Security groups may not work as expected for instances created in an external subnet. The proper configuration is to create instances in a virtual subnet and assign floating IPs from the external subnet.
Hopefully this message is helpful to someone in the future, and thank you all for your patience and support!
Tobias
On Tue, Nov 15, 2022 at 12:27 PM Sean Mooney <smooney@redhat.com> wrote:
On Tue, 2022-11-15 at 09:02 -0800, Clark Boylan wrote:
On Tue, Nov 15, 2022, at 6:14 AM, Tobias McNulty wrote:
As an update, I tried the non-HWE kernel with the same result. Could
be a hardware/driver issue with the 10G NICs? It's so repeatable. I'll look into finding some other hardware to test with.
Has anyone else experienced such a complete failure with cloud-init and/or security groups, and do you have any advice on how I might continue to debug this?
I'm not sure this will be helpful since you seem to have narrowed down
* Use config drive instead of metadata service. The metadata service
hasn't always been reliable.
* Bake information like DHCP config for interfaces and user ssh keys into an image and boot that. This way you don't need to rely on actions taken at boot time. * Use a different boot time configurator tool. Glean is the one the OpenDev team uses for test nodes. When I debug things there I tend to test with cloud-init to compare glean behavior. But you can do this in reverse.
Again, I'm not sure this is helpful in this specific instance. But
it the issue to VM networking, but here are some of the things that I do when debugging boot time VM setup failures: thought I'd send it out anyway to help those who may land here through Google search in the future.
one thing that you shoudl check in addtion to considering ^ is make sure that the nova api is configured to use memcache.
cloud init only retries request until the first request succceds. once the first request works it assumes that the rest will. if you are
using a loadbalance and multipel nova-metadtaa-api process
without memcache, and it take more then 10-30 seconds(cant recall how long cloud-init waits) to build the metadatta respocnce then cloud init can fail. basically if the second request need to rebuild everythign again because its not in a shared cache( memcache) then teh request can time out and cloud init wont try again.
Many thanks, Tobias
Hello Eugen, IP configurations on vm is happening via neutron DHCP But yesterday we enabled config drive metadata on Ubuntu guest image after that we faced the same issue and then restart memchached, Neutron metadata agent, neutron DHCP agent and neutron l3 agent service and add ingress security group rule over TCP protocol over remote ip 169.254.169.254 then I immediately launched vm it didn't work But I tried again after 8 hours i.e. today's morning with debian, CentOS and Ubuntu guest images everything is working fine This is very strange behaviour from last two days it was not working but now it start working Can you please input your thoughts or RCA for this? Thanks Arihant Jain On Wed, 9 Oct, 2024, 2:48 pm Eugen Block, <eblock@nde.ag> wrote:
Hi,
for external networks you will need to inject metadata via config-drive. Does your VM have the IP configured which neutron assigned to it?
Zitat von AJ_ sunny <jains8550@gmail.com>:
Hello team,
I am also facing the similar kind of problem in which cloud-init not able to push key-pair inside the image due to which I am not able to ssh the vm
Inside the vm On curl http://169.254.169.254/openstack Failed to connect to 169.254.169.254 port 80 connection refused
VM with direct external ip not able to ssh But vm with tenant network with floating ip able to ssh this is very strange scenario
In neutron logs I am also getting error Unexpected number of DHCP interface for metadata proxy expected 1, got2
Please provide the assistance on this
Thanks Arihant Jain
On Fri, 18 Nov, 2022, 7:24 am Tobias McNulty, <tobias@caktusgroup.com> wrote:
Thank you all for the helpful responses and suggestions. I tried these steps, but I am afraid the problem was user error.
I thought I had adequately tested the internal network previously, but that was not the case. cloud-init and security groups now appear to work seamlessly on an internal subnet. Furthermore, floating IPs from the external subnet are properly allocated and are reachable from the LAN.
I believe the issue was that I accidentally left DHCP disabled on the internal subnet previously. When I disable DHCP on the internal subnet now, a new instance will hang for ~400-500 seconds at this point in the boot process:
Starting [0;1;39mLoad AppArmor pro���managed internally by snapd[0m... Starting [0;1;39mInitial cloud-init job (pre-networking)[0m... Mounting [0;1;39mArbitrary Executable File Formats File System[0m... [[0;32m OK [0m] Mounted [0;1;39mArbitrary Executable File Formats File System[0m. [ 7.673299] cloud-init[508]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init-local' at Fri, 18 Nov 2022 01:18:29 +0000. Up 7.61 seconds. [[0;32m OK [0m] Finished [0;1;39mLoad AppArmor pro���s managed internally by snapd[0m.
Eventually the instance finishes booting and displays the timeout attempting to reach 169.254.169.254:
[ 430.150383] cloud-init[551]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init' at Fri, 18 Nov 2022 01:25:31 +0000. Up 430.12 seconds. <snip> [ 430.210288] cloud-init[551]: 2022-11-18 01:25:31,748 - url_helper.py[ERROR]: Timed out, no response from urls: ['http://169.254.169.254/openstack'] [ 430.217100] cloud-init[551]: 2022-11-18 01:25:31,749 - util.py[WARNING]: No active metadata service found
In summary, I believe that:
* cloud-init will timeout if DHCP is disabled (presumably because it has no IP with which to make a request?) * Security groups may not work as expected for instances created in an external subnet. The proper configuration is to create instances in a virtual subnet and assign floating IPs from the external subnet.
Hopefully this message is helpful to someone in the future, and thank you all for your patience and support!
Tobias
On Tue, Nov 15, 2022 at 12:27 PM Sean Mooney <smooney@redhat.com> wrote:
On Tue, 2022-11-15 at 09:02 -0800, Clark Boylan wrote:
On Tue, Nov 15, 2022, at 6:14 AM, Tobias McNulty wrote:
As an update, I tried the non-HWE kernel with the same result.
be a hardware/driver issue with the 10G NICs? It's so repeatable. I'll look into finding some other hardware to test with.
Has anyone else experienced such a complete failure with cloud-init and/or security groups, and do you have any advice on how I might continue to debug this?
I'm not sure this will be helpful since you seem to have narrowed down
* Use config drive instead of metadata service. The metadata service
hasn't always been reliable.
* Bake information like DHCP config for interfaces and user ssh keys into an image and boot that. This way you don't need to rely on actions taken at boot time. * Use a different boot time configurator tool. Glean is the one the OpenDev team uses for test nodes. When I debug things there I tend to test with cloud-init to compare glean behavior. But you can do this in reverse.
Again, I'm not sure this is helpful in this specific instance. But
Could it the issue to VM networking, but here are some of the things that I do when debugging boot time VM setup failures: thought I'd send it out anyway to help those who may land here through Google search in the future.
one thing that you shoudl check in addtion to considering ^ is make sure that the nova api is configured to use memcache.
cloud init only retries request until the first request succceds. once the first request works it assumes that the rest will. if you are
using a loadbalance and multipel nova-metadtaa-api process
without memcache, and it take more then 10-30 seconds(cant recall how long cloud-init waits) to build the metadatta respocnce then cloud init can fail. basically if the second request need to rebuild everythign again because its not in a shared cache( memcache) then teh request can time out and cloud init wont try again.
Many thanks, Tobias
If it happens intermittently, there might be an underlying issue. This reminds me of a customer issue we had to deal with two years ago. At some point, the network stack suddenly bacame unreliable, many VMs lost their ssh connections and their customers complained. We did a lot of debugging, lots of daemon restarts, but in the end it just seemed to be neutron overloading, I vaguely remember neutron-server logs indicating that. At that time, they were still waiting for the third control node to become ready, so they only had two of them up and running. And for some reason, restarting neutron services in a random order didn't resolve anything. I found out that it helped to stop all neutron daemons, ensure all processes were actually dead. Then resumed operation by starting neutron-server first, waited a minute or two, then continued with the rest of the agents. Since then, this has become my standard procedure when dealing with neutron issues. And since the third control node has joined, they haven't faced this issue again (yet). So if this happens to you every now and then, I suggest to monitor the load and look out for neutron logs that might point to the underlying issue. Regards, Eugen Zitat von AJ_ sunny <jains8550@gmail.com>:
Hello Eugen,
IP configurations on vm is happening via neutron DHCP But yesterday we enabled config drive metadata on Ubuntu guest image after that we faced the same issue and then restart memchached, Neutron metadata agent, neutron DHCP agent and neutron l3 agent service and add ingress security group rule over TCP protocol over remote ip 169.254.169.254 then I immediately launched vm it didn't work But I tried again after 8 hours i.e. today's morning with debian, CentOS and Ubuntu guest images everything is working fine This is very strange behaviour from last two days it was not working but now it start working
Can you please input your thoughts or RCA for this?
Thanks Arihant Jain
On Wed, 9 Oct, 2024, 2:48 pm Eugen Block, <eblock@nde.ag> wrote:
Hi,
for external networks you will need to inject metadata via config-drive. Does your VM have the IP configured which neutron assigned to it?
Zitat von AJ_ sunny <jains8550@gmail.com>:
Hello team,
I am also facing the similar kind of problem in which cloud-init not able to push key-pair inside the image due to which I am not able to ssh the vm
Inside the vm On curl http://169.254.169.254/openstack Failed to connect to 169.254.169.254 port 80 connection refused
VM with direct external ip not able to ssh But vm with tenant network with floating ip able to ssh this is very strange scenario
In neutron logs I am also getting error Unexpected number of DHCP interface for metadata proxy expected 1, got2
Please provide the assistance on this
Thanks Arihant Jain
On Fri, 18 Nov, 2022, 7:24 am Tobias McNulty, <tobias@caktusgroup.com> wrote:
Thank you all for the helpful responses and suggestions. I tried these steps, but I am afraid the problem was user error.
I thought I had adequately tested the internal network previously, but that was not the case. cloud-init and security groups now appear to work seamlessly on an internal subnet. Furthermore, floating IPs from the external subnet are properly allocated and are reachable from the LAN.
I believe the issue was that I accidentally left DHCP disabled on the internal subnet previously. When I disable DHCP on the internal subnet now, a new instance will hang for ~400-500 seconds at this point in the boot process:
Starting [0;1;39mLoad AppArmor pro���managed internally by snapd[0m... Starting [0;1;39mInitial cloud-init job (pre-networking)[0m... Mounting [0;1;39mArbitrary Executable File Formats File System[0m... [[0;32m OK [0m] Mounted [0;1;39mArbitrary Executable File Formats File System[0m. [ 7.673299] cloud-init[508]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init-local' at Fri, 18 Nov 2022 01:18:29 +0000. Up 7.61 seconds. [[0;32m OK [0m] Finished [0;1;39mLoad AppArmor pro���s managed internally by snapd[0m.
Eventually the instance finishes booting and displays the timeout attempting to reach 169.254.169.254:
[ 430.150383] cloud-init[551]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init' at Fri, 18 Nov 2022 01:25:31 +0000. Up 430.12 seconds. <snip> [ 430.210288] cloud-init[551]: 2022-11-18 01:25:31,748 - url_helper.py[ERROR]: Timed out, no response from urls: ['http://169.254.169.254/openstack'] [ 430.217100] cloud-init[551]: 2022-11-18 01:25:31,749 - util.py[WARNING]: No active metadata service found
In summary, I believe that:
* cloud-init will timeout if DHCP is disabled (presumably because it has no IP with which to make a request?) * Security groups may not work as expected for instances created in an external subnet. The proper configuration is to create instances in a virtual subnet and assign floating IPs from the external subnet.
Hopefully this message is helpful to someone in the future, and thank you all for your patience and support!
Tobias
On Tue, Nov 15, 2022 at 12:27 PM Sean Mooney <smooney@redhat.com> wrote:
On Tue, 2022-11-15 at 09:02 -0800, Clark Boylan wrote:
On Tue, Nov 15, 2022, at 6:14 AM, Tobias McNulty wrote: > As an update, I tried the non-HWE kernel with the same result.
> be a hardware/driver issue with the 10G NICs? It's so repeatable. I'll > look into finding some other hardware to test with. > > Has anyone else experienced such a complete failure with cloud-init > and/or security groups, and do you have any advice on how I might > continue to debug this?
I'm not sure this will be helpful since you seem to have narrowed down
* Use config drive instead of metadata service. The metadata service
hasn't always been reliable.
* Bake information like DHCP config for interfaces and user ssh keys into an image and boot that. This way you don't need to rely on actions taken at boot time. * Use a different boot time configurator tool. Glean is the one the OpenDev team uses for test nodes. When I debug things there I tend to test with cloud-init to compare glean behavior. But you can do this in reverse.
Again, I'm not sure this is helpful in this specific instance. But
Could it the issue to VM networking, but here are some of the things that I do when debugging boot time VM setup failures: thought I'd send it out anyway to help those who may land here through Google search in the future.
one thing that you shoudl check in addtion to considering ^ is make sure that the nova api is configured to use memcache.
cloud init only retries request until the first request succceds. once the first request works it assumes that the rest will. if you are
using a loadbalance and multipel nova-metadtaa-api process
without memcache, and it take more then 10-30 seconds(cant recall how long cloud-init waits) to build the metadatta respocnce then cloud init can fail. basically if the second request need to rebuild everythign again because its not in a shared cache( memcache) then teh request can time out and cloud init wont try again.
> > Many thanks, > Tobias
participants (5)
-
AJ_ sunny
-
Clark Boylan
-
Eugen Block
-
Sean Mooney
-
Tobias McNulty