[Kolla][Kolla-Ansible] Ironic Node Cleaning Failed

Anirudh Gupta

27 Jul 2021 27 Jul '21

12:22 a.m.

Hi Team, We have deployed 2 node kolla ansible *12.0.0* in order to deploy openstack *wallaby* release. We have also enabled ironic in order to provision the bare metal nodes. On each server we have 3 nics - *eno1* - OAM for external connectivity and endpoint's publicURL - *eno2* - Mgmt for internal communication between various openstack services. - *ens2f0* - Data Interface Corresponding to this we have defined the following fields in globals.yml - kolla_base_distro: "centos" - kolla_install_type: "source" - openstack_release: "wallaby" - network_interface: "eno2" # MGMT interface - kolla_external_vip_interface: "eno1" # OAM Interface - kolla_internal_vip_address: "192.168.10.3" # MGMT Subnet free ip - kolla_external_vip_address: "10.0.1.136" # OAM subnet free IP - neutron_external_interface: "ens2f0" # Data Interface - enable_neutron_provider_networks: "yes" Note: Only relevant fields are being shown in this query Also, for ironic following fields have been defined in globals.yml - enable_ironic: "yes" - enable_ironic_neutron_agent: "{{ enable_neutron | bool and enable_ironic | bool }}" - enable_horizon_ironic: "{{ enable_ironic | bool }}" - ironic_dnsmasq_interface: "*ens2f0*" # Data interface - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" - ironic_dnsmasq_boot_file: "pxelinux.0" - ironic_cleaning_network: "public1" - ironic_dnsmasq_default_gateway: "20.20.20.1" After successful deployment, a flat provider network with the name public1 is being created in openstack using the below commands: - openstack network create public1 --provider-network-type flat --provider-physical-network physnet1 - openstack subnet create subnet1 --network public1 --subnet-range 20.20.20.0/24 --allocation-pool start=20.20.20.10,end=20.20.20.100 --ip-version=4 --gateway=20.20.20.1 --dhcp Issue/Queries: - Is the configuration done in globals.yml correct or is there anything else that needs to be done in order to separate control and data plane traffic? - Also I have set automated_cleaning as "true" in ironic-conductor conatiner settings.But after creating the baremetal node, we run "node manage" command which runs successfully. Running "*openstack baremetal node provide <node id>"* command powers on the machine, sets the boot mode on Network Boot but no DHCP request for that particular mac is obtained on the controller. Is there anything I am missing that needs to be done in order to make ironic work? Note: I have also verified that the nic is PXE enabled in system configuration setting Regards Anirudh Gupta

Attachments:

attachment.html (text/html — 3.2 KB)

Show replies by date

Anirudh Gupta

29 Jul 29 Jul

6:41 a.m.

Hi Team, In continuation to the email below, I have some updated information:- Earlier the allocation range mentioned in "*ironic_dnsmasq_dhcp_range*" in globals.yml had an overlapping range with the cleaning network, due to which there was some issue in receiving the DHCP request After creating a cleaning network with a separate allocation range, I am successfully getting IP allocated to my Baremetal Node - openstack subnet create subnet1 --network public1 --subnet-range 20.20.20.0/24 --allocation-pool start=20.20.20.150,end=20.20.20.200 --ip-version=4 --gateway=20.20.20.1 --dhcp [image: image.png] After getting the IP, there is no further action on the node. From " *clean_wait*", it goes into "*clean_failed*" state after around half an hour. On verifying the logs, I could see the below error messages - In */var/log/kolla/ironic/ironic-conductor.log*, we observed the following error: ERROR ironic.conductor.utils [-] Cleaning for node 3a56748e-a8ca-4dec-a332-ace18e6d494e failed. *Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}.* Note : For Cleaning the node, we have used the below images https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... - In /var/log/kolla/nova/nova-compute-ironic.log, we observed the error ERROR nova.compute.manager [req-810ffedf-3343-471c-94db-85411984e6cc - - - - -] No compute node record for host controller-ironic: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-ironic could not be found. Can someone please help in this regard? Regards Anirudh Gupta On Tue, Jul 27, 2021 at 12:52 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...

Hi Team,

We have deployed 2 node kolla ansible *12.0.0* in order to deploy openstack *wallaby* release. We have also enabled ironic in order to provision the bare metal nodes.

On each server we have 3 nics

- *eno1* - OAM for external connectivity and endpoint's publicURL - *eno2* - Mgmt for internal communication between various openstack services. - *ens2f0* - Data Interface

Corresponding to this we have defined the following fields in globals.yml

- kolla_base_distro: "centos" - kolla_install_type: "source" - openstack_release: "wallaby" - network_interface: "eno2" # MGMT interface - kolla_external_vip_interface: "eno1" # OAM Interface - kolla_internal_vip_address: "192.168.10.3" # MGMT Subnet free ip - kolla_external_vip_address: "10.0.1.136" # OAM subnet free IP - neutron_external_interface: "ens2f0" # Data Interface - enable_neutron_provider_networks: "yes"

Note: Only relevant fields are being shown in this query

Also, for ironic following fields have been defined in globals.yml

- enable_ironic: "yes" - enable_ironic_neutron_agent: "{{ enable_neutron | bool and enable_ironic | bool }}" - enable_horizon_ironic: "{{ enable_ironic | bool }}" - ironic_dnsmasq_interface: "*ens2f0*" # Data interface - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" - ironic_dnsmasq_boot_file: "pxelinux.0" - ironic_cleaning_network: "public1" - ironic_dnsmasq_default_gateway: "20.20.20.1"

After successful deployment, a flat provider network with the name public1 is being created in openstack using the below commands:

- openstack network create public1 --provider-network-type flat --provider-physical-network physnet1 - openstack subnet create subnet1 --network public1 --subnet-range 20.20.20.0/24 --allocation-pool start=20.20.20.10,end=20.20.20.100 --ip-version=4 --gateway=20.20.20.1 --dhcp

Issue/Queries:

- Is the configuration done in globals.yml correct or is there anything else that needs to be done in order to separate control and data plane traffic?

- Also I have set automated_cleaning as "true" in ironic-conductor conatiner settings.But after creating the baremetal node, we run "node manage" command which runs successfully. Running "*openstack baremetal node provide <node id>"* command powers on the machine, sets the boot mode on Network Boot but no DHCP request for that particular mac is obtained on the controller. Is there anything I am missing that needs to be done in order to make ironic work?

Note: I have also verified that the nic is PXE enabled in system configuration setting

Regards Anirudh Gupta

Dmitry Tantsur

30 Jul 30 Jul

11:35 a.m.

On Thu, Jul 29, 2021 at 6:05 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...

Hi Team,

In continuation to the email below, I have some updated information:-

Earlier the allocation range mentioned in "*ironic_dnsmasq_dhcp_range*" in globals.yml had an overlapping range with the cleaning network, due to which there was some issue in receiving the DHCP request

After creating a cleaning network with a separate allocation range, I am successfully getting IP allocated to my Baremetal Node

- openstack subnet create subnet1 --network public1 --subnet-range 20.20.20.0/24 --allocation-pool start=20.20.20.150,end=20.20.20.200 --ip-version=4 --gateway=20.20.20.1 --dhcp

[image: image.png]

After getting the IP, there is no further action on the node. From " *clean_wait*", it goes into "*clean_failed*" state after around half an hour.

The IP address is not from the cleaning range, it may come from inspection. You probably need to investigate your network topology, maybe use tcpdump. Unfortunately, I'm not fluent in Kolla to say if it can be a bug or not. Dmitry

...

On verifying the logs, I could see the below error messages

- In */var/log/kolla/ironic/ironic-conductor.log*, we observed the following error:

ERROR ironic.conductor.utils [-] Cleaning for node 3a56748e-a8ca-4dec-a332-ace18e6d494e failed. *Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}.*

Note : For Cleaning the node, we have used the below images

https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

- In /var/log/kolla/nova/nova-compute-ironic.log, we observed the error

ERROR nova.compute.manager [req-810ffedf-3343-471c-94db-85411984e6cc - - - - -] No compute node record for host controller-ironic: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-ironic could not be found.

Can someone please help in this regard?

Regards Anirudh Gupta

On Tue, Jul 27, 2021 at 12:52 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

We have deployed 2 node kolla ansible *12.0.0* in order to deploy openstack *wallaby* release. We have also enabled ironic in order to provision the bare metal nodes.

On each server we have 3 nics

- *eno1* - OAM for external connectivity and endpoint's publicURL - *eno2* - Mgmt for internal communication between various openstack services. - *ens2f0* - Data Interface

Corresponding to this we have defined the following fields in globals.yml

- kolla_base_distro: "centos" - kolla_install_type: "source" - openstack_release: "wallaby" - network_interface: "eno2" # MGMT interface - kolla_external_vip_interface: "eno1" # OAM Interface - kolla_internal_vip_address: "192.168.10.3" # MGMT Subnet free ip - kolla_external_vip_address: "10.0.1.136" # OAM subnet free IP - neutron_external_interface: "ens2f0" # Data Interface - enable_neutron_provider_networks: "yes"

Note: Only relevant fields are being shown in this query

Also, for ironic following fields have been defined in globals.yml

- enable_ironic: "yes" - enable_ironic_neutron_agent: "{{ enable_neutron | bool and enable_ironic | bool }}" - enable_horizon_ironic: "{{ enable_ironic | bool }}" - ironic_dnsmasq_interface: "*ens2f0*" # Data interface - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" - ironic_dnsmasq_boot_file: "pxelinux.0" - ironic_cleaning_network: "public1" - ironic_dnsmasq_default_gateway: "20.20.20.1"

After successful deployment, a flat provider network with the name public1 is being created in openstack using the below commands:

- openstack network create public1 --provider-network-type flat --provider-physical-network physnet1 - openstack subnet create subnet1 --network public1 --subnet-range 20.20.20.0/24 --allocation-pool start=20.20.20.10,end=20.20.20.100 --ip-version=4 --gateway=20.20.20.1 --dhcp

Issue/Queries:

- Is the configuration done in globals.yml correct or is there anything else that needs to be done in order to separate control and data plane traffic?

- Also I have set automated_cleaning as "true" in ironic-conductor conatiner settings.But after creating the baremetal node, we run "node manage" command which runs successfully. Running "*openstack baremetal node provide <node id>"* command powers on the machine, sets the boot mode on Network Boot but no DHCP request for that particular mac is obtained on the controller. Is there anything I am missing that needs to be done in order to make ironic work?

Note: I have also verified that the nic is PXE enabled in system configuration setting

Regards Anirudh Gupta

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Anirudh Gupta

1:29 p.m.

Hi Dmitry Thanks for your time. My system is getting IP 20.20.20.10 which is in the range defined in ironic_dnsmasq_dhcp_range field under globals.yml file. ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" And in the cleaning network (public1), the range defined is 20.20.20.150-20.20.20.200 As per my understanding, these 2 ranges should be mutually exclusive. Please suggest if my understanding is not correct. Any suggestions what should I do to resolve this issue? Regards Anirudh Gupta On Sat, 31 Jul, 2021, 12:06 am Dmitry Tantsur, <dtantsur@redhat.com> wrote:

...

On Thu, Jul 29, 2021 at 6:05 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

In to the email below, I have some updated information:-

Earlier the allocation range mentioned in "*ironic_dnsmasq_dhcp_range*" in globals.yml had an overlapping range with the cleaning network, due to which there was some issue in receiving the DHCP request

After creating a cleaning network with a separate allocation range, I am successfully getting IP allocated to my Baremetal Node

- openstack subnet create subnet1 --network public1 --subnet-range 20.20.20.0/24 --allocation-pool start=20.20.20.150,end=20.20.20.200 --ip-version=4 --gateway=20.20.20.1 --dhcp

[image: image.png]

After getting the IP, there is no further action on the node. From " *clean_wait*", it goes into "*clean_failed*" state after around half an hour.

The IP address is not from the cleaning range, it may come from inspection. You probably need to investigate your network topology, maybe use tcpdump.

Unfortunately, I'm not fluent in Kolla to say if it can be a bug or not.

Dmitry

...
On verifying the logs, I could see the below error messages

- In */var/log/kolla/ironic/ironic-conductor.log*, we observed the following error:

ERROR ironic.conductor.utils [-] Cleaning for node 3a56748e-a8ca-4dec-a332-ace18e6d494e failed. *Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}.*

Note : For Cleaning the node, we have used the below images

https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

- In /var/log/kolla/nova/nova-compute-ironic.log, we observed the error

ERROR nova.compute.manager [req-810ffedf-3343-471c-94db-85411984e6cc - - - - -] No compute node record for host controller-ironic: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-ironic could not be found.

Can someone please help in this regard?

Regards Anirudh Gupta

On Tue, Jul 27, 2021 at 12:52 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

We have deployed 2 node kolla ansible *12.0.0* in order to deploy openstack *wallaby* release. We have also enabled ironic in order to provision the bare metal nodes.

On each server we have 3 nics

- *eno1* - OAM for external connectivity and endpoint's publicURL - *eno2* - Mgmt for internal communication between various openstack services. - *ens2f0* - Data Interface

Corresponding to this we have defined the following fields in globals.yml

- kolla_base_distro: "centos" - kolla_install_type: "source" - openstack_release: "wallaby" - network_interface: "eno2" # MGMT interface - kolla_external_vip_interface: "eno1" # OAM Interface - kolla_internal_vip_address: "192.168.10.3" # MGMT Subnet free ip - kolla_external_vip_address: "10.0.1.136" # OAM subnet free IP - neutron_external_interface: "ens2f0" # Data Interface - enable_neutron_provider_networks: "yes"

Note: Only relevant fields are being shown in this query

Also, for ironic following fields have been defined in globals.yml

- enable_ironic: "yes" - enable_ironic_neutron_agent: "{{ enable_neutron | bool and enable_ironic | bool }}" - enable_horizon_ironic: "{{ enable_ironic | bool }}" - ironic_dnsmasq_interface: "*ens2f0*" # Data interface - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" - ironic_dnsmasq_boot_file: "pxelinux.0" - ironic_cleaning_network: "public1" - ironic_dnsmasq_default_gateway: "20.20.20.1"

After successful deployment, a flat provider network with the name public1 is being created in openstack using the below commands:

- openstack network create public1 --provider-network-type flat --provider-physical-network physnet1 - openstack subnet create subnet1 --network public1 --subnet-range 20.20.20.0/24 --allocation-pool start=20.20.20.10,end=20.20.20.100 --ip-version=4 --gateway=20.20.20.1 --dhcp

Issue/Queries:

- Is the configuration done in globals.yml correct or is there anything else that needs to be done in order to separate control and data plane traffic?

- Also I have set automated_cleaning as "true" in ironic-conductor conatiner settings.But after creating the baremetal node, we run "node manage" command which runs successfully. Running "*openstack baremetal node provide <node id>"* command powers on the machine, sets the boot mode on Network Boot but no DHCP request for that particular mac is obtained on the controller. Is there anything I am missing that needs to be done in order to make ironic work?

Note: I have also verified that the nic is PXE enabled in system configuration setting

Regards Anirudh Gupta

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Anirudh Gupta

3 Aug 3 Aug

4:02 a.m.

Hi Radosław, I read somewhere in openstack-discuss forums that you are also using Ironic service in Kolla Ansible, Can you please go through my query chain and suggest some pointers to resolve the issue. I am unable to find any error logs except the ones I shared. Looking forward to hearing from you Regards Anirudh Gupta On Sat, Jul 31, 2021 at 1:59 AM Anirudh Gupta <anyrude10@gmail.com> wrote:

...

Hi Dmitry

Thanks for your time.

My system is getting IP 20.20.20.10 which is in the range defined in ironic_dnsmasq_dhcp_range field under globals.yml file.

ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100"

And in the cleaning network (public1), the range defined is 20.20.20.150-20.20.20.200

As per my understanding, these 2 ranges should be mutually exclusive.

Please suggest if my understanding is not correct.

Any suggestions what should I do to resolve this issue?

Regards Anirudh Gupta

On Sat, 31 Jul, 2021, 12:06 am Dmitry Tantsur, <dtantsur@redhat.com> wrote:

...
On Thu, Jul 29, 2021 at 6:05 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

In to the email below, I have some updated information:-

Earlier the allocation range mentioned in "*ironic_dnsmasq_dhcp_range*" in globals.yml had an overlapping range with the cleaning network, due to which there was some issue in receiving the DHCP request

After creating a cleaning network with a separate allocation range, I am successfully getting IP allocated to my Baremetal Node

- openstack subnet create subnet1 --network public1 --subnet-range 20.20.20.0/24 --allocation-pool start=20.20.20.150,end=20.20.20.200 --ip-version=4 --gateway=20.20.20.1 --dhcp

[image: image.png]

After getting the IP, there is no further action on the node. From " *clean_wait*", it goes into "*clean_failed*" state after around half an hour.

The IP address is not from the cleaning range, it may come from inspection. You probably need to investigate your network topology, maybe use tcpdump.

Unfortunately, I'm not fluent in Kolla to say if it can be a bug or not.

Dmitry

...
On verifying the logs, I could see the below error messages

- In */var/log/kolla/ironic/ironic-conductor.log*, we observed the following error:

ERROR ironic.conductor.utils [-] Cleaning for node 3a56748e-a8ca-4dec-a332-ace18e6d494e failed. *Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}.*

Note : For Cleaning the node, we have used the below images

https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

- In /var/log/kolla/nova/nova-compute-ironic.log, we observed the error

ERROR nova.compute.manager [req-810ffedf-3343-471c-94db-85411984e6cc - - - - -] No compute node record for host controller-ironic: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-ironic could not be found.

Can someone please help in this regard?

Regards Anirudh Gupta

On Tue, Jul 27, 2021 at 12:52 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

We have deployed 2 node kolla ansible *12.0.0* in order to deploy openstack *wallaby* release. We have also enabled ironic in order to provision the bare metal nodes.

On each server we have 3 nics

- *eno1* - OAM for external connectivity and endpoint's publicURL - *eno2* - Mgmt for internal communication between various openstack services. - *ens2f0* - Data Interface

Corresponding to this we have defined the following fields in globals.yml

- kolla_base_distro: "centos" - kolla_install_type: "source" - openstack_release: "wallaby" - network_interface: "eno2" # MGMT interface - kolla_external_vip_interface: "eno1" # OAM Interface - kolla_internal_vip_address: "192.168.10.3" # MGMT Subnet free ip - kolla_external_vip_address: "10.0.1.136" # OAM subnet free IP - neutron_external_interface: "ens2f0" # Data Interface - enable_neutron_provider_networks: "yes"

Note: Only relevant fields are being shown in this query

Also, for ironic following fields have been defined in globals.yml

- enable_ironic: "yes" - enable_ironic_neutron_agent: "{{ enable_neutron | bool and enable_ironic | bool }}" - enable_horizon_ironic: "{{ enable_ironic | bool }}" - ironic_dnsmasq_interface: "*ens2f0*" # Data interface - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" - ironic_dnsmasq_boot_file: "pxelinux.0" - ironic_cleaning_network: "public1" - ironic_dnsmasq_default_gateway: "20.20.20.1"

After successful deployment, a flat provider network with the name public1 is being created in openstack using the below commands:

- openstack network create public1 --provider-network-type flat --provider-physical-network physnet1 - openstack subnet create subnet1 --network public1 --subnet-range 20.20.20.0/24 --allocation-pool start=20.20.20.10,end=20.20.20.100 --ip-version=4 --gateway=20.20.20.1 --dhcp

Issue/Queries:

- Is the configuration done in globals.yml correct or is there anything else that needs to be done in order to separate control and data plane traffic?

- Also I have set automated_cleaning as "true" in ironic-conductor conatiner settings.But after creating the baremetal node, we run "node manage" command which runs successfully. Running "*openstack baremetal node provide <node id>"* command powers on the machine, sets the boot mode on Network Boot but no DHCP request for that particular mac is obtained on the controller. Is there anything I am missing that needs to be done in order to make ironic work?

Note: I have also verified that the nic is PXE enabled in system configuration setting

Regards Anirudh Gupta

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Dmitry Tantsur

6:58 a.m.

Hi, You need to check the dnsmasq logs (there are two dnsmasqs: from neutron and from ironic-inspector). tcpdump may also help to determine where the packages are lost. Dmitry On Fri, Jul 30, 2021 at 10:29 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...

Hi Dmitry

Thanks for your time.

My system is getting IP 20.20.20.10 which is in the range defined in ironic_dnsmasq_dhcp_range field under globals.yml file.

ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100"

And in the cleaning network (public1), the range defined is 20.20.20.150-20.20.20.200

As per my understanding, these 2 ranges should be mutually exclusive.

Please suggest if my understanding is not correct.

Any suggestions what should I do to resolve this issue?

Regards Anirudh Gupta

On Sat, 31 Jul, 2021, 12:06 am Dmitry Tantsur, <dtantsur@redhat.com> wrote:

...
On Thu, Jul 29, 2021 at 6:05 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

In to the email below, I have some updated information:-

Earlier the allocation range mentioned in "*ironic_dnsmasq_dhcp_range*" in globals.yml had an overlapping range with the cleaning network, due to which there was some issue in receiving the DHCP request

After creating a cleaning network with a separate allocation range, I am successfully getting IP allocated to my Baremetal Node

- openstack subnet create subnet1 --network public1 --subnet-range 20.20.20.0/24 --allocation-pool start=20.20.20.150,end=20.20.20.200 --ip-version=4 --gateway=20.20.20.1 --dhcp

[image: image.png]

After getting the IP, there is no further action on the node. From " *clean_wait*", it goes into "*clean_failed*" state after around half an hour.

The IP address is not from the cleaning range, it may come from inspection. You probably need to investigate your network topology, maybe use tcpdump.

Unfortunately, I'm not fluent in Kolla to say if it can be a bug or not.

Dmitry

...
On verifying the logs, I could see the below error messages

- In */var/log/kolla/ironic/ironic-conductor.log*, we observed the following error:

ERROR ironic.conductor.utils [-] Cleaning for node 3a56748e-a8ca-4dec-a332-ace18e6d494e failed. *Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}.*

Note : For Cleaning the node, we have used the below images

https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

- In /var/log/kolla/nova/nova-compute-ironic.log, we observed the error

ERROR nova.compute.manager [req-810ffedf-3343-471c-94db-85411984e6cc - - - - -] No compute node record for host controller-ironic: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-ironic could not be found.

Can someone please help in this regard?

Regards Anirudh Gupta

On Tue, Jul 27, 2021 at 12:52 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

We have deployed 2 node kolla ansible *12.0.0* in order to deploy openstack *wallaby* release. We have also enabled ironic in order to provision the bare metal nodes.

On each server we have 3 nics

- *eno1* - OAM for external connectivity and endpoint's publicURL - *eno2* - Mgmt for internal communication between various openstack services. - *ens2f0* - Data Interface

Corresponding to this we have defined the following fields in globals.yml

- kolla_base_distro: "centos" - kolla_install_type: "source" - openstack_release: "wallaby" - network_interface: "eno2" # MGMT interface - kolla_external_vip_interface: "eno1" # OAM Interface - kolla_internal_vip_address: "192.168.10.3" # MGMT Subnet free ip - kolla_external_vip_address: "10.0.1.136" # OAM subnet free IP - neutron_external_interface: "ens2f0" # Data Interface - enable_neutron_provider_networks: "yes"

Note: Only relevant fields are being shown in this query

Also, for ironic following fields have been defined in globals.yml

- enable_ironic: "yes" - enable_ironic_neutron_agent: "{{ enable_neutron | bool and enable_ironic | bool }}" - enable_horizon_ironic: "{{ enable_ironic | bool }}" - ironic_dnsmasq_interface: "*ens2f0*" # Data interface - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" - ironic_dnsmasq_boot_file: "pxelinux.0" - ironic_cleaning_network: "public1" - ironic_dnsmasq_default_gateway: "20.20.20.1"

After successful deployment, a flat provider network with the name public1 is being created in openstack using the below commands:

- openstack network create public1 --provider-network-type flat --provider-physical-network physnet1 - openstack subnet create subnet1 --network public1 --subnet-range 20.20.20.0/24 --allocation-pool start=20.20.20.10,end=20.20.20.100 --ip-version=4 --gateway=20.20.20.1 --dhcp

Issue/Queries:

- Is the configuration done in globals.yml correct or is there anything else that needs to be done in order to separate control and data plane traffic?

- Also I have set automated_cleaning as "true" in ironic-conductor conatiner settings.But after creating the baremetal node, we run "node manage" command which runs successfully. Running "*openstack baremetal node provide <node id>"* command powers on the machine, sets the boot mode on Network Boot but no DHCP request for that particular mac is obtained on the controller. Is there anything I am missing that needs to be done in order to make ironic work?

Note: I have also verified that the nic is PXE enabled in system configuration setting

Regards Anirudh Gupta

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Anirudh Gupta

8:02 a.m.

Hi Dmitry, I might be wrong, but as per my understanding if there would be an issue in dnsmasq, then IP 20.20.20.10 would not have been assigned to the machine. TCPDUMP logs are as below: 20:16:58.938089 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:02.765291 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 20:17:02.766303 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:26.944378 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347 20:17:26.944756 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:30.763627 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 20:17:30.764620 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:54.938791 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347 Also the neutron dnsmasq logs and ironic inspector logs are attached in the mail. Regards Anirudh Gupta On Tue, Aug 3, 2021 at 7:29 PM Dmitry Tantsur <dtantsur@redhat.com> wrote:

...

Hi,

You need to check the dnsmasq logs (there are two dnsmasqs: from neutron and from ironic-inspector). tcpdump may also help to determine where the packages are lost.

Dmitry

On Fri, Jul 30, 2021 at 10:29 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Dmitry

Thanks for your time.

My system is getting IP 20.20.20.10 which is in the range defined in ironic_dnsmasq_dhcp_range field under globals.yml file.

ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100"

And in the cleaning network (public1), the range defined is 20.20.20.150-20.20.20.200

As per my understanding, these 2 ranges should be mutually exclusive.

Please suggest if my understanding is not correct.

Any suggestions what should I do to resolve this issue?

Regards Anirudh Gupta

On Sat, 31 Jul, 2021, 12:06 am Dmitry Tantsur, <dtantsur@redhat.com> wrote:

...
On Thu, Jul 29, 2021 at 6:05 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

In to the email below, I have some updated information:-

Earlier the allocation range mentioned in "*ironic_dnsmasq_dhcp_range*" in globals.yml had an overlapping range with the cleaning network, due to which there was some issue in receiving the DHCP request

After creating a cleaning network with a separate allocation range, I am successfully getting IP allocated to my Baremetal Node

- openstack subnet create subnet1 --network public1 --subnet-range 20.20.20.0/24 --allocation-pool start=20.20.20.150,end=20.20.20.200 --ip-version=4 --gateway=20.20.20.1 --dhcp

[image: image.png]

After getting the IP, there is no further action on the node. From " *clean_wait*", it goes into "*clean_failed*" state after around half an hour.

The IP address is not from the cleaning range, it may come from inspection. You probably need to investigate your network topology, maybe use tcpdump.

Unfortunately, I'm not fluent in Kolla to say if it can be a bug or not.

Dmitry

...
On verifying the logs, I could see the below error messages

- In */var/log/kolla/ironic/ironic-conductor.log*, we observed the following error:

ERROR ironic.conductor.utils [-] Cleaning for node 3a56748e-a8ca-4dec-a332-ace18e6d494e failed. *Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}.*

Note : For Cleaning the node, we have used the below images

https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

- In /var/log/kolla/nova/nova-compute-ironic.log, we observed the error

ERROR nova.compute.manager [req-810ffedf-3343-471c-94db-85411984e6cc - - - - -] No compute node record for host controller-ironic: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-ironic could not be found.

Can someone please help in this regard?

Regards Anirudh Gupta

On Tue, Jul 27, 2021 at 12:52 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

We have deployed 2 node kolla ansible *12.0.0* in order to deploy openstack *wallaby* release. We have also enabled ironic in order to provision the bare metal nodes.

On each server we have 3 nics

- *eno1* - OAM for external connectivity and endpoint's publicURL - *eno2* - Mgmt for internal communication between various openstack services. - *ens2f0* - Data Interface

Corresponding to this we have defined the following fields in globals.yml

- kolla_base_distro: "centos" - kolla_install_type: "source" - openstack_release: "wallaby" - network_interface: "eno2" # MGMT interface - kolla_external_vip_interface: "eno1" # OAM Interface - kolla_internal_vip_address: "192.168.10.3" # MGMT Subnet free ip - kolla_external_vip_address: "10.0.1.136" # OAM subnet free IP - neutron_external_interface: "ens2f0" # Data Interface - enable_neutron_provider_networks: "yes"

Note: Only relevant fields are being shown in this query

Also, for ironic following fields have been defined in globals.yml

- enable_ironic: "yes" - enable_ironic_neutron_agent: "{{ enable_neutron | bool and enable_ironic | bool }}" - enable_horizon_ironic: "{{ enable_ironic | bool }}" - ironic_dnsmasq_interface: "*ens2f0*" # Data interface - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" - ironic_dnsmasq_boot_file: "pxelinux.0" - ironic_cleaning_network: "public1" - ironic_dnsmasq_default_gateway: "20.20.20.1"

After successful deployment, a flat provider network with the name public1 is being created in openstack using the below commands:

- openstack network create public1 --provider-network-type flat --provider-physical-network physnet1 - openstack subnet create subnet1 --network public1 --subnet-range 20.20.20.0/24 --allocation-pool start=20.20.20.10,end=20.20.20.100 --ip-version=4 --gateway=20.20.20.1 --dhcp

Issue/Queries:

- Is the configuration done in globals.yml correct or is there anything else that needs to be done in order to separate control and data plane traffic?

- Also I have set automated_cleaning as "true" in ironic-conductor conatiner settings.But after creating the baremetal node, we run "node manage" command which runs successfully. Running "*openstack baremetal node provide <node id>"* command powers on the machine, sets the boot mode on Network Boot but no DHCP request for that particular mac is obtained on the controller. Is there anything I am missing that needs to be done in order to make ironic work?

Note: I have also verified that the nic is PXE enabled in system configuration setting

Regards Anirudh Gupta

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Anirudh Gupta

5 Aug 5 Aug

4:26 a.m.

Hi Team, On further debugging, I found an error in neutron-server logs Failed to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f on host f4a43fa5-9c41-488e-a34d-714ae5a9d300 for vnic_type baremetal using segments [{'id': '1a5bbe96-2488-4971-925f-7c9346ba3ef5', 'network_type': 'flat', 'physical_network': 'physnet1', 'segmentation_id': None, 'network_id': '5b6cccec-ad86-4ed9-8d3c-72a31ec3a0d4'}] 2021-08-05 16:33:06.979 23 INFO neutron.plugins.ml2.plugin [req-54d11d51-7319-43ea-b70c-fe39d8aafe8a 21d6a238438e4294912746bcdc895e31 3eca725754e1405eb178cc39bd0da3aa - default default] Attempt 9 to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f where 476d8175-ffc2-49ba-bb12-0a77c1f07e5f is the uuid of Baremetal Node However the port is created in openstack, but its state is down [ansible@localhost ~]$ openstack port list +--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ | ID | Name | MAC Address | Fixed IP Addresses | Status | +--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ | 07d6b83d-d83c-498f-8ba8-b4f21bef7249 | | fa:16:3e:38:05:9d | ip_address='10.0.1.200', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | ACTIVE | | 476d8175-ffc2-49ba-bb12-0a77c1f07e5f | | *98:f2:b3:3f:72:d8* | ip_address='10.0.1.202', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | *DOWN * | +--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ *98:f2:b3:3f:72:d8 *is the mac address of my Baremetal Node on which PXE is enabled. Can someone please help in resolving this issue. *Issue:* *Node goes in clean_failed from clean_wait.* Regards Anirudh Gupta On Tue, Aug 3, 2021 at 8:32 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...

Hi Dmitry,

I might be wrong, but as per my understanding if there would be an issue in dnsmasq, then IP 20.20.20.10 would not have been assigned to the machine.

TCPDUMP logs are as below:

20:16:58.938089 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:02.765291 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 20:17:02.766303 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:26.944378 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347 20:17:26.944756 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:30.763627 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 20:17:30.764620 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:54.938791 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347

Also the neutron dnsmasq logs and ironic inspector logs are attached in the mail.

Regards Anirudh Gupta

On Tue, Aug 3, 2021 at 7:29 PM Dmitry Tantsur <dtantsur@redhat.com> wrote:

...
Hi,

You need to check the dnsmasq logs (there are two dnsmasqs: from neutron and from ironic-inspector). tcpdump may also help to determine where the packages are lost.

Dmitry

On Fri, Jul 30, 2021 at 10:29 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Dmitry

Thanks for your time.

My system is getting IP 20.20.20.10 which is in the range defined in ironic_dnsmasq_dhcp_range field under globals.yml file.

ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100"

And in the cleaning network (public1), the range defined is 20.20.20.150-20.20.20.200

As per my understanding, these 2 ranges should be mutually exclusive.

Please suggest if my understanding is not correct.

Any suggestions what should I do to resolve this issue?

Regards Anirudh Gupta

On Sat, 31 Jul, 2021, 12:06 am Dmitry Tantsur, <dtantsur@redhat.com> wrote:

...
On Thu, Jul 29, 2021 at 6:05 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

In to the email below, I have some updated information:-

Earlier the allocation range mentioned in "*ironic_dnsmasq_dhcp_range*" in globals.yml had an overlapping range with the cleaning network, due to which there was some issue in receiving the DHCP request

After creating a cleaning network with a separate allocation range, I am successfully getting IP allocated to my Baremetal Node

- openstack subnet create subnet1 --network public1 --subnet-range 20.20.20.0/24 --allocation-pool start=20.20.20.150,end=20.20.20.200 --ip-version=4 --gateway=20.20.20.1 --dhcp

[image: image.png]

After getting the IP, there is no further action on the node. From " *clean_wait*", it goes into "*clean_failed*" state after around half an hour.

The IP address is not from the cleaning range, it may come from inspection. You probably need to investigate your network topology, maybe use tcpdump.

Unfortunately, I'm not fluent in Kolla to say if it can be a bug or not.

Dmitry

...
On verifying the logs, I could see the below error messages

- In */var/log/kolla/ironic/ironic-conductor.log*, we observed the following error:

ERROR ironic.conductor.utils [-] Cleaning for node 3a56748e-a8ca-4dec-a332-ace18e6d494e failed. *Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}.*

Note : For Cleaning the node, we have used the below images

https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

- In /var/log/kolla/nova/nova-compute-ironic.log, we observed the error

ERROR nova.compute.manager [req-810ffedf-3343-471c-94db-85411984e6cc - - - - -] No compute node record for host controller-ironic: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-ironic could not be found.

Can someone please help in this regard?

Regards Anirudh Gupta

On Tue, Jul 27, 2021 at 12:52 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

We have deployed 2 node kolla ansible *12.0.0* in order to deploy openstack *wallaby* release. We have also enabled ironic in order to provision the bare metal nodes.

On each server we have 3 nics

- *eno1* - OAM for external connectivity and endpoint's publicURL - *eno2* - Mgmt for internal communication between various openstack services. - *ens2f0* - Data Interface

Corresponding to this we have defined the following fields in globals.yml

- kolla_base_distro: "centos" - kolla_install_type: "source" - openstack_release: "wallaby" - network_interface: "eno2" # MGMT interface - kolla_external_vip_interface: "eno1" # OAM Interface - kolla_internal_vip_address: "192.168.10.3" # MGMT Subnet free ip - kolla_external_vip_address: "10.0.1.136" # OAM subnet free IP - neutron_external_interface: "ens2f0" # Data Interface - enable_neutron_provider_networks: "yes"

Note: Only relevant fields are being shown in this query

Also, for ironic following fields have been defined in globals.yml

- enable_ironic: "yes" - enable_ironic_neutron_agent: "{{ enable_neutron | bool and enable_ironic | bool }}" - enable_horizon_ironic: "{{ enable_ironic | bool }}" - ironic_dnsmasq_interface: "*ens2f0*" # Data interface - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" - ironic_dnsmasq_boot_file: "pxelinux.0" - ironic_cleaning_network: "public1" - ironic_dnsmasq_default_gateway: "20.20.20.1"

After successful deployment, a flat provider network with the name public1 is being created in openstack using the below commands:

- openstack network create public1 --provider-network-type flat --provider-physical-network physnet1 - openstack subnet create subnet1 --network public1 --subnet-range 20.20.20.0/24 --allocation-pool start=20.20.20.10,end=20.20.20.100 --ip-version=4 --gateway=20.20.20.1 --dhcp

Issue/Queries:

- Is the configuration done in globals.yml correct or is there anything else that needs to be done in order to separate control and data plane traffic?

- Also I have set automated_cleaning as "true" in ironic-conductor conatiner settings.But after creating the baremetal node, we run "node manage" command which runs successfully. Running "*openstack baremetal node provide <node id>"* command powers on the machine, sets the boot mode on Network Boot but no DHCP request for that particular mac is obtained on the controller. Is there anything I am missing that needs to be done in order to make ironic work?

Note: I have also verified that the nic is PXE enabled in system configuration setting

Regards Anirudh Gupta

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Anirudh Gupta

6 Aug 6 Aug

5:12 a.m.

Hi Dmitry, I tried taking TCPDUMP while the Baremetal Node was booting up and looked for tftp protocols and found there was some "*File Not Found" *traces for bootx64.efi [image: image.png] Then, I found a related post on openstack Discuss which suggested to enable IPXE http://lists.openstack.org/pipermail/openstack-discuss/2019-October/010329.h... After re-deploying the setup with IPXE enabled, i found similar traces now for *ipxe.efi file* [image: image.png] Can you please now suggest what possibly could be a miss in configuration and steps to resolve it. For your reference, I am attaching the complete tcpdump logs of both the Scenarios Looking forward to hearing from you. Regards Anirudh Gupta On Thu, Aug 5, 2021 at 4:56 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...

Hi Team,

On further debugging, I found an error in neutron-server logs

Failed to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f on host f4a43fa5-9c41-488e-a34d-714ae5a9d300 for vnic_type baremetal using segments [{'id': '1a5bbe96-2488-4971-925f-7c9346ba3ef5', 'network_type': 'flat', 'physical_network': 'physnet1', 'segmentation_id': None, 'network_id': '5b6cccec-ad86-4ed9-8d3c-72a31ec3a0d4'}] 2021-08-05 16:33:06.979 23 INFO neutron.plugins.ml2.plugin [req-54d11d51-7319-43ea-b70c-fe39d8aafe8a 21d6a238438e4294912746bcdc895e31 3eca725754e1405eb178cc39bd0da3aa - default default] Attempt 9 to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f

where 476d8175-ffc2-49ba-bb12-0a77c1f07e5f is the uuid of Baremetal Node

However the port is created in openstack, but its state is down

[ansible@localhost ~]$ openstack port list

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ | ID | Name | MAC Address | Fixed IP Addresses | Status |

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ | 07d6b83d-d83c-498f-8ba8-b4f21bef7249 | | fa:16:3e:38:05:9d | ip_address='10.0.1.200', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | ACTIVE | | 476d8175-ffc2-49ba-bb12-0a77c1f07e5f | | *98:f2:b3:3f:72:d8* | ip_address='10.0.1.202', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | *DOWN * |

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+

*98:f2:b3:3f:72:d8 *is the mac address of my Baremetal Node on which PXE is enabled.

Can someone please help in resolving this issue.

*Issue:* *Node goes in clean_failed from clean_wait.*

Regards Anirudh Gupta

On Tue, Aug 3, 2021 at 8:32 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Dmitry,

I might be wrong, but as per my understanding if there would be an issue in dnsmasq, then IP 20.20.20.10 would not have been assigned to the machine.

TCPDUMP logs are as below:

20:16:58.938089 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:02.765291 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 20:17:02.766303 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:26.944378 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347 20:17:26.944756 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:30.763627 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 20:17:30.764620 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:54.938791 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347

Also the neutron dnsmasq logs and ironic inspector logs are attached in the mail.

Regards Anirudh Gupta

On Tue, Aug 3, 2021 at 7:29 PM Dmitry Tantsur <dtantsur@redhat.com> wrote:

...
Hi,

You need to check the dnsmasq logs (there are two dnsmasqs: from neutron and from ironic-inspector). tcpdump may also help to determine where the packages are lost.

Dmitry

On Fri, Jul 30, 2021 at 10:29 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Dmitry

Thanks for your time.

My system is getting IP 20.20.20.10 which is in the range defined in ironic_dnsmasq_dhcp_range field under globals.yml file.

ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100"

And in the cleaning network (public1), the range defined is 20.20.20.150-20.20.20.200

As per my understanding, these 2 ranges should be mutually exclusive.

Please suggest if my understanding is not correct.

Any suggestions what should I do to resolve this issue?

Regards Anirudh Gupta

On Sat, 31 Jul, 2021, 12:06 am Dmitry Tantsur, <dtantsur@redhat.com> wrote:

...
On Thu, Jul 29, 2021 at 6:05 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

In to the email below, I have some updated information:-

Earlier the allocation range mentioned in " *ironic_dnsmasq_dhcp_range*" in globals.yml had an overlapping range with the cleaning network, due to which there was some issue in receiving the DHCP request

After creating a cleaning network with a separate allocation range, I am successfully getting IP allocated to my Baremetal Node

- openstack subnet create subnet1 --network public1 --subnet-range 20.20.20.0/24 --allocation-pool start=20.20.20.150,end=20.20.20.200 --ip-version=4 --gateway=20.20.20.1 --dhcp

[image: image.png]

After getting the IP, there is no further action on the node. From " *clean_wait*", it goes into "*clean_failed*" state after around half an hour.

The IP address is not from the cleaning range, it may come from inspection. You probably need to investigate your network topology, maybe use tcpdump.

Unfortunately, I'm not fluent in Kolla to say if it can be a bug or not.

Dmitry

...
On verifying the logs, I could see the below error messages

- In */var/log/kolla/ironic/ironic-conductor.log*, we observed the following error:

ERROR ironic.conductor.utils [-] Cleaning for node 3a56748e-a8ca-4dec-a332-ace18e6d494e failed. *Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}.*

Note : For Cleaning the node, we have used the below images

https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

- In /var/log/kolla/nova/nova-compute-ironic.log, we observed the error

ERROR nova.compute.manager [req-810ffedf-3343-471c-94db-85411984e6cc - - - - -] No compute node record for host controller-ironic: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-ironic could not be found.

Can someone please help in this regard?

Regards Anirudh Gupta

On Tue, Jul 27, 2021 at 12:52 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

> Hi Team, > > We have deployed 2 node kolla ansible *12.0.0* in order to deploy > openstack *wallaby* release. We have also enabled ironic in order > to provision the bare metal nodes. > > On each server we have 3 nics > > - *eno1* - OAM for external connectivity and endpoint's publicURL > - *eno2* - Mgmt for internal communication between various > openstack services. > - *ens2f0* - Data Interface > > > Corresponding to this we have defined the following fields in > globals.yml > > > - kolla_base_distro: "centos" > - kolla_install_type: "source" > - openstack_release: "wallaby" > - network_interface: "eno2" # MGMT > interface > - kolla_external_vip_interface: "eno1" # OAM > Interface > - kolla_internal_vip_address: "192.168.10.3" # MGMT Subnet > free ip > - kolla_external_vip_address: "10.0.1.136" # OAM subnet > free IP > - neutron_external_interface: "ens2f0" # Data > Interface > - enable_neutron_provider_networks: "yes" > > Note: Only relevant fields are being shown in this query > > Also, for ironic following fields have been defined in globals.yml > > - enable_ironic: "yes" > - enable_ironic_neutron_agent: "{{ enable_neutron | bool and > enable_ironic | bool }}" > - enable_horizon_ironic: "{{ enable_ironic | bool }}" > - ironic_dnsmasq_interface: "*ens2f0*" # > Data interface > - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" > - ironic_dnsmasq_boot_file: "pxelinux.0" > - ironic_cleaning_network: "public1" > - ironic_dnsmasq_default_gateway: "20.20.20.1" > > > After successful deployment, a flat provider network with the name > public1 is being created in openstack using the below commands: > > > - openstack network create public1 --provider-network-type flat > --provider-physical-network physnet1 > - openstack subnet create subnet1 --network public1 > --subnet-range 20.20.20.0/24 --allocation-pool > start=20.20.20.10,end=20.20.20.100 --ip-version=4 --gateway=20.20.20.1 > --dhcp > > > Issue/Queries: > > > - Is the configuration done in globals.yml correct or is there > anything else that needs to be done in order to separate control and data > plane traffic? > > > - Also I have set automated_cleaning as "true" in > ironic-conductor conatiner settings.But after creating the baremetal node, > we run "node manage" command which runs successfully. Running "*openstack > baremetal node provide <node id>"* command powers on the > machine, sets the boot mode on Network Boot but no DHCP request for that > particular mac is obtained on the controller. Is there anything I am > missing that needs to be done in order to make ironic work? > > Note: I have also verified that the nic is PXE enabled in system > configuration setting > > Regards > Anirudh Gupta > > >

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Dmitry Tantsur

9:34 a.m.

Hi! It might be a Kolla issue, please ping the Kolla devs. Dmitry On Fri, Aug 6, 2021 at 2:12 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...

Hi Dmitry,

I tried taking TCPDUMP while the Baremetal Node was booting up and looked for tftp protocols and found there was some "*File Not Found" *traces for bootx64.efi

[image: image.png]

Then, I found a related post on openstack Discuss which suggested to enable IPXE

http://lists.openstack.org/pipermail/openstack-discuss/2019-October/010329.h...

After re-deploying the setup with IPXE enabled, i found similar traces now for *ipxe.efi file*

[image: image.png]

Can you please now suggest what possibly could be a miss in configuration and steps to resolve it.

For your reference, I am attaching the complete tcpdump logs of both the Scenarios

Looking forward to hearing from you.

Regards Anirudh Gupta

On Thu, Aug 5, 2021 at 4:56 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

On further debugging, I found an error in neutron-server logs

Failed to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f on host f4a43fa5-9c41-488e-a34d-714ae5a9d300 for vnic_type baremetal using segments [{'id': '1a5bbe96-2488-4971-925f-7c9346ba3ef5', 'network_type': 'flat', 'physical_network': 'physnet1', 'segmentation_id': None, 'network_id': '5b6cccec-ad86-4ed9-8d3c-72a31ec3a0d4'}] 2021-08-05 16:33:06.979 23 INFO neutron.plugins.ml2.plugin [req-54d11d51-7319-43ea-b70c-fe39d8aafe8a 21d6a238438e4294912746bcdc895e31 3eca725754e1405eb178cc39bd0da3aa - default default] Attempt 9 to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f

where 476d8175-ffc2-49ba-bb12-0a77c1f07e5f is the uuid of Baremetal Node

However the port is created in openstack, but its state is down

[ansible@localhost ~]$ openstack port list

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ | ID | Name | MAC Address | Fixed IP Addresses | Status |

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ | 07d6b83d-d83c-498f-8ba8-b4f21bef7249 | | fa:16:3e:38:05:9d | ip_address='10.0.1.200', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | ACTIVE | | 476d8175-ffc2-49ba-bb12-0a77c1f07e5f | | *98:f2:b3:3f:72:d8* | ip_address='10.0.1.202', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | *DOWN * |

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+

*98:f2:b3:3f:72:d8 *is the mac address of my Baremetal Node on which PXE is enabled.

Can someone please help in resolving this issue.

*Issue:* *Node goes in clean_failed from clean_wait.*

Regards Anirudh Gupta

On Tue, Aug 3, 2021 at 8:32 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Dmitry,

I might be wrong, but as per my understanding if there would be an issue in dnsmasq, then IP 20.20.20.10 would not have been assigned to the machine.

TCPDUMP logs are as below:

20:16:58.938089 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:02.765291 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 20:17:02.766303 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:26.944378 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347 20:17:26.944756 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:30.763627 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 20:17:30.764620 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:54.938791 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347

Also the neutron dnsmasq logs and ironic inspector logs are attached in the mail.

Regards Anirudh Gupta

On Tue, Aug 3, 2021 at 7:29 PM Dmitry Tantsur <dtantsur@redhat.com> wrote:

...
Hi,

You need to check the dnsmasq logs (there are two dnsmasqs: from neutron and from ironic-inspector). tcpdump may also help to determine where the packages are lost.

Dmitry

On Fri, Jul 30, 2021 at 10:29 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Dmitry

Thanks for your time.

My system is getting IP 20.20.20.10 which is in the range defined in ironic_dnsmasq_dhcp_range field under globals.yml file.

ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100"

And in the cleaning network (public1), the range defined is 20.20.20.150-20.20.20.200

As per my understanding, these 2 ranges should be mutually exclusive.

Please suggest if my understanding is not correct.

Any suggestions what should I do to resolve this issue?

Regards Anirudh Gupta

On Sat, 31 Jul, 2021, 12:06 am Dmitry Tantsur, <dtantsur@redhat.com> wrote:

...
On Thu, Jul 29, 2021 at 6:05 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

> Hi Team, > > In to the email below, I have some updated information:- > > Earlier the allocation range mentioned in " > *ironic_dnsmasq_dhcp_range*" in globals.yml had an overlapping > range with the cleaning network, due to which there was some issue in > receiving the DHCP request > > After creating a cleaning network with a separate allocation range, > I am successfully getting IP allocated to my Baremetal Node > > - openstack subnet create subnet1 --network public1 > --subnet-range 20.20.20.0/24 --allocation-pool > start=20.20.20.150,end=20.20.20.200 --ip-version=4 --gateway=20.20.20.1 > --dhcp > > > [image: image.png] > > After getting the IP, there is no further action on the node. From " > *clean_wait*", it goes into "*clean_failed*" state after around > half an hour. >

The IP address is not from the cleaning range, it may come from inspection. You probably need to investigate your network topology, maybe use tcpdump.

Unfortunately, I'm not fluent in Kolla to say if it can be a bug or not.

Dmitry

> > On verifying the logs, I could see the below error messages > > > - In */var/log/kolla/ironic/ironic-conductor.log*, we observed > the following error: > > ERROR ironic.conductor.utils [-] Cleaning for node > 3a56748e-a8ca-4dec-a332-ace18e6d494e failed. *Timeout reached while > cleaning the node. Please check if the ramdisk responsible for the cleaning > is running on the node. Failed on step {}.* > > > Note : For Cleaning the node, we have used the below images > > > > https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... > > > https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... > > > - In /var/log/kolla/nova/nova-compute-ironic.log, we observed > the error > > ERROR nova.compute.manager [req-810ffedf-3343-471c-94db-85411984e6cc > - - - - -] No compute node record for host controller-ironic: > nova.exception_Remote.ComputeHostNotFound_Remote: Compute host > controller-ironic could not be found. > > > Can someone please help in this regard? > > Regards > Anirudh Gupta > > > On Tue, Jul 27, 2021 at 12:52 PM Anirudh Gupta <anyrude10@gmail.com> > wrote: > >> Hi Team, >> >> We have deployed 2 node kolla ansible *12.0.0* in order to deploy >> openstack *wallaby* release. We have also enabled ironic in order >> to provision the bare metal nodes. >> >> On each server we have 3 nics >> >> - *eno1* - OAM for external connectivity and endpoint's >> publicURL >> - *eno2* - Mgmt for internal communication between various >> openstack services. >> - *ens2f0* - Data Interface >> >> >> Corresponding to this we have defined the following fields in >> globals.yml >> >> >> - kolla_base_distro: "centos" >> - kolla_install_type: "source" >> - openstack_release: "wallaby" >> - network_interface: "eno2" # >> MGMT interface >> - kolla_external_vip_interface: "eno1" # OAM >> Interface >> - kolla_internal_vip_address: "192.168.10.3" # MGMT Subnet >> free ip >> - kolla_external_vip_address: "10.0.1.136" # OAM subnet >> free IP >> - neutron_external_interface: "ens2f0" # Data >> Interface >> - enable_neutron_provider_networks: "yes" >> >> Note: Only relevant fields are being shown in this query >> >> Also, for ironic following fields have been defined in globals.yml >> >> - enable_ironic: "yes" >> - enable_ironic_neutron_agent: "{{ enable_neutron | bool and >> enable_ironic | bool }}" >> - enable_horizon_ironic: "{{ enable_ironic | bool }}" >> - ironic_dnsmasq_interface: "*ens2f0*" # >> Data interface >> - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" >> - ironic_dnsmasq_boot_file: "pxelinux.0" >> - ironic_cleaning_network: "public1" >> - ironic_dnsmasq_default_gateway: "20.20.20.1" >> >> >> After successful deployment, a flat provider network with the name >> public1 is being created in openstack using the below commands: >> >> >> - openstack network create public1 --provider-network-type flat >> --provider-physical-network physnet1 >> - openstack subnet create subnet1 --network public1 >> --subnet-range 20.20.20.0/24 --allocation-pool >> start=20.20.20.10,end=20.20.20.100 --ip-version=4 --gateway=20.20.20.1 >> --dhcp >> >> >> Issue/Queries: >> >> >> - Is the configuration done in globals.yml correct or is there >> anything else that needs to be done in order to separate control and data >> plane traffic? >> >> >> - Also I have set automated_cleaning as "true" in >> ironic-conductor conatiner settings.But after creating the baremetal node, >> we run "node manage" command which runs successfully. Running "*openstack >> baremetal node provide <node id>"* command powers on the >> machine, sets the boot mode on Network Boot but no DHCP request for that >> particular mac is obtained on the controller. Is there anything I am >> missing that needs to be done in order to make ironic work? >> >> Note: I have also verified that the nic is PXE enabled in system >> configuration setting >> >> Regards >> Anirudh Gupta >> >> >>

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Mark Goddard

9 Aug 9 Aug

1:13 a.m.

On Fri, 6 Aug 2021 at 13:49, Anirudh Gupta <anyrude10@gmail.com> wrote:

...

Hi Dmitry,

I tried taking TCPDUMP while the Baremetal Node was booting up and looked for tftp protocols and found there was some "*File Not Found" *traces for bootx64.efi

[image: image.png]

Then, I found a related post on openstack Discuss which suggested to enable IPXE

http://lists.openstack.org/pipermail/openstack-discuss/2019-October/010329.h...

After re-deploying the setup with IPXE enabled, i found similar traces now for *ipxe.efi file*

[image: image.png]

Can you please now suggest what possibly could be a miss in configuration and steps to resolve it.

Hi Anirudh, I'd suggest installing a tftp client on your machine and making some requests. The TFTP daemon runs in the ironic_pxe container, and TFTP files are served from /tftpboot in that container. Mark

...

For your reference, I am attaching the complete tcpdump logs of both the Scenarios

Looking forward to hearing from you.

Regards Anirudh Gupta

On Thu, Aug 5, 2021 at 4:56 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

On further debugging, I found an error in neutron-server logs

Failed to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f on host f4a43fa5-9c41-488e-a34d-714ae5a9d300 for vnic_type baremetal using segments [{'id': '1a5bbe96-2488-4971-925f-7c9346ba3ef5', 'network_type': 'flat', 'physical_network': 'physnet1', 'segmentation_id': None, 'network_id': '5b6cccec-ad86-4ed9-8d3c-72a31ec3a0d4'}] 2021-08-05 16:33:06.979 23 INFO neutron.plugins.ml2.plugin [req-54d11d51-7319-43ea-b70c-fe39d8aafe8a 21d6a238438e4294912746bcdc895e31 3eca725754e1405eb178cc39bd0da3aa - default default] Attempt 9 to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f

where 476d8175-ffc2-49ba-bb12-0a77c1f07e5f is the uuid of Baremetal Node

However the port is created in openstack, but its state is down

[ansible@localhost ~]$ openstack port list

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ | ID | Name | MAC Address | Fixed IP Addresses | Status |

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ | 07d6b83d-d83c-498f-8ba8-b4f21bef7249 | | fa:16:3e:38:05:9d | ip_address='10.0.1.200', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | ACTIVE | | 476d8175-ffc2-49ba-bb12-0a77c1f07e5f | | *98:f2:b3:3f:72:d8* | ip_address='10.0.1.202', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | *DOWN * |

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+

*98:f2:b3:3f:72:d8 *is the mac address of my Baremetal Node on which PXE is enabled.

Can someone please help in resolving this issue.

*Issue:* *Node goes in clean_failed from clean_wait.*

Regards Anirudh Gupta

On Tue, Aug 3, 2021 at 8:32 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Dmitry,

I might be wrong, but as per my understanding if there would be an issue in dnsmasq, then IP 20.20.20.10 would not have been assigned to the machine.

TCPDUMP logs are as below:

20:16:58.938089 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:02.765291 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 20:17:02.766303 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:26.944378 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347 20:17:26.944756 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:30.763627 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 20:17:30.764620 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:54.938791 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347

Also the neutron dnsmasq logs and ironic inspector logs are attached in the mail.

Regards Anirudh Gupta

On Tue, Aug 3, 2021 at 7:29 PM Dmitry Tantsur <dtantsur@redhat.com> wrote:

...
Hi,

You need to check the dnsmasq logs (there are two dnsmasqs: from neutron and from ironic-inspector). tcpdump may also help to determine where the packages are lost.

Dmitry

On Fri, Jul 30, 2021 at 10:29 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Dmitry

Thanks for your time.

My system is getting IP 20.20.20.10 which is in the range defined in ironic_dnsmasq_dhcp_range field under globals.yml file.

ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100"

And in the cleaning network (public1), the range defined is 20.20.20.150-20.20.20.200

As per my understanding, these 2 ranges should be mutually exclusive.

Please suggest if my understanding is not correct.

Any suggestions what should I do to resolve this issue?

Regards Anirudh Gupta

On Sat, 31 Jul, 2021, 12:06 am Dmitry Tantsur, <dtantsur@redhat.com> wrote:

...
On Thu, Jul 29, 2021 at 6:05 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

> Hi Team, > > In to the email below, I have some updated information:- > > Earlier the allocation range mentioned in " > *ironic_dnsmasq_dhcp_range*" in globals.yml had an overlapping > range with the cleaning network, due to which there was some issue in > receiving the DHCP request > > After creating a cleaning network with a separate allocation range, > I am successfully getting IP allocated to my Baremetal Node > > - openstack subnet create subnet1 --network public1 > --subnet-range 20.20.20.0/24 --allocation-pool > start=20.20.20.150,end=20.20.20.200 --ip-version=4 --gateway=20.20.20.1 > --dhcp > > > [image: image.png] > > After getting the IP, there is no further action on the node. From " > *clean_wait*", it goes into "*clean_failed*" state after around > half an hour. >

The IP address is not from the cleaning range, it may come from inspection. You probably need to investigate your network topology, maybe use tcpdump.

Unfortunately, I'm not fluent in Kolla to say if it can be a bug or not.

Dmitry

> > On verifying the logs, I could see the below error messages > > > - In */var/log/kolla/ironic/ironic-conductor.log*, we observed > the following error: > > ERROR ironic.conductor.utils [-] Cleaning for node > 3a56748e-a8ca-4dec-a332-ace18e6d494e failed. *Timeout reached while > cleaning the node. Please check if the ramdisk responsible for the cleaning > is running on the node. Failed on step {}.* > > > Note : For Cleaning the node, we have used the below images > > > > https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... > > > https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... > > > - In /var/log/kolla/nova/nova-compute-ironic.log, we observed > the error > > ERROR nova.compute.manager [req-810ffedf-3343-471c-94db-85411984e6cc > - - - - -] No compute node record for host controller-ironic: > nova.exception_Remote.ComputeHostNotFound_Remote: Compute host > controller-ironic could not be found. > > > Can someone please help in this regard? > > Regards > Anirudh Gupta > > > On Tue, Jul 27, 2021 at 12:52 PM Anirudh Gupta <anyrude10@gmail.com> > wrote: > >> Hi Team, >> >> We have deployed 2 node kolla ansible *12.0.0* in order to deploy >> openstack *wallaby* release. We have also enabled ironic in order >> to provision the bare metal nodes. >> >> On each server we have 3 nics >> >> - *eno1* - OAM for external connectivity and endpoint's >> publicURL >> - *eno2* - Mgmt for internal communication between various >> openstack services. >> - *ens2f0* - Data Interface >> >> >> Corresponding to this we have defined the following fields in >> globals.yml >> >> >> - kolla_base_distro: "centos" >> - kolla_install_type: "source" >> - openstack_release: "wallaby" >> - network_interface: "eno2" # >> MGMT interface >> - kolla_external_vip_interface: "eno1" # OAM >> Interface >> - kolla_internal_vip_address: "192.168.10.3" # MGMT Subnet >> free ip >> - kolla_external_vip_address: "10.0.1.136" # OAM subnet >> free IP >> - neutron_external_interface: "ens2f0" # Data >> Interface >> - enable_neutron_provider_networks: "yes" >> >> Note: Only relevant fields are being shown in this query >> >> Also, for ironic following fields have been defined in globals.yml >> >> - enable_ironic: "yes" >> - enable_ironic_neutron_agent: "{{ enable_neutron | bool and >> enable_ironic | bool }}" >> - enable_horizon_ironic: "{{ enable_ironic | bool }}" >> - ironic_dnsmasq_interface: "*ens2f0*" # >> Data interface >> - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" >> - ironic_dnsmasq_boot_file: "pxelinux.0" >> - ironic_cleaning_network: "public1" >> - ironic_dnsmasq_default_gateway: "20.20.20.1" >> >> >> After successful deployment, a flat provider network with the name >> public1 is being created in openstack using the below commands: >> >> >> - openstack network create public1 --provider-network-type flat >> --provider-physical-network physnet1 >> - openstack subnet create subnet1 --network public1 >> --subnet-range 20.20.20.0/24 --allocation-pool >> start=20.20.20.10,end=20.20.20.100 --ip-version=4 --gateway=20.20.20.1 >> --dhcp >> >> >> Issue/Queries: >> >> >> - Is the configuration done in globals.yml correct or is there >> anything else that needs to be done in order to separate control and data >> plane traffic? >> >> >> - Also I have set automated_cleaning as "true" in >> ironic-conductor conatiner settings.But after creating the baremetal node, >> we run "node manage" command which runs successfully. Running "*openstack >> baremetal node provide <node id>"* command powers on the >> machine, sets the boot mode on Network Boot but no DHCP request for that >> particular mac is obtained on the controller. Is there anything I am >> missing that needs to be done in order to make ironic work? >> >> Note: I have also verified that the nic is PXE enabled in system >> configuration setting >> >> Regards >> Anirudh Gupta >> >> >>

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Anirudh Gupta

1:31 a.m.

Hi Mark, Earlier I was passing the boot_mode as uefi while creating the baremetal node. On Kolla-Ansible Launchpad, I found some issues related to UEFI mode, so I didn't pass the parameter. With IPXE and without passing UEFI boot mode parameter, my node started cleaning. It connected with the TFTP server. But from the last 2 hours, the state is still in *clean_wait* only. The ramdisk and kernel images I used were the ones mentioned in the link below - https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... - https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... For this I followed the latest kolla ansible document:- - https://docs.openstack.org/kolla-ansible/latest/reference/bare-metal/ironic-... All I can see in *ironic-conductor* logs is: 2021-08-09 13:49:51.159 7 DEBUG ironic.drivers.modules.agent_base [-] Heartbeat from node 8b1ec553-fbc9-4912-bd33-88afc41b8f81 heartbeat /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_base.py:641 2021-08-09 13:49:51.178 7 DEBUG ironic.drivers.modules.agent_client [-] Fetching status of agent commands for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81 get_commands_status /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:310 2021-08-09 13:49:51.186 7 DEBUG ironic.drivers.modules.agent_client [-] Status of agent commands for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81: get_clean_steps: result "{'clean_steps': {'GenericHardwareManager': [{'step': 'erase_devices', 'priority': 10, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'erase_devices_metadata', 'priority': 99, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'erase_pstore', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'delete_configuration', 'priority': 0, 'interface': 'raid', 'reboot_requested': False, 'abortable': True}, {'step': 'create_configuration', 'priority': 0, 'interface': 'raid', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_cpu', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_disk', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_memory', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_network', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}]}, 'hardware_manager_version': {'MellanoxDeviceHardwareManager': '1', 'generic_hardware_manager': '1.1'}}", error "None"; execute_clean_step: result "{'clean_result': None, 'clean_step': {'step': 'erase_devices_metadata', 'priority': 99, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True, 'requires_ramdisk': True}}", error "None"; execute_clean_step: result "None", error "None" get_commands_status /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:342 2021-08-09 13:49:51.186 7 DEBUG ironic.drivers.modules.agent_base [-] *Clean step still running for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81:* None _get_completed_command /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_base.py:267 It would be a great help if you could suggest some pointers. Regards Anirudh Gupta I tried On Mon, Aug 9, 2021 at 1:43 PM Mark Goddard <mark@stackhpc.com> wrote:

...

On Fri, 6 Aug 2021 at 13:49, Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Dmitry,

I tried taking TCPDUMP while the Baremetal Node was booting up and looked for tftp protocols and found there was some "*File Not Found" *traces for bootx64.efi

[image: image.png]

Then, I found a related post on openstack Discuss which suggested to enable IPXE

http://lists.openstack.org/pipermail/openstack-discuss/2019-October/010329.h...

After re-deploying the setup with IPXE enabled, i found similar traces now for *ipxe.efi file*

[image: image.png]

Can you please now suggest what possibly could be a miss in configuration and steps to resolve it.

Hi Anirudh,

I'd suggest installing a tftp client on your machine and making some requests. The TFTP daemon runs in the ironic_pxe container, and TFTP files are served from /tftpboot in that container.

Mark

...
For your reference, I am attaching the complete tcpdump logs of both the Scenarios

Looking forward to hearing from you.

Regards Anirudh Gupta

On Thu, Aug 5, 2021 at 4:56 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

On further debugging, I found an error in neutron-server logs

Failed to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f on host f4a43fa5-9c41-488e-a34d-714ae5a9d300 for vnic_type baremetal using segments [{'id': '1a5bbe96-2488-4971-925f-7c9346ba3ef5', 'network_type': 'flat', 'physical_network': 'physnet1', 'segmentation_id': None, 'network_id': '5b6cccec-ad86-4ed9-8d3c-72a31ec3a0d4'}] 2021-08-05 16:33:06.979 23 INFO neutron.plugins.ml2.plugin [req-54d11d51-7319-43ea-b70c-fe39d8aafe8a 21d6a238438e4294912746bcdc895e31 3eca725754e1405eb178cc39bd0da3aa - default default] Attempt 9 to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f

where 476d8175-ffc2-49ba-bb12-0a77c1f07e5f is the uuid of Baremetal Node

However the port is created in openstack, but its state is down

[ansible@localhost ~]$ openstack port list

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ | ID | Name | MAC Address | Fixed IP Addresses | Status |

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ | 07d6b83d-d83c-498f-8ba8-b4f21bef7249 | | fa:16:3e:38:05:9d | ip_address='10.0.1.200', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | ACTIVE | | 476d8175-ffc2-49ba-bb12-0a77c1f07e5f | | *98:f2:b3:3f:72:d8* | ip_address='10.0.1.202', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | *DOWN * |

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+

*98:f2:b3:3f:72:d8 *is the mac address of my Baremetal Node on which PXE is enabled.

Can someone please help in resolving this issue.

*Issue:* *Node goes in clean_failed from clean_wait.*

Regards Anirudh Gupta

On Tue, Aug 3, 2021 at 8:32 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Dmitry,

I might be wrong, but as per my understanding if there would be an issue in dnsmasq, then IP 20.20.20.10 would not have been assigned to the machine.

TCPDUMP logs are as below:

20:16:58.938089 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:02.765291 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 20:17:02.766303 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:26.944378 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347 20:17:26.944756 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:30.763627 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 20:17:30.764620 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:54.938791 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347

Also the neutron dnsmasq logs and ironic inspector logs are attached in the mail.

Regards Anirudh Gupta

On Tue, Aug 3, 2021 at 7:29 PM Dmitry Tantsur <dtantsur@redhat.com> wrote:

...
Hi,

You need to check the dnsmasq logs (there are two dnsmasqs: from neutron and from ironic-inspector). tcpdump may also help to determine where the packages are lost.

Dmitry

On Fri, Jul 30, 2021 at 10:29 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Dmitry

Thanks for your time.

My system is getting IP 20.20.20.10 which is in the range defined in ironic_dnsmasq_dhcp_range field under globals.yml file.

ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100"

And in the cleaning network (public1), the range defined is 20.20.20.150-20.20.20.200

As per my understanding, these 2 ranges should be mutually exclusive.

Please suggest if my understanding is not correct.

Any suggestions what should I do to resolve this issue?

Regards Anirudh Gupta

On Sat, 31 Jul, 2021, 12:06 am Dmitry Tantsur, <dtantsur@redhat.com> wrote:

> > > On Thu, Jul 29, 2021 at 6:05 PM Anirudh Gupta <anyrude10@gmail.com> > wrote: > >> Hi Team, >> >> In to the email below, I have some updated information:- >> >> Earlier the allocation range mentioned in " >> *ironic_dnsmasq_dhcp_range*" in globals.yml had an overlapping >> range with the cleaning network, due to which there was some issue in >> receiving the DHCP request >> >> After creating a cleaning network with a separate allocation range, >> I am successfully getting IP allocated to my Baremetal Node >> >> - openstack subnet create subnet1 --network public1 >> --subnet-range 20.20.20.0/24 --allocation-pool >> start=20.20.20.150,end=20.20.20.200 --ip-version=4 --gateway=20.20.20.1 >> --dhcp >> >> >> [image: image.png] >> >> After getting the IP, there is no further action on the node. From " >> *clean_wait*", it goes into "*clean_failed*" state after around >> half an hour. >> > > The IP address is not from the cleaning range, it may come from > inspection. You probably need to investigate your network topology, maybe > use tcpdump. > > Unfortunately, I'm not fluent in Kolla to say if it can be a bug or > not. > > Dmitry > > >> >> On verifying the logs, I could see the below error messages >> >> >> - In */var/log/kolla/ironic/ironic-conductor.log*, we observed >> the following error: >> >> ERROR ironic.conductor.utils [-] Cleaning for node >> 3a56748e-a8ca-4dec-a332-ace18e6d494e failed. *Timeout reached >> while cleaning the node. Please check if the ramdisk responsible for the >> cleaning is running on the node. Failed on step {}.* >> >> >> Note : For Cleaning the node, we have used the below images >> >> >> >> https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... >> >> >> https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... >> >> >> - In /var/log/kolla/nova/nova-compute-ironic.log, we observed >> the error >> >> ERROR nova.compute.manager >> [req-810ffedf-3343-471c-94db-85411984e6cc - - - - -] No compute node record >> for host controller-ironic: >> nova.exception_Remote.ComputeHostNotFound_Remote: Compute host >> controller-ironic could not be found. >> >> >> Can someone please help in this regard? >> >> Regards >> Anirudh Gupta >> >> >> On Tue, Jul 27, 2021 at 12:52 PM Anirudh Gupta <anyrude10@gmail.com> >> wrote: >> >>> Hi Team, >>> >>> We have deployed 2 node kolla ansible *12.0.0* in order to deploy >>> openstack *wallaby* release. We have also enabled ironic in order >>> to provision the bare metal nodes. >>> >>> On each server we have 3 nics >>> >>> - *eno1* - OAM for external connectivity and endpoint's >>> publicURL >>> - *eno2* - Mgmt for internal communication between various >>> openstack services. >>> - *ens2f0* - Data Interface >>> >>> >>> Corresponding to this we have defined the following fields in >>> globals.yml >>> >>> >>> - kolla_base_distro: "centos" >>> - kolla_install_type: "source" >>> - openstack_release: "wallaby" >>> - network_interface: "eno2" # >>> MGMT interface >>> - kolla_external_vip_interface: "eno1" # OAM >>> Interface >>> - kolla_internal_vip_address: "192.168.10.3" # MGMT Subnet >>> free ip >>> - kolla_external_vip_address: "10.0.1.136" # OAM subnet >>> free IP >>> - neutron_external_interface: "ens2f0" # Data >>> Interface >>> - enable_neutron_provider_networks: "yes" >>> >>> Note: Only relevant fields are being shown in this query >>> >>> Also, for ironic following fields have been defined in globals.yml >>> >>> - enable_ironic: "yes" >>> - enable_ironic_neutron_agent: "{{ enable_neutron | bool and >>> enable_ironic | bool }}" >>> - enable_horizon_ironic: "{{ enable_ironic | bool }}" >>> - ironic_dnsmasq_interface: "*ens2f0*" # >>> Data interface >>> - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" >>> - ironic_dnsmasq_boot_file: "pxelinux.0" >>> - ironic_cleaning_network: "public1" >>> - ironic_dnsmasq_default_gateway: "20.20.20.1" >>> >>> >>> After successful deployment, a flat provider network with the name >>> public1 is being created in openstack using the below commands: >>> >>> >>> - openstack network create public1 --provider-network-type >>> flat --provider-physical-network physnet1 >>> - openstack subnet create subnet1 --network public1 >>> --subnet-range 20.20.20.0/24 --allocation-pool >>> start=20.20.20.10,end=20.20.20.100 --ip-version=4 --gateway=20.20.20.1 >>> --dhcp >>> >>> >>> Issue/Queries: >>> >>> >>> - Is the configuration done in globals.yml correct or is there >>> anything else that needs to be done in order to separate control and data >>> plane traffic? >>> >>> >>> - Also I have set automated_cleaning as "true" in >>> ironic-conductor conatiner settings.But after creating the baremetal node, >>> we run "node manage" command which runs successfully. Running "*openstack >>> baremetal node provide <node id>"* command powers on the >>> machine, sets the boot mode on Network Boot but no DHCP request for that >>> particular mac is obtained on the controller. Is there anything I am >>> missing that needs to be done in order to make ironic work? >>> >>> Note: I have also verified that the nic is PXE enabled in system >>> configuration setting >>> >>> Regards >>> Anirudh Gupta >>> >>> >>> > > -- > Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, > Commercial register: Amtsgericht Muenchen, HRB 153243, > Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, > Michael O'Neill >

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Anirudh Gupta

12 Aug 12 Aug

11:56 p.m.

Hi All, I had a 900 GB hard disk on my Baremetal Node and it took approx *15 hours *to make the baremetal node come in *available* state from *clean_wait* state. Once the baremetal node came available, I was able to create a server and provision it with a user image. Is taking 15 hours to erase_device in clean_wait normal for a 900 GB hard disk in Ironic? Regards Anirudh Gupta On Mon, Aug 9, 2021 at 2:01 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...

Hi Mark,

Earlier I was passing the boot_mode as uefi while creating the baremetal node. On Kolla-Ansible Launchpad, I found some issues related to UEFI mode, so I didn't pass the parameter.

With IPXE and without passing UEFI boot mode parameter, my node started cleaning. It connected with the TFTP server.

But from the last 2 hours, the state is still in *clean_wait* only.

The ramdisk and kernel images I used were the ones mentioned in the link below

- https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... - https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

For this I followed the latest kolla ansible document:-

- https://docs.openstack.org/kolla-ansible/latest/reference/bare-metal/ironic-...

All I can see in *ironic-conductor* logs is:

2021-08-09 13:49:51.159 7 DEBUG ironic.drivers.modules.agent_base [-] Heartbeat from node 8b1ec553-fbc9-4912-bd33-88afc41b8f81 heartbeat /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_base.py:641 2021-08-09 13:49:51.178 7 DEBUG ironic.drivers.modules.agent_client [-] Fetching status of agent commands for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81 get_commands_status /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:310 2021-08-09 13:49:51.186 7 DEBUG ironic.drivers.modules.agent_client [-] Status of agent commands for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81: get_clean_steps: result "{'clean_steps': {'GenericHardwareManager': [{'step': 'erase_devices', 'priority': 10, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'erase_devices_metadata', 'priority': 99, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'erase_pstore', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'delete_configuration', 'priority': 0, 'interface': 'raid', 'reboot_requested': False, 'abortable': True}, {'step': 'create_configuration', 'priority': 0, 'interface': 'raid', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_cpu', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_disk', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_memory', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_network', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}]}, 'hardware_manager_version': {'MellanoxDeviceHardwareManager': '1', 'generic_hardware_manager': '1.1'}}", error "None"; execute_clean_step: result "{'clean_result': None, 'clean_step': {'step': 'erase_devices_metadata', 'priority': 99, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True, 'requires_ramdisk': True}}", error "None"; execute_clean_step: result "None", error "None" get_commands_status /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:342 2021-08-09 13:49:51.186 7 DEBUG ironic.drivers.modules.agent_base [-] *Clean step still running for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81:* None _get_completed_command /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_base.py:267

It would be a great help if you could suggest some pointers.

Regards Anirudh Gupta

I tried

On Mon, Aug 9, 2021 at 1:43 PM Mark Goddard <mark@stackhpc.com> wrote:

...
On Fri, 6 Aug 2021 at 13:49, Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Dmitry,

I tried taking TCPDUMP while the Baremetal Node was booting up and looked for tftp protocols and found there was some "*File Not Found" *traces for bootx64.efi

[image: image.png]

Then, I found a related post on openstack Discuss which suggested to enable IPXE

http://lists.openstack.org/pipermail/openstack-discuss/2019-October/010329.h...

After re-deploying the setup with IPXE enabled, i found similar traces now for *ipxe.efi file*

[image: image.png]

Can you please now suggest what possibly could be a miss in configuration and steps to resolve it.

Hi Anirudh,

I'd suggest installing a tftp client on your machine and making some requests. The TFTP daemon runs in the ironic_pxe container, and TFTP files are served from /tftpboot in that container.

Mark

...
For your reference, I am attaching the complete tcpdump logs of both the Scenarios

Looking forward to hearing from you.

Regards Anirudh Gupta

On Thu, Aug 5, 2021 at 4:56 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

On further debugging, I found an error in neutron-server logs

Failed to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f on host f4a43fa5-9c41-488e-a34d-714ae5a9d300 for vnic_type baremetal using segments [{'id': '1a5bbe96-2488-4971-925f-7c9346ba3ef5', 'network_type': 'flat', 'physical_network': 'physnet1', 'segmentation_id': None, 'network_id': '5b6cccec-ad86-4ed9-8d3c-72a31ec3a0d4'}] 2021-08-05 16:33:06.979 23 INFO neutron.plugins.ml2.plugin [req-54d11d51-7319-43ea-b70c-fe39d8aafe8a 21d6a238438e4294912746bcdc895e31 3eca725754e1405eb178cc39bd0da3aa - default default] Attempt 9 to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f

where 476d8175-ffc2-49ba-bb12-0a77c1f07e5f is the uuid of Baremetal Node

However the port is created in openstack, but its state is down

[ansible@localhost ~]$ openstack port list

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ | ID | Name | MAC Address | Fixed IP Addresses | Status |

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ | 07d6b83d-d83c-498f-8ba8-b4f21bef7249 | | fa:16:3e:38:05:9d | ip_address='10.0.1.200', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | ACTIVE | | 476d8175-ffc2-49ba-bb12-0a77c1f07e5f | | *98:f2:b3:3f:72:d8* | ip_address='10.0.1.202', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | *DOWN * |

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+

*98:f2:b3:3f:72:d8 *is the mac address of my Baremetal Node on which PXE is enabled.

Can someone please help in resolving this issue.

*Issue:* *Node goes in clean_failed from clean_wait.*

Regards Anirudh Gupta

On Tue, Aug 3, 2021 at 8:32 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Dmitry,

I might be wrong, but as per my understanding if there would be an issue in dnsmasq, then IP 20.20.20.10 would not have been assigned to the machine.

TCPDUMP logs are as below:

20:16:58.938089 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:02.765291 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 20:17:02.766303 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:26.944378 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347 20:17:26.944756 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:30.763627 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 20:17:30.764620 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:54.938791 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347

Also the neutron dnsmasq logs and ironic inspector logs are attached in the mail.

Regards Anirudh Gupta

On Tue, Aug 3, 2021 at 7:29 PM Dmitry Tantsur <dtantsur@redhat.com> wrote:

...
Hi,

You need to check the dnsmasq logs (there are two dnsmasqs: from neutron and from ironic-inspector). tcpdump may also help to determine where the packages are lost.

Dmitry

On Fri, Jul 30, 2021 at 10:29 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

> Hi Dmitry > > Thanks for your time. > > My system is getting IP 20.20.20.10 which is in the range defined in > ironic_dnsmasq_dhcp_range field under globals.yml file. > > ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" > > And in the cleaning network (public1), the range defined is > 20.20.20.150-20.20.20.200 > > As per my understanding, these 2 ranges should be mutually exclusive. > > Please suggest if my understanding is not correct. > > Any suggestions what should I do to resolve this issue? > > Regards > Anirudh Gupta > > > On Sat, 31 Jul, 2021, 12:06 am Dmitry Tantsur, <dtantsur@redhat.com> > wrote: > >> >> >> On Thu, Jul 29, 2021 at 6:05 PM Anirudh Gupta <anyrude10@gmail.com> >> wrote: >> >>> Hi Team, >>> >>> In to the email below, I have some updated information:- >>> >>> Earlier the allocation range mentioned in " >>> *ironic_dnsmasq_dhcp_range*" in globals.yml had an overlapping >>> range with the cleaning network, due to which there was some issue in >>> receiving the DHCP request >>> >>> After creating a cleaning network with a separate allocation >>> range, I am successfully getting IP allocated to my Baremetal Node >>> >>> - openstack subnet create subnet1 --network public1 >>> --subnet-range 20.20.20.0/24 --allocation-pool >>> start=20.20.20.150,end=20.20.20.200 --ip-version=4 --gateway=20.20.20.1 >>> --dhcp >>> >>> >>> [image: image.png] >>> >>> After getting the IP, there is no further action on the node. From >>> "*clean_wait*", it goes into "*clean_failed*" state after around >>> half an hour. >>> >> >> The IP address is not from the cleaning range, it may come from >> inspection. You probably need to investigate your network topology, maybe >> use tcpdump. >> >> Unfortunately, I'm not fluent in Kolla to say if it can be a bug or >> not. >> >> Dmitry >> >> >>> >>> On verifying the logs, I could see the below error messages >>> >>> >>> - In */var/log/kolla/ironic/ironic-conductor.log*, we observed >>> the following error: >>> >>> ERROR ironic.conductor.utils [-] Cleaning for node >>> 3a56748e-a8ca-4dec-a332-ace18e6d494e failed. *Timeout reached >>> while cleaning the node. Please check if the ramdisk responsible for the >>> cleaning is running on the node. Failed on step {}.* >>> >>> >>> Note : For Cleaning the node, we have used the below images >>> >>> >>> >>> https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... >>> >>> >>> https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... >>> >>> >>> - In /var/log/kolla/nova/nova-compute-ironic.log, we observed >>> the error >>> >>> ERROR nova.compute.manager >>> [req-810ffedf-3343-471c-94db-85411984e6cc - - - - -] No compute node record >>> for host controller-ironic: >>> nova.exception_Remote.ComputeHostNotFound_Remote: Compute host >>> controller-ironic could not be found. >>> >>> >>> Can someone please help in this regard? >>> >>> Regards >>> Anirudh Gupta >>> >>> >>> On Tue, Jul 27, 2021 at 12:52 PM Anirudh Gupta < >>> anyrude10@gmail.com> wrote: >>> >>>> Hi Team, >>>> >>>> We have deployed 2 node kolla ansible *12.0.0* in order to >>>> deploy openstack *wallaby* release. We have also enabled ironic >>>> in order to provision the bare metal nodes. >>>> >>>> On each server we have 3 nics >>>> >>>> - *eno1* - OAM for external connectivity and endpoint's >>>> publicURL >>>> - *eno2* - Mgmt for internal communication between various >>>> openstack services. >>>> - *ens2f0* - Data Interface >>>> >>>> >>>> Corresponding to this we have defined the following fields in >>>> globals.yml >>>> >>>> >>>> - kolla_base_distro: "centos" >>>> - kolla_install_type: "source" >>>> - openstack_release: "wallaby" >>>> - network_interface: "eno2" # >>>> MGMT interface >>>> - kolla_external_vip_interface: "eno1" # OAM >>>> Interface >>>> - kolla_internal_vip_address: "192.168.10.3" # MGMT Subnet >>>> free ip >>>> - kolla_external_vip_address: "10.0.1.136" # OAM subnet >>>> free IP >>>> - neutron_external_interface: "ens2f0" # Data >>>> Interface >>>> - enable_neutron_provider_networks: "yes" >>>> >>>> Note: Only relevant fields are being shown in this query >>>> >>>> Also, for ironic following fields have been defined in globals.yml >>>> >>>> - enable_ironic: "yes" >>>> - enable_ironic_neutron_agent: "{{ enable_neutron | bool and >>>> enable_ironic | bool }}" >>>> - enable_horizon_ironic: "{{ enable_ironic | bool }}" >>>> - ironic_dnsmasq_interface: "*ens2f0*" >>>> # Data interface >>>> - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" >>>> - ironic_dnsmasq_boot_file: "pxelinux.0" >>>> - ironic_cleaning_network: "public1" >>>> - ironic_dnsmasq_default_gateway: "20.20.20.1" >>>> >>>> >>>> After successful deployment, a flat provider network with the >>>> name public1 is being created in openstack using the below commands: >>>> >>>> >>>> - openstack network create public1 --provider-network-type >>>> flat --provider-physical-network physnet1 >>>> - openstack subnet create subnet1 --network public1 >>>> --subnet-range 20.20.20.0/24 --allocation-pool >>>> start=20.20.20.10,end=20.20.20.100 --ip-version=4 --gateway=20.20.20.1 >>>> --dhcp >>>> >>>> >>>> Issue/Queries: >>>> >>>> >>>> - Is the configuration done in globals.yml correct or is >>>> there anything else that needs to be done in order to separate control and >>>> data plane traffic? >>>> >>>> >>>> - Also I have set automated_cleaning as "true" in >>>> ironic-conductor conatiner settings.But after creating the baremetal node, >>>> we run "node manage" command which runs successfully. Running "*openstack >>>> baremetal node provide <node id>"* command powers on the >>>> machine, sets the boot mode on Network Boot but no DHCP request for that >>>> particular mac is obtained on the controller. Is there anything I am >>>> missing that needs to be done in order to make ironic work? >>>> >>>> Note: I have also verified that the nic is PXE enabled in system >>>> configuration setting >>>> >>>> Regards >>>> Anirudh Gupta >>>> >>>> >>>> >> >> -- >> Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, >> Commercial register: Amtsgericht Muenchen, HRB 153243, >> Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, >> Michael O'Neill >> >

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Dmitry Tantsur

20 Aug 20 Aug

3:17 a.m.

Hi, On Fri, Aug 13, 2021 at 8:56 AM Anirudh Gupta <anyrude10@gmail.com> wrote:

...

Hi All,

I had a 900 GB hard disk on my Baremetal Node and it took approx *15 hours *to make the baremetal node come in *available* state from *clean_wait* state.

Once the baremetal node came available, I was able to create a server and provision it with a user image.

Is taking 15 hours to erase_device in clean_wait normal for a 900 GB hard disk in Ironic?

Unfortunately, yes. If your hardware does not support ATA secure erase, the only way we can remove data is to shred the disk (essentially, write all 900 GB several times). If you don't care about residual data on the disks, you can switch to metadata cleaning (only partition tables). This is fast but insecure. I don't know how the options are called in Kolla, but in Ironic you do something like this: [deploy] erase_devices_priority = 0 erase_devices_metadata_priority = 10 Dmitry

...

Regards Anirudh Gupta

On Mon, Aug 9, 2021 at 2:01 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Mark,

Earlier I was passing the boot_mode as uefi while creating the baremetal node. On Kolla-Ansible Launchpad, I found some issues related to UEFI mode, so I didn't pass the parameter.

With IPXE and without passing UEFI boot mode parameter, my node started cleaning. It connected with the TFTP server.

But from the last 2 hours, the state is still in *clean_wait* only.

The ramdisk and kernel images I used were the ones mentioned in the link below

- https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... - https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

For this I followed the latest kolla ansible document:-

- https://docs.openstack.org/kolla-ansible/latest/reference/bare-metal/ironic-...

All I can see in *ironic-conductor* logs is:

2021-08-09 13:49:51.159 7 DEBUG ironic.drivers.modules.agent_base [-] Heartbeat from node 8b1ec553-fbc9-4912-bd33-88afc41b8f81 heartbeat /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_base.py:641 2021-08-09 13:49:51.178 7 DEBUG ironic.drivers.modules.agent_client [-] Fetching status of agent commands for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81 get_commands_status /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:310 2021-08-09 13:49:51.186 7 DEBUG ironic.drivers.modules.agent_client [-] Status of agent commands for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81: get_clean_steps: result "{'clean_steps': {'GenericHardwareManager': [{'step': 'erase_devices', 'priority': 10, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'erase_devices_metadata', 'priority': 99, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'erase_pstore', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'delete_configuration', 'priority': 0, 'interface': 'raid', 'reboot_requested': False, 'abortable': True}, {'step': 'create_configuration', 'priority': 0, 'interface': 'raid', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_cpu', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_disk', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_memory', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_network', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}]}, 'hardware_manager_version': {'MellanoxDeviceHardwareManager': '1', 'generic_hardware_manager': '1.1'}}", error "None"; execute_clean_step: result "{'clean_result': None, 'clean_step': {'step': 'erase_devices_metadata', 'priority': 99, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True, 'requires_ramdisk': True}}", error "None"; execute_clean_step: result "None", error "None" get_commands_status /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:342 2021-08-09 13:49:51.186 7 DEBUG ironic.drivers.modules.agent_base [-] *Clean step still running for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81:* None _get_completed_command /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_base.py:267

It would be a great help if you could suggest some pointers.

Regards Anirudh Gupta

I tried

On Mon, Aug 9, 2021 at 1:43 PM Mark Goddard <mark@stackhpc.com> wrote:

...
On Fri, 6 Aug 2021 at 13:49, Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Dmitry,

I tried taking TCPDUMP while the Baremetal Node was booting up and looked for tftp protocols and found there was some "*File Not Found" *traces for bootx64.efi

[image: image.png]

Then, I found a related post on openstack Discuss which suggested to enable IPXE

http://lists.openstack.org/pipermail/openstack-discuss/2019-October/010329.h...

After re-deploying the setup with IPXE enabled, i found similar traces now for *ipxe.efi file*

[image: image.png]

Can you please now suggest what possibly could be a miss in configuration and steps to resolve it.

Hi Anirudh,

I'd suggest installing a tftp client on your machine and making some requests. The TFTP daemon runs in the ironic_pxe container, and TFTP files are served from /tftpboot in that container.

Mark

...
For your reference, I am attaching the complete tcpdump logs of both the Scenarios

Looking forward to hearing from you.

Regards Anirudh Gupta

On Thu, Aug 5, 2021 at 4:56 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

On further debugging, I found an error in neutron-server logs

Failed to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f on host f4a43fa5-9c41-488e-a34d-714ae5a9d300 for vnic_type baremetal using segments [{'id': '1a5bbe96-2488-4971-925f-7c9346ba3ef5', 'network_type': 'flat', 'physical_network': 'physnet1', 'segmentation_id': None, 'network_id': '5b6cccec-ad86-4ed9-8d3c-72a31ec3a0d4'}] 2021-08-05 16:33:06.979 23 INFO neutron.plugins.ml2.plugin [req-54d11d51-7319-43ea-b70c-fe39d8aafe8a 21d6a238438e4294912746bcdc895e31 3eca725754e1405eb178cc39bd0da3aa - default default] Attempt 9 to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f

where 476d8175-ffc2-49ba-bb12-0a77c1f07e5f is the uuid of Baremetal Node

However the port is created in openstack, but its state is down

[ansible@localhost ~]$ openstack port list

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ | ID | Name | MAC Address | Fixed IP Addresses | Status |

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ | 07d6b83d-d83c-498f-8ba8-b4f21bef7249 | | fa:16:3e:38:05:9d | ip_address='10.0.1.200', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | ACTIVE | | 476d8175-ffc2-49ba-bb12-0a77c1f07e5f | | *98:f2:b3:3f:72:d8* | ip_address='10.0.1.202', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | *DOWN * |

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+

*98:f2:b3:3f:72:d8 *is the mac address of my Baremetal Node on which PXE is enabled.

Can someone please help in resolving this issue.

*Issue:* *Node goes in clean_failed from clean_wait.*

Regards Anirudh Gupta

On Tue, Aug 3, 2021 at 8:32 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Dmitry,

I might be wrong, but as per my understanding if there would be an issue in dnsmasq, then IP 20.20.20.10 would not have been assigned to the machine.

TCPDUMP logs are as below:

20:16:58.938089 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:02.765291 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 20:17:02.766303 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:26.944378 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347 20:17:26.944756 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:30.763627 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 20:17:30.764620 IP controller.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 312 20:17:54.938791 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347

Also the neutron dnsmasq logs and ironic inspector logs are attached in the mail.

Regards Anirudh Gupta

On Tue, Aug 3, 2021 at 7:29 PM Dmitry Tantsur <dtantsur@redhat.com> wrote:

> Hi, > > You need to check the dnsmasq logs (there are two dnsmasqs: from > neutron and from ironic-inspector). tcpdump may also help to determine > where the packages are lost. > > Dmitry > > On Fri, Jul 30, 2021 at 10:29 PM Anirudh Gupta <anyrude10@gmail.com> > wrote: > >> Hi Dmitry >> >> Thanks for your time. >> >> My system is getting IP 20.20.20.10 which is in the range defined >> in ironic_dnsmasq_dhcp_range field under globals.yml file. >> >> ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" >> >> And in the cleaning network (public1), the range defined is >> 20.20.20.150-20.20.20.200 >> >> As per my understanding, these 2 ranges should be mutually >> exclusive. >> >> Please suggest if my understanding is not correct. >> >> Any suggestions what should I do to resolve this issue? >> >> Regards >> Anirudh Gupta >> >> >> On Sat, 31 Jul, 2021, 12:06 am Dmitry Tantsur, <dtantsur@redhat.com> >> wrote: >> >>> >>> >>> On Thu, Jul 29, 2021 at 6:05 PM Anirudh Gupta <anyrude10@gmail.com> >>> wrote: >>> >>>> Hi Team, >>>> >>>> In to the email below, I have some updated information:- >>>> >>>> Earlier the allocation range mentioned in " >>>> *ironic_dnsmasq_dhcp_range*" in globals.yml had an overlapping >>>> range with the cleaning network, due to which there was some issue in >>>> receiving the DHCP request >>>> >>>> After creating a cleaning network with a separate allocation >>>> range, I am successfully getting IP allocated to my Baremetal Node >>>> >>>> - openstack subnet create subnet1 --network public1 >>>> --subnet-range 20.20.20.0/24 --allocation-pool >>>> start=20.20.20.150,end=20.20.20.200 --ip-version=4 --gateway=20.20.20.1 >>>> --dhcp >>>> >>>> >>>> [image: image.png] >>>> >>>> After getting the IP, there is no further action on the node. >>>> From "*clean_wait*", it goes into "*clean_failed*" state after >>>> around half an hour. >>>> >>> >>> The IP address is not from the cleaning range, it may come from >>> inspection. You probably need to investigate your network topology, maybe >>> use tcpdump. >>> >>> Unfortunately, I'm not fluent in Kolla to say if it can be a bug >>> or not. >>> >>> Dmitry >>> >>> >>>> >>>> On verifying the logs, I could see the below error messages >>>> >>>> >>>> - In */var/log/kolla/ironic/ironic-conductor.log*, we >>>> observed the following error: >>>> >>>> ERROR ironic.conductor.utils [-] Cleaning for node >>>> 3a56748e-a8ca-4dec-a332-ace18e6d494e failed. *Timeout reached >>>> while cleaning the node. Please check if the ramdisk responsible for the >>>> cleaning is running on the node. Failed on step {}.* >>>> >>>> >>>> Note : For Cleaning the node, we have used the below images >>>> >>>> >>>> >>>> https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... >>>> >>>> >>>> https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... >>>> >>>> >>>> - In /var/log/kolla/nova/nova-compute-ironic.log, we observed >>>> the error >>>> >>>> ERROR nova.compute.manager >>>> [req-810ffedf-3343-471c-94db-85411984e6cc - - - - -] No compute node record >>>> for host controller-ironic: >>>> nova.exception_Remote.ComputeHostNotFound_Remote: Compute host >>>> controller-ironic could not be found. >>>> >>>> >>>> Can someone please help in this regard? >>>> >>>> Regards >>>> Anirudh Gupta >>>> >>>> >>>> On Tue, Jul 27, 2021 at 12:52 PM Anirudh Gupta < >>>> anyrude10@gmail.com> wrote: >>>> >>>>> Hi Team, >>>>> >>>>> We have deployed 2 node kolla ansible *12.0.0* in order to >>>>> deploy openstack *wallaby* release. We have also enabled ironic >>>>> in order to provision the bare metal nodes. >>>>> >>>>> On each server we have 3 nics >>>>> >>>>> - *eno1* - OAM for external connectivity and endpoint's >>>>> publicURL >>>>> - *eno2* - Mgmt for internal communication between various >>>>> openstack services. >>>>> - *ens2f0* - Data Interface >>>>> >>>>> >>>>> Corresponding to this we have defined the following fields in >>>>> globals.yml >>>>> >>>>> >>>>> - kolla_base_distro: "centos" >>>>> - kolla_install_type: "source" >>>>> - openstack_release: "wallaby" >>>>> - network_interface: "eno2" # >>>>> MGMT interface >>>>> - kolla_external_vip_interface: "eno1" # OAM >>>>> Interface >>>>> - kolla_internal_vip_address: "192.168.10.3" # MGMT >>>>> Subnet free ip >>>>> - kolla_external_vip_address: "10.0.1.136" # OAM >>>>> subnet free IP >>>>> - neutron_external_interface: "ens2f0" # Data >>>>> Interface >>>>> - enable_neutron_provider_networks: "yes" >>>>> >>>>> Note: Only relevant fields are being shown in this query >>>>> >>>>> Also, for ironic following fields have been defined in >>>>> globals.yml >>>>> >>>>> - enable_ironic: "yes" >>>>> - enable_ironic_neutron_agent: "{{ enable_neutron | bool and >>>>> enable_ironic | bool }}" >>>>> - enable_horizon_ironic: "{{ enable_ironic | bool }}" >>>>> - ironic_dnsmasq_interface: "*ens2f0*" >>>>> # Data interface >>>>> - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" >>>>> - ironic_dnsmasq_boot_file: "pxelinux.0" >>>>> - ironic_cleaning_network: "public1" >>>>> - ironic_dnsmasq_default_gateway: "20.20.20.1" >>>>> >>>>> >>>>> After successful deployment, a flat provider network with the >>>>> name public1 is being created in openstack using the below commands: >>>>> >>>>> >>>>> - openstack network create public1 --provider-network-type >>>>> flat --provider-physical-network physnet1 >>>>> - openstack subnet create subnet1 --network public1 >>>>> --subnet-range 20.20.20.0/24 --allocation-pool >>>>> start=20.20.20.10,end=20.20.20.100 --ip-version=4 --gateway=20.20.20.1 >>>>> --dhcp >>>>> >>>>> >>>>> Issue/Queries: >>>>> >>>>> >>>>> - Is the configuration done in globals.yml correct or is >>>>> there anything else that needs to be done in order to separate control and >>>>> data plane traffic? >>>>> >>>>> >>>>> - Also I have set automated_cleaning as "true" in >>>>> ironic-conductor conatiner settings.But after creating the baremetal node, >>>>> we run "node manage" command which runs successfully. Running "*openstack >>>>> baremetal node provide <node id>"* command powers on the >>>>> machine, sets the boot mode on Network Boot but no DHCP request for that >>>>> particular mac is obtained on the controller. Is there anything I am >>>>> missing that needs to be done in order to make ironic work? >>>>> >>>>> Note: I have also verified that the nic is PXE enabled in system >>>>> configuration setting >>>>> >>>>> Regards >>>>> Anirudh Gupta >>>>> >>>>> >>>>> >>> >>> -- >>> Red Hat GmbH, https://de.redhat.com/ , Registered seat: >>> Grasbrunn, >>> Commercial register: Amtsgericht Muenchen, HRB 153243, >>> Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, >>> Michael O'Neill >>> >> > > -- > Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, > Commercial register: Amtsgericht Muenchen, HRB 153243, > Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, > Michael O'Neill >

Mark Goddard

23 Aug 23 Aug

12:50 a.m.

On Fri, 20 Aug 2021 at 15:08, Dmitry Tantsur <dtantsur@redhat.com> wrote:

...

Hi,

On Fri, Aug 13, 2021 at 8:56 AM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi All,

I had a 900 GB hard disk on my Baremetal Node and it took approx *15 hours *to make the baremetal node come in *available* state from *clean_wait* state.

Once the baremetal node came available, I was able to create a server and provision it with a user image.

Is taking 15 hours to erase_device in clean_wait normal for a 900 GB hard disk in Ironic?

Unfortunately, yes. If your hardware does not support ATA secure erase, the only way we can remove data is to shred the disk (essentially, write all 900 GB several times).

If you don't care about residual data on the disks, you can switch to metadata cleaning (only partition tables). This is fast but insecure. I don't know how the options are called in Kolla, but in Ironic you do something like this:

[deploy] erase_devices_priority = 0 erase_devices_metadata_priority = 10

In kolla we do not try to own all options - simply create a config override file at /etc/kolla/config/ironic.conf including the above. https://docs.openstack.org/kolla-ansible/latest/admin/advanced-configuration...

...

Dmitry

...
Regards Anirudh Gupta

On Mon, Aug 9, 2021 at 2:01 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Mark,

Earlier I was passing the boot_mode as uefi while creating the baremetal node. On Kolla-Ansible Launchpad, I found some issues related to UEFI mode, so I didn't pass the parameter.

With IPXE and without passing UEFI boot mode parameter, my node started cleaning. It connected with the TFTP server.

But from the last 2 hours, the state is still in *clean_wait* only.

The ramdisk and kernel images I used were the ones mentioned in the link below

- https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... - https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

For this I followed the latest kolla ansible document:-

- https://docs.openstack.org/kolla-ansible/latest/reference/bare-metal/ironic-...

All I can see in *ironic-conductor* logs is:

2021-08-09 13:49:51.159 7 DEBUG ironic.drivers.modules.agent_base [-] Heartbeat from node 8b1ec553-fbc9-4912-bd33-88afc41b8f81 heartbeat /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_base.py:641 2021-08-09 13:49:51.178 7 DEBUG ironic.drivers.modules.agent_client [-] Fetching status of agent commands for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81 get_commands_status /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:310 2021-08-09 13:49:51.186 7 DEBUG ironic.drivers.modules.agent_client [-] Status of agent commands for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81: get_clean_steps: result "{'clean_steps': {'GenericHardwareManager': [{'step': 'erase_devices', 'priority': 10, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'erase_devices_metadata', 'priority': 99, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'erase_pstore', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'delete_configuration', 'priority': 0, 'interface': 'raid', 'reboot_requested': False, 'abortable': True}, {'step': 'create_configuration', 'priority': 0, 'interface': 'raid', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_cpu', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_disk', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_memory', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_network', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}]}, 'hardware_manager_version': {'MellanoxDeviceHardwareManager': '1', 'generic_hardware_manager': '1.1'}}", error "None"; execute_clean_step: result "{'clean_result': None, 'clean_step': {'step': 'erase_devices_metadata', 'priority': 99, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True, 'requires_ramdisk': True}}", error "None"; execute_clean_step: result "None", error "None" get_commands_status /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:342 2021-08-09 13:49:51.186 7 DEBUG ironic.drivers.modules.agent_base [-] *Clean step still running for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81:* None _get_completed_command /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_base.py:267

It would be a great help if you could suggest some pointers.

Regards Anirudh Gupta

I tried

On Mon, Aug 9, 2021 at 1:43 PM Mark Goddard <mark@stackhpc.com> wrote:

...
On Fri, 6 Aug 2021 at 13:49, Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Dmitry,

I tried taking TCPDUMP while the Baremetal Node was booting up and looked for tftp protocols and found there was some "*File Not Found" *traces for bootx64.efi

[image: image.png]

Then, I found a related post on openstack Discuss which suggested to enable IPXE

http://lists.openstack.org/pipermail/openstack-discuss/2019-October/010329.h...

After re-deploying the setup with IPXE enabled, i found similar traces now for *ipxe.efi file*

[image: image.png]

Can you please now suggest what possibly could be a miss in configuration and steps to resolve it.

Hi Anirudh,

I'd suggest installing a tftp client on your machine and making some requests. The TFTP daemon runs in the ironic_pxe container, and TFTP files are served from /tftpboot in that container.

Mark

...
For your reference, I am attaching the complete tcpdump logs of both the Scenarios

Looking forward to hearing from you.

Regards Anirudh Gupta

On Thu, Aug 5, 2021 at 4:56 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Team,

On further debugging, I found an error in neutron-server logs

Failed to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f on host f4a43fa5-9c41-488e-a34d-714ae5a9d300 for vnic_type baremetal using segments [{'id': '1a5bbe96-2488-4971-925f-7c9346ba3ef5', 'network_type': 'flat', 'physical_network': 'physnet1', 'segmentation_id': None, 'network_id': '5b6cccec-ad86-4ed9-8d3c-72a31ec3a0d4'}] 2021-08-05 16:33:06.979 23 INFO neutron.plugins.ml2.plugin [req-54d11d51-7319-43ea-b70c-fe39d8aafe8a 21d6a238438e4294912746bcdc895e31 3eca725754e1405eb178cc39bd0da3aa - default default] Attempt 9 to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f

where 476d8175-ffc2-49ba-bb12-0a77c1f07e5f is the uuid of Baremetal Node

However the port is created in openstack, but its state is down

[ansible@localhost ~]$ openstack port list

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ | ID | Name | MAC Address | Fixed IP Addresses | Status |

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ | 07d6b83d-d83c-498f-8ba8-b4f21bef7249 | | fa:16:3e:38:05:9d | ip_address='10.0.1.200', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | ACTIVE | | 476d8175-ffc2-49ba-bb12-0a77c1f07e5f | | *98:f2:b3:3f:72:d8* | ip_address='10.0.1.202', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | *DOWN * |

+--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+

*98:f2:b3:3f:72:d8 *is the mac address of my Baremetal Node on which PXE is enabled.

Can someone please help in resolving this issue.

*Issue:* *Node goes in clean_failed from clean_wait.*

Regards Anirudh Gupta

On Tue, Aug 3, 2021 at 8:32 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

> Hi Dmitry, > > I might be wrong, but as per my understanding if there would be an > issue in dnsmasq, then IP 20.20.20.10 would not have been assigned to the > machine. > > TCPDUMP logs are as below: > > 20:16:58.938089 IP controller.bootps > 255.255.255.255.bootpc: > BOOTP/DHCP, Reply, length 312 > 20:17:02.765291 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: > BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 > 20:17:02.766303 IP controller.bootps > 255.255.255.255.bootpc: > BOOTP/DHCP, Reply, length 312 > 20:17:26.944378 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: > BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347 > 20:17:26.944756 IP controller.bootps > 255.255.255.255.bootpc: > BOOTP/DHCP, Reply, length 312 > 20:17:30.763627 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: > BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 > 20:17:30.764620 IP controller.bootps > 255.255.255.255.bootpc: > BOOTP/DHCP, Reply, length 312 > 20:17:54.938791 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: > BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347 > > Also the neutron dnsmasq logs and ironic inspector logs are attached > in the mail. > > Regards > Anirudh Gupta > > > On Tue, Aug 3, 2021 at 7:29 PM Dmitry Tantsur <dtantsur@redhat.com> > wrote: > >> Hi, >> >> You need to check the dnsmasq logs (there are two dnsmasqs: from >> neutron and from ironic-inspector). tcpdump may also help to determine >> where the packages are lost. >> >> Dmitry >> >> On Fri, Jul 30, 2021 at 10:29 PM Anirudh Gupta <anyrude10@gmail.com> >> wrote: >> >>> Hi Dmitry >>> >>> Thanks for your time. >>> >>> My system is getting IP 20.20.20.10 which is in the range defined >>> in ironic_dnsmasq_dhcp_range field under globals.yml file. >>> >>> ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" >>> >>> And in the cleaning network (public1), the range defined is >>> 20.20.20.150-20.20.20.200 >>> >>> As per my understanding, these 2 ranges should be mutually >>> exclusive. >>> >>> Please suggest if my understanding is not correct. >>> >>> Any suggestions what should I do to resolve this issue? >>> >>> Regards >>> Anirudh Gupta >>> >>> >>> On Sat, 31 Jul, 2021, 12:06 am Dmitry Tantsur, < >>> dtantsur@redhat.com> wrote: >>> >>>> >>>> >>>> On Thu, Jul 29, 2021 at 6:05 PM Anirudh Gupta < >>>> anyrude10@gmail.com> wrote: >>>> >>>>> Hi Team, >>>>> >>>>> In to the email below, I have some updated information:- >>>>> >>>>> Earlier the allocation range mentioned in " >>>>> *ironic_dnsmasq_dhcp_range*" in globals.yml had an overlapping >>>>> range with the cleaning network, due to which there was some issue in >>>>> receiving the DHCP request >>>>> >>>>> After creating a cleaning network with a separate allocation >>>>> range, I am successfully getting IP allocated to my Baremetal Node >>>>> >>>>> - openstack subnet create subnet1 --network public1 >>>>> --subnet-range 20.20.20.0/24 --allocation-pool >>>>> start=20.20.20.150,end=20.20.20.200 --ip-version=4 --gateway=20.20.20.1 >>>>> --dhcp >>>>> >>>>> >>>>> [image: image.png] >>>>> >>>>> After getting the IP, there is no further action on the node. >>>>> From "*clean_wait*", it goes into "*clean_failed*" state after >>>>> around half an hour. >>>>> >>>> >>>> The IP address is not from the cleaning range, it may come from >>>> inspection. You probably need to investigate your network topology, maybe >>>> use tcpdump. >>>> >>>> Unfortunately, I'm not fluent in Kolla to say if it can be a bug >>>> or not. >>>> >>>> Dmitry >>>> >>>> >>>>> >>>>> On verifying the logs, I could see the below error messages >>>>> >>>>> >>>>> - In */var/log/kolla/ironic/ironic-conductor.log*, we >>>>> observed the following error: >>>>> >>>>> ERROR ironic.conductor.utils [-] Cleaning for node >>>>> 3a56748e-a8ca-4dec-a332-ace18e6d494e failed. *Timeout reached >>>>> while cleaning the node. Please check if the ramdisk responsible for the >>>>> cleaning is running on the node. Failed on step {}.* >>>>> >>>>> >>>>> Note : For Cleaning the node, we have used the below images >>>>> >>>>> >>>>> >>>>> https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... >>>>> >>>>> >>>>> https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... >>>>> >>>>> >>>>> - In /var/log/kolla/nova/nova-compute-ironic.log, we >>>>> observed the error >>>>> >>>>> ERROR nova.compute.manager >>>>> [req-810ffedf-3343-471c-94db-85411984e6cc - - - - -] No compute node record >>>>> for host controller-ironic: >>>>> nova.exception_Remote.ComputeHostNotFound_Remote: Compute host >>>>> controller-ironic could not be found. >>>>> >>>>> >>>>> Can someone please help in this regard? >>>>> >>>>> Regards >>>>> Anirudh Gupta >>>>> >>>>> >>>>> On Tue, Jul 27, 2021 at 12:52 PM Anirudh Gupta < >>>>> anyrude10@gmail.com> wrote: >>>>> >>>>>> Hi Team, >>>>>> >>>>>> We have deployed 2 node kolla ansible *12.0.0* in order to >>>>>> deploy openstack *wallaby* release. We have also enabled >>>>>> ironic in order to provision the bare metal nodes. >>>>>> >>>>>> On each server we have 3 nics >>>>>> >>>>>> - *eno1* - OAM for external connectivity and endpoint's >>>>>> publicURL >>>>>> - *eno2* - Mgmt for internal communication between various >>>>>> openstack services. >>>>>> - *ens2f0* - Data Interface >>>>>> >>>>>> >>>>>> Corresponding to this we have defined the following fields in >>>>>> globals.yml >>>>>> >>>>>> >>>>>> - kolla_base_distro: "centos" >>>>>> - kolla_install_type: "source" >>>>>> - openstack_release: "wallaby" >>>>>> - network_interface: "eno2" # >>>>>> MGMT interface >>>>>> - kolla_external_vip_interface: "eno1" # OAM >>>>>> Interface >>>>>> - kolla_internal_vip_address: "192.168.10.3" # MGMT >>>>>> Subnet free ip >>>>>> - kolla_external_vip_address: "10.0.1.136" # OAM >>>>>> subnet free IP >>>>>> - neutron_external_interface: "ens2f0" # Data >>>>>> Interface >>>>>> - enable_neutron_provider_networks: "yes" >>>>>> >>>>>> Note: Only relevant fields are being shown in this query >>>>>> >>>>>> Also, for ironic following fields have been defined in >>>>>> globals.yml >>>>>> >>>>>> - enable_ironic: "yes" >>>>>> - enable_ironic_neutron_agent: "{{ enable_neutron | bool >>>>>> and enable_ironic | bool }}" >>>>>> - enable_horizon_ironic: "{{ enable_ironic | bool }}" >>>>>> - ironic_dnsmasq_interface: "*ens2f0*" >>>>>> # Data interface >>>>>> - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" >>>>>> - ironic_dnsmasq_boot_file: "pxelinux.0" >>>>>> - ironic_cleaning_network: "public1" >>>>>> - ironic_dnsmasq_default_gateway: "20.20.20.1" >>>>>> >>>>>> >>>>>> After successful deployment, a flat provider network with the >>>>>> name public1 is being created in openstack using the below commands: >>>>>> >>>>>> >>>>>> - openstack network create public1 --provider-network-type >>>>>> flat --provider-physical-network physnet1 >>>>>> - openstack subnet create subnet1 --network public1 >>>>>> --subnet-range 20.20.20.0/24 --allocation-pool >>>>>> start=20.20.20.10,end=20.20.20.100 --ip-version=4 --gateway=20.20.20.1 >>>>>> --dhcp >>>>>> >>>>>> >>>>>> Issue/Queries: >>>>>> >>>>>> >>>>>> - Is the configuration done in globals.yml correct or is >>>>>> there anything else that needs to be done in order to separate control and >>>>>> data plane traffic? >>>>>> >>>>>> >>>>>> - Also I have set automated_cleaning as "true" in >>>>>> ironic-conductor conatiner settings.But after creating the baremetal node, >>>>>> we run "node manage" command which runs successfully. Running "*openstack >>>>>> baremetal node provide <node id>"* command powers on the >>>>>> machine, sets the boot mode on Network Boot but no DHCP request for that >>>>>> particular mac is obtained on the controller. Is there anything I am >>>>>> missing that needs to be done in order to make ironic work? >>>>>> >>>>>> Note: I have also verified that the nic is PXE enabled in >>>>>> system configuration setting >>>>>> >>>>>> Regards >>>>>> Anirudh Gupta >>>>>> >>>>>> >>>>>> >>>> >>>> -- >>>> Red Hat GmbH, https://de.redhat.com/ , Registered seat: >>>> Grasbrunn, >>>> Commercial register: Amtsgericht Muenchen, HRB 153243, >>>> Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, >>>> Michael O'Neill >>>> >>> >> >> -- >> Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, >> Commercial register: Amtsgericht Muenchen, HRB 153243, >> Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, >> Michael O'Neill >> >

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Anirudh Gupta

1 Sep 1 Sep

7 a.m.

Hi, I tried Creating a config override file at /etc/kolla/config/ironic.conf and specified the below items: [deploy] erase_devices_priority = 0 erase_devices_metadata_priority = 10 With this it took approx 15 mins to complete the process. Now I tried provisioning the Baremetal Node having 2 Hard disk with the above configuration. Below is my observations: On one of the Hard disk, OS was installed. On the other hard drive, we tried mounting it to a particular mount point. Initially, we got an error *"unknown fs type"*. So we formatted the sdb drive in XFS format using mkfs.xfs command (also tried with ext4) and it was successfully mounted then. After this we put some data onto that disk. Now when the baremetal server is recreated ( openstack server delete and then openstack server create), according to my understanding, the data on the 2nd hard drive sdb should remain intact. Only the data on OS should get deleted. But when I tried mounting the 2nd drive again, it gave us the same error *"unknown fs type"* That means without formatting sdb, I am not able to mount it which means I am not able to access my data stored on SDB which ideally should not be the case. Is there any additional setting that we need to do in an ironic in order to make it work? Regards Anirudh Gupta On Mon, Aug 23, 2021 at 1:20 PM Mark Goddard <mark@stackhpc.com> wrote:

...

On Fri, 20 Aug 2021 at 15:08, Dmitry Tantsur <dtantsur@redhat.com> wrote:

...
Hi,

On Fri, Aug 13, 2021 at 8:56 AM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi All,

I had a 900 GB hard disk on my Baremetal Node and it took approx *15 hours *to make the baremetal node come in *available* state from *clean_wait* state.

Once the baremetal node came available, I was able to create a server and provision it with a user image.

Is taking 15 hours to erase_device in clean_wait normal for a 900 GB hard disk in Ironic?

Unfortunately, yes. If your hardware does not support ATA secure erase, the only way we can remove data is to shred the disk (essentially, write all 900 GB several times).

If you don't care about residual data on the disks, you can switch to metadata cleaning (only partition tables). This is fast but insecure. I don't know how the options are called in Kolla, but in Ironic you do something like this:

[deploy] erase_devices_priority = 0 erase_devices_metadata_priority = 10

In kolla we do not try to own all options - simply create a config override file at /etc/kolla/config/ironic.conf including the above.

https://docs.openstack.org/kolla-ansible/latest/admin/advanced-configuration...

...
Dmitry

...
Regards Anirudh Gupta

On Mon, Aug 9, 2021 at 2:01 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Mark,

Earlier I was passing the boot_mode as uefi while creating the baremetal node. On Kolla-Ansible Launchpad, I found some issues related to UEFI mode, so I didn't pass the parameter.

With IPXE and without passing UEFI boot mode parameter, my node started cleaning. It connected with the TFTP server.

But from the last 2 hours, the state is still in *clean_wait* only.

The ramdisk and kernel images I used were the ones mentioned in the link below

- https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... - https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas...

For this I followed the latest kolla ansible document:-

- https://docs.openstack.org/kolla-ansible/latest/reference/bare-metal/ironic-...

All I can see in *ironic-conductor* logs is:

2021-08-09 13:49:51.159 7 DEBUG ironic.drivers.modules.agent_base [-] Heartbeat from node 8b1ec553-fbc9-4912-bd33-88afc41b8f81 heartbeat /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_base.py:641 2021-08-09 13:49:51.178 7 DEBUG ironic.drivers.modules.agent_client [-] Fetching status of agent commands for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81 get_commands_status /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:310 2021-08-09 13:49:51.186 7 DEBUG ironic.drivers.modules.agent_client [-] Status of agent commands for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81: get_clean_steps: result "{'clean_steps': {'GenericHardwareManager': [{'step': 'erase_devices', 'priority': 10, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'erase_devices_metadata', 'priority': 99, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'erase_pstore', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'delete_configuration', 'priority': 0, 'interface': 'raid', 'reboot_requested': False, 'abortable': True}, {'step': 'create_configuration', 'priority': 0, 'interface': 'raid', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_cpu', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_disk', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_memory', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_network', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}]}, 'hardware_manager_version': {'MellanoxDeviceHardwareManager': '1', 'generic_hardware_manager': '1.1'}}", error "None"; execute_clean_step: result "{'clean_result': None, 'clean_step': {'step': 'erase_devices_metadata', 'priority': 99, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True, 'requires_ramdisk': True}}", error "None"; execute_clean_step: result "None", error "None" get_commands_status /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:342 2021-08-09 13:49:51.186 7 DEBUG ironic.drivers.modules.agent_base [-] *Clean step still running for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81:* None _get_completed_command /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_base.py:267

It would be a great help if you could suggest some pointers.

Regards Anirudh Gupta

I tried

On Mon, Aug 9, 2021 at 1:43 PM Mark Goddard <mark@stackhpc.com> wrote:

...
On Fri, 6 Aug 2021 at 13:49, Anirudh Gupta <anyrude10@gmail.com> wrote:

...
Hi Dmitry,

I tried taking TCPDUMP while the Baremetal Node was booting up and looked for tftp protocols and found there was some "*File Not Found" *traces for bootx64.efi

[image: image.png]

Then, I found a related post on openstack Discuss which suggested to enable IPXE

http://lists.openstack.org/pipermail/openstack-discuss/2019-October/010329.h...

After re-deploying the setup with IPXE enabled, i found similar traces now for *ipxe.efi file*

[image: image.png]

Can you please now suggest what possibly could be a miss in configuration and steps to resolve it.

Hi Anirudh,

I'd suggest installing a tftp client on your machine and making some requests. The TFTP daemon runs in the ironic_pxe container, and TFTP files are served from /tftpboot in that container.

Mark

...
For your reference, I am attaching the complete tcpdump logs of both the Scenarios

Looking forward to hearing from you.

Regards Anirudh Gupta

On Thu, Aug 5, 2021 at 4:56 PM Anirudh Gupta <anyrude10@gmail.com> wrote:

> Hi Team, > > On further debugging, I found an error in neutron-server logs > > > Failed to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f on host > f4a43fa5-9c41-488e-a34d-714ae5a9d300 for vnic_type baremetal using segments > [{'id': '1a5bbe96-2488-4971-925f-7c9346ba3ef5', 'network_type': 'flat', > 'physical_network': 'physnet1', 'segmentation_id': None, 'network_id': > '5b6cccec-ad86-4ed9-8d3c-72a31ec3a0d4'}] > 2021-08-05 16:33:06.979 23 INFO neutron.plugins.ml2.plugin > [req-54d11d51-7319-43ea-b70c-fe39d8aafe8a 21d6a238438e4294912746bcdc895e31 > 3eca725754e1405eb178cc39bd0da3aa - default default] Attempt 9 to bind port > 476d8175-ffc2-49ba-bb12-0a77c1f07e5f > > where 476d8175-ffc2-49ba-bb12-0a77c1f07e5f is the uuid of Baremetal > Node > > However the port is created in openstack, but its state is down > > [ansible@localhost ~]$ openstack port list > > +--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ > | ID | Name | MAC Address | > Fixed IP Addresses | > Status | > > +--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ > | 07d6b83d-d83c-498f-8ba8-b4f21bef7249 | | fa:16:3e:38:05:9d | > ip_address='10.0.1.200', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | > ACTIVE | > | 476d8175-ffc2-49ba-bb12-0a77c1f07e5f | | *98:f2:b3:3f:72:d8* > | ip_address='10.0.1.202', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' > | *DOWN * | > > +--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+ > > *98:f2:b3:3f:72:d8 *is the mac address of my Baremetal Node on > which PXE is enabled. > > Can someone please help in resolving this issue. > > *Issue:* > *Node goes in clean_failed from clean_wait.* > > Regards > Anirudh Gupta > > On Tue, Aug 3, 2021 at 8:32 PM Anirudh Gupta <anyrude10@gmail.com> > wrote: > >> Hi Dmitry, >> >> I might be wrong, but as per my understanding if there would be an >> issue in dnsmasq, then IP 20.20.20.10 would not have been assigned to the >> machine. >> >> TCPDUMP logs are as below: >> >> 20:16:58.938089 IP controller.bootps > 255.255.255.255.bootpc: >> BOOTP/DHCP, Reply, length 312 >> 20:17:02.765291 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: >> BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 >> 20:17:02.766303 IP controller.bootps > 255.255.255.255.bootpc: >> BOOTP/DHCP, Reply, length 312 >> 20:17:26.944378 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: >> BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347 >> 20:17:26.944756 IP controller.bootps > 255.255.255.255.bootpc: >> BOOTP/DHCP, Reply, length 312 >> 20:17:30.763627 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: >> BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359 >> 20:17:30.764620 IP controller.bootps > 255.255.255.255.bootpc: >> BOOTP/DHCP, Reply, length 312 >> 20:17:54.938791 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: >> BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347 >> >> Also the neutron dnsmasq logs and ironic inspector logs are >> attached in the mail. >> >> Regards >> Anirudh Gupta >> >> >> On Tue, Aug 3, 2021 at 7:29 PM Dmitry Tantsur <dtantsur@redhat.com> >> wrote: >> >>> Hi, >>> >>> You need to check the dnsmasq logs (there are two dnsmasqs: from >>> neutron and from ironic-inspector). tcpdump may also help to determine >>> where the packages are lost. >>> >>> Dmitry >>> >>> On Fri, Jul 30, 2021 at 10:29 PM Anirudh Gupta < >>> anyrude10@gmail.com> wrote: >>> >>>> Hi Dmitry >>>> >>>> Thanks for your time. >>>> >>>> My system is getting IP 20.20.20.10 which is in the range defined >>>> in ironic_dnsmasq_dhcp_range field under globals.yml file. >>>> >>>> ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" >>>> >>>> And in the cleaning network (public1), the range defined is >>>> 20.20.20.150-20.20.20.200 >>>> >>>> As per my understanding, these 2 ranges should be mutually >>>> exclusive. >>>> >>>> Please suggest if my understanding is not correct. >>>> >>>> Any suggestions what should I do to resolve this issue? >>>> >>>> Regards >>>> Anirudh Gupta >>>> >>>> >>>> On Sat, 31 Jul, 2021, 12:06 am Dmitry Tantsur, < >>>> dtantsur@redhat.com> wrote: >>>> >>>>> >>>>> >>>>> On Thu, Jul 29, 2021 at 6:05 PM Anirudh Gupta < >>>>> anyrude10@gmail.com> wrote: >>>>> >>>>>> Hi Team, >>>>>> >>>>>> In to the email below, I have some updated information:- >>>>>> >>>>>> Earlier the allocation range mentioned in " >>>>>> *ironic_dnsmasq_dhcp_range*" in globals.yml had an overlapping >>>>>> range with the cleaning network, due to which there was some issue in >>>>>> receiving the DHCP request >>>>>> >>>>>> After creating a cleaning network with a separate allocation >>>>>> range, I am successfully getting IP allocated to my Baremetal Node >>>>>> >>>>>> - openstack subnet create subnet1 --network public1 >>>>>> --subnet-range 20.20.20.0/24 --allocation-pool >>>>>> start=20.20.20.150,end=20.20.20.200 --ip-version=4 --gateway=20.20.20.1 >>>>>> --dhcp >>>>>> >>>>>> >>>>>> [image: image.png] >>>>>> >>>>>> After getting the IP, there is no further action on the node. >>>>>> From "*clean_wait*", it goes into "*clean_failed*" state after >>>>>> around half an hour. >>>>>> >>>>> >>>>> The IP address is not from the cleaning range, it may come from >>>>> inspection. You probably need to investigate your network topology, maybe >>>>> use tcpdump. >>>>> >>>>> Unfortunately, I'm not fluent in Kolla to say if it can be a bug >>>>> or not. >>>>> >>>>> Dmitry >>>>> >>>>> >>>>>> >>>>>> On verifying the logs, I could see the below error messages >>>>>> >>>>>> >>>>>> - In */var/log/kolla/ironic/ironic-conductor.log*, we >>>>>> observed the following error: >>>>>> >>>>>> ERROR ironic.conductor.utils [-] Cleaning for node >>>>>> 3a56748e-a8ca-4dec-a332-ace18e6d494e failed. *Timeout reached >>>>>> while cleaning the node. Please check if the ramdisk responsible for the >>>>>> cleaning is running on the node. Failed on step {}.* >>>>>> >>>>>> >>>>>> Note : For Cleaning the node, we have used the below images >>>>>> >>>>>> >>>>>> >>>>>> https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... >>>>>> >>>>>> >>>>>> https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-mas... >>>>>> >>>>>> >>>>>> - In /var/log/kolla/nova/nova-compute-ironic.log, we >>>>>> observed the error >>>>>> >>>>>> ERROR nova.compute.manager >>>>>> [req-810ffedf-3343-471c-94db-85411984e6cc - - - - -] No compute node record >>>>>> for host controller-ironic: >>>>>> nova.exception_Remote.ComputeHostNotFound_Remote: Compute host >>>>>> controller-ironic could not be found. >>>>>> >>>>>> >>>>>> Can someone please help in this regard? >>>>>> >>>>>> Regards >>>>>> Anirudh Gupta >>>>>> >>>>>> >>>>>> On Tue, Jul 27, 2021 at 12:52 PM Anirudh Gupta < >>>>>> anyrude10@gmail.com> wrote: >>>>>> >>>>>>> Hi Team, >>>>>>> >>>>>>> We have deployed 2 node kolla ansible *12.0.0* in order to >>>>>>> deploy openstack *wallaby* release. We have also enabled >>>>>>> ironic in order to provision the bare metal nodes. >>>>>>> >>>>>>> On each server we have 3 nics >>>>>>> >>>>>>> - *eno1* - OAM for external connectivity and endpoint's >>>>>>> publicURL >>>>>>> - *eno2* - Mgmt for internal communication between various >>>>>>> openstack services. >>>>>>> - *ens2f0* - Data Interface >>>>>>> >>>>>>> >>>>>>> Corresponding to this we have defined the following fields in >>>>>>> globals.yml >>>>>>> >>>>>>> >>>>>>> - kolla_base_distro: "centos" >>>>>>> - kolla_install_type: "source" >>>>>>> - openstack_release: "wallaby" >>>>>>> - network_interface: "eno2" >>>>>>> # MGMT interface >>>>>>> - kolla_external_vip_interface: "eno1" # OAM >>>>>>> Interface >>>>>>> - kolla_internal_vip_address: "192.168.10.3" # MGMT >>>>>>> Subnet free ip >>>>>>> - kolla_external_vip_address: "10.0.1.136" # OAM >>>>>>> subnet free IP >>>>>>> - neutron_external_interface: "ens2f0" # Data >>>>>>> Interface >>>>>>> - enable_neutron_provider_networks: "yes" >>>>>>> >>>>>>> Note: Only relevant fields are being shown in this query >>>>>>> >>>>>>> Also, for ironic following fields have been defined in >>>>>>> globals.yml >>>>>>> >>>>>>> - enable_ironic: "yes" >>>>>>> - enable_ironic_neutron_agent: "{{ enable_neutron | bool >>>>>>> and enable_ironic | bool }}" >>>>>>> - enable_horizon_ironic: "{{ enable_ironic | bool }}" >>>>>>> - ironic_dnsmasq_interface: "*ens2f0*" >>>>>>> # Data interface >>>>>>> - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100" >>>>>>> - ironic_dnsmasq_boot_file: "pxelinux.0" >>>>>>> - ironic_cleaning_network: "public1" >>>>>>> - ironic_dnsmasq_default_gateway: "20.20.20.1" >>>>>>> >>>>>>> >>>>>>> After successful deployment, a flat provider network with the >>>>>>> name public1 is being created in openstack using the below commands: >>>>>>> >>>>>>> >>>>>>> - openstack network create public1 --provider-network-type >>>>>>> flat --provider-physical-network physnet1 >>>>>>> - openstack subnet create subnet1 --network public1 >>>>>>> --subnet-range 20.20.20.0/24 --allocation-pool >>>>>>> start=20.20.20.10,end=20.20.20.100 --ip-version=4 --gateway=20.20.20.1 >>>>>>> --dhcp >>>>>>> >>>>>>> >>>>>>> Issue/Queries: >>>>>>> >>>>>>> >>>>>>> - Is the configuration done in globals.yml correct or is >>>>>>> there anything else that needs to be done in order to separate control and >>>>>>> data plane traffic? >>>>>>> >>>>>>> >>>>>>> - Also I have set automated_cleaning as "true" in >>>>>>> ironic-conductor conatiner settings.But after creating the baremetal node, >>>>>>> we run "node manage" command which runs successfully. Running "*openstack >>>>>>> baremetal node provide <node id>"* command powers on the >>>>>>> machine, sets the boot mode on Network Boot but no DHCP request for that >>>>>>> particular mac is obtained on the controller. Is there anything I am >>>>>>> missing that needs to be done in order to make ironic work? >>>>>>> >>>>>>> Note: I have also verified that the nic is PXE enabled in >>>>>>> system configuration setting >>>>>>> >>>>>>> Regards >>>>>>> Anirudh Gupta >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> Red Hat GmbH, https://de.redhat.com/ , Registered seat: >>>>> Grasbrunn, >>>>> Commercial register: Amtsgericht Muenchen, HRB 153243, >>>>> Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, >>>>> Michael O'Neill >>>>> >>>> >>> >>> -- >>> Red Hat GmbH, https://de.redhat.com/ , Registered seat: >>> Grasbrunn, >>> Commercial register: Amtsgericht Muenchen, HRB 153243, >>> Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, >>> Michael O'Neill >>> >>

-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

1591

Age (days ago)

1627

Last active (days ago)

List overview

Download

15 comments

3 participants

participants (3)

Anirudh Gupta
Dmitry Tantsur
Mark Goddard

[Kolla][Kolla-Ansible] Ironic Node Cleaning Failed

tags

participants (3)