[Kolla][Kolla-Ansible] Ironic Node Cleaning Failed

Anirudh Gupta anyrude10 at gmail.com
Mon Aug 9 08:31:24 UTC 2021


Hi Mark,

Earlier I was passing the boot_mode as uefi while creating the baremetal
node.
On Kolla-Ansible Launchpad, I found some issues related to UEFI mode, so I
didn't pass the parameter.

With IPXE and without passing UEFI boot mode parameter, my node started
cleaning. It connected with the TFTP server.

But from the last 2 hours, the state is still in *clean_wait* only.

The ramdisk and kernel images I used were the ones mentioned in the link
below


   -
   https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-master.kernel
   -
   https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-master.initramfs


For this I followed the latest kolla ansible document:-

   -
   https://docs.openstack.org/kolla-ansible/latest/reference/bare-metal/ironic-guide.html


All I can see in *ironic-conductor* logs is:

2021-08-09 13:49:51.159 7 DEBUG ironic.drivers.modules.agent_base [-]
Heartbeat from node 8b1ec553-fbc9-4912-bd33-88afc41b8f81 heartbeat
/var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_base.py:641
2021-08-09 13:49:51.178 7 DEBUG ironic.drivers.modules.agent_client [-]
Fetching status of agent commands for node
8b1ec553-fbc9-4912-bd33-88afc41b8f81 get_commands_status
/var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:310
2021-08-09 13:49:51.186 7 DEBUG ironic.drivers.modules.agent_client [-]
Status of agent commands for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81:
get_clean_steps: result "{'clean_steps': {'GenericHardwareManager':
[{'step': 'erase_devices', 'priority': 10, 'interface': 'deploy',
'reboot_requested': False, 'abortable': True}, {'step':
'erase_devices_metadata', 'priority': 99, 'interface': 'deploy',
'reboot_requested': False, 'abortable': True}, {'step': 'erase_pstore',
'priority': 0, 'interface': 'deploy', 'reboot_requested': False,
'abortable': True}, {'step': 'delete_configuration', 'priority': 0,
'interface': 'raid', 'reboot_requested': False, 'abortable': True},
{'step': 'create_configuration', 'priority': 0, 'interface': 'raid',
'reboot_requested': False, 'abortable': True}, {'step': 'burnin_cpu',
'priority': 0, 'interface': 'deploy', 'reboot_requested': False,
'abortable': True}, {'step': 'burnin_disk', 'priority': 0, 'interface':
'deploy', 'reboot_requested': False, 'abortable': True}, {'step':
'burnin_memory', 'priority': 0, 'interface': 'deploy', 'reboot_requested':
False, 'abortable': True}, {'step': 'burnin_network', 'priority': 0,
'interface': 'deploy', 'reboot_requested': False, 'abortable': True}]},
'hardware_manager_version': {'MellanoxDeviceHardwareManager': '1',
'generic_hardware_manager': '1.1'}}", error "None"; execute_clean_step:
result "{'clean_result': None, 'clean_step': {'step':
'erase_devices_metadata', 'priority': 99, 'interface': 'deploy',
'reboot_requested': False, 'abortable': True, 'requires_ramdisk': True}}",
error "None"; execute_clean_step: result "None", error "None"
get_commands_status
/var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:342
2021-08-09 13:49:51.186 7 DEBUG ironic.drivers.modules.agent_base [-] *Clean
step still running for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81:* None
_get_completed_command
/var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_base.py:267

It would be a great help if you could suggest some pointers.

Regards
Anirudh Gupta




I tried

On Mon, Aug 9, 2021 at 1:43 PM Mark Goddard <mark at stackhpc.com> wrote:

>
>
> On Fri, 6 Aug 2021 at 13:49, Anirudh Gupta <anyrude10 at gmail.com> wrote:
>
>> Hi Dmitry,
>>
>> I tried taking TCPDUMP while the Baremetal Node was booting up and looked
>> for tftp protocols and found there was some "*File Not Found" *traces
>> for bootx64.efi
>>
>> [image: image.png]
>>
>> Then, I found a related post on openstack Discuss which suggested to
>> enable IPXE
>>
>> http://lists.openstack.org/pipermail/openstack-discuss/2019-October/010329.html
>>
>> After re-deploying the setup with IPXE enabled, i found similar traces
>> now for *ipxe.efi file*
>>
>> [image: image.png]
>>
>> Can you please now suggest what possibly could be a miss in configuration
>> and steps to resolve it.
>>
>
> Hi Anirudh,
>
> I'd suggest installing a tftp client on your machine and making some
> requests. The TFTP daemon runs in the ironic_pxe container, and TFTP files
> are served from /tftpboot in that container.
>
> Mark
>
>>
>> For your reference, I am attaching the complete tcpdump logs of both the
>> Scenarios
>>
>> Looking forward to hearing from you.
>>
>> Regards
>> Anirudh Gupta
>>
>>
>>
>>
>>
>> On Thu, Aug 5, 2021 at 4:56 PM Anirudh Gupta <anyrude10 at gmail.com> wrote:
>>
>>> Hi Team,
>>>
>>> On further debugging, I found an error in neutron-server logs
>>>
>>>
>>> Failed to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f on host
>>> f4a43fa5-9c41-488e-a34d-714ae5a9d300 for vnic_type baremetal using segments
>>> [{'id': '1a5bbe96-2488-4971-925f-7c9346ba3ef5', 'network_type': 'flat',
>>> 'physical_network': 'physnet1', 'segmentation_id': None, 'network_id':
>>> '5b6cccec-ad86-4ed9-8d3c-72a31ec3a0d4'}]
>>> 2021-08-05 16:33:06.979 23 INFO neutron.plugins.ml2.plugin
>>> [req-54d11d51-7319-43ea-b70c-fe39d8aafe8a 21d6a238438e4294912746bcdc895e31
>>> 3eca725754e1405eb178cc39bd0da3aa - default default] Attempt 9 to bind port
>>> 476d8175-ffc2-49ba-bb12-0a77c1f07e5f
>>>
>>> where 476d8175-ffc2-49ba-bb12-0a77c1f07e5f is the uuid of Baremetal Node
>>>
>>> However the port is created in openstack, but its state is down
>>>
>>> [ansible at localhost ~]$ openstack port list
>>>
>>> +--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+
>>> | ID                                   | Name | MAC Address       |
>>> Fixed IP Addresses                                                        |
>>> Status |
>>>
>>> +--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+
>>> | 07d6b83d-d83c-498f-8ba8-b4f21bef7249 |      | fa:16:3e:38:05:9d |
>>> ip_address='10.0.1.200', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' |
>>> ACTIVE |
>>> | 476d8175-ffc2-49ba-bb12-0a77c1f07e5f |      | *98:f2:b3:3f:72:d8* |
>>> ip_address='10.0.1.202', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | *DOWN
>>> *  |
>>>
>>> +--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+
>>>
>>> *98:f2:b3:3f:72:d8 *is the mac address of my Baremetal Node on which
>>> PXE is enabled.
>>>
>>> Can someone please help in resolving this issue.
>>>
>>> *Issue:*
>>> *Node goes in clean_failed from clean_wait.*
>>>
>>> Regards
>>> Anirudh Gupta
>>>
>>> On Tue, Aug 3, 2021 at 8:32 PM Anirudh Gupta <anyrude10 at gmail.com>
>>> wrote:
>>>
>>>> Hi Dmitry,
>>>>
>>>> I might be wrong, but as per my understanding if there would be an
>>>> issue in dnsmasq, then IP 20.20.20.10 would not have been assigned to the
>>>> machine.
>>>>
>>>> TCPDUMP logs are as below:
>>>>
>>>> 20:16:58.938089 IP controller.bootps > 255.255.255.255.bootpc:
>>>> BOOTP/DHCP, Reply, length 312
>>>> 20:17:02.765291 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP,
>>>> Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359
>>>> 20:17:02.766303 IP controller.bootps > 255.255.255.255.bootpc:
>>>> BOOTP/DHCP, Reply, length 312
>>>> 20:17:26.944378 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP,
>>>> Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347
>>>> 20:17:26.944756 IP controller.bootps > 255.255.255.255.bootpc:
>>>> BOOTP/DHCP, Reply, length 312
>>>> 20:17:30.763627 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP,
>>>> Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359
>>>> 20:17:30.764620 IP controller.bootps > 255.255.255.255.bootpc:
>>>> BOOTP/DHCP, Reply, length 312
>>>> 20:17:54.938791 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP,
>>>> Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347
>>>>
>>>> Also the neutron dnsmasq logs and ironic inspector logs are attached in
>>>> the mail.
>>>>
>>>> Regards
>>>> Anirudh Gupta
>>>>
>>>>
>>>> On Tue, Aug 3, 2021 at 7:29 PM Dmitry Tantsur <dtantsur at redhat.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> You need to check the dnsmasq logs (there are two dnsmasqs: from
>>>>> neutron and from ironic-inspector). tcpdump may also help to determine
>>>>> where the packages are lost.
>>>>>
>>>>> Dmitry
>>>>>
>>>>> On Fri, Jul 30, 2021 at 10:29 PM Anirudh Gupta <anyrude10 at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Dmitry
>>>>>>
>>>>>> Thanks for your time.
>>>>>>
>>>>>> My system is getting IP 20.20.20.10 which is in the range defined in
>>>>>> ironic_dnsmasq_dhcp_range field under globals.yml file.
>>>>>>
>>>>>> ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100"
>>>>>>
>>>>>> And in the cleaning network (public1), the range defined is
>>>>>> 20.20.20.150-20.20.20.200
>>>>>>
>>>>>> As per my understanding, these 2 ranges should be mutually exclusive.
>>>>>>
>>>>>> Please suggest if my understanding is not correct.
>>>>>>
>>>>>> Any suggestions what should I do to resolve this issue?
>>>>>>
>>>>>> Regards
>>>>>> Anirudh Gupta
>>>>>>
>>>>>>
>>>>>> On Sat, 31 Jul, 2021, 12:06 am Dmitry Tantsur, <dtantsur at redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 29, 2021 at 6:05 PM Anirudh Gupta <anyrude10 at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Team,
>>>>>>>>
>>>>>>>> In  to the email below, I have some updated information:-
>>>>>>>>
>>>>>>>> Earlier the allocation range mentioned in "
>>>>>>>> *ironic_dnsmasq_dhcp_range*" in globals.yml had an overlapping
>>>>>>>> range with the cleaning network, due to which there was some issue in
>>>>>>>> receiving the DHCP request
>>>>>>>>
>>>>>>>> After creating a cleaning network with a separate allocation range,
>>>>>>>> I am successfully getting IP allocated to my Baremetal Node
>>>>>>>>
>>>>>>>>    - openstack subnet create subnet1 --network public1
>>>>>>>>    --subnet-range 20.20.20.0/24 --allocation-pool
>>>>>>>>    start=20.20.20.150,end=20.20.20.200 --ip-version=4  --gateway=20.20.20.1
>>>>>>>>    --dhcp
>>>>>>>>
>>>>>>>>
>>>>>>>> [image: image.png]
>>>>>>>>
>>>>>>>> After getting the IP, there is no further action on the node. From "
>>>>>>>> *clean_wait*", it goes into "*clean_failed*" state after around
>>>>>>>> half an hour.
>>>>>>>>
>>>>>>>
>>>>>>> The IP address is not from the cleaning range, it may come from
>>>>>>> inspection. You probably need to investigate your network topology, maybe
>>>>>>> use tcpdump.
>>>>>>>
>>>>>>> Unfortunately, I'm not fluent in Kolla to say if it can be a bug or
>>>>>>> not.
>>>>>>>
>>>>>>> Dmitry
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> On verifying the logs, I could see the below error messages
>>>>>>>>
>>>>>>>>
>>>>>>>>    - In */var/log/kolla/ironic/ironic-conductor.log*, we observed
>>>>>>>>    the following error:
>>>>>>>>
>>>>>>>> ERROR ironic.conductor.utils [-] Cleaning for node
>>>>>>>> 3a56748e-a8ca-4dec-a332-ace18e6d494e failed. *Timeout reached
>>>>>>>> while cleaning the node. Please check if the ramdisk responsible for the
>>>>>>>> cleaning is running on the node. Failed on step {}.*
>>>>>>>>
>>>>>>>>
>>>>>>>> Note : For Cleaning the node, we have used the below images
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-master.kernel
>>>>>>>>
>>>>>>>>
>>>>>>>> https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-master.initramfs
>>>>>>>>
>>>>>>>>
>>>>>>>>    - In /var/log/kolla/nova/nova-compute-ironic.log, we observed
>>>>>>>>    the error
>>>>>>>>
>>>>>>>> ERROR nova.compute.manager
>>>>>>>> [req-810ffedf-3343-471c-94db-85411984e6cc - - - - -] No compute node record
>>>>>>>> for host controller-ironic:
>>>>>>>> nova.exception_Remote.ComputeHostNotFound_Remote: Compute host
>>>>>>>> controller-ironic could not be found.
>>>>>>>>
>>>>>>>>
>>>>>>>> Can someone please help in this regard?
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Anirudh Gupta
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jul 27, 2021 at 12:52 PM Anirudh Gupta <anyrude10 at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Team,
>>>>>>>>>
>>>>>>>>> We have deployed 2 node kolla ansible *12.0.0* in order to deploy
>>>>>>>>> openstack *wallaby* release. We have also enabled ironic in order
>>>>>>>>> to provision the bare metal nodes.
>>>>>>>>>
>>>>>>>>> On each server we have 3 nics
>>>>>>>>>
>>>>>>>>>    - *eno1* - OAM for external connectivity and endpoint's
>>>>>>>>>    publicURL
>>>>>>>>>    - *eno2* - Mgmt for internal communication between various
>>>>>>>>>    openstack services.
>>>>>>>>>    - *ens2f0* - Data Interface
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Corresponding to this we have defined the following fields in
>>>>>>>>> globals.yml
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - kolla_base_distro: "centos"
>>>>>>>>>    - kolla_install_type: "source"
>>>>>>>>>    - openstack_release: "wallaby"
>>>>>>>>>    - network_interface: "eno2"                               #
>>>>>>>>>    MGMT interface
>>>>>>>>>    - kolla_external_vip_interface: "eno1"               # OAM
>>>>>>>>>    Interface
>>>>>>>>>    - kolla_internal_vip_address: "192.168.10.3"    # MGMT Subnet
>>>>>>>>>    free ip
>>>>>>>>>    - kolla_external_vip_address: "10.0.1.136"       # OAM subnet
>>>>>>>>>    free IP
>>>>>>>>>    - neutron_external_interface: "ens2f0"             # Data
>>>>>>>>>    Interface
>>>>>>>>>    - enable_neutron_provider_networks: "yes"
>>>>>>>>>
>>>>>>>>> Note: Only relevant fields are being shown in this query
>>>>>>>>>
>>>>>>>>> Also, for ironic following fields have been defined in globals.yml
>>>>>>>>>
>>>>>>>>>    - enable_ironic: "yes"
>>>>>>>>>    - enable_ironic_neutron_agent: "{{ enable_neutron | bool and
>>>>>>>>>    enable_ironic | bool }}"
>>>>>>>>>    - enable_horizon_ironic: "{{ enable_ironic | bool }}"
>>>>>>>>>    - ironic_dnsmasq_interface: "*ens2f0*"                       #
>>>>>>>>>    Data interface
>>>>>>>>>    - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100"
>>>>>>>>>    - ironic_dnsmasq_boot_file: "pxelinux.0"
>>>>>>>>>    - ironic_cleaning_network: "public1"
>>>>>>>>>    - ironic_dnsmasq_default_gateway: "20.20.20.1"
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> After successful deployment, a flat provider network with the name
>>>>>>>>> public1 is being created in openstack using the below commands:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - openstack network create public1 --provider-network-type
>>>>>>>>>    flat --provider-physical-network physnet1
>>>>>>>>>    - openstack subnet create subnet1 --network public1
>>>>>>>>>    --subnet-range 20.20.20.0/24 --allocation-pool
>>>>>>>>>    start=20.20.20.10,end=20.20.20.100 --ip-version=4  --gateway=20.20.20.1
>>>>>>>>>    --dhcp
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Issue/Queries:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Is the configuration done in globals.yml correct or is there
>>>>>>>>>    anything else that needs to be done in order to separate control and data
>>>>>>>>>    plane traffic?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Also I have set automated_cleaning as "true" in
>>>>>>>>>    ironic-conductor conatiner settings.But after creating the baremetal node,
>>>>>>>>>    we run "node manage" command which runs successfully. Running "*openstack
>>>>>>>>>    baremetal node provide <node id>"* command powers on the
>>>>>>>>>    machine, sets the boot mode on Network Boot but no DHCP request for that
>>>>>>>>>    particular mac is obtained on the controller. Is there anything I am
>>>>>>>>>    missing that needs to be done in order to make ironic work?
>>>>>>>>>
>>>>>>>>> Note: I have also verified that the nic is PXE enabled in system
>>>>>>>>> configuration setting
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Anirudh Gupta
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
>>>>>>> Commercial register: Amtsgericht Muenchen, HRB 153243,
>>>>>>> Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs,
>>>>>>> Michael O'Neill
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
>>>>> Commercial register: Amtsgericht Muenchen, HRB 153243,
>>>>> Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs,
>>>>> Michael O'Neill
>>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210809/7729dda1/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 38285 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210809/7729dda1/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 185546 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210809/7729dda1/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 200447 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210809/7729dda1/attachment-0005.png>


More information about the openstack-discuss mailing list