[Kolla][Kolla-Ansible] Ironic Node Cleaning Failed

Anirudh Gupta anyrude10 at gmail.com
Fri Aug 13 06:56:36 UTC 2021


Hi All,

I had a 900 GB hard disk on my Baremetal Node and it took approx *15 hours *to
make the baremetal node come in *available* state from *clean_wait* state.

Once the baremetal node came available, I was able to create a server and
provision it with a user image.

Is taking 15 hours to erase_device in clean_wait normal for a 900 GB hard
disk in Ironic?

Regards
Anirudh Gupta


On Mon, Aug 9, 2021 at 2:01 PM Anirudh Gupta <anyrude10 at gmail.com> wrote:

> Hi Mark,
>
> Earlier I was passing the boot_mode as uefi while creating the baremetal
> node.
> On Kolla-Ansible Launchpad, I found some issues related to UEFI mode, so I
> didn't pass the parameter.
>
> With IPXE and without passing UEFI boot mode parameter, my node started
> cleaning. It connected with the TFTP server.
>
> But from the last 2 hours, the state is still in *clean_wait* only.
>
> The ramdisk and kernel images I used were the ones mentioned in the link
> below
>
>
>    -
>    https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-master.kernel
>    -
>    https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-master.initramfs
>
>
> For this I followed the latest kolla ansible document:-
>
>    -
>    https://docs.openstack.org/kolla-ansible/latest/reference/bare-metal/ironic-guide.html
>
>
> All I can see in *ironic-conductor* logs is:
>
> 2021-08-09 13:49:51.159 7 DEBUG ironic.drivers.modules.agent_base [-]
> Heartbeat from node 8b1ec553-fbc9-4912-bd33-88afc41b8f81 heartbeat
> /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_base.py:641
> 2021-08-09 13:49:51.178 7 DEBUG ironic.drivers.modules.agent_client [-]
> Fetching status of agent commands for node
> 8b1ec553-fbc9-4912-bd33-88afc41b8f81 get_commands_status
> /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:310
> 2021-08-09 13:49:51.186 7 DEBUG ironic.drivers.modules.agent_client [-]
> Status of agent commands for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81:
> get_clean_steps: result "{'clean_steps': {'GenericHardwareManager':
> [{'step': 'erase_devices', 'priority': 10, 'interface': 'deploy',
> 'reboot_requested': False, 'abortable': True}, {'step':
> 'erase_devices_metadata', 'priority': 99, 'interface': 'deploy',
> 'reboot_requested': False, 'abortable': True}, {'step': 'erase_pstore',
> 'priority': 0, 'interface': 'deploy', 'reboot_requested': False,
> 'abortable': True}, {'step': 'delete_configuration', 'priority': 0,
> 'interface': 'raid', 'reboot_requested': False, 'abortable': True},
> {'step': 'create_configuration', 'priority': 0, 'interface': 'raid',
> 'reboot_requested': False, 'abortable': True}, {'step': 'burnin_cpu',
> 'priority': 0, 'interface': 'deploy', 'reboot_requested': False,
> 'abortable': True}, {'step': 'burnin_disk', 'priority': 0, 'interface':
> 'deploy', 'reboot_requested': False, 'abortable': True}, {'step':
> 'burnin_memory', 'priority': 0, 'interface': 'deploy', 'reboot_requested':
> False, 'abortable': True}, {'step': 'burnin_network', 'priority': 0,
> 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}]},
> 'hardware_manager_version': {'MellanoxDeviceHardwareManager': '1',
> 'generic_hardware_manager': '1.1'}}", error "None"; execute_clean_step:
> result "{'clean_result': None, 'clean_step': {'step':
> 'erase_devices_metadata', 'priority': 99, 'interface': 'deploy',
> 'reboot_requested': False, 'abortable': True, 'requires_ramdisk': True}}",
> error "None"; execute_clean_step: result "None", error "None"
> get_commands_status
> /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:342
> 2021-08-09 13:49:51.186 7 DEBUG ironic.drivers.modules.agent_base [-] *Clean
> step still running for node 8b1ec553-fbc9-4912-bd33-88afc41b8f81:* None
> _get_completed_command
> /var/lib/kolla/venv/lib/python3.6/site-packages/ironic/drivers/modules/agent_base.py:267
>
> It would be a great help if you could suggest some pointers.
>
> Regards
> Anirudh Gupta
>
>
>
>
> I tried
>
> On Mon, Aug 9, 2021 at 1:43 PM Mark Goddard <mark at stackhpc.com> wrote:
>
>>
>>
>> On Fri, 6 Aug 2021 at 13:49, Anirudh Gupta <anyrude10 at gmail.com> wrote:
>>
>>> Hi Dmitry,
>>>
>>> I tried taking TCPDUMP while the Baremetal Node was booting up and
>>> looked for tftp protocols and found there was some "*File Not Found" *traces
>>> for bootx64.efi
>>>
>>> [image: image.png]
>>>
>>> Then, I found a related post on openstack Discuss which suggested to
>>> enable IPXE
>>>
>>> http://lists.openstack.org/pipermail/openstack-discuss/2019-October/010329.html
>>>
>>> After re-deploying the setup with IPXE enabled, i found similar traces
>>> now for *ipxe.efi file*
>>>
>>> [image: image.png]
>>>
>>> Can you please now suggest what possibly could be a miss in
>>> configuration and steps to resolve it.
>>>
>>
>> Hi Anirudh,
>>
>> I'd suggest installing a tftp client on your machine and making some
>> requests. The TFTP daemon runs in the ironic_pxe container, and TFTP files
>> are served from /tftpboot in that container.
>>
>> Mark
>>
>>>
>>> For your reference, I am attaching the complete tcpdump logs of both the
>>> Scenarios
>>>
>>> Looking forward to hearing from you.
>>>
>>> Regards
>>> Anirudh Gupta
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Aug 5, 2021 at 4:56 PM Anirudh Gupta <anyrude10 at gmail.com>
>>> wrote:
>>>
>>>> Hi Team,
>>>>
>>>> On further debugging, I found an error in neutron-server logs
>>>>
>>>>
>>>> Failed to bind port 476d8175-ffc2-49ba-bb12-0a77c1f07e5f on host
>>>> f4a43fa5-9c41-488e-a34d-714ae5a9d300 for vnic_type baremetal using segments
>>>> [{'id': '1a5bbe96-2488-4971-925f-7c9346ba3ef5', 'network_type': 'flat',
>>>> 'physical_network': 'physnet1', 'segmentation_id': None, 'network_id':
>>>> '5b6cccec-ad86-4ed9-8d3c-72a31ec3a0d4'}]
>>>> 2021-08-05 16:33:06.979 23 INFO neutron.plugins.ml2.plugin
>>>> [req-54d11d51-7319-43ea-b70c-fe39d8aafe8a 21d6a238438e4294912746bcdc895e31
>>>> 3eca725754e1405eb178cc39bd0da3aa - default default] Attempt 9 to bind port
>>>> 476d8175-ffc2-49ba-bb12-0a77c1f07e5f
>>>>
>>>> where 476d8175-ffc2-49ba-bb12-0a77c1f07e5f is the uuid of Baremetal Node
>>>>
>>>> However the port is created in openstack, but its state is down
>>>>
>>>> [ansible at localhost ~]$ openstack port list
>>>>
>>>> +--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+
>>>> | ID                                   | Name | MAC Address       |
>>>> Fixed IP Addresses                                                        |
>>>> Status |
>>>>
>>>> +--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+
>>>> | 07d6b83d-d83c-498f-8ba8-b4f21bef7249 |      | fa:16:3e:38:05:9d |
>>>> ip_address='10.0.1.200', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' |
>>>> ACTIVE |
>>>> | 476d8175-ffc2-49ba-bb12-0a77c1f07e5f |      | *98:f2:b3:3f:72:d8* |
>>>> ip_address='10.0.1.202', subnet_id='7b72c158-2146-4bd6-893b-bd76b4a3e869' | *DOWN
>>>> *  |
>>>>
>>>> +--------------------------------------+------+-------------------+---------------------------------------------------------------------------+--------+
>>>>
>>>> *98:f2:b3:3f:72:d8 *is the mac address of my Baremetal Node on which
>>>> PXE is enabled.
>>>>
>>>> Can someone please help in resolving this issue.
>>>>
>>>> *Issue:*
>>>> *Node goes in clean_failed from clean_wait.*
>>>>
>>>> Regards
>>>> Anirudh Gupta
>>>>
>>>> On Tue, Aug 3, 2021 at 8:32 PM Anirudh Gupta <anyrude10 at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Dmitry,
>>>>>
>>>>> I might be wrong, but as per my understanding if there would be an
>>>>> issue in dnsmasq, then IP 20.20.20.10 would not have been assigned to the
>>>>> machine.
>>>>>
>>>>> TCPDUMP logs are as below:
>>>>>
>>>>> 20:16:58.938089 IP controller.bootps > 255.255.255.255.bootpc:
>>>>> BOOTP/DHCP, Reply, length 312
>>>>> 20:17:02.765291 IP 0.0.0.0.bootpc > 255.255.255.255.bootps:
>>>>> BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359
>>>>> 20:17:02.766303 IP controller.bootps > 255.255.255.255.bootpc:
>>>>> BOOTP/DHCP, Reply, length 312
>>>>> 20:17:26.944378 IP 0.0.0.0.bootpc > 255.255.255.255.bootps:
>>>>> BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347
>>>>> 20:17:26.944756 IP controller.bootps > 255.255.255.255.bootpc:
>>>>> BOOTP/DHCP, Reply, length 312
>>>>> 20:17:30.763627 IP 0.0.0.0.bootpc > 255.255.255.255.bootps:
>>>>> BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 359
>>>>> 20:17:30.764620 IP controller.bootps > 255.255.255.255.bootpc:
>>>>> BOOTP/DHCP, Reply, length 312
>>>>> 20:17:54.938791 IP 0.0.0.0.bootpc > 255.255.255.255.bootps:
>>>>> BOOTP/DHCP, Request from 98:f2:b3:3f:72:e5 (oui Unknown), length 347
>>>>>
>>>>> Also the neutron dnsmasq logs and ironic inspector logs are attached
>>>>> in the mail.
>>>>>
>>>>> Regards
>>>>> Anirudh Gupta
>>>>>
>>>>>
>>>>> On Tue, Aug 3, 2021 at 7:29 PM Dmitry Tantsur <dtantsur at redhat.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> You need to check the dnsmasq logs (there are two dnsmasqs: from
>>>>>> neutron and from ironic-inspector). tcpdump may also help to determine
>>>>>> where the packages are lost.
>>>>>>
>>>>>> Dmitry
>>>>>>
>>>>>> On Fri, Jul 30, 2021 at 10:29 PM Anirudh Gupta <anyrude10 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Dmitry
>>>>>>>
>>>>>>> Thanks for your time.
>>>>>>>
>>>>>>> My system is getting IP 20.20.20.10 which is in the range defined in
>>>>>>> ironic_dnsmasq_dhcp_range field under globals.yml file.
>>>>>>>
>>>>>>> ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100"
>>>>>>>
>>>>>>> And in the cleaning network (public1), the range defined is
>>>>>>> 20.20.20.150-20.20.20.200
>>>>>>>
>>>>>>> As per my understanding, these 2 ranges should be mutually exclusive.
>>>>>>>
>>>>>>> Please suggest if my understanding is not correct.
>>>>>>>
>>>>>>> Any suggestions what should I do to resolve this issue?
>>>>>>>
>>>>>>> Regards
>>>>>>> Anirudh Gupta
>>>>>>>
>>>>>>>
>>>>>>> On Sat, 31 Jul, 2021, 12:06 am Dmitry Tantsur, <dtantsur at redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 29, 2021 at 6:05 PM Anirudh Gupta <anyrude10 at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Team,
>>>>>>>>>
>>>>>>>>> In  to the email below, I have some updated information:-
>>>>>>>>>
>>>>>>>>> Earlier the allocation range mentioned in "
>>>>>>>>> *ironic_dnsmasq_dhcp_range*" in globals.yml had an overlapping
>>>>>>>>> range with the cleaning network, due to which there was some issue in
>>>>>>>>> receiving the DHCP request
>>>>>>>>>
>>>>>>>>> After creating a cleaning network with a separate allocation
>>>>>>>>> range, I am successfully getting IP allocated to my Baremetal Node
>>>>>>>>>
>>>>>>>>>    - openstack subnet create subnet1 --network public1
>>>>>>>>>    --subnet-range 20.20.20.0/24 --allocation-pool
>>>>>>>>>    start=20.20.20.150,end=20.20.20.200 --ip-version=4  --gateway=20.20.20.1
>>>>>>>>>    --dhcp
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [image: image.png]
>>>>>>>>>
>>>>>>>>> After getting the IP, there is no further action on the node. From
>>>>>>>>> "*clean_wait*", it goes into "*clean_failed*" state after around
>>>>>>>>> half an hour.
>>>>>>>>>
>>>>>>>>
>>>>>>>> The IP address is not from the cleaning range, it may come from
>>>>>>>> inspection. You probably need to investigate your network topology, maybe
>>>>>>>> use tcpdump.
>>>>>>>>
>>>>>>>> Unfortunately, I'm not fluent in Kolla to say if it can be a bug or
>>>>>>>> not.
>>>>>>>>
>>>>>>>> Dmitry
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On verifying the logs, I could see the below error messages
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - In */var/log/kolla/ironic/ironic-conductor.log*, we observed
>>>>>>>>>    the following error:
>>>>>>>>>
>>>>>>>>> ERROR ironic.conductor.utils [-] Cleaning for node
>>>>>>>>> 3a56748e-a8ca-4dec-a332-ace18e6d494e failed. *Timeout reached
>>>>>>>>> while cleaning the node. Please check if the ramdisk responsible for the
>>>>>>>>> cleaning is running on the node. Failed on step {}.*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Note : For Cleaning the node, we have used the below images
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-master.kernel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://tarballs.openstack.org/ironic-python-agent/dib/files/ipa-centos8-master.initramfs
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - In /var/log/kolla/nova/nova-compute-ironic.log, we observed
>>>>>>>>>    the error
>>>>>>>>>
>>>>>>>>> ERROR nova.compute.manager
>>>>>>>>> [req-810ffedf-3343-471c-94db-85411984e6cc - - - - -] No compute node record
>>>>>>>>> for host controller-ironic:
>>>>>>>>> nova.exception_Remote.ComputeHostNotFound_Remote: Compute host
>>>>>>>>> controller-ironic could not be found.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can someone please help in this regard?
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Anirudh Gupta
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jul 27, 2021 at 12:52 PM Anirudh Gupta <
>>>>>>>>> anyrude10 at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Team,
>>>>>>>>>>
>>>>>>>>>> We have deployed 2 node kolla ansible *12.0.0* in order to
>>>>>>>>>> deploy openstack *wallaby* release. We have also enabled ironic
>>>>>>>>>> in order to provision the bare metal nodes.
>>>>>>>>>>
>>>>>>>>>> On each server we have 3 nics
>>>>>>>>>>
>>>>>>>>>>    - *eno1* - OAM for external connectivity and endpoint's
>>>>>>>>>>    publicURL
>>>>>>>>>>    - *eno2* - Mgmt for internal communication between various
>>>>>>>>>>    openstack services.
>>>>>>>>>>    - *ens2f0* - Data Interface
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Corresponding to this we have defined the following fields in
>>>>>>>>>> globals.yml
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - kolla_base_distro: "centos"
>>>>>>>>>>    - kolla_install_type: "source"
>>>>>>>>>>    - openstack_release: "wallaby"
>>>>>>>>>>    - network_interface: "eno2"                               #
>>>>>>>>>>    MGMT interface
>>>>>>>>>>    - kolla_external_vip_interface: "eno1"               # OAM
>>>>>>>>>>    Interface
>>>>>>>>>>    - kolla_internal_vip_address: "192.168.10.3"    # MGMT Subnet
>>>>>>>>>>    free ip
>>>>>>>>>>    - kolla_external_vip_address: "10.0.1.136"       # OAM subnet
>>>>>>>>>>    free IP
>>>>>>>>>>    - neutron_external_interface: "ens2f0"             # Data
>>>>>>>>>>    Interface
>>>>>>>>>>    - enable_neutron_provider_networks: "yes"
>>>>>>>>>>
>>>>>>>>>> Note: Only relevant fields are being shown in this query
>>>>>>>>>>
>>>>>>>>>> Also, for ironic following fields have been defined in globals.yml
>>>>>>>>>>
>>>>>>>>>>    - enable_ironic: "yes"
>>>>>>>>>>    - enable_ironic_neutron_agent: "{{ enable_neutron | bool and
>>>>>>>>>>    enable_ironic | bool }}"
>>>>>>>>>>    - enable_horizon_ironic: "{{ enable_ironic | bool }}"
>>>>>>>>>>    - ironic_dnsmasq_interface: "*ens2f0*"
>>>>>>>>>>     # Data interface
>>>>>>>>>>    - ironic_dnsmasq_dhcp_range: "20.20.20.10,20.20.20.100"
>>>>>>>>>>    - ironic_dnsmasq_boot_file: "pxelinux.0"
>>>>>>>>>>    - ironic_cleaning_network: "public1"
>>>>>>>>>>    - ironic_dnsmasq_default_gateway: "20.20.20.1"
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> After successful deployment, a flat provider network with the
>>>>>>>>>> name public1 is being created in openstack using the below commands:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - openstack network create public1 --provider-network-type
>>>>>>>>>>    flat --provider-physical-network physnet1
>>>>>>>>>>    - openstack subnet create subnet1 --network public1
>>>>>>>>>>    --subnet-range 20.20.20.0/24 --allocation-pool
>>>>>>>>>>    start=20.20.20.10,end=20.20.20.100 --ip-version=4  --gateway=20.20.20.1
>>>>>>>>>>    --dhcp
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Issue/Queries:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Is the configuration done in globals.yml correct or is
>>>>>>>>>>    there anything else that needs to be done in order to separate control and
>>>>>>>>>>    data plane traffic?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Also I have set automated_cleaning as "true" in
>>>>>>>>>>    ironic-conductor conatiner settings.But after creating the baremetal node,
>>>>>>>>>>    we run "node manage" command which runs successfully. Running "*openstack
>>>>>>>>>>    baremetal node provide <node id>"* command powers on the
>>>>>>>>>>    machine, sets the boot mode on Network Boot but no DHCP request for that
>>>>>>>>>>    particular mac is obtained on the controller. Is there anything I am
>>>>>>>>>>    missing that needs to be done in order to make ironic work?
>>>>>>>>>>
>>>>>>>>>> Note: I have also verified that the nic is PXE enabled in system
>>>>>>>>>> configuration setting
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Anirudh Gupta
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
>>>>>>>> Commercial register: Amtsgericht Muenchen, HRB 153243,
>>>>>>>> Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs,
>>>>>>>> Michael O'Neill
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
>>>>>> Commercial register: Amtsgericht Muenchen, HRB 153243,
>>>>>> Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs,
>>>>>> Michael O'Neill
>>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210813/e2841dbe/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 38285 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210813/e2841dbe/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 185546 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210813/e2841dbe/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 200447 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210813/e2841dbe/attachment-0005.png>


More information about the openstack-discuss mailing list