[ironic][xena] problems updating redfish_password for existing node

Wade Albright the.wade.albright at gmail.com
Mon Aug 15 18:47:36 UTC 2022


1) There are sockets open briefly when the conductor is trying to connect.
After three tries the node is set to maintenance mode and there are no more
sockets open.
2) My (extremely simple) code was not using connection: close. I was just
running "requests.get("
https://10.12.104.174/redfish/v1/Systems/System.Embedded.1", verify=False,
auth=('xxxx', 'xxxx'))" in a loop. I just tried it
with headers={'Connection':'close'} and it doesn't seem to make any
difference. Works fine either way.

I was able to confirm that the problem only happens when using session
auth. With basic auth it doesn't happen.

Versions I'm using here are ironic 18.2.1 and sushy 3.12.2.

Here are some fresh logs from the node having the problem:

2022-08-15 10:34:21.726 208875 INFO ironic.conductor.task_manager
[req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338
b679510ddb6540ca9454e26841f65c89 - default default] Node
0c304cea-8ae2-4a12-b658-dec05c190f88 moved to provision state "deploying"
from state "active"; target provision state is "active"
2022-08-15 10:34:22.553 208875 INFO ironic.conductor.utils
[req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338
b679510ddb6540ca9454e26841f65c89 - default default] Node
0c304cea-8ae2-4a12-b658-dec05c190f88 current power state is 'power on',
requested state is 'power off'.
2022-08-15 10:34:35.185 208875 INFO ironic.conductor.utils
[req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338
b679510ddb6540ca9454e26841f65c89 - default default] Successfully set node
0c304cea-8ae2-4a12-b658-dec05c190f88 power state to power off by power off.
2022-08-15 10:34:35.200 208875 WARNING ironic.common.pxe_utils
[req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338
b679510ddb6540ca9454e26841f65c89 - default default] IPv6 is enabled and the
DHCP driver appears set to a plugin aside from "neutron". Node
0c304cea-8ae2-4a12-b658-dec05c190f88 may not receive proper DHCPv6 provided
boot parameters.
2022-08-15 10:34:38.246 208875 INFO ironic.conductor.deployments
[req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338
b679510ddb6540ca9454e26841f65c89 - default default] Deploying on node
0c304cea-8ae2-4a12-b658-dec05c190f88, remaining steps: [{'step':
'pre_deploy', 'priority': 200, 'argsinfo': None, 'interface': 'deploy'},
{'step': 'deploy', 'priority': 100, 'argsinfo': None, 'interface':
'deploy'}, {'step': 'write_image', 'priority': 80, 'argsinfo': None,
'interface': 'deploy'}, {'step': 'tear_down_agent', 'priority': 40,
'argsinfo': None, 'interface': 'deploy'}, {'step':
'switch_to_tenant_network', 'priority': 30, 'argsinfo': None, 'interface':
'deploy'}, {'step': 'boot_instance', 'priority': 20, 'argsinfo': None,
'interface': 'deploy'}]
2022-08-15 10:34:38.255 208875 INFO ironic.conductor.deployments
[req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338
b679510ddb6540ca9454e26841f65c89 - default default] Executing {'step':
'pre_deploy', 'priority': 200, 'argsinfo': None, 'interface': 'deploy'} on
node 0c304cea-8ae2-4a12-b658-dec05c190f88
2022-08-15 10:35:27.158 208875 INFO ironic.drivers.modules.ansible.deploy
[req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338
b679510ddb6540ca9454e26841f65c89 - default default] Ansible pre-deploy step
complete on node 0c304cea-8ae2-4a12-b658-dec05c190f88
2022-08-15 10:35:27.159 208875 INFO ironic.conductor.deployments
[req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338
b679510ddb6540ca9454e26841f65c89 - default default] Node
0c304cea-8ae2-4a12-b658-dec05c190f88 finished deploy step {'step':
'pre_deploy', 'priority': 200, 'argsinfo': None, 'interface': 'deploy'}
2022-08-15 10:35:27.160 208875 INFO ironic.conductor.deployments
[req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338
b679510ddb6540ca9454e26841f65c89 - default default] Deploying on node
0c304cea-8ae2-4a12-b658-dec05c190f88, remaining steps: [{'step': 'deploy',
'priority': 100, 'argsinfo': None, 'interface': 'deploy'}, {'step':
'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'},
{'step': 'tear_down_agent', 'priority': 40, 'argsinfo': None, 'interface':
'deploy'}, {'step': 'switch_to_tenant_network', 'priority': 30, 'argsinfo':
None, 'interface': 'deploy'}, {'step': 'boot_instance', 'priority': 20,
'argsinfo': None, 'interface': 'deploy'}]
2022-08-15 10:35:27.176 208875 INFO ironic.conductor.deployments
[req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338
b679510ddb6540ca9454e26841f65c89 - default default] Executing {'step':
'deploy', 'priority': 100, 'argsinfo': None, 'interface': 'deploy'} on node
0c304cea-8ae2-4a12-b658-dec05c190f88
2022-08-15 10:35:32.037 208875 INFO ironic.conductor.utils
[req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338
b679510ddb6540ca9454e26841f65c89 - default default] Successfully set node
0c304cea-8ae2-4a12-b658-dec05c190f88 power state to power on by rebooting.
2022-08-15 10:35:32.037 208875 INFO ironic.conductor.deployments
[req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338
b679510ddb6540ca9454e26841f65c89 - default default] Deploy step {'step':
'deploy', 'priority': 100, 'argsinfo': None, 'interface': 'deploy'} on node
0c304cea-8ae2-4a12-b658-dec05c190f88 being executed asynchronously, waiting
for driver.
2022-08-15 10:35:32.051 208875 INFO ironic.conductor.task_manager
[req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338
b679510ddb6540ca9454e26841f65c89 - default default] Node
0c304cea-8ae2-4a12-b658-dec05c190f88 moved to provision state "wait
call-back" from state "deploying"; target provision state is "active"
2022-08-15 10:39:54.726 208875 INFO ironic.conductor.task_manager
[req-fdd7319b-85b5-4adb-9773-ff64fd1423b2 - - - - -] Node
0c304cea-8ae2-4a12-b658-dec05c190f88 moved to provision state "deploying"
from state "wait call-back"; target provision state is "active"
2022-08-15 10:39:54.741 208875 INFO ironic.conductor.deployments
[req-fdd7319b-85b5-4adb-9773-ff64fd1423b2 - - - - -] Deploying on node
0c304cea-8ae2-4a12-b658-dec05c190f88, remaining steps: [{'step':
'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'},
{'step': 'tear_down_agent', 'priority': 40, 'argsinfo': None, 'interface':
'deploy'}, {'step': 'switch_to_tenant_network', 'priority': 30, 'argsinfo':
None, 'interface': 'deploy'}, {'step': 'boot_instance', 'priority': 20,
'argsinfo': None, 'interface': 'deploy'}]
2022-08-15 10:39:54.748 208875 INFO ironic.conductor.deployments
[req-fdd7319b-85b5-4adb-9773-ff64fd1423b2 - - - - -] Executing {'step':
'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on
node 0c304cea-8ae2-4a12-b658-dec05c190f88
2022-08-15 10:42:24.738 208875 WARNING ironic.drivers.modules.agent_base
[-] Node 0c304cea-8ae2-4a12-b658-dec05c190f88 is currently locked, skipping
heartbeat processing (will retry on the next heartbeat):
ironic.common.exception.NodeLocked: Node
0c304cea-8ae2-4a12-b658-dec05c190f88 is locked by host
sjc06-c01-irn01.ops.ringcentral.com, please retry after the current
operation is completed.
2022-08-15 10:44:29.788 208875 WARNING ironic.drivers.modules.agent_base
[-] Node 0c304cea-8ae2-4a12-b658-dec05c190f88 is currently locked, skipping
heartbeat processing (will retry on the next heartbeat):
ironic.common.exception.NodeLocked: Node
0c304cea-8ae2-4a12-b658-dec05c190f88 is locked by host
sjc06-c01-irn01.ops.ringcentral.com, please retry after the current
operation is completed.
2022-08-15 10:47:24.830 208875 WARNING ironic.drivers.modules.agent_base
[-] Node 0c304cea-8ae2-4a12-b658-dec05c190f88 is currently locked, skipping
heartbeat processing (will retry on the next heartbeat):
ironic.common.exception.NodeLocked: Node
0c304cea-8ae2-4a12-b658-dec05c190f88 is locked by host
sjc06-c01-irn01.ops.ringcentral.com, please retry after the current
operation is completed.
2022-08-15 11:05:59.544 208875 INFO ironic.drivers.modules.ansible.deploy
[req-fdd7319b-85b5-4adb-9773-ff64fd1423b2 - - - - -] Ansible complete
deploy on node 0c304cea-8ae2-4a12-b658-dec05c190f88
2022-08-15 11:06:00.141 208875 ERROR ironic.conductor.utils
[req-fdd7319b-85b5-4adb-9773-ff64fd1423b2 - - - - -] Node
0c304cea-8ae2-4a12-b658-dec05c190f88 failed deploy step {'step':
'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'}
with unexpected error: ("Connection broken: InvalidChunkLength(got length
b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)):
requests.exceptions.ChunkedEncodingError: ("Connection broken:
InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got
length b'', 0 bytes read))
2022-08-15 11:06:00.218 208875 ERROR ironic.conductor.task_manager
[req-fdd7319b-85b5-4adb-9773-ff64fd1423b2 - - - - -] Node
0c304cea-8ae2-4a12-b658-dec05c190f88 moved to provision state "deploy
failed" from state "deploying"; target provision state is "active":
requests.exceptions.ChunkedEncodingError: ("Connection broken:
InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got
length b'', 0 bytes read))
2022-08-15 11:06:28.774 208875 WARNING ironic.conductor.manager
[req-dd0abde9-3364-4828-8907-b42fe4cc9c66 - - - - -] During
sync_power_state, could not get power state for node
0c304cea-8ae2-4a12-b658-dec05c190f88, attempt 1 of 3. Error: ("Connection
broken: InvalidChunkLength(got length b'', 0 bytes read)",
InvalidChunkLength(got length b'', 0 bytes read)).:
requests.exceptions.ChunkedEncodingError: ("Connection broken:
InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got
length b'', 0 bytes read))
2022-08-15 11:06:53.710 208875 WARNING ironic.conductor.manager
[req-dd0abde9-3364-4828-8907-b42fe4cc9c66 - - - - -] During
sync_power_state, could not get power state for node
0c304cea-8ae2-4a12-b658-dec05c190f88, attempt 2 of 3. Error: ("Connection
broken: InvalidChunkLength(got length b'', 0 bytes read)",
InvalidChunkLength(got length b'', 0 bytes read)).:
requests.exceptions.ChunkedEncodingError: ("Connection broken:
InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got
length b'', 0 bytes read))
2022-08-15 11:07:53.727 208875 WARNING ironic.conductor.manager
[req-dd0abde9-3364-4828-8907-b42fe4cc9c66 - - - - -] During
sync_power_state, could not get power state for node
0c304cea-8ae2-4a12-b658-dec05c190f88, attempt 3 of 3. Error: ("Connection
broken: InvalidChunkLength(got length b'', 0 bytes read)",
InvalidChunkLength(got length b'', 0 bytes read)).:
requests.exceptions.ChunkedEncodingError: ("Connection broken:
InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got
length b'', 0 bytes read))
2022-08-15 11:08:53.704 208875 ERROR ironic.conductor.manager
[req-dd0abde9-3364-4828-8907-b42fe4cc9c66 - - - - -] During
sync_power_state, max retries exceeded for node
0c304cea-8ae2-4a12-b658-dec05c190f88, node state None does not match
expected state 'power on'. Updating DB state to 'None' Switching node to
maintenance mode. Error: ("Connection broken: InvalidChunkLength(got length
b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)):
requests.exceptions.ChunkedEncodingError: ("Connection broken:
InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got
length b'', 0 bytes read))
2022-08-15 11:13:53.750 208875 ERROR ironic.conductor.manager
[req-dd0abde9-3364-4828-8907-b42fe4cc9c66 - - - - -] During
sync_power_state, max retries exceeded for node
0c304cea-8ae2-4a12-b658-dec05c190f88, node state None does not match
expected state 'None'. Updating DB state to 'None' Switching node to
maintenance mode. Error: ("Connection broken: InvalidChunkLength(got length
b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)):
requests.exceptions.ChunkedEncodingError: ("Connection broken:
InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got
length b'', 0 bytes read))


On Fri, Aug 12, 2022 at 9:06 PM Julia Kreger <juliaashleykreger at gmail.com>
wrote:

> Two questions:
>
> 1)  do you see open sockets to the BMCs in netstat output?
> 2) is your code using ‘connection: close’? Or are you using sushy?
>
> Honestly, this seems *really* weird with current sushy versions, and is
> kind of reminiscent of a cached session which is using kept alive sockets.
>
> If you could grep out req-b6dd74da-1cc7-4c63-b58e-b7ded37007e9 to see
> what the prior couple of conductor actions were, that would give us better
> context as to what is going on.
>
> -Julia
>
> On Fri, Aug 12, 2022 at 3:11 PM Wade Albright <the.wade.albright at gmail.com>
> wrote:
>
>> Sorry for the spam. The openssl issue may have been a red herring. I am
>> not able to reproduce the issue directly with my own python code. I was
>> trying to fetch something that required authentication. After I added the
>> correct auth info it works fine. I am not able to cause the same error as
>> is happening in the Ironic logs.
>>
>> Anyway I'll do some more testing and report back.
>>
>> On Fri, Aug 12, 2022 at 2:14 PM Wade Albright <
>> the.wade.albright at gmail.com> wrote:
>>
>>> I'm not sure why this problem only now started showing up, but it
>>> appears to be unrelated to Ironic. I was able to reproduce it directly
>>> outside of Ironic using a simple python program using urllib to get URLs
>>> from the BMC/redfish interface. Seems to be some combination of a buggy
>>> server SSL implementation and newer openssl 1.1.1. Apparently it doesn't
>>> happen using openssl 1.0.
>>>
>>> I've found some information about possible workarounds but haven't
>>> figured it out yet. If I do I'll update this thread just in case anyone
>>> else runs into it.
>>>
>>> On Fri, Aug 12, 2022 at 8:13 AM Wade Albright <
>>> the.wade.albright at gmail.com> wrote:
>>>
>>>> So I seem to have run into a new issue after upgrading to the newer
>>>> versions to fix the password change issue.
>>>>
>>>> Now I am randomly getting errors like the below. Once I hit this error
>>>> for a given node, no operations work on the node. I thought maybe it was an
>>>> issue with the node itself, but it doesn't seem like it. The BMC seems to
>>>> be working fine.
>>>>
>>>> After a conductor restart, things start working again. Has anyone seen
>>>> something like this?
>>>>
>>>> Log example:
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils [- - - -
>>>> -] Node ef5a2502-680b-4933-a0ee-6737e57ce1c5 failed deploy step {'step':
>>>> 'write_image', 'priority':
>>>> 80, 'argsinfo': None, 'interface': 'deploy'} with unexpected error:
>>>> ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)",
>>>> InvalidChunkLength(got length b'', 0 bytes read)): requests.exceptions.
>>>> ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got
>>>> length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes
>>>> read))
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils Traceback
>>>> (most recent call last):
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>>> "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 697, in
>>>> _update_chunk_length
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>>> self.chunk_left = int(line, 16)
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>>> ValueError: invalid literal for int() with base 16: b''
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils During
>>>> handling of the above exception, another exception occurred:
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils Traceback
>>>> (most recent call last):
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>>> "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 438, in
>>>> _error_catcher
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils     yield
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>>> "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 764, in
>>>> read_chunked
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>>> self._update_chunk_length()
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>>> "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 701, in
>>>> _update_chunk_length
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils     raise
>>>> InvalidChunkLength(self, line)
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>>> urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0
>>>> bytes read)
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils During
>>>> handling of the above exception, another exception occurred:
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils Traceback
>>>> (most recent call last):
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>>> "/usr/local/lib/python3.6/site-packages/requests/models.py", line 760, in
>>>> generate
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils     for
>>>> chunk in self.raw.stream(chunk_size, decode_content=True):
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>>> "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 572, in
>>>> stream
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils     for
>>>> line in self.read_chunked(amt, decode_content=decode_content):
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>>> "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 793, in
>>>> read_chunked
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>>> self._original_response.close()
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>>> "/usr/lib64/python3.6/contextlib.py", line 99, in __exit__
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>>> self.gen.throw(type, value, traceback)
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>>> "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 455, in
>>>> _error_catcher
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils     raise
>>>> ProtocolError("Connection broken: %r" % e, e)
>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>>> urllib3.exceptions.ProtocolError: ("Connection broken:
>>>> InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got
>>>> length b'', 0 bytes r
>>>> ead))
>>>>
>>>> On Wed, Jul 20, 2022 at 2:04 PM Wade Albright <
>>>> the.wade.albright at gmail.com> wrote:
>>>>
>>>>> I forgot to mention, that using session auth solved the problem after
>>>>> upgrading to the newer versions that include the two mentioned patches.
>>>>>
>>>>> On Wed, Jul 20, 2022 at 7:36 AM Wade Albright <
>>>>> the.wade.albright at gmail.com> wrote:
>>>>>
>>>>>> Switching to session auth solved the problem, and it seems like the
>>>>>> better way to go anyway for equipment that supports it. Thanks again for
>>>>>> all your help!
>>>>>>
>>>>>> Wade
>>>>>>
>>>>>> On Tue, Jul 19, 2022 at 5:37 PM Julia Kreger <
>>>>>> juliaashleykreger at gmail.com> wrote:
>>>>>>
>>>>>>> Just to provide a brief update for the mailing list. It looks like
>>>>>>> this is a case of use of Basic Auth with the BMC, where we were not
>>>>>>> catching the error properly... and thus not reporting the
>>>>>>> authentication failure to ironic so it would catch, and initiate a
>>>>>>> new
>>>>>>> client with the most up to date password. The default, typically used
>>>>>>> path is Session based authentication as BMCs generally handle
>>>>>>> internal
>>>>>>> session/user login tracking in a far better fashion. But not every
>>>>>>> BMC
>>>>>>> supports sessions.
>>>>>>>
>>>>>>> Fix in review[0] :)
>>>>>>>
>>>>>>> -Julia
>>>>>>> [0] https://review.opendev.org/c/openstack/sushy/+/850425
>>>>>>>
>>>>>>> On Mon, Jul 18, 2022 at 4:15 PM Julia Kreger
>>>>>>> <juliaashleykreger at gmail.com> wrote:
>>>>>>> >
>>>>>>> > Excellent, hopefully I'll be able to figure out why Sushy is not
>>>>>>> doing
>>>>>>> > the needful... Or if it is and Ironic is not picking up on it.
>>>>>>> >
>>>>>>> > Anyway, I've posted
>>>>>>> > https://review.opendev.org/c/openstack/ironic/+/850259 which might
>>>>>>> > handle this issue. Obviously a work in progress, but it represents
>>>>>>> > what I think is happening inside of ironic itself leading into
>>>>>>> sushy
>>>>>>> > when cache access occurs.
>>>>>>> >
>>>>>>> > On Mon, Jul 18, 2022 at 4:04 PM Wade Albright
>>>>>>> > <the.wade.albright at gmail.com> wrote:
>>>>>>> > >
>>>>>>> > > Sounds good, I will do that tomorrow. Thanks Julia.
>>>>>>> > >
>>>>>>> > > On Mon, Jul 18, 2022 at 3:27 PM Julia Kreger <
>>>>>>> juliaashleykreger at gmail.com> wrote:
>>>>>>> > >>
>>>>>>> > >> Debug would be best. I think I have an idea what is going on,
>>>>>>> and this
>>>>>>> > >> is a similar variation. If you want, you can email them
>>>>>>> directly to
>>>>>>> > >> me. Specifically only need entries reported by the sushy
>>>>>>> library and
>>>>>>> > >> ironic.drivers.modules.redfish.utils.
>>>>>>> > >>
>>>>>>> > >> On Mon, Jul 18, 2022 at 3:20 PM Wade Albright
>>>>>>> > >> <the.wade.albright at gmail.com> wrote:
>>>>>>> > >> >
>>>>>>> > >> > I'm happy to supply some logs, what verbosity level should i
>>>>>>> use? And should I just embed the logs in email to the list or upload
>>>>>>> somewhere?
>>>>>>> > >> >
>>>>>>> > >> > On Mon, Jul 18, 2022 at 3:14 PM Julia Kreger <
>>>>>>> juliaashleykreger at gmail.com> wrote:
>>>>>>> > >> >>
>>>>>>> > >> >> If you could supply some conductor logs, that would be
>>>>>>> helpful. It
>>>>>>> > >> >> should be re-authenticating, but obviously we have a larger
>>>>>>> bug there
>>>>>>> > >> >> we need to find the root issue behind.
>>>>>>> > >> >>
>>>>>>> > >> >> On Mon, Jul 18, 2022 at 3:06 PM Wade Albright
>>>>>>> > >> >> <the.wade.albright at gmail.com> wrote:
>>>>>>> > >> >> >
>>>>>>> > >> >> > I was able to use the patches to update the code, but
>>>>>>> unfortunately the problem is still there for me.
>>>>>>> > >> >> >
>>>>>>> > >> >> > I also tried an RPM upgrade to the versions Julia
>>>>>>> mentioned had the fixes, namely Sushy 3.12.1 - Released May 2022 and Ironic
>>>>>>> 18.2.1 - Released in January 2022. But it did not fix the problem.
>>>>>>> > >> >> >
>>>>>>> > >> >> > I am able to consistently reproduce the error.
>>>>>>> > >> >> >  - step 1: change BMC password directly on the node itself
>>>>>>> > >> >> >  - step 2: update BMC password (redfish_password) in
>>>>>>> ironic with 'openstack baremetal node set <nodename> --driver-info
>>>>>>> redfish_password='newpass'
>>>>>>> > >> >> >
>>>>>>> > >> >> > After step 1 there are errors in the logs entries like
>>>>>>> "Session authentication appears to have been lost at some point in time"
>>>>>>> and eventually it puts the node into maintenance mode and marks the power
>>>>>>> state as "none."
>>>>>>> > >> >> > After step 2 and taking the host back out of maintenance
>>>>>>> mode, it goes through a similar set of log entries puts the node into MM
>>>>>>> again.
>>>>>>> > >> >> >
>>>>>>> > >> >> > After the above steps, a conductor restart fixes the
>>>>>>> problem and operations work normally again. Given this it seems like there
>>>>>>> is still some kind of caching issue.
>>>>>>> > >> >> >
>>>>>>> > >> >> > On Sat, Jul 16, 2022 at 6:01 PM Wade Albright <
>>>>>>> the.wade.albright at gmail.com> wrote:
>>>>>>> > >> >> >>
>>>>>>> > >> >> >> Hi Julia,
>>>>>>> > >> >> >>
>>>>>>> > >> >> >> Thank you so much for the reply! Hopefully this is the
>>>>>>> issue. I'll try out the patches next week and report back. I'll also email
>>>>>>> you on Monday about the versions, that would be very helpful to know.
>>>>>>> > >> >> >>
>>>>>>> > >> >> >> Thanks again, really appreciate it.
>>>>>>> > >> >> >>
>>>>>>> > >> >> >> Wade
>>>>>>> > >> >> >>
>>>>>>> > >> >> >>
>>>>>>> > >> >> >>
>>>>>>> > >> >> >> On Sat, Jul 16, 2022 at 4:36 PM Julia Kreger <
>>>>>>> juliaashleykreger at gmail.com> wrote:
>>>>>>> > >> >> >>>
>>>>>>> > >> >> >>> Greetings!
>>>>>>> > >> >> >>>
>>>>>>> > >> >> >>> I believe you need two patches, one in ironic and one in
>>>>>>> sushy.
>>>>>>> > >> >> >>>
>>>>>>> > >> >> >>> Sushy:
>>>>>>> > >> >> >>> https://review.opendev.org/c/openstack/sushy/+/832860
>>>>>>> > >> >> >>>
>>>>>>> > >> >> >>> Ironic:
>>>>>>> > >> >> >>> https://review.opendev.org/c/openstack/ironic/+/820588
>>>>>>> > >> >> >>>
>>>>>>> > >> >> >>> I think it is variation, and the comment about working
>>>>>>> after you restart the conductor is the big signal to me. I’m on a phone on
>>>>>>> a bad data connection, if you email me on Monday I can see what versions
>>>>>>> the fixes would be in.
>>>>>>> > >> >> >>>
>>>>>>> > >> >> >>> For the record, it is a session cache issue, the bug was
>>>>>>> that the service didn’t quite know what to do when auth fails.
>>>>>>> > >> >> >>>
>>>>>>> > >> >> >>> -Julia
>>>>>>> > >> >> >>>
>>>>>>> > >> >> >>>
>>>>>>> > >> >> >>> On Fri, Jul 15, 2022 at 2:55 PM Wade Albright <
>>>>>>> the.wade.albright at gmail.com> wrote:
>>>>>>> > >> >> >>>>
>>>>>>> > >> >> >>>> Hi,
>>>>>>> > >> >> >>>>
>>>>>>> > >> >> >>>> I'm hitting a problem when trying to update the
>>>>>>> redfish_password for an existing node. I'm curious to know if anyone else
>>>>>>> has encountered this problem. I'm not sure if I'm just doing something
>>>>>>> wrong or if there is a bug. Or if the problem is unique to my setup.
>>>>>>> > >> >> >>>>
>>>>>>> > >> >> >>>> I have a node already added into ironic with all the
>>>>>>> driver details set, and things are working fine. I am able to run
>>>>>>> deployments.
>>>>>>> > >> >> >>>>
>>>>>>> > >> >> >>>> Now I need to change the redfish password on the host.
>>>>>>> So I update the password for redfish access on the host, then use an
>>>>>>> 'openstack baremetal node set <node> --driver-info
>>>>>>> redfish_password=<newpass>' command to set the new redfish_password.
>>>>>>> > >> >> >>>>
>>>>>>> > >> >> >>>> Once this has been done, deployment no longer works. I
>>>>>>> see redfish authentication errors in the logs and the operation fails. I
>>>>>>> waited a bit to see if there might just be a delay in updating the
>>>>>>> password, but after awhile it still didn't work.
>>>>>>> > >> >> >>>>
>>>>>>> > >> >> >>>> I restarted the conductor, and after that things work
>>>>>>> fine again. So it seems like the password is cached or something. Is there
>>>>>>> a way to force the password to update? I even tried removing the redfish
>>>>>>> credentials and re-adding them, but that didn't work either. Only a
>>>>>>> conductor restart seems to make the new password work.
>>>>>>> > >> >> >>>>
>>>>>>> > >> >> >>>> We are running Xena, using rpm installation on Oracle
>>>>>>> Linux 8.5.
>>>>>>> > >> >> >>>>
>>>>>>> > >> >> >>>> Thanks in advance for any help with this issue.
>>>>>>>
>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20220815/3db057d5/attachment-0001.htm>


More information about the openstack-discuss mailing list