Thanks Julia! When I get a chance I will test that out and report back. I may not be able to get to it right away as the system where I could reproduce this issue reliably is in use by another team now.On Mon, Aug 15, 2022 at 5:27 PM Julia Kreger <juliaashleykreger@gmail.com> wrote:Well, that is weird. If I grok it as-is, it almost looks like the BMC
returned an empty response.... It is not a failure mode we've seen or
had reported (afaik) up to this point.
That being said, we should be able to invalidate the session and
launch a new client...
I suspect https://review.opendev.org/c/openstack/sushy/+/853209
should fix things up. I suspect we will look at just invalidating the
session upon any error.
On Mon, Aug 15, 2022 at 11:47 AM Wade Albright
<the.wade.albright@gmail.com> wrote:
>
> 1) There are sockets open briefly when the conductor is trying to connect. After three tries the node is set to maintenance mode and there are no more sockets open.
> 2) My (extremely simple) code was not using connection: close. I was just running "requests.get("https://10.12.104.174/redfish/v1/Systems/System.Embedded.1", verify=False, auth=('xxxx', 'xxxx'))" in a loop. I just tried it with headers={'Connection':'close'} and it doesn't seem to make any difference. Works fine either way.
>
> I was able to confirm that the problem only happens when using session auth. With basic auth it doesn't happen.
>
> Versions I'm using here are ironic 18.2.1 and sushy 3.12.2.
>
> Here are some fresh logs from the node having the problem:
>
> 2022-08-15 10:34:21.726 208875 INFO ironic.conductor.task_manager [req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338 b679510ddb6540ca9454e26841f65c89 - default default] Node 0c304cea-8ae2-4a12-b658-dec05c190f88 moved to provision state "deploying" from state "active"; target provision state is "active"
> 2022-08-15 10:34:22.553 208875 INFO ironic.conductor.utils [req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338 b679510ddb6540ca9454e26841f65c89 - default default] Node 0c304cea-8ae2-4a12-b658-dec05c190f88 current power state is 'power on', requested state is 'power off'.
> 2022-08-15 10:34:35.185 208875 INFO ironic.conductor.utils [req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338 b679510ddb6540ca9454e26841f65c89 - default default] Successfully set node 0c304cea-8ae2-4a12-b658-dec05c190f88 power state to power off by power off.
> 2022-08-15 10:34:35.200 208875 WARNING ironic.common.pxe_utils [req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338 b679510ddb6540ca9454e26841f65c89 - default default] IPv6 is enabled and the DHCP driver appears set to a plugin aside from "neutron". Node 0c304cea-8ae2-4a12-b658-dec05c190f88 may not receive proper DHCPv6 provided boot parameters.
> 2022-08-15 10:34:38.246 208875 INFO ironic.conductor.deployments [req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338 b679510ddb6540ca9454e26841f65c89 - default default] Deploying on node 0c304cea-8ae2-4a12-b658-dec05c190f88, remaining steps: [{'step': 'pre_deploy', 'priority': 200, 'argsinfo': None, 'interface': 'deploy'}, {'step': 'deploy', 'priority': 100, 'argsinfo': None, 'interface': 'deploy'}, {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'}, {'step': 'tear_down_agent', 'priority': 40, 'argsinfo': None, 'interface': 'deploy'}, {'step': 'switch_to_tenant_network', 'priority': 30, 'argsinfo': None, 'interface': 'deploy'}, {'step': 'boot_instance', 'priority': 20, 'argsinfo': None, 'interface': 'deploy'}]
> 2022-08-15 10:34:38.255 208875 INFO ironic.conductor.deployments [req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338 b679510ddb6540ca9454e26841f65c89 - default default] Executing {'step': 'pre_deploy', 'priority': 200, 'argsinfo': None, 'interface': 'deploy'} on node 0c304cea-8ae2-4a12-b658-dec05c190f88
> 2022-08-15 10:35:27.158 208875 INFO ironic.drivers.modules.ansible.deploy [req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338 b679510ddb6540ca9454e26841f65c89 - default default] Ansible pre-deploy step complete on node 0c304cea-8ae2-4a12-b658-dec05c190f88
> 2022-08-15 10:35:27.159 208875 INFO ironic.conductor.deployments [req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338 b679510ddb6540ca9454e26841f65c89 - default default] Node 0c304cea-8ae2-4a12-b658-dec05c190f88 finished deploy step {'step': 'pre_deploy', 'priority': 200, 'argsinfo': None, 'interface': 'deploy'}
> 2022-08-15 10:35:27.160 208875 INFO ironic.conductor.deployments [req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338 b679510ddb6540ca9454e26841f65c89 - default default] Deploying on node 0c304cea-8ae2-4a12-b658-dec05c190f88, remaining steps: [{'step': 'deploy', 'priority': 100, 'argsinfo': None, 'interface': 'deploy'}, {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'}, {'step': 'tear_down_agent', 'priority': 40, 'argsinfo': None, 'interface': 'deploy'}, {'step': 'switch_to_tenant_network', 'priority': 30, 'argsinfo': None, 'interface': 'deploy'}, {'step': 'boot_instance', 'priority': 20, 'argsinfo': None, 'interface': 'deploy'}]
> 2022-08-15 10:35:27.176 208875 INFO ironic.conductor.deployments [req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338 b679510ddb6540ca9454e26841f65c89 - default default] Executing {'step': 'deploy', 'priority': 100, 'argsinfo': None, 'interface': 'deploy'} on node 0c304cea-8ae2-4a12-b658-dec05c190f88
> 2022-08-15 10:35:32.037 208875 INFO ironic.conductor.utils [req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338 b679510ddb6540ca9454e26841f65c89 - default default] Successfully set node 0c304cea-8ae2-4a12-b658-dec05c190f88 power state to power on by rebooting.
> 2022-08-15 10:35:32.037 208875 INFO ironic.conductor.deployments [req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338 b679510ddb6540ca9454e26841f65c89 - default default] Deploy step {'step': 'deploy', 'priority': 100, 'argsinfo': None, 'interface': 'deploy'} on node 0c304cea-8ae2-4a12-b658-dec05c190f88 being executed asynchronously, waiting for driver.
> 2022-08-15 10:35:32.051 208875 INFO ironic.conductor.task_manager [req-a66441d0-45a6-4e27-96d9-f26bf9c83725 fcba8c03bfd649cfbb39a907426b3338 b679510ddb6540ca9454e26841f65c89 - default default] Node 0c304cea-8ae2-4a12-b658-dec05c190f88 moved to provision state "wait call-back" from state "deploying"; target provision state is "active"
> 2022-08-15 10:39:54.726 208875 INFO ironic.conductor.task_manager [req-fdd7319b-85b5-4adb-9773-ff64fd1423b2 - - - - -] Node 0c304cea-8ae2-4a12-b658-dec05c190f88 moved to provision state "deploying" from state "wait call-back"; target provision state is "active"
> 2022-08-15 10:39:54.741 208875 INFO ironic.conductor.deployments [req-fdd7319b-85b5-4adb-9773-ff64fd1423b2 - - - - -] Deploying on node 0c304cea-8ae2-4a12-b658-dec05c190f88, remaining steps: [{'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'}, {'step': 'tear_down_agent', 'priority': 40, 'argsinfo': None, 'interface': 'deploy'}, {'step': 'switch_to_tenant_network', 'priority': 30, 'argsinfo': None, 'interface': 'deploy'}, {'step': 'boot_instance', 'priority': 20, 'argsinfo': None, 'interface': 'deploy'}]
> 2022-08-15 10:39:54.748 208875 INFO ironic.conductor.deployments [req-fdd7319b-85b5-4adb-9773-ff64fd1423b2 - - - - -] Executing {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node 0c304cea-8ae2-4a12-b658-dec05c190f88
> 2022-08-15 10:42:24.738 208875 WARNING ironic.drivers.modules.agent_base [-] Node 0c304cea-8ae2-4a12-b658-dec05c190f88 is currently locked, skipping heartbeat processing (will retry on the next heartbeat): ironic.common.exception.NodeLocked: Node 0c304cea-8ae2-4a12-b658-dec05c190f88 is locked by host sjc06-c01-irn01.ops.ringcentral.com, please retry after the current operation is completed.
> 2022-08-15 10:44:29.788 208875 WARNING ironic.drivers.modules.agent_base [-] Node 0c304cea-8ae2-4a12-b658-dec05c190f88 is currently locked, skipping heartbeat processing (will retry on the next heartbeat): ironic.common.exception.NodeLocked: Node 0c304cea-8ae2-4a12-b658-dec05c190f88 is locked by host sjc06-c01-irn01.ops.ringcentral.com, please retry after the current operation is completed.
> 2022-08-15 10:47:24.830 208875 WARNING ironic.drivers.modules.agent_base [-] Node 0c304cea-8ae2-4a12-b658-dec05c190f88 is currently locked, skipping heartbeat processing (will retry on the next heartbeat): ironic.common.exception.NodeLocked: Node 0c304cea-8ae2-4a12-b658-dec05c190f88 is locked by host sjc06-c01-irn01.ops.ringcentral.com, please retry after the current operation is completed.
> 2022-08-15 11:05:59.544 208875 INFO ironic.drivers.modules.ansible.deploy [req-fdd7319b-85b5-4adb-9773-ff64fd1423b2 - - - - -] Ansible complete deploy on node 0c304cea-8ae2-4a12-b658-dec05c190f88
> 2022-08-15 11:06:00.141 208875 ERROR ironic.conductor.utils [req-fdd7319b-85b5-4adb-9773-ff64fd1423b2 - - - - -] Node 0c304cea-8ae2-4a12-b658-dec05c190f88 failed deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} with unexpected error: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)): requests.exceptions.ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
> 2022-08-15 11:06:00.218 208875 ERROR ironic.conductor.task_manager [req-fdd7319b-85b5-4adb-9773-ff64fd1423b2 - - - - -] Node 0c304cea-8ae2-4a12-b658-dec05c190f88 moved to provision state "deploy failed" from state "deploying"; target provision state is "active": requests.exceptions.ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
> 2022-08-15 11:06:28.774 208875 WARNING ironic.conductor.manager [req-dd0abde9-3364-4828-8907-b42fe4cc9c66 - - - - -] During sync_power_state, could not get power state for node 0c304cea-8ae2-4a12-b658-dec05c190f88, attempt 1 of 3. Error: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)).: requests.exceptions.ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
> 2022-08-15 11:06:53.710 208875 WARNING ironic.conductor.manager [req-dd0abde9-3364-4828-8907-b42fe4cc9c66 - - - - -] During sync_power_state, could not get power state for node 0c304cea-8ae2-4a12-b658-dec05c190f88, attempt 2 of 3. Error: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)).: requests.exceptions.ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
> 2022-08-15 11:07:53.727 208875 WARNING ironic.conductor.manager [req-dd0abde9-3364-4828-8907-b42fe4cc9c66 - - - - -] During sync_power_state, could not get power state for node 0c304cea-8ae2-4a12-b658-dec05c190f88, attempt 3 of 3. Error: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)).: requests.exceptions.ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
> 2022-08-15 11:08:53.704 208875 ERROR ironic.conductor.manager [req-dd0abde9-3364-4828-8907-b42fe4cc9c66 - - - - -] During sync_power_state, max retries exceeded for node 0c304cea-8ae2-4a12-b658-dec05c190f88, node state None does not match expected state 'power on'. Updating DB state to 'None' Switching node to maintenance mode. Error: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)): requests.exceptions.ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
> 2022-08-15 11:13:53.750 208875 ERROR ironic.conductor.manager [req-dd0abde9-3364-4828-8907-b42fe4cc9c66 - - - - -] During sync_power_state, max retries exceeded for node 0c304cea-8ae2-4a12-b658-dec05c190f88, node state None does not match expected state 'None'. Updating DB state to 'None' Switching node to maintenance mode. Error: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)): requests.exceptions.ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
>
>
> On Fri, Aug 12, 2022 at 9:06 PM Julia Kreger <juliaashleykreger@gmail.com> wrote:
>>
>> Two questions:
>>
>> 1) do you see open sockets to the BMCs in netstat output?
>> 2) is your code using ‘connection: close’? Or are you using sushy?
>>
>> Honestly, this seems *really* weird with current sushy versions, and is kind of reminiscent of a cached session which is using kept alive sockets.
>>
>> If you could grep out req-b6dd74da-1cc7-4c63-b58e-b7ded37007e9 to see what the prior couple of conductor actions were, that would give us better context as to what is going on.
>>
>> -Julia
>>
>> On Fri, Aug 12, 2022 at 3:11 PM Wade Albright <the.wade.albright@gmail.com> wrote:
>>>
>>> Sorry for the spam. The openssl issue may have been a red herring. I am not able to reproduce the issue directly with my own python code. I was trying to fetch something that required authentication. After I added the correct auth info it works fine. I am not able to cause the same error as is happening in the Ironic logs.
>>>
>>> Anyway I'll do some more testing and report back.
>>>
>>> On Fri, Aug 12, 2022 at 2:14 PM Wade Albright <the.wade.albright@gmail.com> wrote:
>>>>
>>>> I'm not sure why this problem only now started showing up, but it appears to be unrelated to Ironic. I was able to reproduce it directly outside of Ironic using a simple python program using urllib to get URLs from the BMC/redfish interface. Seems to be some combination of a buggy server SSL implementation and newer openssl 1.1.1. Apparently it doesn't happen using openssl 1.0.
>>>>
>>>> I've found some information about possible workarounds but haven't figured it out yet. If I do I'll update this thread just in case anyone else runs into it.
>>>>
>>>> On Fri, Aug 12, 2022 at 8:13 AM Wade Albright <the.wade.albright@gmail.com> wrote:
>>>>>
>>>>> So I seem to have run into a new issue after upgrading to the newer versions to fix the password change issue.
>>>>>
>>>>> Now I am randomly getting errors like the below. Once I hit this error for a given node, no operations work on the node. I thought maybe it was an issue with the node itself, but it doesn't seem like it. The BMC seems to be working fine.
>>>>>
>>>>> After a conductor restart, things start working again. Has anyone seen something like this?
>>>>>
>>>>> Log example:
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils [- - - - -] Node ef5a2502-680b-4933-a0ee-6737e57ce1c5 failed deploy step {'step': 'write_image', 'priority':
>>>>> 80, 'argsinfo': None, 'interface': 'deploy'} with unexpected error: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)): requests.exceptions.
>>>>> ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils Traceback (most recent call last):
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 697, in _update_chunk_length
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils self.chunk_left = int(line, 16)
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils ValueError: invalid literal for int() with base 16: b''
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils During handling of the above exception, another exception occurred:
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils Traceback (most recent call last):
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 438, in _error_catcher
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils yield
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 764, in read_chunked
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils self._update_chunk_length()
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 701, in _update_chunk_length
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils raise InvalidChunkLength(self, line)
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils During handling of the above exception, another exception occurred:
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils Traceback (most recent call last):
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/local/lib/python3.6/site-packages/requests/models.py", line 760, in generate
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils for chunk in self.raw.stream(chunk_size, decode_content=True):
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 572, in stream
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils for line in self.read_chunked(amt, decode_content=decode_content):
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 793, in read_chunked
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils self._original_response.close()
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/lib64/python3.6/contextlib.py", line 99, in __exit__
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils self.gen.throw(type, value, traceback)
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 455, in _error_catcher
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils raise ProtocolError("Connection broken: %r" % e, e)
>>>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes r
>>>>> ead))
>>>>>
>>>>> On Wed, Jul 20, 2022 at 2:04 PM Wade Albright <the.wade.albright@gmail.com> wrote:
>>>>>>
>>>>>> I forgot to mention, that using session auth solved the problem after upgrading to the newer versions that include the two mentioned patches.
>>>>>>
>>>>>> On Wed, Jul 20, 2022 at 7:36 AM Wade Albright <the.wade.albright@gmail.com> wrote:
>>>>>>>
>>>>>>> Switching to session auth solved the problem, and it seems like the better way to go anyway for equipment that supports it. Thanks again for all your help!
>>>>>>>
>>>>>>> Wade
>>>>>>>
>>>>>>> On Tue, Jul 19, 2022 at 5:37 PM Julia Kreger <juliaashleykreger@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Just to provide a brief update for the mailing list. It looks like
>>>>>>>> this is a case of use of Basic Auth with the BMC, where we were not
>>>>>>>> catching the error properly... and thus not reporting the
>>>>>>>> authentication failure to ironic so it would catch, and initiate a new
>>>>>>>> client with the most up to date password. The default, typically used
>>>>>>>> path is Session based authentication as BMCs generally handle internal
>>>>>>>> session/user login tracking in a far better fashion. But not every BMC
>>>>>>>> supports sessions.
>>>>>>>>
>>>>>>>> Fix in review[0] :)
>>>>>>>>
>>>>>>>> -Julia
>>>>>>>> [0] https://review.opendev.org/c/openstack/sushy/+/850425
>>>>>>>>
>>>>>>>> On Mon, Jul 18, 2022 at 4:15 PM Julia Kreger
>>>>>>>> <juliaashleykreger@gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > Excellent, hopefully I'll be able to figure out why Sushy is not doing
>>>>>>>> > the needful... Or if it is and Ironic is not picking up on it.
>>>>>>>> >
>>>>>>>> > Anyway, I've posted
>>>>>>>> > https://review.opendev.org/c/openstack/ironic/+/850259 which might
>>>>>>>> > handle this issue. Obviously a work in progress, but it represents
>>>>>>>> > what I think is happening inside of ironic itself leading into sushy
>>>>>>>> > when cache access occurs.
>>>>>>>> >
>>>>>>>> > On Mon, Jul 18, 2022 at 4:04 PM Wade Albright
>>>>>>>> > <the.wade.albright@gmail.com> wrote:
>>>>>>>> > >
>>>>>>>> > > Sounds good, I will do that tomorrow. Thanks Julia.
>>>>>>>> > >
>>>>>>>> > > On Mon, Jul 18, 2022 at 3:27 PM Julia Kreger <juliaashleykreger@gmail.com> wrote:
>>>>>>>> > >>
>>>>>>>> > >> Debug would be best. I think I have an idea what is going on, and this
>>>>>>>> > >> is a similar variation. If you want, you can email them directly to
>>>>>>>> > >> me. Specifically only need entries reported by the sushy library and
>>>>>>>> > >> ironic.drivers.modules.redfish.utils.
>>>>>>>> > >>
>>>>>>>> > >> On Mon, Jul 18, 2022 at 3:20 PM Wade Albright
>>>>>>>> > >> <the.wade.albright@gmail.com> wrote:
>>>>>>>> > >> >
>>>>>>>> > >> > I'm happy to supply some logs, what verbosity level should i use? And should I just embed the logs in email to the list or upload somewhere?
>>>>>>>> > >> >
>>>>>>>> > >> > On Mon, Jul 18, 2022 at 3:14 PM Julia Kreger <juliaashleykreger@gmail.com> wrote:
>>>>>>>> > >> >>
>>>>>>>> > >> >> If you could supply some conductor logs, that would be helpful. It
>>>>>>>> > >> >> should be re-authenticating, but obviously we have a larger bug there
>>>>>>>> > >> >> we need to find the root issue behind.
>>>>>>>> > >> >>
>>>>>>>> > >> >> On Mon, Jul 18, 2022 at 3:06 PM Wade Albright
>>>>>>>> > >> >> <the.wade.albright@gmail.com> wrote:
>>>>>>>> > >> >> >
>>>>>>>> > >> >> > I was able to use the patches to update the code, but unfortunately the problem is still there for me.
>>>>>>>> > >> >> >
>>>>>>>> > >> >> > I also tried an RPM upgrade to the versions Julia mentioned had the fixes, namely Sushy 3.12.1 - Released May 2022 and Ironic 18.2.1 - Released in January 2022. But it did not fix the problem.
>>>>>>>> > >> >> >
>>>>>>>> > >> >> > I am able to consistently reproduce the error.
>>>>>>>> > >> >> > - step 1: change BMC password directly on the node itself
>>>>>>>> > >> >> > - step 2: update BMC password (redfish_password) in ironic with 'openstack baremetal node set <nodename> --driver-info redfish_password='newpass'
>>>>>>>> > >> >> >
>>>>>>>> > >> >> > After step 1 there are errors in the logs entries like "Session authentication appears to have been lost at some point in time" and eventually it puts the node into maintenance mode and marks the power state as "none."
>>>>>>>> > >> >> > After step 2 and taking the host back out of maintenance mode, it goes through a similar set of log entries puts the node into MM again.
>>>>>>>> > >> >> >
>>>>>>>> > >> >> > After the above steps, a conductor restart fixes the problem and operations work normally again. Given this it seems like there is still some kind of caching issue.
>>>>>>>> > >> >> >
>>>>>>>> > >> >> > On Sat, Jul 16, 2022 at 6:01 PM Wade Albright <the.wade.albright@gmail.com> wrote:
>>>>>>>> > >> >> >>
>>>>>>>> > >> >> >> Hi Julia,
>>>>>>>> > >> >> >>
>>>>>>>> > >> >> >> Thank you so much for the reply! Hopefully this is the issue. I'll try out the patches next week and report back. I'll also email you on Monday about the versions, that would be very helpful to know.
>>>>>>>> > >> >> >>
>>>>>>>> > >> >> >> Thanks again, really appreciate it.
>>>>>>>> > >> >> >>
>>>>>>>> > >> >> >> Wade
>>>>>>>> > >> >> >>
>>>>>>>> > >> >> >>
>>>>>>>> > >> >> >>
>>>>>>>> > >> >> >> On Sat, Jul 16, 2022 at 4:36 PM Julia Kreger <juliaashleykreger@gmail.com> wrote:
>>>>>>>> > >> >> >>>
>>>>>>>> > >> >> >>> Greetings!
>>>>>>>> > >> >> >>>
>>>>>>>> > >> >> >>> I believe you need two patches, one in ironic and one in sushy.
>>>>>>>> > >> >> >>>
>>>>>>>> > >> >> >>> Sushy:
>>>>>>>> > >> >> >>> https://review.opendev.org/c/openstack/sushy/+/832860
>>>>>>>> > >> >> >>>
>>>>>>>> > >> >> >>> Ironic:
>>>>>>>> > >> >> >>> https://review.opendev.org/c/openstack/ironic/+/820588
>>>>>>>> > >> >> >>>
>>>>>>>> > >> >> >>> I think it is variation, and the comment about working after you restart the conductor is the big signal to me. I’m on a phone on a bad data connection, if you email me on Monday I can see what versions the fixes would be in.
>>>>>>>> > >> >> >>>
>>>>>>>> > >> >> >>> For the record, it is a session cache issue, the bug was that the service didn’t quite know what to do when auth fails.
>>>>>>>> > >> >> >>>
>>>>>>>> > >> >> >>> -Julia
>>>>>>>> > >> >> >>>
>>>>>>>> > >> >> >>>
>>>>>>>> > >> >> >>> On Fri, Jul 15, 2022 at 2:55 PM Wade Albright <the.wade.albright@gmail.com> wrote:
>>>>>>>> > >> >> >>>>
>>>>>>>> > >> >> >>>> Hi,
>>>>>>>> > >> >> >>>>
>>>>>>>> > >> >> >>>> I'm hitting a problem when trying to update the redfish_password for an existing node. I'm curious to know if anyone else has encountered this problem. I'm not sure if I'm just doing something wrong or if there is a bug. Or if the problem is unique to my setup.
>>>>>>>> > >> >> >>>>
>>>>>>>> > >> >> >>>> I have a node already added into ironic with all the driver details set, and things are working fine. I am able to run deployments.
>>>>>>>> > >> >> >>>>
>>>>>>>> > >> >> >>>> Now I need to change the redfish password on the host. So I update the password for redfish access on the host, then use an 'openstack baremetal node set <node> --driver-info redfish_password=<newpass>' command to set the new redfish_password.
>>>>>>>> > >> >> >>>>
>>>>>>>> > >> >> >>>> Once this has been done, deployment no longer works. I see redfish authentication errors in the logs and the operation fails. I waited a bit to see if there might just be a delay in updating the password, but after awhile it still didn't work.
>>>>>>>> > >> >> >>>>
>>>>>>>> > >> >> >>>> I restarted the conductor, and after that things work fine again. So it seems like the password is cached or something. Is there a way to force the password to update? I even tried removing the redfish credentials and re-adding them, but that didn't work either. Only a conductor restart seems to make the new password work.
>>>>>>>> > >> >> >>>>
>>>>>>>> > >> >> >>>> We are running Xena, using rpm installation on Oracle Linux 8.5.
>>>>>>>> > >> >> >>>>
>>>>>>>> > >> >> >>>> Thanks in advance for any help with this issue.