[ironic][xena] problems updating redfish_password for existing node

Julia Kreger juliaashleykreger at gmail.com
Sat Aug 13 04:05:52 UTC 2022


Two questions:

1)  do you see open sockets to the BMCs in netstat output?
2) is your code using ‘connection: close’? Or are you using sushy?

Honestly, this seems *really* weird with current sushy versions, and is
kind of reminiscent of a cached session which is using kept alive sockets.

If you could grep out req-b6dd74da-1cc7-4c63-b58e-b7ded37007e9 to see what
the prior couple of conductor actions were, that would give us better
context as to what is going on.

-Julia

On Fri, Aug 12, 2022 at 3:11 PM Wade Albright <the.wade.albright at gmail.com>
wrote:

> Sorry for the spam. The openssl issue may have been a red herring. I am
> not able to reproduce the issue directly with my own python code. I was
> trying to fetch something that required authentication. After I added the
> correct auth info it works fine. I am not able to cause the same error as
> is happening in the Ironic logs.
>
> Anyway I'll do some more testing and report back.
>
> On Fri, Aug 12, 2022 at 2:14 PM Wade Albright <the.wade.albright at gmail.com>
> wrote:
>
>> I'm not sure why this problem only now started showing up, but it appears
>> to be unrelated to Ironic. I was able to reproduce it directly outside of
>> Ironic using a simple python program using urllib to get URLs from the
>> BMC/redfish interface. Seems to be some combination of a buggy server SSL
>> implementation and newer openssl 1.1.1. Apparently it doesn't happen using
>> openssl 1.0.
>>
>> I've found some information about possible workarounds but haven't
>> figured it out yet. If I do I'll update this thread just in case anyone
>> else runs into it.
>>
>> On Fri, Aug 12, 2022 at 8:13 AM Wade Albright <
>> the.wade.albright at gmail.com> wrote:
>>
>>> So I seem to have run into a new issue after upgrading to the newer
>>> versions to fix the password change issue.
>>>
>>> Now I am randomly getting errors like the below. Once I hit this error
>>> for a given node, no operations work on the node. I thought maybe it was an
>>> issue with the node itself, but it doesn't seem like it. The BMC seems to
>>> be working fine.
>>>
>>> After a conductor restart, things start working again. Has anyone seen
>>> something like this?
>>>
>>> Log example:
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils [- - - - -]
>>> Node ef5a2502-680b-4933-a0ee-6737e57ce1c5 failed deploy step {'step':
>>> 'write_image', 'priority':
>>> 80, 'argsinfo': None, 'interface': 'deploy'} with unexpected error:
>>> ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)",
>>> InvalidChunkLength(got length b'', 0 bytes read)): requests.exceptions.
>>> ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length
>>> b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils Traceback
>>> (most recent call last):
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>> "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 697, in
>>> _update_chunk_length
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>> self.chunk_left = int(line, 16)
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils ValueError:
>>> invalid literal for int() with base 16: b''
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils During
>>> handling of the above exception, another exception occurred:
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils Traceback
>>> (most recent call last):
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>> "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 438, in
>>> _error_catcher
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils     yield
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>> "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 764, in
>>> read_chunked
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>> self._update_chunk_length()
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>> "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 701, in
>>> _update_chunk_length
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils     raise
>>> InvalidChunkLength(self, line)
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>> urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0
>>> bytes read)
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils During
>>> handling of the above exception, another exception occurred:
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils Traceback
>>> (most recent call last):
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>> "/usr/local/lib/python3.6/site-packages/requests/models.py", line 760, in
>>> generate
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils     for
>>> chunk in self.raw.stream(chunk_size, decode_content=True):
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>> "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 572, in
>>> stream
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils     for
>>> line in self.read_chunked(amt, decode_content=decode_content):
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>> "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 793, in
>>> read_chunked
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>> self._original_response.close()
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>> "/usr/lib64/python3.6/contextlib.py", line 99, in __exit__
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>> self.gen.throw(type, value, traceback)
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils   File
>>> "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 455, in
>>> _error_catcher
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils     raise
>>> ProtocolError("Connection broken: %r" % e, e)
>>> 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils
>>> urllib3.exceptions.ProtocolError: ("Connection broken:
>>> InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got
>>> length b'', 0 bytes r
>>> ead))
>>>
>>> On Wed, Jul 20, 2022 at 2:04 PM Wade Albright <
>>> the.wade.albright at gmail.com> wrote:
>>>
>>>> I forgot to mention, that using session auth solved the problem after
>>>> upgrading to the newer versions that include the two mentioned patches.
>>>>
>>>> On Wed, Jul 20, 2022 at 7:36 AM Wade Albright <
>>>> the.wade.albright at gmail.com> wrote:
>>>>
>>>>> Switching to session auth solved the problem, and it seems like the
>>>>> better way to go anyway for equipment that supports it. Thanks again for
>>>>> all your help!
>>>>>
>>>>> Wade
>>>>>
>>>>> On Tue, Jul 19, 2022 at 5:37 PM Julia Kreger <
>>>>> juliaashleykreger at gmail.com> wrote:
>>>>>
>>>>>> Just to provide a brief update for the mailing list. It looks like
>>>>>> this is a case of use of Basic Auth with the BMC, where we were not
>>>>>> catching the error properly... and thus not reporting the
>>>>>> authentication failure to ironic so it would catch, and initiate a new
>>>>>> client with the most up to date password. The default, typically used
>>>>>> path is Session based authentication as BMCs generally handle internal
>>>>>> session/user login tracking in a far better fashion. But not every BMC
>>>>>> supports sessions.
>>>>>>
>>>>>> Fix in review[0] :)
>>>>>>
>>>>>> -Julia
>>>>>> [0] https://review.opendev.org/c/openstack/sushy/+/850425
>>>>>>
>>>>>> On Mon, Jul 18, 2022 at 4:15 PM Julia Kreger
>>>>>> <juliaashleykreger at gmail.com> wrote:
>>>>>> >
>>>>>> > Excellent, hopefully I'll be able to figure out why Sushy is not
>>>>>> doing
>>>>>> > the needful... Or if it is and Ironic is not picking up on it.
>>>>>> >
>>>>>> > Anyway, I've posted
>>>>>> > https://review.opendev.org/c/openstack/ironic/+/850259 which might
>>>>>> > handle this issue. Obviously a work in progress, but it represents
>>>>>> > what I think is happening inside of ironic itself leading into sushy
>>>>>> > when cache access occurs.
>>>>>> >
>>>>>> > On Mon, Jul 18, 2022 at 4:04 PM Wade Albright
>>>>>> > <the.wade.albright at gmail.com> wrote:
>>>>>> > >
>>>>>> > > Sounds good, I will do that tomorrow. Thanks Julia.
>>>>>> > >
>>>>>> > > On Mon, Jul 18, 2022 at 3:27 PM Julia Kreger <
>>>>>> juliaashleykreger at gmail.com> wrote:
>>>>>> > >>
>>>>>> > >> Debug would be best. I think I have an idea what is going on,
>>>>>> and this
>>>>>> > >> is a similar variation. If you want, you can email them directly
>>>>>> to
>>>>>> > >> me. Specifically only need entries reported by the sushy library
>>>>>> and
>>>>>> > >> ironic.drivers.modules.redfish.utils.
>>>>>> > >>
>>>>>> > >> On Mon, Jul 18, 2022 at 3:20 PM Wade Albright
>>>>>> > >> <the.wade.albright at gmail.com> wrote:
>>>>>> > >> >
>>>>>> > >> > I'm happy to supply some logs, what verbosity level should i
>>>>>> use? And should I just embed the logs in email to the list or upload
>>>>>> somewhere?
>>>>>> > >> >
>>>>>> > >> > On Mon, Jul 18, 2022 at 3:14 PM Julia Kreger <
>>>>>> juliaashleykreger at gmail.com> wrote:
>>>>>> > >> >>
>>>>>> > >> >> If you could supply some conductor logs, that would be
>>>>>> helpful. It
>>>>>> > >> >> should be re-authenticating, but obviously we have a larger
>>>>>> bug there
>>>>>> > >> >> we need to find the root issue behind.
>>>>>> > >> >>
>>>>>> > >> >> On Mon, Jul 18, 2022 at 3:06 PM Wade Albright
>>>>>> > >> >> <the.wade.albright at gmail.com> wrote:
>>>>>> > >> >> >
>>>>>> > >> >> > I was able to use the patches to update the code, but
>>>>>> unfortunately the problem is still there for me.
>>>>>> > >> >> >
>>>>>> > >> >> > I also tried an RPM upgrade to the versions Julia mentioned
>>>>>> had the fixes, namely Sushy 3.12.1 - Released May 2022 and Ironic 18.2.1 -
>>>>>> Released in January 2022. But it did not fix the problem.
>>>>>> > >> >> >
>>>>>> > >> >> > I am able to consistently reproduce the error.
>>>>>> > >> >> >  - step 1: change BMC password directly on the node itself
>>>>>> > >> >> >  - step 2: update BMC password (redfish_password) in ironic
>>>>>> with 'openstack baremetal node set <nodename> --driver-info
>>>>>> redfish_password='newpass'
>>>>>> > >> >> >
>>>>>> > >> >> > After step 1 there are errors in the logs entries like
>>>>>> "Session authentication appears to have been lost at some point in time"
>>>>>> and eventually it puts the node into maintenance mode and marks the power
>>>>>> state as "none."
>>>>>> > >> >> > After step 2 and taking the host back out of maintenance
>>>>>> mode, it goes through a similar set of log entries puts the node into MM
>>>>>> again.
>>>>>> > >> >> >
>>>>>> > >> >> > After the above steps, a conductor restart fixes the
>>>>>> problem and operations work normally again. Given this it seems like there
>>>>>> is still some kind of caching issue.
>>>>>> > >> >> >
>>>>>> > >> >> > On Sat, Jul 16, 2022 at 6:01 PM Wade Albright <
>>>>>> the.wade.albright at gmail.com> wrote:
>>>>>> > >> >> >>
>>>>>> > >> >> >> Hi Julia,
>>>>>> > >> >> >>
>>>>>> > >> >> >> Thank you so much for the reply! Hopefully this is the
>>>>>> issue. I'll try out the patches next week and report back. I'll also email
>>>>>> you on Monday about the versions, that would be very helpful to know.
>>>>>> > >> >> >>
>>>>>> > >> >> >> Thanks again, really appreciate it.
>>>>>> > >> >> >>
>>>>>> > >> >> >> Wade
>>>>>> > >> >> >>
>>>>>> > >> >> >>
>>>>>> > >> >> >>
>>>>>> > >> >> >> On Sat, Jul 16, 2022 at 4:36 PM Julia Kreger <
>>>>>> juliaashleykreger at gmail.com> wrote:
>>>>>> > >> >> >>>
>>>>>> > >> >> >>> Greetings!
>>>>>> > >> >> >>>
>>>>>> > >> >> >>> I believe you need two patches, one in ironic and one in
>>>>>> sushy.
>>>>>> > >> >> >>>
>>>>>> > >> >> >>> Sushy:
>>>>>> > >> >> >>> https://review.opendev.org/c/openstack/sushy/+/832860
>>>>>> > >> >> >>>
>>>>>> > >> >> >>> Ironic:
>>>>>> > >> >> >>> https://review.opendev.org/c/openstack/ironic/+/820588
>>>>>> > >> >> >>>
>>>>>> > >> >> >>> I think it is variation, and the comment about working
>>>>>> after you restart the conductor is the big signal to me. I’m on a phone on
>>>>>> a bad data connection, if you email me on Monday I can see what versions
>>>>>> the fixes would be in.
>>>>>> > >> >> >>>
>>>>>> > >> >> >>> For the record, it is a session cache issue, the bug was
>>>>>> that the service didn’t quite know what to do when auth fails.
>>>>>> > >> >> >>>
>>>>>> > >> >> >>> -Julia
>>>>>> > >> >> >>>
>>>>>> > >> >> >>>
>>>>>> > >> >> >>> On Fri, Jul 15, 2022 at 2:55 PM Wade Albright <
>>>>>> the.wade.albright at gmail.com> wrote:
>>>>>> > >> >> >>>>
>>>>>> > >> >> >>>> Hi,
>>>>>> > >> >> >>>>
>>>>>> > >> >> >>>> I'm hitting a problem when trying to update the
>>>>>> redfish_password for an existing node. I'm curious to know if anyone else
>>>>>> has encountered this problem. I'm not sure if I'm just doing something
>>>>>> wrong or if there is a bug. Or if the problem is unique to my setup.
>>>>>> > >> >> >>>>
>>>>>> > >> >> >>>> I have a node already added into ironic with all the
>>>>>> driver details set, and things are working fine. I am able to run
>>>>>> deployments.
>>>>>> > >> >> >>>>
>>>>>> > >> >> >>>> Now I need to change the redfish password on the host.
>>>>>> So I update the password for redfish access on the host, then use an
>>>>>> 'openstack baremetal node set <node> --driver-info
>>>>>> redfish_password=<newpass>' command to set the new redfish_password.
>>>>>> > >> >> >>>>
>>>>>> > >> >> >>>> Once this has been done, deployment no longer works. I
>>>>>> see redfish authentication errors in the logs and the operation fails. I
>>>>>> waited a bit to see if there might just be a delay in updating the
>>>>>> password, but after awhile it still didn't work.
>>>>>> > >> >> >>>>
>>>>>> > >> >> >>>> I restarted the conductor, and after that things work
>>>>>> fine again. So it seems like the password is cached or something. Is there
>>>>>> a way to force the password to update? I even tried removing the redfish
>>>>>> credentials and re-adding them, but that didn't work either. Only a
>>>>>> conductor restart seems to make the new password work.
>>>>>> > >> >> >>>>
>>>>>> > >> >> >>>> We are running Xena, using rpm installation on Oracle
>>>>>> Linux 8.5.
>>>>>> > >> >> >>>>
>>>>>> > >> >> >>>> Thanks in advance for any help with this issue.
>>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20220812/0a7f034f/attachment-0001.htm>


More information about the openstack-discuss mailing list