I'm not sure why this problem only now started showing up, but it appears to be unrelated to Ironic. I was able to reproduce it directly outside of Ironic using a simple python program using urllib to get URLs from the BMC/redfish interface. Seems to be some combination of a buggy server SSL implementation and newer openssl 1.1.1. Apparently it doesn't happen using openssl 1.0. I've found some information about possible workarounds but haven't figured it out yet. If I do I'll update this thread just in case anyone else runs into it. On Fri, Aug 12, 2022 at 8:13 AM Wade Albright <the.wade.albright@gmail.com> wrote:
So I seem to have run into a new issue after upgrading to the newer versions to fix the password change issue.
Now I am randomly getting errors like the below. Once I hit this error for a given node, no operations work on the node. I thought maybe it was an issue with the node itself, but it doesn't seem like it. The BMC seems to be working fine.
After a conductor restart, things start working again. Has anyone seen something like this?
Log example: 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils [req-b6dd74da-1cc7-4c63-b58e-b7ded37007e9 - - - - -] Node ef5a2502-680b-4933-a0ee-6737e57ce1c5 failed deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} with unexpected error: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)): requests.exceptions. ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)) 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils Traceback (most recent call last): 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 697, in _update_chunk_length 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils self.chunk_left = int(line, 16) 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils ValueError: invalid literal for int() with base 16: b'' 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils During handling of the above exception, another exception occurred: 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils Traceback (most recent call last): 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 438, in _error_catcher 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils yield 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 764, in read_chunked 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils self._update_chunk_length() 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 701, in _update_chunk_length 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils raise InvalidChunkLength(self, line) 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read) 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils During handling of the above exception, another exception occurred: 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils Traceback (most recent call last): 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/local/lib/python3.6/site-packages/requests/models.py", line 760, in generate 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils for chunk in self.raw.stream(chunk_size, decode_content=True): 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 572, in stream 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils for line in self.read_chunked(amt, decode_content=decode_content): 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 793, in read_chunked 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils self._original_response.close() 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/lib64/python3.6/contextlib.py", line 99, in __exit__ 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils self.gen.throw(type, value, traceback) 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 455, in _error_catcher 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils raise ProtocolError("Connection broken: %r" % e, e) 2022-08-12 07:45:33.227 1563371 ERROR ironic.conductor.utils urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes r ead))
On Wed, Jul 20, 2022 at 2:04 PM Wade Albright <the.wade.albright@gmail.com> wrote:
I forgot to mention, that using session auth solved the problem after upgrading to the newer versions that include the two mentioned patches.
On Wed, Jul 20, 2022 at 7:36 AM Wade Albright < the.wade.albright@gmail.com> wrote:
Switching to session auth solved the problem, and it seems like the better way to go anyway for equipment that supports it. Thanks again for all your help!
Wade
On Tue, Jul 19, 2022 at 5:37 PM Julia Kreger < juliaashleykreger@gmail.com> wrote:
Just to provide a brief update for the mailing list. It looks like this is a case of use of Basic Auth with the BMC, where we were not catching the error properly... and thus not reporting the authentication failure to ironic so it would catch, and initiate a new client with the most up to date password. The default, typically used path is Session based authentication as BMCs generally handle internal session/user login tracking in a far better fashion. But not every BMC supports sessions.
Fix in review[0] :)
-Julia [0] https://review.opendev.org/c/openstack/sushy/+/850425
On Mon, Jul 18, 2022 at 4:15 PM Julia Kreger <juliaashleykreger@gmail.com> wrote:
Excellent, hopefully I'll be able to figure out why Sushy is not doing the needful... Or if it is and Ironic is not picking up on it.
Anyway, I've posted https://review.opendev.org/c/openstack/ironic/+/850259 which might handle this issue. Obviously a work in progress, but it represents what I think is happening inside of ironic itself leading into sushy when cache access occurs.
On Mon, Jul 18, 2022 at 4:04 PM Wade Albright <the.wade.albright@gmail.com> wrote:
Sounds good, I will do that tomorrow. Thanks Julia.
On Mon, Jul 18, 2022 at 3:27 PM Julia Kreger <
> > Debug would be best. I think I have an idea what is going on, and
> is a similar variation. If you want, you can email them directly to > me. Specifically only need entries reported by the sushy library and > ironic.drivers.modules.redfish.utils. > > On Mon, Jul 18, 2022 at 3:20 PM Wade Albright > <the.wade.albright@gmail.com> wrote: > > > > I'm happy to supply some logs, what verbosity level should i use? And should I just embed the logs in email to the list or upload somewhere? > > > > On Mon, Jul 18, 2022 at 3:14 PM Julia Kreger < juliaashleykreger@gmail.com> wrote: > >> > >> If you could supply some conductor logs, that would be helpful. It > >> should be re-authenticating, but obviously we have a larger bug
> >> we need to find the root issue behind. > >> > >> On Mon, Jul 18, 2022 at 3:06 PM Wade Albright > >> <the.wade.albright@gmail.com> wrote: > >> > > >> > I was able to use the patches to update the code, but unfortunately the problem is still there for me. > >> > > >> > I also tried an RPM upgrade to the versions Julia mentioned had the fixes, namely Sushy 3.12.1 - Released May 2022 and Ironic 18.2.1 - Released in January 2022. But it did not fix the problem. > >> > > >> > I am able to consistently reproduce the error. > >> > - step 1: change BMC password directly on the node itself > >> > - step 2: update BMC password (redfish_password) in ironic with 'openstack baremetal node set <nodename> --driver-info redfish_password='newpass' > >> > > >> > After step 1 there are errors in the logs entries like "Session authentication appears to have been lost at some point in time" and eventually it puts the node into maintenance mode and marks the power state as "none." > >> > After step 2 and taking the host back out of maintenance mode, it goes through a similar set of log entries puts the node into MM again. > >> > > >> > After the above steps, a conductor restart fixes the problem and operations work normally again. Given this it seems like there is still some kind of caching issue. > >> > > >> > On Sat, Jul 16, 2022 at 6:01 PM Wade Albright <
> >> >> > >> >> Hi Julia, > >> >> > >> >> Thank you so much for the reply! Hopefully this is the issue. I'll try out the patches next week and report back. I'll also email you on Monday about the versions, that would be very helpful to know. > >> >> > >> >> Thanks again, really appreciate it. > >> >> > >> >> Wade > >> >> > >> >> > >> >> > >> >> On Sat, Jul 16, 2022 at 4:36 PM Julia Kreger < juliaashleykreger@gmail.com> wrote: > >> >>> > >> >>> Greetings! > >> >>> > >> >>> I believe you need two patches, one in ironic and one in sushy. > >> >>> > >> >>> Sushy: > >> >>> https://review.opendev.org/c/openstack/sushy/+/832860 > >> >>> > >> >>> Ironic: > >> >>> https://review.opendev.org/c/openstack/ironic/+/820588 > >> >>> > >> >>> I think it is variation, and the comment about working after you restart the conductor is the big signal to me. I’m on a phone on a bad data connection, if you email me on Monday I can see what versions
> >> >>> > >> >>> For the record, it is a session cache issue, the bug was
> >> >>> > >> >>> -Julia > >> >>> > >> >>> > >> >>> On Fri, Jul 15, 2022 at 2:55 PM Wade Albright <
juliaashleykreger@gmail.com> wrote: this there the.wade.albright@gmail.com> wrote: the fixes would be in. that the service didn’t quite know what to do when auth fails. the.wade.albright@gmail.com> wrote:
> >> >>>> > >> >>>> Hi, > >> >>>> > >> >>>> I'm hitting a problem when trying to update the redfish_password for an existing node. I'm curious to know if anyone else has encountered this problem. I'm not sure if I'm just doing something wrong or if there is a bug. Or if the problem is unique to my setup. > >> >>>> > >> >>>> I have a node already added into ironic with all the driver details set, and things are working fine. I am able to run deployments. > >> >>>> > >> >>>> Now I need to change the redfish password on the host. So I update the password for redfish access on the host, then use an 'openstack baremetal node set <node> --driver-info redfish_password=<newpass>' command to set the new redfish_password. > >> >>>> > >> >>>> Once this has been done, deployment no longer works. I see redfish authentication errors in the logs and the operation fails. I waited a bit to see if there might just be a delay in updating the password, but after awhile it still didn't work. > >> >>>> > >> >>>> I restarted the conductor, and after that things work fine again. So it seems like the password is cached or something. Is there a way to force the password to update? I even tried removing the redfish credentials and re-adding them, but that didn't work either. Only a conductor restart seems to make the new password work. > >> >>>> > >> >>>> We are running Xena, using rpm installation on Oracle Linux 8.5. > >> >>>> > >> >>>> Thanks in advance for any help with this issue.