Another workaround is: 1. Power off the instance relying on Cinder local storage. 2. Upgrade the compute nodes. 3. Start the instance. On Fri, Aug 16, 2024 at 3:37 PM Vahideh Alinouri <vahideh.alinouri@gmail.com> wrote:
Hi Everyone,
I'm currently upgrading an OpenStack deployment that was set up using Kolla-Ansible. I'm currently on Kolla version 14.8 (Yoga) and I am trying to upgrade to (Latest Yoga) 14.11, to eventually upgrade to the Zed release.
During the upgrade process, I encountered an issue with the compute node that uses the Cinder LVM backend, which the instances rely on. Restarting the tgtd container caused the connection between the iscsid container and tgtd to be lost. In the Nova compute logs, I found the following error: Stderr: 'blockdev: cannot open /dev/sdb: No such device or address\n'
Because of this, the instance transitions to an error state.
I also observed this error in the tgtd container: login failed ISCSI_CONN_STATE_IN_LOGIN/R_STAGE_SESSION_REOPEN 500 10:18 Kernel reported iSCSI connection 6:0 error (1019 - ISCSI_ERR_XMIT_FAILED: Transmission of iSCSI packet failed) state (2).
When I check the iSCSI session with the following command, it shows that the session is established: iscsiadm --mode session
However, when I run the following command on the compute host (which handles the Cinder LVM backend via iSCSI), no output is returned: tgtadm --lld iscsi --op show --mode target
I tried logging out and logging back into the session, but the logout operation failed!
Here is my cinder.conf file on the compute host: [os-compXXXX-nvme] volume_group = vgnvme volume_driver = cinder.volume.drivers.lvm.LVMVolumeDriver volume_backend_name = os-compXXXX-nvme target_helper = tgtadm target_protocol = iscsi lvm_mirrors = 0 lvm_max_over_subscription_ratio = 1.0
The operating system on the compute node is Ubuntu 20.04.
After further investigation, I found a temporary workaround to bring the instance back to an active running state: 1. Power off the compute node. 2. Evacuate the node (it gets stuck in rebuild, but that's okay). 3. Power on the compute node. 4. Reset the instance state to "active." 5. Check the volume status and update it if it's not in the "in-use" state. 6. Perform a hard reboot of the instance. 7. Restart the tgtd container.
However, this is only a temporary solution. When I attempt to upgrade to the next version, I encounter the same issue again.
Is there anything specific I should consider? Could this be a bug?
I would greatly appreciate your help.