[kolla] [2023.1] [masakari] Orphaned instances on node shutdown
Folks, I'm running into an issue with instances appearing to be running, even though they shouldn't be. The environment in this case is a virtualized OpenStack Kolla (2023.1) cluster, with three compute nodes and three control plane nodes. I am trying to see what happens when I create and terminate instances, in an attempt to figure out failover behavior. The three computes are in a single segment. I created four instances: +--------------------+--------+---------------------------+ | Name | Status | Host | +--------------------+--------+---------------------------+ | rj_testinstance_04 | ACTIVE | devtest-libvirt1-muon1005 | | rj_testinstance_03 | ACTIVE | devtest-libvirt1-muon1004 | | rj_testinstance_02 | ACTIVE | devtest-libvirt1-muon1006 | | rj_testinstance | ACTIVE | devtest-libvirt1-muon1005 | +--------------------+--------+---------------------------+ I decided to see what would happen if I went in and shutdown devtest-libvirt1-muon-1005 (which has two instances running). So I ssh'd into the machine and entered (as root): `echo c > /proc/sysrq-trigger` to force a crash, which duly occurred. I then made sure the host was seen as down: +---------------------------+-------+ | Hypervisor Hostname | State | +---------------------------+-------+ | devtest-libvirt1-muon1005 | down | | devtest-libvirt1-muon1004 | up | | devtest-libvirt1-muon1006 | up | +---------------------------+-------+ But even after this, the instances seem to still be up, including on the now crashed machine: +--------------------+--------+---------------------------+ | Name | Status | Host | +--------------------+--------+---------------------------+ | rj_testinstance_04 | ACTIVE | devtest-libvirt1-muon1005 | | rj_testinstance_03 | ACTIVE | devtest-libvirt1-muon1004 | | rj_testinstance_02 | ACTIVE | devtest-libvirt1-muon1006 | | rj_testinstance | ACTIVE | devtest-libvirt1-muon1005 | +--------------------+--------+---------------------------+ I would have expected the instances to have been regenerated on other compute hosts, or some sort of error. The nodes are on the same segment, so Masakari should have at least attempted to roll over the instances to other computes in the segment. But checking the instance details suggests that nothing has happened. The instances are just... lost. The Masakari host monitor log file reflects reality: 2023-12-11 19:21:49.953 8 INFO masakarimonitors.hostmonitor.host_handler.handle_host [-] 'devtest-libvirt1-muon1005' is 'offline' (current: 'offline'). Needless to say, I'm very confused here and am sure what's going on. Any advice would help. Thanks, Rob
As an addendum, a few things I forgot to include in the previous email: 1. I did use the HA_Enabled=True property on all the instances I created. 2. I got the same result whether the host was set to be in maintenance mode on the segment or not. 3. When I restarted the host node, I saw the status of the instances change to SHUTOFF. Masakari made no move, AFAICT, to regenerate those instances. Thanks, Rob On Mon, Dec 11, 2023 at 3:20 PM Rob Jefferson <techstep@gmail.com> wrote:
Folks,
I'm running into an issue with instances appearing to be running, even though they shouldn't be.
The environment in this case is a virtualized OpenStack Kolla (2023.1) cluster, with three compute nodes and three control plane nodes. I am trying to see what happens when I create and terminate instances, in an attempt to figure out failover behavior. The three computes are in a single segment.
I created four instances:
+--------------------+--------+---------------------------+ | Name | Status | Host | +--------------------+--------+---------------------------+ | rj_testinstance_04 | ACTIVE | devtest-libvirt1-muon1005 | | rj_testinstance_03 | ACTIVE | devtest-libvirt1-muon1004 | | rj_testinstance_02 | ACTIVE | devtest-libvirt1-muon1006 | | rj_testinstance | ACTIVE | devtest-libvirt1-muon1005 | +--------------------+--------+---------------------------+
I decided to see what would happen if I went in and shutdown devtest-libvirt1-muon-1005 (which has two instances running). So I ssh'd into the machine and entered (as root): `echo c > /proc/sysrq-trigger` to force a crash, which duly occurred.
I then made sure the host was seen as down:
+---------------------------+-------+ | Hypervisor Hostname | State | +---------------------------+-------+ | devtest-libvirt1-muon1005 | down | | devtest-libvirt1-muon1004 | up | | devtest-libvirt1-muon1006 | up | +---------------------------+-------+
But even after this, the instances seem to still be up, including on the now crashed machine:
+--------------------+--------+---------------------------+ | Name | Status | Host | +--------------------+--------+---------------------------+ | rj_testinstance_04 | ACTIVE | devtest-libvirt1-muon1005 | | rj_testinstance_03 | ACTIVE | devtest-libvirt1-muon1004 | | rj_testinstance_02 | ACTIVE | devtest-libvirt1-muon1006 | | rj_testinstance | ACTIVE | devtest-libvirt1-muon1005 | +--------------------+--------+---------------------------+
I would have expected the instances to have been regenerated on other compute hosts, or some sort of error. The nodes are on the same segment, so Masakari should have at least attempted to roll over the instances to other computes in the segment. But checking the instance details suggests that nothing has happened. The instances are just... lost.
The Masakari host monitor log file reflects reality:
2023-12-11 19:21:49.953 8 INFO masakarimonitors.hostmonitor.host_handler.handle_host [-] 'devtest-libvirt1-muon1005' is 'offline' (current: 'offline').
Needless to say, I'm very confused here and am sure what's going on. Any advice would help.
Thanks,
Rob
participants (1)
-
Rob Jefferson