Re: [kolla] [2023.1] [masakari] Orphaned instances on node shutdown

11 Dec 2023

      As an addendum, a few things I forgot to include in the previous email:

1. I did use the HA_Enabled=True property on all the instances I created.
2. I got the same result whether the host was set to be in maintenance mode
on the segment or not.
3. When I restarted the host node, I saw the status of the instances change
to SHUTOFF. Masakari made no move, AFAICT, to regenerate those instances.

Thanks,

Rob

On Mon, Dec 11, 2023 at 3:20 PM Rob Jefferson <techstep@gmail.com> wrote:
...
Folks,
I'm running into an issue with instances appearing to be running, even
though they shouldn't be.
The environment in this case is a virtualized OpenStack Kolla (2023.1)
cluster, with three compute nodes and three control plane nodes. I am
trying to see what happens when I create and terminate instances, in an
attempt to figure out failover behavior. The three computes are in a single
segment.
I created four instances:
+--------------------+--------+---------------------------+
| Name               | Status | Host                      |
+--------------------+--------+---------------------------+
| rj_testinstance_04 | ACTIVE | devtest-libvirt1-muon1005 |
| rj_testinstance_03 | ACTIVE | devtest-libvirt1-muon1004 |
| rj_testinstance_02 | ACTIVE | devtest-libvirt1-muon1006 |
| rj_testinstance    | ACTIVE | devtest-libvirt1-muon1005 |
+--------------------+--------+---------------------------+
I decided to see what would happen if I went in and shutdown
devtest-libvirt1-muon-1005 (which has two instances running). So I ssh'd
into the machine and entered (as root): `echo c > /proc/sysrq-trigger` to
force a crash, which duly occurred.
I then made sure the host was seen as down:
+---------------------------+-------+
| Hypervisor Hostname       | State |
+---------------------------+-------+
| devtest-libvirt1-muon1005 | down  |
| devtest-libvirt1-muon1004 | up    |
| devtest-libvirt1-muon1006 | up    |
+---------------------------+-------+
But even after this, the instances seem to still be up, including on the
now crashed machine:
+--------------------+--------+---------------------------+
| Name               | Status | Host                      |
+--------------------+--------+---------------------------+
| rj_testinstance_04 | ACTIVE | devtest-libvirt1-muon1005 |
| rj_testinstance_03 | ACTIVE | devtest-libvirt1-muon1004 |
| rj_testinstance_02 | ACTIVE | devtest-libvirt1-muon1006 |
| rj_testinstance    | ACTIVE | devtest-libvirt1-muon1005 |
+--------------------+--------+---------------------------+
I would have expected the instances to have been regenerated on other
compute hosts, or some sort of error. The nodes are on the same segment, so
Masakari should have at least attempted to roll over the instances to other
computes in the segment. But checking the instance details suggests that
nothing has happened. The instances are just... lost.
The Masakari host monitor log file reflects reality:
2023-12-11 19:21:49.953 8 INFO
masakarimonitors.hostmonitor.host_handler.handle_host [-]
'devtest-libvirt1-muon1005' is 'offline' (current: 'offline').
Needless to say, I'm very confused here and am sure what's going on. Any
advice would help.
Thanks,
Rob