[kolla] [2023.1] [masakari] Orphaned instances on node shutdown

11 Dec 2023

      Folks,

I'm running into an issue with instances appearing to be running, even
though they shouldn't be.

The environment in this case is a virtualized OpenStack Kolla (2023.1)
cluster, with three compute nodes and three control plane nodes. I am
trying to see what happens when I create and terminate instances, in an
attempt to figure out failover behavior. The three computes are in a single
segment.

I created four instances:

+--------------------+--------+---------------------------+
| Name               | Status | Host                      |
+--------------------+--------+---------------------------+
| rj_testinstance_04 | ACTIVE | devtest-libvirt1-muon1005 |
| rj_testinstance_03 | ACTIVE | devtest-libvirt1-muon1004 |
| rj_testinstance_02 | ACTIVE | devtest-libvirt1-muon1006 |
| rj_testinstance    | ACTIVE | devtest-libvirt1-muon1005 |
+--------------------+--------+---------------------------+

I decided to see what would happen if I went in and shutdown
devtest-libvirt1-muon-1005 (which has two instances running). So I ssh'd
into the machine and entered (as root): `echo c > /proc/sysrq-trigger` to
force a crash, which duly occurred.

I then made sure the host was seen as down:

+---------------------------+-------+
| Hypervisor Hostname       | State |
+---------------------------+-------+
| devtest-libvirt1-muon1005 | down  |
| devtest-libvirt1-muon1004 | up    |
| devtest-libvirt1-muon1006 | up    |
+---------------------------+-------+

But even after this, the instances seem to still be up, including on the
now crashed machine:

+--------------------+--------+---------------------------+
| Name               | Status | Host                      |
+--------------------+--------+---------------------------+
| rj_testinstance_04 | ACTIVE | devtest-libvirt1-muon1005 |
| rj_testinstance_03 | ACTIVE | devtest-libvirt1-muon1004 |
| rj_testinstance_02 | ACTIVE | devtest-libvirt1-muon1006 |
| rj_testinstance    | ACTIVE | devtest-libvirt1-muon1005 |
+--------------------+--------+---------------------------+

I would have expected the instances to have been regenerated on other
compute hosts, or some sort of error. The nodes are on the same segment, so
Masakari should have at least attempted to roll over the instances to other
computes in the segment. But checking the instance details suggests that
nothing has happened. The instances are just... lost.

The Masakari host monitor log file reflects reality:

2023-12-11 19:21:49.953 8 INFO
masakarimonitors.hostmonitor.host_handler.handle_host [-]
'devtest-libvirt1-muon1005' is 'offline' (current: 'offline').

Needless to say, I'm very confused here and am sure what's going on. Any
advice would help.

Thanks,

Rob

Rob Jefferson

Rob Jefferson

tags

participants (1)