I have a setup where each VM gets assigned two vDisks, one encrypted boot volume and another storage volume.

Storage used is NetApp (tripleo-netapp). With two controllers on NetApp side working in active/ active mode.

My test case goes as follows:

- I stop one of the active controllers.

- I stop one of my VMs using OpenStack server stop

- I then start my VM one more time using OpenStack server start.

- VM fails to start.

Here're my findings, hope someone would help if they can explain me the behaviour seen below:

My VM: vel1bgw01-MCM2, it is running on compute overcloud-sriovperformancecompute-3.localdomain

[root@overcloud-controller-0 (vel1asbc01) cbis-admin]# openstack server show vel1bgw01-MCM2

+--------------------------------------+------------------------------------------------------------------------------------------------------------+

| Field | Value |

+--------------------------------------+------------------------------------------------------------------------------------------------------------+

| OS-DCF:diskConfig | MANUAL |

| OS-EXT-AZ:availability_zone | zone1 |

| OS-EXT-SRV-ATTR:host | overcloud-sriovperformancecompute-3.localdomain |

| OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-sriovperformancecompute-3.localdomain |

| OS-EXT-SRV-ATTR:instance_name | instance-00000e93 |

| OS-EXT-STS:power_state | Running |

| OS-EXT-STS:task_state | None |

| OS-EXT-STS:vm_state | active |

| OS-SRV-USG:launched_at | 2019-12-18T15:49:37.000000 |

| OS-SRV-USG:terminated_at | None |

| accessIPv4 | |

| accessIPv6 | |

| addresses | SBC01_MGW01_TIPC=192.168.48.22; SBC01_MGW01_DATAPATH_MATE=192.168.16.11; SBC01_MGW01_DATAPATH=192.168.32.8 |

| config_drive | True |

| created | 2019-12-18T15:49:16Z |

| flavor | SBC_MCM (asbc_mcm) |

| hostId | 7886df0f7a3d4e131304a8eb860e6a704c5fda2a7ed751b544ff2bf5 |

| id | 5c70a984-89a9-44ce-876d-9e2e568eb819 |

| image | |

| key_name | CBAM-b5fd59a066e8450ca9f104a69da5a043-Keypair |

| name | vel1bgw01-MCM2 |

| os-extended-volumes:volumes_attached | [{u'id': u'717e5744-4786-42dc-9e3e-3c5e6994c482'}, {u'id': u'd6cf0cf9-36d1-4b62-86b4-faa4a6642166'}] |

| progress | 0 |

| project_id | 41777c6f1e7b4f8d8fd76b5e0f67e5e8 |

| properties | |

| security_groups | [{u'name': u'vel1bgw01-TIPC-Security-Group'}] |

| status | ACTIVE |

| updated | 2020-01-07T17:18:32Z |

| user_id | be13deba85794016a00fec9d18c5d7cf |

+--------------------------------------+------------------------------------------------------------------------------------------------------------+

It is mapped to the following vDisks (seen using virsh list on compute-3):

dm-uuid-mpath-3600a098000d9818b0000185c5dfa0714 è Boot Volume
dm-uuid-mpath-3600a098000d9818b000018565dfa069e è Storage volume

</disk>

</disk>

Name: crypt-dm-uuid-mpath-3600a098000d9818b0000185c5dfa0714

State: ACTIVE

Read Ahead: 256

Tables present: LIVE

Open count: 1

Event number: 0

Major, minor: 253, 5

Number of targets: 1

UUID: CRYPT-LUKS1-769cc20bc5af469c8c9075a2a6fc4aa0-crypt-dm-uuid-mpath-3600a098000d9818b0000185c5dfa0714

Name: mpathpy

State: ACTIVE

Read Ahead: 256

Tables present: LIVE

Open count: 1

Event number: 32

Major, minor: 253, 4

Number of targets: 1

UUID: mpath-3600a098000d9818b0000185c5dfa0714

Name: crypt-dm-uuid-mpath-3600a098000d9818b000018565dfa069e

State: ACTIVE

Read Ahead: 256

Tables present: LIVE

Open count: 1

Event number: 0

Major, minor: 253, 7

Number of targets: 1

UUID: CRYPT-LUKS1-4015c585a0df4074821ca312c4caacca-crypt-dm-uuid-mpath-3600a098000d9818b000018565dfa069e

Name: mpathpz

State: ACTIVE

Read Ahead: 256

Tables present: LIVE

Open count: 1

Event number: 28

Major, minor: 253, 6

Number of targets: 1

UUID: mpath-3600a098000d9818b000018565dfa069e

This means boot volume is represented by dm-4 while storage volume is represented by dm-6

Dumping the multipath daemon on the controller shows that at a steady running state both DMs are accounted for (see below).

multipathd> show maps

name sysfs uuid

mpathpy dm-4 3600a098000d9818b0000185c5dfa0714

mpathpz dm-6 3600a098000d9818b000018565dfa069e

mpathqi dm-12 3600a098000d9818b000018df5dfafd40

mpathqj dm-13 3600a098000d9818b000018de5dfafd10

mpathpw dm-0 3600a098000d9818b000018425dfa059f

mpathpx dm-1 3600a098000d9818b0000184c5dfa05fc

mpathqk dm-16 3600a098000d9818b000018eb5dfafe80

mpathql dm-17 3600a098000d9818b000018e95dfafe26

mpathqh dm-9 3600a098000d9818b000018c65dfafa91

These vDisks are mapped to the following multipaths:

multipathd> show topology

mpathpy (3600a098000d9818b0000185c5dfa0714) dm-4 NETAPP ,INF-01-00

size=21G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 rdac' wp=rw

|-+- policy='service-time 0' prio=14 status=active

| |- 30:0:0:82 sdm 8:192 active ready running

| `- 32:0:0:82 sdk 8:160 active ready running

`-+- policy='service-time 0' prio=0 status=enabled

|- 33:0:0:82 sdn 8:208 failed faulty running

`- 31:0:0:82 sdl 8:176 failed faulty running

mpathpz (3600a098000d9818b000018565dfa069e) dm-6 NETAPP ,INF-01-00

size=10G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 rdac' wp=rw

|-+- policy='service-time 0' prio=14 status=active

| |- 30:0:0:229 sdr 65:16 active ready running

| `- 32:0:0:229 sdp 8:240 active ready running

`-+- policy='service-time 0' prio=0 status=enabled

|- 31:0:0:229 sdo 8:224 failed faulty running

`- 33:0:0:229 sdq 65:0 failed faulty running

Now, it starts getting very interesting, if I shutdown controller-A from NetApp side, dm-4 disappears but dm-6 is still running while detecting the active path is controller B while standby path is controller-A which now is displayed as failed

multipathd> show topology

mpathpz (3600a098000d9818b000018565dfa069e) dm-6 NETAPP ,INF-01-00

size=10G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 rdac' wp=rw

|-+- policy='service-time 0' prio=0 status=enabled

| |- 30:0:0:229 sdr 65:16 failed faulty running

| `- 32:0:0:229 sdp 8:240 failed faulty running

`-+- policy='service-time 0' prio=11 status=active

|- 31:0:0:229 sdo 8:224 active ready running

`- 33:0:0:229 sdq 65:0 active ready running

multipathd> show maps

name sysfs uuid

mpathpz dm-6 3600a098000d9818b000018565dfa069e

mpathqi dm-12 3600a098000d9818b000018df5dfafd40

mpathqj dm-13 3600a098000d9818b000018de5dfafd10

mpathpw dm-0 3600a098000d9818b000018425dfa059f

mpathpx dm-1 3600a098000d9818b0000184c5dfa05fc

mpathqk dm-16 3600a098000d9818b000018eb5dfafe80

mpathql dm-17 3600a098000d9818b000018e95dfafe26

mpathqg dm-8 3600a098000d9818b000018c75dfafac0

mpathqh dm-9 3600a098000d9818b000018c65dfafa91

If I restore Controller-A into service from NetApp side while fail only the path to controller A from multipathd everything works fine, dm-4 is still present and the VM can be put into service.

multipathd> fail path sdk

multipathd>

multipathd> fail path sdm

mpathpy (3600a098000d9818b0000185c5dfa0714) dm-4 NETAPP ,INF-01-00

size=21G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 rdac' wp=rw

|-+- policy='service-time 0' prio=0 status=enabled

| |- 32:0:0:82 sdk 8:160 failed faulty running

| `- 30:0:0:82 sdm 8:192 failed faulty running

`-+- policy='service-time 0' prio=9 status=active

|- 31:0:0:82 sdl 8:176 active ready running

`- 33:0:0:82 sdn 8:208 active ready running

multipathd> reinstate path sdk

multipathd>

multipathd> reinstate path sdm

mpathpy (3600a098000d9818b0000185c5dfa0714) dm-4 NETAPP ,INF-01-00

size=21G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 rdac' wp=rw

|-+- policy='service-time 0' prio=14 status=active

| |- 32:0:0:82 sdk 8:160 active ready running

| `- 30:0:0:82 sdm 8:192 active ready running

`-+- policy='service-time 0' prio=9 status=enabled

|- 31:0:0:82 sdl 8:176 active ready running

`- 33:0:0:82 sdn 8:208 active ready running

It is observed in the working case, the storage volume disappears (which seems normal), also the instance totally vanishes from the virsh list and no trace can be found at the KVM level if we run ps -def | grep fd | grep <instance_ID>. However, the boot volume is always present in the multipathd records when we stop the VM at normal conditions without stopping NetApp controller.

Any ideas?

Kind regards,
Ahmed