DCN compute service goes down when a instance is scheduled to launch | wallaby | tripleo
Hi, I currently have 3 compute nodes on edge site1 where i am trying to launch vm's. When the VM is in spawning state the node goes down (openstack compute service list), the node comes backup when i restart the nova compute service but then the launch of the vm fails. nova-compute.log 2023-02-26 08:15:51.808 7 INFO nova.compute.manager [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running instance usage audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to 2023-02-26 08:00:00. 0 instances. 2023-02-26 08:49:52.813 7 INFO nova.compute.claims [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node dcn01-hci-0.bdxworld.com 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: /dev/vda. Libvirt can't honour user-supplied dev names 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Running privsep helper: ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', '--config-file', '/etc/nova/nova.conf', '--config-file', '/etc/nova/nova-compute.conf', '--privsep_context', 'os_brick.privileged.default', '--privsep_sock_path', '/tmp/tmpin40tah6/privsep.sock'] 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new privsep daemon via rootwrap 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep daemon running as pid 2647 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image Is there a way to solve this issue? With regards, Swogat Pradhan
Hi, Please find the nova conductor as well as nova api log. nova-conuctor: 2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 16152921c1eb45c2b1f562087140168b 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to 83dbe5f567a940b698acfe986f6194fa 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to f3bfd7f65bd542b18d84cea3033abb43: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to d4b9180f91a94f9a82c3c9c4b7595566: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 897911a234a445d8a0d8af02ece40f6f: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 8f723ceb10c3472db9a9f324861df2bb: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable With regards, Swogat Pradhan On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, I currently have 3 compute nodes on edge site1 where i am trying to launch vm's. When the VM is in spawning state the node goes down (openstack compute service list), the node comes backup when i restart the nova compute service but then the launch of the vm fails.
nova-compute.log
2023-02-26 08:15:51.808 7 INFO nova.compute.manager [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running instance usage audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to 2023-02-26 08:00:00. 0 instances. 2023-02-26 08:49:52.813 7 INFO nova.compute.claims [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node dcn01-hci-0.bdxworld.com 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: /dev/vda. Libvirt can't honour user-supplied dev names 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Running privsep helper: ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', '--config-file', '/etc/nova/nova.conf', '--config-file', '/etc/nova/nova-compute.conf', '--privsep_context', 'os_brick.privileged.default', '--privsep_sock_path', '/tmp/tmpin40tah6/privsep.sock'] 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new privsep daemon via rootwrap 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep daemon running as pid 2647 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image
Is there a way to solve this issue?
With regards,
Swogat Pradhan
Hi, this seems to be a rabbitmq issue. Can you add some rabbit logs as well? Do you have a rabbit cluster consisting of multiple hosts (HA) or a single control node? Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi, Please find the nova conductor as well as nova api log.
nova-conuctor:
2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 16152921c1eb45c2b1f562087140168b 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to 83dbe5f567a940b698acfe986f6194fa 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to f3bfd7f65bd542b18d84cea3033abb43: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to d4b9180f91a94f9a82c3c9c4b7595566: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 897911a234a445d8a0d8af02ece40f6f: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 8f723ceb10c3472db9a9f324861df2bb: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, I currently have 3 compute nodes on edge site1 where i am trying to launch vm's. When the VM is in spawning state the node goes down (openstack compute service list), the node comes backup when i restart the nova compute service but then the launch of the vm fails.
nova-compute.log
2023-02-26 08:15:51.808 7 INFO nova.compute.manager [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running instance usage audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to 2023-02-26 08:00:00. 0 instances. 2023-02-26 08:49:52.813 7 INFO nova.compute.claims [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node dcn01-hci-0.bdxworld.com 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: /dev/vda. Libvirt can't honour user-supplied dev names 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Running privsep helper: ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', '--config-file', '/etc/nova/nova.conf', '--config-file', '/etc/nova/nova-compute.conf', '--privsep_context', 'os_brick.privileged.default', '--privsep_sock_path', '/tmp/tmpin40tah6/privsep.sock'] 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new privsep daemon via rootwrap 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep daemon running as pid 2647 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image
Is there a way to solve this issue?
With regards,
Swogat Pradhan
Hi Eugen, Thanks for your response. I have actually a 4 controller setup so here are the details: *PCS Status:* * Container bundle set: rabbitmq-bundle [ 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-no-ceph-3 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1 * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0 I have tried restarting the bundle multiple times but the issue is still present. *Cluster status:* [root@overcloud-controller-0 /]# rabbitmqctl cluster_status Cluster status of node rabbit@overcloud-controller-0.internalapi.bdxworld.com ... Basics Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com Disk Nodes rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com Running Nodes rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com Versions rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 Alarms (none) Network Partitions (none) Listeners Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Feature flags Flag: drop_unroutable_metric, state: enabled Flag: empty_basic_get_metric, state: enabled Flag: implicit_default_bindings, state: enabled Flag: quorum_queue, state: enabled Flag: virtual_host_metadata, state: enabled *Logs:* *(Attached)* With regards, Swogat Pradhan On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Please find the nova conductor as well as nova api log.
nova-conuctor:
2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 16152921c1eb45c2b1f562087140168b 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to 83dbe5f567a940b698acfe986f6194fa 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to f3bfd7f65bd542b18d84cea3033abb43: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to d4b9180f91a94f9a82c3c9c4b7595566: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 897911a234a445d8a0d8af02ece40f6f: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 8f723ceb10c3472db9a9f324861df2bb: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, I currently have 3 compute nodes on edge site1 where i am trying to launch vm's. When the VM is in spawning state the node goes down (openstack compute service list), the node comes backup when i restart the nova compute service but then the launch of the vm fails.
nova-compute.log
2023-02-26 08:15:51.808 7 INFO nova.compute.manager [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running instance usage audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to 2023-02-26 08:00:00. 0 instances. 2023-02-26 08:49:52.813 7 INFO nova.compute.claims [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node dcn01-hci-0.bdxworld.com 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: /dev/vda. Libvirt can't honour user-supplied dev names 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Running privsep helper: ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', '--config-file', '/etc/nova/nova.conf', '--config-file', '/etc/nova/nova-compute.conf', '--privsep_context', 'os_brick.privileged.default', '--privsep_sock_path', '/tmp/tmpin40tah6/privsep.sock'] 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new privsep daemon via rootwrap 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep daemon running as pid 2647 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image
Is there a way to solve this issue?
With regards,
Swogat Pradhan
The logs are not attached, actually. Can you use something like https://paste.openstack.org/ or pastebin to upload your logs and then paste the link here? Since they can be quite verbose please make sure that you only upload logs from the time of the failure. Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi Eugen, Thanks for your response. I have actually a 4 controller setup so here are the details:
*PCS Status:* * Container bundle set: rabbitmq-bundle [ 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-no-ceph-3 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1 * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0
I have tried restarting the bundle multiple times but the issue is still present.
*Cluster status:* [root@overcloud-controller-0 /]# rabbitmqctl cluster_status Cluster status of node rabbit@overcloud-controller-0.internalapi.bdxworld.com ... Basics
Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com
Disk Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Running Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Versions
rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1
Alarms
(none)
Network Partitions
(none)
Listeners
Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Feature flags
Flag: drop_unroutable_metric, state: enabled Flag: empty_basic_get_metric, state: enabled Flag: implicit_default_bindings, state: enabled Flag: quorum_queue, state: enabled Flag: virtual_host_metadata, state: enabled
*Logs:* *(Attached)*
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Please find the nova conductor as well as nova api log.
nova-conuctor:
2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 16152921c1eb45c2b1f562087140168b 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to 83dbe5f567a940b698acfe986f6194fa 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to f3bfd7f65bd542b18d84cea3033abb43: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to d4b9180f91a94f9a82c3c9c4b7595566: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 897911a234a445d8a0d8af02ece40f6f: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 8f723ceb10c3472db9a9f324861df2bb: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, I currently have 3 compute nodes on edge site1 where i am trying to launch vm's. When the VM is in spawning state the node goes down (openstack compute service list), the node comes backup when i restart the nova compute service but then the launch of the vm fails.
nova-compute.log
2023-02-26 08:15:51.808 7 INFO nova.compute.manager [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running instance usage audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to 2023-02-26 08:00:00. 0 instances. 2023-02-26 08:49:52.813 7 INFO nova.compute.claims [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node dcn01-hci-0.bdxworld.com 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: /dev/vda. Libvirt can't honour user-supplied dev names 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Running privsep helper: ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', '--config-file', '/etc/nova/nova.conf', '--config-file', '/etc/nova/nova-compute.conf', '--privsep_context', 'os_brick.privileged.default', '--privsep_sock_path', '/tmp/tmpin40tah6/privsep.sock'] 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new privsep daemon via rootwrap 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep daemon running as pid 2647 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image
Is there a way to solve this issue?
With regards,
Swogat Pradhan
Hi Eugen, For some reason i am not getting your email to me directly, i am checking the email digest and there i am able to find your reply. Here is the log for download: https://we.tl/t-L8FEkGZFSq Yes, these logs are from the time when the issue occurred. *Note: i am able to create vm's and perform other activities in the central site, only facing this issue in the edge site.* With regards, Swogat Pradhan On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Eugen, Thanks for your response. I have actually a 4 controller setup so here are the details:
*PCS Status:* * Container bundle set: rabbitmq-bundle [ 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-no-ceph-3 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1 * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0
I have tried restarting the bundle multiple times but the issue is still present.
*Cluster status:* [root@overcloud-controller-0 /]# rabbitmqctl cluster_status Cluster status of node rabbit@overcloud-controller-0.internalapi.bdxworld.com ... Basics
Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com
Disk Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Running Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Versions
rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1
Alarms
(none)
Network Partitions
(none)
Listeners
Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Feature flags
Flag: drop_unroutable_metric, state: enabled Flag: empty_basic_get_metric, state: enabled Flag: implicit_default_bindings, state: enabled Flag: quorum_queue, state: enabled Flag: virtual_host_metadata, state: enabled
*Logs:* *(Attached)*
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Please find the nova conductor as well as nova api log.
nova-conuctor:
2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 16152921c1eb45c2b1f562087140168b 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to 83dbe5f567a940b698acfe986f6194fa 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to f3bfd7f65bd542b18d84cea3033abb43: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to d4b9180f91a94f9a82c3c9c4b7595566: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 897911a234a445d8a0d8af02ece40f6f: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 8f723ceb10c3472db9a9f324861df2bb: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, I currently have 3 compute nodes on edge site1 where i am trying to launch vm's. When the VM is in spawning state the node goes down (openstack compute service list), the node comes backup when i restart the nova compute service but then the launch of the vm fails.
nova-compute.log
2023-02-26 08:15:51.808 7 INFO nova.compute.manager [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running instance usage audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to 2023-02-26 08:00:00. 0 instances. 2023-02-26 08:49:52.813 7 INFO nova.compute.claims [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node dcn01-hci-0.bdxworld.com 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: /dev/vda. Libvirt can't honour user-supplied dev names 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Running privsep helper: ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', '--config-file', '/etc/nova/nova.conf', '--config-file', '/etc/nova/nova-compute.conf', '--privsep_context', 'os_brick.privileged.default', '--privsep_sock_path', '/tmp/tmpin40tah6/privsep.sock'] 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new privsep daemon via rootwrap 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep daemon running as pid 2647 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image
Is there a way to solve this issue?
With regards,
Swogat Pradhan
I'm not familiar with TripleO so I'm not sure how much of help I can be here, maybe someone else with can chime in. I would look for network and rabbit issues. Are the control nodes heavily loaded? Do you see the compute services from the edge site up all the time? If you run a 'watch -n 20 openstack compute service list', do they "flap" all the time or only if you launch instances? Maybe rabbitmq needs some tweaking? Can you show your policies? rabbitmqctl list_policies -p <VHOST> What network connection do they have, is the network saturated? Is it different on the edge site compared to the central site? Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi Eugen, For some reason i am not getting your email to me directly, i am checking the email digest and there i am able to find your reply. Here is the log for download: https://we.tl/t-L8FEkGZFSq Yes, these logs are from the time when the issue occurred.
*Note: i am able to create vm's and perform other activities in the central site, only facing this issue in the edge site.*
With regards, Swogat Pradhan
On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Eugen, Thanks for your response. I have actually a 4 controller setup so here are the details:
*PCS Status:* * Container bundle set: rabbitmq-bundle [ 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-no-ceph-3 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1 * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0
I have tried restarting the bundle multiple times but the issue is still present.
*Cluster status:* [root@overcloud-controller-0 /]# rabbitmqctl cluster_status Cluster status of node rabbit@overcloud-controller-0.internalapi.bdxworld.com ... Basics
Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com
Disk Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Running Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Versions
rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1
Alarms
(none)
Network Partitions
(none)
Listeners
Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Feature flags
Flag: drop_unroutable_metric, state: enabled Flag: empty_basic_get_metric, state: enabled Flag: implicit_default_bindings, state: enabled Flag: quorum_queue, state: enabled Flag: virtual_host_metadata, state: enabled
*Logs:* *(Attached)*
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Please find the nova conductor as well as nova api log.
nova-conuctor:
2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 16152921c1eb45c2b1f562087140168b 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to 83dbe5f567a940b698acfe986f6194fa 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to f3bfd7f65bd542b18d84cea3033abb43: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to d4b9180f91a94f9a82c3c9c4b7595566: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 897911a234a445d8a0d8af02ece40f6f: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 8f723ceb10c3472db9a9f324861df2bb: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, I currently have 3 compute nodes on edge site1 where i am trying to launch vm's. When the VM is in spawning state the node goes down (openstack compute service list), the node comes backup when i restart the nova compute service but then the launch of the vm fails.
nova-compute.log
2023-02-26 08:15:51.808 7 INFO nova.compute.manager [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running instance usage audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to 2023-02-26 08:00:00. 0 instances. 2023-02-26 08:49:52.813 7 INFO nova.compute.claims [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node dcn01-hci-0.bdxworld.com 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: /dev/vda. Libvirt can't honour user-supplied dev names 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Running privsep helper: ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', '--config-file', '/etc/nova/nova.conf', '--config-file', '/etc/nova/nova-compute.conf', '--privsep_context', 'os_brick.privileged.default', '--privsep_sock_path', '/tmp/tmpin40tah6/privsep.sock'] 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new privsep daemon via rootwrap 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep daemon running as pid 2647 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image
Is there a way to solve this issue?
With regards,
Swogat Pradhan
Hi Eugen, Request you to please add my email either on 'to' or 'cc' as i am not getting email's from you. Coming to the issue: [root@overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p / Listing policies for vhost "/" ... vhost name pattern apply-to definition priority / ha-all ^(?!amq\.).* queues {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 I have the edge site compute nodes up, it only goes down when i am trying to launch an instance and the instance comes to a spawning state and then gets stuck. I have a tunnel setup between the central and the edge sites. With regards, Swogat Pradhan On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Eugen, For some reason i am not getting your email to me directly, i am checking the email digest and there i am able to find your reply. Here is the log for download: https://we.tl/t-L8FEkGZFSq Yes, these logs are from the time when the issue occurred.
*Note: i am able to create vm's and perform other activities in the central site, only facing this issue in the edge site.*
With regards, Swogat Pradhan
On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Eugen, Thanks for your response. I have actually a 4 controller setup so here are the details:
*PCS Status:* * Container bundle set: rabbitmq-bundle [ 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-no-ceph-3 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1 * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0
I have tried restarting the bundle multiple times but the issue is still present.
*Cluster status:* [root@overcloud-controller-0 /]# rabbitmqctl cluster_status Cluster status of node rabbit@overcloud-controller-0.internalapi.bdxworld.com ... Basics
Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com
Disk Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Running Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Versions
rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1
Alarms
(none)
Network Partitions
(none)
Listeners
Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Feature flags
Flag: drop_unroutable_metric, state: enabled Flag: empty_basic_get_metric, state: enabled Flag: implicit_default_bindings, state: enabled Flag: quorum_queue, state: enabled Flag: virtual_host_metadata, state: enabled
*Logs:* *(Attached)*
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Please find the nova conductor as well as nova api log.
nova-conuctor:
2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 16152921c1eb45c2b1f562087140168b 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to 83dbe5f567a940b698acfe986f6194fa 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to f3bfd7f65bd542b18d84cea3033abb43: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to d4b9180f91a94f9a82c3c9c4b7595566: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 897911a234a445d8a0d8af02ece40f6f: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 8f723ceb10c3472db9a9f324861df2bb: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi, I currently have 3 compute nodes on edge site1 where i am trying to launch vm's. When the VM is in spawning state the node goes down (openstack compute service list), the node comes backup when i restart the nova compute service but then the launch of the vm fails.
nova-compute.log
2023-02-26 08:15:51.808 7 INFO nova.compute.manager [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running instance usage audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to 2023-02-26 08:00:00. 0 instances. 2023-02-26 08:49:52.813 7 INFO nova.compute.claims [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node dcn01-hci-0.bdxworld.com 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: /dev/vda. Libvirt can't honour user-supplied dev names 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Running privsep helper: ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', '--config-file', '/etc/nova/nova.conf', '--config-file', '/etc/nova/nova-compute.conf', '--privsep_context', 'os_brick.privileged.default', '--privsep_sock_path', '/tmp/tmpin40tah6/privsep.sock'] 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new privsep daemon via rootwrap 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep daemon running as pid 2647 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image
Is there a way to solve this issue?
With regards,
Swogat Pradhan
One more thing coming to mind is MTU size. Are they identical between central and edge site? Do you see packet loss through the tunnel? Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi Eugen, Request you to please add my email either on 'to' or 'cc' as i am not getting email's from you. Coming to the issue:
[root@overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p / Listing policies for vhost "/" ... vhost name pattern apply-to definition priority / ha-all ^(?!amq\.).* queues {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
I have the edge site compute nodes up, it only goes down when i am trying to launch an instance and the instance comes to a spawning state and then gets stuck.
I have a tunnel setup between the central and the edge sites.
With regards, Swogat Pradhan
On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Eugen, For some reason i am not getting your email to me directly, i am checking the email digest and there i am able to find your reply. Here is the log for download: https://we.tl/t-L8FEkGZFSq Yes, these logs are from the time when the issue occurred.
*Note: i am able to create vm's and perform other activities in the central site, only facing this issue in the edge site.*
With regards, Swogat Pradhan
On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Eugen, Thanks for your response. I have actually a 4 controller setup so here are the details:
*PCS Status:* * Container bundle set: rabbitmq-bundle [ 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-no-ceph-3 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1 * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0
I have tried restarting the bundle multiple times but the issue is still present.
*Cluster status:* [root@overcloud-controller-0 /]# rabbitmqctl cluster_status Cluster status of node rabbit@overcloud-controller-0.internalapi.bdxworld.com ... Basics
Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com
Disk Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Running Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Versions
rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1
Alarms
(none)
Network Partitions
(none)
Listeners
Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Feature flags
Flag: drop_unroutable_metric, state: enabled Flag: empty_basic_get_metric, state: enabled Flag: implicit_default_bindings, state: enabled Flag: quorum_queue, state: enabled Flag: virtual_host_metadata, state: enabled
*Logs:* *(Attached)*
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Please find the nova conductor as well as nova api log.
nova-conuctor:
2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 16152921c1eb45c2b1f562087140168b 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to 83dbe5f567a940b698acfe986f6194fa 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to f3bfd7f65bd542b18d84cea3033abb43: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to d4b9180f91a94f9a82c3c9c4b7595566: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 897911a234a445d8a0d8af02ece40f6f: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 8f723ceb10c3472db9a9f324861df2bb: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi, I currently have 3 compute nodes on edge site1 where i am trying to launch vm's. When the VM is in spawning state the node goes down (openstack compute service list), the node comes backup when i restart the nova compute service but then the launch of the vm fails.
nova-compute.log
2023-02-26 08:15:51.808 7 INFO nova.compute.manager [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running instance usage audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to 2023-02-26 08:00:00. 0 instances. 2023-02-26 08:49:52.813 7 INFO nova.compute.claims [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node dcn01-hci-0.bdxworld.com 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: /dev/vda. Libvirt can't honour user-supplied dev names 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Running privsep helper: ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', '--config-file', '/etc/nova/nova.conf', '--config-file', '/etc/nova/nova-compute.conf', '--privsep_context', 'os_brick.privileged.default', '--privsep_sock_path', '/tmp/tmpin40tah6/privsep.sock'] 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new privsep daemon via rootwrap 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep daemon running as pid 2647 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image
Is there a way to solve this issue?
With regards,
Swogat Pradhan
Hi, Yes the MTU is the same as the default '1500'. Generally I haven't seen any packet loss, but never checked when launching the instance. I will check that and come back. But everytime i launch an instance the instance gets stuck at spawning state and there the hypervisor becomes down, so not sure if packet loss causes this. With regards, Swogat pradhan On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote:
One more thing coming to mind is MTU size. Are they identical between central and edge site? Do you see packet loss through the tunnel?
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi Eugen, Request you to please add my email either on 'to' or 'cc' as i am not getting email's from you. Coming to the issue:
[root@overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p / Listing policies for vhost "/" ... vhost name pattern apply-to definition priority / ha-all ^(?!amq\.).* queues {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
I have the edge site compute nodes up, it only goes down when i am trying to launch an instance and the instance comes to a spawning state and then gets stuck.
I have a tunnel setup between the central and the edge sites.
With regards, Swogat Pradhan
On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Eugen, For some reason i am not getting your email to me directly, i am checking the email digest and there i am able to find your reply. Here is the log for download: https://we.tl/t-L8FEkGZFSq Yes, these logs are from the time when the issue occurred.
*Note: i am able to create vm's and perform other activities in the central site, only facing this issue in the edge site.*
With regards, Swogat Pradhan
On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Eugen, Thanks for your response. I have actually a 4 controller setup so here are the details:
*PCS Status:* * Container bundle set: rabbitmq-bundle [ 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-no-ceph-3 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1 * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0
I have tried restarting the bundle multiple times but the issue is still present.
*Cluster status:* [root@overcloud-controller-0 /]# rabbitmqctl cluster_status Cluster status of node rabbit@overcloud-controller-0.internalapi.bdxworld.com ... Basics
Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com
Disk Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Running Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Versions
rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1
Alarms
(none)
Network Partitions
(none)
Listeners
Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Feature flags
Flag: drop_unroutable_metric, state: enabled Flag: empty_basic_get_metric, state: enabled Flag: implicit_default_bindings, state: enabled Flag: quorum_queue, state: enabled Flag: virtual_host_metadata, state: enabled
*Logs:* *(Attached)*
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi, Please find the nova conductor as well as nova api log.
nova-conuctor:
2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 16152921c1eb45c2b1f562087140168b 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to 83dbe5f567a940b698acfe986f6194fa 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to f3bfd7f65bd542b18d84cea3033abb43: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to d4b9180f91a94f9a82c3c9c4b7595566: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 897911a234a445d8a0d8af02ece40f6f: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 8f723ceb10c3472db9a9f324861df2bb: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi, I currently have 3 compute nodes on edge site1 where i am trying to launch vm's. When the VM is in spawning state the node goes down (openstack compute service list), the node comes backup when i restart the nova compute service but then the launch of the vm fails.
nova-compute.log
2023-02-26 08:15:51.808 7 INFO nova.compute.manager [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running instance usage audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to 2023-02-26 08:00:00. 0 instances. 2023-02-26 08:49:52.813 7 INFO nova.compute.claims [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node dcn01-hci-0.bdxworld.com 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: /dev/vda. Libvirt can't honour user-supplied dev names 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Running privsep helper: ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', '--config-file', '/etc/nova/nova.conf', '--config-file', '/etc/nova/nova-compute.conf', '--privsep_context', 'os_brick.privileged.default', '--privsep_sock_path', '/tmp/tmpin40tah6/privsep.sock'] 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new privsep daemon via rootwrap 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep daemon running as pid 2647 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image
Is there a way to solve this issue?
With regards,
Swogat Pradhan
Hi I don't see any major packet loss. It seems the problem is somewhere in rabbitmq maybe but not due to packet loss. with regards, Swogat Pradhan On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Yes the MTU is the same as the default '1500'. Generally I haven't seen any packet loss, but never checked when launching the instance. I will check that and come back. But everytime i launch an instance the instance gets stuck at spawning state and there the hypervisor becomes down, so not sure if packet loss causes this.
With regards, Swogat pradhan
On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote:
One more thing coming to mind is MTU size. Are they identical between central and edge site? Do you see packet loss through the tunnel?
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi Eugen, Request you to please add my email either on 'to' or 'cc' as i am not getting email's from you. Coming to the issue:
[root@overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p / Listing policies for vhost "/" ... vhost name pattern apply-to definition priority / ha-all ^(?!amq\.).* queues {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
I have the edge site compute nodes up, it only goes down when i am trying to launch an instance and the instance comes to a spawning state and then gets stuck.
I have a tunnel setup between the central and the edge sites.
With regards, Swogat Pradhan
On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Eugen, For some reason i am not getting your email to me directly, i am checking the email digest and there i am able to find your reply. Here is the log for download: https://we.tl/t-L8FEkGZFSq Yes, these logs are from the time when the issue occurred.
*Note: i am able to create vm's and perform other activities in the central site, only facing this issue in the edge site.*
With regards, Swogat Pradhan
On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Eugen, Thanks for your response. I have actually a 4 controller setup so here are the details:
*PCS Status:* * Container bundle set: rabbitmq-bundle [ 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-no-ceph-3 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1 * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0
I have tried restarting the bundle multiple times but the issue is still present.
*Cluster status:* [root@overcloud-controller-0 /]# rabbitmqctl cluster_status Cluster status of node rabbit@overcloud-controller-0.internalapi.bdxworld.com ... Basics
Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com
Disk Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Running Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Versions
rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1
Alarms
(none)
Network Partitions
(none)
Listeners
Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Feature flags
Flag: drop_unroutable_metric, state: enabled Flag: empty_basic_get_metric, state: enabled Flag: implicit_default_bindings, state: enabled Flag: quorum_queue, state: enabled Flag: virtual_host_metadata, state: enabled
*Logs:* *(Attached)*
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi, Please find the nova conductor as well as nova api log.
nova-conuctor:
2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 16152921c1eb45c2b1f562087140168b 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to 83dbe5f567a940b698acfe986f6194fa 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to f3bfd7f65bd542b18d84cea3033abb43: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to d4b9180f91a94f9a82c3c9c4b7595566: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 897911a234a445d8a0d8af02ece40f6f: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 8f723ceb10c3472db9a9f324861df2bb: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
> Hi, > I currently have 3 compute nodes on edge site1 where i am trying to > launch vm's. > When the VM is in spawning state the node goes down (openstack compute > service list), the node comes backup when i restart the nova compute > service but then the launch of the vm fails. > > nova-compute.log > > 2023-02-26 08:15:51.808 7 INFO nova.compute.manager > [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running > instance usage > audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to > 2023-02-26 08:00:00. 0 instances. > 2023-02-26 08:49:52.813 7 INFO nova.compute.claims > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] [instance: > 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node > dcn01-hci-0.bdxworld.com > 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] [instance: > 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: > /dev/vda. Libvirt can't honour user-supplied dev names > 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] [instance: > 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume > c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda > 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with > backend dogpile.cache.null. > 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] Running > privsep helper: > ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', > '--config-file', '/etc/nova/nova.conf', '--config-file', > '/etc/nova/nova-compute.conf', '--privsep_context', > 'os_brick.privileged.default', '--privsep_sock_path', > '/tmp/tmpin40tah6/privsep.sock'] > 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new privsep > daemon via rootwrap > 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep > daemon starting > 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep > process running with uid/gid: 0/0 > 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep > process running with capabilities (eff/prm/inh): > CAP_SYS_ADMIN/CAP_SYS_ADMIN/none > 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep > daemon running as pid 2647 > 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] Process > execution error > in _get_host_uuid: Unexpected error while running command. > Command: blkid overlay -s UUID -o value > Exit code: 2 > Stdout: '' > Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: > Unexpected error while running command. > 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] [instance: > 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image > > Is there a way to solve this issue? > > > With regards, > > Swogat Pradhan >
Hi, Can someone please help me out on this issue? With regards, Swogat Pradhan On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi I don't see any major packet loss. It seems the problem is somewhere in rabbitmq maybe but not due to packet loss.
with regards, Swogat Pradhan
On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Yes the MTU is the same as the default '1500'. Generally I haven't seen any packet loss, but never checked when launching the instance. I will check that and come back. But everytime i launch an instance the instance gets stuck at spawning state and there the hypervisor becomes down, so not sure if packet loss causes this.
With regards, Swogat pradhan
On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote:
One more thing coming to mind is MTU size. Are they identical between central and edge site? Do you see packet loss through the tunnel?
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi Eugen, Request you to please add my email either on 'to' or 'cc' as i am not getting email's from you. Coming to the issue:
[root@overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p / Listing policies for vhost "/" ... vhost name pattern apply-to definition priority / ha-all ^(?!amq\.).* queues
{"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
I have the edge site compute nodes up, it only goes down when i am
to launch an instance and the instance comes to a spawning state and
gets stuck.
I have a tunnel setup between the central and the edge sites.
With regards, Swogat Pradhan
On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Eugen, For some reason i am not getting your email to me directly, i am checking the email digest and there i am able to find your reply. Here is the log for download: https://we.tl/t-L8FEkGZFSq Yes, these logs are from the time when the issue occurred.
*Note: i am able to create vm's and perform other activities in the central site, only facing this issue in the edge site.*
With regards, Swogat Pradhan
On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Eugen, Thanks for your response. I have actually a 4 controller setup so here are the details:
*PCS Status:* * Container bundle set: rabbitmq-bundle [ 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-no-ceph-3 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1 * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0
I have tried restarting the bundle multiple times but the issue is still present.
*Cluster status:* [root@overcloud-controller-0 /]# rabbitmqctl cluster_status Cluster status of node rabbit@overcloud-controller-0.internalapi.bdxworld.com ... Basics
Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com
Disk Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Running Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Versions
rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1
Alarms
(none)
Network Partitions
(none)
Listeners
Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , interface: [::], port: 15672, protocol: http, purpose: HTTP API
Feature flags
Flag: drop_unroutable_metric, state: enabled Flag: empty_basic_get_metric, state: enabled Flag: implicit_default_bindings, state: enabled Flag: quorum_queue, state: enabled Flag: virtual_host_metadata, state: enabled
*Logs:* *(Attached)*
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
> Hi, > Please find the nova conductor as well as nova api log. > > nova-conuctor: > > 2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver > [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to > 16152921c1eb45c2b1f562087140168b > 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver > [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] > reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to > 83dbe5f567a940b698acfe986f6194fa > 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver > [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] > reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to > f3bfd7f65bd542b18d84cea3033abb43: > oslo_messaging.exceptions.MessageUndeliverable > 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver > [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply > f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a > missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: > oslo_messaging.exceptions.MessageUndeliverable > 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver > [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to > d4b9180f91a94f9a82c3c9c4b7595566: > oslo_messaging.exceptions.MessageUndeliverable > 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver > [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply > d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a > missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: > oslo_messaging.exceptions.MessageUndeliverable > 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver > [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to > 897911a234a445d8a0d8af02ece40f6f: > oslo_messaging.exceptions.MessageUndeliverable > 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver > [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply > 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a > missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: > oslo_messaging.exceptions.MessageUndeliverable > 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with > backend dogpile.cache.null. > 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver > [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to > 8f723ceb10c3472db9a9f324861df2bb: > oslo_messaging.exceptions.MessageUndeliverable > 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver > [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply > 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a > missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: > oslo_messaging.exceptions.MessageUndeliverable > > With regards, > Swogat Pradhan > > On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < > swogatpradhan22@gmail.com> wrote: > >> Hi, >> I currently have 3 compute nodes on edge site1 where i am trying to >> launch vm's. >> When the VM is in spawning state the node goes down (openstack compute >> service list), the node comes backup when i restart the nova compute >> service but then the launch of the vm fails. >> >> nova-compute.log >> >> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running >> instance usage >> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to >> 2023-02-26 08:00:00. 0 instances. >> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node >> dcn01-hci-0.bdxworld.com >> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: >> /dev/vda. Libvirt can't honour user-supplied dev names >> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume >> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with >> backend dogpile.cache.null. >> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] Running >> privsep helper: >> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', >> '--config-file', '/etc/nova/nova.conf', '--config-file', >> '/etc/nova/nova-compute.conf', '--privsep_context', >> 'os_brick.privileged.default', '--privsep_sock_path', >> '/tmp/tmpin40tah6/privsep.sock'] >> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new
trying then privsep
>> daemon via rootwrap >> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep >> daemon starting >> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep >> process running with uid/gid: 0/0 >> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >> process running with capabilities (eff/prm/inh): >> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >> daemon running as pid 2647 >> 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] Process >> execution error >> in _get_host_uuid: Unexpected error while running command. >> Command: blkid overlay -s UUID -o value >> Exit code: 2 >> Stdout: '' >> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: >> Unexpected error while running command. >> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image >> >> Is there a way to solve this issue? >> >> >> With regards, >> >> Swogat Pradhan >> >
Hi, I tried to help someone with a similar issue some time ago in this thread: https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. If there isn't any additional information in the debug logs I probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes: - Either remove queues, exchanges etc. while rabbit is running, this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild. I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice. Regards, Eugen Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi, Can someone please help me out on this issue?
With regards, Swogat Pradhan
On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi I don't see any major packet loss. It seems the problem is somewhere in rabbitmq maybe but not due to packet loss.
with regards, Swogat Pradhan
On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Yes the MTU is the same as the default '1500'. Generally I haven't seen any packet loss, but never checked when launching the instance. I will check that and come back. But everytime i launch an instance the instance gets stuck at spawning state and there the hypervisor becomes down, so not sure if packet loss causes this.
With regards, Swogat pradhan
On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote:
One more thing coming to mind is MTU size. Are they identical between central and edge site? Do you see packet loss through the tunnel?
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi Eugen, Request you to please add my email either on 'to' or 'cc' as i am not getting email's from you. Coming to the issue:
[root@overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p / Listing policies for vhost "/" ... vhost name pattern apply-to definition priority / ha-all ^(?!amq\.).* queues
{"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
I have the edge site compute nodes up, it only goes down when i am
to launch an instance and the instance comes to a spawning state and
gets stuck.
I have a tunnel setup between the central and the edge sites.
With regards, Swogat Pradhan
On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Eugen, For some reason i am not getting your email to me directly, i am checking the email digest and there i am able to find your reply. Here is the log for download: https://we.tl/t-L8FEkGZFSq Yes, these logs are from the time when the issue occurred.
*Note: i am able to create vm's and perform other activities in the central site, only facing this issue in the edge site.*
With regards, Swogat Pradhan
On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
> Hi Eugen, > Thanks for your response. > I have actually a 4 controller setup so here are the details: > > *PCS Status:* > * Container bundle set: rabbitmq-bundle [ > 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: > * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started > overcloud-controller-no-ceph-3 > * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started > overcloud-controller-2 > * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started > overcloud-controller-1 > * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started > overcloud-controller-0 > > I have tried restarting the bundle multiple times but the issue is still > present. > > *Cluster status:* > [root@overcloud-controller-0 /]# rabbitmqctl cluster_status > Cluster status of node > rabbit@overcloud-controller-0.internalapi.bdxworld.com ... > Basics > > Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com > > Disk Nodes > > rabbit@overcloud-controller-0.internalapi.bdxworld.com > rabbit@overcloud-controller-1.internalapi.bdxworld.com > rabbit@overcloud-controller-2.internalapi.bdxworld.com > rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > > Running Nodes > > rabbit@overcloud-controller-0.internalapi.bdxworld.com > rabbit@overcloud-controller-1.internalapi.bdxworld.com > rabbit@overcloud-controller-2.internalapi.bdxworld.com > rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > > Versions > > rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 > on Erlang 22.3.4.1 > rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 > on Erlang 22.3.4.1 > rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 > on Erlang 22.3.4.1 > rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ > 3.8.3 on Erlang 22.3.4.1 > > Alarms > > (none) > > Network Partitions > > (none) > > Listeners > > Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: > [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool > communication > Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: > 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 > and AMQP 1.0 > Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: > [::], port: 15672, protocol: http, purpose: HTTP API > Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: > [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool > communication > Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: > 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 > and AMQP 1.0 > Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: > [::], port: 15672, protocol: http, purpose: HTTP API > Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: > [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool > communication > Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: > 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 > and AMQP 1.0 > Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: > [::], port: 15672, protocol: http, purpose: HTTP API > Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , > interface: [::], port: 25672, protocol: clustering, purpose: inter-node and > CLI tool communication > Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , > interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 > and AMQP 1.0 > Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , > interface: [::], port: 15672, protocol: http, purpose: HTTP API > > Feature flags > > Flag: drop_unroutable_metric, state: enabled > Flag: empty_basic_get_metric, state: enabled > Flag: implicit_default_bindings, state: enabled > Flag: quorum_queue, state: enabled > Flag: virtual_host_metadata, state: enabled > > *Logs:* > *(Attached)* > > With regards, > Swogat Pradhan > > On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> > wrote: > >> Hi, >> Please find the nova conductor as well as nova api log. >> >> nova-conuctor: >> >> 2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver >> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> 16152921c1eb45c2b1f562087140168b >> 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver >> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >> 83dbe5f567a940b698acfe986f6194fa >> 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver >> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >> f3bfd7f65bd542b18d84cea3033abb43: >> oslo_messaging.exceptions.MessageUndeliverable >> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver >> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply >> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a >> missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: >> oslo_messaging.exceptions.MessageUndeliverable >> 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver >> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> d4b9180f91a94f9a82c3c9c4b7595566: >> oslo_messaging.exceptions.MessageUndeliverable >> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver >> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a >> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: >> oslo_messaging.exceptions.MessageUndeliverable >> 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver >> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> 897911a234a445d8a0d8af02ece40f6f: >> oslo_messaging.exceptions.MessageUndeliverable >> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver >> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a >> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: >> oslo_messaging.exceptions.MessageUndeliverable >> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with >> backend dogpile.cache.null. >> 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver >> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> 8f723ceb10c3472db9a9f324861df2bb: >> oslo_messaging.exceptions.MessageUndeliverable >> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver >> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a >> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: >> oslo_messaging.exceptions.MessageUndeliverable >> >> With regards, >> Swogat Pradhan >> >> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >> swogatpradhan22@gmail.com> wrote: >> >>> Hi, >>> I currently have 3 compute nodes on edge site1 where i am trying to >>> launch vm's. >>> When the VM is in spawning state the node goes down (openstack compute >>> service list), the node comes backup when i restart the nova compute >>> service but then the launch of the vm fails. >>> >>> nova-compute.log >>> >>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running >>> instance usage >>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to >>> 2023-02-26 08:00:00. 0 instances. >>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node >>> dcn01-hci-0.bdxworld.com >>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: >>> /dev/vda. Libvirt can't honour user-supplied dev names >>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume >>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with >>> backend dogpile.cache.null. >>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] Running >>> privsep helper: >>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', >>> '--config-file', '/etc/nova/nova.conf', '--config-file', >>> '/etc/nova/nova-compute.conf', '--privsep_context', >>> 'os_brick.privileged.default', '--privsep_sock_path', >>> '/tmp/tmpin40tah6/privsep.sock'] >>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new
trying then privsep
>>> daemon via rootwrap >>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep >>> daemon starting >>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep >>> process running with uid/gid: 0/0 >>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>> process running with capabilities (eff/prm/inh): >>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>> daemon running as pid 2647 >>> 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] Process >>> execution error >>> in _get_host_uuid: Unexpected error while running command. >>> Command: blkid overlay -s UUID -o value >>> Exit code: 2 >>> Stdout: '' >>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: >>> Unexpected error while running command. >>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image >>> >>> Is there a way to solve this issue? >>> >>> >>> With regards, >>> >>> Swogat Pradhan >>> >>
Does your environment use different network interfaces for each of the networks? Or does it have a bond with everything on it? One issue I have seen before is that when launching instances, there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface. I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance. In my case, changing from active/backup to LACP helped. So, based on that experience, from my perspective, is certainly sounds like some kind of network issue. Regards, Brendan Shephard Senior Software Engineer Red Hat Australia
On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote:
Hi,
I tried to help someone with a similar issue some time ago in this thread: https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception...
But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. If there isn't any additional information in the debug logs I probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes:
- Either remove queues, exchanges etc. while rabbit is running, this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild.
I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice.
Regards, Eugen
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi, Can someone please help me out on this issue?
With regards, Swogat Pradhan
On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi I don't see any major packet loss. It seems the problem is somewhere in rabbitmq maybe but not due to packet loss.
with regards, Swogat Pradhan
On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Yes the MTU is the same as the default '1500'. Generally I haven't seen any packet loss, but never checked when launching the instance. I will check that and come back. But everytime i launch an instance the instance gets stuck at spawning state and there the hypervisor becomes down, so not sure if packet loss causes this.
With regards, Swogat pradhan
On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote:
One more thing coming to mind is MTU size. Are they identical between central and edge site? Do you see packet loss through the tunnel?
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi Eugen, Request you to please add my email either on 'to' or 'cc' as i am not getting email's from you. Coming to the issue:
[root@overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p / Listing policies for vhost "/" ... vhost name pattern apply-to definition priority / ha-all ^(?!amq\.).* queues
{"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
I have the edge site compute nodes up, it only goes down when i am
to launch an instance and the instance comes to a spawning state and
gets stuck.
I have a tunnel setup between the central and the edge sites.
With regards, Swogat Pradhan
On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
> Hi Eugen, > For some reason i am not getting your email to me directly, i am checking > the email digest and there i am able to find your reply. > Here is the log for download: https://we.tl/t-L8FEkGZFSq > Yes, these logs are from the time when the issue occurred. > > *Note: i am able to create vm's and perform other activities in the > central site, only facing this issue in the edge site.* > > With regards, > Swogat Pradhan > > On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < swogatpradhan22@gmail.com> > wrote: > >> Hi Eugen, >> Thanks for your response. >> I have actually a 4 controller setup so here are the details: >> >> *PCS Status:* >> * Container bundle set: rabbitmq-bundle [ >> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: >> * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started >> overcloud-controller-no-ceph-3 >> * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started >> overcloud-controller-2 >> * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started >> overcloud-controller-1 >> * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started >> overcloud-controller-0 >> >> I have tried restarting the bundle multiple times but the issue is still >> present. >> >> *Cluster status:* >> [root@overcloud-controller-0 /]# rabbitmqctl cluster_status >> Cluster status of node >> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >> Basics >> >> Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com >> >> Disk Nodes >> >> rabbit@overcloud-controller-0.internalapi.bdxworld.com >> rabbit@overcloud-controller-1.internalapi.bdxworld.com >> rabbit@overcloud-controller-2.internalapi.bdxworld.com >> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >> Running Nodes >> >> rabbit@overcloud-controller-0.internalapi.bdxworld.com >> rabbit@overcloud-controller-1.internalapi.bdxworld.com >> rabbit@overcloud-controller-2.internalapi.bdxworld.com >> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >> Versions >> >> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 >> on Erlang 22.3.4.1 >> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 >> on Erlang 22.3.4.1 >> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 >> on Erlang 22.3.4.1 >> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ >> 3.8.3 on Erlang 22.3.4.1 >> >> Alarms >> >> (none) >> >> Network Partitions >> >> (none) >> >> Listeners >> >> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: >> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool >> communication >> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: >> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >> and AMQP 1.0 >> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: >> [::], port: 15672, protocol: http, purpose: HTTP API >> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: >> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool >> communication >> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: >> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >> and AMQP 1.0 >> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: >> [::], port: 15672, protocol: http, purpose: HTTP API >> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: >> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool >> communication >> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: >> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >> and AMQP 1.0 >> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: >> [::], port: 15672, protocol: http, purpose: HTTP API >> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , >> interface: [::], port: 25672, protocol: clustering, purpose: inter-node and >> CLI tool communication >> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , >> interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >> and AMQP 1.0 >> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , >> interface: [::], port: 15672, protocol: http, purpose: HTTP API >> >> Feature flags >> >> Flag: drop_unroutable_metric, state: enabled >> Flag: empty_basic_get_metric, state: enabled >> Flag: implicit_default_bindings, state: enabled >> Flag: quorum_queue, state: enabled >> Flag: virtual_host_metadata, state: enabled >> >> *Logs:* >> *(Attached)* >> >> With regards, >> Swogat Pradhan >> >> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> >> wrote: >> >>> Hi, >>> Please find the nova conductor as well as nova api log. >>> >>> nova-conuctor: >>> >>> 2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver >>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> 16152921c1eb45c2b1f562087140168b >>> 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver >>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>> 83dbe5f567a940b698acfe986f6194fa >>> 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver >>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>> f3bfd7f65bd542b18d84cea3033abb43: >>> oslo_messaging.exceptions.MessageUndeliverable >>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver >>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply >>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a >>> missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: >>> oslo_messaging.exceptions.MessageUndeliverable >>> 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver >>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> d4b9180f91a94f9a82c3c9c4b7595566: >>> oslo_messaging.exceptions.MessageUndeliverable >>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver >>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a >>> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: >>> oslo_messaging.exceptions.MessageUndeliverable >>> 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver >>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> 897911a234a445d8a0d8af02ece40f6f: >>> oslo_messaging.exceptions.MessageUndeliverable >>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver >>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a >>> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: >>> oslo_messaging.exceptions.MessageUndeliverable >>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with >>> backend dogpile.cache.null. >>> 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver >>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> 8f723ceb10c3472db9a9f324861df2bb: >>> oslo_messaging.exceptions.MessageUndeliverable >>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver >>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a >>> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: >>> oslo_messaging.exceptions.MessageUndeliverable >>> >>> With regards, >>> Swogat Pradhan >>> >>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >>> swogatpradhan22@gmail.com> wrote: >>> >>>> Hi, >>>> I currently have 3 compute nodes on edge site1 where i am trying to >>>> launch vm's. >>>> When the VM is in spawning state the node goes down (openstack compute >>>> service list), the node comes backup when i restart the nova compute >>>> service but then the launch of the vm fails. >>>> >>>> nova-compute.log >>>> >>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running >>>> instance usage >>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to >>>> 2023-02-26 08:00:00. 0 instances. >>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node >>>> dcn01-hci-0.bdxworld.com >>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: >>>> /dev/vda. Libvirt can't honour user-supplied dev names >>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume >>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with >>>> backend dogpile.cache.null. >>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running >>>> privsep helper: >>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', >>>> '--config-file', '/etc/nova/nova.conf', '--config-file', >>>> '/etc/nova/nova-compute.conf', '--privsep_context', >>>> 'os_brick.privileged.default', '--privsep_sock_path', >>>> '/tmp/tmpin40tah6/privsep.sock'] >>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new
trying then privsep
>>>> daemon via rootwrap >>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep >>>> daemon starting >>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep >>>> process running with uid/gid: 0/0 >>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>>> process running with capabilities (eff/prm/inh): >>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>>> daemon running as pid 2647 >>>> 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process >>>> execution error >>>> in _get_host_uuid: Unexpected error while running command. >>>> Command: blkid overlay -s UUID -o value >>>> Exit code: 2 >>>> Stdout: '' >>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: >>>> Unexpected error while running command. >>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image >>>> >>>> Is there a way to solve this issue? >>>> >>>> >>>> With regards, >>>> >>>> Swogat Pradhan >>>> >>>
Hi Brendan, Thank you for your response. The edge1 site was just for testing so i used active-backup on 1gbe bonded interface. We are in the process of adding another edge site where we are using 2 linux bond vlan templates. I will test and try launching vm in the 2nd edge site and confirm if I am facing the same issue or no issue at all. With regards, Swogat Pradhan On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard <bshephar@redhat.com> wrote:
Does your environment use different network interfaces for each of the networks? Or does it have a bond with everything on it?
One issue I have seen before is that when launching instances, there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface.
I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance.
In my case, changing from active/backup to LACP helped. So, based on that experience, from my perspective, is certainly sounds like some kind of network issue.
Regards,
Brendan Shephard Senior Software Engineer Red Hat Australia
On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote:
Hi,
I tried to help someone with a similar issue some time ago in this thread:
https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception...
But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. If there isn't any additional information in the debug logs I probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes:
- Either remove queues, exchanges etc. while rabbit is running, this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild.
I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice.
Regards, Eugen
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi, Can someone please help me out on this issue?
With regards, Swogat Pradhan
On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi I don't see any major packet loss. It seems the problem is somewhere in rabbitmq maybe but not due to packet loss.
with regards, Swogat Pradhan
On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Yes the MTU is the same as the default '1500'. Generally I haven't seen any packet loss, but never checked when launching the instance. I will check that and come back. But everytime i launch an instance the instance gets stuck at spawning state and there the hypervisor becomes down, so not sure if packet loss causes this.
With regards, Swogat pradhan
On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote:
One more thing coming to mind is MTU size. Are they identical between central and edge site? Do you see packet loss through the tunnel?
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi Eugen, Request you to please add my email either on 'to' or 'cc' as i am not getting email's from you. Coming to the issue:
[root@overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p / Listing policies for vhost "/" ... vhost name pattern apply-to definition priority / ha-all ^(?!amq\.).* queues
{"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
I have the edge site compute nodes up, it only goes down when i am
to launch an instance and the instance comes to a spawning state and
gets stuck.
I have a tunnel setup between the central and the edge sites.
With regards, Swogat Pradhan
On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Eugen, For some reason i am not getting your email to me directly, i am checking the email digest and there i am able to find your reply. Here is the log for download: https://we.tl/t-L8FEkGZFSq Yes, these logs are from the time when the issue occurred.
*Note: i am able to create vm's and perform other activities in the central site, only facing this issue in the edge site.*
With regards, Swogat Pradhan
On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Eugen, Thanks for your response. I have actually a 4 controller setup so here are the details:
*PCS Status:* * Container bundle set: rabbitmq-bundle [ 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-no-ceph-3 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1 * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0
I have tried restarting the bundle multiple times but the issue is still present.
*Cluster status:* [root@overcloud-controller-0 /]# rabbitmqctl cluster_status Cluster status of node rabbit@overcloud-controller-0.internalapi.bdxworld.com ... Basics
Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com
Disk Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Running Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Versions
rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1
Alarms
(none)
Network Partitions
(none)
Listeners
Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , interface: [::], port: 15672, protocol: http, purpose: HTTP API
Feature flags
Flag: drop_unroutable_metric, state: enabled Flag: empty_basic_get_metric, state: enabled Flag: implicit_default_bindings, state: enabled Flag: quorum_queue, state: enabled Flag: virtual_host_metadata, state: enabled
*Logs:* *(Attached)*
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi, Please find the nova conductor as well as nova api log.
nova-conuctor:
2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 16152921c1eb45c2b1f562087140168b 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to 83dbe5f567a940b698acfe986f6194fa 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to f3bfd7f65bd542b18d84cea3033abb43: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to d4b9180f91a94f9a82c3c9c4b7595566: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 897911a234a445d8a0d8af02ece40f6f: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 8f723ceb10c3472db9a9f324861df2bb: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi, I currently have 3 compute nodes on edge site1 where i am trying to launch vm's. When the VM is in spawning state the node goes down (openstack compute service list), the node comes backup when i restart the nova compute service but then the launch of the vm fails.
nova-compute.log
2023-02-26 08:15:51.808 7 INFO nova.compute.manager [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running instance usage audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to 2023-02-26 08:00:00. 0 instances. 2023-02-26 08:49:52.813 7 INFO nova.compute.claims [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node dcn01-hci-0.bdxworld.com 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: /dev/vda. Libvirt can't honour user-supplied dev names 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Running privsep helper: ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', '--config-file', '/etc/nova/nova.conf', '--config-file', '/etc/nova/nova-compute.conf', '--privsep_context', 'os_brick.privileged.default', '--privsep_sock_path', '/tmp/tmpin40tah6/privsep.sock'] 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new
trying then privsep
daemon via rootwrap 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep daemon running as pid 2647 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image
Is there a way to solve this issue?
With regards,
Swogat Pradhan
Hi Brendan, Now i have deployed another site where i have used 2 linux bonds network template for both 3 compute nodes and 3 ceph nodes. The bonding options is set to mode=802.3ad (lacp=active). I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. Once the volume was created i tried launching the instance from the volume and still the instance is stuck in spawning state. Here is the nova-compute log: 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep daemon running as pid 185437 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... The volume is already created and i do not understand why the instance is stuck in spawning state. With regards, Swogat Pradhan On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard <bshephar@redhat.com> wrote:
Does your environment use different network interfaces for each of the networks? Or does it have a bond with everything on it?
One issue I have seen before is that when launching instances, there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface.
I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance.
In my case, changing from active/backup to LACP helped. So, based on that experience, from my perspective, is certainly sounds like some kind of network issue.
Regards,
Brendan Shephard Senior Software Engineer Red Hat Australia
On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote:
Hi,
I tried to help someone with a similar issue some time ago in this thread:
https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception...
But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. If there isn't any additional information in the debug logs I probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes:
- Either remove queues, exchanges etc. while rabbit is running, this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild.
I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice.
Regards, Eugen
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi, Can someone please help me out on this issue?
With regards, Swogat Pradhan
On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi I don't see any major packet loss. It seems the problem is somewhere in rabbitmq maybe but not due to packet loss.
with regards, Swogat Pradhan
On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Yes the MTU is the same as the default '1500'. Generally I haven't seen any packet loss, but never checked when launching the instance. I will check that and come back. But everytime i launch an instance the instance gets stuck at spawning state and there the hypervisor becomes down, so not sure if packet loss causes this.
With regards, Swogat pradhan
On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote:
One more thing coming to mind is MTU size. Are they identical between central and edge site? Do you see packet loss through the tunnel?
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi Eugen, Request you to please add my email either on 'to' or 'cc' as i am not getting email's from you. Coming to the issue:
[root@overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p / Listing policies for vhost "/" ... vhost name pattern apply-to definition priority / ha-all ^(?!amq\.).* queues
{"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
I have the edge site compute nodes up, it only goes down when i am
to launch an instance and the instance comes to a spawning state and
gets stuck.
I have a tunnel setup between the central and the edge sites.
With regards, Swogat Pradhan
On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Eugen, For some reason i am not getting your email to me directly, i am checking the email digest and there i am able to find your reply. Here is the log for download: https://we.tl/t-L8FEkGZFSq Yes, these logs are from the time when the issue occurred.
*Note: i am able to create vm's and perform other activities in the central site, only facing this issue in the edge site.*
With regards, Swogat Pradhan
On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Eugen, Thanks for your response. I have actually a 4 controller setup so here are the details:
*PCS Status:* * Container bundle set: rabbitmq-bundle [ 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-no-ceph-3 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1 * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0
I have tried restarting the bundle multiple times but the issue is still present.
*Cluster status:* [root@overcloud-controller-0 /]# rabbitmqctl cluster_status Cluster status of node rabbit@overcloud-controller-0.internalapi.bdxworld.com ... Basics
Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com
Disk Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Running Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Versions
rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1
Alarms
(none)
Network Partitions
(none)
Listeners
Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , interface: [::], port: 15672, protocol: http, purpose: HTTP API
Feature flags
Flag: drop_unroutable_metric, state: enabled Flag: empty_basic_get_metric, state: enabled Flag: implicit_default_bindings, state: enabled Flag: quorum_queue, state: enabled Flag: virtual_host_metadata, state: enabled
*Logs:* *(Attached)*
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi, Please find the nova conductor as well as nova api log.
nova-conuctor:
2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 16152921c1eb45c2b1f562087140168b 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to 83dbe5f567a940b698acfe986f6194fa 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to f3bfd7f65bd542b18d84cea3033abb43: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to d4b9180f91a94f9a82c3c9c4b7595566: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 897911a234a445d8a0d8af02ece40f6f: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 8f723ceb10c3472db9a9f324861df2bb: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi, I currently have 3 compute nodes on edge site1 where i am trying to launch vm's. When the VM is in spawning state the node goes down (openstack compute service list), the node comes backup when i restart the nova compute service but then the launch of the vm fails.
nova-compute.log
2023-02-26 08:15:51.808 7 INFO nova.compute.manager [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running instance usage audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to 2023-02-26 08:00:00. 0 instances. 2023-02-26 08:49:52.813 7 INFO nova.compute.claims [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node dcn01-hci-0.bdxworld.com 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: /dev/vda. Libvirt can't honour user-supplied dev names 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Running privsep helper: ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', '--config-file', '/etc/nova/nova.conf', '--config-file', '/etc/nova/nova-compute.conf', '--privsep_context', 'os_brick.privileged.default', '--privsep_sock_path', '/tmp/tmpin40tah6/privsep.sock'] 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new
trying then privsep
daemon via rootwrap 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep daemon running as pid 2647 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image
Is there a way to solve this issue?
With regards,
Swogat Pradhan
Update: In the hypervisor list the compute node state is showing down. On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Brendan, Now i have deployed another site where i have used 2 linux bonds network template for both 3 compute nodes and 3 ceph nodes. The bonding options is set to mode=802.3ad (lacp=active). I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. Once the volume was created i tried launching the instance from the volume and still the instance is stuck in spawning state.
Here is the nova-compute log:
2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep daemon running as pid 185437 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image
It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep...
The volume is already created and i do not understand why the instance is stuck in spawning state.
With regards, Swogat Pradhan
On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard <bshephar@redhat.com> wrote:
Does your environment use different network interfaces for each of the networks? Or does it have a bond with everything on it?
One issue I have seen before is that when launching instances, there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface.
I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance.
In my case, changing from active/backup to LACP helped. So, based on that experience, from my perspective, is certainly sounds like some kind of network issue.
Regards,
Brendan Shephard Senior Software Engineer Red Hat Australia
On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote:
Hi,
I tried to help someone with a similar issue some time ago in this thread:
https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception...
But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. If there isn't any additional information in the debug logs I probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes:
- Either remove queues, exchanges etc. while rabbit is running, this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild.
I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice.
Regards, Eugen
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi, Can someone please help me out on this issue?
With regards, Swogat Pradhan
On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi I don't see any major packet loss. It seems the problem is somewhere in rabbitmq maybe but not due to packet loss.
with regards, Swogat Pradhan
On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Yes the MTU is the same as the default '1500'. Generally I haven't seen any packet loss, but never checked when launching the instance. I will check that and come back. But everytime i launch an instance the instance gets stuck at spawning state and there the hypervisor becomes down, so not sure if packet loss causes this.
With regards, Swogat pradhan
On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote:
One more thing coming to mind is MTU size. Are they identical between central and edge site? Do you see packet loss through the tunnel?
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi Eugen, Request you to please add my email either on 'to' or 'cc' as i am not getting email's from you. Coming to the issue:
[root@overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p / Listing policies for vhost "/" ... vhost name pattern apply-to definition priority / ha-all ^(?!amq\.).* queues
{"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
I have the edge site compute nodes up, it only goes down when i am
to launch an instance and the instance comes to a spawning state and
gets stuck.
I have a tunnel setup between the central and the edge sites.
With regards, Swogat Pradhan
On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Eugen, For some reason i am not getting your email to me directly, i am checking the email digest and there i am able to find your reply. Here is the log for download: https://we.tl/t-L8FEkGZFSq Yes, these logs are from the time when the issue occurred.
*Note: i am able to create vm's and perform other activities in the central site, only facing this issue in the edge site.*
With regards, Swogat Pradhan
On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Eugen, Thanks for your response. I have actually a 4 controller setup so here are the details:
*PCS Status:* * Container bundle set: rabbitmq-bundle [ 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-no-ceph-3 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1 * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0
I have tried restarting the bundle multiple times but the issue is still present.
*Cluster status:* [root@overcloud-controller-0 /]# rabbitmqctl cluster_status Cluster status of node rabbit@overcloud-controller-0.internalapi.bdxworld.com ... Basics
Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com
Disk Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Running Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Versions
rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1
Alarms
(none)
Network Partitions
(none)
Listeners
Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , interface: [::], port: 15672, protocol: http, purpose: HTTP API
Feature flags
Flag: drop_unroutable_metric, state: enabled Flag: empty_basic_get_metric, state: enabled Flag: implicit_default_bindings, state: enabled Flag: quorum_queue, state: enabled Flag: virtual_host_metadata, state: enabled
*Logs:* *(Attached)*
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi, Please find the nova conductor as well as nova api log.
nova-conuctor:
2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 16152921c1eb45c2b1f562087140168b 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to 83dbe5f567a940b698acfe986f6194fa 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to f3bfd7f65bd542b18d84cea3033abb43: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to d4b9180f91a94f9a82c3c9c4b7595566: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 897911a234a445d8a0d8af02ece40f6f: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with backend dogpile.cache.null. 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to 8f723ceb10c3472db9a9f324861df2bb: oslo_messaging.exceptions.MessageUndeliverable 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
> Hi, > I currently have 3 compute nodes on edge site1 where i am trying to > launch vm's. > When the VM is in spawning state the node goes down (openstack compute > service list), the node comes backup when i restart the nova compute > service but then the launch of the vm fails. > > nova-compute.log > > 2023-02-26 08:15:51.808 7 INFO nova.compute.manager > [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running > instance usage > audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to > 2023-02-26 08:00:00. 0 instances. > 2023-02-26 08:49:52.813 7 INFO nova.compute.claims > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] [instance: > 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node > dcn01-hci-0.bdxworld.com > 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] [instance: > 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: > /dev/vda. Libvirt can't honour user-supplied dev names > 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] [instance: > 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume > c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda > 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with > backend dogpile.cache.null. > 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] Running > privsep helper: > ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', > '--config-file', '/etc/nova/nova.conf', '--config-file', > '/etc/nova/nova-compute.conf', '--privsep_context', > 'os_brick.privileged.default', '--privsep_sock_path', > '/tmp/tmpin40tah6/privsep.sock'] > 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new
trying then privsep
> daemon via rootwrap > 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep > daemon starting > 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep > process running with uid/gid: 0/0 > 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep > process running with capabilities (eff/prm/inh): > CAP_SYS_ADMIN/CAP_SYS_ADMIN/none > 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep > daemon running as pid 2647 > 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] Process > execution error > in _get_host_uuid: Unexpected error while running command. > Command: blkid overlay -s UUID -o value > Exit code: 2 > Stdout: '' > Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: > Unexpected error while running command. > 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] [instance: > 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image > > Is there a way to solve this issue? > > > With regards, > > Swogat Pradhan >
Update: After restarting the nova services on the controller and running the deploy script on the edge site, I was able to launch the VM from volume. Right now the instance creation is failing as the block device creation is stuck in creating state, it is taking more than 10 mins for the volume to be created, whereas the image has already been imported to the edge glance. I will try and create a new fresh image and test again then update. With regards, Swogat Pradhan On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Update: In the hypervisor list the compute node state is showing down.
On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Brendan, Now i have deployed another site where i have used 2 linux bonds network template for both 3 compute nodes and 3 ceph nodes. The bonding options is set to mode=802.3ad (lacp=active). I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. Once the volume was created i tried launching the instance from the volume and still the instance is stuck in spawning state.
Here is the nova-compute log:
2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep daemon running as pid 185437 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image
It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep...
The volume is already created and i do not understand why the instance is stuck in spawning state.
With regards, Swogat Pradhan
On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard <bshephar@redhat.com> wrote:
Does your environment use different network interfaces for each of the networks? Or does it have a bond with everything on it?
One issue I have seen before is that when launching instances, there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface.
I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance.
In my case, changing from active/backup to LACP helped. So, based on that experience, from my perspective, is certainly sounds like some kind of network issue.
Regards,
Brendan Shephard Senior Software Engineer Red Hat Australia
On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote:
Hi,
I tried to help someone with a similar issue some time ago in this thread:
https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception...
But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. If there isn't any additional information in the debug logs I probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes:
- Either remove queues, exchanges etc. while rabbit is running, this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild.
I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice.
Regards, Eugen
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi, Can someone please help me out on this issue?
With regards, Swogat Pradhan
On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan <swogatpradhan22@gmail.com
wrote:
Hi I don't see any major packet loss. It seems the problem is somewhere in rabbitmq maybe but not due to packet loss.
with regards, Swogat Pradhan
On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan <swogatpradhan22@gmail.com
wrote:
Hi, Yes the MTU is the same as the default '1500'. Generally I haven't seen any packet loss, but never checked when launching the instance. I will check that and come back. But everytime i launch an instance the instance gets stuck at spawning state and there the hypervisor becomes down, so not sure if packet loss causes this.
With regards, Swogat pradhan
On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote:
One more thing coming to mind is MTU size. Are they identical between central and edge site? Do you see packet loss through the tunnel?
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi Eugen, Request you to please add my email either on 'to' or 'cc' as i am not getting email's from you. Coming to the issue:
[root@overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p / Listing policies for vhost "/" ... vhost name pattern apply-to definition priority / ha-all ^(?!amq\.).* queues
{"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
I have the edge site compute nodes up, it only goes down when i am
to launch an instance and the instance comes to a spawning state and
gets stuck.
I have a tunnel setup between the central and the edge sites.
With regards, Swogat Pradhan
On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Eugen, For some reason i am not getting your email to me directly, i am checking the email digest and there i am able to find your reply. Here is the log for download: https://we.tl/t-L8FEkGZFSq Yes, these logs are from the time when the issue occurred.
*Note: i am able to create vm's and perform other activities in the central site, only facing this issue in the edge site.*
With regards, Swogat Pradhan
On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Eugen, Thanks for your response. I have actually a 4 controller setup so here are the details:
*PCS Status:* * Container bundle set: rabbitmq-bundle [ 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-no-ceph-3 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1 * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0
I have tried restarting the bundle multiple times but the issue is still present.
*Cluster status:* [root@overcloud-controller-0 /]# rabbitmqctl cluster_status Cluster status of node rabbit@overcloud-controller-0.internalapi.bdxworld.com ... Basics
Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com
Disk Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Running Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com rabbit@overcloud-controller-1.internalapi.bdxworld.com rabbit@overcloud-controller-2.internalapi.bdxworld.com rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Versions
rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1 rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ 3.8.3 on Erlang 22.3.4.1
Alarms
(none)
Network Partitions
(none)
Listeners
Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: [::], port: 15672, protocol: http, purpose: HTTP API Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , interface: [::], port: 15672, protocol: http, purpose: HTTP API
Feature flags
Flag: drop_unroutable_metric, state: enabled Flag: empty_basic_get_metric, state: enabled Flag: implicit_default_bindings, state: enabled Flag: quorum_queue, state: enabled Flag: virtual_host_metadata, state: enabled
*Logs:* *(Attached)*
With regards, Swogat Pradhan
On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
> Hi, > Please find the nova conductor as well as nova api log. > > nova-conuctor: > > 2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver > [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to > 16152921c1eb45c2b1f562087140168b > 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver > [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] > reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to > 83dbe5f567a940b698acfe986f6194fa > 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver > [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] > reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to > f3bfd7f65bd542b18d84cea3033abb43: > oslo_messaging.exceptions.MessageUndeliverable > 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver > [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply > f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a > missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: > oslo_messaging.exceptions.MessageUndeliverable > 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver > [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to > d4b9180f91a94f9a82c3c9c4b7595566: > oslo_messaging.exceptions.MessageUndeliverable > 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver > [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply > d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a > missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: > oslo_messaging.exceptions.MessageUndeliverable > 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver > [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to > 897911a234a445d8a0d8af02ece40f6f: > oslo_messaging.exceptions.MessageUndeliverable > 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver > [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply > 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a > missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: > oslo_messaging.exceptions.MessageUndeliverable > 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils > [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with > backend dogpile.cache.null. > 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver > [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to > 8f723ceb10c3472db9a9f324861df2bb: > oslo_messaging.exceptions.MessageUndeliverable > 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver > [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply > 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a > missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: > oslo_messaging.exceptions.MessageUndeliverable > > With regards, > Swogat Pradhan > > On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < > swogatpradhan22@gmail.com> wrote: > >> Hi, >> I currently have 3 compute nodes on edge site1 where i am trying to >> launch vm's. >> When the VM is in spawning state the node goes down (openstack compute >> service list), the node comes backup when i restart the nova compute >> service but then the launch of the vm fails. >> >> nova-compute.log >> >> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running >> instance usage >> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to >> 2023-02-26 08:00:00. 0 instances. >> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node >> dcn01-hci-0.bdxworld.com >> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: >> /dev/vda. Libvirt can't honour user-supplied dev names >> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume >> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with >> backend dogpile.cache.null. >> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] Running >> privsep helper: >> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', >> '--config-file', '/etc/nova/nova.conf', '--config-file', >> '/etc/nova/nova-compute.conf', '--privsep_context', >> 'os_brick.privileged.default', '--privsep_sock_path', >> '/tmp/tmpin40tah6/privsep.sock'] >> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new
trying then privsep
>> daemon via rootwrap >> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep >> daemon starting >> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep >> process running with uid/gid: 0/0 >> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >> process running with capabilities (eff/prm/inh): >> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >> daemon running as pid 2647 >> 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] Process >> execution error >> in _get_host_uuid: Unexpected error while running command. >> Command: blkid overlay -s UUID -o value >> Exit code: 2 >> Stdout: '' >> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: >> Unexpected error while running command. >> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image >> >> Is there a way to solve this issue? >> >> >> With regards, >> >> Swogat Pradhan >> >
On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Update: After restarting the nova services on the controller and running the deploy script on the edge site, I was able to launch the VM from volume.
Right now the instance creation is failing as the block device creation is stuck in creating state, it is taking more than 10 mins for the volume to be created, whereas the image has already been imported to the edge glance.
Try following this document and making the same observations in your environment for AZs and their local ceph cluster. https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... On a DCN site if you run a command like this: $ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring /etc/ceph/dcn0.client.admin.keyring $ rbd --cluster dcn0 -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl $ Then, you should see the parent of the volume is the image which is on the same local ceph cluster. I wonder if something is misconfigured and thus you're encountering the streaming behavior described here: Ideally all images should reside in the central Glance and be copied to DCN sites before instances of those images are booted on DCN sites. If an image is not copied to a DCN site before it is booted, then the image will be streamed to the DCN site and then the image will boot as an instance. This happens because Glance at the DCN site has access to the images store at the Central ceph cluster. Though the booting of the image will take time because it has not been copied in advance, this is still preferable to failing to boot the image. You can also exec into the cinder container at the DCN site and confirm it's using it's local ceph cluster. John
I will try and create a new fresh image and test again then update.
With regards, Swogat Pradhan
On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Update: In the hypervisor list the compute node state is showing down.
On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Brendan, Now i have deployed another site where i have used 2 linux bonds network template for both 3 compute nodes and 3 ceph nodes. The bonding options is set to mode=802.3ad (lacp=active). I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. Once the volume was created i tried launching the instance from the volume and still the instance is stuck in spawning state.
Here is the nova-compute log:
2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep daemon running as pid 185437 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image
It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep...
The volume is already created and i do not understand why the instance is stuck in spawning state.
With regards, Swogat Pradhan
On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard <bshephar@redhat.com> wrote:
Does your environment use different network interfaces for each of the networks? Or does it have a bond with everything on it?
One issue I have seen before is that when launching instances, there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface.
I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance.
In my case, changing from active/backup to LACP helped. So, based on that experience, from my perspective, is certainly sounds like some kind of network issue.
Regards,
Brendan Shephard Senior Software Engineer Red Hat Australia
On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote:
Hi,
I tried to help someone with a similar issue some time ago in this thread: https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception...
But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. If there isn't any additional information in the debug logs I probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes:
- Either remove queues, exchanges etc. while rabbit is running, this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild.
I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice.
Regards, Eugen
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi, Can someone please help me out on this issue?
With regards, Swogat Pradhan
On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi I don't see any major packet loss. It seems the problem is somewhere in rabbitmq maybe but not due to packet loss.
with regards, Swogat Pradhan
On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Yes the MTU is the same as the default '1500'. Generally I haven't seen any packet loss, but never checked when launching the instance. I will check that and come back. But everytime i launch an instance the instance gets stuck at spawning state and there the hypervisor becomes down, so not sure if packet loss causes this.
With regards, Swogat pradhan
On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote:
One more thing coming to mind is MTU size. Are they identical between central and edge site? Do you see packet loss through the tunnel?
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi Eugen, Request you to please add my email either on 'to' or 'cc' as i am not getting email's from you. Coming to the issue:
[root@overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p / Listing policies for vhost "/" ... vhost name pattern apply-to definition priority / ha-all ^(?!amq\.).* queues
{"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
I have the edge site compute nodes up, it only goes down when i am
to launch an instance and the instance comes to a spawning state and
gets stuck.
I have a tunnel setup between the central and the edge sites.
With regards, Swogat Pradhan
On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Eugen, For some reason i am not getting your email to me directly, i am checking the email digest and there i am able to find your reply. Here is the log for download: https://we.tl/t-L8FEkGZFSq Yes, these logs are from the time when the issue occurred.
*Note: i am able to create vm's and perform other activities in the central site, only facing this issue in the edge site.*
With regards, Swogat Pradhan
On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
> Hi Eugen, > Thanks for your response. > I have actually a 4 controller setup so here are the details: > > *PCS Status:* > * Container bundle set: rabbitmq-bundle [ > 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: > * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started > overcloud-controller-no-ceph-3 > * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started > overcloud-controller-2 > * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started > overcloud-controller-1 > * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started > overcloud-controller-0 > > I have tried restarting the bundle multiple times but the issue is still > present. > > *Cluster status:* > [root@overcloud-controller-0 /]# rabbitmqctl cluster_status > Cluster status of node > rabbit@overcloud-controller-0.internalapi.bdxworld.com ... > Basics > > Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com > > Disk Nodes > > rabbit@overcloud-controller-0.internalapi.bdxworld.com > rabbit@overcloud-controller-1.internalapi.bdxworld.com > rabbit@overcloud-controller-2.internalapi.bdxworld.com > rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > > Running Nodes > > rabbit@overcloud-controller-0.internalapi.bdxworld.com > rabbit@overcloud-controller-1.internalapi.bdxworld.com > rabbit@overcloud-controller-2.internalapi.bdxworld.com > rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > > Versions > > rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 > on Erlang 22.3.4.1 > rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 > on Erlang 22.3.4.1 > rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 > on Erlang 22.3.4.1 > rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ > 3.8.3 on Erlang 22.3.4.1 > > Alarms > > (none) > > Network Partitions > > (none) > > Listeners > > Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: > [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool > communication > Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: > 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 > and AMQP 1.0 > Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: > [::], port: 15672, protocol: http, purpose: HTTP API > Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: > [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool > communication > Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: > 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 > and AMQP 1.0 > Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: > [::], port: 15672, protocol: http, purpose: HTTP API > Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: > [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool > communication > Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: > 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 > and AMQP 1.0 > Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: > [::], port: 15672, protocol: http, purpose: HTTP API > Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , > interface: [::], port: 25672, protocol: clustering, purpose: inter-node and > CLI tool communication > Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , > interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 > and AMQP 1.0 > Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , > interface: [::], port: 15672, protocol: http, purpose: HTTP API > > Feature flags > > Flag: drop_unroutable_metric, state: enabled > Flag: empty_basic_get_metric, state: enabled > Flag: implicit_default_bindings, state: enabled > Flag: quorum_queue, state: enabled > Flag: virtual_host_metadata, state: enabled > > *Logs:* > *(Attached)* > > With regards, > Swogat Pradhan > > On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> > wrote: > >> Hi, >> Please find the nova conductor as well as nova api log. >> >> nova-conuctor: >> >> 2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver >> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> 16152921c1eb45c2b1f562087140168b >> 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver >> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >> 83dbe5f567a940b698acfe986f6194fa >> 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver >> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >> f3bfd7f65bd542b18d84cea3033abb43: >> oslo_messaging.exceptions.MessageUndeliverable >> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver >> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply >> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a >> missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: >> oslo_messaging.exceptions.MessageUndeliverable >> 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver >> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> d4b9180f91a94f9a82c3c9c4b7595566: >> oslo_messaging.exceptions.MessageUndeliverable >> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver >> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a >> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: >> oslo_messaging.exceptions.MessageUndeliverable >> 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver >> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> 897911a234a445d8a0d8af02ece40f6f: >> oslo_messaging.exceptions.MessageUndeliverable >> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver >> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a >> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: >> oslo_messaging.exceptions.MessageUndeliverable >> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils >> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with >> backend dogpile.cache.null. >> 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver >> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> 8f723ceb10c3472db9a9f324861df2bb: >> oslo_messaging.exceptions.MessageUndeliverable >> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver >> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a >> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: >> oslo_messaging.exceptions.MessageUndeliverable >> >> With regards, >> Swogat Pradhan >> >> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >> swogatpradhan22@gmail.com> wrote: >> >>> Hi, >>> I currently have 3 compute nodes on edge site1 where i am trying to >>> launch vm's. >>> When the VM is in spawning state the node goes down (openstack compute >>> service list), the node comes backup when i restart the nova compute >>> service but then the launch of the vm fails. >>> >>> nova-compute.log >>> >>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running >>> instance usage >>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to >>> 2023-02-26 08:00:00. 0 instances. >>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node >>> dcn01-hci-0.bdxworld.com >>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: >>> /dev/vda. Libvirt can't honour user-supplied dev names >>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume >>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with >>> backend dogpile.cache.null. >>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] Running >>> privsep helper: >>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', >>> '--config-file', '/etc/nova/nova.conf', '--config-file', >>> '/etc/nova/nova-compute.conf', '--privsep_context', >>> 'os_brick.privileged.default', '--privsep_sock_path', >>> '/tmp/tmpin40tah6/privsep.sock'] >>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new
trying then privsep
>>> daemon via rootwrap >>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep >>> daemon starting >>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep >>> process running with uid/gid: 0/0 >>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>> process running with capabilities (eff/prm/inh): >>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>> daemon running as pid 2647 >>> 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] Process >>> execution error >>> in _get_host_uuid: Unexpected error while running command. >>> Command: blkid overlay -s UUID -o value >>> Exit code: 2 >>> Stdout: '' >>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: >>> Unexpected error while running command. >>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image >>> >>> Is there a way to solve this issue? >>> >>> >>> With regards, >>> >>> Swogat Pradhan >>> >>
Hi Jhon, I checked in the ceph od dcn02, I can see the images created after importing from the central site. But launching an instance normally fails as it takes a long time for the volume to get created. When launching an instance from volume the instance is getting created properly without any errors. I tried to cache images in nova using https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... but getting checksum failed error. With regards, Swogat Pradhan On Thu, Mar 16, 2023 at 5:24 PM John Fulton <johfulto@redhat.com> wrote:
On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Update: After restarting the nova services on the controller and running
the deploy script on the edge site, I was able to launch the VM from volume.
Right now the instance creation is failing as the block device creation
is stuck in creating state, it is taking more than 10 mins for the volume to be created, whereas the image has already been imported to the edge glance.
Try following this document and making the same observations in your environment for AZs and their local ceph cluster.
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features...
On a DCN site if you run a command like this:
$ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring /etc/ceph/dcn0.client.admin.keyring $ rbd --cluster dcn0 -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl $
Then, you should see the parent of the volume is the image which is on the same local ceph cluster.
I wonder if something is misconfigured and thus you're encountering the streaming behavior described here:
Ideally all images should reside in the central Glance and be copied to DCN sites before instances of those images are booted on DCN sites. If an image is not copied to a DCN site before it is booted, then the image will be streamed to the DCN site and then the image will boot as an instance. This happens because Glance at the DCN site has access to the images store at the Central ceph cluster. Though the booting of the image will take time because it has not been copied in advance, this is still preferable to failing to boot the image.
You can also exec into the cinder container at the DCN site and confirm it's using it's local ceph cluster.
John
I will try and create a new fresh image and test again then update.
With regards, Swogat Pradhan
On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan <
Update: In the hypervisor list the compute node state is showing down.
On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan <
swogatpradhan22@gmail.com> wrote:
Hi Brendan, Now i have deployed another site where i have used 2 linux bonds
network template for both 3 compute nodes and 3 ceph nodes.
The bonding options is set to mode=802.3ad (lacp=active). I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. Once the volume was created i tried launching the instance from the volume and still the instance is stuck in spawning state.
Here is the nova-compute log:
2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-] privsep
2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep
2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep daemon running as pid 185437 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image
It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep...
The volume is already created and i do not understand why the instance is stuck in spawning state.
With regards, Swogat Pradhan
On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard <bshephar@redhat.com> wrote:
Does your environment use different network interfaces for each of
One issue I have seen before is that when launching instances, there
is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface.
I have seen the same situation in fact when using a single
active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In
In my case, changing from active/backup to LACP helped. So, based on
Regards,
Brendan Shephard Senior Software Engineer Red Hat Australia
On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote:
Hi,
I tried to help someone with a similar issue some time ago in this
https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception...
But apparently a neutron reinstallation fixed it for that user, not
sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue.
If there isn't any additional information in the debug logs I
- Either remove queues, exchanges etc. while rabbit is running, this
will most likely impact client IO depending on your load. Check out the rabbitmqctl commands.
- Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild.
I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice.
Regards, Eugen
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi, Can someone please help me out on this issue?
With regards, Swogat Pradhan
On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi I don't see any major packet loss. It seems the problem is somewhere in rabbitmq maybe but not due to
loss.
with regards, Swogat Pradhan
On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi, Yes the MTU is the same as the default '1500'. Generally I haven't seen any packet loss, but never checked when launching the instance. I will check that and come back. But everytime i launch an instance the instance gets stuck at spawning state and there the hypervisor becomes down, so not sure if packet loss causes this.
With regards, Swogat pradhan
On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote:
One more thing coming to mind is MTU size. Are they identical between central and edge site? Do you see packet loss through the tunnel?
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi Eugen, Request you to please add my email either on 'to' or 'cc' as i am not getting email's from you. Coming to the issue:
[root@overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p / Listing policies for vhost "/" ... vhost name pattern apply-to definition priority / ha-all ^(?!amq\.).* queues
{"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
I have the edge site compute nodes up, it only goes down when i am
to launch an instance and the instance comes to a spawning state and
trying then
gets stuck.
I have a tunnel setup between the central and the edge sites.
With regards, Swogat Pradhan
On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
> Hi Eugen, > For some reason i am not getting your email to me directly, i am checking > the email digest and there i am able to find your reply. > Here is the log for download: https://we.tl/t-L8FEkGZFSq > Yes, these logs are from the time when the issue occurred. > > *Note: i am able to create vm's and perform other activities in the > central site, only facing this issue in the edge site.* > > With regards, > Swogat Pradhan > > On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < swogatpradhan22@gmail.com> > wrote: > >> Hi Eugen, >> Thanks for your response. >> I have actually a 4 controller setup so here are the details: >> >> *PCS Status:* >> * Container bundle set: rabbitmq-bundle [ >> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: >> * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started >> overcloud-controller-no-ceph-3 >> * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started >> overcloud-controller-2 >> * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started >> overcloud-controller-1 >> * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started >> overcloud-controller-0 >> >> I have tried restarting the bundle multiple times but the issue is still >> present. >> >> *Cluster status:* >> [root@overcloud-controller-0 /]# rabbitmqctl cluster_status >> Cluster status of node >> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >> Basics >> >> Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com >> >> Disk Nodes >> >> rabbit@overcloud-controller-0.internalapi.bdxworld.com >> rabbit@overcloud-controller-1.internalapi.bdxworld.com >> rabbit@overcloud-controller-2.internalapi.bdxworld.com >> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >> Running Nodes >> >> rabbit@overcloud-controller-0.internalapi.bdxworld.com >> rabbit@overcloud-controller-1.internalapi.bdxworld.com >> rabbit@overcloud-controller-2.internalapi.bdxworld.com >> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >> Versions >> >> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 >> on Erlang 22.3.4.1 >> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 >> on Erlang 22.3.4.1 >> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 >> on Erlang 22.3.4.1 >> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ >> 3.8.3 on Erlang 22.3.4.1 >> >> Alarms >> >> (none) >> >> Network Partitions >> >> (none) >> >> Listeners >> >> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: >> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool >> communication >> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: >> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >> and AMQP 1.0 >> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: >> [::], port: 15672, protocol: http, purpose: HTTP API >> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: >> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool >> communication >> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: >> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >> and AMQP 1.0 >> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: >> [::], port: 15672, protocol: http, purpose: HTTP API >> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: >> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool >> communication >> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: >> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >> and AMQP 1.0 >> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: >> [::], port: 15672, protocol: http, purpose: HTTP API >> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , >> interface: [::], port: 25672, protocol: clustering, purpose: inter-node and >> CLI tool communication >> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , >> interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >> and AMQP 1.0 >> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , >> interface: [::], port: 15672, protocol: http, purpose: HTTP API >> >> Feature flags >> >> Flag: drop_unroutable_metric, state: enabled >> Flag: empty_basic_get_metric, state: enabled >> Flag: implicit_default_bindings, state: enabled >> Flag: quorum_queue, state: enabled >> Flag: virtual_host_metadata, state: enabled >> >> *Logs:* >> *(Attached)* >> >> With regards, >> Swogat Pradhan >> >> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> >> wrote: >> >>> Hi, >>> Please find the nova conductor as well as nova api log. >>> >>> nova-conuctor: >>> >>> 2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver >>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> 16152921c1eb45c2b1f562087140168b >>> 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver >>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>> 83dbe5f567a940b698acfe986f6194fa >>> 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver >>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>> f3bfd7f65bd542b18d84cea3033abb43: >>> oslo_messaging.exceptions.MessageUndeliverable >>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver >>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply >>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a >>> missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: >>> oslo_messaging.exceptions.MessageUndeliverable >>> 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver >>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> d4b9180f91a94f9a82c3c9c4b7595566: >>> oslo_messaging.exceptions.MessageUndeliverable >>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver >>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a >>> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: >>> oslo_messaging.exceptions.MessageUndeliverable >>> 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver >>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> 897911a234a445d8a0d8af02ece40f6f: >>> oslo_messaging.exceptions.MessageUndeliverable >>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver >>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a >>> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: >>> oslo_messaging.exceptions.MessageUndeliverable >>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils >>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with >>> backend dogpile.cache.null. >>> 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver >>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> 8f723ceb10c3472db9a9f324861df2bb: >>> oslo_messaging.exceptions.MessageUndeliverable >>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver >>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a >>> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: >>> oslo_messaging.exceptions.MessageUndeliverable >>> >>> With regards, >>> Swogat Pradhan >>> >>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >>> swogatpradhan22@gmail.com> wrote: >>> >>>> Hi, >>>> I currently have 3 compute nodes on edge site1 where i am
>>>> launch vm's. >>>> When the VM is in spawning state the node goes down (openstack compute >>>> service list), the node comes backup when i restart the nova compute >>>> service but then the launch of the vm fails. >>>> >>>> nova-compute.log >>>> >>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running >>>> instance usage >>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to >>>> 2023-02-26 08:00:00. 0 instances. >>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node >>>> dcn01-hci-0.bdxworld.com >>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: >>>> /dev/vda. Libvirt can't honour user-supplied dev names >>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume >>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with >>>> backend dogpile.cache.null. >>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running >>>> privsep helper: >>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', >>>> '--config-file', '/etc/nova/nova.conf', '--config-file', >>>> '/etc/nova/nova-compute.conf', '--privsep_context', >>>> 'os_brick.privileged.default', '--privsep_sock_path', >>>> '/tmp/tmpin40tah6/privsep.sock'] >>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new privsep >>>> daemon via rootwrap >>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-]
>>>> daemon starting >>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-]
>>>> process running with uid/gid: 0/0 >>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-]
>>>> process running with capabilities (eff/prm/inh): >>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-]
swogatpradhan22@gmail.com> wrote: process running with uid/gid: 0/0 process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none the networks? Or does it have a bond with everything on it? the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance. that experience, from my perspective, is certainly sounds like some kind of network issue. thread: probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes: packet trying to privsep privsep privsep privsep
>>>> daemon running as pid 2647 >>>> 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process >>>> execution error >>>> in _get_host_uuid: Unexpected error while running command. >>>> Command: blkid overlay -s UUID -o value >>>> Exit code: 2 >>>> Stdout: '' >>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: >>>> Unexpected error while running command. >>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image >>>> >>>> Is there a way to solve this issue? >>>> >>>> >>>> With regards, >>>> >>>> Swogat Pradhan >>>> >>>
Update: I uploaded an image directly to the dcn02 store, and it takes around 10,15 minutes to create a volume with image in dcn02. The image size is 389 MB. On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Jhon, I checked in the ceph od dcn02, I can see the images created after importing from the central site. But launching an instance normally fails as it takes a long time for the volume to get created.
When launching an instance from volume the instance is getting created properly without any errors.
I tried to cache images in nova using https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... but getting checksum failed error.
With regards, Swogat Pradhan
On Thu, Mar 16, 2023 at 5:24 PM John Fulton <johfulto@redhat.com> wrote:
On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Update: After restarting the nova services on the controller and
running the deploy script on the edge site, I was able to launch the VM from volume.
Right now the instance creation is failing as the block device creation
is stuck in creating state, it is taking more than 10 mins for the volume to be created, whereas the image has already been imported to the edge glance.
Try following this document and making the same observations in your environment for AZs and their local ceph cluster.
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features...
On a DCN site if you run a command like this:
$ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring /etc/ceph/dcn0.client.admin.keyring $ rbd --cluster dcn0 -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl $
Then, you should see the parent of the volume is the image which is on the same local ceph cluster.
I wonder if something is misconfigured and thus you're encountering the streaming behavior described here:
Ideally all images should reside in the central Glance and be copied to DCN sites before instances of those images are booted on DCN sites. If an image is not copied to a DCN site before it is booted, then the image will be streamed to the DCN site and then the image will boot as an instance. This happens because Glance at the DCN site has access to the images store at the Central ceph cluster. Though the booting of the image will take time because it has not been copied in advance, this is still preferable to failing to boot the image.
You can also exec into the cinder container at the DCN site and confirm it's using it's local ceph cluster.
John
I will try and create a new fresh image and test again then update.
With regards, Swogat Pradhan
On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan <
Update: In the hypervisor list the compute node state is showing down.
On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan <
swogatpradhan22@gmail.com> wrote:
Hi Brendan, Now i have deployed another site where i have used 2 linux bonds
network template for both 3 compute nodes and 3 ceph nodes.
The bonding options is set to mode=802.3ad (lacp=active). I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. Once the volume was created i tried launching the instance from the volume and still the instance is stuck in spawning state.
Here is the nova-compute log:
2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-] privsep
2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep
2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep daemon running as pid 185437 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image
It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep...
The volume is already created and i do not understand why the instance is stuck in spawning state.
With regards, Swogat Pradhan
On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard <bshephar@redhat.com> wrote:
Does your environment use different network interfaces for each of
One issue I have seen before is that when launching instances, there
is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface.
I have seen the same situation in fact when using a single
active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In
In my case, changing from active/backup to LACP helped. So, based on
Regards,
Brendan Shephard Senior Software Engineer Red Hat Australia
On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote:
Hi,
I tried to help someone with a similar issue some time ago in this
https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception...
But apparently a neutron reinstallation fixed it for that user, not
sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue.
If there isn't any additional information in the debug logs I
- Either remove queues, exchanges etc. while rabbit is running, this
will most likely impact client IO depending on your load. Check out the rabbitmqctl commands.
- Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild.
I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice.
Regards, Eugen
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi, Can someone please help me out on this issue?
With regards, Swogat Pradhan
On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi I don't see any major packet loss. It seems the problem is somewhere in rabbitmq maybe but not due to
loss.
with regards, Swogat Pradhan
On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi, Yes the MTU is the same as the default '1500'. Generally I haven't seen any packet loss, but never checked when launching the instance. I will check that and come back. But everytime i launch an instance the instance gets stuck at spawning state and there the hypervisor becomes down, so not sure if packet loss causes this.
With regards, Swogat pradhan
On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote:
One more thing coming to mind is MTU size. Are they identical between central and edge site? Do you see packet loss through the tunnel?
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
> Hi Eugen, > Request you to please add my email either on 'to' or 'cc' as i am not > getting email's from you. > Coming to the issue: > > [root@overcloud-controller-no-ceph-3 /]# rabbitmqctl
/ > Listing policies for vhost "/" ... > vhost name pattern apply-to definition priority > / ha-all ^(?!amq\.).* queues >
{"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0
> > I have the edge site compute nodes up, it only goes down when i am trying > to launch an instance and the instance comes to a spawning state and then > gets stuck. > > I have a tunnel setup between the central and the edge sites. > > With regards, > Swogat Pradhan > > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> > wrote: > >> Hi Eugen, >> For some reason i am not getting your email to me directly, i am checking >> the email digest and there i am able to find your reply. >> Here is the log for download: https://we.tl/t-L8FEkGZFSq >> Yes, these logs are from the time when the issue occurred. >> >> *Note: i am able to create vm's and perform other activities in
>> central site, only facing this issue in the edge site.* >> >> With regards, >> Swogat Pradhan >> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < swogatpradhan22@gmail.com> >> wrote: >> >>> Hi Eugen, >>> Thanks for your response. >>> I have actually a 4 controller setup so here are the details: >>> >>> *PCS Status:* >>> * Container bundle set: rabbitmq-bundle [ >>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: >>> * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started >>> overcloud-controller-no-ceph-3 >>> * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started >>> overcloud-controller-2 >>> * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started >>> overcloud-controller-1 >>> * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Started >>> overcloud-controller-0 >>> >>> I have tried restarting the bundle multiple times but the issue is still >>> present. >>> >>> *Cluster status:* >>> [root@overcloud-controller-0 /]# rabbitmqctl cluster_status >>> Cluster status of node >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >>> Basics >>> >>> Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com >>> >>> Disk Nodes >>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>> Running Nodes >>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>> Versions >>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 >>> on Erlang 22.3.4.1 >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 >>> on Erlang 22.3.4.1 >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 >>> on Erlang 22.3.4.1 >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ >>> 3.8.3 on Erlang 22.3.4.1 >>> >>> Alarms >>> >>> (none) >>> >>> Network Partitions >>> >>> (none) >>> >>> Listeners >>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool >>> communication >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>> and AMQP 1.0 >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, interface: >>> [::], port: 15672, protocol: http, purpose: HTTP API >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool >>> communication >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>> and AMQP 1.0 >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, interface: >>> [::], port: 15672, protocol: http, purpose: HTTP API >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool >>> communication >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>> and AMQP 1.0 >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, interface: >>> [::], port: 15672, protocol: http, purpose: HTTP API >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , >>> interface: [::], port: 25672, protocol: clustering, purpose: inter-node and >>> CLI tool communication >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , >>> interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>> and AMQP 1.0 >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com , >>> interface: [::], port: 15672, protocol: http, purpose: HTTP API >>> >>> Feature flags >>> >>> Flag: drop_unroutable_metric, state: enabled >>> Flag: empty_basic_get_metric, state: enabled >>> Flag: implicit_default_bindings, state: enabled >>> Flag: quorum_queue, state: enabled >>> Flag: virtual_host_metadata, state: enabled >>> >>> *Logs:* >>> *(Attached)* >>> >>> With regards, >>> Swogat Pradhan >>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> >>> wrote: >>> >>>> Hi, >>>> Please find the nova conductor as well as nova api log. >>>> >>>> nova-conuctor: >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>> 16152921c1eb45c2b1f562087140168b >>>> 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>>> 83dbe5f567a940b698acfe986f6194fa >>>> 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>>> f3bfd7f65bd542b18d84cea3033abb43: >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a >>>> missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...: >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>> d4b9180f91a94f9a82c3c9c4b7595566: >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>> 897911a234a445d8a0d8af02ece40f6f: >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with >>>> backend dogpile.cache.null. >>>> 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>> 8f723ceb10c3472db9a9f324861df2bb: >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...: >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> With regards, >>>> Swogat Pradhan >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >>>> swogatpradhan22@gmail.com> wrote: >>>> >>>>> Hi, >>>>> I currently have 3 compute nodes on edge site1 where i am
>>>>> launch vm's. >>>>> When the VM is in spawning state the node goes down (openstack compute >>>>> service list), the node comes backup when i restart the nova compute >>>>> service but then the launch of the vm fails. >>>>> >>>>> nova-compute.log >>>>> >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running >>>>> instance usage >>>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to >>>>> 2023-02-26 08:00:00. 0 instances. >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node >>>>> dcn01-hci-0.bdxworld.com >>>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name: >>>>> /dev/vda. Libvirt can't honour user-supplied dev names >>>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with >>>>> backend dogpile.cache.null. >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running >>>>> privsep helper: >>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper', >>>>> '--config-file', '/etc/nova/nova.conf', '--config-file', >>>>> '/etc/nova/nova-compute.conf', '--privsep_context', >>>>> 'os_brick.privileged.default', '--privsep_sock_path', >>>>> '/tmp/tmpin40tah6/privsep.sock'] >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new privsep >>>>> daemon via rootwrap >>>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-]
>>>>> daemon starting >>>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-]
>>>>> process running with uid/gid: 0/0 >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-]
>>>>> process running with capabilities (eff/prm/inh): >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-]
swogatpradhan22@gmail.com> wrote: process running with uid/gid: 0/0 process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none the networks? Or does it have a bond with everything on it? the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance. that experience, from my perspective, is certainly sounds like some kind of network issue. thread: probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes: packet list_policies -p the trying to privsep privsep privsep privsep
>>>>> daemon running as pid 2647 >>>>> 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process >>>>> execution error >>>>> in _get_host_uuid: Unexpected error while running command. >>>>> Command: blkid overlay -s UUID -o value >>>>> Exit code: 2 >>>>> Stdout: '' >>>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: >>>>> Unexpected error while running command. >>>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image >>>>> >>>>> Is there a way to solve this issue? >>>>> >>>>> >>>>> With regards, >>>>> >>>>> Swogat Pradhan >>>>> >>>>
in my last message under the line "On a DCN site if you run a command like this:" I suggested some steps you could try to confirm the image is a COW from the local glance as well as how to look at your cinder config. On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Update: I uploaded an image directly to the dcn02 store, and it takes around 10,15 minutes to create a volume with image in dcn02. The image size is 389 MB.
On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Jhon, I checked in the ceph od dcn02, I can see the images created after importing from the central site. But launching an instance normally fails as it takes a long time for the volume to get created.
When launching an instance from volume the instance is getting created properly without any errors.
I tried to cache images in nova using https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... but getting checksum failed error.
With regards, Swogat Pradhan
On Thu, Mar 16, 2023 at 5:24 PM John Fulton <johfulto@redhat.com> wrote:
On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Update: After restarting the nova services on the controller and
running the deploy script on the edge site, I was able to launch the VM from volume.
Right now the instance creation is failing as the block device
creation is stuck in creating state, it is taking more than 10 mins for the volume to be created, whereas the image has already been imported to the edge glance.
Try following this document and making the same observations in your environment for AZs and their local ceph cluster.
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features...
On a DCN site if you run a command like this:
$ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring /etc/ceph/dcn0.client.admin.keyring $ rbd --cluster dcn0 -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl $
Then, you should see the parent of the volume is the image which is on the same local ceph cluster.
I wonder if something is misconfigured and thus you're encountering the streaming behavior described here:
Ideally all images should reside in the central Glance and be copied to DCN sites before instances of those images are booted on DCN sites. If an image is not copied to a DCN site before it is booted, then the image will be streamed to the DCN site and then the image will boot as an instance. This happens because Glance at the DCN site has access to the images store at the Central ceph cluster. Though the booting of the image will take time because it has not been copied in advance, this is still preferable to failing to boot the image.
You can also exec into the cinder container at the DCN site and confirm it's using it's local ceph cluster.
John
I will try and create a new fresh image and test again then update.
With regards, Swogat Pradhan
On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan <
Update: In the hypervisor list the compute node state is showing down.
On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan <
swogatpradhan22@gmail.com> wrote:
Hi Brendan, Now i have deployed another site where i have used 2 linux bonds
network template for both 3 compute nodes and 3 ceph nodes.
The bonding options is set to mode=802.3ad (lacp=active). I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. Once the volume was created i tried launching the instance from the volume and still the instance is stuck in spawning state.
Here is the nova-compute log:
2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-] privsep daemon starting 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-] privsep
2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep
2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep daemon running as pid 185437 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. Command: blkid overlay -s UUID -o value Exit code: 2 Stdout: '' Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image
It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep...
The volume is already created and i do not understand why the instance is stuck in spawning state.
With regards, Swogat Pradhan
On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard <bshephar@redhat.com> wrote: > > Does your environment use different network interfaces for each of
> > One issue I have seen before is that when launching instances,
> > I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In
> > In my case, changing from active/backup to LACP helped. So, based on that experience, from my perspective, is certainly sounds like some kind of network issue. > > Regards, > > Brendan Shephard > Senior Software Engineer > Red Hat Australia > > > > On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote: > > Hi, > > I tried to help someone with a similar issue some time ago in this
> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... > > But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. > If there isn't any additional information in the debug logs I
> > - Either remove queues, exchanges etc. while rabbit is running,
> - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild. > > I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice. > > Regards, > Eugen > > Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: > > Hi, > Can someone please help me out on this issue? > > With regards, > Swogat Pradhan > > On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < swogatpradhan22@gmail.com> > wrote: > > Hi > I don't see any major packet loss. > It seems the problem is somewhere in rabbitmq maybe but not due to
> loss. > > with regards, > Swogat Pradhan > > On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> > wrote: > > Hi, > Yes the MTU is the same as the default '1500'. > Generally I haven't seen any packet loss, but never checked when > launching the instance. > I will check that and come back. > But everytime i launch an instance the instance gets stuck at spawning > state and there the hypervisor becomes down, so not sure if packet loss > causes this. > > With regards, > Swogat pradhan > > On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote: > > One more thing coming to mind is MTU size. Are they identical between > central and edge site? Do you see packet loss through the tunnel? > > Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: > > > Hi Eugen, > > Request you to please add my email either on 'to' or 'cc' as i am not > > getting email's from you. > > Coming to the issue: > > > > [root@overcloud-controller-no-ceph-3 /]# rabbitmqctl
> / > > Listing policies for vhost "/" ... > > vhost name pattern apply-to definition priority > > / ha-all ^(?!amq\.).* queues > > > {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 > > > > I have the edge site compute nodes up, it only goes down when i am > trying > > to launch an instance and the instance comes to a spawning state and > then > > gets stuck. > > > > I have a tunnel setup between the central and the edge sites. > > > > With regards, > > Swogat Pradhan > > > > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < > swogatpradhan22@gmail.com> > > wrote: > > > >> Hi Eugen, > >> For some reason i am not getting your email to me directly, i am > checking > >> the email digest and there i am able to find your reply. > >> Here is the log for download: https://we.tl/t-L8FEkGZFSq > >> Yes, these logs are from the time when the issue occurred. > >> > >> *Note: i am able to create vm's and perform other activities in
> >> central site, only facing this issue in the edge site.* > >> > >> With regards, > >> Swogat Pradhan > >> > >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < > swogatpradhan22@gmail.com> > >> wrote: > >> > >>> Hi Eugen, > >>> Thanks for your response. > >>> I have actually a 4 controller setup so here are the details: > >>> > >>> *PCS Status:* > >>> * Container bundle set: rabbitmq-bundle [ > >>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest ]: > >>> * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): > Started > >>> overcloud-controller-no-ceph-3 > >>> * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): > Started > >>> overcloud-controller-2 > >>> * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): > Started > >>> overcloud-controller-1 > >>> * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): > Started > >>> overcloud-controller-0 > >>> > >>> I have tried restarting the bundle multiple times but the issue is > still > >>> present. > >>> > >>> *Cluster status:* > >>> [root@overcloud-controller-0 /]# rabbitmqctl cluster_status > >>> Cluster status of node > >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... > >>> Basics > >>> > >>> Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com > >>> > >>> Disk Nodes > >>> > >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com > >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com > >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com > >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > >>> > >>> Running Nodes > >>> > >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com > >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com > >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com > >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > >>> > >>> Versions > >>> > >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ > 3.8.3 > >>> on Erlang 22.3.4.1 > >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ > 3.8.3 > >>> on Erlang 22.3.4.1 > >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ > 3.8.3 > >>> on Erlang 22.3.4.1 > >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: > RabbitMQ > >>> 3.8.3 on Erlang 22.3.4.1 > >>> > >>> Alarms > >>> > >>> (none) > >>> > >>> Network Partitions > >>> > >>> (none) > >>> > >>> Listeners > >>> > >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, > interface: > >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI > tool > >>> communication > >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, > interface: > >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 > >>> and AMQP 1.0 > >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, > interface: > >>> [::], port: 15672, protocol: http, purpose: HTTP API > >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, > interface: > >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI > tool > >>> communication > >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, > interface: > >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 > >>> and AMQP 1.0 > >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, > interface: > >>> [::], port: 15672, protocol: http, purpose: HTTP API > >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, > interface: > >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI > tool > >>> communication > >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, > interface: > >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 > >>> and AMQP 1.0 > >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, > interface: > >>> [::], port: 15672, protocol: http, purpose: HTTP API > >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > , > >>> interface: [::], port: 25672, protocol: clustering, purpose: > inter-node and > >>> CLI tool communication > >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > , > >>> interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP > 0-9-1 > >>> and AMQP 1.0 > >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > , > >>> interface: [::], port: 15672, protocol: http, purpose: HTTP API > >>> > >>> Feature flags > >>> > >>> Flag: drop_unroutable_metric, state: enabled > >>> Flag: empty_basic_get_metric, state: enabled > >>> Flag: implicit_default_bindings, state: enabled > >>> Flag: quorum_queue, state: enabled > >>> Flag: virtual_host_metadata, state: enabled > >>> > >>> *Logs:* > >>> *(Attached)* > >>> > >>> With regards, > >>> Swogat Pradhan > >>> > >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < > swogatpradhan22@gmail.com> > >>> wrote: > >>> > >>>> Hi, > >>>> Please find the nova conductor as well as nova api log. > >>>> > >>>> nova-conuctor: > >>>> > >>>> 2023-02-26 08:45:01.108 31 WARNING > oslo_messaging._drivers.amqpdriver > >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to > >>>> 16152921c1eb45c2b1f562087140168b > >>>> 2023-02-26 08:45:02.144 26 WARNING > oslo_messaging._drivers.amqpdriver > >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] > >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to > >>>> 83dbe5f567a940b698acfe986f6194fa > >>>> 2023-02-26 08:45:02.314 32 WARNING > oslo_messaging._drivers.amqpdriver > >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] > >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to > >>>> f3bfd7f65bd542b18d84cea3033abb43: > >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver > >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply > >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds > due to a > >>>> missing queue (reply_276049ec36a84486a8a406911d9802f4). > Abandoning...: > >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>> 2023-02-26 08:48:01.282 35 WARNING > oslo_messaging._drivers.amqpdriver > >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to > >>>> d4b9180f91a94f9a82c3c9c4b7595566: > >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver > >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply > >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds > due to a > >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). > Abandoning...: > >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>> 2023-02-26 08:49:01.303 33 WARNING > oslo_messaging._drivers.amqpdriver > >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to > >>>> 897911a234a445d8a0d8af02ece40f6f: > >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver > >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply > >>>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds > due to a > >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). > Abandoning...: > >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils > >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > b240e3e89d99489284cd731e75f2a5db > >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled > with > >>>> backend dogpile.cache.null. > >>>> 2023-02-26 08:50:01.264 27 WARNING > oslo_messaging._drivers.amqpdriver > >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to > >>>> 8f723ceb10c3472db9a9f324861df2bb: > >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver > >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply > >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds > due to a > >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). > Abandoning...: > >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>> > >>>> With regards, > >>>> Swogat Pradhan > >>>> > >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < > >>>> swogatpradhan22@gmail.com> wrote: > >>>> > >>>>> Hi, > >>>>> I currently have 3 compute nodes on edge site1 where i am
> >>>>> launch vm's. > >>>>> When the VM is in spawning state the node goes down (openstack > compute > >>>>> service list), the node comes backup when i restart the nova > compute > >>>>> service but then the launch of the vm fails. > >>>>> > >>>>> nova-compute.log > >>>>> > >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager > >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running > >>>>> instance usage > >>>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 > to > >>>>> 2023-02-26 08:00:00. 0 instances. > >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims > >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>>> b240e3e89d99489284cd731e75f2a5db > >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: > >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node > >>>>> dcn01-hci-0.bdxworld.com > >>>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver > >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>>> b240e3e89d99489284cd731e75f2a5db > >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: > >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device > name: > >>>>> /dev/vda. Libvirt can't honour user-supplied dev names > >>>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device > >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>>> b240e3e89d99489284cd731e75f2a5db > >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: > >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume > >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda > >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils > >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>>> b240e3e89d99489284cd731e75f2a5db > >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled > with > >>>>> backend dogpile.cache.null. > >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon > >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>>> b240e3e89d99489284cd731e75f2a5db > >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running > >>>>> privsep helper: > >>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', > 'privsep-helper', > >>>>> '--config-file', '/etc/nova/nova.conf', '--config-file', > >>>>> '/etc/nova/nova-compute.conf', '--privsep_context', > >>>>> 'os_brick.privileged.default', '--privsep_sock_path', > >>>>> '/tmp/tmpin40tah6/privsep.sock'] > >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon > >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>>> b240e3e89d99489284cd731e75f2a5db > >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new > privsep > >>>>> daemon via rootwrap > >>>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-]
> >>>>> daemon starting > >>>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-]
> >>>>> process running with uid/gid: 0/0 > >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-]
> >>>>> process running with capabilities (eff/prm/inh): > >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none > >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-]
swogatpradhan22@gmail.com> wrote: process running with uid/gid: 0/0 process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none the networks? Or does it have a bond with everything on it? there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface. the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance. thread: probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes: this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. packet list_policies -p the trying to privsep privsep privsep privsep
> >>>>> daemon running as pid 2647 > >>>>> 2023-02-26 08:49:55.956 7 WARNING > os_brick.initiator.connectors.nvmeof > >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>>> b240e3e89d99489284cd731e75f2a5db > >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process > >>>>> execution error > >>>>> in _get_host_uuid: Unexpected error while running command. > >>>>> Command: blkid overlay -s UUID -o value > >>>>> Exit code: 2 > >>>>> Stdout: '' > >>>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: > >>>>> Unexpected error while running command. > >>>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver > >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>>> b240e3e89d99489284cd731e75f2a5db > >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: > >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image > >>>>> > >>>>> Is there a way to solve this issue? > >>>>> > >>>>> > >>>>> With regards, > >>>>> > >>>>> Swogat Pradhan > >>>>> > >>>> > > > > > > > > >
Hi, Seems like cinder is not using the local ceph. Ceph Output: [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l NAME SIZE PARENT FMT PROT LOCK 2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB 2 excl 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB 2 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB 2 yes 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB 2 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB 2 yes 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB 2 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB 2 yes 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB 2 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB 2 yes b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB 2 b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB 2 yes e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB 2 e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB 2 yes [ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB 2 volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB 2 [ceph: root@dcn02-ceph-all-0 /]# Attached the cinder config. Please let me know how I can solve this issue. With regards, Swogat Pradhan On Tue, Mar 21, 2023 at 3:53 PM John Fulton <johfulto@redhat.com> wrote:
in my last message under the line "On a DCN site if you run a command like this:" I suggested some steps you could try to confirm the image is a COW from the local glance as well as how to look at your cinder config.
On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Update: I uploaded an image directly to the dcn02 store, and it takes around 10,15 minutes to create a volume with image in dcn02. The image size is 389 MB.
On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Jhon, I checked in the ceph od dcn02, I can see the images created after importing from the central site. But launching an instance normally fails as it takes a long time for the volume to get created.
When launching an instance from volume the instance is getting created properly without any errors.
I tried to cache images in nova using https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... but getting checksum failed error.
With regards, Swogat Pradhan
On Thu, Mar 16, 2023 at 5:24 PM John Fulton <johfulto@redhat.com> wrote:
On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Update: After restarting the nova services on the controller and
running the deploy script on the edge site, I was able to launch the VM from volume.
Right now the instance creation is failing as the block device
creation is stuck in creating state, it is taking more than 10 mins for the volume to be created, whereas the image has already been imported to the edge glance.
Try following this document and making the same observations in your environment for AZs and their local ceph cluster.
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features...
On a DCN site if you run a command like this:
$ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring /etc/ceph/dcn0.client.admin.keyring $ rbd --cluster dcn0 -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl $
Then, you should see the parent of the volume is the image which is on the same local ceph cluster.
I wonder if something is misconfigured and thus you're encountering the streaming behavior described here:
Ideally all images should reside in the central Glance and be copied to DCN sites before instances of those images are booted on DCN sites. If an image is not copied to a DCN site before it is booted, then the image will be streamed to the DCN site and then the image will boot as an instance. This happens because Glance at the DCN site has access to the images store at the Central ceph cluster. Though the booting of the image will take time because it has not been copied in advance, this is still preferable to failing to boot the image.
You can also exec into the cinder container at the DCN site and confirm it's using it's local ceph cluster.
John
I will try and create a new fresh image and test again then update.
With regards, Swogat Pradhan
On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan <
Update: In the hypervisor list the compute node state is showing down.
On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan <
swogatpradhan22@gmail.com> wrote:
> > Hi Brendan, > Now i have deployed another site where i have used 2 linux bonds network template for both 3 compute nodes and 3 ceph nodes. > The bonding options is set to mode=802.3ad (lacp=active). > I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. > Once the volume was created i tried launching the instance from the volume and still the instance is stuck in spawning state. > > Here is the nova-compute log: > > 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-] privsep daemon starting > 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-] privsep
> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep
> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep daemon running as pid 185437 > 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. > Command: blkid overlay -s UUID -o value > Exit code: 2 > Stdout: '' > Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. > 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image > > It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... > > The volume is already created and i do not understand why the instance is stuck in spawning state. > > With regards, > Swogat Pradhan > > > On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard < bshephar@redhat.com> wrote: >> >> Does your environment use different network interfaces for each of
>> >> One issue I have seen before is that when launching instances,
>> >> I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In
>> >> In my case, changing from active/backup to LACP helped. So, based on that experience, from my perspective, is certainly sounds like some kind of network issue. >> >> Regards, >> >> Brendan Shephard >> Senior Software Engineer >> Red Hat Australia >> >> >> >> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote: >> >> Hi, >> >> I tried to help someone with a similar issue some time ago in this
>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... >> >> But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. >> If there isn't any additional information in the debug logs I
>> >> - Either remove queues, exchanges etc. while rabbit is running,
>> - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild. >> >> I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice. >> >> Regards, >> Eugen >> >> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >> >> Hi, >> Can someone please help me out on this issue? >> >> With regards, >> Swogat Pradhan >> >> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < swogatpradhan22@gmail.com> >> wrote: >> >> Hi >> I don't see any major packet loss. >> It seems the problem is somewhere in rabbitmq maybe but not due to
>> loss. >> >> with regards, >> Swogat Pradhan >> >> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> >> wrote: >> >> Hi, >> Yes the MTU is the same as the default '1500'. >> Generally I haven't seen any packet loss, but never checked when >> launching the instance. >> I will check that and come back. >> But everytime i launch an instance the instance gets stuck at spawning >> state and there the hypervisor becomes down, so not sure if packet loss >> causes this. >> >> With regards, >> Swogat pradhan >> >> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote: >> >> One more thing coming to mind is MTU size. Are they identical between >> central and edge site? Do you see packet loss through the tunnel? >> >> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >> >> > Hi Eugen, >> > Request you to please add my email either on 'to' or 'cc' as i am not >> > getting email's from you. >> > Coming to the issue: >> > >> > [root@overcloud-controller-no-ceph-3 /]# rabbitmqctl
>> / >> > Listing policies for vhost "/" ... >> > vhost name pattern apply-to definition priority >> > / ha-all ^(?!amq\.).* queues >> > >> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 >> > >> > I have the edge site compute nodes up, it only goes down when i am >> trying >> > to launch an instance and the instance comes to a spawning state and >> then >> > gets stuck. >> > >> > I have a tunnel setup between the central and the edge sites. >> > >> > With regards, >> > Swogat Pradhan >> > >> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < >> swogatpradhan22@gmail.com> >> > wrote: >> > >> >> Hi Eugen, >> >> For some reason i am not getting your email to me directly, i am >> checking >> >> the email digest and there i am able to find your reply. >> >> Here is the log for download: https://we.tl/t-L8FEkGZFSq >> >> Yes, these logs are from the time when the issue occurred. >> >> >> >> *Note: i am able to create vm's and perform other activities in
>> >> central site, only facing this issue in the edge site.* >> >> >> >> With regards, >> >> Swogat Pradhan >> >> >> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < >> swogatpradhan22@gmail.com> >> >> wrote: >> >> >> >>> Hi Eugen, >> >>> Thanks for your response. >> >>> I have actually a 4 controller setup so here are the details: >> >>> >> >>> *PCS Status:* >> >>> * Container bundle set: rabbitmq-bundle [ >> >>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest ]: >> >>> * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): >> Started >> >>> overcloud-controller-no-ceph-3 >> >>> * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): >> Started >> >>> overcloud-controller-2 >> >>> * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): >> Started >> >>> overcloud-controller-1 >> >>> * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): >> Started >> >>> overcloud-controller-0 >> >>> >> >>> I have tried restarting the bundle multiple times but the issue is >> still >> >>> present. >> >>> >> >>> *Cluster status:* >> >>> [root@overcloud-controller-0 /]# rabbitmqctl cluster_status >> >>> Cluster status of node >> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >> >>> Basics >> >>> >> >>> Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com >> >>> >> >>> Disk Nodes >> >>> >> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >>> >> >>> Running Nodes >> >>> >> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >>> >> >>> Versions >> >>> >> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ >> 3.8.3 >> >>> on Erlang 22.3.4.1 >> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ >> 3.8.3 >> >>> on Erlang 22.3.4.1 >> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ >> 3.8.3 >> >>> on Erlang 22.3.4.1 >> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com : >> RabbitMQ >> >>> 3.8.3 on Erlang 22.3.4.1 >> >>> >> >>> Alarms >> >>> >> >>> (none) >> >>> >> >>> Network Partitions >> >>> >> >>> (none) >> >>> >> >>> Listeners >> >>> >> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >> interface: >> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >> tool >> >>> communication >> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >> interface: >> >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >> >>> and AMQP 1.0 >> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >> interface: >> >>> [::], port: 15672, protocol: http, purpose: HTTP API >> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >> interface: >> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >> tool >> >>> communication >> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >> interface: >> >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >> >>> and AMQP 1.0 >> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >> interface: >> >>> [::], port: 15672, protocol: http, purpose: HTTP API >> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >> interface: >> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >> tool >> >>> communication >> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >> interface: >> >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >> >>> and AMQP 1.0 >> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >> interface: >> >>> [::], port: 15672, protocol: http, purpose: HTTP API >> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> , >> >>> interface: [::], port: 25672, protocol: clustering, purpose: >> inter-node and >> >>> CLI tool communication >> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> , >> >>> interface: 172.25.201.209, port: 5672, protocol: amqp,
>> 0-9-1 >> >>> and AMQP 1.0 >> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> , >> >>> interface: [::], port: 15672, protocol: http, purpose: HTTP API >> >>> >> >>> Feature flags >> >>> >> >>> Flag: drop_unroutable_metric, state: enabled >> >>> Flag: empty_basic_get_metric, state: enabled >> >>> Flag: implicit_default_bindings, state: enabled >> >>> Flag: quorum_queue, state: enabled >> >>> Flag: virtual_host_metadata, state: enabled >> >>> >> >>> *Logs:* >> >>> *(Attached)* >> >>> >> >>> With regards, >> >>> Swogat Pradhan >> >>> >> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < >> swogatpradhan22@gmail.com> >> >>> wrote: >> >>> >> >>>> Hi, >> >>>> Please find the nova conductor as well as nova api log. >> >>>> >> >>>> nova-conuctor: >> >>>> >> >>>> 2023-02-26 08:45:01.108 31 WARNING >> oslo_messaging._drivers.amqpdriver >> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> >>>> 16152921c1eb45c2b1f562087140168b >> >>>> 2023-02-26 08:45:02.144 26 WARNING >> oslo_messaging._drivers.amqpdriver >> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >> >>>> 83dbe5f567a940b698acfe986f6194fa >> >>>> 2023-02-26 08:45:02.314 32 WARNING >> oslo_messaging._drivers.amqpdriver >> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >> >>>> f3bfd7f65bd542b18d84cea3033abb43: >> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver >> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply >> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds >> due to a >> >>>> missing queue (reply_276049ec36a84486a8a406911d9802f4). >> Abandoning...: >> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> 2023-02-26 08:48:01.282 35 WARNING >> oslo_messaging._drivers.amqpdriver >> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> >>>> d4b9180f91a94f9a82c3c9c4b7595566: >> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver >> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds >> due to a >> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >> Abandoning...: >> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> 2023-02-26 08:49:01.303 33 WARNING >> oslo_messaging._drivers.amqpdriver >> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> >>>> 897911a234a445d8a0d8af02ece40f6f: >> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver >> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds >> due to a >> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >> Abandoning...: >> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils >> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> b240e3e89d99489284cd731e75f2a5db >> >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >> with >> >>>> backend dogpile.cache.null. >> >>>> 2023-02-26 08:50:01.264 27 WARNING >> oslo_messaging._drivers.amqpdriver >> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> >>>> 8f723ceb10c3472db9a9f324861df2bb: >> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver >> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds >> due to a >> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >> Abandoning...: >> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> >> >>>> With regards, >> >>>> Swogat Pradhan >> >>>> >> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >> >>>> swogatpradhan22@gmail.com> wrote: >> >>>> >> >>>>> Hi, >> >>>>> I currently have 3 compute nodes on edge site1 where i am
>> >>>>> launch vm's. >> >>>>> When the VM is in spawning state the node goes down (openstack >> compute >> >>>>> service list), the node comes backup when i restart the nova >> compute >> >>>>> service but then the launch of the vm fails. >> >>>>> >> >>>>> nova-compute.log >> >>>>> >> >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running >> >>>>> instance usage >> >>>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 >> to >> >>>>> 2023-02-26 08:00:00. 0 instances. >> >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node >> >>>>> dcn01-hci-0.bdxworld.com >> >>>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver >> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device >> name: >> >>>>> /dev/vda. Libvirt can't honour user-supplied dev names >> >>>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device >> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume >> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >> >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >> with >> >>>>> backend dogpile.cache.null. >> >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running >> >>>>> privsep helper: >> >>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', >> 'privsep-helper', >> >>>>> '--config-file', '/etc/nova/nova.conf', '--config-file', >> >>>>> '/etc/nova/nova-compute.conf', '--privsep_context', >> >>>>> 'os_brick.privileged.default', '--privsep_sock_path', >> >>>>> '/tmp/tmpin40tah6/privsep.sock'] >> >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new >> privsep >> >>>>> daemon via rootwrap >> >>>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-]
>> >>>>> daemon starting >> >>>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-]
>> >>>>> process running with uid/gid: 0/0 >> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-]
>> >>>>> process running with capabilities (eff/prm/inh): >> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-]
swogatpradhan22@gmail.com> wrote: process running with uid/gid: 0/0 process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none the networks? Or does it have a bond with everything on it? there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface. the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance. thread: probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes: this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. packet list_policies -p the purpose: AMQP trying to privsep privsep privsep privsep
>> >>>>> daemon running as pid 2647 >> >>>>> 2023-02-26 08:49:55.956 7 WARNING >> os_brick.initiator.connectors.nvmeof >> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process >> >>>>> execution error >> >>>>> in _get_host_uuid: Unexpected error while running command. >> >>>>> Command: blkid overlay -s UUID -o value >> >>>>> Exit code: 2 >> >>>>> Stdout: '' >> >>>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: >> >>>>> Unexpected error while running command. >> >>>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver >> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image >> >>>>> >> >>>>> Is there a way to solve this issue? >> >>>>> >> >>>>> >> >>>>> With regards, >> >>>>> >> >>>>> Swogat Pradhan >> >>>>> >> >>>> >> >> >> >> >> >> >> >> >>
On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Seems like cinder is not using the local ceph.
That explains the issue. It's a misconfiguration. I hope this is not a production system since the mailing list now has the cinder.conf which contains passwords. The section that looks like this: [tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_ceph_conf=/etc/ceph/ceph.conf rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=<redacted> report_discard_supported=True Should be updated to refer to the local DCN ceph cluster and not the central one. Use the ceph conf file for that cluster and ensure the rbd_secret_uuid corresponds to that one. TripleO’s convention is to set the rbd_secret_uuid to the FSID of the Ceph cluster. The FSID should be in the ceph.conf file. The tripleo_nova_libvirt role will use virsh secret-* commands so that libvirt can retrieve the cephx secret using the FSID as a key. This can be confirmed with `podman exec nova_virtsecretd virsh secret-get-value $FSID`. The documentation describes how to configure the central and DCN sites correctly but an error seems to have occurred while you were following it. https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... John
Ceph Output: [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l NAME SIZE PARENT FMT PROT LOCK 2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB 2 excl 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB 2 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB 2 yes 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB 2 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB 2 yes 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB 2 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB 2 yes 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB 2 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB 2 yes b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB 2 b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB 2 yes e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB 2 e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB 2 yes
[ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB 2 volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB 2 [ceph: root@dcn02-ceph-all-0 /]#
Attached the cinder config. Please let me know how I can solve this issue.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 3:53 PM John Fulton <johfulto@redhat.com> wrote:
in my last message under the line "On a DCN site if you run a command like this:" I suggested some steps you could try to confirm the image is a COW from the local glance as well as how to look at your cinder config.
On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Update: I uploaded an image directly to the dcn02 store, and it takes around 10,15 minutes to create a volume with image in dcn02. The image size is 389 MB.
On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Jhon, I checked in the ceph od dcn02, I can see the images created after importing from the central site. But launching an instance normally fails as it takes a long time for the volume to get created.
When launching an instance from volume the instance is getting created properly without any errors.
I tried to cache images in nova using https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... but getting checksum failed error.
With regards, Swogat Pradhan
On Thu, Mar 16, 2023 at 5:24 PM John Fulton <johfulto@redhat.com> wrote:
On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Update: After restarting the nova services on the controller and running the deploy script on the edge site, I was able to launch the VM from volume.
Right now the instance creation is failing as the block device creation is stuck in creating state, it is taking more than 10 mins for the volume to be created, whereas the image has already been imported to the edge glance.
Try following this document and making the same observations in your environment for AZs and their local ceph cluster.
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features...
On a DCN site if you run a command like this:
$ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring /etc/ceph/dcn0.client.admin.keyring $ rbd --cluster dcn0 -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl $
Then, you should see the parent of the volume is the image which is on the same local ceph cluster.
I wonder if something is misconfigured and thus you're encountering the streaming behavior described here:
Ideally all images should reside in the central Glance and be copied to DCN sites before instances of those images are booted on DCN sites. If an image is not copied to a DCN site before it is booted, then the image will be streamed to the DCN site and then the image will boot as an instance. This happens because Glance at the DCN site has access to the images store at the Central ceph cluster. Though the booting of the image will take time because it has not been copied in advance, this is still preferable to failing to boot the image.
You can also exec into the cinder container at the DCN site and confirm it's using it's local ceph cluster.
John
I will try and create a new fresh image and test again then update.
With regards, Swogat Pradhan
On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote: > > Update: > In the hypervisor list the compute node state is showing down. > > > On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote: >> >> Hi Brendan, >> Now i have deployed another site where i have used 2 linux bonds network template for both 3 compute nodes and 3 ceph nodes. >> The bonding options is set to mode=802.3ad (lacp=active). >> I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. >> Once the volume was created i tried launching the instance from the volume and still the instance is stuck in spawning state. >> >> Here is the nova-compute log: >> >> 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-] privsep daemon starting >> 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 >> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep daemon running as pid 185437 >> 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. >> Command: blkid overlay -s UUID -o value >> Exit code: 2 >> Stdout: '' >> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. >> 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image >> >> It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >> >> The volume is already created and i do not understand why the instance is stuck in spawning state. >> >> With regards, >> Swogat Pradhan >> >> >> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard <bshephar@redhat.com> wrote: >>> >>> Does your environment use different network interfaces for each of the networks? Or does it have a bond with everything on it? >>> >>> One issue I have seen before is that when launching instances, there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface. >>> >>> I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance. >>> >>> In my case, changing from active/backup to LACP helped. So, based on that experience, from my perspective, is certainly sounds like some kind of network issue. >>> >>> Regards, >>> >>> Brendan Shephard >>> Senior Software Engineer >>> Red Hat Australia >>> >>> >>> >>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote: >>> >>> Hi, >>> >>> I tried to help someone with a similar issue some time ago in this thread: >>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... >>> >>> But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. >>> If there isn't any additional information in the debug logs I probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes: >>> >>> - Either remove queues, exchanges etc. while rabbit is running, this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. >>> - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild. >>> >>> I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice. >>> >>> Regards, >>> Eugen >>> >>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>> >>> Hi, >>> Can someone please help me out on this issue? >>> >>> With regards, >>> Swogat Pradhan >>> >>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan <swogatpradhan22@gmail.com> >>> wrote: >>> >>> Hi >>> I don't see any major packet loss. >>> It seems the problem is somewhere in rabbitmq maybe but not due to packet >>> loss. >>> >>> with regards, >>> Swogat Pradhan >>> >>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan <swogatpradhan22@gmail.com> >>> wrote: >>> >>> Hi, >>> Yes the MTU is the same as the default '1500'. >>> Generally I haven't seen any packet loss, but never checked when >>> launching the instance. >>> I will check that and come back. >>> But everytime i launch an instance the instance gets stuck at spawning >>> state and there the hypervisor becomes down, so not sure if packet loss >>> causes this. >>> >>> With regards, >>> Swogat pradhan >>> >>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote: >>> >>> One more thing coming to mind is MTU size. Are they identical between >>> central and edge site? Do you see packet loss through the tunnel? >>> >>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>> >>> > Hi Eugen, >>> > Request you to please add my email either on 'to' or 'cc' as i am not >>> > getting email's from you. >>> > Coming to the issue: >>> > >>> > [root@overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p >>> / >>> > Listing policies for vhost "/" ... >>> > vhost name pattern apply-to definition priority >>> > / ha-all ^(?!amq\.).* queues >>> > >>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 >>> > >>> > I have the edge site compute nodes up, it only goes down when i am >>> trying >>> > to launch an instance and the instance comes to a spawning state and >>> then >>> > gets stuck. >>> > >>> > I have a tunnel setup between the central and the edge sites. >>> > >>> > With regards, >>> > Swogat Pradhan >>> > >>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < >>> swogatpradhan22@gmail.com> >>> > wrote: >>> > >>> >> Hi Eugen, >>> >> For some reason i am not getting your email to me directly, i am >>> checking >>> >> the email digest and there i am able to find your reply. >>> >> Here is the log for download: https://we.tl/t-L8FEkGZFSq >>> >> Yes, these logs are from the time when the issue occurred. >>> >> >>> >> *Note: i am able to create vm's and perform other activities in the >>> >> central site, only facing this issue in the edge site.* >>> >> >>> >> With regards, >>> >> Swogat Pradhan >>> >> >>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < >>> swogatpradhan22@gmail.com> >>> >> wrote: >>> >> >>> >>> Hi Eugen, >>> >>> Thanks for your response. >>> >>> I have actually a 4 controller setup so here are the details: >>> >>> >>> >>> *PCS Status:* >>> >>> * Container bundle set: rabbitmq-bundle [ >>> >>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: >>> >>> * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): >>> Started >>> >>> overcloud-controller-no-ceph-3 >>> >>> * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): >>> Started >>> >>> overcloud-controller-2 >>> >>> * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): >>> Started >>> >>> overcloud-controller-1 >>> >>> * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): >>> Started >>> >>> overcloud-controller-0 >>> >>> >>> >>> I have tried restarting the bundle multiple times but the issue is >>> still >>> >>> present. >>> >>> >>> >>> *Cluster status:* >>> >>> [root@overcloud-controller-0 /]# rabbitmqctl cluster_status >>> >>> Cluster status of node >>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >>> >>> Basics >>> >>> >>> >>> Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com >>> >>> >>> >>> Disk Nodes >>> >>> >>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>> >>> >>> Running Nodes >>> >>> >>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>> >>> >>> Versions >>> >>> >>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ >>> 3.8.3 >>> >>> on Erlang 22.3.4.1 >>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ >>> 3.8.3 >>> >>> on Erlang 22.3.4.1 >>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ >>> 3.8.3 >>> >>> on Erlang 22.3.4.1 >>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: >>> RabbitMQ >>> >>> 3.8.3 on Erlang 22.3.4.1 >>> >>> >>> >>> Alarms >>> >>> >>> >>> (none) >>> >>> >>> >>> Network Partitions >>> >>> >>> >>> (none) >>> >>> >>> >>> Listeners >>> >>> >>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>> interface: >>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>> tool >>> >>> communication >>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>> interface: >>> >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>> >>> and AMQP 1.0 >>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>> interface: >>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>> interface: >>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>> tool >>> >>> communication >>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>> interface: >>> >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>> >>> and AMQP 1.0 >>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>> interface: >>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>> interface: >>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>> tool >>> >>> communication >>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>> interface: >>> >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>> >>> and AMQP 1.0 >>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>> interface: >>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> , >>> >>> interface: [::], port: 25672, protocol: clustering, purpose: >>> inter-node and >>> >>> CLI tool communication >>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> , >>> >>> interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP >>> 0-9-1 >>> >>> and AMQP 1.0 >>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> , >>> >>> interface: [::], port: 15672, protocol: http, purpose: HTTP API >>> >>> >>> >>> Feature flags >>> >>> >>> >>> Flag: drop_unroutable_metric, state: enabled >>> >>> Flag: empty_basic_get_metric, state: enabled >>> >>> Flag: implicit_default_bindings, state: enabled >>> >>> Flag: quorum_queue, state: enabled >>> >>> Flag: virtual_host_metadata, state: enabled >>> >>> >>> >>> *Logs:* >>> >>> *(Attached)* >>> >>> >>> >>> With regards, >>> >>> Swogat Pradhan >>> >>> >>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < >>> swogatpradhan22@gmail.com> >>> >>> wrote: >>> >>> >>> >>>> Hi, >>> >>>> Please find the nova conductor as well as nova api log. >>> >>>> >>> >>>> nova-conuctor: >>> >>>> >>> >>>> 2023-02-26 08:45:01.108 31 WARNING >>> oslo_messaging._drivers.amqpdriver >>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> >>>> 16152921c1eb45c2b1f562087140168b >>> >>>> 2023-02-26 08:45:02.144 26 WARNING >>> oslo_messaging._drivers.amqpdriver >>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>> >>>> 83dbe5f567a940b698acfe986f6194fa >>> >>>> 2023-02-26 08:45:02.314 32 WARNING >>> oslo_messaging._drivers.amqpdriver >>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>> >>>> f3bfd7f65bd542b18d84cea3033abb43: >>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver >>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply >>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds >>> due to a >>> >>>> missing queue (reply_276049ec36a84486a8a406911d9802f4). >>> Abandoning...: >>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> 2023-02-26 08:48:01.282 35 WARNING >>> oslo_messaging._drivers.amqpdriver >>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: >>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver >>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds >>> due to a >>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>> Abandoning...: >>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> 2023-02-26 08:49:01.303 33 WARNING >>> oslo_messaging._drivers.amqpdriver >>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> >>>> 897911a234a445d8a0d8af02ece40f6f: >>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver >>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds >>> due to a >>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>> Abandoning...: >>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils >>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> b240e3e89d99489284cd731e75f2a5db >>> >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >>> with >>> >>>> backend dogpile.cache.null. >>> >>>> 2023-02-26 08:50:01.264 27 WARNING >>> oslo_messaging._drivers.amqpdriver >>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> >>>> 8f723ceb10c3472db9a9f324861df2bb: >>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver >>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds >>> due to a >>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>> Abandoning...: >>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> >>> >>>> With regards, >>> >>>> Swogat Pradhan >>> >>>> >>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >>> >>>> swogatpradhan22@gmail.com> wrote: >>> >>>> >>> >>>>> Hi, >>> >>>>> I currently have 3 compute nodes on edge site1 where i am trying to >>> >>>>> launch vm's. >>> >>>>> When the VM is in spawning state the node goes down (openstack >>> compute >>> >>>>> service list), the node comes backup when i restart the nova >>> compute >>> >>>>> service but then the launch of the vm fails. >>> >>>>> >>> >>>>> nova-compute.log >>> >>>>> >>> >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running >>> >>>>> instance usage >>> >>>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 >>> to >>> >>>>> 2023-02-26 08:00:00. 0 instances. >>> >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node >>> >>>>> dcn01-hci-0.bdxworld.com >>> >>>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver >>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device >>> name: >>> >>>>> /dev/vda. Libvirt can't honour user-supplied dev names >>> >>>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device >>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume >>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >>> >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >>> with >>> >>>>> backend dogpile.cache.null. >>> >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running >>> >>>>> privsep helper: >>> >>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', >>> 'privsep-helper', >>> >>>>> '--config-file', '/etc/nova/nova.conf', '--config-file', >>> >>>>> '/etc/nova/nova-compute.conf', '--privsep_context', >>> >>>>> 'os_brick.privileged.default', '--privsep_sock_path', >>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] >>> >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new >>> privsep >>> >>>>> daemon via rootwrap >>> >>>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep >>> >>>>> daemon starting >>> >>>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep >>> >>>>> process running with uid/gid: 0/0 >>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>> >>>>> process running with capabilities (eff/prm/inh): >>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>> >>>>> daemon running as pid 2647 >>> >>>>> 2023-02-26 08:49:55.956 7 WARNING >>> os_brick.initiator.connectors.nvmeof >>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process >>> >>>>> execution error >>> >>>>> in _get_host_uuid: Unexpected error while running command. >>> >>>>> Command: blkid overlay -s UUID -o value >>> >>>>> Exit code: 2 >>> >>>>> Stdout: '' >>> >>>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: >>> >>>>> Unexpected error while running command. >>> >>>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver >>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image >>> >>>>> >>> >>>>> Is there a way to solve this issue? >>> >>>>> >>> >>>>> >>> >>>>> With regards, >>> >>>>> >>> >>>>> Swogat Pradhan >>> >>>>> >>> >>>> >>> >>> >>> >>> >>> >>> >>> >>> >>>
Hi Jhon, This seems to be an issue. When i deployed the dcn ceph in both dcn01 and dcn02 the --cluster parameter was specified to the respective cluster names but the config files were created in the name of ceph.conf and keyring was ceph.client.openstack.keyring. Which created issues in glance as well as the naming convention of the files didn't match the cluster names, so i had to manually rename the central ceph conf file as such: [root@dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/ [root@dcn02-compute-0 ceph]# ll total 16 -rw-------. 1 root root 257 Mar 13 13:56 ceph_central.client.openstack.keyring -rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf -rw-------. 1 root root 205 Mar 15 18:45 ceph.client.openstack.keyring -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf [root@dcn02-compute-0 ceph]# ceph.conf and ceph.client.openstack.keyring contain the fsid of the respective clusters in both dcn01 and dcn02. In the above cli output, the ceph.conf and ceph.client... are the files used to access dcn02 ceph cluster and ceph_central* files are used in for accessing central ceph cluster. glance multistore config: [dcn02] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store [ceph_central] rbd_store_ceph_conf=/etc/ceph/ceph_central.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend. With regards, Swogat Pradhan On Tue, Mar 21, 2023 at 5:52 PM John Fulton <johfulto@redhat.com> wrote:
On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Seems like cinder is not using the local ceph.
That explains the issue. It's a misconfiguration.
I hope this is not a production system since the mailing list now has the cinder.conf which contains passwords.
The section that looks like this:
[tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_ceph_conf=/etc/ceph/ceph.conf rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=<redacted> report_discard_supported=True
Should be updated to refer to the local DCN ceph cluster and not the central one. Use the ceph conf file for that cluster and ensure the rbd_secret_uuid corresponds to that one.
TripleO’s convention is to set the rbd_secret_uuid to the FSID of the Ceph cluster. The FSID should be in the ceph.conf file. The tripleo_nova_libvirt role will use virsh secret-* commands so that libvirt can retrieve the cephx secret using the FSID as a key. This can be confirmed with `podman exec nova_virtsecretd virsh secret-get-value $FSID`.
The documentation describes how to configure the central and DCN sites correctly but an error seems to have occurred while you were following it.
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features...
John
Ceph Output: [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l NAME SIZE PARENT FMT PROT
2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB 2 excl 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB 2 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB 2 yes 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB 2 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB 2 yes 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB 2 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB 2 yes 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB 2 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB 2 yes b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB 2 b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB 2 yes e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB 2 e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB 2 yes
[ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB 2 volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB 2 [ceph: root@dcn02-ceph-all-0 /]#
Attached the cinder config. Please let me know how I can solve this issue.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 3:53 PM John Fulton <johfulto@redhat.com> wrote:
in my last message under the line "On a DCN site if you run a command
On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan <
swogatpradhan22@gmail.com> wrote:
Update: I uploaded an image directly to the dcn02 store, and it takes around
10,15 minutes to create a volume with image in dcn02.
The image size is 389 MB.
On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Jhon, I checked in the ceph od dcn02, I can see the images created after
importing from the central site.
But launching an instance normally fails as it takes a long time for
When launching an instance from volume the instance is getting
created properly without any errors.
I tried to cache images in nova using
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... but getting checksum failed error.
With regards, Swogat Pradhan
On Thu, Mar 16, 2023 at 5:24 PM John Fulton <johfulto@redhat.com>
wrote:
On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote: > > Update: After restarting the nova services on the controller and
running the deploy script on the edge site, I was able to launch the VM from volume.
> > Right now the instance creation is failing as the block device creation is stuck in creating state, it is taking more than 10 mins for the volume to be created, whereas the image has already been imported to the edge glance.
Try following this document and making the same observations in your environment for AZs and their local ceph cluster.
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features...
On a DCN site if you run a command like this:
$ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring /etc/ceph/dcn0.client.admin.keyring $ rbd --cluster dcn0 -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl $
Then, you should see the parent of the volume is the image which is
on
the same local ceph cluster.
I wonder if something is misconfigured and thus you're encountering the streaming behavior described here:
Ideally all images should reside in the central Glance and be copied to DCN sites before instances of those images are booted on DCN sites. If an image is not copied to a DCN site before it is booted, then the image will be streamed to the DCN site and then the image will boot as an instance. This happens because Glance at the DCN site has access to the images store at the Central ceph cluster. Though the booting of the image will take time because it has not been copied in advance, this is still preferable to failing to boot the image.
You can also exec into the cinder container at the DCN site and confirm it's using it's local ceph cluster.
John
> > I will try and create a new fresh image and test again then update. > > With regards, > Swogat Pradhan > > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >> >> Update: >> In the hypervisor list the compute node state is showing down. >> >> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >>> >>> Hi Brendan, >>> Now i have deployed another site where i have used 2 linux bonds network template for both 3 compute nodes and 3 ceph nodes. >>> The bonding options is set to mode=802.3ad (lacp=active). >>> I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. >>> Once the volume was created i tried launching the instance from
>>> >>> Here is the nova-compute log: >>> >>> 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-]
>>> 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-]
>>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-]
>>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-]
>>> 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. >>> Command: blkid overlay -s UUID -o value >>> Exit code: 2 >>> Stdout: '' >>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. >>> 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image >>> >>> It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>> >>> The volume is already created and i do not understand why the instance is stuck in spawning state. >>> >>> With regards, >>> Swogat Pradhan >>> >>> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard < bshephar@redhat.com> wrote: >>>> >>>> Does your environment use different network interfaces for each of the networks? Or does it have a bond with everything on it? >>>> >>>> One issue I have seen before is that when launching instances,
>>>> >>>> I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In
>>>> >>>> In my case, changing from active/backup to LACP helped. So,
>>>> >>>> Regards, >>>> >>>> Brendan Shephard >>>> Senior Software Engineer >>>> Red Hat Australia >>>> >>>> >>>> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote: >>>> >>>> Hi, >>>> >>>> I tried to help someone with a similar issue some time ago in
>>>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... >>>> >>>> But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. >>>> If there isn't any additional information in the debug logs I
>>>> >>>> - Either remove queues, exchanges etc. while rabbit is running,
>>>> - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild. >>>> >>>> I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice. >>>> >>>> Regards, >>>> Eugen >>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>>> >>>> Hi, >>>> Can someone please help me out on this issue? >>>> >>>> With regards, >>>> Swogat Pradhan >>>> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < swogatpradhan22@gmail.com> >>>> wrote: >>>> >>>> Hi >>>> I don't see any major packet loss. >>>> It seems the problem is somewhere in rabbitmq maybe but not due to packet >>>> loss. >>>> >>>> with regards, >>>> Swogat Pradhan >>>> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> >>>> wrote: >>>> >>>> Hi, >>>> Yes the MTU is the same as the default '1500'. >>>> Generally I haven't seen any packet loss, but never checked when >>>> launching the instance. >>>> I will check that and come back. >>>> But everytime i launch an instance the instance gets stuck at spawning >>>> state and there the hypervisor becomes down, so not sure if
>>>> causes this. >>>> >>>> With regards, >>>> Swogat pradhan >>>> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote: >>>> >>>> One more thing coming to mind is MTU size. Are they identical between >>>> central and edge site? Do you see packet loss through the tunnel? >>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>>> >>>> > Hi Eugen, >>>> > Request you to please add my email either on 'to' or 'cc' as i am not >>>> > getting email's from you. >>>> > Coming to the issue: >>>> > >>>> > [root@overcloud-controller-no-ceph-3 /]# rabbitmqctl
>>>> / >>>> > Listing policies for vhost "/" ... >>>> > vhost name pattern apply-to definition
>>>> > / ha-all ^(?!amq\.).* queues >>>> > >>>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 >>>> > >>>> > I have the edge site compute nodes up, it only goes down when i am >>>> trying >>>> > to launch an instance and the instance comes to a spawning state and >>>> then >>>> > gets stuck. >>>> > >>>> > I have a tunnel setup between the central and the edge sites. >>>> > >>>> > With regards, >>>> > Swogat Pradhan >>>> > >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < >>>> swogatpradhan22@gmail.com> >>>> > wrote: >>>> > >>>> >> Hi Eugen, >>>> >> For some reason i am not getting your email to me directly, i am >>>> checking >>>> >> the email digest and there i am able to find your reply. >>>> >> Here is the log for download: https://we.tl/t-L8FEkGZFSq >>>> >> Yes, these logs are from the time when the issue occurred. >>>> >> >>>> >> *Note: i am able to create vm's and perform other activities in the >>>> >> central site, only facing this issue in the edge site.* >>>> >> >>>> >> With regards, >>>> >> Swogat Pradhan >>>> >> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < >>>> swogatpradhan22@gmail.com> >>>> >> wrote: >>>> >> >>>> >>> Hi Eugen, >>>> >>> Thanks for your response. >>>> >>> I have actually a 4 controller setup so here are the
>>>> >>> >>>> >>> *PCS Status:* >>>> >>> * Container bundle set: rabbitmq-bundle [ >>>> >>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: >>>> >>> * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): >>>> Started >>>> >>> overcloud-controller-no-ceph-3 >>>> >>> * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): >>>> Started >>>> >>> overcloud-controller-2 >>>> >>> * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): >>>> Started >>>> >>> overcloud-controller-1 >>>> >>> * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): >>>> Started >>>> >>> overcloud-controller-0 >>>> >>> >>>> >>> I have tried restarting the bundle multiple times but the issue is >>>> still >>>> >>> present. >>>> >>> >>>> >>> *Cluster status:* >>>> >>> [root@overcloud-controller-0 /]# rabbitmqctl cluster_status >>>> >>> Cluster status of node >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >>>> >>> Basics >>>> >>> >>>> >>> Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com >>>> >>> >>>> >>> Disk Nodes >>>> >>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> >>> >>>> >>> Running Nodes >>>> >>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> >>> >>>> >>> Versions >>>> >>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ >>>> 3.8.3 >>>> >>> on Erlang 22.3.4.1 >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ >>>> 3.8.3 >>>> >>> on Erlang 22.3.4.1 >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ >>>> 3.8.3 >>>> >>> on Erlang 22.3.4.1 >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: >>>> RabbitMQ >>>> >>> 3.8.3 on Erlang 22.3.4.1 >>>> >>> >>>> >>> Alarms >>>> >>> >>>> >>> (none) >>>> >>> >>>> >>> Network Partitions >>>> >>> >>>> >>> (none) >>>> >>> >>>> >>> Listeners >>>> >>> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>> interface: >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>>> tool >>>> >>> communication >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>> interface: >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>>> >>> and AMQP 1.0 >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>> interface: >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>> interface: >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>>> tool >>>> >>> communication >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>> interface: >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>>> >>> and AMQP 1.0 >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>> interface: >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>> interface: >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>>> tool >>>> >>> communication >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>> interface: >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>>> >>> and AMQP 1.0 >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>> interface: >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> , >>>> >>> interface: [::], port: 25672, protocol: clustering, purpose: >>>> inter-node and >>>> >>> CLI tool communication >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> , >>>> >>> interface: 172.25.201.209, port: 5672, protocol: amqp,
>>>> 0-9-1 >>>> >>> and AMQP 1.0 >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> , >>>> >>> interface: [::], port: 15672, protocol: http, purpose: HTTP API >>>> >>> >>>> >>> Feature flags >>>> >>> >>>> >>> Flag: drop_unroutable_metric, state: enabled >>>> >>> Flag: empty_basic_get_metric, state: enabled >>>> >>> Flag: implicit_default_bindings, state: enabled >>>> >>> Flag: quorum_queue, state: enabled >>>> >>> Flag: virtual_host_metadata, state: enabled >>>> >>> >>>> >>> *Logs:* >>>> >>> *(Attached)* >>>> >>> >>>> >>> With regards, >>>> >>> Swogat Pradhan >>>> >>> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < >>>> swogatpradhan22@gmail.com> >>>> >>> wrote: >>>> >>> >>>> >>>> Hi, >>>> >>>> Please find the nova conductor as well as nova api log. >>>> >>>> >>>> >>>> nova-conuctor: >>>> >>>> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>> >>>> 16152921c1eb45c2b1f562087140168b >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>>> >>>> 83dbe5f567a940b698acfe986f6194fa >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43: >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds >>>> due to a >>>> >>>> missing queue (reply_276049ec36a84486a8a406911d9802f4). >>>> Abandoning...: >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds >>>> due to a >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>>> Abandoning...: >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>> >>>> 897911a234a445d8a0d8af02ece40f6f: >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds >>>> due to a >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>>> Abandoning...: >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >>>> with >>>> >>>> backend dogpile.cache.null. >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb: >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds >>>> due to a >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>>> Abandoning...: >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> >>>> >>>> With regards, >>>> >>>> Swogat Pradhan >>>> >>>> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >>>> >>>> swogatpradhan22@gmail.com> wrote: >>>> >>>> >>>> >>>>> Hi, >>>> >>>>> I currently have 3 compute nodes on edge site1 where i am
>>>> >>>>> launch vm's. >>>> >>>>> When the VM is in spawning state the node goes down (openstack >>>> compute >>>> >>>>> service list), the node comes backup when i restart the nova >>>> compute >>>> >>>>> service but then the launch of the vm fails. >>>> >>>>> >>>> >>>>> nova-compute.log >>>> >>>>> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running >>>> >>>>> instance usage >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 >>>> to >>>> >>>>> 2023-02-26 08:00:00. 0 instances. >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node >>>> >>>>> dcn01-hci-0.bdxworld.com >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device >>>> name: >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied dev names >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >>>> with >>>> >>>>> backend dogpile.cache.null. >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running >>>> >>>>> privsep helper: >>>> >>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', >>>> 'privsep-helper', >>>> >>>>> '--config-file', '/etc/nova/nova.conf', '--config-file', >>>> >>>>> '/etc/nova/nova-compute.conf', '--privsep_context', >>>> >>>>> 'os_brick.privileged.default', '--privsep_sock_path', >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new >>>> privsep >>>> >>>>> daemon via rootwrap >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-]
>>>> >>>>> daemon starting >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-]
>>>> >>>>> process running with uid/gid: 0/0 >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-]
>>>> >>>>> process running with capabilities (eff/prm/inh): >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-]
LOCK like this:" I suggested some steps you could try to confirm the image is a COW from the local glance as well as how to look at your cinder config. the volume to get created. the volume and still the instance is stuck in spawning state. privsep daemon starting privsep process running with uid/gid: 0/0 privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none privsep daemon running as pid 185437 there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface. the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance. based on that experience, from my perspective, is certainly sounds like some kind of network issue. this thread: probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes: this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. packet loss list_policies -p priority details: purpose: AMQP trying to privsep privsep privsep privsep
>>>> >>>>> daemon running as pid 2647 >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING >>>> os_brick.initiator.connectors.nvmeof >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process >>>> >>>>> execution error >>>> >>>>> in _get_host_uuid: Unexpected error while running command. >>>> >>>>> Command: blkid overlay -s UUID -o value >>>> >>>>> Exit code: 2 >>>> >>>>> Stdout: '' >>>> >>>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: >>>> >>>>> Unexpected error while running command. >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image >>>> >>>>> >>>> >>>>> Is there a way to solve this issue? >>>> >>>>> >>>> >>>>> >>>> >>>>> With regards, >>>> >>>>> >>>> >>>>> Swogat Pradhan >>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>>
Update: Here is the log when creating a volume using cirros image: 2023-03-22 11:04:38.449 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume bf341343-6609-4b8c-b9e0-93e2a89c8c8f: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f', 'volume_size': 4, 'image_id': '736d8779-07cd-4510-bab2-adcb653cc538', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': '736d8779-07cd-4510-bab2-adcb653cc538', 'created_at': datetime.datetime(2023, 3, 22, 10, 44, 12, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 10, 54, 1, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'tags': [], 'file': '/v2/images/736d8779-07cd-4510-bab2-adcb653cc538/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f449ded1198>} 2023-03-22 11:06:16.570 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Image download 15.58 MB at 0.16 MB/s 2023-03-22 11:07:54.023 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning) 2023-03-22 11:11:12.161 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning) 2023-03-22 11:11:12.163 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Converted 112.00 MB image at 112.00 MB/s 2023-03-22 11:11:14.998 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f (bf341343-6609-4b8c-b9e0-93e2a89c8c8f): created successfully 2023-03-22 11:11:15.195 109 INFO cinder.volume.manager [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Created volume successfully. The image is present in dcn02 store but still it downloaded the image in 0.16 MB/s and then created the volume. With regards, Swogat Pradhan On Tue, Mar 21, 2023 at 6:10 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Jhon, This seems to be an issue. When i deployed the dcn ceph in both dcn01 and dcn02 the --cluster parameter was specified to the respective cluster names but the config files were created in the name of ceph.conf and keyring was ceph.client.openstack.keyring.
Which created issues in glance as well as the naming convention of the files didn't match the cluster names, so i had to manually rename the central ceph conf file as such:
[root@dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/ [root@dcn02-compute-0 ceph]# ll total 16 -rw-------. 1 root root 257 Mar 13 13:56 ceph_central.client.openstack.keyring -rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf -rw-------. 1 root root 205 Mar 15 18:45 ceph.client.openstack.keyring -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf [root@dcn02-compute-0 ceph]#
ceph.conf and ceph.client.openstack.keyring contain the fsid of the respective clusters in both dcn01 and dcn02. In the above cli output, the ceph.conf and ceph.client... are the files used to access dcn02 ceph cluster and ceph_central* files are used in for accessing central ceph cluster.
glance multistore config: [dcn02] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store
[ceph_central] rbd_store_ceph_conf=/etc/ceph/ceph_central.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 5:52 PM John Fulton <johfulto@redhat.com> wrote:
On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Seems like cinder is not using the local ceph.
That explains the issue. It's a misconfiguration.
I hope this is not a production system since the mailing list now has the cinder.conf which contains passwords.
The section that looks like this:
[tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_ceph_conf=/etc/ceph/ceph.conf rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=<redacted> report_discard_supported=True
Should be updated to refer to the local DCN ceph cluster and not the central one. Use the ceph conf file for that cluster and ensure the rbd_secret_uuid corresponds to that one.
TripleO’s convention is to set the rbd_secret_uuid to the FSID of the Ceph cluster. The FSID should be in the ceph.conf file. The tripleo_nova_libvirt role will use virsh secret-* commands so that libvirt can retrieve the cephx secret using the FSID as a key. This can be confirmed with `podman exec nova_virtsecretd virsh secret-get-value $FSID`.
The documentation describes how to configure the central and DCN sites correctly but an error seems to have occurred while you were following it.
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features...
John
Ceph Output: [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l NAME SIZE PARENT FMT PROT
2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB 2 excl 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB 2 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB 2 yes 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB 2 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB 2 yes 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB 2 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB 2 yes 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB 2 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB 2 yes b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB 2 b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB 2 yes e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB 2 e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB 2 yes
[ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB 2 volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB 2 [ceph: root@dcn02-ceph-all-0 /]#
Attached the cinder config. Please let me know how I can solve this issue.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 3:53 PM John Fulton <johfulto@redhat.com> wrote:
in my last message under the line "On a DCN site if you run a command
On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan <
swogatpradhan22@gmail.com> wrote:
Update: I uploaded an image directly to the dcn02 store, and it takes around
10,15 minutes to create a volume with image in dcn02.
The image size is 389 MB.
On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Jhon, I checked in the ceph od dcn02, I can see the images created after
importing from the central site.
But launching an instance normally fails as it takes a long time for
When launching an instance from volume the instance is getting
created properly without any errors.
I tried to cache images in nova using
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... but getting checksum failed error.
With regards, Swogat Pradhan
On Thu, Mar 16, 2023 at 5:24 PM John Fulton <johfulto@redhat.com>
wrote:
> > On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan > <swogatpradhan22@gmail.com> wrote: > > > > Update: After restarting the nova services on the controller and running the deploy script on the edge site, I was able to launch the VM from volume. > > > > Right now the instance creation is failing as the block device creation is stuck in creating state, it is taking more than 10 mins for the volume to be created, whereas the image has already been imported to the edge glance. > > Try following this document and making the same observations in your > environment for AZs and their local ceph cluster. > > https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... > > On a DCN site if you run a command like this: > > $ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring > /etc/ceph/dcn0.client.admin.keyring > $ rbd --cluster dcn0 -p volumes ls -l > NAME SIZE PARENT > FMT PROT LOCK > volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB > images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl > $ > > Then, you should see the parent of the volume is the image which is on > the same local ceph cluster. > > I wonder if something is misconfigured and thus you're encountering > the streaming behavior described here: > > Ideally all images should reside in the central Glance and be copied > to DCN sites before instances of those images are booted on DCN sites. > If an image is not copied to a DCN site before it is booted, then
> image will be streamed to the DCN site and then the image will boot as > an instance. This happens because Glance at the DCN site has access to > the images store at the Central ceph cluster. Though the booting of > the image will take time because it has not been copied in advance, > this is still preferable to failing to boot the image. > > You can also exec into the cinder container at the DCN site and > confirm it's using it's local ceph cluster. > > John > > > > > I will try and create a new fresh image and test again then update. > > > > With regards, > > Swogat Pradhan > > > > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: > >> > >> Update: > >> In the hypervisor list the compute node state is showing down. > >> > >> > >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: > >>> > >>> Hi Brendan, > >>> Now i have deployed another site where i have used 2 linux bonds network template for both 3 compute nodes and 3 ceph nodes. > >>> The bonding options is set to mode=802.3ad (lacp=active). > >>> I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. > >>> Once the volume was created i tried launching the instance from
> >>> > >>> Here is the nova-compute log: > >>> > >>> 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-]
> >>> 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-]
> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-]
> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-]
> >>> 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. > >>> Command: blkid overlay -s UUID -o value > >>> Exit code: 2 > >>> Stdout: '' > >>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. > >>> 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image > >>> > >>> It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... > >>> > >>> The volume is already created and i do not understand why the instance is stuck in spawning state. > >>> > >>> With regards, > >>> Swogat Pradhan > >>> > >>> > >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard < bshephar@redhat.com> wrote: > >>>> > >>>> Does your environment use different network interfaces for each of the networks? Or does it have a bond with everything on it? > >>>> > >>>> One issue I have seen before is that when launching instances,
> >>>> > >>>> I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In
> >>>> > >>>> In my case, changing from active/backup to LACP helped. So,
> >>>> > >>>> Regards, > >>>> > >>>> Brendan Shephard > >>>> Senior Software Engineer > >>>> Red Hat Australia > >>>> > >>>> > >>>> > >>>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote: > >>>> > >>>> Hi, > >>>> > >>>> I tried to help someone with a similar issue some time ago in
> >>>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... > >>>> > >>>> But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. > >>>> If there isn't any additional information in the debug logs I
> >>>> > >>>> - Either remove queues, exchanges etc. while rabbit is running, this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. > >>>> - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild. > >>>> > >>>> I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice. > >>>> > >>>> Regards, > >>>> Eugen > >>>> > >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: > >>>> > >>>> Hi, > >>>> Can someone please help me out on this issue? > >>>> > >>>> With regards, > >>>> Swogat Pradhan > >>>> > >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < swogatpradhan22@gmail.com> > >>>> wrote: > >>>> > >>>> Hi > >>>> I don't see any major packet loss. > >>>> It seems the problem is somewhere in rabbitmq maybe but not due to packet > >>>> loss. > >>>> > >>>> with regards, > >>>> Swogat Pradhan > >>>> > >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> > >>>> wrote: > >>>> > >>>> Hi, > >>>> Yes the MTU is the same as the default '1500'. > >>>> Generally I haven't seen any packet loss, but never checked when > >>>> launching the instance. > >>>> I will check that and come back. > >>>> But everytime i launch an instance the instance gets stuck at spawning > >>>> state and there the hypervisor becomes down, so not sure if
> >>>> causes this. > >>>> > >>>> With regards, > >>>> Swogat pradhan > >>>> > >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote: > >>>> > >>>> One more thing coming to mind is MTU size. Are they identical between > >>>> central and edge site? Do you see packet loss through the tunnel? > >>>> > >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: > >>>> > >>>> > Hi Eugen, > >>>> > Request you to please add my email either on 'to' or 'cc' as i am not > >>>> > getting email's from you. > >>>> > Coming to the issue: > >>>> > > >>>> > [root@overcloud-controller-no-ceph-3 /]# rabbitmqctl
> >>>> / > >>>> > Listing policies for vhost "/" ... > >>>> > vhost name pattern apply-to definition
> >>>> > / ha-all ^(?!amq\.).* queues > >>>> > > >>>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 > >>>> > > >>>> > I have the edge site compute nodes up, it only goes down when i am > >>>> trying > >>>> > to launch an instance and the instance comes to a spawning state and > >>>> then > >>>> > gets stuck. > >>>> > > >>>> > I have a tunnel setup between the central and the edge sites. > >>>> > > >>>> > With regards, > >>>> > Swogat Pradhan > >>>> > > >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < > >>>> swogatpradhan22@gmail.com> > >>>> > wrote: > >>>> > > >>>> >> Hi Eugen, > >>>> >> For some reason i am not getting your email to me directly, i am > >>>> checking > >>>> >> the email digest and there i am able to find your reply. > >>>> >> Here is the log for download: https://we.tl/t-L8FEkGZFSq > >>>> >> Yes, these logs are from the time when the issue occurred. > >>>> >> > >>>> >> *Note: i am able to create vm's and perform other activities in the > >>>> >> central site, only facing this issue in the edge site.* > >>>> >> > >>>> >> With regards, > >>>> >> Swogat Pradhan > >>>> >> > >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < > >>>> swogatpradhan22@gmail.com> > >>>> >> wrote: > >>>> >> > >>>> >>> Hi Eugen, > >>>> >>> Thanks for your response. > >>>> >>> I have actually a 4 controller setup so here are the
> >>>> >>> > >>>> >>> *PCS Status:* > >>>> >>> * Container bundle set: rabbitmq-bundle [ > >>>> >>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: > >>>> >>> * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): > >>>> Started > >>>> >>> overcloud-controller-no-ceph-3 > >>>> >>> * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): > >>>> Started > >>>> >>> overcloud-controller-2 > >>>> >>> * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): > >>>> Started > >>>> >>> overcloud-controller-1 > >>>> >>> * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): > >>>> Started > >>>> >>> overcloud-controller-0 > >>>> >>> > >>>> >>> I have tried restarting the bundle multiple times but the issue is > >>>> still > >>>> >>> present. > >>>> >>> > >>>> >>> *Cluster status:* > >>>> >>> [root@overcloud-controller-0 /]# rabbitmqctl cluster_status > >>>> >>> Cluster status of node > >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... > >>>> >>> Basics > >>>> >>> > >>>> >>> Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com > >>>> >>> > >>>> >>> Disk Nodes > >>>> >>> > >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com > >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com > >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com > >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > >>>> >>> > >>>> >>> Running Nodes > >>>> >>> > >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com > >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com > >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com > >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > >>>> >>> > >>>> >>> Versions > >>>> >>> > >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ > >>>> 3.8.3 > >>>> >>> on Erlang 22.3.4.1 > >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ > >>>> 3.8.3 > >>>> >>> on Erlang 22.3.4.1 > >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ > >>>> 3.8.3 > >>>> >>> on Erlang 22.3.4.1 > >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: > >>>> RabbitMQ > >>>> >>> 3.8.3 on Erlang 22.3.4.1 > >>>> >>> > >>>> >>> Alarms > >>>> >>> > >>>> >>> (none) > >>>> >>> > >>>> >>> Network Partitions > >>>> >>> > >>>> >>> (none) > >>>> >>> > >>>> >>> Listeners > >>>> >>> > >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, > >>>> interface: > >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI > >>>> tool > >>>> >>> communication > >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, > >>>> interface: > >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 > >>>> >>> and AMQP 1.0 > >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, > >>>> interface: > >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API > >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, > >>>> interface: > >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI > >>>> tool > >>>> >>> communication > >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, > >>>> interface: > >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 > >>>> >>> and AMQP 1.0 > >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, > >>>> interface: > >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API > >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, > >>>> interface: > >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI > >>>> tool > >>>> >>> communication > >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, > >>>> interface: > >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 > >>>> >>> and AMQP 1.0 > >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, > >>>> interface: > >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API > >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > >>>> , > >>>> >>> interface: [::], port: 25672, protocol: clustering,
> >>>> inter-node and > >>>> >>> CLI tool communication > >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > >>>> , > >>>> >>> interface: 172.25.201.209, port: 5672, protocol: amqp,
LOCK like this:" I suggested some steps you could try to confirm the image is a COW from the local glance as well as how to look at your cinder config. the volume to get created. the the volume and still the instance is stuck in spawning state. privsep daemon starting privsep process running with uid/gid: 0/0 privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none privsep daemon running as pid 185437 there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface. the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance. based on that experience, from my perspective, is certainly sounds like some kind of network issue. this thread: probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes: packet loss list_policies -p priority details: purpose: purpose: AMQP
> >>>> 0-9-1 > >>>> >>> and AMQP 1.0 > >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > >>>> , > >>>> >>> interface: [::], port: 15672, protocol: http, purpose: HTTP API > >>>> >>> > >>>> >>> Feature flags > >>>> >>> > >>>> >>> Flag: drop_unroutable_metric, state: enabled > >>>> >>> Flag: empty_basic_get_metric, state: enabled > >>>> >>> Flag: implicit_default_bindings, state: enabled > >>>> >>> Flag: quorum_queue, state: enabled > >>>> >>> Flag: virtual_host_metadata, state: enabled > >>>> >>> > >>>> >>> *Logs:* > >>>> >>> *(Attached)* > >>>> >>> > >>>> >>> With regards, > >>>> >>> Swogat Pradhan > >>>> >>> > >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < > >>>> swogatpradhan22@gmail.com> > >>>> >>> wrote: > >>>> >>> > >>>> >>>> Hi, > >>>> >>>> Please find the nova conductor as well as nova api log. > >>>> >>>> > >>>> >>>> nova-conuctor: > >>>> >>>> > >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING > >>>> oslo_messaging._drivers.amqpdriver > >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to > >>>> >>>> 16152921c1eb45c2b1f562087140168b > >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING > >>>> oslo_messaging._drivers.amqpdriver > >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] > >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to > >>>> >>>> 83dbe5f567a940b698acfe986f6194fa > >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING > >>>> oslo_messaging._drivers.amqpdriver > >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] > >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to > >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43: > >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver > >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply > >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds > >>>> due to a > >>>> >>>> missing queue (reply_276049ec36a84486a8a406911d9802f4). > >>>> Abandoning...: > >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING > >>>> oslo_messaging._drivers.amqpdriver > >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to > >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: > >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver > >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply > >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds > >>>> due to a > >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). > >>>> Abandoning...: > >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING > >>>> oslo_messaging._drivers.amqpdriver > >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to > >>>> >>>> 897911a234a445d8a0d8af02ece40f6f: > >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver > >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply > >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds > >>>> due to a > >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). > >>>> Abandoning...: > >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils > >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>> b240e3e89d99489284cd731e75f2a5db > >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled > >>>> with > >>>> >>>> backend dogpile.cache.null. > >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING > >>>> oslo_messaging._drivers.amqpdriver > >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to > >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb: > >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver > >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply > >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds > >>>> due to a > >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). > >>>> Abandoning...: > >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>> >>>> > >>>> >>>> With regards, > >>>> >>>> Swogat Pradhan > >>>> >>>> > >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < > >>>> >>>> swogatpradhan22@gmail.com> wrote: > >>>> >>>> > >>>> >>>>> Hi, > >>>> >>>>> I currently have 3 compute nodes on edge site1 where i am trying to > >>>> >>>>> launch vm's. > >>>> >>>>> When the VM is in spawning state the node goes down (openstack > >>>> compute > >>>> >>>>> service list), the node comes backup when i restart the nova > >>>> compute > >>>> >>>>> service but then the launch of the vm fails. > >>>> >>>>> > >>>> >>>>> nova-compute.log > >>>> >>>>> > >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager > >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running > >>>> >>>>> instance usage > >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 > >>>> to > >>>> >>>>> 2023-02-26 08:00:00. 0 instances. > >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims > >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>> >>>>> b240e3e89d99489284cd731e75f2a5db > >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: > >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node > >>>> >>>>> dcn01-hci-0.bdxworld.com > >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver > >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>> >>>>> b240e3e89d99489284cd731e75f2a5db > >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: > >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device > >>>> name: > >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied dev names > >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device > >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>> >>>>> b240e3e89d99489284cd731e75f2a5db > >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: > >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume > >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda > >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils > >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>> >>>>> b240e3e89d99489284cd731e75f2a5db > >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled > >>>> with > >>>> >>>>> backend dogpile.cache.null. > >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon > >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>> >>>>> b240e3e89d99489284cd731e75f2a5db > >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running > >>>> >>>>> privsep helper: > >>>> >>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', > >>>> 'privsep-helper', > >>>> >>>>> '--config-file', '/etc/nova/nova.conf', '--config-file', > >>>> >>>>> '/etc/nova/nova-compute.conf', '--privsep_context', > >>>> >>>>> 'os_brick.privileged.default', '--privsep_sock_path', > >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] > >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon > >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>> >>>>> b240e3e89d99489284cd731e75f2a5db > >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new > >>>> privsep > >>>> >>>>> daemon via rootwrap > >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep > >>>> >>>>> daemon starting > >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep > >>>> >>>>> process running with uid/gid: 0/0 > >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep > >>>> >>>>> process running with capabilities (eff/prm/inh): > >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none > >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep > >>>> >>>>> daemon running as pid 2647 > >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING > >>>> os_brick.initiator.connectors.nvmeof > >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>> >>>>> b240e3e89d99489284cd731e75f2a5db > >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process > >>>> >>>>> execution error > >>>> >>>>> in _get_host_uuid: Unexpected error while running command. > >>>> >>>>> Command: blkid overlay -s UUID -o value > >>>> >>>>> Exit code: 2 > >>>> >>>>> Stdout: '' > >>>> >>>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: > >>>> >>>>> Unexpected error while running command. > >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver > >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>> >>>>> b240e3e89d99489284cd731e75f2a5db > >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: > >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image > >>>> >>>>> > >>>> >>>>> Is there a way to solve this issue? > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> With regards, > >>>> >>>>> > >>>> >>>>> Swogat Pradhan > >>>> >>>>> > >>>> >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> >
Hi Jhon, After some changes i feel like the cinder is now trying to pull the image from local glance as i am getting the following error in cinder-colume log: 2023-03-22 13:32:29.786 108 ERROR oslo_messaging.rpc.server cinder.exception.GlanceConnectionFailed: Connection to glance failed: Error finding address for http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: Unable to establish connection to http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: HTTPConnectionPool(host='172.25.228.253', port=9292): Max retries exceeded with url: /v2/images/736d8779-07cd-4510-bab2-adcb653cc538 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7682d2cd30>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)) As the endpoint it is trying to reach is the dcn02 IP address. But when i check the ports i don't find the port 9292 running: [root@dcn02-compute-2 ceph]# netstat -nultp Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:2022 0.0.0.0:* LISTEN 656800/sshd tcp 0 0 127.0.0.1:199 0.0.0.0:* LISTEN 4878/snmpd tcp 0 0 172.25.228.253:2379 0.0.0.0:* LISTEN 6232/etcd tcp 0 0 172.25.228.253:2380 0.0.0.0:* LISTEN 6232/etcd tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/systemd tcp 0 0 127.0.0.1:6640 0.0.0.0:* LISTEN 2779/ovsdb-server tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 4918/sshd tcp6 0 0 :::2022 :::* LISTEN 656800/sshd tcp6 0 0 :::111 :::* LISTEN 1/systemd tcp6 0 0 :::22 :::* LISTEN 4918/sshd udp 0 0 0.0.0.0:111 0.0.0.0:* 1/systemd udp 0 0 0.0.0.0:161 0.0.0.0:* 4878/snmpd udp 0 0 127.0.0.1:323 0.0.0.0:* 2609/chronyd udp 0 0 0.0.0.0:6081 0.0.0.0:* - udp6 0 0 :::111 :::* 1/systemd udp6 0 0 ::1:161 :::* 4878/snmpd udp6 0 0 ::1:323 :::* 2609/chronyd udp6 0 0 :::6081 :::* - I see in the glance-api.conf that bind port parameter is set to 9292 but the port is not listed in netstat command. Can you please guide me in getting this port up and running as i feel like this would solve the issue i am facing right now. With regards, Swogat Pradhan On Wed, Mar 22, 2023 at 4:55 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Update: Here is the log when creating a volume using cirros image:
2023-03-22 11:04:38.449 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume bf341343-6609-4b8c-b9e0-93e2a89c8c8f: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f', 'volume_size': 4, 'image_id': '736d8779-07cd-4510-bab2-adcb653cc538', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': '736d8779-07cd-4510-bab2-adcb653cc538', 'created_at': datetime.datetime(2023, 3, 22, 10, 44, 12, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 10, 54, 1, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'tags': [], 'file': '/v2/images/736d8779-07cd-4510-bab2-adcb653cc538/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f449ded1198>} 2023-03-22 11:06:16.570 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Image download 15.58 MB at 0.16 MB/s 2023-03-22 11:07:54.023 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.161 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.163 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Converted 112.00 MB image at 112.00 MB/s 2023-03-22 11:11:14.998 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f (bf341343-6609-4b8c-b9e0-93e2a89c8c8f): created successfully 2023-03-22 11:11:15.195 109 INFO cinder.volume.manager [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Created volume successfully.
The image is present in dcn02 store but still it downloaded the image in 0.16 MB/s and then created the volume.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 6:10 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Jhon, This seems to be an issue. When i deployed the dcn ceph in both dcn01 and dcn02 the --cluster parameter was specified to the respective cluster names but the config files were created in the name of ceph.conf and keyring was ceph.client.openstack.keyring.
Which created issues in glance as well as the naming convention of the files didn't match the cluster names, so i had to manually rename the central ceph conf file as such:
[root@dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/ [root@dcn02-compute-0 ceph]# ll total 16 -rw-------. 1 root root 257 Mar 13 13:56 ceph_central.client.openstack.keyring -rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf -rw-------. 1 root root 205 Mar 15 18:45 ceph.client.openstack.keyring -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf [root@dcn02-compute-0 ceph]#
ceph.conf and ceph.client.openstack.keyring contain the fsid of the respective clusters in both dcn01 and dcn02. In the above cli output, the ceph.conf and ceph.client... are the files used to access dcn02 ceph cluster and ceph_central* files are used in for accessing central ceph cluster.
glance multistore config: [dcn02] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store
[ceph_central] rbd_store_ceph_conf=/etc/ceph/ceph_central.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 5:52 PM John Fulton <johfulto@redhat.com> wrote:
On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Seems like cinder is not using the local ceph.
That explains the issue. It's a misconfiguration.
I hope this is not a production system since the mailing list now has the cinder.conf which contains passwords.
The section that looks like this:
[tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_ceph_conf=/etc/ceph/ceph.conf rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=<redacted> report_discard_supported=True
Should be updated to refer to the local DCN ceph cluster and not the central one. Use the ceph conf file for that cluster and ensure the rbd_secret_uuid corresponds to that one.
TripleO’s convention is to set the rbd_secret_uuid to the FSID of the Ceph cluster. The FSID should be in the ceph.conf file. The tripleo_nova_libvirt role will use virsh secret-* commands so that libvirt can retrieve the cephx secret using the FSID as a key. This can be confirmed with `podman exec nova_virtsecretd virsh secret-get-value $FSID`.
The documentation describes how to configure the central and DCN sites correctly but an error seems to have occurred while you were following it.
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features...
John
Ceph Output: [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l NAME SIZE PARENT FMT PROT
2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB 2 excl 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB 2 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB 2 yes 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB 2 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB 2 yes 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB 2 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB 2 yes 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB 2 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB 2 yes b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB 2 b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB 2 yes e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB 2 e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB 2 yes
[ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB 2 volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB 2 [ceph: root@dcn02-ceph-all-0 /]#
Attached the cinder config. Please let me know how I can solve this issue.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 3:53 PM John Fulton <johfulto@redhat.com> wrote:
in my last message under the line "On a DCN site if you run a command
On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan <
swogatpradhan22@gmail.com> wrote:
Update: I uploaded an image directly to the dcn02 store, and it takes around
10,15 minutes to create a volume with image in dcn02.
The image size is 389 MB.
On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: > > Hi Jhon, > I checked in the ceph od dcn02, I can see the images created after importing from the central site. > But launching an instance normally fails as it takes a long time for the volume to get created. > > When launching an instance from volume the instance is getting created properly without any errors. > > I tried to cache images in nova using https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... but getting checksum failed error. > > With regards, > Swogat Pradhan > > On Thu, Mar 16, 2023 at 5:24 PM John Fulton <johfulto@redhat.com> wrote: >> >> On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan >> <swogatpradhan22@gmail.com> wrote: >> > >> > Update: After restarting the nova services on the controller and running the deploy script on the edge site, I was able to launch the VM from volume. >> > >> > Right now the instance creation is failing as the block device creation is stuck in creating state, it is taking more than 10 mins for the volume to be created, whereas the image has already been imported to the edge glance. >> >> Try following this document and making the same observations in your >> environment for AZs and their local ceph cluster. >> >> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >> >> On a DCN site if you run a command like this: >> >> $ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring >> /etc/ceph/dcn0.client.admin.keyring >> $ rbd --cluster dcn0 -p volumes ls -l >> NAME SIZE PARENT >> FMT PROT LOCK >> volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB >> images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl >> $ >> >> Then, you should see the parent of the volume is the image which is on >> the same local ceph cluster. >> >> I wonder if something is misconfigured and thus you're encountering >> the streaming behavior described here: >> >> Ideally all images should reside in the central Glance and be copied >> to DCN sites before instances of those images are booted on DCN sites. >> If an image is not copied to a DCN site before it is booted, then
>> image will be streamed to the DCN site and then the image will boot as >> an instance. This happens because Glance at the DCN site has access to >> the images store at the Central ceph cluster. Though the booting of >> the image will take time because it has not been copied in advance, >> this is still preferable to failing to boot the image. >> >> You can also exec into the cinder container at the DCN site and >> confirm it's using it's local ceph cluster. >> >> John >> >> > >> > I will try and create a new fresh image and test again then update. >> > >> > With regards, >> > Swogat Pradhan >> > >> > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >> >> >> >> Update: >> >> In the hypervisor list the compute node state is showing down. >> >> >> >> >> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >> >>> >> >>> Hi Brendan, >> >>> Now i have deployed another site where i have used 2 linux bonds network template for both 3 compute nodes and 3 ceph nodes. >> >>> The bonding options is set to mode=802.3ad (lacp=active). >> >>> I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. >> >>> Once the volume was created i tried launching the instance from the volume and still the instance is stuck in spawning state. >> >>> >> >>> Here is the nova-compute log: >> >>> >> >>> 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-]
>> >>> 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-]
>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-]
>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-]
>> >>> 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. >> >>> Command: blkid overlay -s UUID -o value >> >>> Exit code: 2 >> >>> Stdout: '' >> >>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. >> >>> 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image >> >>> >> >>> It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >> >>> >> >>> The volume is already created and i do not understand why the instance is stuck in spawning state. >> >>> >> >>> With regards, >> >>> Swogat Pradhan >> >>> >> >>> >> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard < bshephar@redhat.com> wrote: >> >>>> >> >>>> Does your environment use different network interfaces for each of the networks? Or does it have a bond with everything on it? >> >>>> >> >>>> One issue I have seen before is that when launching instances, there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface. >> >>>> >> >>>> I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In
>> >>>> >> >>>> In my case, changing from active/backup to LACP helped. So,
>> >>>> >> >>>> Regards, >> >>>> >> >>>> Brendan Shephard >> >>>> Senior Software Engineer >> >>>> Red Hat Australia >> >>>> >> >>>> >> >>>> >> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote: >> >>>> >> >>>> Hi, >> >>>> >> >>>> I tried to help someone with a similar issue some time ago in
>> >>>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... >> >>>> >> >>>> But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. >> >>>> If there isn't any additional information in the debug logs I
>> >>>> >> >>>> - Either remove queues, exchanges etc. while rabbit is running, this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. >> >>>> - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild. >> >>>> >> >>>> I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice. >> >>>> >> >>>> Regards, >> >>>> Eugen >> >>>> >> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >> >>>> >> >>>> Hi, >> >>>> Can someone please help me out on this issue? >> >>>> >> >>>> With regards, >> >>>> Swogat Pradhan >> >>>> >> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < swogatpradhan22@gmail.com> >> >>>> wrote: >> >>>> >> >>>> Hi >> >>>> I don't see any major packet loss. >> >>>> It seems the problem is somewhere in rabbitmq maybe but not due to packet >> >>>> loss. >> >>>> >> >>>> with regards, >> >>>> Swogat Pradhan >> >>>> >> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> >> >>>> wrote: >> >>>> >> >>>> Hi, >> >>>> Yes the MTU is the same as the default '1500'. >> >>>> Generally I haven't seen any packet loss, but never checked when >> >>>> launching the instance. >> >>>> I will check that and come back. >> >>>> But everytime i launch an instance the instance gets stuck at spawning >> >>>> state and there the hypervisor becomes down, so not sure if
>> >>>> causes this. >> >>>> >> >>>> With regards, >> >>>> Swogat pradhan >> >>>> >> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote: >> >>>> >> >>>> One more thing coming to mind is MTU size. Are they identical between >> >>>> central and edge site? Do you see packet loss through the tunnel? >> >>>> >> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >> >>>> >> >>>> > Hi Eugen, >> >>>> > Request you to please add my email either on 'to' or 'cc' as i am not >> >>>> > getting email's from you. >> >>>> > Coming to the issue: >> >>>> > >> >>>> > [root@overcloud-controller-no-ceph-3 /]# rabbitmqctl
>> >>>> / >> >>>> > Listing policies for vhost "/" ... >> >>>> > vhost name pattern apply-to definition
>> >>>> > / ha-all ^(?!amq\.).* queues >> >>>> > >> >>>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 >> >>>> > >> >>>> > I have the edge site compute nodes up, it only goes down when i am >> >>>> trying >> >>>> > to launch an instance and the instance comes to a spawning state and >> >>>> then >> >>>> > gets stuck. >> >>>> > >> >>>> > I have a tunnel setup between the central and the edge sites. >> >>>> > >> >>>> > With regards, >> >>>> > Swogat Pradhan >> >>>> > >> >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < >> >>>> swogatpradhan22@gmail.com> >> >>>> > wrote: >> >>>> > >> >>>> >> Hi Eugen, >> >>>> >> For some reason i am not getting your email to me
>> >>>> checking >> >>>> >> the email digest and there i am able to find your reply. >> >>>> >> Here is the log for download: https://we.tl/t-L8FEkGZFSq >> >>>> >> Yes, these logs are from the time when the issue occurred. >> >>>> >> >> >>>> >> *Note: i am able to create vm's and perform other activities in the >> >>>> >> central site, only facing this issue in the edge site.* >> >>>> >> >> >>>> >> With regards, >> >>>> >> Swogat Pradhan >> >>>> >> >> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < >> >>>> swogatpradhan22@gmail.com> >> >>>> >> wrote: >> >>>> >> >> >>>> >>> Hi Eugen, >> >>>> >>> Thanks for your response. >> >>>> >>> I have actually a 4 controller setup so here are the
>> >>>> >>> >> >>>> >>> *PCS Status:* >> >>>> >>> * Container bundle set: rabbitmq-bundle [ >> >>>> >>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: >> >>>> >>> * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): >> >>>> Started >> >>>> >>> overcloud-controller-no-ceph-3 >> >>>> >>> * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): >> >>>> Started >> >>>> >>> overcloud-controller-2 >> >>>> >>> * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): >> >>>> Started >> >>>> >>> overcloud-controller-1 >> >>>> >>> * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): >> >>>> Started >> >>>> >>> overcloud-controller-0 >> >>>> >>> >> >>>> >>> I have tried restarting the bundle multiple times but the issue is >> >>>> still >> >>>> >>> present. >> >>>> >>> >> >>>> >>> *Cluster status:* >> >>>> >>> [root@overcloud-controller-0 /]# rabbitmqctl cluster_status >> >>>> >>> Cluster status of node >> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >> >>>> >>> Basics >> >>>> >>> >> >>>> >>> Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com >> >>>> >>> >> >>>> >>> Disk Nodes >> >>>> >>> >> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >>>> >>> >> >>>> >>> Running Nodes >> >>>> >>> >> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >>>> >>> >> >>>> >>> Versions >> >>>> >>> >> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ >> >>>> 3.8.3 >> >>>> >>> on Erlang 22.3.4.1 >> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ >> >>>> 3.8.3 >> >>>> >>> on Erlang 22.3.4.1 >> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ >> >>>> 3.8.3 >> >>>> >>> on Erlang 22.3.4.1 >> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: >> >>>> RabbitMQ >> >>>> >>> 3.8.3 on Erlang 22.3.4.1 >> >>>> >>> >> >>>> >>> Alarms >> >>>> >>> >> >>>> >>> (none) >> >>>> >>> >> >>>> >>> Network Partitions >> >>>> >>> >> >>>> >>> (none) >> >>>> >>> >> >>>> >>> Listeners >> >>>> >>> >> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >> >>>> tool >> >>>> >>> communication >> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >> >>>> >>> and AMQP 1.0 >> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >> >>>> tool >> >>>> >>> communication >> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >> >>>> >>> and AMQP 1.0 >> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >> >>>> tool >> >>>> >>> communication >> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >> >>>> >>> and AMQP 1.0 >> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >>>> , >> >>>> >>> interface: [::], port: 25672, protocol: clustering,
>> >>>> inter-node and >> >>>> >>> CLI tool communication >> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >>>> , >> >>>> >>> interface: 172.25.201.209, port: 5672, protocol: amqp,
LOCK like this:" I suggested some steps you could try to confirm the image is a COW from the local glance as well as how to look at your cinder config. the privsep daemon starting privsep process running with uid/gid: 0/0 privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none privsep daemon running as pid 185437 the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance. based on that experience, from my perspective, is certainly sounds like some kind of network issue. this thread: probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes: packet loss list_policies -p priority directly, i am details: purpose: purpose: AMQP
>> >>>> 0-9-1 >> >>>> >>> and AMQP 1.0 >> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >>>> , >> >>>> >>> interface: [::], port: 15672, protocol: http, purpose: HTTP API >> >>>> >>> >> >>>> >>> Feature flags >> >>>> >>> >> >>>> >>> Flag: drop_unroutable_metric, state: enabled >> >>>> >>> Flag: empty_basic_get_metric, state: enabled >> >>>> >>> Flag: implicit_default_bindings, state: enabled >> >>>> >>> Flag: quorum_queue, state: enabled >> >>>> >>> Flag: virtual_host_metadata, state: enabled >> >>>> >>> >> >>>> >>> *Logs:* >> >>>> >>> *(Attached)* >> >>>> >>> >> >>>> >>> With regards, >> >>>> >>> Swogat Pradhan >> >>>> >>> >> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < >> >>>> swogatpradhan22@gmail.com> >> >>>> >>> wrote: >> >>>> >>> >> >>>> >>>> Hi, >> >>>> >>>> Please find the nova conductor as well as nova api log. >> >>>> >>>> >> >>>> >>>> nova-conuctor: >> >>>> >>>> >> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING >> >>>> oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> >>>> >>>> 16152921c1eb45c2b1f562087140168b >> >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING >> >>>> oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >> >>>> >>>> 83dbe5f567a940b698acfe986f6194fa >> >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING >> >>>> oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43: >> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply >> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds >> >>>> due to a >> >>>> >>>> missing queue (reply_276049ec36a84486a8a406911d9802f4). >> >>>> Abandoning...: >> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING >> >>>> oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: >> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds >> >>>> due to a >> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >> >>>> Abandoning...: >> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING >> >>>> oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f: >> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds >> >>>> due to a >> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >> >>>> Abandoning...: >> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils >> >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >> >>>> with >> >>>> >>>> backend dogpile.cache.null. >> >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING >> >>>> oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb: >> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds >> >>>> due to a >> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >> >>>> Abandoning...: >> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> >>>> >> >>>> >>>> With regards, >> >>>> >>>> Swogat Pradhan >> >>>> >>>> >> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >> >>>> >>>> swogatpradhan22@gmail.com> wrote: >> >>>> >>>> >> >>>> >>>>> Hi, >> >>>> >>>>> I currently have 3 compute nodes on edge site1 where i am trying to >> >>>> >>>>> launch vm's. >> >>>> >>>>> When the VM is in spawning state the node goes down (openstack >> >>>> compute >> >>>> >>>>> service list), the node comes backup when i restart the nova >> >>>> compute >> >>>> >>>>> service but then the launch of the vm fails. >> >>>> >>>>> >> >>>> >>>>> nova-compute.log >> >>>> >>>>> >> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >> >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running >> >>>> >>>>> instance usage >> >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 >> >>>> to >> >>>> >>>>> 2023-02-26 08:00:00. 0 instances. >> >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node >> >>>> >>>>> dcn01-hci-0.bdxworld.com >> >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver >> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device >> >>>> name: >> >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied dev names >> >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device >> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume >> >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >> >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >> >>>> with >> >>>> >>>>> backend dogpile.cache.null. >> >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running >> >>>> >>>>> privsep helper: >> >>>> >>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', >> >>>> 'privsep-helper', >> >>>> >>>>> '--config-file', '/etc/nova/nova.conf', '--config-file', >> >>>> >>>>> '/etc/nova/nova-compute.conf', '--privsep_context', >> >>>> >>>>> 'os_brick.privileged.default', '--privsep_sock_path', >> >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] >> >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new >> >>>> privsep >> >>>> >>>>> daemon via rootwrap >> >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep >> >>>> >>>>> daemon starting >> >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep >> >>>> >>>>> process running with uid/gid: 0/0 >> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >> >>>> >>>>> process running with capabilities (eff/prm/inh): >> >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >> >>>> >>>>> daemon running as pid 2647 >> >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING >> >>>> os_brick.initiator.connectors.nvmeof >> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process >> >>>> >>>>> execution error >> >>>> >>>>> in _get_host_uuid: Unexpected error while running command. >> >>>> >>>>> Command: blkid overlay -s UUID -o value >> >>>> >>>>> Exit code: 2 >> >>>> >>>>> Stdout: '' >> >>>> >>>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: >> >>>> >>>>> Unexpected error while running command. >> >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver >> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image >> >>>> >>>>> >> >>>> >>>>> Is there a way to solve this issue? >> >>>> >>>>> >> >>>> >>>>> >> >>>> >>>>> With regards, >> >>>> >>>>> >> >>>> >>>>> Swogat Pradhan >> >>>> >>>>> >> >>>> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >>
On Wed, Mar 22, 2023 at 9:42 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Jhon, After some changes i feel like the cinder is now trying to pull the image from local glance as i am getting the following error in cinder-colume log:
2023-03-22 13:32:29.786 108 ERROR oslo_messaging.rpc.server cinder.exception.GlanceConnectionFailed: Connection to glance failed: Error finding address for http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: Unable to establish connection to http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: HTTPConnectionPool(host='172.25.228.253', port=9292): Max retries exceeded with url: /v2/images/736d8779-07cd-4510-bab2-adcb653cc538 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7682d2cd30>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
As the endpoint it is trying to reach is the dcn02 IP address.
But when i check the ports i don't find the port 9292 running: [root@dcn02-compute-2 ceph]# netstat -nultp Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:2022 0.0.0.0:* LISTEN 656800/sshd tcp 0 0 127.0.0.1:199 0.0.0.0:* LISTEN 4878/snmpd tcp 0 0 172.25.228.253:2379 0.0.0.0:* LISTEN 6232/etcd tcp 0 0 172.25.228.253:2380 0.0.0.0:* LISTEN 6232/etcd tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/systemd tcp 0 0 127.0.0.1:6640 0.0.0.0:* LISTEN 2779/ovsdb-server tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 4918/sshd tcp6 0 0 :::2022 :::* LISTEN 656800/sshd tcp6 0 0 :::111 :::* LISTEN 1/systemd tcp6 0 0 :::22 :::* LISTEN 4918/sshd udp 0 0 0.0.0.0:111 0.0.0.0:* 1/systemd udp 0 0 0.0.0.0:161 0.0.0.0:* 4878/snmpd udp 0 0 127.0.0.1:323 0.0.0.0:* 2609/chronyd udp 0 0 0.0.0.0:6081 0.0.0.0:* - udp6 0 0 :::111 :::* 1/systemd udp6 0 0 ::1:161 :::* 4878/snmpd udp6 0 0 ::1:323 :::* 2609/chronyd udp6 0 0 :::6081 :::* -
I see in the glance-api.conf that bind port parameter is set to 9292 but the port is not listed in netstat command. Can you please guide me in getting this port up and running as i feel like this would solve the issue i am facing right now.
Looks like your glance container stopped running. Ask podman to show you all containers (including stopped ones) and investigate why the glance container stopped.
With regards, Swogat Pradhan
On Wed, Mar 22, 2023 at 4:55 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Update: Here is the log when creating a volume using cirros image:
2023-03-22 11:04:38.449 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume bf341343-6609-4b8c-b9e0-93e2a89c8c8f: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f', 'volume_size': 4, 'image_id': '736d8779-07cd-4510-bab2-adcb653cc538', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': '736d8779-07cd-4510-bab2-adcb653cc538', 'created_at': datetime.datetime(2023, 3, 22, 10, 44, 12, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 10, 54, 1, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'tags': [], 'file': '/v2/images/736d8779-07cd-4510-bab2-adcb653cc538/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f449ded1198>} 2023-03-22 11:06:16.570 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Image download 15.58 MB at 0.16 MB/s 2023-03-22 11:07:54.023 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.161 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.163 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Converted 112.00 MB image at 112.00 MB/s 2023-03-22 11:11:14.998 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f (bf341343-6609-4b8c-b9e0-93e2a89c8c8f): created successfully 2023-03-22 11:11:15.195 109 INFO cinder.volume.manager [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Created volume successfully.
The image is present in dcn02 store but still it downloaded the image in 0.16 MB/s and then created the volume.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 6:10 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Jhon, This seems to be an issue. When i deployed the dcn ceph in both dcn01 and dcn02 the --cluster parameter was specified to the respective cluster names but the config files were created in the name of ceph.conf and keyring was ceph.client.openstack.keyring.
Which created issues in glance as well as the naming convention of the files didn't match the cluster names, so i had to manually rename the central ceph conf file as such:
[root@dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/ [root@dcn02-compute-0 ceph]# ll total 16 -rw-------. 1 root root 257 Mar 13 13:56 ceph_central.client.openstack.keyring -rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf -rw-------. 1 root root 205 Mar 15 18:45 ceph.client.openstack.keyring -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf [root@dcn02-compute-0 ceph]#
ceph.conf and ceph.client.openstack.keyring contain the fsid of the respective clusters in both dcn01 and dcn02. In the above cli output, the ceph.conf and ceph.client... are the files used to access dcn02 ceph cluster and ceph_central* files are used in for accessing central ceph cluster.
glance multistore config: [dcn02] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store
[ceph_central] rbd_store_ceph_conf=/etc/ceph/ceph_central.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 5:52 PM John Fulton <johfulto@redhat.com> wrote:
On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Seems like cinder is not using the local ceph.
That explains the issue. It's a misconfiguration.
I hope this is not a production system since the mailing list now has the cinder.conf which contains passwords.
The section that looks like this:
[tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_ceph_conf=/etc/ceph/ceph.conf rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=<redacted> report_discard_supported=True
Should be updated to refer to the local DCN ceph cluster and not the central one. Use the ceph conf file for that cluster and ensure the rbd_secret_uuid corresponds to that one.
TripleO’s convention is to set the rbd_secret_uuid to the FSID of the Ceph cluster. The FSID should be in the ceph.conf file. The tripleo_nova_libvirt role will use virsh secret-* commands so that libvirt can retrieve the cephx secret using the FSID as a key. This can be confirmed with `podman exec nova_virtsecretd virsh secret-get-value $FSID`.
The documentation describes how to configure the central and DCN sites correctly but an error seems to have occurred while you were following it.
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features...
John
Ceph Output: [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l NAME SIZE PARENT FMT PROT LOCK 2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB 2 excl 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB 2 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB 2 yes 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB 2 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB 2 yes 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB 2 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB 2 yes 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB 2 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB 2 yes b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB 2 b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB 2 yes e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB 2 e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB 2 yes
[ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB 2 volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB 2 [ceph: root@dcn02-ceph-all-0 /]#
Attached the cinder config. Please let me know how I can solve this issue.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 3:53 PM John Fulton <johfulto@redhat.com> wrote:
in my last message under the line "On a DCN site if you run a command like this:" I suggested some steps you could try to confirm the image is a COW from the local glance as well as how to look at your cinder config.
On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote: > > Update: > I uploaded an image directly to the dcn02 store, and it takes around 10,15 minutes to create a volume with image in dcn02. > The image size is 389 MB. > > On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote: >> >> Hi Jhon, >> I checked in the ceph od dcn02, I can see the images created after importing from the central site. >> But launching an instance normally fails as it takes a long time for the volume to get created. >> >> When launching an instance from volume the instance is getting created properly without any errors. >> >> I tried to cache images in nova using https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... but getting checksum failed error. >> >> With regards, >> Swogat Pradhan >> >> On Thu, Mar 16, 2023 at 5:24 PM John Fulton <johfulto@redhat.com> wrote: >>> >>> On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan >>> <swogatpradhan22@gmail.com> wrote: >>> > >>> > Update: After restarting the nova services on the controller and running the deploy script on the edge site, I was able to launch the VM from volume. >>> > >>> > Right now the instance creation is failing as the block device creation is stuck in creating state, it is taking more than 10 mins for the volume to be created, whereas the image has already been imported to the edge glance. >>> >>> Try following this document and making the same observations in your >>> environment for AZs and their local ceph cluster. >>> >>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >>> >>> On a DCN site if you run a command like this: >>> >>> $ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring >>> /etc/ceph/dcn0.client.admin.keyring >>> $ rbd --cluster dcn0 -p volumes ls -l >>> NAME SIZE PARENT >>> FMT PROT LOCK >>> volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB >>> images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl >>> $ >>> >>> Then, you should see the parent of the volume is the image which is on >>> the same local ceph cluster. >>> >>> I wonder if something is misconfigured and thus you're encountering >>> the streaming behavior described here: >>> >>> Ideally all images should reside in the central Glance and be copied >>> to DCN sites before instances of those images are booted on DCN sites. >>> If an image is not copied to a DCN site before it is booted, then the >>> image will be streamed to the DCN site and then the image will boot as >>> an instance. This happens because Glance at the DCN site has access to >>> the images store at the Central ceph cluster. Though the booting of >>> the image will take time because it has not been copied in advance, >>> this is still preferable to failing to boot the image. >>> >>> You can also exec into the cinder container at the DCN site and >>> confirm it's using it's local ceph cluster. >>> >>> John >>> >>> > >>> > I will try and create a new fresh image and test again then update. >>> > >>> > With regards, >>> > Swogat Pradhan >>> > >>> > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote: >>> >> >>> >> Update: >>> >> In the hypervisor list the compute node state is showing down. >>> >> >>> >> >>> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote: >>> >>> >>> >>> Hi Brendan, >>> >>> Now i have deployed another site where i have used 2 linux bonds network template for both 3 compute nodes and 3 ceph nodes. >>> >>> The bonding options is set to mode=802.3ad (lacp=active). >>> >>> I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. >>> >>> Once the volume was created i tried launching the instance from the volume and still the instance is stuck in spawning state. >>> >>> >>> >>> Here is the nova-compute log: >>> >>> >>> >>> 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-] privsep daemon starting >>> >>> 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 >>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep daemon running as pid 185437 >>> >>> 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. >>> >>> Command: blkid overlay -s UUID -o value >>> >>> Exit code: 2 >>> >>> Stdout: '' >>> >>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. >>> >>> 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image >>> >>> >>> >>> It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>> >>> >>> >>> The volume is already created and i do not understand why the instance is stuck in spawning state. >>> >>> >>> >>> With regards, >>> >>> Swogat Pradhan >>> >>> >>> >>> >>> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard <bshephar@redhat.com> wrote: >>> >>>> >>> >>>> Does your environment use different network interfaces for each of the networks? Or does it have a bond with everything on it? >>> >>>> >>> >>>> One issue I have seen before is that when launching instances, there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface. >>> >>>> >>> >>>> I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance. >>> >>>> >>> >>>> In my case, changing from active/backup to LACP helped. So, based on that experience, from my perspective, is certainly sounds like some kind of network issue. >>> >>>> >>> >>>> Regards, >>> >>>> >>> >>>> Brendan Shephard >>> >>>> Senior Software Engineer >>> >>>> Red Hat Australia >>> >>>> >>> >>>> >>> >>>> >>> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote: >>> >>>> >>> >>>> Hi, >>> >>>> >>> >>>> I tried to help someone with a similar issue some time ago in this thread: >>> >>>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... >>> >>>> >>> >>>> But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. >>> >>>> If there isn't any additional information in the debug logs I probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes: >>> >>>> >>> >>>> - Either remove queues, exchanges etc. while rabbit is running, this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. >>> >>>> - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild. >>> >>>> >>> >>>> I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice. >>> >>>> >>> >>>> Regards, >>> >>>> Eugen >>> >>>> >>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>> >>>> >>> >>>> Hi, >>> >>>> Can someone please help me out on this issue? >>> >>>> >>> >>>> With regards, >>> >>>> Swogat Pradhan >>> >>>> >>> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan <swogatpradhan22@gmail.com> >>> >>>> wrote: >>> >>>> >>> >>>> Hi >>> >>>> I don't see any major packet loss. >>> >>>> It seems the problem is somewhere in rabbitmq maybe but not due to packet >>> >>>> loss. >>> >>>> >>> >>>> with regards, >>> >>>> Swogat Pradhan >>> >>>> >>> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan <swogatpradhan22@gmail.com> >>> >>>> wrote: >>> >>>> >>> >>>> Hi, >>> >>>> Yes the MTU is the same as the default '1500'. >>> >>>> Generally I haven't seen any packet loss, but never checked when >>> >>>> launching the instance. >>> >>>> I will check that and come back. >>> >>>> But everytime i launch an instance the instance gets stuck at spawning >>> >>>> state and there the hypervisor becomes down, so not sure if packet loss >>> >>>> causes this. >>> >>>> >>> >>>> With regards, >>> >>>> Swogat pradhan >>> >>>> >>> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote: >>> >>>> >>> >>>> One more thing coming to mind is MTU size. Are they identical between >>> >>>> central and edge site? Do you see packet loss through the tunnel? >>> >>>> >>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>> >>>> >>> >>>> > Hi Eugen, >>> >>>> > Request you to please add my email either on 'to' or 'cc' as i am not >>> >>>> > getting email's from you. >>> >>>> > Coming to the issue: >>> >>>> > >>> >>>> > [root@overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p >>> >>>> / >>> >>>> > Listing policies for vhost "/" ... >>> >>>> > vhost name pattern apply-to definition priority >>> >>>> > / ha-all ^(?!amq\.).* queues >>> >>>> > >>> >>>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 >>> >>>> > >>> >>>> > I have the edge site compute nodes up, it only goes down when i am >>> >>>> trying >>> >>>> > to launch an instance and the instance comes to a spawning state and >>> >>>> then >>> >>>> > gets stuck. >>> >>>> > >>> >>>> > I have a tunnel setup between the central and the edge sites. >>> >>>> > >>> >>>> > With regards, >>> >>>> > Swogat Pradhan >>> >>>> > >>> >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < >>> >>>> swogatpradhan22@gmail.com> >>> >>>> > wrote: >>> >>>> > >>> >>>> >> Hi Eugen, >>> >>>> >> For some reason i am not getting your email to me directly, i am >>> >>>> checking >>> >>>> >> the email digest and there i am able to find your reply. >>> >>>> >> Here is the log for download: https://we.tl/t-L8FEkGZFSq >>> >>>> >> Yes, these logs are from the time when the issue occurred. >>> >>>> >> >>> >>>> >> *Note: i am able to create vm's and perform other activities in the >>> >>>> >> central site, only facing this issue in the edge site.* >>> >>>> >> >>> >>>> >> With regards, >>> >>>> >> Swogat Pradhan >>> >>>> >> >>> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < >>> >>>> swogatpradhan22@gmail.com> >>> >>>> >> wrote: >>> >>>> >> >>> >>>> >>> Hi Eugen, >>> >>>> >>> Thanks for your response. >>> >>>> >>> I have actually a 4 controller setup so here are the details: >>> >>>> >>> >>> >>>> >>> *PCS Status:* >>> >>>> >>> * Container bundle set: rabbitmq-bundle [ >>> >>>> >>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: >>> >>>> >>> * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): >>> >>>> Started >>> >>>> >>> overcloud-controller-no-ceph-3 >>> >>>> >>> * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): >>> >>>> Started >>> >>>> >>> overcloud-controller-2 >>> >>>> >>> * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): >>> >>>> Started >>> >>>> >>> overcloud-controller-1 >>> >>>> >>> * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): >>> >>>> Started >>> >>>> >>> overcloud-controller-0 >>> >>>> >>> >>> >>>> >>> I have tried restarting the bundle multiple times but the issue is >>> >>>> still >>> >>>> >>> present. >>> >>>> >>> >>> >>>> >>> *Cluster status:* >>> >>>> >>> [root@overcloud-controller-0 /]# rabbitmqctl cluster_status >>> >>>> >>> Cluster status of node >>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >>> >>>> >>> Basics >>> >>>> >>> >>> >>>> >>> Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com >>> >>>> >>> >>> >>>> >>> Disk Nodes >>> >>>> >>> >>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>>> >>> >>> >>>> >>> Running Nodes >>> >>>> >>> >>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>>> >>> >>> >>>> >>> Versions >>> >>>> >>> >>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ >>> >>>> 3.8.3 >>> >>>> >>> on Erlang 22.3.4.1 >>> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ >>> >>>> 3.8.3 >>> >>>> >>> on Erlang 22.3.4.1 >>> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ >>> >>>> 3.8.3 >>> >>>> >>> on Erlang 22.3.4.1 >>> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: >>> >>>> RabbitMQ >>> >>>> >>> 3.8.3 on Erlang 22.3.4.1 >>> >>>> >>> >>> >>>> >>> Alarms >>> >>>> >>> >>> >>>> >>> (none) >>> >>>> >>> >>> >>>> >>> Network Partitions >>> >>>> >>> >>> >>>> >>> (none) >>> >>>> >>> >>> >>>> >>> Listeners >>> >>>> >>> >>> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>> >>>> tool >>> >>>> >>> communication >>> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>> >>>> >>> and AMQP 1.0 >>> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>> >>>> tool >>> >>>> >>> communication >>> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>> >>>> >>> and AMQP 1.0 >>> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>> >>>> tool >>> >>>> >>> communication >>> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>> >>>> >>> and AMQP 1.0 >>> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>>> , >>> >>>> >>> interface: [::], port: 25672, protocol: clustering, purpose: >>> >>>> inter-node and >>> >>>> >>> CLI tool communication >>> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>>> , >>> >>>> >>> interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP >>> >>>> 0-9-1 >>> >>>> >>> and AMQP 1.0 >>> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>>> , >>> >>>> >>> interface: [::], port: 15672, protocol: http, purpose: HTTP API >>> >>>> >>> >>> >>>> >>> Feature flags >>> >>>> >>> >>> >>>> >>> Flag: drop_unroutable_metric, state: enabled >>> >>>> >>> Flag: empty_basic_get_metric, state: enabled >>> >>>> >>> Flag: implicit_default_bindings, state: enabled >>> >>>> >>> Flag: quorum_queue, state: enabled >>> >>>> >>> Flag: virtual_host_metadata, state: enabled >>> >>>> >>> >>> >>>> >>> *Logs:* >>> >>>> >>> *(Attached)* >>> >>>> >>> >>> >>>> >>> With regards, >>> >>>> >>> Swogat Pradhan >>> >>>> >>> >>> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < >>> >>>> swogatpradhan22@gmail.com> >>> >>>> >>> wrote: >>> >>>> >>> >>> >>>> >>>> Hi, >>> >>>> >>>> Please find the nova conductor as well as nova api log. >>> >>>> >>>> >>> >>>> >>>> nova-conuctor: >>> >>>> >>>> >>> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING >>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> >>>> >>>> 16152921c1eb45c2b1f562087140168b >>> >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING >>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>> >>>> >>>> 83dbe5f567a940b698acfe986f6194fa >>> >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING >>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43: >>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply >>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds >>> >>>> due to a >>> >>>> >>>> missing queue (reply_276049ec36a84486a8a406911d9802f4). >>> >>>> Abandoning...: >>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING >>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: >>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds >>> >>>> due to a >>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>> >>>> Abandoning...: >>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING >>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f: >>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds >>> >>>> due to a >>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>> >>>> Abandoning...: >>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils >>> >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >>> >>>> with >>> >>>> >>>> backend dogpile.cache.null. >>> >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING >>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb: >>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds >>> >>>> due to a >>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>> >>>> Abandoning...: >>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> >>>> >>> >>>> >>>> With regards, >>> >>>> >>>> Swogat Pradhan >>> >>>> >>>> >>> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >>> >>>> >>>> swogatpradhan22@gmail.com> wrote: >>> >>>> >>>> >>> >>>> >>>>> Hi, >>> >>>> >>>>> I currently have 3 compute nodes on edge site1 where i am trying to >>> >>>> >>>>> launch vm's. >>> >>>> >>>>> When the VM is in spawning state the node goes down (openstack >>> >>>> compute >>> >>>> >>>>> service list), the node comes backup when i restart the nova >>> >>>> compute >>> >>>> >>>>> service but then the launch of the vm fails. >>> >>>> >>>>> >>> >>>> >>>>> nova-compute.log >>> >>>> >>>>> >>> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >>> >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running >>> >>>> >>>>> instance usage >>> >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 >>> >>>> to >>> >>>> >>>>> 2023-02-26 08:00:00. 0 instances. >>> >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node >>> >>>> >>>>> dcn01-hci-0.bdxworld.com >>> >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver >>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device >>> >>>> name: >>> >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied dev names >>> >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device >>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume >>> >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >>> >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >>> >>>> with >>> >>>> >>>>> backend dogpile.cache.null. >>> >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running >>> >>>> >>>>> privsep helper: >>> >>>> >>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', >>> >>>> 'privsep-helper', >>> >>>> >>>>> '--config-file', '/etc/nova/nova.conf', '--config-file', >>> >>>> >>>>> '/etc/nova/nova-compute.conf', '--privsep_context', >>> >>>> >>>>> 'os_brick.privileged.default', '--privsep_sock_path', >>> >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] >>> >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new >>> >>>> privsep >>> >>>> >>>>> daemon via rootwrap >>> >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep >>> >>>> >>>>> daemon starting >>> >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep >>> >>>> >>>>> process running with uid/gid: 0/0 >>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>> >>>> >>>>> process running with capabilities (eff/prm/inh): >>> >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>> >>>> >>>>> daemon running as pid 2647 >>> >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING >>> >>>> os_brick.initiator.connectors.nvmeof >>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process >>> >>>> >>>>> execution error >>> >>>> >>>>> in _get_host_uuid: Unexpected error while running command. >>> >>>> >>>>> Command: blkid overlay -s UUID -o value >>> >>>> >>>>> Exit code: 2 >>> >>>> >>>>> Stdout: '' >>> >>>> >>>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: >>> >>>> >>>>> Unexpected error while running command. >>> >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver >>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image >>> >>>> >>>>> >>> >>>> >>>>> Is there a way to solve this issue? >>> >>>> >>>>> >>> >>>> >>>>> >>> >>>> >>>>> With regards, >>> >>>> >>>>> >>> >>>> >>>>> Swogat Pradhan >>> >>>> >>>>> >>> >>>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>>
My glance container is running but is in an unhealthy state. I don't see any errors in podman logs glance_api or anywhere. [root@dcn02-compute-0 ~]# podman ps --all | grep glance 03a07452704a 172.25.201.68:8787/tripleomaster/openstack-glance-api:current-tripleo 9 days ago Exited (0) 41 minutes ago container-puppet-glance_api b61e96e9f504 172.25.201.68:8787/tripleomaster/openstack-glance-api:current-tripleo /bin/bash -c chow... 9 days ago Exited (0) 36 minutes ago glance_init_logs ec1734dfb072 172.25.201.68:8787/tripleomaster/openstack-glance-api:current-tripleo /usr/bin/bootstra... 34 minutes ago Exited (0) 34 minutes ago glance_api_db_sync a8eb5d18b8d6 172.25.201.68:8787/tripleomaster/openstack-glance-api:current-tripleo kolla_start 31 minutes ago Up 32 minutes ago (healthy) glance_api_cron 74a92f45a4a2 172.25.201.68:8787/tripleomaster/openstack-glance-api:current-tripleo kolla_start 31 minutes ago Up 32 minutes ago (unhealthy) glance_api With regards, Swogat Pradhan On Wed, Mar 22, 2023 at 7:16 PM John Fulton <johfulto@redhat.com> wrote:
On Wed, Mar 22, 2023 at 9:42 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Jhon, After some changes i feel like the cinder is now trying to pull the
image from local glance as i am getting the following error in cinder-colume log:
2023-03-22 13:32:29.786 108 ERROR oslo_messaging.rpc.server
cinder.exception.GlanceConnectionFailed: Connection to glance failed: Error finding address for http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: Unable to establish connection to http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: HTTPConnectionPool(host='172.25.228.253', port=9292): Max retries exceeded with url: /v2/images/736d8779-07cd-4510-bab2-adcb653cc538 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7682d2cd30>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
As the endpoint it is trying to reach is the dcn02 IP address.
But when i check the ports i don't find the port 9292 running: [root@dcn02-compute-2 ceph]# netstat -nultp Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address
tcp 0 0 0.0.0.0:2022 0.0.0.0:* LISTEN 656800/sshd tcp 0 0 127.0.0.1:199 0.0.0.0:* LISTEN 4878/snmpd tcp 0 0 172.25.228.253:2379 0.0.0.0:* LISTEN 6232/etcd tcp 0 0 172.25.228.253:2380 0.0.0.0:* LISTEN 6232/etcd tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/systemd tcp 0 0 127.0.0.1:6640 0.0.0.0:* LISTEN 2779/ovsdb-server tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 4918/sshd tcp6 0 0 :::2022 :::* LISTEN 656800/sshd tcp6 0 0 :::111 :::* LISTEN 1/systemd tcp6 0 0 :::22 :::* LISTEN 4918/sshd udp 0 0 0.0.0.0:111 0.0.0.0:* 1/systemd udp 0 0 0.0.0.0:161 0.0.0.0:* 4878/snmpd udp 0 0 127.0.0.1:323 0.0.0.0:* 2609/chronyd udp 0 0 0.0.0.0:6081 0.0.0.0:* - udp6 0 0 :::111 :::* 1/systemd udp6 0 0 ::1:161 :::* 4878/snmpd udp6 0 0 ::1:323 :::* 2609/chronyd udp6 0 0 :::6081 :::* -
I see in the glance-api.conf that bind port parameter is set to 9292 but
Can you please guide me in getting this port up and running as i feel
State PID/Program name the port is not listed in netstat command. like this would solve the issue i am facing right now.
Looks like your glance container stopped running. Ask podman to show you all containers (including stopped ones) and investigate why the glance container stopped.
With regards, Swogat Pradhan
On Wed, Mar 22, 2023 at 4:55 PM Swogat Pradhan <
Update: Here is the log when creating a volume using cirros image:
2023-03-22 11:04:38.449 109 INFO
cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume bf341343-6609-4b8c-b9e0-93e2a89c8c8f: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f', 'volume_size': 4, 'image_id': '736d8779-07cd-4510-bab2-adcb653cc538', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': '736d8779-07cd-4510-bab2-adcb653cc538', 'created_at': datetime.datetime(2023, 3, 22, 10, 44, 12, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 10, 54, 1, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'tags': [], 'file': '/v2/images/736d8779-07cd-4510-bab2-adcb653cc538/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f449ded1198>}
2023-03-22 11:06:16.570 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Image download 15.58 MB at 0.16 MB/s 2023-03-22 11:07:54.023 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.161 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.163 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Converted 112.00 MB image at 112.00 MB/s 2023-03-22 11:11:14.998 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f (bf341343-6609-4b8c-b9e0-93e2a89c8c8f): created successfully 2023-03-22 11:11:15.195 109 INFO cinder.volume.manager [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Created volume successfully.
The image is present in dcn02 store but still it downloaded the image in 0.16 MB/s and then created the volume.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 6:10 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Jhon, This seems to be an issue. When i deployed the dcn ceph in both dcn01 and dcn02 the --cluster
Which created issues in glance as well as the naming convention of the
files didn't match the cluster names, so i had to manually rename the central ceph conf file as such:
[root@dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/ [root@dcn02-compute-0 ceph]# ll total 16 -rw-------. 1 root root 257 Mar 13 13:56
ceph_central.client.openstack.keyring
-rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf -rw-------. 1 root root 205 Mar 15 18:45 ceph.client.openstack.keyring -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf [root@dcn02-compute-0 ceph]#
ceph.conf and ceph.client.openstack.keyring contain the fsid of the respective clusters in both dcn01 and dcn02. In the above cli output, the ceph.conf and ceph.client... are the files used to access dcn02 ceph cluster and ceph_central* files are used in for accessing central ceph cluster.
glance multistore config: [dcn02] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store
[ceph_central] rbd_store_ceph_conf=/etc/ceph/ceph_central.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 5:52 PM John Fulton <johfulto@redhat.com> wrote:
On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Seems like cinder is not using the local ceph.
That explains the issue. It's a misconfiguration.
I hope this is not a production system since the mailing list now has the cinder.conf which contains passwords.
The section that looks like this:
[tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_ceph_conf=/etc/ceph/ceph.conf rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=<redacted> report_discard_supported=True
Should be updated to refer to the local DCN ceph cluster and not the central one. Use the ceph conf file for that cluster and ensure the rbd_secret_uuid corresponds to that one.
TripleO’s convention is to set the rbd_secret_uuid to the FSID of the Ceph cluster. The FSID should be in the ceph.conf file. The tripleo_nova_libvirt role will use virsh secret-* commands so that libvirt can retrieve the cephx secret using the FSID as a key. This can be confirmed with `podman exec nova_virtsecretd virsh secret-get-value $FSID`.
The documentation describes how to configure the central and DCN sites correctly but an error seems to have occurred while you were following it.
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features...
John
Ceph Output: [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l NAME SIZE PARENT FMT
PROT LOCK
2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB 2 excl 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB 2 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB 2 yes 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB 2 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB 2 yes 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB 2 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB 2 yes 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB 2 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB 2 yes b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB 2 b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB 2 yes e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB 2 e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB 2 yes
[ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB 2 volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB 2 [ceph: root@dcn02-ceph-all-0 /]#
Attached the cinder config. Please let me know how I can solve this issue.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 3:53 PM John Fulton <johfulto@redhat.com> wrote: > > in my last message under the line "On a DCN site if you run a command like this:" I suggested some steps you could try to confirm the image is a COW from the local glance as well as how to look at your cinder config. > > On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >> >> Update: >> I uploaded an image directly to the dcn02 store, and it takes around 10,15 minutes to create a volume with image in dcn02. >> The image size is 389 MB. >> >> On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >>> >>> Hi Jhon, >>> I checked in the ceph od dcn02, I can see the images created after importing from the central site. >>> But launching an instance normally fails as it takes a long time for the volume to get created. >>> >>> When launching an instance from volume the instance is getting created properly without any errors. >>> >>> I tried to cache images in nova using https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... but getting checksum failed error. >>> >>> With regards, >>> Swogat Pradhan >>> >>> On Thu, Mar 16, 2023 at 5:24 PM John Fulton <johfulto@redhat.com> wrote: >>>> >>>> On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan >>>> <swogatpradhan22@gmail.com> wrote: >>>> > >>>> > Update: After restarting the nova services on the controller and running the deploy script on the edge site, I was able to launch the VM from volume. >>>> > >>>> > Right now the instance creation is failing as the block device creation is stuck in creating state, it is taking more than 10 mins for the volume to be created, whereas the image has already been imported to the edge glance. >>>> >>>> Try following this document and making the same observations in your >>>> environment for AZs and their local ceph cluster. >>>> >>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >>>> >>>> On a DCN site if you run a command like this: >>>> >>>> $ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring >>>> /etc/ceph/dcn0.client.admin.keyring >>>> $ rbd --cluster dcn0 -p volumes ls -l >>>> NAME SIZE PARENT >>>> FMT PROT LOCK >>>> volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB >>>> images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl >>>> $ >>>> >>>> Then, you should see the parent of the volume is the image which is on >>>> the same local ceph cluster. >>>> >>>> I wonder if something is misconfigured and thus you're encountering >>>> the streaming behavior described here: >>>> >>>> Ideally all images should reside in the central Glance and be copied >>>> to DCN sites before instances of those images are booted on DCN sites. >>>> If an image is not copied to a DCN site before it is booted,
>>>> image will be streamed to the DCN site and then the image will boot as >>>> an instance. This happens because Glance at the DCN site has access to >>>> the images store at the Central ceph cluster. Though the booting of >>>> the image will take time because it has not been copied in advance, >>>> this is still preferable to failing to boot the image. >>>> >>>> You can also exec into the cinder container at the DCN site and >>>> confirm it's using it's local ceph cluster. >>>> >>>> John >>>> >>>> > >>>> > I will try and create a new fresh image and test again then update. >>>> > >>>> > With regards, >>>> > Swogat Pradhan >>>> > >>>> > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >>>> >> >>>> >> Update: >>>> >> In the hypervisor list the compute node state is showing down. >>>> >> >>>> >> >>>> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >>>> >>> >>>> >>> Hi Brendan, >>>> >>> Now i have deployed another site where i have used 2 linux bonds network template for both 3 compute nodes and 3 ceph nodes. >>>> >>> The bonding options is set to mode=802.3ad (lacp=active). >>>> >>> I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. >>>> >>> Once the volume was created i tried launching the instance from the volume and still the instance is stuck in spawning state. >>>> >>> >>>> >>> Here is the nova-compute log: >>>> >>> >>>> >>> 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-]
>>>> >>> 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-]
>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-]
>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-]
>>>> >>> 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. >>>> >>> Command: blkid overlay -s UUID -o value >>>> >>> Exit code: 2 >>>> >>> Stdout: '' >>>> >>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. >>>> >>> 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image >>>> >>> >>>> >>> It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>>> >>> >>>> >>> The volume is already created and i do not understand why
>>>> >>> >>>> >>> With regards, >>>> >>> Swogat Pradhan >>>> >>> >>>> >>> >>>> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard < bshephar@redhat.com> wrote: >>>> >>>> >>>> >>>> Does your environment use different network interfaces for each of the networks? Or does it have a bond with everything on it? >>>> >>>> >>>> >>>> One issue I have seen before is that when launching instances, there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface. >>>> >>>> >>>> >>>> I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In
>>>> >>>> >>>> >>>> In my case, changing from active/backup to LACP helped. So, based on that experience, from my perspective, is certainly sounds like some kind of network issue. >>>> >>>> >>>> >>>> Regards, >>>> >>>> >>>> >>>> Brendan Shephard >>>> >>>> Senior Software Engineer >>>> >>>> Red Hat Australia >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote: >>>> >>>> >>>> >>>> Hi, >>>> >>>> >>>> >>>> I tried to help someone with a similar issue some time ago in this thread: >>>> >>>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... >>>> >>>> >>>> >>>> But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. >>>> >>>> If there isn't any additional information in the debug logs I probably would start "tearing down" rabbitmq. I didn't have to do
>>>> >>>> >>>> >>>> - Either remove queues, exchanges etc. while rabbit is running, this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. >>>> >>>> - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild. >>>> >>>> >>>> >>>> I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice. >>>> >>>> >>>> >>>> Regards, >>>> >>>> Eugen >>>> >>>> >>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>>> >>>> >>>> >>>> Hi, >>>> >>>> Can someone please help me out on this issue? >>>> >>>> >>>> >>>> With regards, >>>> >>>> Swogat Pradhan >>>> >>>> >>>> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < swogatpradhan22@gmail.com> >>>> >>>> wrote: >>>> >>>> >>>> >>>> Hi >>>> >>>> I don't see any major packet loss. >>>> >>>> It seems the problem is somewhere in rabbitmq maybe but not due to packet >>>> >>>> loss. >>>> >>>> >>>> >>>> with regards, >>>> >>>> Swogat Pradhan >>>> >>>> >>>> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> >>>> >>>> wrote: >>>> >>>> >>>> >>>> Hi, >>>> >>>> Yes the MTU is the same as the default '1500'. >>>> >>>> Generally I haven't seen any packet loss, but never checked when >>>> >>>> launching the instance. >>>> >>>> I will check that and come back. >>>> >>>> But everytime i launch an instance the instance gets stuck at spawning >>>> >>>> state and there the hypervisor becomes down, so not sure if packet loss >>>> >>>> causes this. >>>> >>>> >>>> >>>> With regards, >>>> >>>> Swogat pradhan >>>> >>>> >>>> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote: >>>> >>>> >>>> >>>> One more thing coming to mind is MTU size. Are they identical between >>>> >>>> central and edge site? Do you see packet loss through the tunnel? >>>> >>>> >>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>>> >>>> >>>> >>>> > Hi Eugen, >>>> >>>> > Request you to please add my email either on 'to' or 'cc' as i am not >>>> >>>> > getting email's from you. >>>> >>>> > Coming to the issue: >>>> >>>> > >>>> >>>> > [root@overcloud-controller-no-ceph-3 /]# rabbitmqctl
>>>> >>>> / >>>> >>>> > Listing policies for vhost "/" ... >>>> >>>> > vhost name pattern apply-to definition
>>>> >>>> > / ha-all ^(?!amq\.).* queues >>>> >>>> > >>>> >>>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 >>>> >>>> > >>>> >>>> > I have the edge site compute nodes up, it only goes down when i am >>>> >>>> trying >>>> >>>> > to launch an instance and the instance comes to a spawning state and >>>> >>>> then >>>> >>>> > gets stuck. >>>> >>>> > >>>> >>>> > I have a tunnel setup between the central and the edge sites. >>>> >>>> > >>>> >>>> > With regards, >>>> >>>> > Swogat Pradhan >>>> >>>> > >>>> >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < >>>> >>>> swogatpradhan22@gmail.com> >>>> >>>> > wrote: >>>> >>>> > >>>> >>>> >> Hi Eugen, >>>> >>>> >> For some reason i am not getting your email to me
>>>> >>>> checking >>>> >>>> >> the email digest and there i am able to find your reply. >>>> >>>> >> Here is the log for download: https://we.tl/t-L8FEkGZFSq >>>> >>>> >> Yes, these logs are from the time when the issue occurred. >>>> >>>> >> >>>> >>>> >> *Note: i am able to create vm's and perform other activities in the >>>> >>>> >> central site, only facing this issue in the edge site.* >>>> >>>> >> >>>> >>>> >> With regards, >>>> >>>> >> Swogat Pradhan >>>> >>>> >> >>>> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < >>>> >>>> swogatpradhan22@gmail.com> >>>> >>>> >> wrote: >>>> >>>> >> >>>> >>>> >>> Hi Eugen, >>>> >>>> >>> Thanks for your response. >>>> >>>> >>> I have actually a 4 controller setup so here are the
>>>> >>>> >>> >>>> >>>> >>> *PCS Status:* >>>> >>>> >>> * Container bundle set: rabbitmq-bundle [ >>>> >>>> >>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: >>>> >>>> >>> * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): >>>> >>>> Started >>>> >>>> >>> overcloud-controller-no-ceph-3 >>>> >>>> >>> * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): >>>> >>>> Started >>>> >>>> >>> overcloud-controller-2 >>>> >>>> >>> * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): >>>> >>>> Started >>>> >>>> >>> overcloud-controller-1 >>>> >>>> >>> * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): >>>> >>>> Started >>>> >>>> >>> overcloud-controller-0 >>>> >>>> >>> >>>> >>>> >>> I have tried restarting the bundle multiple times but
>>>> >>>> still >>>> >>>> >>> present. >>>> >>>> >>> >>>> >>>> >>> *Cluster status:* >>>> >>>> >>> [root@overcloud-controller-0 /]# rabbitmqctl cluster_status >>>> >>>> >>> Cluster status of node >>>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >>>> >>>> >>> Basics >>>> >>>> >>> >>>> >>>> >>> Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com >>>> >>>> >>> >>>> >>>> >>> Disk Nodes >>>> >>>> >>> >>>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> >>>> >>> >>>> >>>> >>> Running Nodes >>>> >>>> >>> >>>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> >>>> >>> >>>> >>>> >>> Versions >>>> >>>> >>> >>>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ >>>> >>>> 3.8.3 >>>> >>>> >>> on Erlang 22.3.4.1 >>>> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ >>>> >>>> 3.8.3 >>>> >>>> >>> on Erlang 22.3.4.1 >>>> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ >>>> >>>> 3.8.3 >>>> >>>> >>> on Erlang 22.3.4.1 >>>> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: >>>> >>>> RabbitMQ >>>> >>>> >>> 3.8.3 on Erlang 22.3.4.1 >>>> >>>> >>> >>>> >>>> >>> Alarms >>>> >>>> >>> >>>> >>>> >>> (none) >>>> >>>> >>> >>>> >>>> >>> Network Partitions >>>> >>>> >>> >>>> >>>> >>> (none) >>>> >>>> >>> >>>> >>>> >>> Listeners >>>> >>>> >>> >>>> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>>> >>>> tool >>>> >>>> >>> communication >>>> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>>> >>>> >>> and AMQP 1.0 >>>> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>>> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>>> >>>> tool >>>> >>>> >>> communication >>>> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>>> >>>> >>> and AMQP 1.0 >>>> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>>> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>>> >>>> tool >>>> >>>> >>> communication >>>> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>>> >>>> >>> and AMQP 1.0 >>>> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>>> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> >>>> , >>>> >>>> >>> interface: [::], port: 25672, protocol: clustering,
>>>> >>>> inter-node and >>>> >>>> >>> CLI tool communication >>>> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> >>>> , >>>> >>>> >>> interface: 172.25.201.209, port: 5672, protocol: amqp,
>>>> >>>> 0-9-1 >>>> >>>> >>> and AMQP 1.0 >>>> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> >>>> , >>>> >>>> >>> interface: [::], port: 15672, protocol: http, purpose: HTTP API >>>> >>>> >>> >>>> >>>> >>> Feature flags >>>> >>>> >>> >>>> >>>> >>> Flag: drop_unroutable_metric, state: enabled >>>> >>>> >>> Flag: empty_basic_get_metric, state: enabled >>>> >>>> >>> Flag: implicit_default_bindings, state: enabled >>>> >>>> >>> Flag: quorum_queue, state: enabled >>>> >>>> >>> Flag: virtual_host_metadata, state: enabled >>>> >>>> >>> >>>> >>>> >>> *Logs:* >>>> >>>> >>> *(Attached)* >>>> >>>> >>> >>>> >>>> >>> With regards, >>>> >>>> >>> Swogat Pradhan >>>> >>>> >>> >>>> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < >>>> >>>> swogatpradhan22@gmail.com> >>>> >>>> >>> wrote: >>>> >>>> >>> >>>> >>>> >>>> Hi, >>>> >>>> >>>> Please find the nova conductor as well as nova api log. >>>> >>>> >>>> >>>> >>>> >>>> nova-conuctor: >>>> >>>> >>>> >>>> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING >>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>> >>>> >>>> 16152921c1eb45c2b1f562087140168b >>>> >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING >>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>>> >>>> >>>> 83dbe5f567a940b698acfe986f6194fa >>>> >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING >>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43: >>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply >>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds >>>> >>>> due to a >>>> >>>> >>>> missing queue (reply_276049ec36a84486a8a406911d9802f4). >>>> >>>> Abandoning...: >>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING >>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: >>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds >>>> >>>> due to a >>>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>>> >>>> Abandoning...: >>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING >>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f: >>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds >>>> >>>> due to a >>>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>>> >>>> Abandoning...: >>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils >>>> >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >>>> >>>> with >>>> >>>> >>>> backend dogpile.cache.null. >>>> >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING >>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb: >>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds >>>> >>>> due to a >>>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>>> >>>> Abandoning...: >>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> >>>> >>>> >>>> >>>> With regards, >>>> >>>> >>>> Swogat Pradhan >>>> >>>> >>>> >>>> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >>>> >>>> >>>> swogatpradhan22@gmail.com> wrote: >>>> >>>> >>>> >>>> >>>> >>>>> Hi, >>>> >>>> >>>>> I currently have 3 compute nodes on edge site1 where i am trying to >>>> >>>> >>>>> launch vm's. >>>> >>>> >>>>> When the VM is in spawning state the node goes down (openstack >>>> >>>> compute >>>> >>>> >>>>> service list), the node comes backup when i restart
swogatpradhan22@gmail.com> wrote: parameter was specified to the respective cluster names but the config files were created in the name of ceph.conf and keyring was ceph.client.openstack.keyring. then the privsep daemon starting privsep process running with uid/gid: 0/0 privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none privsep daemon running as pid 185437 the instance is stuck in spawning state. the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance. that in a production system yet so be careful. I can think of two routes: list_policies -p priority directly, i am details: the issue is purpose: purpose: AMQP the nova
>>>> >>>> compute >>>> >>>> >>>>> service but then the launch of the vm fails. >>>> >>>> >>>>> >>>> >>>> >>>>> nova-compute.log >>>> >>>> >>>>> >>>> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >>>> >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running >>>> >>>> >>>>> instance usage >>>> >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 >>>> >>>> to >>>> >>>> >>>>> 2023-02-26 08:00:00. 0 instances. >>>> >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node >>>> >>>> >>>>> dcn01-hci-0.bdxworld.com >>>> >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver >>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device >>>> >>>> name: >>>> >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied dev names >>>> >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device >>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume >>>> >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >>>> >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >>>> >>>> with >>>> >>>> >>>>> backend dogpile.cache.null. >>>> >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running >>>> >>>> >>>>> privsep helper: >>>> >>>> >>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', >>>> >>>> 'privsep-helper', >>>> >>>> >>>>> '--config-file', '/etc/nova/nova.conf', '--config-file', >>>> >>>> >>>>> '/etc/nova/nova-compute.conf', '--privsep_context', >>>> >>>> >>>>> 'os_brick.privileged.default', '--privsep_sock_path', >>>> >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] >>>> >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new >>>> >>>> privsep >>>> >>>> >>>>> daemon via rootwrap >>>> >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep >>>> >>>> >>>>> daemon starting >>>> >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep >>>> >>>> >>>>> process running with uid/gid: 0/0 >>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>>> >>>> >>>>> process running with capabilities (eff/prm/inh): >>>> >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>>> >>>> >>>>> daemon running as pid 2647 >>>> >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING >>>> >>>> os_brick.initiator.connectors.nvmeof >>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process >>>> >>>> >>>>> execution error >>>> >>>> >>>>> in _get_host_uuid: Unexpected error while running command. >>>> >>>> >>>>> Command: blkid overlay -s UUID -o value >>>> >>>> >>>>> Exit code: 2 >>>> >>>> >>>>> Stdout: '' >>>> >>>> >>>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: >>>> >>>> >>>>> Unexpected error while running command. >>>> >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver >>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image >>>> >>>> >>>>> >>>> >>>> >>>>> Is there a way to solve this issue? >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> >>>> >>>>> With regards, >>>> >>>> >>>>> >>>> >>>> >>>>> Swogat Pradhan >>>> >>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>>
On Wed, Mar 22, 2023 at 6:37 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Update: Here is the log when creating a volume using cirros image:
2023-03-22 11:04:38.449 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume bf341343-6609-4b8c-b9e0-93e2a89c8c8f: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f', 'volume_size': 4, 'image_id': '736d8779-07cd-4510-bab2-adcb653cc538', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': '736d8779-07cd-4510-bab2-adcb653cc538', 'created_at': datetime.datetime(2023, 3, 22, 10, 44, 12, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 10, 54, 1, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'tags': [], 'file': '/v2/images/736d8779-07cd-4510-bab2-adcb653cc538/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f449ded1198>} 2023-03-22 11:06:16.570 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Image download 15.58 MB at 0.16 MB/s
As Adam Savage would say, well there's your problem ^^ (Image download 15.58 MB at 0.16 MB/s). Downloading the image takes too long, and 0.16 MB/s suggests you have a network issue. John Fulton previously stated your cinder-volume service at the edge site is not using the local ceph image store. Assuming you are deploying GlanceApiEdge service [1], then the cinder-volume service should be configured to use the local glance service [2]. You should check cinder's glance_api_servers to confirm it's the edge site's glance service. [1] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/envi... [2] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/depl... Alan
2023-03-22 11:07:54.023 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.161 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.163 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Converted 112.00 MB image at 112.00 MB/s 2023-03-22 11:11:14.998 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f (bf341343-6609-4b8c-b9e0-93e2a89c8c8f): created successfully 2023-03-22 11:11:15.195 109 INFO cinder.volume.manager [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Created volume successfully.
The image is present in dcn02 store but still it downloaded the image in 0.16 MB/s and then created the volume.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 6:10 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Jhon, This seems to be an issue. When i deployed the dcn ceph in both dcn01 and dcn02 the --cluster parameter was specified to the respective cluster names but the config files were created in the name of ceph.conf and keyring was ceph.client.openstack.keyring.
Which created issues in glance as well as the naming convention of the files didn't match the cluster names, so i had to manually rename the central ceph conf file as such:
[root@dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/ [root@dcn02-compute-0 ceph]# ll total 16 -rw-------. 1 root root 257 Mar 13 13:56 ceph_central.client.openstack.keyring -rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf -rw-------. 1 root root 205 Mar 15 18:45 ceph.client.openstack.keyring -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf [root@dcn02-compute-0 ceph]#
ceph.conf and ceph.client.openstack.keyring contain the fsid of the respective clusters in both dcn01 and dcn02. In the above cli output, the ceph.conf and ceph.client... are the files used to access dcn02 ceph cluster and ceph_central* files are used in for accessing central ceph cluster.
glance multistore config: [dcn02] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store
[ceph_central] rbd_store_ceph_conf=/etc/ceph/ceph_central.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 5:52 PM John Fulton <johfulto@redhat.com> wrote:
On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Seems like cinder is not using the local ceph.
That explains the issue. It's a misconfiguration.
I hope this is not a production system since the mailing list now has the cinder.conf which contains passwords.
The section that looks like this:
[tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_ceph_conf=/etc/ceph/ceph.conf rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=<redacted> report_discard_supported=True
Should be updated to refer to the local DCN ceph cluster and not the central one. Use the ceph conf file for that cluster and ensure the rbd_secret_uuid corresponds to that one.
TripleO’s convention is to set the rbd_secret_uuid to the FSID of the Ceph cluster. The FSID should be in the ceph.conf file. The tripleo_nova_libvirt role will use virsh secret-* commands so that libvirt can retrieve the cephx secret using the FSID as a key. This can be confirmed with `podman exec nova_virtsecretd virsh secret-get-value $FSID`.
The documentation describes how to configure the central and DCN sites correctly but an error seems to have occurred while you were following it.
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features...
John
Ceph Output: [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l NAME SIZE PARENT FMT PROT
2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB 2 excl 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB 2 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB 2 yes 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB 2 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB 2 yes 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB 2 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB 2 yes 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB 2 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB 2 yes b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB 2 b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB 2 yes e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB 2 e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB 2 yes
[ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB 2 volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB 2 [ceph: root@dcn02-ceph-all-0 /]#
Attached the cinder config. Please let me know how I can solve this issue.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 3:53 PM John Fulton <johfulto@redhat.com> wrote:
in my last message under the line "On a DCN site if you run a command
On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan <
swogatpradhan22@gmail.com> wrote:
Update: I uploaded an image directly to the dcn02 store, and it takes around
10,15 minutes to create a volume with image in dcn02.
The image size is 389 MB.
On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: > > Hi Jhon, > I checked in the ceph od dcn02, I can see the images created after importing from the central site. > But launching an instance normally fails as it takes a long time for the volume to get created. > > When launching an instance from volume the instance is getting created properly without any errors. > > I tried to cache images in nova using https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... but getting checksum failed error. > > With regards, > Swogat Pradhan > > On Thu, Mar 16, 2023 at 5:24 PM John Fulton <johfulto@redhat.com> wrote: >> >> On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan >> <swogatpradhan22@gmail.com> wrote: >> > >> > Update: After restarting the nova services on the controller and running the deploy script on the edge site, I was able to launch the VM from volume. >> > >> > Right now the instance creation is failing as the block device creation is stuck in creating state, it is taking more than 10 mins for the volume to be created, whereas the image has already been imported to the edge glance. >> >> Try following this document and making the same observations in your >> environment for AZs and their local ceph cluster. >> >> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >> >> On a DCN site if you run a command like this: >> >> $ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring >> /etc/ceph/dcn0.client.admin.keyring >> $ rbd --cluster dcn0 -p volumes ls -l >> NAME SIZE PARENT >> FMT PROT LOCK >> volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB >> images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl >> $ >> >> Then, you should see the parent of the volume is the image which is on >> the same local ceph cluster. >> >> I wonder if something is misconfigured and thus you're encountering >> the streaming behavior described here: >> >> Ideally all images should reside in the central Glance and be copied >> to DCN sites before instances of those images are booted on DCN sites. >> If an image is not copied to a DCN site before it is booted, then
>> image will be streamed to the DCN site and then the image will boot as >> an instance. This happens because Glance at the DCN site has access to >> the images store at the Central ceph cluster. Though the booting of >> the image will take time because it has not been copied in advance, >> this is still preferable to failing to boot the image. >> >> You can also exec into the cinder container at the DCN site and >> confirm it's using it's local ceph cluster. >> >> John >> >> > >> > I will try and create a new fresh image and test again then update. >> > >> > With regards, >> > Swogat Pradhan >> > >> > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >> >> >> >> Update: >> >> In the hypervisor list the compute node state is showing down. >> >> >> >> >> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >> >>> >> >>> Hi Brendan, >> >>> Now i have deployed another site where i have used 2 linux bonds network template for both 3 compute nodes and 3 ceph nodes. >> >>> The bonding options is set to mode=802.3ad (lacp=active). >> >>> I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. >> >>> Once the volume was created i tried launching the instance from the volume and still the instance is stuck in spawning state. >> >>> >> >>> Here is the nova-compute log: >> >>> >> >>> 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-]
>> >>> 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-]
>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-]
>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-]
>> >>> 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. >> >>> Command: blkid overlay -s UUID -o value >> >>> Exit code: 2 >> >>> Stdout: '' >> >>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. >> >>> 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image >> >>> >> >>> It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >> >>> >> >>> The volume is already created and i do not understand why the instance is stuck in spawning state. >> >>> >> >>> With regards, >> >>> Swogat Pradhan >> >>> >> >>> >> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard < bshephar@redhat.com> wrote: >> >>>> >> >>>> Does your environment use different network interfaces for each of the networks? Or does it have a bond with everything on it? >> >>>> >> >>>> One issue I have seen before is that when launching instances, there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface. >> >>>> >> >>>> I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In
>> >>>> >> >>>> In my case, changing from active/backup to LACP helped. So,
>> >>>> >> >>>> Regards, >> >>>> >> >>>> Brendan Shephard >> >>>> Senior Software Engineer >> >>>> Red Hat Australia >> >>>> >> >>>> >> >>>> >> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote: >> >>>> >> >>>> Hi, >> >>>> >> >>>> I tried to help someone with a similar issue some time ago in
>> >>>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... >> >>>> >> >>>> But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. >> >>>> If there isn't any additional information in the debug logs I
>> >>>> >> >>>> - Either remove queues, exchanges etc. while rabbit is running, this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. >> >>>> - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild. >> >>>> >> >>>> I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice. >> >>>> >> >>>> Regards, >> >>>> Eugen >> >>>> >> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >> >>>> >> >>>> Hi, >> >>>> Can someone please help me out on this issue? >> >>>> >> >>>> With regards, >> >>>> Swogat Pradhan >> >>>> >> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < swogatpradhan22@gmail.com> >> >>>> wrote: >> >>>> >> >>>> Hi >> >>>> I don't see any major packet loss. >> >>>> It seems the problem is somewhere in rabbitmq maybe but not due to packet >> >>>> loss. >> >>>> >> >>>> with regards, >> >>>> Swogat Pradhan >> >>>> >> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> >> >>>> wrote: >> >>>> >> >>>> Hi, >> >>>> Yes the MTU is the same as the default '1500'. >> >>>> Generally I haven't seen any packet loss, but never checked when >> >>>> launching the instance. >> >>>> I will check that and come back. >> >>>> But everytime i launch an instance the instance gets stuck at spawning >> >>>> state and there the hypervisor becomes down, so not sure if
>> >>>> causes this. >> >>>> >> >>>> With regards, >> >>>> Swogat pradhan >> >>>> >> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote: >> >>>> >> >>>> One more thing coming to mind is MTU size. Are they identical between >> >>>> central and edge site? Do you see packet loss through the tunnel? >> >>>> >> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >> >>>> >> >>>> > Hi Eugen, >> >>>> > Request you to please add my email either on 'to' or 'cc' as i am not >> >>>> > getting email's from you. >> >>>> > Coming to the issue: >> >>>> > >> >>>> > [root@overcloud-controller-no-ceph-3 /]# rabbitmqctl
>> >>>> / >> >>>> > Listing policies for vhost "/" ... >> >>>> > vhost name pattern apply-to definition
>> >>>> > / ha-all ^(?!amq\.).* queues >> >>>> > >> >>>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 >> >>>> > >> >>>> > I have the edge site compute nodes up, it only goes down when i am >> >>>> trying >> >>>> > to launch an instance and the instance comes to a spawning state and >> >>>> then >> >>>> > gets stuck. >> >>>> > >> >>>> > I have a tunnel setup between the central and the edge sites. >> >>>> > >> >>>> > With regards, >> >>>> > Swogat Pradhan >> >>>> > >> >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < >> >>>> swogatpradhan22@gmail.com> >> >>>> > wrote: >> >>>> > >> >>>> >> Hi Eugen, >> >>>> >> For some reason i am not getting your email to me
>> >>>> checking >> >>>> >> the email digest and there i am able to find your reply. >> >>>> >> Here is the log for download: https://we.tl/t-L8FEkGZFSq >> >>>> >> Yes, these logs are from the time when the issue occurred. >> >>>> >> >> >>>> >> *Note: i am able to create vm's and perform other activities in the >> >>>> >> central site, only facing this issue in the edge site.* >> >>>> >> >> >>>> >> With regards, >> >>>> >> Swogat Pradhan >> >>>> >> >> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < >> >>>> swogatpradhan22@gmail.com> >> >>>> >> wrote: >> >>>> >> >> >>>> >>> Hi Eugen, >> >>>> >>> Thanks for your response. >> >>>> >>> I have actually a 4 controller setup so here are the
>> >>>> >>> >> >>>> >>> *PCS Status:* >> >>>> >>> * Container bundle set: rabbitmq-bundle [ >> >>>> >>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: >> >>>> >>> * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): >> >>>> Started >> >>>> >>> overcloud-controller-no-ceph-3 >> >>>> >>> * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): >> >>>> Started >> >>>> >>> overcloud-controller-2 >> >>>> >>> * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): >> >>>> Started >> >>>> >>> overcloud-controller-1 >> >>>> >>> * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): >> >>>> Started >> >>>> >>> overcloud-controller-0 >> >>>> >>> >> >>>> >>> I have tried restarting the bundle multiple times but the issue is >> >>>> still >> >>>> >>> present. >> >>>> >>> >> >>>> >>> *Cluster status:* >> >>>> >>> [root@overcloud-controller-0 /]# rabbitmqctl cluster_status >> >>>> >>> Cluster status of node >> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >> >>>> >>> Basics >> >>>> >>> >> >>>> >>> Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com >> >>>> >>> >> >>>> >>> Disk Nodes >> >>>> >>> >> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >>>> >>> >> >>>> >>> Running Nodes >> >>>> >>> >> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >>>> >>> >> >>>> >>> Versions >> >>>> >>> >> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ >> >>>> 3.8.3 >> >>>> >>> on Erlang 22.3.4.1 >> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ >> >>>> 3.8.3 >> >>>> >>> on Erlang 22.3.4.1 >> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ >> >>>> 3.8.3 >> >>>> >>> on Erlang 22.3.4.1 >> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: >> >>>> RabbitMQ >> >>>> >>> 3.8.3 on Erlang 22.3.4.1 >> >>>> >>> >> >>>> >>> Alarms >> >>>> >>> >> >>>> >>> (none) >> >>>> >>> >> >>>> >>> Network Partitions >> >>>> >>> >> >>>> >>> (none) >> >>>> >>> >> >>>> >>> Listeners >> >>>> >>> >> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >> >>>> tool >> >>>> >>> communication >> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >> >>>> >>> and AMQP 1.0 >> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >> >>>> tool >> >>>> >>> communication >> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >> >>>> >>> and AMQP 1.0 >> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >> >>>> tool >> >>>> >>> communication >> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >> >>>> >>> and AMQP 1.0 >> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >> >>>> interface: >> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >>>> , >> >>>> >>> interface: [::], port: 25672, protocol: clustering,
>> >>>> inter-node and >> >>>> >>> CLI tool communication >> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >>>> , >> >>>> >>> interface: 172.25.201.209, port: 5672, protocol: amqp,
LOCK like this:" I suggested some steps you could try to confirm the image is a COW from the local glance as well as how to look at your cinder config. the privsep daemon starting privsep process running with uid/gid: 0/0 privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none privsep daemon running as pid 185437 the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance. based on that experience, from my perspective, is certainly sounds like some kind of network issue. this thread: probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes: packet loss list_policies -p priority directly, i am details: purpose: purpose: AMQP
>> >>>> 0-9-1 >> >>>> >>> and AMQP 1.0 >> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >>>> , >> >>>> >>> interface: [::], port: 15672, protocol: http, purpose: HTTP API >> >>>> >>> >> >>>> >>> Feature flags >> >>>> >>> >> >>>> >>> Flag: drop_unroutable_metric, state: enabled >> >>>> >>> Flag: empty_basic_get_metric, state: enabled >> >>>> >>> Flag: implicit_default_bindings, state: enabled >> >>>> >>> Flag: quorum_queue, state: enabled >> >>>> >>> Flag: virtual_host_metadata, state: enabled >> >>>> >>> >> >>>> >>> *Logs:* >> >>>> >>> *(Attached)* >> >>>> >>> >> >>>> >>> With regards, >> >>>> >>> Swogat Pradhan >> >>>> >>> >> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < >> >>>> swogatpradhan22@gmail.com> >> >>>> >>> wrote: >> >>>> >>> >> >>>> >>>> Hi, >> >>>> >>>> Please find the nova conductor as well as nova api log. >> >>>> >>>> >> >>>> >>>> nova-conuctor: >> >>>> >>>> >> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING >> >>>> oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> >>>> >>>> 16152921c1eb45c2b1f562087140168b >> >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING >> >>>> oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >> >>>> >>>> 83dbe5f567a940b698acfe986f6194fa >> >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING >> >>>> oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43: >> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply >> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds >> >>>> due to a >> >>>> >>>> missing queue (reply_276049ec36a84486a8a406911d9802f4). >> >>>> Abandoning...: >> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING >> >>>> oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: >> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds >> >>>> due to a >> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >> >>>> Abandoning...: >> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING >> >>>> oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f: >> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds >> >>>> due to a >> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >> >>>> Abandoning...: >> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils >> >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >> >>>> with >> >>>> >>>> backend dogpile.cache.null. >> >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING >> >>>> oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb: >> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver >> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds >> >>>> due to a >> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >> >>>> Abandoning...: >> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>> >>>> >> >>>> >>>> With regards, >> >>>> >>>> Swogat Pradhan >> >>>> >>>> >> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >> >>>> >>>> swogatpradhan22@gmail.com> wrote: >> >>>> >>>> >> >>>> >>>>> Hi, >> >>>> >>>>> I currently have 3 compute nodes on edge site1 where i am trying to >> >>>> >>>>> launch vm's. >> >>>> >>>>> When the VM is in spawning state the node goes down (openstack >> >>>> compute >> >>>> >>>>> service list), the node comes backup when i restart the nova >> >>>> compute >> >>>> >>>>> service but then the launch of the vm fails. >> >>>> >>>>> >> >>>> >>>>> nova-compute.log >> >>>> >>>>> >> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >> >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running >> >>>> >>>>> instance usage >> >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 >> >>>> to >> >>>> >>>>> 2023-02-26 08:00:00. 0 instances. >> >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node >> >>>> >>>>> dcn01-hci-0.bdxworld.com >> >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver >> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device >> >>>> name: >> >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied dev names >> >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device >> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume >> >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >> >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >> >>>> with >> >>>> >>>>> backend dogpile.cache.null. >> >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running >> >>>> >>>>> privsep helper: >> >>>> >>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', >> >>>> 'privsep-helper', >> >>>> >>>>> '--config-file', '/etc/nova/nova.conf', '--config-file', >> >>>> >>>>> '/etc/nova/nova-compute.conf', '--privsep_context', >> >>>> >>>>> 'os_brick.privileged.default', '--privsep_sock_path', >> >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] >> >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new >> >>>> privsep >> >>>> >>>>> daemon via rootwrap >> >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep >> >>>> >>>>> daemon starting >> >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep >> >>>> >>>>> process running with uid/gid: 0/0 >> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >> >>>> >>>>> process running with capabilities (eff/prm/inh): >> >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >> >>>> >>>>> daemon running as pid 2647 >> >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING >> >>>> os_brick.initiator.connectors.nvmeof >> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process >> >>>> >>>>> execution error >> >>>> >>>>> in _get_host_uuid: Unexpected error while running command. >> >>>> >>>>> Command: blkid overlay -s UUID -o value >> >>>> >>>>> Exit code: 2 >> >>>> >>>>> Stdout: '' >> >>>> >>>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: >> >>>> >>>>> Unexpected error while running command. >> >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver >> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image >> >>>> >>>>> >> >>>> >>>>> Is there a way to solve this issue? >> >>>> >>>>> >> >>>> >>>>> >> >>>> >>>>> With regards, >> >>>> >>>>> >> >>>> >>>>> Swogat Pradhan >> >>>> >>>>> >> >>>> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >>
Hi Adam, The systems are in same LAN, in this case it seemed like the image was getting pulled from the central site which was caused due to an misconfiguration in ceph.conf file in /var/lib/tripleo-config/ceph/ directory, which seems to have been resolved after the changes i made to fix it. Right now the glance api podman is running in unhealthy state and the podman logs don't show any error whatsoever and when issued the command netstat -nultp i do not see any entry for glance port i.e. 9292 in the dcn site, which is why cinder is throwing an error stating: 2023-03-22 13:32:29.786 108 ERROR oslo_messaging.rpc.server cinder.exception.GlanceConnectionFailed: Connection to glance failed: Error finding address for http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: Unable to establish connection to http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: HTTPConnectionPool(host='172.25.228.253', port=9292): Max retries exceeded with url: /v2/images/736d8779-07cd-4510-bab2-adcb653cc538 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7682d2cd30>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)) Now i need to find out why the port is not listed as the glance service is running, which i am not sure how to find out. With regards, Swogat Pradhan On Wed, Mar 22, 2023 at 8:11 PM Alan Bishop <abishop@redhat.com> wrote:
On Wed, Mar 22, 2023 at 6:37 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Update: Here is the log when creating a volume using cirros image:
2023-03-22 11:04:38.449 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume bf341343-6609-4b8c-b9e0-93e2a89c8c8f: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f', 'volume_size': 4, 'image_id': '736d8779-07cd-4510-bab2-adcb653cc538', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': '736d8779-07cd-4510-bab2-adcb653cc538', 'created_at': datetime.datetime(2023, 3, 22, 10, 44, 12, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 10, 54, 1, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'tags': [], 'file': '/v2/images/736d8779-07cd-4510-bab2-adcb653cc538/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f449ded1198>} 2023-03-22 11:06:16.570 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Image download 15.58 MB at 0.16 MB/s
As Adam Savage would say, well there's your problem ^^ (Image download 15.58 MB at 0.16 MB/s). Downloading the image takes too long, and 0.16 MB/s suggests you have a network issue.
John Fulton previously stated your cinder-volume service at the edge site is not using the local ceph image store. Assuming you are deploying GlanceApiEdge service [1], then the cinder-volume service should be configured to use the local glance service [2]. You should check cinder's glance_api_servers to confirm it's the edge site's glance service.
[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/envi... [2] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/depl...
Alan
2023-03-22 11:07:54.023 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.161 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.163 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Converted 112.00 MB image at 112.00 MB/s 2023-03-22 11:11:14.998 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f (bf341343-6609-4b8c-b9e0-93e2a89c8c8f): created successfully 2023-03-22 11:11:15.195 109 INFO cinder.volume.manager [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Created volume successfully.
The image is present in dcn02 store but still it downloaded the image in 0.16 MB/s and then created the volume.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 6:10 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Jhon, This seems to be an issue. When i deployed the dcn ceph in both dcn01 and dcn02 the --cluster parameter was specified to the respective cluster names but the config files were created in the name of ceph.conf and keyring was ceph.client.openstack.keyring.
Which created issues in glance as well as the naming convention of the files didn't match the cluster names, so i had to manually rename the central ceph conf file as such:
[root@dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/ [root@dcn02-compute-0 ceph]# ll total 16 -rw-------. 1 root root 257 Mar 13 13:56 ceph_central.client.openstack.keyring -rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf -rw-------. 1 root root 205 Mar 15 18:45 ceph.client.openstack.keyring -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf [root@dcn02-compute-0 ceph]#
ceph.conf and ceph.client.openstack.keyring contain the fsid of the respective clusters in both dcn01 and dcn02. In the above cli output, the ceph.conf and ceph.client... are the files used to access dcn02 ceph cluster and ceph_central* files are used in for accessing central ceph cluster.
glance multistore config: [dcn02] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store
[ceph_central] rbd_store_ceph_conf=/etc/ceph/ceph_central.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 5:52 PM John Fulton <johfulto@redhat.com> wrote:
On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Seems like cinder is not using the local ceph.
That explains the issue. It's a misconfiguration.
I hope this is not a production system since the mailing list now has the cinder.conf which contains passwords.
The section that looks like this:
[tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_ceph_conf=/etc/ceph/ceph.conf rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=<redacted> report_discard_supported=True
Should be updated to refer to the local DCN ceph cluster and not the central one. Use the ceph conf file for that cluster and ensure the rbd_secret_uuid corresponds to that one.
TripleO’s convention is to set the rbd_secret_uuid to the FSID of the Ceph cluster. The FSID should be in the ceph.conf file. The tripleo_nova_libvirt role will use virsh secret-* commands so that libvirt can retrieve the cephx secret using the FSID as a key. This can be confirmed with `podman exec nova_virtsecretd virsh secret-get-value $FSID`.
The documentation describes how to configure the central and DCN sites correctly but an error seems to have occurred while you were following it.
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features...
John
Ceph Output: [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l NAME SIZE PARENT FMT
2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB 2 excl 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB 2 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB 2 yes 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB 2 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB 2 yes 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB 2 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB 2 yes 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB 2 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB 2 yes b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB 2 b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB 2 yes e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB 2 e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB 2 yes
[ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB 2 volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB 2 [ceph: root@dcn02-ceph-all-0 /]#
Attached the cinder config. Please let me know how I can solve this issue.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 3:53 PM John Fulton <johfulto@redhat.com> wrote:
in my last message under the line "On a DCN site if you run a
command like this:" I suggested some steps you could try to confirm the image is a COW from the local glance as well as how to look at your cinder config.
On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan <
swogatpradhan22@gmail.com> wrote:
> > Update: > I uploaded an image directly to the dcn02 store, and it takes around 10,15 minutes to create a volume with image in dcn02. > The image size is 389 MB. > > On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >> >> Hi Jhon, >> I checked in the ceph od dcn02, I can see the images created after importing from the central site. >> But launching an instance normally fails as it takes a long time for the volume to get created. >> >> When launching an instance from volume the instance is getting created properly without any errors. >> >> I tried to cache images in nova using https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... but getting checksum failed error. >> >> With regards, >> Swogat Pradhan >> >> On Thu, Mar 16, 2023 at 5:24 PM John Fulton <johfulto@redhat.com> wrote: >>> >>> On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan >>> <swogatpradhan22@gmail.com> wrote: >>> > >>> > Update: After restarting the nova services on the controller and running the deploy script on the edge site, I was able to launch the VM from volume. >>> > >>> > Right now the instance creation is failing as the block device creation is stuck in creating state, it is taking more than 10 mins for the volume to be created, whereas the image has already been imported to the edge glance. >>> >>> Try following this document and making the same observations in your >>> environment for AZs and their local ceph cluster. >>> >>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >>> >>> On a DCN site if you run a command like this: >>> >>> $ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring >>> /etc/ceph/dcn0.client.admin.keyring >>> $ rbd --cluster dcn0 -p volumes ls -l >>> NAME SIZE PARENT >>> FMT PROT LOCK >>> volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB >>> images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl >>> $ >>> >>> Then, you should see the parent of the volume is the image which is on >>> the same local ceph cluster. >>> >>> I wonder if something is misconfigured and thus you're encountering >>> the streaming behavior described here: >>> >>> Ideally all images should reside in the central Glance and be copied >>> to DCN sites before instances of those images are booted on DCN sites. >>> If an image is not copied to a DCN site before it is booted, then
>>> image will be streamed to the DCN site and then the image will boot as >>> an instance. This happens because Glance at the DCN site has access to >>> the images store at the Central ceph cluster. Though the booting of >>> the image will take time because it has not been copied in advance, >>> this is still preferable to failing to boot the image. >>> >>> You can also exec into the cinder container at the DCN site and >>> confirm it's using it's local ceph cluster. >>> >>> John >>> >>> > >>> > I will try and create a new fresh image and test again then update. >>> > >>> > With regards, >>> > Swogat Pradhan >>> > >>> > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >>> >> >>> >> Update: >>> >> In the hypervisor list the compute node state is showing down. >>> >> >>> >> >>> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >>> >>> >>> >>> Hi Brendan, >>> >>> Now i have deployed another site where i have used 2 linux bonds network template for both 3 compute nodes and 3 ceph nodes. >>> >>> The bonding options is set to mode=802.3ad (lacp=active). >>> >>> I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. >>> >>> Once the volume was created i tried launching the instance from the volume and still the instance is stuck in spawning state. >>> >>> >>> >>> Here is the nova-compute log: >>> >>> >>> >>> 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-]
>>> >>> 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-]
>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-]
>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-]
>>> >>> 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. >>> >>> Command: blkid overlay -s UUID -o value >>> >>> Exit code: 2 >>> >>> Stdout: '' >>> >>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. >>> >>> 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image >>> >>> >>> >>> It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>> >>> >>> >>> The volume is already created and i do not understand why the instance is stuck in spawning state. >>> >>> >>> >>> With regards, >>> >>> Swogat Pradhan >>> >>> >>> >>> >>> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard < bshephar@redhat.com> wrote: >>> >>>> >>> >>>> Does your environment use different network interfaces for each of the networks? Or does it have a bond with everything on it? >>> >>>> >>> >>>> One issue I have seen before is that when launching instances, there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface. >>> >>>> >>> >>>> I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In
>>> >>>> >>> >>>> In my case, changing from active/backup to LACP helped. So,
>>> >>>> >>> >>>> Regards, >>> >>>> >>> >>>> Brendan Shephard >>> >>>> Senior Software Engineer >>> >>>> Red Hat Australia >>> >>>> >>> >>>> >>> >>>> >>> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote: >>> >>>> >>> >>>> Hi, >>> >>>> >>> >>>> I tried to help someone with a similar issue some time ago in this thread: >>> >>>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... >>> >>>> >>> >>>> But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. >>> >>>> If there isn't any additional information in the debug logs I probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes: >>> >>>> >>> >>>> - Either remove queues, exchanges etc. while rabbit is running, this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. >>> >>>> - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild. >>> >>>> >>> >>>> I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice. >>> >>>> >>> >>>> Regards, >>> >>>> Eugen >>> >>>> >>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>> >>>> >>> >>>> Hi, >>> >>>> Can someone please help me out on this issue? >>> >>>> >>> >>>> With regards, >>> >>>> Swogat Pradhan >>> >>>> >>> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < swogatpradhan22@gmail.com> >>> >>>> wrote: >>> >>>> >>> >>>> Hi >>> >>>> I don't see any major packet loss. >>> >>>> It seems the problem is somewhere in rabbitmq maybe but not due to packet >>> >>>> loss. >>> >>>> >>> >>>> with regards, >>> >>>> Swogat Pradhan >>> >>>> >>> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> >>> >>>> wrote: >>> >>>> >>> >>>> Hi, >>> >>>> Yes the MTU is the same as the default '1500'. >>> >>>> Generally I haven't seen any packet loss, but never checked when >>> >>>> launching the instance. >>> >>>> I will check that and come back. >>> >>>> But everytime i launch an instance the instance gets stuck at spawning >>> >>>> state and there the hypervisor becomes down, so not sure if
>>> >>>> causes this. >>> >>>> >>> >>>> With regards, >>> >>>> Swogat pradhan >>> >>>> >>> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote: >>> >>>> >>> >>>> One more thing coming to mind is MTU size. Are they identical between >>> >>>> central and edge site? Do you see packet loss through the tunnel? >>> >>>> >>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>> >>>> >>> >>>> > Hi Eugen, >>> >>>> > Request you to please add my email either on 'to' or 'cc' as i am not >>> >>>> > getting email's from you. >>> >>>> > Coming to the issue: >>> >>>> > >>> >>>> > [root@overcloud-controller-no-ceph-3 /]# rabbitmqctl
>>> >>>> / >>> >>>> > Listing policies for vhost "/" ... >>> >>>> > vhost name pattern apply-to definition
>>> >>>> > / ha-all ^(?!amq\.).* queues >>> >>>> > >>> >>>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 >>> >>>> > >>> >>>> > I have the edge site compute nodes up, it only goes down when i am >>> >>>> trying >>> >>>> > to launch an instance and the instance comes to a spawning state and >>> >>>> then >>> >>>> > gets stuck. >>> >>>> > >>> >>>> > I have a tunnel setup between the central and the edge sites. >>> >>>> > >>> >>>> > With regards, >>> >>>> > Swogat Pradhan >>> >>>> > >>> >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < >>> >>>> swogatpradhan22@gmail.com> >>> >>>> > wrote: >>> >>>> > >>> >>>> >> Hi Eugen, >>> >>>> >> For some reason i am not getting your email to me
>>> >>>> checking >>> >>>> >> the email digest and there i am able to find your reply. >>> >>>> >> Here is the log for download: https://we.tl/t-L8FEkGZFSq >>> >>>> >> Yes, these logs are from the time when the issue occurred. >>> >>>> >> >>> >>>> >> *Note: i am able to create vm's and perform other activities in the >>> >>>> >> central site, only facing this issue in the edge site.* >>> >>>> >> >>> >>>> >> With regards, >>> >>>> >> Swogat Pradhan >>> >>>> >> >>> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < >>> >>>> swogatpradhan22@gmail.com> >>> >>>> >> wrote: >>> >>>> >> >>> >>>> >>> Hi Eugen, >>> >>>> >>> Thanks for your response. >>> >>>> >>> I have actually a 4 controller setup so here are the
>>> >>>> >>> >>> >>>> >>> *PCS Status:* >>> >>>> >>> * Container bundle set: rabbitmq-bundle [ >>> >>>> >>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: >>> >>>> >>> * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): >>> >>>> Started >>> >>>> >>> overcloud-controller-no-ceph-3 >>> >>>> >>> * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): >>> >>>> Started >>> >>>> >>> overcloud-controller-2 >>> >>>> >>> * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): >>> >>>> Started >>> >>>> >>> overcloud-controller-1 >>> >>>> >>> * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): >>> >>>> Started >>> >>>> >>> overcloud-controller-0 >>> >>>> >>> >>> >>>> >>> I have tried restarting the bundle multiple times but
>>> >>>> still >>> >>>> >>> present. >>> >>>> >>> >>> >>>> >>> *Cluster status:* >>> >>>> >>> [root@overcloud-controller-0 /]# rabbitmqctl cluster_status >>> >>>> >>> Cluster status of node >>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >>> >>>> >>> Basics >>> >>>> >>> >>> >>>> >>> Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com >>> >>>> >>> >>> >>>> >>> Disk Nodes >>> >>>> >>> >>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>>> >>> >>> >>>> >>> Running Nodes >>> >>>> >>> >>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>>> >>> >>> >>>> >>> Versions >>> >>>> >>> >>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ >>> >>>> 3.8.3 >>> >>>> >>> on Erlang 22.3.4.1 >>> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ >>> >>>> 3.8.3 >>> >>>> >>> on Erlang 22.3.4.1 >>> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ >>> >>>> 3.8.3 >>> >>>> >>> on Erlang 22.3.4.1 >>> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: >>> >>>> RabbitMQ >>> >>>> >>> 3.8.3 on Erlang 22.3.4.1 >>> >>>> >>> >>> >>>> >>> Alarms >>> >>>> >>> >>> >>>> >>> (none) >>> >>>> >>> >>> >>>> >>> Network Partitions >>> >>>> >>> >>> >>>> >>> (none) >>> >>>> >>> >>> >>>> >>> Listeners >>> >>>> >>> >>> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>> >>>> tool >>> >>>> >>> communication >>> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>> >>>> >>> and AMQP 1.0 >>> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>> >>>> tool >>> >>>> >>> communication >>> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>> >>>> >>> and AMQP 1.0 >>> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>> >>>> tool >>> >>>> >>> communication >>> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>> >>>> >>> and AMQP 1.0 >>> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>> >>>> interface: >>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>>> , >>> >>>> >>> interface: [::], port: 25672, protocol: clustering,
>>> >>>> inter-node and >>> >>>> >>> CLI tool communication >>> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>>> , >>> >>>> >>> interface: 172.25.201.209, port: 5672, protocol: amqp,
>>> >>>> 0-9-1 >>> >>>> >>> and AMQP 1.0 >>> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>>> , >>> >>>> >>> interface: [::], port: 15672, protocol: http, purpose: HTTP API >>> >>>> >>> >>> >>>> >>> Feature flags >>> >>>> >>> >>> >>>> >>> Flag: drop_unroutable_metric, state: enabled >>> >>>> >>> Flag: empty_basic_get_metric, state: enabled >>> >>>> >>> Flag: implicit_default_bindings, state: enabled >>> >>>> >>> Flag: quorum_queue, state: enabled >>> >>>> >>> Flag: virtual_host_metadata, state: enabled >>> >>>> >>> >>> >>>> >>> *Logs:* >>> >>>> >>> *(Attached)* >>> >>>> >>> >>> >>>> >>> With regards, >>> >>>> >>> Swogat Pradhan >>> >>>> >>> >>> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < >>> >>>> swogatpradhan22@gmail.com> >>> >>>> >>> wrote: >>> >>>> >>> >>> >>>> >>>> Hi, >>> >>>> >>>> Please find the nova conductor as well as nova api log. >>> >>>> >>>> >>> >>>> >>>> nova-conuctor: >>> >>>> >>>> >>> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING >>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> >>>> >>>> 16152921c1eb45c2b1f562087140168b >>> >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING >>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>> >>>> >>>> 83dbe5f567a940b698acfe986f6194fa >>> >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING >>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43: >>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply >>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds >>> >>>> due to a >>> >>>> >>>> missing queue (reply_276049ec36a84486a8a406911d9802f4). >>> >>>> Abandoning...: >>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING >>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: >>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds >>> >>>> due to a >>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>> >>>> Abandoning...: >>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING >>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f: >>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds >>> >>>> due to a >>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>> >>>> Abandoning...: >>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils >>> >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >>> >>>> with >>> >>>> >>>> backend dogpile.cache.null. >>> >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING >>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb: >>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver >>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds >>> >>>> due to a >>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>> >>>> Abandoning...: >>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>> >>>> >>> >>>> >>>> With regards, >>> >>>> >>>> Swogat Pradhan >>> >>>> >>>> >>> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >>> >>>> >>>> swogatpradhan22@gmail.com> wrote: >>> >>>> >>>> >>> >>>> >>>>> Hi, >>> >>>> >>>>> I currently have 3 compute nodes on edge site1 where i am trying to >>> >>>> >>>>> launch vm's. >>> >>>> >>>>> When the VM is in spawning state the node goes down (openstack >>> >>>> compute >>> >>>> >>>>> service list), the node comes backup when i restart
PROT LOCK the privsep daemon starting privsep process running with uid/gid: 0/0 privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none privsep daemon running as pid 185437 the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance. based on that experience, from my perspective, is certainly sounds like some kind of network issue. packet loss list_policies -p priority directly, i am details: the issue is purpose: purpose: AMQP the nova
>>> >>>> compute >>> >>>> >>>>> service but then the launch of the vm fails. >>> >>>> >>>>> >>> >>>> >>>>> nova-compute.log >>> >>>> >>>>> >>> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >>> >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running >>> >>>> >>>>> instance usage >>> >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 >>> >>>> to >>> >>>> >>>>> 2023-02-26 08:00:00. 0 instances. >>> >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node >>> >>>> >>>>> dcn01-hci-0.bdxworld.com >>> >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver >>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device >>> >>>> name: >>> >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied dev names >>> >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device >>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume >>> >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >>> >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >>> >>>> with >>> >>>> >>>>> backend dogpile.cache.null. >>> >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running >>> >>>> >>>>> privsep helper: >>> >>>> >>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', >>> >>>> 'privsep-helper', >>> >>>> >>>>> '--config-file', '/etc/nova/nova.conf', '--config-file', >>> >>>> >>>>> '/etc/nova/nova-compute.conf', '--privsep_context', >>> >>>> >>>>> 'os_brick.privileged.default', '--privsep_sock_path', >>> >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] >>> >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new >>> >>>> privsep >>> >>>> >>>>> daemon via rootwrap >>> >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep >>> >>>> >>>>> daemon starting >>> >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep >>> >>>> >>>>> process running with uid/gid: 0/0 >>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>> >>>> >>>>> process running with capabilities (eff/prm/inh): >>> >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>> >>>> >>>>> daemon running as pid 2647 >>> >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING >>> >>>> os_brick.initiator.connectors.nvmeof >>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process >>> >>>> >>>>> execution error >>> >>>> >>>>> in _get_host_uuid: Unexpected error while running command. >>> >>>> >>>>> Command: blkid overlay -s UUID -o value >>> >>>> >>>>> Exit code: 2 >>> >>>> >>>>> Stdout: '' >>> >>>> >>>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: >>> >>>> >>>>> Unexpected error while running command. >>> >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver >>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image >>> >>>> >>>>> >>> >>>> >>>>> Is there a way to solve this issue? >>> >>>> >>>>> >>> >>>> >>>>> >>> >>>> >>>>> With regards, >>> >>>> >>>>> >>> >>>> >>>>> Swogat Pradhan >>> >>>> >>>>> >>> >>>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>>
On Wed, Mar 22, 2023 at 8:38 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Adam, The systems are in same LAN, in this case it seemed like the image was getting pulled from the central site which was caused due to an misconfiguration in ceph.conf file in /var/lib/tripleo-config/ceph/ directory, which seems to have been resolved after the changes i made to fix it.
Right now the glance api podman is running in unhealthy state and the podman logs don't show any error whatsoever and when issued the command netstat -nultp i do not see any entry for glance port i.e. 9292 in the dcn site, which is why cinder is throwing an error stating:
2023-03-22 13:32:29.786 108 ERROR oslo_messaging.rpc.server cinder.exception.GlanceConnectionFailed: Connection to glance failed: Error finding address for http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: Unable to establish connection to http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: HTTPConnectionPool(host='172.25.228.253', port=9292): Max retries exceeded with url: /v2/images/736d8779-07cd-4510-bab2-adcb653cc538 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7682d2cd30>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
Now i need to find out why the port is not listed as the glance service is running, which i am not sure how to find out.
One other thing to investigate is whether your deployment includes this patch [1]. If it does, then bear in mind the glance-api service running at the edge site will be an "internal" (non public facing) instance that uses port 9293 instead of 9292. You should familiarize yourself with the release note [2]. [1] https://opendev.org/openstack/tripleo-heat-templates/commit/3605d45e417a77a1... [2] https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/walla... Alan
With regards, Swogat Pradhan
On Wed, Mar 22, 2023 at 8:11 PM Alan Bishop <abishop@redhat.com> wrote:
On Wed, Mar 22, 2023 at 6:37 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Update: Here is the log when creating a volume using cirros image:
2023-03-22 11:04:38.449 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume bf341343-6609-4b8c-b9e0-93e2a89c8c8f: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f', 'volume_size': 4, 'image_id': '736d8779-07cd-4510-bab2-adcb653cc538', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': '736d8779-07cd-4510-bab2-adcb653cc538', 'created_at': datetime.datetime(2023, 3, 22, 10, 44, 12, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 10, 54, 1, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'tags': [], 'file': '/v2/images/736d8779-07cd-4510-bab2-adcb653cc538/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f449ded1198>} 2023-03-22 11:06:16.570 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Image download 15.58 MB at 0.16 MB/s
As Adam Savage would say, well there's your problem ^^ (Image download 15.58 MB at 0.16 MB/s). Downloading the image takes too long, and 0.16 MB/s suggests you have a network issue.
John Fulton previously stated your cinder-volume service at the edge site is not using the local ceph image store. Assuming you are deploying GlanceApiEdge service [1], then the cinder-volume service should be configured to use the local glance service [2]. You should check cinder's glance_api_servers to confirm it's the edge site's glance service.
[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/envi... [2] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/depl...
Alan
2023-03-22 11:07:54.023 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.161 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.163 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Converted 112.00 MB image at 112.00 MB/s 2023-03-22 11:11:14.998 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f (bf341343-6609-4b8c-b9e0-93e2a89c8c8f): created successfully 2023-03-22 11:11:15.195 109 INFO cinder.volume.manager [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Created volume successfully.
The image is present in dcn02 store but still it downloaded the image in 0.16 MB/s and then created the volume.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 6:10 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Jhon, This seems to be an issue. When i deployed the dcn ceph in both dcn01 and dcn02 the --cluster parameter was specified to the respective cluster names but the config files were created in the name of ceph.conf and keyring was ceph.client.openstack.keyring.
Which created issues in glance as well as the naming convention of the files didn't match the cluster names, so i had to manually rename the central ceph conf file as such:
[root@dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/ [root@dcn02-compute-0 ceph]# ll total 16 -rw-------. 1 root root 257 Mar 13 13:56 ceph_central.client.openstack.keyring -rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf -rw-------. 1 root root 205 Mar 15 18:45 ceph.client.openstack.keyring -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf [root@dcn02-compute-0 ceph]#
ceph.conf and ceph.client.openstack.keyring contain the fsid of the respective clusters in both dcn01 and dcn02. In the above cli output, the ceph.conf and ceph.client... are the files used to access dcn02 ceph cluster and ceph_central* files are used in for accessing central ceph cluster.
glance multistore config: [dcn02] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store
[ceph_central] rbd_store_ceph_conf=/etc/ceph/ceph_central.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 5:52 PM John Fulton <johfulto@redhat.com> wrote:
On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Seems like cinder is not using the local ceph.
That explains the issue. It's a misconfiguration.
I hope this is not a production system since the mailing list now has the cinder.conf which contains passwords.
The section that looks like this:
[tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_ceph_conf=/etc/ceph/ceph.conf rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=<redacted> report_discard_supported=True
Should be updated to refer to the local DCN ceph cluster and not the central one. Use the ceph conf file for that cluster and ensure the rbd_secret_uuid corresponds to that one.
TripleO’s convention is to set the rbd_secret_uuid to the FSID of the Ceph cluster. The FSID should be in the ceph.conf file. The tripleo_nova_libvirt role will use virsh secret-* commands so that libvirt can retrieve the cephx secret using the FSID as a key. This can be confirmed with `podman exec nova_virtsecretd virsh secret-get-value $FSID`.
The documentation describes how to configure the central and DCN sites correctly but an error seems to have occurred while you were following it.
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features...
John
Ceph Output: [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l NAME SIZE PARENT FMT
2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB 2 excl 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB 2 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB 2 yes 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB 2 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB 2 yes 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB 2 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB 2 yes 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB 2 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB 2 yes b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB 2 b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB 2 yes e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB 2 e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB 2 yes
[ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l NAME SIZE PARENT FMT PROT LOCK volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB 2 volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB 2 [ceph: root@dcn02-ceph-all-0 /]#
Attached the cinder config. Please let me know how I can solve this issue.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 3:53 PM John Fulton <johfulto@redhat.com> wrote: > > in my last message under the line "On a DCN site if you run a command like this:" I suggested some steps you could try to confirm the image is a COW from the local glance as well as how to look at your cinder config. > > On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >> >> Update: >> I uploaded an image directly to the dcn02 store, and it takes around 10,15 minutes to create a volume with image in dcn02. >> The image size is 389 MB. >> >> On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >>> >>> Hi Jhon, >>> I checked in the ceph od dcn02, I can see the images created after importing from the central site. >>> But launching an instance normally fails as it takes a long time for the volume to get created. >>> >>> When launching an instance from volume the instance is getting created properly without any errors. >>> >>> I tried to cache images in nova using https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... but getting checksum failed error. >>> >>> With regards, >>> Swogat Pradhan >>> >>> On Thu, Mar 16, 2023 at 5:24 PM John Fulton <johfulto@redhat.com> wrote: >>>> >>>> On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan >>>> <swogatpradhan22@gmail.com> wrote: >>>> > >>>> > Update: After restarting the nova services on the controller and running the deploy script on the edge site, I was able to launch the VM from volume. >>>> > >>>> > Right now the instance creation is failing as the block device creation is stuck in creating state, it is taking more than 10 mins for the volume to be created, whereas the image has already been imported to the edge glance. >>>> >>>> Try following this document and making the same observations in your >>>> environment for AZs and their local ceph cluster. >>>> >>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >>>> >>>> On a DCN site if you run a command like this: >>>> >>>> $ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring >>>> /etc/ceph/dcn0.client.admin.keyring >>>> $ rbd --cluster dcn0 -p volumes ls -l >>>> NAME SIZE PARENT >>>> FMT PROT LOCK >>>> volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB >>>> images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl >>>> $ >>>> >>>> Then, you should see the parent of the volume is the image which is on >>>> the same local ceph cluster. >>>> >>>> I wonder if something is misconfigured and thus you're encountering >>>> the streaming behavior described here: >>>> >>>> Ideally all images should reside in the central Glance and be copied >>>> to DCN sites before instances of those images are booted on DCN sites. >>>> If an image is not copied to a DCN site before it is booted,
>>>> image will be streamed to the DCN site and then the image will boot as >>>> an instance. This happens because Glance at the DCN site has access to >>>> the images store at the Central ceph cluster. Though the booting of >>>> the image will take time because it has not been copied in advance, >>>> this is still preferable to failing to boot the image. >>>> >>>> You can also exec into the cinder container at the DCN site and >>>> confirm it's using it's local ceph cluster. >>>> >>>> John >>>> >>>> > >>>> > I will try and create a new fresh image and test again then update. >>>> > >>>> > With regards, >>>> > Swogat Pradhan >>>> > >>>> > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >>>> >> >>>> >> Update: >>>> >> In the hypervisor list the compute node state is showing down. >>>> >> >>>> >> >>>> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >>>> >>> >>>> >>> Hi Brendan, >>>> >>> Now i have deployed another site where i have used 2 linux bonds network template for both 3 compute nodes and 3 ceph nodes. >>>> >>> The bonding options is set to mode=802.3ad (lacp=active). >>>> >>> I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. >>>> >>> Once the volume was created i tried launching the instance from the volume and still the instance is stuck in spawning state. >>>> >>> >>>> >>> Here is the nova-compute log: >>>> >>> >>>> >>> 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-]
>>>> >>> 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-]
>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-]
>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-]
>>>> >>> 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. >>>> >>> Command: blkid overlay -s UUID -o value >>>> >>> Exit code: 2 >>>> >>> Stdout: '' >>>> >>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. >>>> >>> 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image >>>> >>> >>>> >>> It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>>> >>> >>>> >>> The volume is already created and i do not understand why
>>>> >>> >>>> >>> With regards, >>>> >>> Swogat Pradhan >>>> >>> >>>> >>> >>>> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard < bshephar@redhat.com> wrote: >>>> >>>> >>>> >>>> Does your environment use different network interfaces for each of the networks? Or does it have a bond with everything on it? >>>> >>>> >>>> >>>> One issue I have seen before is that when launching instances, there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface. >>>> >>>> >>>> >>>> I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In
>>>> >>>> >>>> >>>> In my case, changing from active/backup to LACP helped. So,
>>>> >>>> >>>> >>>> Regards, >>>> >>>> >>>> >>>> Brendan Shephard >>>> >>>> Senior Software Engineer >>>> >>>> Red Hat Australia >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote: >>>> >>>> >>>> >>>> Hi, >>>> >>>> >>>> >>>> I tried to help someone with a similar issue some time ago in this thread: >>>> >>>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... >>>> >>>> >>>> >>>> But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. >>>> >>>> If there isn't any additional information in the debug logs I probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes: >>>> >>>> >>>> >>>> - Either remove queues, exchanges etc. while rabbit is running, this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. >>>> >>>> - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild. >>>> >>>> >>>> >>>> I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice. >>>> >>>> >>>> >>>> Regards, >>>> >>>> Eugen >>>> >>>> >>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>>> >>>> >>>> >>>> Hi, >>>> >>>> Can someone please help me out on this issue? >>>> >>>> >>>> >>>> With regards, >>>> >>>> Swogat Pradhan >>>> >>>> >>>> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < swogatpradhan22@gmail.com> >>>> >>>> wrote: >>>> >>>> >>>> >>>> Hi >>>> >>>> I don't see any major packet loss. >>>> >>>> It seems the problem is somewhere in rabbitmq maybe but not due to packet >>>> >>>> loss. >>>> >>>> >>>> >>>> with regards, >>>> >>>> Swogat Pradhan >>>> >>>> >>>> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> >>>> >>>> wrote: >>>> >>>> >>>> >>>> Hi, >>>> >>>> Yes the MTU is the same as the default '1500'. >>>> >>>> Generally I haven't seen any packet loss, but never checked when >>>> >>>> launching the instance. >>>> >>>> I will check that and come back. >>>> >>>> But everytime i launch an instance the instance gets stuck at spawning >>>> >>>> state and there the hypervisor becomes down, so not sure if
>>>> >>>> causes this. >>>> >>>> >>>> >>>> With regards, >>>> >>>> Swogat pradhan >>>> >>>> >>>> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote: >>>> >>>> >>>> >>>> One more thing coming to mind is MTU size. Are they identical between >>>> >>>> central and edge site? Do you see packet loss through the tunnel? >>>> >>>> >>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>>> >>>> >>>> >>>> > Hi Eugen, >>>> >>>> > Request you to please add my email either on 'to' or 'cc' as i am not >>>> >>>> > getting email's from you. >>>> >>>> > Coming to the issue: >>>> >>>> > >>>> >>>> > [root@overcloud-controller-no-ceph-3 /]# rabbitmqctl
>>>> >>>> / >>>> >>>> > Listing policies for vhost "/" ... >>>> >>>> > vhost name pattern apply-to definition
>>>> >>>> > / ha-all ^(?!amq\.).* queues >>>> >>>> > >>>> >>>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 >>>> >>>> > >>>> >>>> > I have the edge site compute nodes up, it only goes down when i am >>>> >>>> trying >>>> >>>> > to launch an instance and the instance comes to a spawning state and >>>> >>>> then >>>> >>>> > gets stuck. >>>> >>>> > >>>> >>>> > I have a tunnel setup between the central and the edge sites. >>>> >>>> > >>>> >>>> > With regards, >>>> >>>> > Swogat Pradhan >>>> >>>> > >>>> >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < >>>> >>>> swogatpradhan22@gmail.com> >>>> >>>> > wrote: >>>> >>>> > >>>> >>>> >> Hi Eugen, >>>> >>>> >> For some reason i am not getting your email to me
>>>> >>>> checking >>>> >>>> >> the email digest and there i am able to find your reply. >>>> >>>> >> Here is the log for download: https://we.tl/t-L8FEkGZFSq >>>> >>>> >> Yes, these logs are from the time when the issue occurred. >>>> >>>> >> >>>> >>>> >> *Note: i am able to create vm's and perform other activities in the >>>> >>>> >> central site, only facing this issue in the edge site.* >>>> >>>> >> >>>> >>>> >> With regards, >>>> >>>> >> Swogat Pradhan >>>> >>>> >> >>>> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < >>>> >>>> swogatpradhan22@gmail.com> >>>> >>>> >> wrote: >>>> >>>> >> >>>> >>>> >>> Hi Eugen, >>>> >>>> >>> Thanks for your response. >>>> >>>> >>> I have actually a 4 controller setup so here are the
>>>> >>>> >>> >>>> >>>> >>> *PCS Status:* >>>> >>>> >>> * Container bundle set: rabbitmq-bundle [ >>>> >>>> >>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: >>>> >>>> >>> * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): >>>> >>>> Started >>>> >>>> >>> overcloud-controller-no-ceph-3 >>>> >>>> >>> * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): >>>> >>>> Started >>>> >>>> >>> overcloud-controller-2 >>>> >>>> >>> * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): >>>> >>>> Started >>>> >>>> >>> overcloud-controller-1 >>>> >>>> >>> * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): >>>> >>>> Started >>>> >>>> >>> overcloud-controller-0 >>>> >>>> >>> >>>> >>>> >>> I have tried restarting the bundle multiple times but
>>>> >>>> still >>>> >>>> >>> present. >>>> >>>> >>> >>>> >>>> >>> *Cluster status:* >>>> >>>> >>> [root@overcloud-controller-0 /]# rabbitmqctl cluster_status >>>> >>>> >>> Cluster status of node >>>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >>>> >>>> >>> Basics >>>> >>>> >>> >>>> >>>> >>> Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com >>>> >>>> >>> >>>> >>>> >>> Disk Nodes >>>> >>>> >>> >>>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> >>>> >>> >>>> >>>> >>> Running Nodes >>>> >>>> >>> >>>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> >>>> >>> >>>> >>>> >>> Versions >>>> >>>> >>> >>>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ >>>> >>>> 3.8.3 >>>> >>>> >>> on Erlang 22.3.4.1 >>>> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ >>>> >>>> 3.8.3 >>>> >>>> >>> on Erlang 22.3.4.1 >>>> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ >>>> >>>> 3.8.3 >>>> >>>> >>> on Erlang 22.3.4.1 >>>> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: >>>> >>>> RabbitMQ >>>> >>>> >>> 3.8.3 on Erlang 22.3.4.1 >>>> >>>> >>> >>>> >>>> >>> Alarms >>>> >>>> >>> >>>> >>>> >>> (none) >>>> >>>> >>> >>>> >>>> >>> Network Partitions >>>> >>>> >>> >>>> >>>> >>> (none) >>>> >>>> >>> >>>> >>>> >>> Listeners >>>> >>>> >>> >>>> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>>> >>>> tool >>>> >>>> >>> communication >>>> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>>> >>>> >>> and AMQP 1.0 >>>> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>>> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>>> >>>> tool >>>> >>>> >>> communication >>>> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>>> >>>> >>> and AMQP 1.0 >>>> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>>> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>>> >>>> tool >>>> >>>> >>> communication >>>> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>>> >>>> >>> and AMQP 1.0 >>>> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>> >>>> interface: >>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>>> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> >>>> , >>>> >>>> >>> interface: [::], port: 25672, protocol: clustering,
>>>> >>>> inter-node and >>>> >>>> >>> CLI tool communication >>>> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> >>>> , >>>> >>>> >>> interface: 172.25.201.209, port: 5672, protocol: amqp,
>>>> >>>> 0-9-1 >>>> >>>> >>> and AMQP 1.0 >>>> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> >>>> , >>>> >>>> >>> interface: [::], port: 15672, protocol: http, purpose: HTTP API >>>> >>>> >>> >>>> >>>> >>> Feature flags >>>> >>>> >>> >>>> >>>> >>> Flag: drop_unroutable_metric, state: enabled >>>> >>>> >>> Flag: empty_basic_get_metric, state: enabled >>>> >>>> >>> Flag: implicit_default_bindings, state: enabled >>>> >>>> >>> Flag: quorum_queue, state: enabled >>>> >>>> >>> Flag: virtual_host_metadata, state: enabled >>>> >>>> >>> >>>> >>>> >>> *Logs:* >>>> >>>> >>> *(Attached)* >>>> >>>> >>> >>>> >>>> >>> With regards, >>>> >>>> >>> Swogat Pradhan >>>> >>>> >>> >>>> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < >>>> >>>> swogatpradhan22@gmail.com> >>>> >>>> >>> wrote: >>>> >>>> >>> >>>> >>>> >>>> Hi, >>>> >>>> >>>> Please find the nova conductor as well as nova api log. >>>> >>>> >>>> >>>> >>>> >>>> nova-conuctor: >>>> >>>> >>>> >>>> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING >>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>> >>>> >>>> 16152921c1eb45c2b1f562087140168b >>>> >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING >>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>>> >>>> >>>> 83dbe5f567a940b698acfe986f6194fa >>>> >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING >>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43: >>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply >>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds >>>> >>>> due to a >>>> >>>> >>>> missing queue (reply_276049ec36a84486a8a406911d9802f4). >>>> >>>> Abandoning...: >>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING >>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: >>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds >>>> >>>> due to a >>>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>>> >>>> Abandoning...: >>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING >>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f: >>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds >>>> >>>> due to a >>>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>>> >>>> Abandoning...: >>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils >>>> >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >>>> >>>> with >>>> >>>> >>>> backend dogpile.cache.null. >>>> >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING >>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb: >>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver >>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds >>>> >>>> due to a >>>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>>> >>>> Abandoning...: >>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>> >>>> >>>> >>>> >>>> With regards, >>>> >>>> >>>> Swogat Pradhan >>>> >>>> >>>> >>>> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >>>> >>>> >>>> swogatpradhan22@gmail.com> wrote: >>>> >>>> >>>> >>>> >>>> >>>>> Hi, >>>> >>>> >>>>> I currently have 3 compute nodes on edge site1 where i am trying to >>>> >>>> >>>>> launch vm's. >>>> >>>> >>>>> When the VM is in spawning state the node goes down (openstack >>>> >>>> compute >>>> >>>> >>>>> service list), the node comes backup when i restart
PROT LOCK then the privsep daemon starting privsep process running with uid/gid: 0/0 privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none privsep daemon running as pid 185437 the instance is stuck in spawning state. the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance. based on that experience, from my perspective, is certainly sounds like some kind of network issue. packet loss list_policies -p priority directly, i am details: the issue is purpose: purpose: AMQP the nova
>>>> >>>> compute >>>> >>>> >>>>> service but then the launch of the vm fails. >>>> >>>> >>>>> >>>> >>>> >>>>> nova-compute.log >>>> >>>> >>>>> >>>> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >>>> >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running >>>> >>>> >>>>> instance usage >>>> >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 >>>> >>>> to >>>> >>>> >>>>> 2023-02-26 08:00:00. 0 instances. >>>> >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node >>>> >>>> >>>>> dcn01-hci-0.bdxworld.com >>>> >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver >>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device >>>> >>>> name: >>>> >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied dev names >>>> >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device >>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume >>>> >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >>>> >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >>>> >>>> with >>>> >>>> >>>>> backend dogpile.cache.null. >>>> >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running >>>> >>>> >>>>> privsep helper: >>>> >>>> >>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', >>>> >>>> 'privsep-helper', >>>> >>>> >>>>> '--config-file', '/etc/nova/nova.conf', '--config-file', >>>> >>>> >>>>> '/etc/nova/nova-compute.conf', '--privsep_context', >>>> >>>> >>>>> 'os_brick.privileged.default', '--privsep_sock_path', >>>> >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] >>>> >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new >>>> >>>> privsep >>>> >>>> >>>>> daemon via rootwrap >>>> >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep >>>> >>>> >>>>> daemon starting >>>> >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep >>>> >>>> >>>>> process running with uid/gid: 0/0 >>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>>> >>>> >>>>> process running with capabilities (eff/prm/inh): >>>> >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>>> >>>> >>>>> daemon running as pid 2647 >>>> >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING >>>> >>>> os_brick.initiator.connectors.nvmeof >>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process >>>> >>>> >>>>> execution error >>>> >>>> >>>>> in _get_host_uuid: Unexpected error while running command. >>>> >>>> >>>>> Command: blkid overlay -s UUID -o value >>>> >>>> >>>>> Exit code: 2 >>>> >>>> >>>>> Stdout: '' >>>> >>>> >>>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: >>>> >>>> >>>>> Unexpected error while running command. >>>> >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver >>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image >>>> >>>> >>>>> >>>> >>>> >>>>> Is there a way to solve this issue? >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> >>>> >>>>> With regards, >>>> >>>> >>>>> >>>> >>>> >>>>> Swogat Pradhan >>>> >>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>>
I still have the same issue, I'm not sure what's left to try. All the pods are now in a healthy state, I am getting log entries 3 mins after I hit the create volume button in cinder-volume when I try to create a volume with an image. And the volumes are just stuck in creating state for more than 20 mins now. Cinder logs: 2023-03-22 20:32:44.010 108 INFO cinder.rpc [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Automatically selected cinder-volume RPC version 3.17 as minimum service version. 2023-03-22 20:34:59.166 108 INFO cinder.volume.flows.manager.create_volume [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume 5743a879-090d-46db-bc7c-1c0b0669a112: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-5743a879-090d-46db-bc7c-1c0b0669a112', 'volume_size': 2, 'image_id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'created_at': datetime.datetime(2023, 3, 22, 18, 50, 5, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 20, 3, 54, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'tags': [], 'file': '/v2/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f8147973438>} With regards, Swogat Pradhan On Wed, Mar 22, 2023 at 9:19 PM Alan Bishop <abishop@redhat.com> wrote:
On Wed, Mar 22, 2023 at 8:38 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Adam, The systems are in same LAN, in this case it seemed like the image was getting pulled from the central site which was caused due to an misconfiguration in ceph.conf file in /var/lib/tripleo-config/ceph/ directory, which seems to have been resolved after the changes i made to fix it.
Right now the glance api podman is running in unhealthy state and the podman logs don't show any error whatsoever and when issued the command netstat -nultp i do not see any entry for glance port i.e. 9292 in the dcn site, which is why cinder is throwing an error stating:
2023-03-22 13:32:29.786 108 ERROR oslo_messaging.rpc.server cinder.exception.GlanceConnectionFailed: Connection to glance failed: Error finding address for http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: Unable to establish connection to http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: HTTPConnectionPool(host='172.25.228.253', port=9292): Max retries exceeded with url: /v2/images/736d8779-07cd-4510-bab2-adcb653cc538 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7682d2cd30>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
Now i need to find out why the port is not listed as the glance service is running, which i am not sure how to find out.
One other thing to investigate is whether your deployment includes this patch [1]. If it does, then bear in mind the glance-api service running at the edge site will be an "internal" (non public facing) instance that uses port 9293 instead of 9292. You should familiarize yourself with the release note [2].
[1] https://opendev.org/openstack/tripleo-heat-templates/commit/3605d45e417a77a1... [2] https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/walla...
Alan
With regards, Swogat Pradhan
On Wed, Mar 22, 2023 at 8:11 PM Alan Bishop <abishop@redhat.com> wrote:
On Wed, Mar 22, 2023 at 6:37 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Update: Here is the log when creating a volume using cirros image:
2023-03-22 11:04:38.449 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume bf341343-6609-4b8c-b9e0-93e2a89c8c8f: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f', 'volume_size': 4, 'image_id': '736d8779-07cd-4510-bab2-adcb653cc538', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': '736d8779-07cd-4510-bab2-adcb653cc538', 'created_at': datetime.datetime(2023, 3, 22, 10, 44, 12, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 10, 54, 1, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'tags': [], 'file': '/v2/images/736d8779-07cd-4510-bab2-adcb653cc538/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f449ded1198>} 2023-03-22 11:06:16.570 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Image download 15.58 MB at 0.16 MB/s
As Adam Savage would say, well there's your problem ^^ (Image download 15.58 MB at 0.16 MB/s). Downloading the image takes too long, and 0.16 MB/s suggests you have a network issue.
John Fulton previously stated your cinder-volume service at the edge site is not using the local ceph image store. Assuming you are deploying GlanceApiEdge service [1], then the cinder-volume service should be configured to use the local glance service [2]. You should check cinder's glance_api_servers to confirm it's the edge site's glance service.
[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/envi... [2] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/depl...
Alan
2023-03-22 11:07:54.023 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.161 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.163 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Converted 112.00 MB image at 112.00 MB/s 2023-03-22 11:11:14.998 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f (bf341343-6609-4b8c-b9e0-93e2a89c8c8f): created successfully 2023-03-22 11:11:15.195 109 INFO cinder.volume.manager [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Created volume successfully.
The image is present in dcn02 store but still it downloaded the image in 0.16 MB/s and then created the volume.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 6:10 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Jhon, This seems to be an issue. When i deployed the dcn ceph in both dcn01 and dcn02 the --cluster parameter was specified to the respective cluster names but the config files were created in the name of ceph.conf and keyring was ceph.client.openstack.keyring.
Which created issues in glance as well as the naming convention of the files didn't match the cluster names, so i had to manually rename the central ceph conf file as such:
[root@dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/ [root@dcn02-compute-0 ceph]# ll total 16 -rw-------. 1 root root 257 Mar 13 13:56 ceph_central.client.openstack.keyring -rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf -rw-------. 1 root root 205 Mar 15 18:45 ceph.client.openstack.keyring -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf [root@dcn02-compute-0 ceph]#
ceph.conf and ceph.client.openstack.keyring contain the fsid of the respective clusters in both dcn01 and dcn02. In the above cli output, the ceph.conf and ceph.client... are the files used to access dcn02 ceph cluster and ceph_central* files are used in for accessing central ceph cluster.
glance multistore config: [dcn02] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store
[ceph_central] rbd_store_ceph_conf=/etc/ceph/ceph_central.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 5:52 PM John Fulton <johfulto@redhat.com> wrote:
On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote: > > Hi, > Seems like cinder is not using the local ceph.
That explains the issue. It's a misconfiguration.
I hope this is not a production system since the mailing list now has the cinder.conf which contains passwords.
The section that looks like this:
[tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_ceph_conf=/etc/ceph/ceph.conf rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=<redacted> report_discard_supported=True
Should be updated to refer to the local DCN ceph cluster and not the central one. Use the ceph conf file for that cluster and ensure the rbd_secret_uuid corresponds to that one.
TripleO’s convention is to set the rbd_secret_uuid to the FSID of the Ceph cluster. The FSID should be in the ceph.conf file. The tripleo_nova_libvirt role will use virsh secret-* commands so that libvirt can retrieve the cephx secret using the FSID as a key. This can be confirmed with `podman exec nova_virtsecretd virsh secret-get-value $FSID`.
The documentation describes how to configure the central and DCN sites correctly but an error seems to have occurred while you were following it.
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features...
John
> > Ceph Output: > [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l > NAME SIZE PARENT FMT PROT LOCK > 2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB 2 excl > 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB 2 > 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB 2 yes > 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB 2 > 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB 2 yes > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB 2 > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB 2 yes > 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB 2 > 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB 2 yes > b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB 2 > b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB 2 yes > e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB 2 > e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB 2 yes > > [ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l > NAME SIZE PARENT FMT PROT LOCK > volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB 2 > volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB 2 > [ceph: root@dcn02-ceph-all-0 /]# > > Attached the cinder config. > Please let me know how I can solve this issue. > > With regards, > Swogat Pradhan > > On Tue, Mar 21, 2023 at 3:53 PM John Fulton <johfulto@redhat.com> wrote: >> >> in my last message under the line "On a DCN site if you run a command like this:" I suggested some steps you could try to confirm the image is a COW from the local glance as well as how to look at your cinder config. >> >> On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >>> >>> Update: >>> I uploaded an image directly to the dcn02 store, and it takes around 10,15 minutes to create a volume with image in dcn02. >>> The image size is 389 MB. >>> >>> On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >>>> >>>> Hi Jhon, >>>> I checked in the ceph od dcn02, I can see the images created after importing from the central site. >>>> But launching an instance normally fails as it takes a long time for the volume to get created. >>>> >>>> When launching an instance from volume the instance is getting created properly without any errors. >>>> >>>> I tried to cache images in nova using https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... but getting checksum failed error. >>>> >>>> With regards, >>>> Swogat Pradhan >>>> >>>> On Thu, Mar 16, 2023 at 5:24 PM John Fulton <johfulto@redhat.com> wrote: >>>>> >>>>> On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan >>>>> <swogatpradhan22@gmail.com> wrote: >>>>> > >>>>> > Update: After restarting the nova services on the controller and running the deploy script on the edge site, I was able to launch the VM from volume. >>>>> > >>>>> > Right now the instance creation is failing as the block device creation is stuck in creating state, it is taking more than 10 mins for the volume to be created, whereas the image has already been imported to the edge glance. >>>>> >>>>> Try following this document and making the same observations in your >>>>> environment for AZs and their local ceph cluster. >>>>> >>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >>>>> >>>>> On a DCN site if you run a command like this: >>>>> >>>>> $ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring >>>>> /etc/ceph/dcn0.client.admin.keyring >>>>> $ rbd --cluster dcn0 -p volumes ls -l >>>>> NAME SIZE PARENT >>>>> FMT PROT LOCK >>>>> volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB >>>>> images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl >>>>> $ >>>>> >>>>> Then, you should see the parent of the volume is the image which is on >>>>> the same local ceph cluster. >>>>> >>>>> I wonder if something is misconfigured and thus you're encountering >>>>> the streaming behavior described here: >>>>> >>>>> Ideally all images should reside in the central Glance and be copied >>>>> to DCN sites before instances of those images are booted on DCN sites. >>>>> If an image is not copied to a DCN site before it is booted, then the >>>>> image will be streamed to the DCN site and then the image will boot as >>>>> an instance. This happens because Glance at the DCN site has access to >>>>> the images store at the Central ceph cluster. Though the booting of >>>>> the image will take time because it has not been copied in advance, >>>>> this is still preferable to failing to boot the image. >>>>> >>>>> You can also exec into the cinder container at the DCN site and >>>>> confirm it's using it's local ceph cluster. >>>>> >>>>> John >>>>> >>>>> > >>>>> > I will try and create a new fresh image and test again then update. >>>>> > >>>>> > With regards, >>>>> > Swogat Pradhan >>>>> > >>>>> > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >>>>> >> >>>>> >> Update: >>>>> >> In the hypervisor list the compute node state is showing down. >>>>> >> >>>>> >> >>>>> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote: >>>>> >>> >>>>> >>> Hi Brendan, >>>>> >>> Now i have deployed another site where i have used 2 linux bonds network template for both 3 compute nodes and 3 ceph nodes. >>>>> >>> The bonding options is set to mode=802.3ad (lacp=active). >>>>> >>> I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created. >>>>> >>> Once the volume was created i tried launching the instance from the volume and still the instance is stuck in spawning state. >>>>> >>> >>>>> >>> Here is the nova-compute log: >>>>> >>> >>>>> >>> 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-] privsep daemon starting >>>>> >>> 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep daemon running as pid 185437 >>>>> >>> 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command. >>>>> >>> Command: blkid overlay -s UUID -o value >>>>> >>> Exit code: 2 >>>>> >>> Stdout: '' >>>>> >>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. >>>>> >>> 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image >>>>> >>> >>>>> >>> It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>>>> >>> >>>>> >>> The volume is already created and i do not understand why the instance is stuck in spawning state. >>>>> >>> >>>>> >>> With regards, >>>>> >>> Swogat Pradhan >>>>> >>> >>>>> >>> >>>>> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard < bshephar@redhat.com> wrote: >>>>> >>>> >>>>> >>>> Does your environment use different network interfaces for each of the networks? Or does it have a bond with everything on it? >>>>> >>>> >>>>> >>>> One issue I have seen before is that when launching instances, there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface. >>>>> >>>> >>>>> >>>> I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance. >>>>> >>>> >>>>> >>>> In my case, changing from active/backup to LACP helped. So, based on that experience, from my perspective, is certainly sounds like some kind of network issue. >>>>> >>>> >>>>> >>>> Regards, >>>>> >>>> >>>>> >>>> Brendan Shephard >>>>> >>>> Senior Software Engineer >>>>> >>>> Red Hat Australia >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote: >>>>> >>>> >>>>> >>>> Hi, >>>>> >>>> >>>>> >>>> I tried to help someone with a similar issue some time ago in this thread: >>>>> >>>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... >>>>> >>>> >>>>> >>>> But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue. >>>>> >>>> If there isn't any additional information in the debug logs I probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes: >>>>> >>>> >>>>> >>>> - Either remove queues, exchanges etc. while rabbit is running, this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands. >>>>> >>>> - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild. >>>>> >>>> >>>>> >>>> I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice. >>>>> >>>> >>>>> >>>> Regards, >>>>> >>>> Eugen >>>>> >>>> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>>>> >>>> >>>>> >>>> Hi, >>>>> >>>> Can someone please help me out on this issue? >>>>> >>>> >>>>> >>>> With regards, >>>>> >>>> Swogat Pradhan >>>>> >>>> >>>>> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < swogatpradhan22@gmail.com> >>>>> >>>> wrote: >>>>> >>>> >>>>> >>>> Hi >>>>> >>>> I don't see any major packet loss. >>>>> >>>> It seems the problem is somewhere in rabbitmq maybe but not due to packet >>>>> >>>> loss. >>>>> >>>> >>>>> >>>> with regards, >>>>> >>>> Swogat Pradhan >>>>> >>>> >>>>> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < swogatpradhan22@gmail.com> >>>>> >>>> wrote: >>>>> >>>> >>>>> >>>> Hi, >>>>> >>>> Yes the MTU is the same as the default '1500'. >>>>> >>>> Generally I haven't seen any packet loss, but never checked when >>>>> >>>> launching the instance. >>>>> >>>> I will check that and come back. >>>>> >>>> But everytime i launch an instance the instance gets stuck at spawning >>>>> >>>> state and there the hypervisor becomes down, so not sure if packet loss >>>>> >>>> causes this. >>>>> >>>> >>>>> >>>> With regards, >>>>> >>>> Swogat pradhan >>>>> >>>> >>>>> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote: >>>>> >>>> >>>>> >>>> One more thing coming to mind is MTU size. Are they identical between >>>>> >>>> central and edge site? Do you see packet loss through the tunnel? >>>>> >>>> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>>>> >>>> >>>>> >>>> > Hi Eugen, >>>>> >>>> > Request you to please add my email either on 'to' or 'cc' as i am not >>>>> >>>> > getting email's from you. >>>>> >>>> > Coming to the issue: >>>>> >>>> > >>>>> >>>> > [root@overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p >>>>> >>>> / >>>>> >>>> > Listing policies for vhost "/" ... >>>>> >>>> > vhost name pattern apply-to definition priority >>>>> >>>> > / ha-all ^(?!amq\.).* queues >>>>> >>>> > >>>>> >>>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 >>>>> >>>> > >>>>> >>>> > I have the edge site compute nodes up, it only goes down when i am >>>>> >>>> trying >>>>> >>>> > to launch an instance and the instance comes to a spawning state and >>>>> >>>> then >>>>> >>>> > gets stuck. >>>>> >>>> > >>>>> >>>> > I have a tunnel setup between the central and the edge sites. >>>>> >>>> > >>>>> >>>> > With regards, >>>>> >>>> > Swogat Pradhan >>>>> >>>> > >>>>> >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < >>>>> >>>> swogatpradhan22@gmail.com> >>>>> >>>> > wrote: >>>>> >>>> > >>>>> >>>> >> Hi Eugen, >>>>> >>>> >> For some reason i am not getting your email to me directly, i am >>>>> >>>> checking >>>>> >>>> >> the email digest and there i am able to find your reply. >>>>> >>>> >> Here is the log for download: https://we.tl/t-L8FEkGZFSq >>>>> >>>> >> Yes, these logs are from the time when the issue occurred. >>>>> >>>> >> >>>>> >>>> >> *Note: i am able to create vm's and perform other activities in the >>>>> >>>> >> central site, only facing this issue in the edge site.* >>>>> >>>> >> >>>>> >>>> >> With regards, >>>>> >>>> >> Swogat Pradhan >>>>> >>>> >> >>>>> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < >>>>> >>>> swogatpradhan22@gmail.com> >>>>> >>>> >> wrote: >>>>> >>>> >> >>>>> >>>> >>> Hi Eugen, >>>>> >>>> >>> Thanks for your response. >>>>> >>>> >>> I have actually a 4 controller setup so here are the details: >>>>> >>>> >>> >>>>> >>>> >>> *PCS Status:* >>>>> >>>> >>> * Container bundle set: rabbitmq-bundle [ >>>>> >>>> >>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: >>>>> >>>> >>> * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): >>>>> >>>> Started >>>>> >>>> >>> overcloud-controller-no-ceph-3 >>>>> >>>> >>> * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): >>>>> >>>> Started >>>>> >>>> >>> overcloud-controller-2 >>>>> >>>> >>> * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): >>>>> >>>> Started >>>>> >>>> >>> overcloud-controller-1 >>>>> >>>> >>> * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): >>>>> >>>> Started >>>>> >>>> >>> overcloud-controller-0 >>>>> >>>> >>> >>>>> >>>> >>> I have tried restarting the bundle multiple times but the issue is >>>>> >>>> still >>>>> >>>> >>> present. >>>>> >>>> >>> >>>>> >>>> >>> *Cluster status:* >>>>> >>>> >>> [root@overcloud-controller-0 /]# rabbitmqctl cluster_status >>>>> >>>> >>> Cluster status of node >>>>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >>>>> >>>> >>> Basics >>>>> >>>> >>> >>>>> >>>> >>> Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com >>>>> >>>> >>> >>>>> >>>> >>> Disk Nodes >>>>> >>>> >>> >>>>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>>> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>>> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>>> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>> >>>> >>> >>>>> >>>> >>> Running Nodes >>>>> >>>> >>> >>>>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>>> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>>> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>>> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>> >>>> >>> >>>>> >>>> >>> Versions >>>>> >>>> >>> >>>>> >>>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ >>>>> >>>> 3.8.3 >>>>> >>>> >>> on Erlang 22.3.4.1 >>>>> >>>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ >>>>> >>>> 3.8.3 >>>>> >>>> >>> on Erlang 22.3.4.1 >>>>> >>>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ >>>>> >>>> 3.8.3 >>>>> >>>> >>> on Erlang 22.3.4.1 >>>>> >>>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: >>>>> >>>> RabbitMQ >>>>> >>>> >>> 3.8.3 on Erlang 22.3.4.1 >>>>> >>>> >>> >>>>> >>>> >>> Alarms >>>>> >>>> >>> >>>>> >>>> >>> (none) >>>>> >>>> >>> >>>>> >>>> >>> Network Partitions >>>>> >>>> >>> >>>>> >>>> >>> (none) >>>>> >>>> >>> >>>>> >>>> >>> Listeners >>>>> >>>> >>> >>>>> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>>> >>>> interface: >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>>>> >>>> tool >>>>> >>>> >>> communication >>>>> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>>> >>>> interface: >>>>> >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>>>> >>>> >>> and AMQP 1.0 >>>>> >>>> >>> Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>>> >>>> interface: >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>>>> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>>> >>>> interface: >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>>>> >>>> tool >>>>> >>>> >>> communication >>>>> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>>> >>>> interface: >>>>> >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>>>> >>>> >>> and AMQP 1.0 >>>>> >>>> >>> Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>>> >>>> interface: >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>>>> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>>> >>>> interface: >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI >>>>> >>>> tool >>>>> >>>> >>> communication >>>>> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>>> >>>> interface: >>>>> >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 >>>>> >>>> >>> and AMQP 1.0 >>>>> >>>> >>> Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>>> >>>> interface: >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>>>> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>> >>>> , >>>>> >>>> >>> interface: [::], port: 25672, protocol: clustering, purpose: >>>>> >>>> inter-node and >>>>> >>>> >>> CLI tool communication >>>>> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>> >>>> , >>>>> >>>> >>> interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP >>>>> >>>> 0-9-1 >>>>> >>>> >>> and AMQP 1.0 >>>>> >>>> >>> Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>> >>>> , >>>>> >>>> >>> interface: [::], port: 15672, protocol: http, purpose: HTTP API >>>>> >>>> >>> >>>>> >>>> >>> Feature flags >>>>> >>>> >>> >>>>> >>>> >>> Flag: drop_unroutable_metric, state: enabled >>>>> >>>> >>> Flag: empty_basic_get_metric, state: enabled >>>>> >>>> >>> Flag: implicit_default_bindings, state: enabled >>>>> >>>> >>> Flag: quorum_queue, state: enabled >>>>> >>>> >>> Flag: virtual_host_metadata, state: enabled >>>>> >>>> >>> >>>>> >>>> >>> *Logs:* >>>>> >>>> >>> *(Attached)* >>>>> >>>> >>> >>>>> >>>> >>> With regards, >>>>> >>>> >>> Swogat Pradhan >>>>> >>>> >>> >>>>> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < >>>>> >>>> swogatpradhan22@gmail.com> >>>>> >>>> >>> wrote: >>>>> >>>> >>> >>>>> >>>> >>>> Hi, >>>>> >>>> >>>> Please find the nova conductor as well as nova api log. >>>>> >>>> >>>> >>>>> >>>> >>>> nova-conuctor: >>>>> >>>> >>>> >>>>> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>>> >>>> >>>> 16152921c1eb45c2b1f562087140168b >>>>> >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>>>> >>>> >>>> 83dbe5f567a940b698acfe986f6194fa >>>>> >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43: >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds >>>>> >>>> due to a >>>>> >>>> >>>> missing queue (reply_276049ec36a84486a8a406911d9802f4). >>>>> >>>> Abandoning...: >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds >>>>> >>>> due to a >>>>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>>>> >>>> Abandoning...: >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f: >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds >>>>> >>>> due to a >>>>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>>>> >>>> Abandoning...: >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils >>>>> >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >>>>> >>>> with >>>>> >>>> >>>> backend dogpile.cache.null. >>>>> >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb: >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds >>>>> >>>> due to a >>>>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). >>>>> >>>> Abandoning...: >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>> >>>> >>>>> >>>> >>>> With regards, >>>>> >>>> >>>> Swogat Pradhan >>>>> >>>> >>>> >>>>> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >>>>> >>>> >>>> swogatpradhan22@gmail.com> wrote: >>>>> >>>> >>>> >>>>> >>>> >>>>> Hi, >>>>> >>>> >>>>> I currently have 3 compute nodes on edge site1 where i am trying to >>>>> >>>> >>>>> launch vm's. >>>>> >>>> >>>>> When the VM is in spawning state the node goes down (openstack >>>>> >>>> compute >>>>> >>>> >>>>> service list), the node comes backup when i restart the nova >>>>> >>>> compute >>>>> >>>> >>>>> service but then the launch of the vm fails. >>>>> >>>> >>>>> >>>>> >>>> >>>>> nova-compute.log >>>>> >>>> >>>>> >>>>> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >>>>> >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running >>>>> >>>> >>>>> instance usage >>>>> >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 >>>>> >>>> to >>>>> >>>> >>>>> 2023-02-26 08:00:00. 0 instances. >>>>> >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node >>>>> >>>> >>>>> dcn01-hci-0.bdxworld.com >>>>> >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device >>>>> >>>> name: >>>>> >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied dev names >>>>> >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume >>>>> >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >>>>> >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled >>>>> >>>> with >>>>> >>>> >>>>> backend dogpile.cache.null. >>>>> >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running >>>>> >>>> >>>>> privsep helper: >>>>> >>>> >>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', >>>>> >>>> 'privsep-helper', >>>>> >>>> >>>>> '--config-file', '/etc/nova/nova.conf', '--config-file', >>>>> >>>> >>>>> '/etc/nova/nova-compute.conf', '--privsep_context', >>>>> >>>> >>>>> 'os_brick.privileged.default', '--privsep_sock_path', >>>>> >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] >>>>> >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new >>>>> >>>> privsep >>>>> >>>> >>>>> daemon via rootwrap >>>>> >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep >>>>> >>>> >>>>> daemon starting >>>>> >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep >>>>> >>>> >>>>> process running with uid/gid: 0/0 >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>>>> >>>> >>>>> process running with capabilities (eff/prm/inh): >>>>> >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep >>>>> >>>> >>>>> daemon running as pid 2647 >>>>> >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING >>>>> >>>> os_brick.initiator.connectors.nvmeof >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process >>>>> >>>> >>>>> execution error >>>>> >>>> >>>>> in _get_host_uuid: Unexpected error while running command. >>>>> >>>> >>>>> Command: blkid overlay -s UUID -o value >>>>> >>>> >>>>> Exit code: 2 >>>>> >>>> >>>>> Stdout: '' >>>>> >>>> >>>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: >>>>> >>>> >>>>> Unexpected error while running command. >>>>> >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image >>>>> >>>> >>>>> >>>>> >>>> >>>>> Is there a way to solve this issue? >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> With regards, >>>>> >>>> >>>>> >>>>> >>>> >>>>> Swogat Pradhan >>>>> >>>> >>>>> >>>>> >>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>>
Cinder volume config: [tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=a8d5f1f5-48e7-5ede-89ab-8aca59b6397b report_discard_supported=True rbd_ceph_conf=/etc/ceph/dcn02.conf rbd_cluster_name=dcn02 Glance api config: [dcn02] rbd_store_ceph_conf=/etc/ceph/dcn02.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store [ceph] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend. On Thu, Mar 23, 2023 at 2:29 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
I still have the same issue, I'm not sure what's left to try. All the pods are now in a healthy state, I am getting log entries 3 mins after I hit the create volume button in cinder-volume when I try to create a volume with an image. And the volumes are just stuck in creating state for more than 20 mins now.
Cinder logs: 2023-03-22 20:32:44.010 108 INFO cinder.rpc [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Automatically selected cinder-volume RPC version 3.17 as minimum service version. 2023-03-22 20:34:59.166 108 INFO cinder.volume.flows.manager.create_volume [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume 5743a879-090d-46db-bc7c-1c0b0669a112: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-5743a879-090d-46db-bc7c-1c0b0669a112', 'volume_size': 2, 'image_id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'created_at': datetime.datetime(2023, 3, 22, 18, 50, 5, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 20, 3, 54, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'tags': [], 'file': '/v2/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f8147973438>}
With regards, Swogat Pradhan
On Wed, Mar 22, 2023 at 9:19 PM Alan Bishop <abishop@redhat.com> wrote:
On Wed, Mar 22, 2023 at 8:38 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Adam, The systems are in same LAN, in this case it seemed like the image was getting pulled from the central site which was caused due to an misconfiguration in ceph.conf file in /var/lib/tripleo-config/ceph/ directory, which seems to have been resolved after the changes i made to fix it.
Right now the glance api podman is running in unhealthy state and the podman logs don't show any error whatsoever and when issued the command netstat -nultp i do not see any entry for glance port i.e. 9292 in the dcn site, which is why cinder is throwing an error stating:
2023-03-22 13:32:29.786 108 ERROR oslo_messaging.rpc.server cinder.exception.GlanceConnectionFailed: Connection to glance failed: Error finding address for http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: Unable to establish connection to http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: HTTPConnectionPool(host='172.25.228.253', port=9292): Max retries exceeded with url: /v2/images/736d8779-07cd-4510-bab2-adcb653cc538 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7682d2cd30>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
Now i need to find out why the port is not listed as the glance service is running, which i am not sure how to find out.
One other thing to investigate is whether your deployment includes this patch [1]. If it does, then bear in mind the glance-api service running at the edge site will be an "internal" (non public facing) instance that uses port 9293 instead of 9292. You should familiarize yourself with the release note [2].
[1] https://opendev.org/openstack/tripleo-heat-templates/commit/3605d45e417a77a1... [2] https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/walla...
Alan
With regards, Swogat Pradhan
On Wed, Mar 22, 2023 at 8:11 PM Alan Bishop <abishop@redhat.com> wrote:
On Wed, Mar 22, 2023 at 6:37 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Update: Here is the log when creating a volume using cirros image:
2023-03-22 11:04:38.449 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume bf341343-6609-4b8c-b9e0-93e2a89c8c8f: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f', 'volume_size': 4, 'image_id': '736d8779-07cd-4510-bab2-adcb653cc538', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': '736d8779-07cd-4510-bab2-adcb653cc538', 'created_at': datetime.datetime(2023, 3, 22, 10, 44, 12, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 10, 54, 1, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'tags': [], 'file': '/v2/images/736d8779-07cd-4510-bab2-adcb653cc538/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f449ded1198>} 2023-03-22 11:06:16.570 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Image download 15.58 MB at 0.16 MB/s
As Adam Savage would say, well there's your problem ^^ (Image download 15.58 MB at 0.16 MB/s). Downloading the image takes too long, and 0.16 MB/s suggests you have a network issue.
John Fulton previously stated your cinder-volume service at the edge site is not using the local ceph image store. Assuming you are deploying GlanceApiEdge service [1], then the cinder-volume service should be configured to use the local glance service [2]. You should check cinder's glance_api_servers to confirm it's the edge site's glance service.
[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/envi... [2] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/depl...
Alan
2023-03-22 11:07:54.023 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.161 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.163 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Converted 112.00 MB image at 112.00 MB/s 2023-03-22 11:11:14.998 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f (bf341343-6609-4b8c-b9e0-93e2a89c8c8f): created successfully 2023-03-22 11:11:15.195 109 INFO cinder.volume.manager [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Created volume successfully.
The image is present in dcn02 store but still it downloaded the image in 0.16 MB/s and then created the volume.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 6:10 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Jhon, This seems to be an issue. When i deployed the dcn ceph in both dcn01 and dcn02 the --cluster parameter was specified to the respective cluster names but the config files were created in the name of ceph.conf and keyring was ceph.client.openstack.keyring.
Which created issues in glance as well as the naming convention of the files didn't match the cluster names, so i had to manually rename the central ceph conf file as such:
[root@dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/ [root@dcn02-compute-0 ceph]# ll total 16 -rw-------. 1 root root 257 Mar 13 13:56 ceph_central.client.openstack.keyring -rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf -rw-------. 1 root root 205 Mar 15 18:45 ceph.client.openstack.keyring -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf [root@dcn02-compute-0 ceph]#
ceph.conf and ceph.client.openstack.keyring contain the fsid of the respective clusters in both dcn01 and dcn02. In the above cli output, the ceph.conf and ceph.client... are the files used to access dcn02 ceph cluster and ceph_central* files are used in for accessing central ceph cluster.
glance multistore config: [dcn02] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store
[ceph_central] rbd_store_ceph_conf=/etc/ceph/ceph_central.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 5:52 PM John Fulton <johfulto@redhat.com> wrote:
> On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan > <swogatpradhan22@gmail.com> wrote: > > > > Hi, > > Seems like cinder is not using the local ceph. > > That explains the issue. It's a misconfiguration. > > I hope this is not a production system since the mailing list now has > the cinder.conf which contains passwords. > > The section that looks like this: > > [tripleo_ceph] > volume_backend_name=tripleo_ceph > volume_driver=cinder.volume.drivers.rbd.RBDDriver > rbd_ceph_conf=/etc/ceph/ceph.conf > rbd_user=openstack > rbd_pool=volumes > rbd_flatten_volume_from_snapshot=False > rbd_secret_uuid=<redacted> > report_discard_supported=True > > Should be updated to refer to the local DCN ceph cluster and not the > central one. Use the ceph conf file for that cluster and ensure the > rbd_secret_uuid corresponds to that one. > > TripleO’s convention is to set the rbd_secret_uuid to the FSID of the > Ceph cluster. The FSID should be in the ceph.conf file. The > tripleo_nova_libvirt role will use virsh secret-* commands so that > libvirt can retrieve the cephx secret using the FSID as a key. This > can be confirmed with `podman exec nova_virtsecretd virsh > secret-get-value $FSID`. > > The documentation describes how to configure the central and DCN > sites > correctly but an error seems to have occurred while you were > following > it. > > > https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... > > John > > > > > Ceph Output: > > [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l > > NAME SIZE PARENT FMT > PROT LOCK > > 2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB 2 > excl > > 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB 2 > > 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB 2 > yes > > 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB 2 > > 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB 2 > yes > > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB 2 > > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB 2 > yes > > 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB 2 > > 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB 2 > yes > > b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB 2 > > b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB 2 > yes > > e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB 2 > > e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB 2 > yes > > > > [ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l > > NAME SIZE PARENT FMT > PROT LOCK > > volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB 2 > > volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB 2 > > [ceph: root@dcn02-ceph-all-0 /]# > > > > Attached the cinder config. > > Please let me know how I can solve this issue. > > > > With regards, > > Swogat Pradhan > > > > On Tue, Mar 21, 2023 at 3:53 PM John Fulton <johfulto@redhat.com> > wrote: > >> > >> in my last message under the line "On a DCN site if you run a > command like this:" I suggested some steps you could try to confirm the > image is a COW from the local glance as well as how to look at your cinder > config. > >> > >> On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan < > swogatpradhan22@gmail.com> wrote: > >>> > >>> Update: > >>> I uploaded an image directly to the dcn02 store, and it takes > around 10,15 minutes to create a volume with image in dcn02. > >>> The image size is 389 MB. > >>> > >>> On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan < > swogatpradhan22@gmail.com> wrote: > >>>> > >>>> Hi Jhon, > >>>> I checked in the ceph od dcn02, I can see the images created > after importing from the central site. > >>>> But launching an instance normally fails as it takes a long > time for the volume to get created. > >>>> > >>>> When launching an instance from volume the instance is getting > created properly without any errors. > >>>> > >>>> I tried to cache images in nova using > https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... > but getting checksum failed error. > >>>> > >>>> With regards, > >>>> Swogat Pradhan > >>>> > >>>> On Thu, Mar 16, 2023 at 5:24 PM John Fulton < > johfulto@redhat.com> wrote: > >>>>> > >>>>> On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan > >>>>> <swogatpradhan22@gmail.com> wrote: > >>>>> > > >>>>> > Update: After restarting the nova services on the controller > and running the deploy script on the edge site, I was able to launch the VM > from volume. > >>>>> > > >>>>> > Right now the instance creation is failing as the block > device creation is stuck in creating state, it is taking more than 10 mins > for the volume to be created, whereas the image has already been imported > to the edge glance. > >>>>> > >>>>> Try following this document and making the same observations > in your > >>>>> environment for AZs and their local ceph cluster. > >>>>> > >>>>> > https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... > >>>>> > >>>>> On a DCN site if you run a command like this: > >>>>> > >>>>> $ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring > >>>>> /etc/ceph/dcn0.client.admin.keyring > >>>>> $ rbd --cluster dcn0 -p volumes ls -l > >>>>> NAME SIZE PARENT > >>>>> FMT PROT LOCK > >>>>> volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB > >>>>> images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 excl > >>>>> $ > >>>>> > >>>>> Then, you should see the parent of the volume is the image > which is on > >>>>> the same local ceph cluster. > >>>>> > >>>>> I wonder if something is misconfigured and thus you're > encountering > >>>>> the streaming behavior described here: > >>>>> > >>>>> Ideally all images should reside in the central Glance and be > copied > >>>>> to DCN sites before instances of those images are booted on > DCN sites. > >>>>> If an image is not copied to a DCN site before it is booted, > then the > >>>>> image will be streamed to the DCN site and then the image will > boot as > >>>>> an instance. This happens because Glance at the DCN site has > access to > >>>>> the images store at the Central ceph cluster. Though the > booting of > >>>>> the image will take time because it has not been copied in > advance, > >>>>> this is still preferable to failing to boot the image. > >>>>> > >>>>> You can also exec into the cinder container at the DCN site and > >>>>> confirm it's using it's local ceph cluster. > >>>>> > >>>>> John > >>>>> > >>>>> > > >>>>> > I will try and create a new fresh image and test again then > update. > >>>>> > > >>>>> > With regards, > >>>>> > Swogat Pradhan > >>>>> > > >>>>> > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan < > swogatpradhan22@gmail.com> wrote: > >>>>> >> > >>>>> >> Update: > >>>>> >> In the hypervisor list the compute node state is showing > down. > >>>>> >> > >>>>> >> > >>>>> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan < > swogatpradhan22@gmail.com> wrote: > >>>>> >>> > >>>>> >>> Hi Brendan, > >>>>> >>> Now i have deployed another site where i have used 2 linux > bonds network template for both 3 compute nodes and 3 ceph nodes. > >>>>> >>> The bonding options is set to mode=802.3ad (lacp=active). > >>>>> >>> I used a cirros image to launch instance but the instance > timed out so i waited for the volume to be created. > >>>>> >>> Once the volume was created i tried launching the instance > from the volume and still the instance is stuck in spawning state. > >>>>> >>> > >>>>> >>> Here is the nova-compute log: > >>>>> >>> > >>>>> >>> 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon > [-] privsep daemon starting > >>>>> >>> 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon > [-] privsep process running with uid/gid: 0/0 > >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon > [-] privsep process running with capabilities (eff/prm/inh): > CAP_SYS_ADMIN/CAP_SYS_ADMIN/none > >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon > [-] privsep daemon running as pid 185437 > >>>>> >>> 2023-03-15 17:35:47.974 8 WARNING > os_brick.initiator.connectors.nvmeof > [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error > in _get_host_uuid: Unexpected error while running command. > >>>>> >>> Command: blkid overlay -s UUID -o value > >>>>> >>> Exit code: 2 > >>>>> >>> Stdout: '' > >>>>> >>> Stderr: '': > oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while > running command. > >>>>> >>> 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver > [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - default default] [instance: > 450b749c-a10a-4308-80a9-3b8020fee758] Creating image > >>>>> >>> > >>>>> >>> It is stuck in creating image, do i need to run the > template mentioned here ?: > https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... > >>>>> >>> > >>>>> >>> The volume is already created and i do not understand why > the instance is stuck in spawning state. > >>>>> >>> > >>>>> >>> With regards, > >>>>> >>> Swogat Pradhan > >>>>> >>> > >>>>> >>> > >>>>> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard < > bshephar@redhat.com> wrote: > >>>>> >>>> > >>>>> >>>> Does your environment use different network interfaces > for each of the networks? Or does it have a bond with everything on it? > >>>>> >>>> > >>>>> >>>> One issue I have seen before is that when launching > instances, there is a lot of network traffic between nodes as the > hypervisor needs to download the image from Glance. Along with various > other services sending normal network traffic, it can be enough to cause > issues if everything is running over a single 1Gbe interface. > >>>>> >>>> > >>>>> >>>> I have seen the same situation in fact when using a > single active/backup bond on 1Gbe nics. It’s worth checking the network > traffic while you try to spawn the instance to see if you’re dropping > packets. In the situation I described, there were dropped packets which > resulted in a loss of communication between nova_compute and RMQ, so the > node appeared offline. You should also confirm that nova_compute is being > disconnected in the nova_compute logs if you tail them on the Hypervisor > while spawning the instance. > >>>>> >>>> > >>>>> >>>> In my case, changing from active/backup to LACP helped. > So, based on that experience, from my perspective, is certainly sounds like > some kind of network issue. > >>>>> >>>> > >>>>> >>>> Regards, > >>>>> >>>> > >>>>> >>>> Brendan Shephard > >>>>> >>>> Senior Software Engineer > >>>>> >>>> Red Hat Australia > >>>>> >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> > wrote: > >>>>> >>>> > >>>>> >>>> Hi, > >>>>> >>>> > >>>>> >>>> I tried to help someone with a similar issue some time > ago in this thread: > >>>>> >>>> > https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... > >>>>> >>>> > >>>>> >>>> But apparently a neutron reinstallation fixed it for that > user, not sure if that could apply here. But is it possible that your nova > and neutron versions are different between central and edge site? Have you > restarted nova and neutron services on the compute nodes after > installation? Have you debug logs of nova-conductor and maybe nova-compute? > Maybe they can help narrow down the issue. > >>>>> >>>> If there isn't any additional information in the debug > logs I probably would start "tearing down" rabbitmq. I didn't have to do > that in a production system yet so be careful. I can think of two routes: > >>>>> >>>> > >>>>> >>>> - Either remove queues, exchanges etc. while rabbit is > running, this will most likely impact client IO depending on your load. > Check out the rabbitmqctl commands. > >>>>> >>>> - Or stop the rabbitmq cluster, remove the mnesia tables > from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild. > >>>>> >>>> > >>>>> >>>> I can imagine that the failed reply "survives" while > being replicated across the rabbit nodes. But I don't really know the > rabbit internals too well, so maybe someone else can chime in here and give > a better advice. > >>>>> >>>> > >>>>> >>>> Regards, > >>>>> >>>> Eugen > >>>>> >>>> > >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: > >>>>> >>>> > >>>>> >>>> Hi, > >>>>> >>>> Can someone please help me out on this issue? > >>>>> >>>> > >>>>> >>>> With regards, > >>>>> >>>> Swogat Pradhan > >>>>> >>>> > >>>>> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < > swogatpradhan22@gmail.com> > >>>>> >>>> wrote: > >>>>> >>>> > >>>>> >>>> Hi > >>>>> >>>> I don't see any major packet loss. > >>>>> >>>> It seems the problem is somewhere in rabbitmq maybe but > not due to packet > >>>>> >>>> loss. > >>>>> >>>> > >>>>> >>>> with regards, > >>>>> >>>> Swogat Pradhan > >>>>> >>>> > >>>>> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < > swogatpradhan22@gmail.com> > >>>>> >>>> wrote: > >>>>> >>>> > >>>>> >>>> Hi, > >>>>> >>>> Yes the MTU is the same as the default '1500'. > >>>>> >>>> Generally I haven't seen any packet loss, but never > checked when > >>>>> >>>> launching the instance. > >>>>> >>>> I will check that and come back. > >>>>> >>>> But everytime i launch an instance the instance gets > stuck at spawning > >>>>> >>>> state and there the hypervisor becomes down, so not sure > if packet loss > >>>>> >>>> causes this. > >>>>> >>>> > >>>>> >>>> With regards, > >>>>> >>>> Swogat pradhan > >>>>> >>>> > >>>>> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> > wrote: > >>>>> >>>> > >>>>> >>>> One more thing coming to mind is MTU size. Are they > identical between > >>>>> >>>> central and edge site? Do you see packet loss through the > tunnel? > >>>>> >>>> > >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: > >>>>> >>>> > >>>>> >>>> > Hi Eugen, > >>>>> >>>> > Request you to please add my email either on 'to' or > 'cc' as i am not > >>>>> >>>> > getting email's from you. > >>>>> >>>> > Coming to the issue: > >>>>> >>>> > > >>>>> >>>> > [root@overcloud-controller-no-ceph-3 /]# rabbitmqctl > list_policies -p > >>>>> >>>> / > >>>>> >>>> > Listing policies for vhost "/" ... > >>>>> >>>> > vhost name pattern apply-to definition > priority > >>>>> >>>> > / ha-all ^(?!amq\.).* queues > >>>>> >>>> > > >>>>> >>>> > {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 > >>>>> >>>> > > >>>>> >>>> > I have the edge site compute nodes up, it only goes > down when i am > >>>>> >>>> trying > >>>>> >>>> > to launch an instance and the instance comes to a > spawning state and > >>>>> >>>> then > >>>>> >>>> > gets stuck. > >>>>> >>>> > > >>>>> >>>> > I have a tunnel setup between the central and the edge > sites. > >>>>> >>>> > > >>>>> >>>> > With regards, > >>>>> >>>> > Swogat Pradhan > >>>>> >>>> > > >>>>> >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < > >>>>> >>>> swogatpradhan22@gmail.com> > >>>>> >>>> > wrote: > >>>>> >>>> > > >>>>> >>>> >> Hi Eugen, > >>>>> >>>> >> For some reason i am not getting your email to me > directly, i am > >>>>> >>>> checking > >>>>> >>>> >> the email digest and there i am able to find your > reply. > >>>>> >>>> >> Here is the log for download: > https://we.tl/t-L8FEkGZFSq > >>>>> >>>> >> Yes, these logs are from the time when the issue > occurred. > >>>>> >>>> >> > >>>>> >>>> >> *Note: i am able to create vm's and perform other > activities in the > >>>>> >>>> >> central site, only facing this issue in the edge site.* > >>>>> >>>> >> > >>>>> >>>> >> With regards, > >>>>> >>>> >> Swogat Pradhan > >>>>> >>>> >> > >>>>> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < > >>>>> >>>> swogatpradhan22@gmail.com> > >>>>> >>>> >> wrote: > >>>>> >>>> >> > >>>>> >>>> >>> Hi Eugen, > >>>>> >>>> >>> Thanks for your response. > >>>>> >>>> >>> I have actually a 4 controller setup so here are the > details: > >>>>> >>>> >>> > >>>>> >>>> >>> *PCS Status:* > >>>>> >>>> >>> * Container bundle set: rabbitmq-bundle [ > >>>>> >>>> >>> > 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: > >>>>> >>>> >>> * rabbitmq-bundle-0 > (ocf::heartbeat:rabbitmq-cluster): > >>>>> >>>> Started > >>>>> >>>> >>> overcloud-controller-no-ceph-3 > >>>>> >>>> >>> * rabbitmq-bundle-1 > (ocf::heartbeat:rabbitmq-cluster): > >>>>> >>>> Started > >>>>> >>>> >>> overcloud-controller-2 > >>>>> >>>> >>> * rabbitmq-bundle-2 > (ocf::heartbeat:rabbitmq-cluster): > >>>>> >>>> Started > >>>>> >>>> >>> overcloud-controller-1 > >>>>> >>>> >>> * rabbitmq-bundle-3 > (ocf::heartbeat:rabbitmq-cluster): > >>>>> >>>> Started > >>>>> >>>> >>> overcloud-controller-0 > >>>>> >>>> >>> > >>>>> >>>> >>> I have tried restarting the bundle multiple times but > the issue is > >>>>> >>>> still > >>>>> >>>> >>> present. > >>>>> >>>> >>> > >>>>> >>>> >>> *Cluster status:* > >>>>> >>>> >>> [root@overcloud-controller-0 /]# rabbitmqctl > cluster_status > >>>>> >>>> >>> Cluster status of node > >>>>> >>>> >>> > rabbit@overcloud-controller-0.internalapi.bdxworld.com ... > >>>>> >>>> >>> Basics > >>>>> >>>> >>> > >>>>> >>>> >>> Cluster name: > rabbit@overcloud-controller-no-ceph-3.bdxworld.com > >>>>> >>>> >>> > >>>>> >>>> >>> Disk Nodes > >>>>> >>>> >>> > >>>>> >>>> >>> > rabbit@overcloud-controller-0.internalapi.bdxworld.com > >>>>> >>>> >>> > rabbit@overcloud-controller-1.internalapi.bdxworld.com > >>>>> >>>> >>> > rabbit@overcloud-controller-2.internalapi.bdxworld.com > >>>>> >>>> >>> > rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > >>>>> >>>> >>> > >>>>> >>>> >>> Running Nodes > >>>>> >>>> >>> > >>>>> >>>> >>> > rabbit@overcloud-controller-0.internalapi.bdxworld.com > >>>>> >>>> >>> > rabbit@overcloud-controller-1.internalapi.bdxworld.com > >>>>> >>>> >>> > rabbit@overcloud-controller-2.internalapi.bdxworld.com > >>>>> >>>> >>> > rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > >>>>> >>>> >>> > >>>>> >>>> >>> Versions > >>>>> >>>> >>> > >>>>> >>>> >>> > rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ > >>>>> >>>> 3.8.3 > >>>>> >>>> >>> on Erlang 22.3.4.1 > >>>>> >>>> >>> > rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ > >>>>> >>>> 3.8.3 > >>>>> >>>> >>> on Erlang 22.3.4.1 > >>>>> >>>> >>> > rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ > >>>>> >>>> 3.8.3 > >>>>> >>>> >>> on Erlang 22.3.4.1 > >>>>> >>>> >>> > rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: > >>>>> >>>> RabbitMQ > >>>>> >>>> >>> 3.8.3 on Erlang 22.3.4.1 > >>>>> >>>> >>> > >>>>> >>>> >>> Alarms > >>>>> >>>> >>> > >>>>> >>>> >>> (none) > >>>>> >>>> >>> > >>>>> >>>> >>> Network Partitions > >>>>> >>>> >>> > >>>>> >>>> >>> (none) > >>>>> >>>> >>> > >>>>> >>>> >>> Listeners > >>>>> >>>> >>> > >>>>> >>>> >>> Node: > rabbit@overcloud-controller-0.internalapi.bdxworld.com, > >>>>> >>>> interface: > >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: > inter-node and CLI > >>>>> >>>> tool > >>>>> >>>> >>> communication > >>>>> >>>> >>> Node: > rabbit@overcloud-controller-0.internalapi.bdxworld.com, > >>>>> >>>> interface: > >>>>> >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: > AMQP 0-9-1 > >>>>> >>>> >>> and AMQP 1.0 > >>>>> >>>> >>> Node: > rabbit@overcloud-controller-0.internalapi.bdxworld.com, > >>>>> >>>> interface: > >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API > >>>>> >>>> >>> Node: > rabbit@overcloud-controller-1.internalapi.bdxworld.com, > >>>>> >>>> interface: > >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: > inter-node and CLI > >>>>> >>>> tool > >>>>> >>>> >>> communication > >>>>> >>>> >>> Node: > rabbit@overcloud-controller-1.internalapi.bdxworld.com, > >>>>> >>>> interface: > >>>>> >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: > AMQP 0-9-1 > >>>>> >>>> >>> and AMQP 1.0 > >>>>> >>>> >>> Node: > rabbit@overcloud-controller-1.internalapi.bdxworld.com, > >>>>> >>>> interface: > >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API > >>>>> >>>> >>> Node: > rabbit@overcloud-controller-2.internalapi.bdxworld.com, > >>>>> >>>> interface: > >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: > inter-node and CLI > >>>>> >>>> tool > >>>>> >>>> >>> communication > >>>>> >>>> >>> Node: > rabbit@overcloud-controller-2.internalapi.bdxworld.com, > >>>>> >>>> interface: > >>>>> >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: > AMQP 0-9-1 > >>>>> >>>> >>> and AMQP 1.0 > >>>>> >>>> >>> Node: > rabbit@overcloud-controller-2.internalapi.bdxworld.com, > >>>>> >>>> interface: > >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API > >>>>> >>>> >>> Node: > rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > >>>>> >>>> , > >>>>> >>>> >>> interface: [::], port: 25672, protocol: clustering, > purpose: > >>>>> >>>> inter-node and > >>>>> >>>> >>> CLI tool communication > >>>>> >>>> >>> Node: > rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > >>>>> >>>> , > >>>>> >>>> >>> interface: 172.25.201.209, port: 5672, protocol: > amqp, purpose: AMQP > >>>>> >>>> 0-9-1 > >>>>> >>>> >>> and AMQP 1.0 > >>>>> >>>> >>> Node: > rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com > >>>>> >>>> , > >>>>> >>>> >>> interface: [::], port: 15672, protocol: http, > purpose: HTTP API > >>>>> >>>> >>> > >>>>> >>>> >>> Feature flags > >>>>> >>>> >>> > >>>>> >>>> >>> Flag: drop_unroutable_metric, state: enabled > >>>>> >>>> >>> Flag: empty_basic_get_metric, state: enabled > >>>>> >>>> >>> Flag: implicit_default_bindings, state: enabled > >>>>> >>>> >>> Flag: quorum_queue, state: enabled > >>>>> >>>> >>> Flag: virtual_host_metadata, state: enabled > >>>>> >>>> >>> > >>>>> >>>> >>> *Logs:* > >>>>> >>>> >>> *(Attached)* > >>>>> >>>> >>> > >>>>> >>>> >>> With regards, > >>>>> >>>> >>> Swogat Pradhan > >>>>> >>>> >>> > >>>>> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < > >>>>> >>>> swogatpradhan22@gmail.com> > >>>>> >>>> >>> wrote: > >>>>> >>>> >>> > >>>>> >>>> >>>> Hi, > >>>>> >>>> >>>> Please find the nova conductor as well as nova api > log. > >>>>> >>>> >>>> > >>>>> >>>> >>>> nova-conuctor: > >>>>> >>>> >>>> > >>>>> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING > >>>>> >>>> oslo_messaging._drivers.amqpdriver > >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't > exist, drop reply to > >>>>> >>>> >>>> 16152921c1eb45c2b1f562087140168b > >>>>> >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING > >>>>> >>>> oslo_messaging._drivers.amqpdriver > >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] > >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't > exist, drop reply to > >>>>> >>>> >>>> 83dbe5f567a940b698acfe986f6194fa > >>>>> >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING > >>>>> >>>> oslo_messaging._drivers.amqpdriver > >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] > >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't > exist, drop reply to > >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43: > >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>>> >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR > oslo_messaging._drivers.amqpdriver > >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] > The reply > >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send > after 60 seconds > >>>>> >>>> due to a > >>>>> >>>> >>>> missing queue > (reply_276049ec36a84486a8a406911d9802f4). > >>>>> >>>> Abandoning...: > >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>>> >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING > >>>>> >>>> oslo_messaging._drivers.amqpdriver > >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't > exist, drop reply to > >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: > >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>>> >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR > oslo_messaging._drivers.amqpdriver > >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > The reply > >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send > after 60 seconds > >>>>> >>>> due to a > >>>>> >>>> >>>> missing queue > (reply_349bcb075f8c49329435a0f884b33066). > >>>>> >>>> Abandoning...: > >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>>> >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING > >>>>> >>>> oslo_messaging._drivers.amqpdriver > >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't > exist, drop reply to > >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f: > >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>>> >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR > oslo_messaging._drivers.amqpdriver > >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > The reply > >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send > after 60 seconds > >>>>> >>>> due to a > >>>>> >>>> >>>> missing queue > (reply_349bcb075f8c49329435a0f884b33066). > >>>>> >>>> Abandoning...: > >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>>> >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils > >>>>> >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>>> >>>> b240e3e89d99489284cd731e75f2a5db > >>>>> >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default default] > Cache enabled > >>>>> >>>> with > >>>>> >>>> >>>> backend dogpile.cache.null. > >>>>> >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING > >>>>> >>>> oslo_messaging._drivers.amqpdriver > >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't > exist, drop reply to > >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb: > >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>>> >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR > oslo_messaging._drivers.amqpdriver > >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] > The reply > >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send > after 60 seconds > >>>>> >>>> due to a > >>>>> >>>> >>>> missing queue > (reply_349bcb075f8c49329435a0f884b33066). > >>>>> >>>> Abandoning...: > >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable > >>>>> >>>> >>>> > >>>>> >>>> >>>> With regards, > >>>>> >>>> >>>> Swogat Pradhan > >>>>> >>>> >>>> > >>>>> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < > >>>>> >>>> >>>> swogatpradhan22@gmail.com> wrote: > >>>>> >>>> >>>> > >>>>> >>>> >>>>> Hi, > >>>>> >>>> >>>>> I currently have 3 compute nodes on edge site1 > where i am trying to > >>>>> >>>> >>>>> launch vm's. > >>>>> >>>> >>>>> When the VM is in spawning state the node goes down > (openstack > >>>>> >>>> compute > >>>>> >>>> >>>>> service list), the node comes backup when i restart > the nova > >>>>> >>>> compute > >>>>> >>>> >>>>> service but then the launch of the vm fails. > >>>>> >>>> >>>>> > >>>>> >>>> >>>>> nova-compute.log > >>>>> >>>> >>>>> > >>>>> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager > >>>>> >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - > -] Running > >>>>> >>>> >>>>> instance usage > >>>>> >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from > 2023-02-26 07:00:00 > >>>>> >>>> to > >>>>> >>>> >>>>> 2023-02-26 08:00:00. 0 instances. > >>>>> >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims > >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db > >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] > [instance: > >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim > successful on node > >>>>> >>>> >>>>> dcn01-hci-0.bdxworld.com > >>>>> >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO > nova.virt.libvirt.driver > >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db > >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] > [instance: > >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring > supplied device > >>>>> >>>> name: > >>>>> >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied dev > names > >>>>> >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO > nova.virt.block_device > >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db > >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] > [instance: > >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with > volume > >>>>> >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda > >>>>> >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils > >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db > >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] > Cache enabled > >>>>> >>>> with > >>>>> >>>> >>>>> backend dogpile.cache.null. > >>>>> >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon > >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db > >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] > Running > >>>>> >>>> >>>>> privsep helper: > >>>>> >>>> >>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', > >>>>> >>>> 'privsep-helper', > >>>>> >>>> >>>>> '--config-file', '/etc/nova/nova.conf', > '--config-file', > >>>>> >>>> >>>>> '/etc/nova/nova-compute.conf', '--privsep_context', > >>>>> >>>> >>>>> 'os_brick.privileged.default', > '--privsep_sock_path', > >>>>> >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] > >>>>> >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon > >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db > >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] > Spawned new > >>>>> >>>> privsep > >>>>> >>>> >>>>> daemon via rootwrap > >>>>> >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO > oslo.privsep.daemon [-] privsep > >>>>> >>>> >>>>> daemon starting > >>>>> >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO > oslo.privsep.daemon [-] privsep > >>>>> >>>> >>>>> process running with uid/gid: 0/0 > >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO > oslo.privsep.daemon [-] privsep > >>>>> >>>> >>>>> process running with capabilities (eff/prm/inh): > >>>>> >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none > >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO > oslo.privsep.daemon [-] privsep > >>>>> >>>> >>>>> daemon running as pid 2647 > >>>>> >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING > >>>>> >>>> os_brick.initiator.connectors.nvmeof > >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db > >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] > Process > >>>>> >>>> >>>>> execution error > >>>>> >>>> >>>>> in _get_host_uuid: Unexpected error while running > command. > >>>>> >>>> >>>>> Command: blkid overlay -s UUID -o value > >>>>> >>>> >>>>> Exit code: 2 > >>>>> >>>> >>>>> Stdout: '' > >>>>> >>>> >>>>> Stderr: '': > oslo_concurrency.processutils.ProcessExecutionError: > >>>>> >>>> >>>>> Unexpected error while running command. > >>>>> >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO > nova.virt.libvirt.driver > >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 > >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db > >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] > [instance: > >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image > >>>>> >>>> >>>>> > >>>>> >>>> >>>>> Is there a way to solve this issue? > >>>>> >>>> >>>>> > >>>>> >>>> >>>>> > >>>>> >>>> >>>>> With regards, > >>>>> >>>> >>>>> > >>>>> >>>> >>>>> Swogat Pradhan > >>>>> >>>> >>>>> > >>>>> >>>> >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> > >
Hi, Is this bind not required for cinder_scheduler container? "/var/lib/tripleo-config/ceph:/var/lib/kolla/config_files/src-ceph:ro,rprivate,rbind", I do not see this particular bind on the cinder scheduler containers on my controller nodes. With regards, Swogat Pradhan On Thu, Mar 23, 2023 at 2:46 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Cinder volume config:
[tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=a8d5f1f5-48e7-5ede-89ab-8aca59b6397b report_discard_supported=True rbd_ceph_conf=/etc/ceph/dcn02.conf rbd_cluster_name=dcn02
Glance api config:
[dcn02] rbd_store_ceph_conf=/etc/ceph/dcn02.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store [ceph] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend.
On Thu, Mar 23, 2023 at 2:29 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
I still have the same issue, I'm not sure what's left to try. All the pods are now in a healthy state, I am getting log entries 3 mins after I hit the create volume button in cinder-volume when I try to create a volume with an image. And the volumes are just stuck in creating state for more than 20 mins now.
Cinder logs: 2023-03-22 20:32:44.010 108 INFO cinder.rpc [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Automatically selected cinder-volume RPC version 3.17 as minimum service version. 2023-03-22 20:34:59.166 108 INFO cinder.volume.flows.manager.create_volume [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume 5743a879-090d-46db-bc7c-1c0b0669a112: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-5743a879-090d-46db-bc7c-1c0b0669a112', 'volume_size': 2, 'image_id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'created_at': datetime.datetime(2023, 3, 22, 18, 50, 5, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 20, 3, 54, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'tags': [], 'file': '/v2/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f8147973438>}
With regards, Swogat Pradhan
On Wed, Mar 22, 2023 at 9:19 PM Alan Bishop <abishop@redhat.com> wrote:
On Wed, Mar 22, 2023 at 8:38 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Adam, The systems are in same LAN, in this case it seemed like the image was getting pulled from the central site which was caused due to an misconfiguration in ceph.conf file in /var/lib/tripleo-config/ceph/ directory, which seems to have been resolved after the changes i made to fix it.
Right now the glance api podman is running in unhealthy state and the podman logs don't show any error whatsoever and when issued the command netstat -nultp i do not see any entry for glance port i.e. 9292 in the dcn site, which is why cinder is throwing an error stating:
2023-03-22 13:32:29.786 108 ERROR oslo_messaging.rpc.server cinder.exception.GlanceConnectionFailed: Connection to glance failed: Error finding address for http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: Unable to establish connection to http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: HTTPConnectionPool(host='172.25.228.253', port=9292): Max retries exceeded with url: /v2/images/736d8779-07cd-4510-bab2-adcb653cc538 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7682d2cd30>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
Now i need to find out why the port is not listed as the glance service is running, which i am not sure how to find out.
One other thing to investigate is whether your deployment includes this patch [1]. If it does, then bear in mind the glance-api service running at the edge site will be an "internal" (non public facing) instance that uses port 9293 instead of 9292. You should familiarize yourself with the release note [2].
[1] https://opendev.org/openstack/tripleo-heat-templates/commit/3605d45e417a77a1... [2] https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/walla...
Alan
With regards, Swogat Pradhan
On Wed, Mar 22, 2023 at 8:11 PM Alan Bishop <abishop@redhat.com> wrote:
On Wed, Mar 22, 2023 at 6:37 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Update: Here is the log when creating a volume using cirros image:
2023-03-22 11:04:38.449 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume bf341343-6609-4b8c-b9e0-93e2a89c8c8f: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f', 'volume_size': 4, 'image_id': '736d8779-07cd-4510-bab2-adcb653cc538', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': '736d8779-07cd-4510-bab2-adcb653cc538', 'created_at': datetime.datetime(2023, 3, 22, 10, 44, 12, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 10, 54, 1, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'tags': [], 'file': '/v2/images/736d8779-07cd-4510-bab2-adcb653cc538/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f449ded1198>} 2023-03-22 11:06:16.570 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Image download 15.58 MB at 0.16 MB/s
As Adam Savage would say, well there's your problem ^^ (Image download 15.58 MB at 0.16 MB/s). Downloading the image takes too long, and 0.16 MB/s suggests you have a network issue.
John Fulton previously stated your cinder-volume service at the edge site is not using the local ceph image store. Assuming you are deploying GlanceApiEdge service [1], then the cinder-volume service should be configured to use the local glance service [2]. You should check cinder's glance_api_servers to confirm it's the edge site's glance service.
[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/envi... [2] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/depl...
Alan
2023-03-22 11:07:54.023 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.161 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena' category=FutureWarning)
2023-03-22 11:11:12.163 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Converted 112.00 MB image at 112.00 MB/s 2023-03-22 11:11:14.998 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f (bf341343-6609-4b8c-b9e0-93e2a89c8c8f): created successfully 2023-03-22 11:11:15.195 109 INFO cinder.volume.manager [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Created volume successfully.
The image is present in dcn02 store but still it downloaded the image in 0.16 MB/s and then created the volume.
With regards, Swogat Pradhan
On Tue, Mar 21, 2023 at 6:10 PM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
> Hi Jhon, > This seems to be an issue. > When i deployed the dcn ceph in both dcn01 and dcn02 the --cluster > parameter was specified to the respective cluster names but the config > files were created in the name of ceph.conf and keyring was > ceph.client.openstack.keyring. > > Which created issues in glance as well as the naming convention of > the files didn't match the cluster names, so i had to manually rename the > central ceph conf file as such: > > [root@dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/ > [root@dcn02-compute-0 ceph]# ll > total 16 > -rw-------. 1 root root 257 Mar 13 13:56 > ceph_central.client.openstack.keyring > -rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf > -rw-------. 1 root root 205 Mar 15 18:45 > ceph.client.openstack.keyring > -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf > [root@dcn02-compute-0 ceph]# > > ceph.conf and ceph.client.openstack.keyring contain the fsid of the > respective clusters in both dcn01 and dcn02. > In the above cli output, the ceph.conf and ceph.client... are the > files used to access dcn02 ceph cluster and ceph_central* files are used in > for accessing central ceph cluster. > > glance multistore config: > [dcn02] > rbd_store_ceph_conf=/etc/ceph/ceph.conf > rbd_store_user=openstack > rbd_store_pool=images > rbd_thin_provisioning=False > store_description=dcn02 rbd glance store > > [ceph_central] > rbd_store_ceph_conf=/etc/ceph/ceph_central.conf > rbd_store_user=openstack > rbd_store_pool=images > rbd_thin_provisioning=False > store_description=Default glance store backend. > > > With regards, > Swogat Pradhan > > On Tue, Mar 21, 2023 at 5:52 PM John Fulton <johfulto@redhat.com> > wrote: > >> On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan >> <swogatpradhan22@gmail.com> wrote: >> > >> > Hi, >> > Seems like cinder is not using the local ceph. >> >> That explains the issue. It's a misconfiguration. >> >> I hope this is not a production system since the mailing list now >> has >> the cinder.conf which contains passwords. >> >> The section that looks like this: >> >> [tripleo_ceph] >> volume_backend_name=tripleo_ceph >> volume_driver=cinder.volume.drivers.rbd.RBDDriver >> rbd_ceph_conf=/etc/ceph/ceph.conf >> rbd_user=openstack >> rbd_pool=volumes >> rbd_flatten_volume_from_snapshot=False >> rbd_secret_uuid=<redacted> >> report_discard_supported=True >> >> Should be updated to refer to the local DCN ceph cluster and not the >> central one. Use the ceph conf file for that cluster and ensure the >> rbd_secret_uuid corresponds to that one. >> >> TripleO’s convention is to set the rbd_secret_uuid to the FSID of >> the >> Ceph cluster. The FSID should be in the ceph.conf file. The >> tripleo_nova_libvirt role will use virsh secret-* commands so that >> libvirt can retrieve the cephx secret using the FSID as a key. This >> can be confirmed with `podman exec nova_virtsecretd virsh >> secret-get-value $FSID`. >> >> The documentation describes how to configure the central and DCN >> sites >> correctly but an error seems to have occurred while you were >> following >> it. >> >> >> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >> >> John >> >> > >> > Ceph Output: >> > [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l >> > NAME SIZE PARENT FMT >> PROT LOCK >> > 2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB 2 >> excl >> > 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB 2 >> > 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB 2 >> yes >> > 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB 2 >> > 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB 2 >> yes >> > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB 2 >> > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB 2 >> yes >> > 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB 2 >> > 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB 2 >> yes >> > b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB 2 >> > b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB 2 >> yes >> > e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB 2 >> > e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB 2 >> yes >> > >> > [ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l >> > NAME SIZE PARENT >> FMT PROT LOCK >> > volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB 2 >> > volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB 2 >> > [ceph: root@dcn02-ceph-all-0 /]# >> > >> > Attached the cinder config. >> > Please let me know how I can solve this issue. >> > >> > With regards, >> > Swogat Pradhan >> > >> > On Tue, Mar 21, 2023 at 3:53 PM John Fulton <johfulto@redhat.com> >> wrote: >> >> >> >> in my last message under the line "On a DCN site if you run a >> command like this:" I suggested some steps you could try to confirm the >> image is a COW from the local glance as well as how to look at your cinder >> config. >> >> >> >> On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan < >> swogatpradhan22@gmail.com> wrote: >> >>> >> >>> Update: >> >>> I uploaded an image directly to the dcn02 store, and it takes >> around 10,15 minutes to create a volume with image in dcn02. >> >>> The image size is 389 MB. >> >>> >> >>> On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan < >> swogatpradhan22@gmail.com> wrote: >> >>>> >> >>>> Hi Jhon, >> >>>> I checked in the ceph od dcn02, I can see the images created >> after importing from the central site. >> >>>> But launching an instance normally fails as it takes a long >> time for the volume to get created. >> >>>> >> >>>> When launching an instance from volume the instance is getting >> created properly without any errors. >> >>>> >> >>>> I tried to cache images in nova using >> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >> but getting checksum failed error. >> >>>> >> >>>> With regards, >> >>>> Swogat Pradhan >> >>>> >> >>>> On Thu, Mar 16, 2023 at 5:24 PM John Fulton < >> johfulto@redhat.com> wrote: >> >>>>> >> >>>>> On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan >> >>>>> <swogatpradhan22@gmail.com> wrote: >> >>>>> > >> >>>>> > Update: After restarting the nova services on the >> controller and running the deploy script on the edge site, I was able to >> launch the VM from volume. >> >>>>> > >> >>>>> > Right now the instance creation is failing as the block >> device creation is stuck in creating state, it is taking more than 10 mins >> for the volume to be created, whereas the image has already been imported >> to the edge glance. >> >>>>> >> >>>>> Try following this document and making the same observations >> in your >> >>>>> environment for AZs and their local ceph cluster. >> >>>>> >> >>>>> >> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >> >>>>> >> >>>>> On a DCN site if you run a command like this: >> >>>>> >> >>>>> $ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring >> >>>>> /etc/ceph/dcn0.client.admin.keyring >> >>>>> $ rbd --cluster dcn0 -p volumes ls -l >> >>>>> NAME SIZE PARENT >> >>>>> FMT PROT LOCK >> >>>>> volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB >> >>>>> images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 >> excl >> >>>>> $ >> >>>>> >> >>>>> Then, you should see the parent of the volume is the image >> which is on >> >>>>> the same local ceph cluster. >> >>>>> >> >>>>> I wonder if something is misconfigured and thus you're >> encountering >> >>>>> the streaming behavior described here: >> >>>>> >> >>>>> Ideally all images should reside in the central Glance and be >> copied >> >>>>> to DCN sites before instances of those images are booted on >> DCN sites. >> >>>>> If an image is not copied to a DCN site before it is booted, >> then the >> >>>>> image will be streamed to the DCN site and then the image >> will boot as >> >>>>> an instance. This happens because Glance at the DCN site has >> access to >> >>>>> the images store at the Central ceph cluster. Though the >> booting of >> >>>>> the image will take time because it has not been copied in >> advance, >> >>>>> this is still preferable to failing to boot the image. >> >>>>> >> >>>>> You can also exec into the cinder container at the DCN site >> and >> >>>>> confirm it's using it's local ceph cluster. >> >>>>> >> >>>>> John >> >>>>> >> >>>>> > >> >>>>> > I will try and create a new fresh image and test again then >> update. >> >>>>> > >> >>>>> > With regards, >> >>>>> > Swogat Pradhan >> >>>>> > >> >>>>> > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan < >> swogatpradhan22@gmail.com> wrote: >> >>>>> >> >> >>>>> >> Update: >> >>>>> >> In the hypervisor list the compute node state is showing >> down. >> >>>>> >> >> >>>>> >> >> >>>>> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan < >> swogatpradhan22@gmail.com> wrote: >> >>>>> >>> >> >>>>> >>> Hi Brendan, >> >>>>> >>> Now i have deployed another site where i have used 2 >> linux bonds network template for both 3 compute nodes and 3 ceph nodes. >> >>>>> >>> The bonding options is set to mode=802.3ad (lacp=active). >> >>>>> >>> I used a cirros image to launch instance but the instance >> timed out so i waited for the volume to be created. >> >>>>> >>> Once the volume was created i tried launching the >> instance from the volume and still the instance is stuck in spawning state. >> >>>>> >>> >> >>>>> >>> Here is the nova-compute log: >> >>>>> >>> >> >>>>> >>> 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon >> [-] privsep daemon starting >> >>>>> >>> 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon >> [-] privsep process running with uid/gid: 0/0 >> >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon >> [-] privsep process running with capabilities (eff/prm/inh): >> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >> >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon >> [-] privsep daemon running as pid 185437 >> >>>>> >>> 2023-03-15 17:35:47.974 8 WARNING >> os_brick.initiator.connectors.nvmeof >> [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error >> in _get_host_uuid: Unexpected error while running command. >> >>>>> >>> Command: blkid overlay -s UUID -o value >> >>>>> >>> Exit code: 2 >> >>>>> >>> Stdout: '' >> >>>>> >>> Stderr: '': >> oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while >> running command. >> >>>>> >>> 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver >> [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >> 450b749c-a10a-4308-80a9-3b8020fee758] Creating image >> >>>>> >>> >> >>>>> >>> It is stuck in creating image, do i need to run the >> template mentioned here ?: >> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >> >>>>> >>> >> >>>>> >>> The volume is already created and i do not understand why >> the instance is stuck in spawning state. >> >>>>> >>> >> >>>>> >>> With regards, >> >>>>> >>> Swogat Pradhan >> >>>>> >>> >> >>>>> >>> >> >>>>> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard < >> bshephar@redhat.com> wrote: >> >>>>> >>>> >> >>>>> >>>> Does your environment use different network interfaces >> for each of the networks? Or does it have a bond with everything on it? >> >>>>> >>>> >> >>>>> >>>> One issue I have seen before is that when launching >> instances, there is a lot of network traffic between nodes as the >> hypervisor needs to download the image from Glance. Along with various >> other services sending normal network traffic, it can be enough to cause >> issues if everything is running over a single 1Gbe interface. >> >>>>> >>>> >> >>>>> >>>> I have seen the same situation in fact when using a >> single active/backup bond on 1Gbe nics. It’s worth checking the network >> traffic while you try to spawn the instance to see if you’re dropping >> packets. In the situation I described, there were dropped packets which >> resulted in a loss of communication between nova_compute and RMQ, so the >> node appeared offline. You should also confirm that nova_compute is being >> disconnected in the nova_compute logs if you tail them on the Hypervisor >> while spawning the instance. >> >>>>> >>>> >> >>>>> >>>> In my case, changing from active/backup to LACP helped. >> So, based on that experience, from my perspective, is certainly sounds like >> some kind of network issue. >> >>>>> >>>> >> >>>>> >>>> Regards, >> >>>>> >>>> >> >>>>> >>>> Brendan Shephard >> >>>>> >>>> Senior Software Engineer >> >>>>> >>>> Red Hat Australia >> >>>>> >>>> >> >>>>> >>>> >> >>>>> >>>> >> >>>>> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> >> wrote: >> >>>>> >>>> >> >>>>> >>>> Hi, >> >>>>> >>>> >> >>>>> >>>> I tried to help someone with a similar issue some time >> ago in this thread: >> >>>>> >>>> >> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... >> >>>>> >>>> >> >>>>> >>>> But apparently a neutron reinstallation fixed it for >> that user, not sure if that could apply here. But is it possible that your >> nova and neutron versions are different between central and edge site? Have >> you restarted nova and neutron services on the compute nodes after >> installation? Have you debug logs of nova-conductor and maybe nova-compute? >> Maybe they can help narrow down the issue. >> >>>>> >>>> If there isn't any additional information in the debug >> logs I probably would start "tearing down" rabbitmq. I didn't have to do >> that in a production system yet so be careful. I can think of two routes: >> >>>>> >>>> >> >>>>> >>>> - Either remove queues, exchanges etc. while rabbit is >> running, this will most likely impact client IO depending on your load. >> Check out the rabbitmqctl commands. >> >>>>> >>>> - Or stop the rabbitmq cluster, remove the mnesia tables >> from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild. >> >>>>> >>>> >> >>>>> >>>> I can imagine that the failed reply "survives" while >> being replicated across the rabbit nodes. But I don't really know the >> rabbit internals too well, so maybe someone else can chime in here and give >> a better advice. >> >>>>> >>>> >> >>>>> >>>> Regards, >> >>>>> >>>> Eugen >> >>>>> >>>> >> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >> >>>>> >>>> >> >>>>> >>>> Hi, >> >>>>> >>>> Can someone please help me out on this issue? >> >>>>> >>>> >> >>>>> >>>> With regards, >> >>>>> >>>> Swogat Pradhan >> >>>>> >>>> >> >>>>> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < >> swogatpradhan22@gmail.com> >> >>>>> >>>> wrote: >> >>>>> >>>> >> >>>>> >>>> Hi >> >>>>> >>>> I don't see any major packet loss. >> >>>>> >>>> It seems the problem is somewhere in rabbitmq maybe but >> not due to packet >> >>>>> >>>> loss. >> >>>>> >>>> >> >>>>> >>>> with regards, >> >>>>> >>>> Swogat Pradhan >> >>>>> >>>> >> >>>>> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < >> swogatpradhan22@gmail.com> >> >>>>> >>>> wrote: >> >>>>> >>>> >> >>>>> >>>> Hi, >> >>>>> >>>> Yes the MTU is the same as the default '1500'. >> >>>>> >>>> Generally I haven't seen any packet loss, but never >> checked when >> >>>>> >>>> launching the instance. >> >>>>> >>>> I will check that and come back. >> >>>>> >>>> But everytime i launch an instance the instance gets >> stuck at spawning >> >>>>> >>>> state and there the hypervisor becomes down, so not sure >> if packet loss >> >>>>> >>>> causes this. >> >>>>> >>>> >> >>>>> >>>> With regards, >> >>>>> >>>> Swogat pradhan >> >>>>> >>>> >> >>>>> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block < >> eblock@nde.ag> wrote: >> >>>>> >>>> >> >>>>> >>>> One more thing coming to mind is MTU size. Are they >> identical between >> >>>>> >>>> central and edge site? Do you see packet loss through >> the tunnel? >> >>>>> >>>> >> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >> >>>>> >>>> >> >>>>> >>>> > Hi Eugen, >> >>>>> >>>> > Request you to please add my email either on 'to' or >> 'cc' as i am not >> >>>>> >>>> > getting email's from you. >> >>>>> >>>> > Coming to the issue: >> >>>>> >>>> > >> >>>>> >>>> > [root@overcloud-controller-no-ceph-3 /]# rabbitmqctl >> list_policies -p >> >>>>> >>>> / >> >>>>> >>>> > Listing policies for vhost "/" ... >> >>>>> >>>> > vhost name pattern apply-to definition >> priority >> >>>>> >>>> > / ha-all ^(?!amq\.).* queues >> >>>>> >>>> > >> >>>>> >>>> >> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 >> >>>>> >>>> > >> >>>>> >>>> > I have the edge site compute nodes up, it only goes >> down when i am >> >>>>> >>>> trying >> >>>>> >>>> > to launch an instance and the instance comes to a >> spawning state and >> >>>>> >>>> then >> >>>>> >>>> > gets stuck. >> >>>>> >>>> > >> >>>>> >>>> > I have a tunnel setup between the central and the edge >> sites. >> >>>>> >>>> > >> >>>>> >>>> > With regards, >> >>>>> >>>> > Swogat Pradhan >> >>>>> >>>> > >> >>>>> >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < >> >>>>> >>>> swogatpradhan22@gmail.com> >> >>>>> >>>> > wrote: >> >>>>> >>>> > >> >>>>> >>>> >> Hi Eugen, >> >>>>> >>>> >> For some reason i am not getting your email to me >> directly, i am >> >>>>> >>>> checking >> >>>>> >>>> >> the email digest and there i am able to find your >> reply. >> >>>>> >>>> >> Here is the log for download: >> https://we.tl/t-L8FEkGZFSq >> >>>>> >>>> >> Yes, these logs are from the time when the issue >> occurred. >> >>>>> >>>> >> >> >>>>> >>>> >> *Note: i am able to create vm's and perform other >> activities in the >> >>>>> >>>> >> central site, only facing this issue in the edge >> site.* >> >>>>> >>>> >> >> >>>>> >>>> >> With regards, >> >>>>> >>>> >> Swogat Pradhan >> >>>>> >>>> >> >> >>>>> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < >> >>>>> >>>> swogatpradhan22@gmail.com> >> >>>>> >>>> >> wrote: >> >>>>> >>>> >> >> >>>>> >>>> >>> Hi Eugen, >> >>>>> >>>> >>> Thanks for your response. >> >>>>> >>>> >>> I have actually a 4 controller setup so here are the >> details: >> >>>>> >>>> >>> >> >>>>> >>>> >>> *PCS Status:* >> >>>>> >>>> >>> * Container bundle set: rabbitmq-bundle [ >> >>>>> >>>> >>> >> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: >> >>>>> >>>> >>> * rabbitmq-bundle-0 >> (ocf::heartbeat:rabbitmq-cluster): >> >>>>> >>>> Started >> >>>>> >>>> >>> overcloud-controller-no-ceph-3 >> >>>>> >>>> >>> * rabbitmq-bundle-1 >> (ocf::heartbeat:rabbitmq-cluster): >> >>>>> >>>> Started >> >>>>> >>>> >>> overcloud-controller-2 >> >>>>> >>>> >>> * rabbitmq-bundle-2 >> (ocf::heartbeat:rabbitmq-cluster): >> >>>>> >>>> Started >> >>>>> >>>> >>> overcloud-controller-1 >> >>>>> >>>> >>> * rabbitmq-bundle-3 >> (ocf::heartbeat:rabbitmq-cluster): >> >>>>> >>>> Started >> >>>>> >>>> >>> overcloud-controller-0 >> >>>>> >>>> >>> >> >>>>> >>>> >>> I have tried restarting the bundle multiple times >> but the issue is >> >>>>> >>>> still >> >>>>> >>>> >>> present. >> >>>>> >>>> >>> >> >>>>> >>>> >>> *Cluster status:* >> >>>>> >>>> >>> [root@overcloud-controller-0 /]# rabbitmqctl >> cluster_status >> >>>>> >>>> >>> Cluster status of node >> >>>>> >>>> >>> >> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >> >>>>> >>>> >>> Basics >> >>>>> >>>> >>> >> >>>>> >>>> >>> Cluster name: >> rabbit@overcloud-controller-no-ceph-3.bdxworld.com >> >>>>> >>>> >>> >> >>>>> >>>> >>> Disk Nodes >> >>>>> >>>> >>> >> >>>>> >>>> >>> >> rabbit@overcloud-controller-0.internalapi.bdxworld.com >> >>>>> >>>> >>> >> rabbit@overcloud-controller-1.internalapi.bdxworld.com >> >>>>> >>>> >>> >> rabbit@overcloud-controller-2.internalapi.bdxworld.com >> >>>>> >>>> >>> >> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >>>>> >>>> >>> >> >>>>> >>>> >>> Running Nodes >> >>>>> >>>> >>> >> >>>>> >>>> >>> >> rabbit@overcloud-controller-0.internalapi.bdxworld.com >> >>>>> >>>> >>> >> rabbit@overcloud-controller-1.internalapi.bdxworld.com >> >>>>> >>>> >>> >> rabbit@overcloud-controller-2.internalapi.bdxworld.com >> >>>>> >>>> >>> >> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >>>>> >>>> >>> >> >>>>> >>>> >>> Versions >> >>>>> >>>> >>> >> >>>>> >>>> >>> >> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ >> >>>>> >>>> 3.8.3 >> >>>>> >>>> >>> on Erlang 22.3.4.1 >> >>>>> >>>> >>> >> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ >> >>>>> >>>> 3.8.3 >> >>>>> >>>> >>> on Erlang 22.3.4.1 >> >>>>> >>>> >>> >> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ >> >>>>> >>>> 3.8.3 >> >>>>> >>>> >>> on Erlang 22.3.4.1 >> >>>>> >>>> >>> >> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: >> >>>>> >>>> RabbitMQ >> >>>>> >>>> >>> 3.8.3 on Erlang 22.3.4.1 >> >>>>> >>>> >>> >> >>>>> >>>> >>> Alarms >> >>>>> >>>> >>> >> >>>>> >>>> >>> (none) >> >>>>> >>>> >>> >> >>>>> >>>> >>> Network Partitions >> >>>>> >>>> >>> >> >>>>> >>>> >>> (none) >> >>>>> >>>> >>> >> >>>>> >>>> >>> Listeners >> >>>>> >>>> >>> >> >>>>> >>>> >>> Node: >> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >> >>>>> >>>> interface: >> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: >> inter-node and CLI >> >>>>> >>>> tool >> >>>>> >>>> >>> communication >> >>>>> >>>> >>> Node: >> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >> >>>>> >>>> interface: >> >>>>> >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: >> AMQP 0-9-1 >> >>>>> >>>> >>> and AMQP 1.0 >> >>>>> >>>> >>> Node: >> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >> >>>>> >>>> interface: >> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >> >>>>> >>>> >>> Node: >> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >> >>>>> >>>> interface: >> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: >> inter-node and CLI >> >>>>> >>>> tool >> >>>>> >>>> >>> communication >> >>>>> >>>> >>> Node: >> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >> >>>>> >>>> interface: >> >>>>> >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: >> AMQP 0-9-1 >> >>>>> >>>> >>> and AMQP 1.0 >> >>>>> >>>> >>> Node: >> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >> >>>>> >>>> interface: >> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >> >>>>> >>>> >>> Node: >> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >> >>>>> >>>> interface: >> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: >> inter-node and CLI >> >>>>> >>>> tool >> >>>>> >>>> >>> communication >> >>>>> >>>> >>> Node: >> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >> >>>>> >>>> interface: >> >>>>> >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: >> AMQP 0-9-1 >> >>>>> >>>> >>> and AMQP 1.0 >> >>>>> >>>> >>> Node: >> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >> >>>>> >>>> interface: >> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >> >>>>> >>>> >>> Node: >> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >>>>> >>>> , >> >>>>> >>>> >>> interface: [::], port: 25672, protocol: clustering, >> purpose: >> >>>>> >>>> inter-node and >> >>>>> >>>> >>> CLI tool communication >> >>>>> >>>> >>> Node: >> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >>>>> >>>> , >> >>>>> >>>> >>> interface: 172.25.201.209, port: 5672, protocol: >> amqp, purpose: AMQP >> >>>>> >>>> 0-9-1 >> >>>>> >>>> >>> and AMQP 1.0 >> >>>>> >>>> >>> Node: >> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >> >>>>> >>>> , >> >>>>> >>>> >>> interface: [::], port: 15672, protocol: http, >> purpose: HTTP API >> >>>>> >>>> >>> >> >>>>> >>>> >>> Feature flags >> >>>>> >>>> >>> >> >>>>> >>>> >>> Flag: drop_unroutable_metric, state: enabled >> >>>>> >>>> >>> Flag: empty_basic_get_metric, state: enabled >> >>>>> >>>> >>> Flag: implicit_default_bindings, state: enabled >> >>>>> >>>> >>> Flag: quorum_queue, state: enabled >> >>>>> >>>> >>> Flag: virtual_host_metadata, state: enabled >> >>>>> >>>> >>> >> >>>>> >>>> >>> *Logs:* >> >>>>> >>>> >>> *(Attached)* >> >>>>> >>>> >>> >> >>>>> >>>> >>> With regards, >> >>>>> >>>> >>> Swogat Pradhan >> >>>>> >>>> >>> >> >>>>> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < >> >>>>> >>>> swogatpradhan22@gmail.com> >> >>>>> >>>> >>> wrote: >> >>>>> >>>> >>> >> >>>>> >>>> >>>> Hi, >> >>>>> >>>> >>>> Please find the nova conductor as well as nova api >> log. >> >>>>> >>>> >>>> >> >>>>> >>>> >>>> nova-conuctor: >> >>>>> >>>> >>>> >> >>>>> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING >> >>>>> >>>> oslo_messaging._drivers.amqpdriver >> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >> exist, drop reply to >> >>>>> >>>> >>>> 16152921c1eb45c2b1f562087140168b >> >>>>> >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING >> >>>>> >>>> oslo_messaging._drivers.amqpdriver >> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >> >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't >> exist, drop reply to >> >>>>> >>>> >>>> 83dbe5f567a940b698acfe986f6194fa >> >>>>> >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING >> >>>>> >>>> oslo_messaging._drivers.amqpdriver >> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] >> >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't >> exist, drop reply to >> >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43: >> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>>> >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR >> oslo_messaging._drivers.amqpdriver >> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - >> -] The reply >> >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send >> after 60 seconds >> >>>>> >>>> due to a >> >>>>> >>>> >>>> missing queue >> (reply_276049ec36a84486a8a406911d9802f4). >> >>>>> >>>> Abandoning...: >> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>>> >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING >> >>>>> >>>> oslo_messaging._drivers.amqpdriver >> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >> exist, drop reply to >> >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: >> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>>> >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR >> oslo_messaging._drivers.amqpdriver >> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - >> -] The reply >> >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send >> after 60 seconds >> >>>>> >>>> due to a >> >>>>> >>>> >>>> missing queue >> (reply_349bcb075f8c49329435a0f884b33066). >> >>>>> >>>> Abandoning...: >> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>>> >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING >> >>>>> >>>> oslo_messaging._drivers.amqpdriver >> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >> exist, drop reply to >> >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f: >> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>>> >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR >> oslo_messaging._drivers.amqpdriver >> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - >> -] The reply >> >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send >> after 60 seconds >> >>>>> >>>> due to a >> >>>>> >>>> >>>> missing queue >> (reply_349bcb075f8c49329435a0f884b33066). >> >>>>> >>>> Abandoning...: >> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>>> >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils >> >>>>> >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>>> >>>> b240e3e89d99489284cd731e75f2a5db >> >>>>> >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default default] >> Cache enabled >> >>>>> >>>> with >> >>>>> >>>> >>>> backend dogpile.cache.null. >> >>>>> >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING >> >>>>> >>>> oslo_messaging._drivers.amqpdriver >> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] >> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >> exist, drop reply to >> >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb: >> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>>> >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR >> oslo_messaging._drivers.amqpdriver >> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - >> -] The reply >> >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send >> after 60 seconds >> >>>>> >>>> due to a >> >>>>> >>>> >>>> missing queue >> (reply_349bcb075f8c49329435a0f884b33066). >> >>>>> >>>> Abandoning...: >> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >> >>>>> >>>> >>>> >> >>>>> >>>> >>>> With regards, >> >>>>> >>>> >>>> Swogat Pradhan >> >>>>> >>>> >>>> >> >>>>> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >> >>>>> >>>> >>>> swogatpradhan22@gmail.com> wrote: >> >>>>> >>>> >>>> >> >>>>> >>>> >>>>> Hi, >> >>>>> >>>> >>>>> I currently have 3 compute nodes on edge site1 >> where i am trying to >> >>>>> >>>> >>>>> launch vm's. >> >>>>> >>>> >>>>> When the VM is in spawning state the node goes >> down (openstack >> >>>>> >>>> compute >> >>>>> >>>> >>>>> service list), the node comes backup when i >> restart the nova >> >>>>> >>>> compute >> >>>>> >>>> >>>>> service but then the launch of the vm fails. >> >>>>> >>>> >>>>> >> >>>>> >>>> >>>>> nova-compute.log >> >>>>> >>>> >>>>> >> >>>>> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager >> >>>>> >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - >> -] Running >> >>>>> >>>> >>>>> instance usage >> >>>>> >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from >> 2023-02-26 07:00:00 >> >>>>> >>>> to >> >>>>> >>>> >>>>> 2023-02-26 08:00:00. 0 instances. >> >>>>> >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >> default] [instance: >> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim >> successful on node >> >>>>> >>>> >>>>> dcn01-hci-0.bdxworld.com >> >>>>> >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO >> nova.virt.libvirt.driver >> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >> default] [instance: >> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring >> supplied device >> >>>>> >>>> name: >> >>>>> >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied dev >> names >> >>>>> >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO >> nova.virt.block_device >> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >> default] [instance: >> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with >> volume >> >>>>> >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >> >>>>> >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >> default] Cache enabled >> >>>>> >>>> with >> >>>>> >>>> >>>>> backend dogpile.cache.null. >> >>>>> >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >> default] Running >> >>>>> >>>> >>>>> privsep helper: >> >>>>> >>>> >>>>> ['sudo', 'nova-rootwrap', >> '/etc/nova/rootwrap.conf', >> >>>>> >>>> 'privsep-helper', >> >>>>> >>>> >>>>> '--config-file', '/etc/nova/nova.conf', >> '--config-file', >> >>>>> >>>> >>>>> '/etc/nova/nova-compute.conf', '--privsep_context', >> >>>>> >>>> >>>>> 'os_brick.privileged.default', >> '--privsep_sock_path', >> >>>>> >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] >> >>>>> >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >> default] Spawned new >> >>>>> >>>> privsep >> >>>>> >>>> >>>>> daemon via rootwrap >> >>>>> >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO >> oslo.privsep.daemon [-] privsep >> >>>>> >>>> >>>>> daemon starting >> >>>>> >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO >> oslo.privsep.daemon [-] privsep >> >>>>> >>>> >>>>> process running with uid/gid: 0/0 >> >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO >> oslo.privsep.daemon [-] privsep >> >>>>> >>>> >>>>> process running with capabilities (eff/prm/inh): >> >>>>> >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >> >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO >> oslo.privsep.daemon [-] privsep >> >>>>> >>>> >>>>> daemon running as pid 2647 >> >>>>> >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING >> >>>>> >>>> os_brick.initiator.connectors.nvmeof >> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >> default] Process >> >>>>> >>>> >>>>> execution error >> >>>>> >>>> >>>>> in _get_host_uuid: Unexpected error while running >> command. >> >>>>> >>>> >>>>> Command: blkid overlay -s UUID -o value >> >>>>> >>>> >>>>> Exit code: 2 >> >>>>> >>>> >>>>> Stdout: '' >> >>>>> >>>> >>>>> Stderr: '': >> oslo_concurrency.processutils.ProcessExecutionError: >> >>>>> >>>> >>>>> Unexpected error while running command. >> >>>>> >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO >> nova.virt.libvirt.driver >> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >> default] [instance: >> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating >> image >> >>>>> >>>> >>>>> >> >>>>> >>>> >>>>> Is there a way to solve this issue? >> >>>>> >>>> >>>>> >> >>>>> >>>> >>>>> >> >>>>> >>>> >>>>> With regards, >> >>>>> >>>> >>>>> >> >>>>> >>>> >>>>> Swogat Pradhan >> >>>>> >>>> >>>>> >> >>>>> >>>> >>>> >> >>>>> >>>> >> >>>>> >>>> >> >>>>> >>>> >> >>>>> >>>> >> >>>>> >>>> >> >>>>> >>>> >> >>>>> >>>> >> >>>>> >>>> >> >>>>> >>>> >> >>>>> >> >>
On Thu, Mar 23, 2023 at 5:20 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Is this bind not required for cinder_scheduler container?
"/var/lib/tripleo-config/ceph:/var/lib/kolla/config_files/src-ceph:ro,rprivate,rbind", I do not see this particular bind on the cinder scheduler containers on my controller nodes.
That is correct, because the scheduler does not access the ceph cluster. Alan
With regards, Swogat Pradhan
On Thu, Mar 23, 2023 at 2:46 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Cinder volume config:
[tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=a8d5f1f5-48e7-5ede-89ab-8aca59b6397b report_discard_supported=True rbd_ceph_conf=/etc/ceph/dcn02.conf rbd_cluster_name=dcn02
Glance api config:
[dcn02] rbd_store_ceph_conf=/etc/ceph/dcn02.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store [ceph] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend.
On Thu, Mar 23, 2023 at 2:29 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
I still have the same issue, I'm not sure what's left to try. All the pods are now in a healthy state, I am getting log entries 3 mins after I hit the create volume button in cinder-volume when I try to create a volume with an image. And the volumes are just stuck in creating state for more than 20 mins now.
Cinder logs: 2023-03-22 20:32:44.010 108 INFO cinder.rpc [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Automatically selected cinder-volume RPC version 3.17 as minimum service version. 2023-03-22 20:34:59.166 108 INFO cinder.volume.flows.manager.create_volume [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume 5743a879-090d-46db-bc7c-1c0b0669a112: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-5743a879-090d-46db-bc7c-1c0b0669a112', 'volume_size': 2, 'image_id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'created_at': datetime.datetime(2023, 3, 22, 18, 50, 5, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 20, 3, 54, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'tags': [], 'file': '/v2/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f8147973438>}
With regards, Swogat Pradhan
On Wed, Mar 22, 2023 at 9:19 PM Alan Bishop <abishop@redhat.com> wrote:
On Wed, Mar 22, 2023 at 8:38 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Adam, The systems are in same LAN, in this case it seemed like the image was getting pulled from the central site which was caused due to an misconfiguration in ceph.conf file in /var/lib/tripleo-config/ceph/ directory, which seems to have been resolved after the changes i made to fix it.
Right now the glance api podman is running in unhealthy state and the podman logs don't show any error whatsoever and when issued the command netstat -nultp i do not see any entry for glance port i.e. 9292 in the dcn site, which is why cinder is throwing an error stating:
2023-03-22 13:32:29.786 108 ERROR oslo_messaging.rpc.server cinder.exception.GlanceConnectionFailed: Connection to glance failed: Error finding address for http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: Unable to establish connection to http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: HTTPConnectionPool(host='172.25.228.253', port=9292): Max retries exceeded with url: /v2/images/736d8779-07cd-4510-bab2-adcb653cc538 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7682d2cd30>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
Now i need to find out why the port is not listed as the glance service is running, which i am not sure how to find out.
One other thing to investigate is whether your deployment includes this patch [1]. If it does, then bear in mind the glance-api service running at the edge site will be an "internal" (non public facing) instance that uses port 9293 instead of 9292. You should familiarize yourself with the release note [2].
[1] https://opendev.org/openstack/tripleo-heat-templates/commit/3605d45e417a77a1... [2] https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/walla...
Alan
With regards, Swogat Pradhan
On Wed, Mar 22, 2023 at 8:11 PM Alan Bishop <abishop@redhat.com> wrote:
On Wed, Mar 22, 2023 at 6:37 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
> Update: > Here is the log when creating a volume using cirros image: > > 2023-03-22 11:04:38.449 109 INFO > cinder.volume.flows.manager.create_volume > [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - - -] Volume > bf341343-6609-4b8c-b9e0-93e2a89c8c8f: being created as image with > specification: {'status': 'creating', 'volume_name': > 'volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f', 'volume_size': 4, > 'image_id': '736d8779-07cd-4510-bab2-adcb653cc538', 'image_location': > ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', > [{'url': > 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', > 'metadata': {'store': 'ceph'}}, {'url': > 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', > 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', > 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', > 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', > 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, > 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', > 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': > '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', > 'id': '736d8779-07cd-4510-bab2-adcb653cc538', 'created_at': > datetime.datetime(2023, 3, 22, 10, 44, 12, tzinfo=datetime.timezone.utc), > 'updated_at': datetime.datetime(2023, 3, 22, 10, 54, 1, > tzinfo=datetime.timezone.utc), 'locations': [{'url': > 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', > 'metadata': {'store': 'ceph'}}, {'url': > 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', > 'metadata': {'store': 'dcn02'}}], 'direct_url': > 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', > 'tags': [], 'file': '/v2/images/736d8779-07cd-4510-bab2-adcb653cc538/file', > 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', > 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', > 'owner_specified.openstack.object': 'images/cirros', > 'owner_specified.openstack.sha256': ''}}, 'image_service': > <cinder.image.glance.GlanceImageService object at 0x7f449ded1198>} > 2023-03-22 11:06:16.570 109 INFO cinder.image.image_utils > [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - - -] Image download 15.58 MB at 0.16 MB/s >
As Adam Savage would say, well there's your problem ^^ (Image download 15.58 MB at 0.16 MB/s). Downloading the image takes too long, and 0.16 MB/s suggests you have a network issue.
John Fulton previously stated your cinder-volume service at the edge site is not using the local ceph image store. Assuming you are deploying GlanceApiEdge service [1], then the cinder-volume service should be configured to use the local glance service [2]. You should check cinder's glance_api_servers to confirm it's the edge site's glance service.
[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/envi... [2] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/depl...
Alan
> 2023-03-22 11:07:54.023 109 WARNING py.warnings > [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - - -] > /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: > FutureWarning: The human format is deprecated and the format parameter will > be removed. Use explicitly json instead in version 'xena' > category=FutureWarning) > > 2023-03-22 11:11:12.161 109 WARNING py.warnings > [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - - -] > /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: > FutureWarning: The human format is deprecated and the format parameter will > be removed. Use explicitly json instead in version 'xena' > category=FutureWarning) > > 2023-03-22 11:11:12.163 109 INFO cinder.image.image_utils > [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - - -] Converted 112.00 MB image at 112.00 > MB/s > 2023-03-22 11:11:14.998 109 INFO > cinder.volume.flows.manager.create_volume > [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - - -] Volume > volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f > (bf341343-6609-4b8c-b9e0-93e2a89c8c8f): created successfully > 2023-03-22 11:11:15.195 109 INFO cinder.volume.manager > [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - - -] Created volume successfully. > > The image is present in dcn02 store but still it downloaded the > image in 0.16 MB/s and then created the volume. > > With regards, > Swogat Pradhan > > On Tue, Mar 21, 2023 at 6:10 PM Swogat Pradhan < > swogatpradhan22@gmail.com> wrote: > >> Hi Jhon, >> This seems to be an issue. >> When i deployed the dcn ceph in both dcn01 and dcn02 the --cluster >> parameter was specified to the respective cluster names but the config >> files were created in the name of ceph.conf and keyring was >> ceph.client.openstack.keyring. >> >> Which created issues in glance as well as the naming convention of >> the files didn't match the cluster names, so i had to manually rename the >> central ceph conf file as such: >> >> [root@dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/ >> [root@dcn02-compute-0 ceph]# ll >> total 16 >> -rw-------. 1 root root 257 Mar 13 13:56 >> ceph_central.client.openstack.keyring >> -rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf >> -rw-------. 1 root root 205 Mar 15 18:45 >> ceph.client.openstack.keyring >> -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf >> [root@dcn02-compute-0 ceph]# >> >> ceph.conf and ceph.client.openstack.keyring contain the fsid of the >> respective clusters in both dcn01 and dcn02. >> In the above cli output, the ceph.conf and ceph.client... are the >> files used to access dcn02 ceph cluster and ceph_central* files are used in >> for accessing central ceph cluster. >> >> glance multistore config: >> [dcn02] >> rbd_store_ceph_conf=/etc/ceph/ceph.conf >> rbd_store_user=openstack >> rbd_store_pool=images >> rbd_thin_provisioning=False >> store_description=dcn02 rbd glance store >> >> [ceph_central] >> rbd_store_ceph_conf=/etc/ceph/ceph_central.conf >> rbd_store_user=openstack >> rbd_store_pool=images >> rbd_thin_provisioning=False >> store_description=Default glance store backend. >> >> >> With regards, >> Swogat Pradhan >> >> On Tue, Mar 21, 2023 at 5:52 PM John Fulton <johfulto@redhat.com> >> wrote: >> >>> On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan >>> <swogatpradhan22@gmail.com> wrote: >>> > >>> > Hi, >>> > Seems like cinder is not using the local ceph. >>> >>> That explains the issue. It's a misconfiguration. >>> >>> I hope this is not a production system since the mailing list now >>> has >>> the cinder.conf which contains passwords. >>> >>> The section that looks like this: >>> >>> [tripleo_ceph] >>> volume_backend_name=tripleo_ceph >>> volume_driver=cinder.volume.drivers.rbd.RBDDriver >>> rbd_ceph_conf=/etc/ceph/ceph.conf >>> rbd_user=openstack >>> rbd_pool=volumes >>> rbd_flatten_volume_from_snapshot=False >>> rbd_secret_uuid=<redacted> >>> report_discard_supported=True >>> >>> Should be updated to refer to the local DCN ceph cluster and not >>> the >>> central one. Use the ceph conf file for that cluster and ensure the >>> rbd_secret_uuid corresponds to that one. >>> >>> TripleO’s convention is to set the rbd_secret_uuid to the FSID of >>> the >>> Ceph cluster. The FSID should be in the ceph.conf file. The >>> tripleo_nova_libvirt role will use virsh secret-* commands so that >>> libvirt can retrieve the cephx secret using the FSID as a key. This >>> can be confirmed with `podman exec nova_virtsecretd virsh >>> secret-get-value $FSID`. >>> >>> The documentation describes how to configure the central and DCN >>> sites >>> correctly but an error seems to have occurred while you were >>> following >>> it. >>> >>> >>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >>> >>> John >>> >>> > >>> > Ceph Output: >>> > [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l >>> > NAME SIZE PARENT FMT >>> PROT LOCK >>> > 2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB 2 >>> excl >>> > 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB 2 >>> > 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB >>> 2 yes >>> > 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB 2 >>> > 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB >>> 2 yes >>> > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB 2 >>> > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB >>> 2 yes >>> > 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB 2 >>> > 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB >>> 2 yes >>> > b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB 2 >>> > b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB >>> 2 yes >>> > e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB 2 >>> > e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB >>> 2 yes >>> > >>> > [ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l >>> > NAME SIZE PARENT >>> FMT PROT LOCK >>> > volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB 2 >>> > volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB 2 >>> > [ceph: root@dcn02-ceph-all-0 /]# >>> > >>> > Attached the cinder config. >>> > Please let me know how I can solve this issue. >>> > >>> > With regards, >>> > Swogat Pradhan >>> > >>> > On Tue, Mar 21, 2023 at 3:53 PM John Fulton <johfulto@redhat.com> >>> wrote: >>> >> >>> >> in my last message under the line "On a DCN site if you run a >>> command like this:" I suggested some steps you could try to confirm the >>> image is a COW from the local glance as well as how to look at your cinder >>> config. >>> >> >>> >> On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan < >>> swogatpradhan22@gmail.com> wrote: >>> >>> >>> >>> Update: >>> >>> I uploaded an image directly to the dcn02 store, and it takes >>> around 10,15 minutes to create a volume with image in dcn02. >>> >>> The image size is 389 MB. >>> >>> >>> >>> On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan < >>> swogatpradhan22@gmail.com> wrote: >>> >>>> >>> >>>> Hi Jhon, >>> >>>> I checked in the ceph od dcn02, I can see the images created >>> after importing from the central site. >>> >>>> But launching an instance normally fails as it takes a long >>> time for the volume to get created. >>> >>>> >>> >>>> When launching an instance from volume the instance is >>> getting created properly without any errors. >>> >>>> >>> >>>> I tried to cache images in nova using >>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>> but getting checksum failed error. >>> >>>> >>> >>>> With regards, >>> >>>> Swogat Pradhan >>> >>>> >>> >>>> On Thu, Mar 16, 2023 at 5:24 PM John Fulton < >>> johfulto@redhat.com> wrote: >>> >>>>> >>> >>>>> On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan >>> >>>>> <swogatpradhan22@gmail.com> wrote: >>> >>>>> > >>> >>>>> > Update: After restarting the nova services on the >>> controller and running the deploy script on the edge site, I was able to >>> launch the VM from volume. >>> >>>>> > >>> >>>>> > Right now the instance creation is failing as the block >>> device creation is stuck in creating state, it is taking more than 10 mins >>> for the volume to be created, whereas the image has already been imported >>> to the edge glance. >>> >>>>> >>> >>>>> Try following this document and making the same observations >>> in your >>> >>>>> environment for AZs and their local ceph cluster. >>> >>>>> >>> >>>>> >>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >>> >>>>> >>> >>>>> On a DCN site if you run a command like this: >>> >>>>> >>> >>>>> $ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring >>> >>>>> /etc/ceph/dcn0.client.admin.keyring >>> >>>>> $ rbd --cluster dcn0 -p volumes ls -l >>> >>>>> NAME SIZE PARENT >>> >>>>> FMT PROT LOCK >>> >>>>> volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB >>> >>>>> images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 >>> excl >>> >>>>> $ >>> >>>>> >>> >>>>> Then, you should see the parent of the volume is the image >>> which is on >>> >>>>> the same local ceph cluster. >>> >>>>> >>> >>>>> I wonder if something is misconfigured and thus you're >>> encountering >>> >>>>> the streaming behavior described here: >>> >>>>> >>> >>>>> Ideally all images should reside in the central Glance and >>> be copied >>> >>>>> to DCN sites before instances of those images are booted on >>> DCN sites. >>> >>>>> If an image is not copied to a DCN site before it is booted, >>> then the >>> >>>>> image will be streamed to the DCN site and then the image >>> will boot as >>> >>>>> an instance. This happens because Glance at the DCN site has >>> access to >>> >>>>> the images store at the Central ceph cluster. Though the >>> booting of >>> >>>>> the image will take time because it has not been copied in >>> advance, >>> >>>>> this is still preferable to failing to boot the image. >>> >>>>> >>> >>>>> You can also exec into the cinder container at the DCN site >>> and >>> >>>>> confirm it's using it's local ceph cluster. >>> >>>>> >>> >>>>> John >>> >>>>> >>> >>>>> > >>> >>>>> > I will try and create a new fresh image and test again >>> then update. >>> >>>>> > >>> >>>>> > With regards, >>> >>>>> > Swogat Pradhan >>> >>>>> > >>> >>>>> > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan < >>> swogatpradhan22@gmail.com> wrote: >>> >>>>> >> >>> >>>>> >> Update: >>> >>>>> >> In the hypervisor list the compute node state is showing >>> down. >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan < >>> swogatpradhan22@gmail.com> wrote: >>> >>>>> >>> >>> >>>>> >>> Hi Brendan, >>> >>>>> >>> Now i have deployed another site where i have used 2 >>> linux bonds network template for both 3 compute nodes and 3 ceph nodes. >>> >>>>> >>> The bonding options is set to mode=802.3ad (lacp=active). >>> >>>>> >>> I used a cirros image to launch instance but the >>> instance timed out so i waited for the volume to be created. >>> >>>>> >>> Once the volume was created i tried launching the >>> instance from the volume and still the instance is stuck in spawning state. >>> >>>>> >>> >>> >>>>> >>> Here is the nova-compute log: >>> >>>>> >>> >>> >>>>> >>> 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon >>> [-] privsep daemon starting >>> >>>>> >>> 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon >>> [-] privsep process running with uid/gid: 0/0 >>> >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon >>> [-] privsep process running with capabilities (eff/prm/inh): >>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>> >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon >>> [-] privsep daemon running as pid 185437 >>> >>>>> >>> 2023-03-15 17:35:47.974 8 WARNING >>> os_brick.initiator.connectors.nvmeof >>> [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error >>> in _get_host_uuid: Unexpected error while running command. >>> >>>>> >>> Command: blkid overlay -s UUID -o value >>> >>>>> >>> Exit code: 2 >>> >>>>> >>> Stdout: '' >>> >>>>> >>> Stderr: '': >>> oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while >>> running command. >>> >>>>> >>> 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver >>> [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>> 450b749c-a10a-4308-80a9-3b8020fee758] Creating image >>> >>>>> >>> >>> >>>>> >>> It is stuck in creating image, do i need to run the >>> template mentioned here ?: >>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>> >>>>> >>> >>> >>>>> >>> The volume is already created and i do not understand >>> why the instance is stuck in spawning state. >>> >>>>> >>> >>> >>>>> >>> With regards, >>> >>>>> >>> Swogat Pradhan >>> >>>>> >>> >>> >>>>> >>> >>> >>>>> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard < >>> bshephar@redhat.com> wrote: >>> >>>>> >>>> >>> >>>>> >>>> Does your environment use different network interfaces >>> for each of the networks? Or does it have a bond with everything on it? >>> >>>>> >>>> >>> >>>>> >>>> One issue I have seen before is that when launching >>> instances, there is a lot of network traffic between nodes as the >>> hypervisor needs to download the image from Glance. Along with various >>> other services sending normal network traffic, it can be enough to cause >>> issues if everything is running over a single 1Gbe interface. >>> >>>>> >>>> >>> >>>>> >>>> I have seen the same situation in fact when using a >>> single active/backup bond on 1Gbe nics. It’s worth checking the network >>> traffic while you try to spawn the instance to see if you’re dropping >>> packets. In the situation I described, there were dropped packets which >>> resulted in a loss of communication between nova_compute and RMQ, so the >>> node appeared offline. You should also confirm that nova_compute is being >>> disconnected in the nova_compute logs if you tail them on the Hypervisor >>> while spawning the instance. >>> >>>>> >>>> >>> >>>>> >>>> In my case, changing from active/backup to LACP helped. >>> So, based on that experience, from my perspective, is certainly sounds like >>> some kind of network issue. >>> >>>>> >>>> >>> >>>>> >>>> Regards, >>> >>>>> >>>> >>> >>>>> >>>> Brendan Shephard >>> >>>>> >>>> Senior Software Engineer >>> >>>>> >>>> Red Hat Australia >>> >>>>> >>>> >>> >>>>> >>>> >>> >>>>> >>>> >>> >>>>> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> >>> wrote: >>> >>>>> >>>> >>> >>>>> >>>> Hi, >>> >>>>> >>>> >>> >>>>> >>>> I tried to help someone with a similar issue some time >>> ago in this thread: >>> >>>>> >>>> >>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... >>> >>>>> >>>> >>> >>>>> >>>> But apparently a neutron reinstallation fixed it for >>> that user, not sure if that could apply here. But is it possible that your >>> nova and neutron versions are different between central and edge site? Have >>> you restarted nova and neutron services on the compute nodes after >>> installation? Have you debug logs of nova-conductor and maybe nova-compute? >>> Maybe they can help narrow down the issue. >>> >>>>> >>>> If there isn't any additional information in the debug >>> logs I probably would start "tearing down" rabbitmq. I didn't have to do >>> that in a production system yet so be careful. I can think of two routes: >>> >>>>> >>>> >>> >>>>> >>>> - Either remove queues, exchanges etc. while rabbit is >>> running, this will most likely impact client IO depending on your load. >>> Check out the rabbitmqctl commands. >>> >>>>> >>>> - Or stop the rabbitmq cluster, remove the mnesia >>> tables from all nodes and restart rabbitmq so the exchanges, queues etc. >>> rebuild. >>> >>>>> >>>> >>> >>>>> >>>> I can imagine that the failed reply "survives" while >>> being replicated across the rabbit nodes. But I don't really know the >>> rabbit internals too well, so maybe someone else can chime in here and give >>> a better advice. >>> >>>>> >>>> >>> >>>>> >>>> Regards, >>> >>>>> >>>> Eugen >>> >>>>> >>>> >>> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>> >>>>> >>>> >>> >>>>> >>>> Hi, >>> >>>>> >>>> Can someone please help me out on this issue? >>> >>>>> >>>> >>> >>>>> >>>> With regards, >>> >>>>> >>>> Swogat Pradhan >>> >>>>> >>>> >>> >>>>> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < >>> swogatpradhan22@gmail.com> >>> >>>>> >>>> wrote: >>> >>>>> >>>> >>> >>>>> >>>> Hi >>> >>>>> >>>> I don't see any major packet loss. >>> >>>>> >>>> It seems the problem is somewhere in rabbitmq maybe but >>> not due to packet >>> >>>>> >>>> loss. >>> >>>>> >>>> >>> >>>>> >>>> with regards, >>> >>>>> >>>> Swogat Pradhan >>> >>>>> >>>> >>> >>>>> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < >>> swogatpradhan22@gmail.com> >>> >>>>> >>>> wrote: >>> >>>>> >>>> >>> >>>>> >>>> Hi, >>> >>>>> >>>> Yes the MTU is the same as the default '1500'. >>> >>>>> >>>> Generally I haven't seen any packet loss, but never >>> checked when >>> >>>>> >>>> launching the instance. >>> >>>>> >>>> I will check that and come back. >>> >>>>> >>>> But everytime i launch an instance the instance gets >>> stuck at spawning >>> >>>>> >>>> state and there the hypervisor becomes down, so not >>> sure if packet loss >>> >>>>> >>>> causes this. >>> >>>>> >>>> >>> >>>>> >>>> With regards, >>> >>>>> >>>> Swogat pradhan >>> >>>>> >>>> >>> >>>>> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block < >>> eblock@nde.ag> wrote: >>> >>>>> >>>> >>> >>>>> >>>> One more thing coming to mind is MTU size. Are they >>> identical between >>> >>>>> >>>> central and edge site? Do you see packet loss through >>> the tunnel? >>> >>>>> >>>> >>> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>> >>>>> >>>> >>> >>>>> >>>> > Hi Eugen, >>> >>>>> >>>> > Request you to please add my email either on 'to' or >>> 'cc' as i am not >>> >>>>> >>>> > getting email's from you. >>> >>>>> >>>> > Coming to the issue: >>> >>>>> >>>> > >>> >>>>> >>>> > [root@overcloud-controller-no-ceph-3 /]# rabbitmqctl >>> list_policies -p >>> >>>>> >>>> / >>> >>>>> >>>> > Listing policies for vhost "/" ... >>> >>>>> >>>> > vhost name pattern apply-to definition >>> priority >>> >>>>> >>>> > / ha-all ^(?!amq\.).* queues >>> >>>>> >>>> > >>> >>>>> >>>> >>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 >>> >>>>> >>>> > >>> >>>>> >>>> > I have the edge site compute nodes up, it only goes >>> down when i am >>> >>>>> >>>> trying >>> >>>>> >>>> > to launch an instance and the instance comes to a >>> spawning state and >>> >>>>> >>>> then >>> >>>>> >>>> > gets stuck. >>> >>>>> >>>> > >>> >>>>> >>>> > I have a tunnel setup between the central and the >>> edge sites. >>> >>>>> >>>> > >>> >>>>> >>>> > With regards, >>> >>>>> >>>> > Swogat Pradhan >>> >>>>> >>>> > >>> >>>>> >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < >>> >>>>> >>>> swogatpradhan22@gmail.com> >>> >>>>> >>>> > wrote: >>> >>>>> >>>> > >>> >>>>> >>>> >> Hi Eugen, >>> >>>>> >>>> >> For some reason i am not getting your email to me >>> directly, i am >>> >>>>> >>>> checking >>> >>>>> >>>> >> the email digest and there i am able to find your >>> reply. >>> >>>>> >>>> >> Here is the log for download: >>> https://we.tl/t-L8FEkGZFSq >>> >>>>> >>>> >> Yes, these logs are from the time when the issue >>> occurred. >>> >>>>> >>>> >> >>> >>>>> >>>> >> *Note: i am able to create vm's and perform other >>> activities in the >>> >>>>> >>>> >> central site, only facing this issue in the edge >>> site.* >>> >>>>> >>>> >> >>> >>>>> >>>> >> With regards, >>> >>>>> >>>> >> Swogat Pradhan >>> >>>>> >>>> >> >>> >>>>> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < >>> >>>>> >>>> swogatpradhan22@gmail.com> >>> >>>>> >>>> >> wrote: >>> >>>>> >>>> >> >>> >>>>> >>>> >>> Hi Eugen, >>> >>>>> >>>> >>> Thanks for your response. >>> >>>>> >>>> >>> I have actually a 4 controller setup so here are >>> the details: >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> *PCS Status:* >>> >>>>> >>>> >>> * Container bundle set: rabbitmq-bundle [ >>> >>>>> >>>> >>> >>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: >>> >>>>> >>>> >>> * rabbitmq-bundle-0 >>> (ocf::heartbeat:rabbitmq-cluster): >>> >>>>> >>>> Started >>> >>>>> >>>> >>> overcloud-controller-no-ceph-3 >>> >>>>> >>>> >>> * rabbitmq-bundle-1 >>> (ocf::heartbeat:rabbitmq-cluster): >>> >>>>> >>>> Started >>> >>>>> >>>> >>> overcloud-controller-2 >>> >>>>> >>>> >>> * rabbitmq-bundle-2 >>> (ocf::heartbeat:rabbitmq-cluster): >>> >>>>> >>>> Started >>> >>>>> >>>> >>> overcloud-controller-1 >>> >>>>> >>>> >>> * rabbitmq-bundle-3 >>> (ocf::heartbeat:rabbitmq-cluster): >>> >>>>> >>>> Started >>> >>>>> >>>> >>> overcloud-controller-0 >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> I have tried restarting the bundle multiple times >>> but the issue is >>> >>>>> >>>> still >>> >>>>> >>>> >>> present. >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> *Cluster status:* >>> >>>>> >>>> >>> [root@overcloud-controller-0 /]# rabbitmqctl >>> cluster_status >>> >>>>> >>>> >>> Cluster status of node >>> >>>>> >>>> >>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >>> >>>>> >>>> >>> Basics >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> Cluster name: >>> rabbit@overcloud-controller-no-ceph-3.bdxworld.com >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> Disk Nodes >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>> >>>>> >>>> >>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>> >>>>> >>>> >>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>> >>>>> >>>> >>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> Running Nodes >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>> >>>>> >>>> >>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>> >>>>> >>>> >>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>> >>>>> >>>> >>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> Versions >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ >>> >>>>> >>>> 3.8.3 >>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>> >>>>> >>>> >>> >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ >>> >>>>> >>>> 3.8.3 >>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>> >>>>> >>>> >>> >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ >>> >>>>> >>>> 3.8.3 >>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>> >>>>> >>>> >>> >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: >>> >>>>> >>>> RabbitMQ >>> >>>>> >>>> >>> 3.8.3 on Erlang 22.3.4.1 >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> Alarms >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> (none) >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> Network Partitions >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> (none) >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> Listeners >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> Node: >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>> >>>>> >>>> interface: >>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: >>> inter-node and CLI >>> >>>>> >>>> tool >>> >>>>> >>>> >>> communication >>> >>>>> >>>> >>> Node: >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>> >>>>> >>>> interface: >>> >>>>> >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, >>> purpose: AMQP 0-9-1 >>> >>>>> >>>> >>> and AMQP 1.0 >>> >>>>> >>>> >>> Node: >>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>> >>>>> >>>> interface: >>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>> >>>>> >>>> >>> Node: >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>> >>>>> >>>> interface: >>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: >>> inter-node and CLI >>> >>>>> >>>> tool >>> >>>>> >>>> >>> communication >>> >>>>> >>>> >>> Node: >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>> >>>>> >>>> interface: >>> >>>>> >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, >>> purpose: AMQP 0-9-1 >>> >>>>> >>>> >>> and AMQP 1.0 >>> >>>>> >>>> >>> Node: >>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>> >>>>> >>>> interface: >>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>> >>>>> >>>> >>> Node: >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>> >>>>> >>>> interface: >>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: >>> inter-node and CLI >>> >>>>> >>>> tool >>> >>>>> >>>> >>> communication >>> >>>>> >>>> >>> Node: >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>> >>>>> >>>> interface: >>> >>>>> >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, >>> purpose: AMQP 0-9-1 >>> >>>>> >>>> >>> and AMQP 1.0 >>> >>>>> >>>> >>> Node: >>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>> >>>>> >>>> interface: >>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API >>> >>>>> >>>> >>> Node: >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>>>> >>>> , >>> >>>>> >>>> >>> interface: [::], port: 25672, protocol: clustering, >>> purpose: >>> >>>>> >>>> inter-node and >>> >>>>> >>>> >>> CLI tool communication >>> >>>>> >>>> >>> Node: >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>>>> >>>> , >>> >>>>> >>>> >>> interface: 172.25.201.209, port: 5672, protocol: >>> amqp, purpose: AMQP >>> >>>>> >>>> 0-9-1 >>> >>>>> >>>> >>> and AMQP 1.0 >>> >>>>> >>>> >>> Node: >>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>> >>>>> >>>> , >>> >>>>> >>>> >>> interface: [::], port: 15672, protocol: http, >>> purpose: HTTP API >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> Feature flags >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> Flag: drop_unroutable_metric, state: enabled >>> >>>>> >>>> >>> Flag: empty_basic_get_metric, state: enabled >>> >>>>> >>>> >>> Flag: implicit_default_bindings, state: enabled >>> >>>>> >>>> >>> Flag: quorum_queue, state: enabled >>> >>>>> >>>> >>> Flag: virtual_host_metadata, state: enabled >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> *Logs:* >>> >>>>> >>>> >>> *(Attached)* >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> With regards, >>> >>>>> >>>> >>> Swogat Pradhan >>> >>>>> >>>> >>> >>> >>>>> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < >>> >>>>> >>>> swogatpradhan22@gmail.com> >>> >>>>> >>>> >>> wrote: >>> >>>>> >>>> >>> >>> >>>>> >>>> >>>> Hi, >>> >>>>> >>>> >>>> Please find the nova conductor as well as nova api >>> log. >>> >>>>> >>>> >>>> >>> >>>>> >>>> >>>> nova-conuctor: >>> >>>>> >>>> >>>> >>> >>>>> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING >>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - >>> -] >>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>> exist, drop reply to >>> >>>>> >>>> >>>> 16152921c1eb45c2b1f562087140168b >>> >>>>> >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING >>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - >>> -] >>> >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't >>> exist, drop reply to >>> >>>>> >>>> >>>> 83dbe5f567a940b698acfe986f6194fa >>> >>>>> >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING >>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - >>> -] >>> >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't >>> exist, drop reply to >>> >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43: >>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>>> >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR >>> oslo_messaging._drivers.amqpdriver >>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - >>> -] The reply >>> >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send >>> after 60 seconds >>> >>>>> >>>> due to a >>> >>>>> >>>> >>>> missing queue >>> (reply_276049ec36a84486a8a406911d9802f4). >>> >>>>> >>>> Abandoning...: >>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>>> >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING >>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - >>> -] >>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>> exist, drop reply to >>> >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: >>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>>> >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR >>> oslo_messaging._drivers.amqpdriver >>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - >>> -] The reply >>> >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send >>> after 60 seconds >>> >>>>> >>>> due to a >>> >>>>> >>>> >>>> missing queue >>> (reply_349bcb075f8c49329435a0f884b33066). >>> >>>>> >>>> Abandoning...: >>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>>> >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING >>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - >>> -] >>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>> exist, drop reply to >>> >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f: >>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>>> >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR >>> oslo_messaging._drivers.amqpdriver >>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - >>> -] The reply >>> >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send >>> after 60 seconds >>> >>>>> >>>> due to a >>> >>>>> >>>> >>>> missing queue >>> (reply_349bcb075f8c49329435a0f884b33066). >>> >>>>> >>>> Abandoning...: >>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>>> >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils >>> >>>>> >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>>> >>>> b240e3e89d99489284cd731e75f2a5db >>> >>>>> >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default >>> default] Cache enabled >>> >>>>> >>>> with >>> >>>>> >>>> >>>> backend dogpile.cache.null. >>> >>>>> >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING >>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - >>> -] >>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>> exist, drop reply to >>> >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb: >>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>>> >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR >>> oslo_messaging._drivers.amqpdriver >>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - >>> -] The reply >>> >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send >>> after 60 seconds >>> >>>>> >>>> due to a >>> >>>>> >>>> >>>> missing queue >>> (reply_349bcb075f8c49329435a0f884b33066). >>> >>>>> >>>> Abandoning...: >>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>> >>>>> >>>> >>>> >>> >>>>> >>>> >>>> With regards, >>> >>>>> >>>> >>>> Swogat Pradhan >>> >>>>> >>>> >>>> >>> >>>>> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >>> >>>>> >>>> >>>> swogatpradhan22@gmail.com> wrote: >>> >>>>> >>>> >>>> >>> >>>>> >>>> >>>>> Hi, >>> >>>>> >>>> >>>>> I currently have 3 compute nodes on edge site1 >>> where i am trying to >>> >>>>> >>>> >>>>> launch vm's. >>> >>>>> >>>> >>>>> When the VM is in spawning state the node goes >>> down (openstack >>> >>>>> >>>> compute >>> >>>>> >>>> >>>>> service list), the node comes backup when i >>> restart the nova >>> >>>>> >>>> compute >>> >>>>> >>>> >>>>> service but then the launch of the vm fails. >>> >>>>> >>>> >>>>> >>> >>>>> >>>> >>>>> nova-compute.log >>> >>>>> >>>> >>>>> >>> >>>>> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO >>> nova.compute.manager >>> >>>>> >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - >>> -] Running >>> >>>>> >>>> >>>>> instance usage >>> >>>>> >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from >>> 2023-02-26 07:00:00 >>> >>>>> >>>> to >>> >>>>> >>>> >>>>> 2023-02-26 08:00:00. 0 instances. >>> >>>>> >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims >>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>> default] [instance: >>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim >>> successful on node >>> >>>>> >>>> >>>>> dcn01-hci-0.bdxworld.com >>> >>>>> >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO >>> nova.virt.libvirt.driver >>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>> default] [instance: >>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring >>> supplied device >>> >>>>> >>>> name: >>> >>>>> >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied dev >>> names >>> >>>>> >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO >>> nova.virt.block_device >>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>> default] [instance: >>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting >>> with volume >>> >>>>> >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils >>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>> default] Cache enabled >>> >>>>> >>>> with >>> >>>>> >>>> >>>>> backend dogpile.cache.null. >>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon >>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>> default] Running >>> >>>>> >>>> >>>>> privsep helper: >>> >>>>> >>>> >>>>> ['sudo', 'nova-rootwrap', >>> '/etc/nova/rootwrap.conf', >>> >>>>> >>>> 'privsep-helper', >>> >>>>> >>>> >>>>> '--config-file', '/etc/nova/nova.conf', >>> '--config-file', >>> >>>>> >>>> >>>>> '/etc/nova/nova-compute.conf', >>> '--privsep_context', >>> >>>>> >>>> >>>>> 'os_brick.privileged.default', >>> '--privsep_sock_path', >>> >>>>> >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] >>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon >>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>> default] Spawned new >>> >>>>> >>>> privsep >>> >>>>> >>>> >>>>> daemon via rootwrap >>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO >>> oslo.privsep.daemon [-] privsep >>> >>>>> >>>> >>>>> daemon starting >>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO >>> oslo.privsep.daemon [-] privsep >>> >>>>> >>>> >>>>> process running with uid/gid: 0/0 >>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO >>> oslo.privsep.daemon [-] privsep >>> >>>>> >>>> >>>>> process running with capabilities (eff/prm/inh): >>> >>>>> >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO >>> oslo.privsep.daemon [-] privsep >>> >>>>> >>>> >>>>> daemon running as pid 2647 >>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING >>> >>>>> >>>> os_brick.initiator.connectors.nvmeof >>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>> default] Process >>> >>>>> >>>> >>>>> execution error >>> >>>>> >>>> >>>>> in _get_host_uuid: Unexpected error while running >>> command. >>> >>>>> >>>> >>>>> Command: blkid overlay -s UUID -o value >>> >>>>> >>>> >>>>> Exit code: 2 >>> >>>>> >>>> >>>>> Stdout: '' >>> >>>>> >>>> >>>>> Stderr: '': >>> oslo_concurrency.processutils.ProcessExecutionError: >>> >>>>> >>>> >>>>> Unexpected error while running command. >>> >>>>> >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO >>> nova.virt.libvirt.driver >>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>> default] [instance: >>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating >>> image >>> >>>>> >>>> >>>>> >>> >>>>> >>>> >>>>> Is there a way to solve this issue? >>> >>>>> >>>> >>>>> >>> >>>>> >>>> >>>>> >>> >>>>> >>>> >>>>> With regards, >>> >>>>> >>>> >>>>> >>> >>>>> >>>> >>>>> Swogat Pradhan >>> >>>>> >>>> >>>>> >>> >>>>> >>>> >>>> >>> >>>>> >>>> >>> >>>>> >>>> >>> >>>>> >>>> >>> >>>>> >>>> >>> >>>>> >>>> >>> >>>>> >>>> >>> >>>>> >>>> >>> >>>>> >>>> >>> >>>>> >>>> >>> >>>>> >>> >>>
Hi Jhon, Thank you for clarifying that. Right now the cinder volume is stuck in *creating *state when adding image as volume source. But when creating an empty volume the volumes are getting created successfully without any errors. We are getting volume creation request in cinder-volume.log as such: 2023-03-23 12:34:40.152 108 INFO cinder.volume.flows.manager.create_volume [req-18556796-a61c-4097-8fa8-b136ce9814f7 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume 872a2ae6-c75b-4fc0-8172-17a29d07a66c: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-872a2ae6-c75b-4fc0-8172-17a29d07a66c', 'volume_size': 1, 'image_id': '131ed4e0-0474-45be-b74a-43b599a7d6c5', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': '131ed4e0-0474-45be-b74a-43b599a7d6c5', 'created_at': datetime.datetime(2023, 3, 23, 11, 41, 51, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 23, 11, 46, 37, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'tags': [], 'file': '/v2/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f98d869ed68>} But there is nothing else after that and the volume doesn't even timeout, it just gets stuck in creating state. Can you advise what might be the issue here? All the containers are in a healthy state now. With regards, Swogat Pradhan On Thu, Mar 23, 2023 at 6:06 PM Alan Bishop <abishop@redhat.com> wrote:
On Thu, Mar 23, 2023 at 5:20 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Is this bind not required for cinder_scheduler container?
"/var/lib/tripleo-config/ceph:/var/lib/kolla/config_files/src-ceph:ro,rprivate,rbind", I do not see this particular bind on the cinder scheduler containers on my controller nodes.
That is correct, because the scheduler does not access the ceph cluster.
Alan
With regards, Swogat Pradhan
On Thu, Mar 23, 2023 at 2:46 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Cinder volume config:
[tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=a8d5f1f5-48e7-5ede-89ab-8aca59b6397b report_discard_supported=True rbd_ceph_conf=/etc/ceph/dcn02.conf rbd_cluster_name=dcn02
Glance api config:
[dcn02] rbd_store_ceph_conf=/etc/ceph/dcn02.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store [ceph] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend.
On Thu, Mar 23, 2023 at 2:29 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
I still have the same issue, I'm not sure what's left to try. All the pods are now in a healthy state, I am getting log entries 3 mins after I hit the create volume button in cinder-volume when I try to create a volume with an image. And the volumes are just stuck in creating state for more than 20 mins now.
Cinder logs: 2023-03-22 20:32:44.010 108 INFO cinder.rpc [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Automatically selected cinder-volume RPC version 3.17 as minimum service version. 2023-03-22 20:34:59.166 108 INFO cinder.volume.flows.manager.create_volume [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume 5743a879-090d-46db-bc7c-1c0b0669a112: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-5743a879-090d-46db-bc7c-1c0b0669a112', 'volume_size': 2, 'image_id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'created_at': datetime.datetime(2023, 3, 22, 18, 50, 5, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 20, 3, 54, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'tags': [], 'file': '/v2/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f8147973438>}
With regards, Swogat Pradhan
On Wed, Mar 22, 2023 at 9:19 PM Alan Bishop <abishop@redhat.com> wrote:
On Wed, Mar 22, 2023 at 8:38 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi Adam, The systems are in same LAN, in this case it seemed like the image was getting pulled from the central site which was caused due to an misconfiguration in ceph.conf file in /var/lib/tripleo-config/ceph/ directory, which seems to have been resolved after the changes i made to fix it.
Right now the glance api podman is running in unhealthy state and the podman logs don't show any error whatsoever and when issued the command netstat -nultp i do not see any entry for glance port i.e. 9292 in the dcn site, which is why cinder is throwing an error stating:
2023-03-22 13:32:29.786 108 ERROR oslo_messaging.rpc.server cinder.exception.GlanceConnectionFailed: Connection to glance failed: Error finding address for http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: Unable to establish connection to http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: HTTPConnectionPool(host='172.25.228.253', port=9292): Max retries exceeded with url: /v2/images/736d8779-07cd-4510-bab2-adcb653cc538 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7682d2cd30>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
Now i need to find out why the port is not listed as the glance service is running, which i am not sure how to find out.
One other thing to investigate is whether your deployment includes this patch [1]. If it does, then bear in mind the glance-api service running at the edge site will be an "internal" (non public facing) instance that uses port 9293 instead of 9292. You should familiarize yourself with the release note [2].
[1] https://opendev.org/openstack/tripleo-heat-templates/commit/3605d45e417a77a1... [2] https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/walla...
Alan
With regards, Swogat Pradhan
On Wed, Mar 22, 2023 at 8:11 PM Alan Bishop <abishop@redhat.com> wrote:
> > > On Wed, Mar 22, 2023 at 6:37 AM Swogat Pradhan < > swogatpradhan22@gmail.com> wrote: > >> Update: >> Here is the log when creating a volume using cirros image: >> >> 2023-03-22 11:04:38.449 109 INFO >> cinder.volume.flows.manager.create_volume >> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - - -] Volume >> bf341343-6609-4b8c-b9e0-93e2a89c8c8f: being created as image with >> specification: {'status': 'creating', 'volume_name': >> 'volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f', 'volume_size': 4, >> 'image_id': '736d8779-07cd-4510-bab2-adcb653cc538', 'image_location': >> ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >> [{'url': >> 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >> 'metadata': {'store': 'ceph'}}, {'url': >> 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >> 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', >> 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', >> 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', >> 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, >> 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', >> 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': >> '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', >> 'id': '736d8779-07cd-4510-bab2-adcb653cc538', 'created_at': >> datetime.datetime(2023, 3, 22, 10, 44, 12, tzinfo=datetime.timezone.utc), >> 'updated_at': datetime.datetime(2023, 3, 22, 10, 54, 1, >> tzinfo=datetime.timezone.utc), 'locations': [{'url': >> 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >> 'metadata': {'store': 'ceph'}}, {'url': >> 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >> 'metadata': {'store': 'dcn02'}}], 'direct_url': >> 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >> 'tags': [], 'file': '/v2/images/736d8779-07cd-4510-bab2-adcb653cc538/file', >> 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', >> 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', >> 'owner_specified.openstack.object': 'images/cirros', >> 'owner_specified.openstack.sha256': ''}}, 'image_service': >> <cinder.image.glance.GlanceImageService object at 0x7f449ded1198>} >> 2023-03-22 11:06:16.570 109 INFO cinder.image.image_utils >> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - - -] Image download 15.58 MB at 0.16 MB/s >> > > As Adam Savage would say, well there's your problem ^^ (Image > download 15.58 MB at 0.16 MB/s). Downloading the image takes too long, and > 0.16 MB/s suggests you have a network issue. > > John Fulton previously stated your cinder-volume service at the edge > site is not using the local ceph image store. Assuming you are deploying > GlanceApiEdge service [1], then the cinder-volume service should be > configured to use the local glance service [2]. You should check cinder's > glance_api_servers to confirm it's the edge site's glance service. > > [1] > https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/envi... > [2] > https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/depl... > > Alan > > >> 2023-03-22 11:07:54.023 109 WARNING py.warnings >> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - - -] >> /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: >> FutureWarning: The human format is deprecated and the format parameter will >> be removed. Use explicitly json instead in version 'xena' >> category=FutureWarning) >> >> 2023-03-22 11:11:12.161 109 WARNING py.warnings >> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - - -] >> /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: >> FutureWarning: The human format is deprecated and the format parameter will >> be removed. Use explicitly json instead in version 'xena' >> category=FutureWarning) >> >> 2023-03-22 11:11:12.163 109 INFO cinder.image.image_utils >> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - - -] Converted 112.00 MB image at 112.00 >> MB/s >> 2023-03-22 11:11:14.998 109 INFO >> cinder.volume.flows.manager.create_volume >> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - - -] Volume >> volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f >> (bf341343-6609-4b8c-b9e0-93e2a89c8c8f): created successfully >> 2023-03-22 11:11:15.195 109 INFO cinder.volume.manager >> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >> 4160ce999a31485fa643aed0936dfef0 - - -] Created volume successfully. >> >> The image is present in dcn02 store but still it downloaded the >> image in 0.16 MB/s and then created the volume. >> >> With regards, >> Swogat Pradhan >> >> On Tue, Mar 21, 2023 at 6:10 PM Swogat Pradhan < >> swogatpradhan22@gmail.com> wrote: >> >>> Hi Jhon, >>> This seems to be an issue. >>> When i deployed the dcn ceph in both dcn01 and dcn02 the --cluster >>> parameter was specified to the respective cluster names but the config >>> files were created in the name of ceph.conf and keyring was >>> ceph.client.openstack.keyring. >>> >>> Which created issues in glance as well as the naming convention of >>> the files didn't match the cluster names, so i had to manually rename the >>> central ceph conf file as such: >>> >>> [root@dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/ >>> [root@dcn02-compute-0 ceph]# ll >>> total 16 >>> -rw-------. 1 root root 257 Mar 13 13:56 >>> ceph_central.client.openstack.keyring >>> -rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf >>> -rw-------. 1 root root 205 Mar 15 18:45 >>> ceph.client.openstack.keyring >>> -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf >>> [root@dcn02-compute-0 ceph]# >>> >>> ceph.conf and ceph.client.openstack.keyring contain the fsid of >>> the respective clusters in both dcn01 and dcn02. >>> In the above cli output, the ceph.conf and ceph.client... are the >>> files used to access dcn02 ceph cluster and ceph_central* files are used in >>> for accessing central ceph cluster. >>> >>> glance multistore config: >>> [dcn02] >>> rbd_store_ceph_conf=/etc/ceph/ceph.conf >>> rbd_store_user=openstack >>> rbd_store_pool=images >>> rbd_thin_provisioning=False >>> store_description=dcn02 rbd glance store >>> >>> [ceph_central] >>> rbd_store_ceph_conf=/etc/ceph/ceph_central.conf >>> rbd_store_user=openstack >>> rbd_store_pool=images >>> rbd_thin_provisioning=False >>> store_description=Default glance store backend. >>> >>> >>> With regards, >>> Swogat Pradhan >>> >>> On Tue, Mar 21, 2023 at 5:52 PM John Fulton <johfulto@redhat.com> >>> wrote: >>> >>>> On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan >>>> <swogatpradhan22@gmail.com> wrote: >>>> > >>>> > Hi, >>>> > Seems like cinder is not using the local ceph. >>>> >>>> That explains the issue. It's a misconfiguration. >>>> >>>> I hope this is not a production system since the mailing list now >>>> has >>>> the cinder.conf which contains passwords. >>>> >>>> The section that looks like this: >>>> >>>> [tripleo_ceph] >>>> volume_backend_name=tripleo_ceph >>>> volume_driver=cinder.volume.drivers.rbd.RBDDriver >>>> rbd_ceph_conf=/etc/ceph/ceph.conf >>>> rbd_user=openstack >>>> rbd_pool=volumes >>>> rbd_flatten_volume_from_snapshot=False >>>> rbd_secret_uuid=<redacted> >>>> report_discard_supported=True >>>> >>>> Should be updated to refer to the local DCN ceph cluster and not >>>> the >>>> central one. Use the ceph conf file for that cluster and ensure >>>> the >>>> rbd_secret_uuid corresponds to that one. >>>> >>>> TripleO’s convention is to set the rbd_secret_uuid to the FSID of >>>> the >>>> Ceph cluster. The FSID should be in the ceph.conf file. The >>>> tripleo_nova_libvirt role will use virsh secret-* commands so that >>>> libvirt can retrieve the cephx secret using the FSID as a key. >>>> This >>>> can be confirmed with `podman exec nova_virtsecretd virsh >>>> secret-get-value $FSID`. >>>> >>>> The documentation describes how to configure the central and DCN >>>> sites >>>> correctly but an error seems to have occurred while you were >>>> following >>>> it. >>>> >>>> >>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >>>> >>>> John >>>> >>>> > >>>> > Ceph Output: >>>> > [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l >>>> > NAME SIZE PARENT >>>> FMT PROT LOCK >>>> > 2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB >>>> 2 excl >>>> > 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB 2 >>>> > 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB >>>> 2 yes >>>> > 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB 2 >>>> > 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB >>>> 2 yes >>>> > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB 2 >>>> > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB >>>> 2 yes >>>> > 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB 2 >>>> > 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB >>>> 2 yes >>>> > b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB 2 >>>> > b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB >>>> 2 yes >>>> > e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB 2 >>>> > e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB >>>> 2 yes >>>> > >>>> > [ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l >>>> > NAME SIZE PARENT >>>> FMT PROT LOCK >>>> > volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB >>>> 2 >>>> > volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB >>>> 2 >>>> > [ceph: root@dcn02-ceph-all-0 /]# >>>> > >>>> > Attached the cinder config. >>>> > Please let me know how I can solve this issue. >>>> > >>>> > With regards, >>>> > Swogat Pradhan >>>> > >>>> > On Tue, Mar 21, 2023 at 3:53 PM John Fulton < >>>> johfulto@redhat.com> wrote: >>>> >> >>>> >> in my last message under the line "On a DCN site if you run a >>>> command like this:" I suggested some steps you could try to confirm the >>>> image is a COW from the local glance as well as how to look at your cinder >>>> config. >>>> >> >>>> >> On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan < >>>> swogatpradhan22@gmail.com> wrote: >>>> >>> >>>> >>> Update: >>>> >>> I uploaded an image directly to the dcn02 store, and it takes >>>> around 10,15 minutes to create a volume with image in dcn02. >>>> >>> The image size is 389 MB. >>>> >>> >>>> >>> On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan < >>>> swogatpradhan22@gmail.com> wrote: >>>> >>>> >>>> >>>> Hi Jhon, >>>> >>>> I checked in the ceph od dcn02, I can see the images created >>>> after importing from the central site. >>>> >>>> But launching an instance normally fails as it takes a long >>>> time for the volume to get created. >>>> >>>> >>>> >>>> When launching an instance from volume the instance is >>>> getting created properly without any errors. >>>> >>>> >>>> >>>> I tried to cache images in nova using >>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>>> but getting checksum failed error. >>>> >>>> >>>> >>>> With regards, >>>> >>>> Swogat Pradhan >>>> >>>> >>>> >>>> On Thu, Mar 16, 2023 at 5:24 PM John Fulton < >>>> johfulto@redhat.com> wrote: >>>> >>>>> >>>> >>>>> On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan >>>> >>>>> <swogatpradhan22@gmail.com> wrote: >>>> >>>>> > >>>> >>>>> > Update: After restarting the nova services on the >>>> controller and running the deploy script on the edge site, I was able to >>>> launch the VM from volume. >>>> >>>>> > >>>> >>>>> > Right now the instance creation is failing as the block >>>> device creation is stuck in creating state, it is taking more than 10 mins >>>> for the volume to be created, whereas the image has already been imported >>>> to the edge glance. >>>> >>>>> >>>> >>>>> Try following this document and making the same >>>> observations in your >>>> >>>>> environment for AZs and their local ceph cluster. >>>> >>>>> >>>> >>>>> >>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >>>> >>>>> >>>> >>>>> On a DCN site if you run a command like this: >>>> >>>>> >>>> >>>>> $ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring >>>> >>>>> /etc/ceph/dcn0.client.admin.keyring >>>> >>>>> $ rbd --cluster dcn0 -p volumes ls -l >>>> >>>>> NAME SIZE PARENT >>>> >>>>> FMT PROT LOCK >>>> >>>>> volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB >>>> >>>>> images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 >>>> excl >>>> >>>>> $ >>>> >>>>> >>>> >>>>> Then, you should see the parent of the volume is the image >>>> which is on >>>> >>>>> the same local ceph cluster. >>>> >>>>> >>>> >>>>> I wonder if something is misconfigured and thus you're >>>> encountering >>>> >>>>> the streaming behavior described here: >>>> >>>>> >>>> >>>>> Ideally all images should reside in the central Glance and >>>> be copied >>>> >>>>> to DCN sites before instances of those images are booted on >>>> DCN sites. >>>> >>>>> If an image is not copied to a DCN site before it is >>>> booted, then the >>>> >>>>> image will be streamed to the DCN site and then the image >>>> will boot as >>>> >>>>> an instance. This happens because Glance at the DCN site >>>> has access to >>>> >>>>> the images store at the Central ceph cluster. Though the >>>> booting of >>>> >>>>> the image will take time because it has not been copied in >>>> advance, >>>> >>>>> this is still preferable to failing to boot the image. >>>> >>>>> >>>> >>>>> You can also exec into the cinder container at the DCN site >>>> and >>>> >>>>> confirm it's using it's local ceph cluster. >>>> >>>>> >>>> >>>>> John >>>> >>>>> >>>> >>>>> > >>>> >>>>> > I will try and create a new fresh image and test again >>>> then update. >>>> >>>>> > >>>> >>>>> > With regards, >>>> >>>>> > Swogat Pradhan >>>> >>>>> > >>>> >>>>> > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan < >>>> swogatpradhan22@gmail.com> wrote: >>>> >>>>> >> >>>> >>>>> >> Update: >>>> >>>>> >> In the hypervisor list the compute node state is showing >>>> down. >>>> >>>>> >> >>>> >>>>> >> >>>> >>>>> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan < >>>> swogatpradhan22@gmail.com> wrote: >>>> >>>>> >>> >>>> >>>>> >>> Hi Brendan, >>>> >>>>> >>> Now i have deployed another site where i have used 2 >>>> linux bonds network template for both 3 compute nodes and 3 ceph nodes. >>>> >>>>> >>> The bonding options is set to mode=802.3ad >>>> (lacp=active). >>>> >>>>> >>> I used a cirros image to launch instance but the >>>> instance timed out so i waited for the volume to be created. >>>> >>>>> >>> Once the volume was created i tried launching the >>>> instance from the volume and still the instance is stuck in spawning state. >>>> >>>>> >>> >>>> >>>>> >>> Here is the nova-compute log: >>>> >>>>> >>> >>>> >>>>> >>> 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon >>>> [-] privsep daemon starting >>>> >>>>> >>> 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon >>>> [-] privsep process running with uid/gid: 0/0 >>>> >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon >>>> [-] privsep process running with capabilities (eff/prm/inh): >>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>> >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon >>>> [-] privsep daemon running as pid 185437 >>>> >>>>> >>> 2023-03-15 17:35:47.974 8 WARNING >>>> os_brick.initiator.connectors.nvmeof >>>> [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error >>>> in _get_host_uuid: Unexpected error while running command. >>>> >>>>> >>> Command: blkid overlay -s UUID -o value >>>> >>>>> >>> Exit code: 2 >>>> >>>>> >>> Stdout: '' >>>> >>>>> >>> Stderr: '': >>>> oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while >>>> running command. >>>> >>>>> >>> 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver >>>> [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance: >>>> 450b749c-a10a-4308-80a9-3b8020fee758] Creating image >>>> >>>>> >>> >>>> >>>>> >>> It is stuck in creating image, do i need to run the >>>> template mentioned here ?: >>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>>> >>>>> >>> >>>> >>>>> >>> The volume is already created and i do not understand >>>> why the instance is stuck in spawning state. >>>> >>>>> >>> >>>> >>>>> >>> With regards, >>>> >>>>> >>> Swogat Pradhan >>>> >>>>> >>> >>>> >>>>> >>> >>>> >>>>> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard < >>>> bshephar@redhat.com> wrote: >>>> >>>>> >>>> >>>> >>>>> >>>> Does your environment use different network interfaces >>>> for each of the networks? Or does it have a bond with everything on it? >>>> >>>>> >>>> >>>> >>>>> >>>> One issue I have seen before is that when launching >>>> instances, there is a lot of network traffic between nodes as the >>>> hypervisor needs to download the image from Glance. Along with various >>>> other services sending normal network traffic, it can be enough to cause >>>> issues if everything is running over a single 1Gbe interface. >>>> >>>>> >>>> >>>> >>>>> >>>> I have seen the same situation in fact when using a >>>> single active/backup bond on 1Gbe nics. It’s worth checking the network >>>> traffic while you try to spawn the instance to see if you’re dropping >>>> packets. In the situation I described, there were dropped packets which >>>> resulted in a loss of communication between nova_compute and RMQ, so the >>>> node appeared offline. You should also confirm that nova_compute is being >>>> disconnected in the nova_compute logs if you tail them on the Hypervisor >>>> while spawning the instance. >>>> >>>>> >>>> >>>> >>>>> >>>> In my case, changing from active/backup to LACP >>>> helped. So, based on that experience, from my perspective, is certainly >>>> sounds like some kind of network issue. >>>> >>>>> >>>> >>>> >>>>> >>>> Regards, >>>> >>>>> >>>> >>>> >>>>> >>>> Brendan Shephard >>>> >>>>> >>>> Senior Software Engineer >>>> >>>>> >>>> Red Hat Australia >>>> >>>>> >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> >>>> wrote: >>>> >>>>> >>>> >>>> >>>>> >>>> Hi, >>>> >>>>> >>>> >>>> >>>>> >>>> I tried to help someone with a similar issue some time >>>> ago in this thread: >>>> >>>>> >>>> >>>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... >>>> >>>>> >>>> >>>> >>>>> >>>> But apparently a neutron reinstallation fixed it for >>>> that user, not sure if that could apply here. But is it possible that your >>>> nova and neutron versions are different between central and edge site? Have >>>> you restarted nova and neutron services on the compute nodes after >>>> installation? Have you debug logs of nova-conductor and maybe nova-compute? >>>> Maybe they can help narrow down the issue. >>>> >>>>> >>>> If there isn't any additional information in the debug >>>> logs I probably would start "tearing down" rabbitmq. I didn't have to do >>>> that in a production system yet so be careful. I can think of two routes: >>>> >>>>> >>>> >>>> >>>>> >>>> - Either remove queues, exchanges etc. while rabbit is >>>> running, this will most likely impact client IO depending on your load. >>>> Check out the rabbitmqctl commands. >>>> >>>>> >>>> - Or stop the rabbitmq cluster, remove the mnesia >>>> tables from all nodes and restart rabbitmq so the exchanges, queues etc. >>>> rebuild. >>>> >>>>> >>>> >>>> >>>>> >>>> I can imagine that the failed reply "survives" while >>>> being replicated across the rabbit nodes. But I don't really know the >>>> rabbit internals too well, so maybe someone else can chime in here and give >>>> a better advice. >>>> >>>>> >>>> >>>> >>>>> >>>> Regards, >>>> >>>>> >>>> Eugen >>>> >>>>> >>>> >>>> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>>> >>>>> >>>> >>>> >>>>> >>>> Hi, >>>> >>>>> >>>> Can someone please help me out on this issue? >>>> >>>>> >>>> >>>> >>>>> >>>> With regards, >>>> >>>>> >>>> Swogat Pradhan >>>> >>>>> >>>> >>>> >>>>> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < >>>> swogatpradhan22@gmail.com> >>>> >>>>> >>>> wrote: >>>> >>>>> >>>> >>>> >>>>> >>>> Hi >>>> >>>>> >>>> I don't see any major packet loss. >>>> >>>>> >>>> It seems the problem is somewhere in rabbitmq maybe >>>> but not due to packet >>>> >>>>> >>>> loss. >>>> >>>>> >>>> >>>> >>>>> >>>> with regards, >>>> >>>>> >>>> Swogat Pradhan >>>> >>>>> >>>> >>>> >>>>> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < >>>> swogatpradhan22@gmail.com> >>>> >>>>> >>>> wrote: >>>> >>>>> >>>> >>>> >>>>> >>>> Hi, >>>> >>>>> >>>> Yes the MTU is the same as the default '1500'. >>>> >>>>> >>>> Generally I haven't seen any packet loss, but never >>>> checked when >>>> >>>>> >>>> launching the instance. >>>> >>>>> >>>> I will check that and come back. >>>> >>>>> >>>> But everytime i launch an instance the instance gets >>>> stuck at spawning >>>> >>>>> >>>> state and there the hypervisor becomes down, so not >>>> sure if packet loss >>>> >>>>> >>>> causes this. >>>> >>>>> >>>> >>>> >>>>> >>>> With regards, >>>> >>>>> >>>> Swogat pradhan >>>> >>>>> >>>> >>>> >>>>> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block < >>>> eblock@nde.ag> wrote: >>>> >>>>> >>>> >>>> >>>>> >>>> One more thing coming to mind is MTU size. Are they >>>> identical between >>>> >>>>> >>>> central and edge site? Do you see packet loss through >>>> the tunnel? >>>> >>>>> >>>> >>>> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>>> >>>>> >>>> >>>> >>>>> >>>> > Hi Eugen, >>>> >>>>> >>>> > Request you to please add my email either on 'to' or >>>> 'cc' as i am not >>>> >>>>> >>>> > getting email's from you. >>>> >>>>> >>>> > Coming to the issue: >>>> >>>>> >>>> > >>>> >>>>> >>>> > [root@overcloud-controller-no-ceph-3 /]# >>>> rabbitmqctl list_policies -p >>>> >>>>> >>>> / >>>> >>>>> >>>> > Listing policies for vhost "/" ... >>>> >>>>> >>>> > vhost name pattern apply-to definition >>>> priority >>>> >>>>> >>>> > / ha-all ^(?!amq\.).* queues >>>> >>>>> >>>> > >>>> >>>>> >>>> >>>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 >>>> >>>>> >>>> > >>>> >>>>> >>>> > I have the edge site compute nodes up, it only goes >>>> down when i am >>>> >>>>> >>>> trying >>>> >>>>> >>>> > to launch an instance and the instance comes to a >>>> spawning state and >>>> >>>>> >>>> then >>>> >>>>> >>>> > gets stuck. >>>> >>>>> >>>> > >>>> >>>>> >>>> > I have a tunnel setup between the central and the >>>> edge sites. >>>> >>>>> >>>> > >>>> >>>>> >>>> > With regards, >>>> >>>>> >>>> > Swogat Pradhan >>>> >>>>> >>>> > >>>> >>>>> >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < >>>> >>>>> >>>> swogatpradhan22@gmail.com> >>>> >>>>> >>>> > wrote: >>>> >>>>> >>>> > >>>> >>>>> >>>> >> Hi Eugen, >>>> >>>>> >>>> >> For some reason i am not getting your email to me >>>> directly, i am >>>> >>>>> >>>> checking >>>> >>>>> >>>> >> the email digest and there i am able to find your >>>> reply. >>>> >>>>> >>>> >> Here is the log for download: >>>> https://we.tl/t-L8FEkGZFSq >>>> >>>>> >>>> >> Yes, these logs are from the time when the issue >>>> occurred. >>>> >>>>> >>>> >> >>>> >>>>> >>>> >> *Note: i am able to create vm's and perform other >>>> activities in the >>>> >>>>> >>>> >> central site, only facing this issue in the edge >>>> site.* >>>> >>>>> >>>> >> >>>> >>>>> >>>> >> With regards, >>>> >>>>> >>>> >> Swogat Pradhan >>>> >>>>> >>>> >> >>>> >>>>> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < >>>> >>>>> >>>> swogatpradhan22@gmail.com> >>>> >>>>> >>>> >> wrote: >>>> >>>>> >>>> >> >>>> >>>>> >>>> >>> Hi Eugen, >>>> >>>>> >>>> >>> Thanks for your response. >>>> >>>>> >>>> >>> I have actually a 4 controller setup so here are >>>> the details: >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> *PCS Status:* >>>> >>>>> >>>> >>> * Container bundle set: rabbitmq-bundle [ >>>> >>>>> >>>> >>> >>>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: >>>> >>>>> >>>> >>> * rabbitmq-bundle-0 >>>> (ocf::heartbeat:rabbitmq-cluster): >>>> >>>>> >>>> Started >>>> >>>>> >>>> >>> overcloud-controller-no-ceph-3 >>>> >>>>> >>>> >>> * rabbitmq-bundle-1 >>>> (ocf::heartbeat:rabbitmq-cluster): >>>> >>>>> >>>> Started >>>> >>>>> >>>> >>> overcloud-controller-2 >>>> >>>>> >>>> >>> * rabbitmq-bundle-2 >>>> (ocf::heartbeat:rabbitmq-cluster): >>>> >>>>> >>>> Started >>>> >>>>> >>>> >>> overcloud-controller-1 >>>> >>>>> >>>> >>> * rabbitmq-bundle-3 >>>> (ocf::heartbeat:rabbitmq-cluster): >>>> >>>>> >>>> Started >>>> >>>>> >>>> >>> overcloud-controller-0 >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> I have tried restarting the bundle multiple times >>>> but the issue is >>>> >>>>> >>>> still >>>> >>>>> >>>> >>> present. >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> *Cluster status:* >>>> >>>>> >>>> >>> [root@overcloud-controller-0 /]# rabbitmqctl >>>> cluster_status >>>> >>>>> >>>> >>> Cluster status of node >>>> >>>>> >>>> >>> >>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >>>> >>>>> >>>> >>> Basics >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> Cluster name: >>>> rabbit@overcloud-controller-no-ceph-3.bdxworld.com >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> Disk Nodes >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> >>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>> >>>>> >>>> >>> >>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>> >>>>> >>>> >>> >>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>> >>>>> >>>> >>> >>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> Running Nodes >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> >>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>> >>>>> >>>> >>> >>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>> >>>>> >>>> >>> >>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>> >>>>> >>>> >>> >>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> Versions >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> >>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ >>>> >>>>> >>>> 3.8.3 >>>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>>> >>>>> >>>> >>> >>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ >>>> >>>>> >>>> 3.8.3 >>>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>>> >>>>> >>>> >>> >>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ >>>> >>>>> >>>> 3.8.3 >>>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>>> >>>>> >>>> >>> >>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: >>>> >>>>> >>>> RabbitMQ >>>> >>>>> >>>> >>> 3.8.3 on Erlang 22.3.4.1 >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> Alarms >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> (none) >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> Network Partitions >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> (none) >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> Listeners >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> Node: >>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>> >>>>> >>>> interface: >>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: >>>> inter-node and CLI >>>> >>>>> >>>> tool >>>> >>>>> >>>> >>> communication >>>> >>>>> >>>> >>> Node: >>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>> >>>>> >>>> interface: >>>> >>>>> >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, >>>> purpose: AMQP 0-9-1 >>>> >>>>> >>>> >>> and AMQP 1.0 >>>> >>>>> >>>> >>> Node: >>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>> >>>>> >>>> interface: >>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP >>>> API >>>> >>>>> >>>> >>> Node: >>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>> >>>>> >>>> interface: >>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: >>>> inter-node and CLI >>>> >>>>> >>>> tool >>>> >>>>> >>>> >>> communication >>>> >>>>> >>>> >>> Node: >>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>> >>>>> >>>> interface: >>>> >>>>> >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, >>>> purpose: AMQP 0-9-1 >>>> >>>>> >>>> >>> and AMQP 1.0 >>>> >>>>> >>>> >>> Node: >>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>> >>>>> >>>> interface: >>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP >>>> API >>>> >>>>> >>>> >>> Node: >>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>> >>>>> >>>> interface: >>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: >>>> inter-node and CLI >>>> >>>>> >>>> tool >>>> >>>>> >>>> >>> communication >>>> >>>>> >>>> >>> Node: >>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>> >>>>> >>>> interface: >>>> >>>>> >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, >>>> purpose: AMQP 0-9-1 >>>> >>>>> >>>> >>> and AMQP 1.0 >>>> >>>>> >>>> >>> Node: >>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>> >>>>> >>>> interface: >>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP >>>> API >>>> >>>>> >>>> >>> Node: >>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> >>>>> >>>> , >>>> >>>>> >>>> >>> interface: [::], port: 25672, protocol: >>>> clustering, purpose: >>>> >>>>> >>>> inter-node and >>>> >>>>> >>>> >>> CLI tool communication >>>> >>>>> >>>> >>> Node: >>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> >>>>> >>>> , >>>> >>>>> >>>> >>> interface: 172.25.201.209, port: 5672, protocol: >>>> amqp, purpose: AMQP >>>> >>>>> >>>> 0-9-1 >>>> >>>>> >>>> >>> and AMQP 1.0 >>>> >>>>> >>>> >>> Node: >>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>> >>>>> >>>> , >>>> >>>>> >>>> >>> interface: [::], port: 15672, protocol: http, >>>> purpose: HTTP API >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> Feature flags >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> Flag: drop_unroutable_metric, state: enabled >>>> >>>>> >>>> >>> Flag: empty_basic_get_metric, state: enabled >>>> >>>>> >>>> >>> Flag: implicit_default_bindings, state: enabled >>>> >>>>> >>>> >>> Flag: quorum_queue, state: enabled >>>> >>>>> >>>> >>> Flag: virtual_host_metadata, state: enabled >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> *Logs:* >>>> >>>>> >>>> >>> *(Attached)* >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> With regards, >>>> >>>>> >>>> >>> Swogat Pradhan >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < >>>> >>>>> >>>> swogatpradhan22@gmail.com> >>>> >>>>> >>>> >>> wrote: >>>> >>>>> >>>> >>> >>>> >>>>> >>>> >>>> Hi, >>>> >>>>> >>>> >>>> Please find the nova conductor as well as nova >>>> api log. >>>> >>>>> >>>> >>>> >>>> >>>>> >>>> >>>> nova-conuctor: >>>> >>>>> >>>> >>>> >>>> >>>>> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING >>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - >>>> -] >>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>> exist, drop reply to >>>> >>>>> >>>> >>>> 16152921c1eb45c2b1f562087140168b >>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING >>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - >>>> -] >>>> >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't >>>> exist, drop reply to >>>> >>>>> >>>> >>>> 83dbe5f567a940b698acfe986f6194fa >>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING >>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - >>>> -] >>>> >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't >>>> exist, drop reply to >>>> >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43: >>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR >>>> oslo_messaging._drivers.amqpdriver >>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - >>>> -] The reply >>>> >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send >>>> after 60 seconds >>>> >>>>> >>>> due to a >>>> >>>>> >>>> >>>> missing queue >>>> (reply_276049ec36a84486a8a406911d9802f4). >>>> >>>>> >>>> Abandoning...: >>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>>> >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING >>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - >>>> -] >>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>> exist, drop reply to >>>> >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: >>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>>> >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR >>>> oslo_messaging._drivers.amqpdriver >>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - >>>> -] The reply >>>> >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send >>>> after 60 seconds >>>> >>>>> >>>> due to a >>>> >>>>> >>>> >>>> missing queue >>>> (reply_349bcb075f8c49329435a0f884b33066). >>>> >>>>> >>>> Abandoning...: >>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>>> >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING >>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - >>>> -] >>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>> exist, drop reply to >>>> >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f: >>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>>> >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR >>>> oslo_messaging._drivers.amqpdriver >>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - >>>> -] The reply >>>> >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send >>>> after 60 seconds >>>> >>>>> >>>> due to a >>>> >>>>> >>>> >>>> missing queue >>>> (reply_349bcb075f8c49329435a0f884b33066). >>>> >>>>> >>>> Abandoning...: >>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>>> >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING >>>> nova.cache_utils >>>> >>>>> >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>>> >>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>>> >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default >>>> default] Cache enabled >>>> >>>>> >>>> with >>>> >>>>> >>>> >>>> backend dogpile.cache.null. >>>> >>>>> >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING >>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - >>>> -] >>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>> exist, drop reply to >>>> >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb: >>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>>> >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR >>>> oslo_messaging._drivers.amqpdriver >>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - >>>> -] The reply >>>> >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send >>>> after 60 seconds >>>> >>>>> >>>> due to a >>>> >>>>> >>>> >>>> missing queue >>>> (reply_349bcb075f8c49329435a0f884b33066). >>>> >>>>> >>>> Abandoning...: >>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>> >>>>> >>>> >>>> >>>> >>>>> >>>> >>>> With regards, >>>> >>>>> >>>> >>>> Swogat Pradhan >>>> >>>>> >>>> >>>> >>>> >>>>> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >>>> >>>>> >>>> >>>> swogatpradhan22@gmail.com> wrote: >>>> >>>>> >>>> >>>> >>>> >>>>> >>>> >>>>> Hi, >>>> >>>>> >>>> >>>>> I currently have 3 compute nodes on edge site1 >>>> where i am trying to >>>> >>>>> >>>> >>>>> launch vm's. >>>> >>>>> >>>> >>>>> When the VM is in spawning state the node goes >>>> down (openstack >>>> >>>>> >>>> compute >>>> >>>>> >>>> >>>>> service list), the node comes backup when i >>>> restart the nova >>>> >>>>> >>>> compute >>>> >>>>> >>>> >>>>> service but then the launch of the vm fails. >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> nova-compute.log >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO >>>> nova.compute.manager >>>> >>>>> >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - >>>> - -] Running >>>> >>>>> >>>> >>>>> instance usage >>>> >>>>> >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from >>>> 2023-02-26 07:00:00 >>>> >>>>> >>>> to >>>> >>>>> >>>> >>>>> 2023-02-26 08:00:00. 0 instances. >>>> >>>>> >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO >>>> nova.compute.claims >>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>> default] [instance: >>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim >>>> successful on node >>>> >>>>> >>>> >>>>> dcn01-hci-0.bdxworld.com >>>> >>>>> >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO >>>> nova.virt.libvirt.driver >>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>> default] [instance: >>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring >>>> supplied device >>>> >>>>> >>>> name: >>>> >>>>> >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied dev >>>> names >>>> >>>>> >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO >>>> nova.virt.block_device >>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>> default] [instance: >>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting >>>> with volume >>>> >>>>> >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING >>>> nova.cache_utils >>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>> default] Cache enabled >>>> >>>>> >>>> with >>>> >>>>> >>>> >>>>> backend dogpile.cache.null. >>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO >>>> oslo.privsep.daemon >>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>> default] Running >>>> >>>>> >>>> >>>>> privsep helper: >>>> >>>>> >>>> >>>>> ['sudo', 'nova-rootwrap', >>>> '/etc/nova/rootwrap.conf', >>>> >>>>> >>>> 'privsep-helper', >>>> >>>>> >>>> >>>>> '--config-file', '/etc/nova/nova.conf', >>>> '--config-file', >>>> >>>>> >>>> >>>>> '/etc/nova/nova-compute.conf', >>>> '--privsep_context', >>>> >>>>> >>>> >>>>> 'os_brick.privileged.default', >>>> '--privsep_sock_path', >>>> >>>>> >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] >>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO >>>> oslo.privsep.daemon >>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>> default] Spawned new >>>> >>>>> >>>> privsep >>>> >>>>> >>>> >>>>> daemon via rootwrap >>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO >>>> oslo.privsep.daemon [-] privsep >>>> >>>>> >>>> >>>>> daemon starting >>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO >>>> oslo.privsep.daemon [-] privsep >>>> >>>>> >>>> >>>>> process running with uid/gid: 0/0 >>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO >>>> oslo.privsep.daemon [-] privsep >>>> >>>>> >>>> >>>>> process running with capabilities (eff/prm/inh): >>>> >>>>> >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO >>>> oslo.privsep.daemon [-] privsep >>>> >>>>> >>>> >>>>> daemon running as pid 2647 >>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING >>>> >>>>> >>>> os_brick.initiator.connectors.nvmeof >>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>> default] Process >>>> >>>>> >>>> >>>>> execution error >>>> >>>>> >>>> >>>>> in _get_host_uuid: Unexpected error while >>>> running command. >>>> >>>>> >>>> >>>>> Command: blkid overlay -s UUID -o value >>>> >>>>> >>>> >>>>> Exit code: 2 >>>> >>>>> >>>> >>>>> Stdout: '' >>>> >>>>> >>>> >>>>> Stderr: '': >>>> oslo_concurrency.processutils.ProcessExecutionError: >>>> >>>>> >>>> >>>>> Unexpected error while running command. >>>> >>>>> >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO >>>> nova.virt.libvirt.driver >>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>> default] [instance: >>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating >>>> image >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> Is there a way to solve this issue? >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> With regards, >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> Swogat Pradhan >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> >>>> >>>>> >>>> >>>>
Hi Alan, My bad i didn't see it was you who replied. Thanks for clarifying my doubt. On Thu, Mar 23, 2023 at 6:12 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Jhon, Thank you for clarifying that. Right now the cinder volume is stuck in *creating *state when adding image as volume source. But when creating an empty volume the volumes are getting created successfully without any errors.
We are getting volume creation request in cinder-volume.log as such: 2023-03-23 12:34:40.152 108 INFO cinder.volume.flows.manager.create_volume [req-18556796-a61c-4097-8fa8-b136ce9814f7 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume 872a2ae6-c75b-4fc0-8172-17a29d07a66c: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-872a2ae6-c75b-4fc0-8172-17a29d07a66c', 'volume_size': 1, 'image_id': '131ed4e0-0474-45be-b74a-43b599a7d6c5', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': '131ed4e0-0474-45be-b74a-43b599a7d6c5', 'created_at': datetime.datetime(2023, 3, 23, 11, 41, 51, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 23, 11, 46, 37, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'tags': [], 'file': '/v2/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f98d869ed68>}
But there is nothing else after that and the volume doesn't even timeout, it just gets stuck in creating state. Can you advise what might be the issue here? All the containers are in a healthy state now.
With regards, Swogat Pradhan
On Thu, Mar 23, 2023 at 6:06 PM Alan Bishop <abishop@redhat.com> wrote:
On Thu, Mar 23, 2023 at 5:20 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Is this bind not required for cinder_scheduler container?
"/var/lib/tripleo-config/ceph:/var/lib/kolla/config_files/src-ceph:ro,rprivate,rbind", I do not see this particular bind on the cinder scheduler containers on my controller nodes.
That is correct, because the scheduler does not access the ceph cluster.
Alan
With regards, Swogat Pradhan
On Thu, Mar 23, 2023 at 2:46 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Cinder volume config:
[tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=a8d5f1f5-48e7-5ede-89ab-8aca59b6397b report_discard_supported=True rbd_ceph_conf=/etc/ceph/dcn02.conf rbd_cluster_name=dcn02
Glance api config:
[dcn02] rbd_store_ceph_conf=/etc/ceph/dcn02.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store [ceph] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend.
On Thu, Mar 23, 2023 at 2:29 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
I still have the same issue, I'm not sure what's left to try. All the pods are now in a healthy state, I am getting log entries 3 mins after I hit the create volume button in cinder-volume when I try to create a volume with an image. And the volumes are just stuck in creating state for more than 20 mins now.
Cinder logs: 2023-03-22 20:32:44.010 108 INFO cinder.rpc [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Automatically selected cinder-volume RPC version 3.17 as minimum service version. 2023-03-22 20:34:59.166 108 INFO cinder.volume.flows.manager.create_volume [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume 5743a879-090d-46db-bc7c-1c0b0669a112: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-5743a879-090d-46db-bc7c-1c0b0669a112', 'volume_size': 2, 'image_id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'created_at': datetime.datetime(2023, 3, 22, 18, 50, 5, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 20, 3, 54, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'tags': [], 'file': '/v2/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f8147973438>}
With regards, Swogat Pradhan
On Wed, Mar 22, 2023 at 9:19 PM Alan Bishop <abishop@redhat.com> wrote:
On Wed, Mar 22, 2023 at 8:38 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
> Hi Adam, > The systems are in same LAN, in this case it seemed like the image > was getting pulled from the central site which was caused due to an > misconfiguration in ceph.conf file in /var/lib/tripleo-config/ceph/ > directory, which seems to have been resolved after the changes i made to > fix it. > > Right now the glance api podman is running in unhealthy state and > the podman logs don't show any error whatsoever and when issued the command > netstat -nultp i do not see any entry for glance port i.e. 9292 in the dcn > site, which is why cinder is throwing an error stating: > > 2023-03-22 13:32:29.786 108 ERROR oslo_messaging.rpc.server > cinder.exception.GlanceConnectionFailed: Connection to glance failed: Error > finding address for > http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: > Unable to establish connection to > http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: > HTTPConnectionPool(host='172.25.228.253', port=9292): Max retries exceeded > with url: /v2/images/736d8779-07cd-4510-bab2-adcb653cc538 (Caused by > NewConnectionError('<urllib3.connection.HTTPConnection object at > 0x7f7682d2cd30>: Failed to establish a new connection: [Errno 111] > ECONNREFUSED',)) > > Now i need to find out why the port is not listed as the glance > service is running, which i am not sure how to find out. >
One other thing to investigate is whether your deployment includes this patch [1]. If it does, then bear in mind the glance-api service running at the edge site will be an "internal" (non public facing) instance that uses port 9293 instead of 9292. You should familiarize yourself with the release note [2].
[1] https://opendev.org/openstack/tripleo-heat-templates/commit/3605d45e417a77a1... [2] https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/walla...
Alan
> With regards, > Swogat Pradhan > > On Wed, Mar 22, 2023 at 8:11 PM Alan Bishop <abishop@redhat.com> > wrote: > >> >> >> On Wed, Mar 22, 2023 at 6:37 AM Swogat Pradhan < >> swogatpradhan22@gmail.com> wrote: >> >>> Update: >>> Here is the log when creating a volume using cirros image: >>> >>> 2023-03-22 11:04:38.449 109 INFO >>> cinder.volume.flows.manager.create_volume >>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - - -] Volume >>> bf341343-6609-4b8c-b9e0-93e2a89c8c8f: being created as image with >>> specification: {'status': 'creating', 'volume_name': >>> 'volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f', 'volume_size': 4, >>> 'image_id': '736d8779-07cd-4510-bab2-adcb653cc538', 'image_location': >>> ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>> [{'url': >>> 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>> 'metadata': {'store': 'ceph'}}, {'url': >>> 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>> 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', >>> 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', >>> 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', >>> 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, >>> 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', >>> 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': >>> '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', >>> 'id': '736d8779-07cd-4510-bab2-adcb653cc538', 'created_at': >>> datetime.datetime(2023, 3, 22, 10, 44, 12, tzinfo=datetime.timezone.utc), >>> 'updated_at': datetime.datetime(2023, 3, 22, 10, 54, 1, >>> tzinfo=datetime.timezone.utc), 'locations': [{'url': >>> 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>> 'metadata': {'store': 'ceph'}}, {'url': >>> 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>> 'metadata': {'store': 'dcn02'}}], 'direct_url': >>> 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>> 'tags': [], 'file': '/v2/images/736d8779-07cd-4510-bab2-adcb653cc538/file', >>> 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', >>> 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', >>> 'owner_specified.openstack.object': 'images/cirros', >>> 'owner_specified.openstack.sha256': ''}}, 'image_service': >>> <cinder.image.glance.GlanceImageService object at 0x7f449ded1198>} >>> 2023-03-22 11:06:16.570 109 INFO cinder.image.image_utils >>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - - -] Image download 15.58 MB at 0.16 MB/s >>> >> >> As Adam Savage would say, well there's your problem ^^ (Image >> download 15.58 MB at 0.16 MB/s). Downloading the image takes too long, and >> 0.16 MB/s suggests you have a network issue. >> >> John Fulton previously stated your cinder-volume service at the >> edge site is not using the local ceph image store. Assuming you are >> deploying GlanceApiEdge service [1], then the cinder-volume service should >> be configured to use the local glance service [2]. You should check >> cinder's glance_api_servers to confirm it's the edge site's glance service. >> >> [1] >> https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/envi... >> [2] >> https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/depl... >> >> Alan >> >> >>> 2023-03-22 11:07:54.023 109 WARNING py.warnings >>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - - -] >>> /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: >>> FutureWarning: The human format is deprecated and the format parameter will >>> be removed. Use explicitly json instead in version 'xena' >>> category=FutureWarning) >>> >>> 2023-03-22 11:11:12.161 109 WARNING py.warnings >>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - - -] >>> /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: >>> FutureWarning: The human format is deprecated and the format parameter will >>> be removed. Use explicitly json instead in version 'xena' >>> category=FutureWarning) >>> >>> 2023-03-22 11:11:12.163 109 INFO cinder.image.image_utils >>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - - -] Converted 112.00 MB image at 112.00 >>> MB/s >>> 2023-03-22 11:11:14.998 109 INFO >>> cinder.volume.flows.manager.create_volume >>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - - -] Volume >>> volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f >>> (bf341343-6609-4b8c-b9e0-93e2a89c8c8f): created successfully >>> 2023-03-22 11:11:15.195 109 INFO cinder.volume.manager >>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - - -] Created volume successfully. >>> >>> The image is present in dcn02 store but still it downloaded the >>> image in 0.16 MB/s and then created the volume. >>> >>> With regards, >>> Swogat Pradhan >>> >>> On Tue, Mar 21, 2023 at 6:10 PM Swogat Pradhan < >>> swogatpradhan22@gmail.com> wrote: >>> >>>> Hi Jhon, >>>> This seems to be an issue. >>>> When i deployed the dcn ceph in both dcn01 and dcn02 the >>>> --cluster parameter was specified to the respective cluster names but the >>>> config files were created in the name of ceph.conf and keyring was >>>> ceph.client.openstack.keyring. >>>> >>>> Which created issues in glance as well as the naming convention >>>> of the files didn't match the cluster names, so i had to manually rename >>>> the central ceph conf file as such: >>>> >>>> [root@dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/ >>>> [root@dcn02-compute-0 ceph]# ll >>>> total 16 >>>> -rw-------. 1 root root 257 Mar 13 13:56 >>>> ceph_central.client.openstack.keyring >>>> -rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf >>>> -rw-------. 1 root root 205 Mar 15 18:45 >>>> ceph.client.openstack.keyring >>>> -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf >>>> [root@dcn02-compute-0 ceph]# >>>> >>>> ceph.conf and ceph.client.openstack.keyring contain the fsid of >>>> the respective clusters in both dcn01 and dcn02. >>>> In the above cli output, the ceph.conf and ceph.client... are the >>>> files used to access dcn02 ceph cluster and ceph_central* files are used in >>>> for accessing central ceph cluster. >>>> >>>> glance multistore config: >>>> [dcn02] >>>> rbd_store_ceph_conf=/etc/ceph/ceph.conf >>>> rbd_store_user=openstack >>>> rbd_store_pool=images >>>> rbd_thin_provisioning=False >>>> store_description=dcn02 rbd glance store >>>> >>>> [ceph_central] >>>> rbd_store_ceph_conf=/etc/ceph/ceph_central.conf >>>> rbd_store_user=openstack >>>> rbd_store_pool=images >>>> rbd_thin_provisioning=False >>>> store_description=Default glance store backend. >>>> >>>> >>>> With regards, >>>> Swogat Pradhan >>>> >>>> On Tue, Mar 21, 2023 at 5:52 PM John Fulton <johfulto@redhat.com> >>>> wrote: >>>> >>>>> On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan >>>>> <swogatpradhan22@gmail.com> wrote: >>>>> > >>>>> > Hi, >>>>> > Seems like cinder is not using the local ceph. >>>>> >>>>> That explains the issue. It's a misconfiguration. >>>>> >>>>> I hope this is not a production system since the mailing list >>>>> now has >>>>> the cinder.conf which contains passwords. >>>>> >>>>> The section that looks like this: >>>>> >>>>> [tripleo_ceph] >>>>> volume_backend_name=tripleo_ceph >>>>> volume_driver=cinder.volume.drivers.rbd.RBDDriver >>>>> rbd_ceph_conf=/etc/ceph/ceph.conf >>>>> rbd_user=openstack >>>>> rbd_pool=volumes >>>>> rbd_flatten_volume_from_snapshot=False >>>>> rbd_secret_uuid=<redacted> >>>>> report_discard_supported=True >>>>> >>>>> Should be updated to refer to the local DCN ceph cluster and not >>>>> the >>>>> central one. Use the ceph conf file for that cluster and ensure >>>>> the >>>>> rbd_secret_uuid corresponds to that one. >>>>> >>>>> TripleO’s convention is to set the rbd_secret_uuid to the FSID >>>>> of the >>>>> Ceph cluster. The FSID should be in the ceph.conf file. The >>>>> tripleo_nova_libvirt role will use virsh secret-* commands so >>>>> that >>>>> libvirt can retrieve the cephx secret using the FSID as a key. >>>>> This >>>>> can be confirmed with `podman exec nova_virtsecretd virsh >>>>> secret-get-value $FSID`. >>>>> >>>>> The documentation describes how to configure the central and DCN >>>>> sites >>>>> correctly but an error seems to have occurred while you were >>>>> following >>>>> it. >>>>> >>>>> >>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >>>>> >>>>> John >>>>> >>>>> > >>>>> > Ceph Output: >>>>> > [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l >>>>> > NAME SIZE PARENT >>>>> FMT PROT LOCK >>>>> > 2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB >>>>> 2 excl >>>>> > 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB 2 >>>>> > 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB >>>>> 2 yes >>>>> > 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB 2 >>>>> > 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB >>>>> 2 yes >>>>> > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB 2 >>>>> > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB >>>>> 2 yes >>>>> > 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB 2 >>>>> > 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB >>>>> 2 yes >>>>> > b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB 2 >>>>> > b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB >>>>> 2 yes >>>>> > e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB 2 >>>>> > e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB >>>>> 2 yes >>>>> > >>>>> > [ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l >>>>> > NAME SIZE PARENT >>>>> FMT PROT LOCK >>>>> > volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB >>>>> 2 >>>>> > volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB >>>>> 2 >>>>> > [ceph: root@dcn02-ceph-all-0 /]# >>>>> > >>>>> > Attached the cinder config. >>>>> > Please let me know how I can solve this issue. >>>>> > >>>>> > With regards, >>>>> > Swogat Pradhan >>>>> > >>>>> > On Tue, Mar 21, 2023 at 3:53 PM John Fulton < >>>>> johfulto@redhat.com> wrote: >>>>> >> >>>>> >> in my last message under the line "On a DCN site if you run a >>>>> command like this:" I suggested some steps you could try to confirm the >>>>> image is a COW from the local glance as well as how to look at your cinder >>>>> config. >>>>> >> >>>>> >> On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan < >>>>> swogatpradhan22@gmail.com> wrote: >>>>> >>> >>>>> >>> Update: >>>>> >>> I uploaded an image directly to the dcn02 store, and it >>>>> takes around 10,15 minutes to create a volume with image in dcn02. >>>>> >>> The image size is 389 MB. >>>>> >>> >>>>> >>> On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan < >>>>> swogatpradhan22@gmail.com> wrote: >>>>> >>>> >>>>> >>>> Hi Jhon, >>>>> >>>> I checked in the ceph od dcn02, I can see the images >>>>> created after importing from the central site. >>>>> >>>> But launching an instance normally fails as it takes a long >>>>> time for the volume to get created. >>>>> >>>> >>>>> >>>> When launching an instance from volume the instance is >>>>> getting created properly without any errors. >>>>> >>>> >>>>> >>>> I tried to cache images in nova using >>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>>>> but getting checksum failed error. >>>>> >>>> >>>>> >>>> With regards, >>>>> >>>> Swogat Pradhan >>>>> >>>> >>>>> >>>> On Thu, Mar 16, 2023 at 5:24 PM John Fulton < >>>>> johfulto@redhat.com> wrote: >>>>> >>>>> >>>>> >>>>> On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan >>>>> >>>>> <swogatpradhan22@gmail.com> wrote: >>>>> >>>>> > >>>>> >>>>> > Update: After restarting the nova services on the >>>>> controller and running the deploy script on the edge site, I was able to >>>>> launch the VM from volume. >>>>> >>>>> > >>>>> >>>>> > Right now the instance creation is failing as the block >>>>> device creation is stuck in creating state, it is taking more than 10 mins >>>>> for the volume to be created, whereas the image has already been imported >>>>> to the edge glance. >>>>> >>>>> >>>>> >>>>> Try following this document and making the same >>>>> observations in your >>>>> >>>>> environment for AZs and their local ceph cluster. >>>>> >>>>> >>>>> >>>>> >>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >>>>> >>>>> >>>>> >>>>> On a DCN site if you run a command like this: >>>>> >>>>> >>>>> >>>>> $ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring >>>>> >>>>> /etc/ceph/dcn0.client.admin.keyring >>>>> >>>>> $ rbd --cluster dcn0 -p volumes ls -l >>>>> >>>>> NAME SIZE PARENT >>>>> >>>>> FMT PROT LOCK >>>>> >>>>> volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB >>>>> >>>>> images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 >>>>> excl >>>>> >>>>> $ >>>>> >>>>> >>>>> >>>>> Then, you should see the parent of the volume is the image >>>>> which is on >>>>> >>>>> the same local ceph cluster. >>>>> >>>>> >>>>> >>>>> I wonder if something is misconfigured and thus you're >>>>> encountering >>>>> >>>>> the streaming behavior described here: >>>>> >>>>> >>>>> >>>>> Ideally all images should reside in the central Glance and >>>>> be copied >>>>> >>>>> to DCN sites before instances of those images are booted >>>>> on DCN sites. >>>>> >>>>> If an image is not copied to a DCN site before it is >>>>> booted, then the >>>>> >>>>> image will be streamed to the DCN site and then the image >>>>> will boot as >>>>> >>>>> an instance. This happens because Glance at the DCN site >>>>> has access to >>>>> >>>>> the images store at the Central ceph cluster. Though the >>>>> booting of >>>>> >>>>> the image will take time because it has not been copied in >>>>> advance, >>>>> >>>>> this is still preferable to failing to boot the image. >>>>> >>>>> >>>>> >>>>> You can also exec into the cinder container at the DCN >>>>> site and >>>>> >>>>> confirm it's using it's local ceph cluster. >>>>> >>>>> >>>>> >>>>> John >>>>> >>>>> >>>>> >>>>> > >>>>> >>>>> > I will try and create a new fresh image and test again >>>>> then update. >>>>> >>>>> > >>>>> >>>>> > With regards, >>>>> >>>>> > Swogat Pradhan >>>>> >>>>> > >>>>> >>>>> > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan < >>>>> swogatpradhan22@gmail.com> wrote: >>>>> >>>>> >> >>>>> >>>>> >> Update: >>>>> >>>>> >> In the hypervisor list the compute node state is >>>>> showing down. >>>>> >>>>> >> >>>>> >>>>> >> >>>>> >>>>> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan < >>>>> swogatpradhan22@gmail.com> wrote: >>>>> >>>>> >>> >>>>> >>>>> >>> Hi Brendan, >>>>> >>>>> >>> Now i have deployed another site where i have used 2 >>>>> linux bonds network template for both 3 compute nodes and 3 ceph nodes. >>>>> >>>>> >>> The bonding options is set to mode=802.3ad >>>>> (lacp=active). >>>>> >>>>> >>> I used a cirros image to launch instance but the >>>>> instance timed out so i waited for the volume to be created. >>>>> >>>>> >>> Once the volume was created i tried launching the >>>>> instance from the volume and still the instance is stuck in spawning state. >>>>> >>>>> >>> >>>>> >>>>> >>> Here is the nova-compute log: >>>>> >>>>> >>> >>>>> >>>>> >>> 2023-03-15 17:35:47.739 185437 INFO >>>>> oslo.privsep.daemon [-] privsep daemon starting >>>>> >>>>> >>> 2023-03-15 17:35:47.744 185437 INFO >>>>> oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 >>>>> >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO >>>>> oslo.privsep.daemon [-] privsep process running with capabilities >>>>> (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>>> >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO >>>>> oslo.privsep.daemon [-] privsep daemon running as pid 185437 >>>>> >>>>> >>> 2023-03-15 17:35:47.974 8 WARNING >>>>> os_brick.initiator.connectors.nvmeof >>>>> [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error >>>>> in _get_host_uuid: Unexpected error while running command. >>>>> >>>>> >>> Command: blkid overlay -s UUID -o value >>>>> >>>>> >>> Exit code: 2 >>>>> >>>>> >>> Stdout: '' >>>>> >>>>> >>> Stderr: '': >>>>> oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while >>>>> running command. >>>>> >>>>> >>> 2023-03-15 17:35:51.616 8 INFO >>>>> nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 >>>>> b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image >>>>> >>>>> >>> >>>>> >>>>> >>> It is stuck in creating image, do i need to run the >>>>> template mentioned here ?: >>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>>>> >>>>> >>> >>>>> >>>>> >>> The volume is already created and i do not understand >>>>> why the instance is stuck in spawning state. >>>>> >>>>> >>> >>>>> >>>>> >>> With regards, >>>>> >>>>> >>> Swogat Pradhan >>>>> >>>>> >>> >>>>> >>>>> >>> >>>>> >>>>> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard < >>>>> bshephar@redhat.com> wrote: >>>>> >>>>> >>>> >>>>> >>>>> >>>> Does your environment use different network >>>>> interfaces for each of the networks? Or does it have a bond with everything >>>>> on it? >>>>> >>>>> >>>> >>>>> >>>>> >>>> One issue I have seen before is that when launching >>>>> instances, there is a lot of network traffic between nodes as the >>>>> hypervisor needs to download the image from Glance. Along with various >>>>> other services sending normal network traffic, it can be enough to cause >>>>> issues if everything is running over a single 1Gbe interface. >>>>> >>>>> >>>> >>>>> >>>>> >>>> I have seen the same situation in fact when using a >>>>> single active/backup bond on 1Gbe nics. It’s worth checking the network >>>>> traffic while you try to spawn the instance to see if you’re dropping >>>>> packets. In the situation I described, there were dropped packets which >>>>> resulted in a loss of communication between nova_compute and RMQ, so the >>>>> node appeared offline. You should also confirm that nova_compute is being >>>>> disconnected in the nova_compute logs if you tail them on the Hypervisor >>>>> while spawning the instance. >>>>> >>>>> >>>> >>>>> >>>>> >>>> In my case, changing from active/backup to LACP >>>>> helped. So, based on that experience, from my perspective, is certainly >>>>> sounds like some kind of network issue. >>>>> >>>>> >>>> >>>>> >>>>> >>>> Regards, >>>>> >>>>> >>>> >>>>> >>>>> >>>> Brendan Shephard >>>>> >>>>> >>>> Senior Software Engineer >>>>> >>>>> >>>> Red Hat Australia >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> >>>>> wrote: >>>>> >>>>> >>>> >>>>> >>>>> >>>> Hi, >>>>> >>>>> >>>> >>>>> >>>>> >>>> I tried to help someone with a similar issue some >>>>> time ago in this thread: >>>>> >>>>> >>>> >>>>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... >>>>> >>>>> >>>> >>>>> >>>>> >>>> But apparently a neutron reinstallation fixed it for >>>>> that user, not sure if that could apply here. But is it possible that your >>>>> nova and neutron versions are different between central and edge site? Have >>>>> you restarted nova and neutron services on the compute nodes after >>>>> installation? Have you debug logs of nova-conductor and maybe nova-compute? >>>>> Maybe they can help narrow down the issue. >>>>> >>>>> >>>> If there isn't any additional information in the >>>>> debug logs I probably would start "tearing down" rabbitmq. I didn't have to >>>>> do that in a production system yet so be careful. I can think of two routes: >>>>> >>>>> >>>> >>>>> >>>>> >>>> - Either remove queues, exchanges etc. while rabbit >>>>> is running, this will most likely impact client IO depending on your load. >>>>> Check out the rabbitmqctl commands. >>>>> >>>>> >>>> - Or stop the rabbitmq cluster, remove the mnesia >>>>> tables from all nodes and restart rabbitmq so the exchanges, queues etc. >>>>> rebuild. >>>>> >>>>> >>>> >>>>> >>>>> >>>> I can imagine that the failed reply "survives" while >>>>> being replicated across the rabbit nodes. But I don't really know the >>>>> rabbit internals too well, so maybe someone else can chime in here and give >>>>> a better advice. >>>>> >>>>> >>>> >>>>> >>>>> >>>> Regards, >>>>> >>>>> >>>> Eugen >>>>> >>>>> >>>> >>>>> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>>>> >>>>> >>>> >>>>> >>>>> >>>> Hi, >>>>> >>>>> >>>> Can someone please help me out on this issue? >>>>> >>>>> >>>> >>>>> >>>>> >>>> With regards, >>>>> >>>>> >>>> Swogat Pradhan >>>>> >>>>> >>>> >>>>> >>>>> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < >>>>> swogatpradhan22@gmail.com> >>>>> >>>>> >>>> wrote: >>>>> >>>>> >>>> >>>>> >>>>> >>>> Hi >>>>> >>>>> >>>> I don't see any major packet loss. >>>>> >>>>> >>>> It seems the problem is somewhere in rabbitmq maybe >>>>> but not due to packet >>>>> >>>>> >>>> loss. >>>>> >>>>> >>>> >>>>> >>>>> >>>> with regards, >>>>> >>>>> >>>> Swogat Pradhan >>>>> >>>>> >>>> >>>>> >>>>> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < >>>>> swogatpradhan22@gmail.com> >>>>> >>>>> >>>> wrote: >>>>> >>>>> >>>> >>>>> >>>>> >>>> Hi, >>>>> >>>>> >>>> Yes the MTU is the same as the default '1500'. >>>>> >>>>> >>>> Generally I haven't seen any packet loss, but never >>>>> checked when >>>>> >>>>> >>>> launching the instance. >>>>> >>>>> >>>> I will check that and come back. >>>>> >>>>> >>>> But everytime i launch an instance the instance gets >>>>> stuck at spawning >>>>> >>>>> >>>> state and there the hypervisor becomes down, so not >>>>> sure if packet loss >>>>> >>>>> >>>> causes this. >>>>> >>>>> >>>> >>>>> >>>>> >>>> With regards, >>>>> >>>>> >>>> Swogat pradhan >>>>> >>>>> >>>> >>>>> >>>>> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block < >>>>> eblock@nde.ag> wrote: >>>>> >>>>> >>>> >>>>> >>>>> >>>> One more thing coming to mind is MTU size. Are they >>>>> identical between >>>>> >>>>> >>>> central and edge site? Do you see packet loss through >>>>> the tunnel? >>>>> >>>>> >>>> >>>>> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>>>> >>>>> >>>> >>>>> >>>>> >>>> > Hi Eugen, >>>>> >>>>> >>>> > Request you to please add my email either on 'to' >>>>> or 'cc' as i am not >>>>> >>>>> >>>> > getting email's from you. >>>>> >>>>> >>>> > Coming to the issue: >>>>> >>>>> >>>> > >>>>> >>>>> >>>> > [root@overcloud-controller-no-ceph-3 /]# >>>>> rabbitmqctl list_policies -p >>>>> >>>>> >>>> / >>>>> >>>>> >>>> > Listing policies for vhost "/" ... >>>>> >>>>> >>>> > vhost name pattern apply-to definition >>>>> priority >>>>> >>>>> >>>> > / ha-all ^(?!amq\.).* queues >>>>> >>>>> >>>> > >>>>> >>>>> >>>> >>>>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 >>>>> >>>>> >>>> > >>>>> >>>>> >>>> > I have the edge site compute nodes up, it only goes >>>>> down when i am >>>>> >>>>> >>>> trying >>>>> >>>>> >>>> > to launch an instance and the instance comes to a >>>>> spawning state and >>>>> >>>>> >>>> then >>>>> >>>>> >>>> > gets stuck. >>>>> >>>>> >>>> > >>>>> >>>>> >>>> > I have a tunnel setup between the central and the >>>>> edge sites. >>>>> >>>>> >>>> > >>>>> >>>>> >>>> > With regards, >>>>> >>>>> >>>> > Swogat Pradhan >>>>> >>>>> >>>> > >>>>> >>>>> >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < >>>>> >>>>> >>>> swogatpradhan22@gmail.com> >>>>> >>>>> >>>> > wrote: >>>>> >>>>> >>>> > >>>>> >>>>> >>>> >> Hi Eugen, >>>>> >>>>> >>>> >> For some reason i am not getting your email to me >>>>> directly, i am >>>>> >>>>> >>>> checking >>>>> >>>>> >>>> >> the email digest and there i am able to find your >>>>> reply. >>>>> >>>>> >>>> >> Here is the log for download: >>>>> https://we.tl/t-L8FEkGZFSq >>>>> >>>>> >>>> >> Yes, these logs are from the time when the issue >>>>> occurred. >>>>> >>>>> >>>> >> >>>>> >>>>> >>>> >> *Note: i am able to create vm's and perform other >>>>> activities in the >>>>> >>>>> >>>> >> central site, only facing this issue in the edge >>>>> site.* >>>>> >>>>> >>>> >> >>>>> >>>>> >>>> >> With regards, >>>>> >>>>> >>>> >> Swogat Pradhan >>>>> >>>>> >>>> >> >>>>> >>>>> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < >>>>> >>>>> >>>> swogatpradhan22@gmail.com> >>>>> >>>>> >>>> >> wrote: >>>>> >>>>> >>>> >> >>>>> >>>>> >>>> >>> Hi Eugen, >>>>> >>>>> >>>> >>> Thanks for your response. >>>>> >>>>> >>>> >>> I have actually a 4 controller setup so here are >>>>> the details: >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> *PCS Status:* >>>>> >>>>> >>>> >>> * Container bundle set: rabbitmq-bundle [ >>>>> >>>>> >>>> >>> >>>>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: >>>>> >>>>> >>>> >>> * rabbitmq-bundle-0 >>>>> (ocf::heartbeat:rabbitmq-cluster): >>>>> >>>>> >>>> Started >>>>> >>>>> >>>> >>> overcloud-controller-no-ceph-3 >>>>> >>>>> >>>> >>> * rabbitmq-bundle-1 >>>>> (ocf::heartbeat:rabbitmq-cluster): >>>>> >>>>> >>>> Started >>>>> >>>>> >>>> >>> overcloud-controller-2 >>>>> >>>>> >>>> >>> * rabbitmq-bundle-2 >>>>> (ocf::heartbeat:rabbitmq-cluster): >>>>> >>>>> >>>> Started >>>>> >>>>> >>>> >>> overcloud-controller-1 >>>>> >>>>> >>>> >>> * rabbitmq-bundle-3 >>>>> (ocf::heartbeat:rabbitmq-cluster): >>>>> >>>>> >>>> Started >>>>> >>>>> >>>> >>> overcloud-controller-0 >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> I have tried restarting the bundle multiple times >>>>> but the issue is >>>>> >>>>> >>>> still >>>>> >>>>> >>>> >>> present. >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> *Cluster status:* >>>>> >>>>> >>>> >>> [root@overcloud-controller-0 /]# rabbitmqctl >>>>> cluster_status >>>>> >>>>> >>>> >>> Cluster status of node >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >>>>> >>>>> >>>> >>> Basics >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Cluster name: >>>>> rabbit@overcloud-controller-no-ceph-3.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Disk Nodes >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Running Nodes >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Versions >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ >>>>> >>>>> >>>> 3.8.3 >>>>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ >>>>> >>>>> >>>> 3.8.3 >>>>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ >>>>> >>>>> >>>> 3.8.3 >>>>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: >>>>> >>>>> >>>> RabbitMQ >>>>> >>>>> >>>> >>> 3.8.3 on Erlang 22.3.4.1 >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Alarms >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> (none) >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Network Partitions >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> (none) >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Listeners >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: >>>>> inter-node and CLI >>>>> >>>>> >>>> tool >>>>> >>>>> >>>> >>> communication >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, >>>>> purpose: AMQP 0-9-1 >>>>> >>>>> >>>> >>> and AMQP 1.0 >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP >>>>> API >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: >>>>> inter-node and CLI >>>>> >>>>> >>>> tool >>>>> >>>>> >>>> >>> communication >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, >>>>> purpose: AMQP 0-9-1 >>>>> >>>>> >>>> >>> and AMQP 1.0 >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP >>>>> API >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: >>>>> inter-node and CLI >>>>> >>>>> >>>> tool >>>>> >>>>> >>>> >>> communication >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, >>>>> purpose: AMQP 0-9-1 >>>>> >>>>> >>>> >>> and AMQP 1.0 >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP >>>>> API >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>> >>>>> >>>> , >>>>> >>>>> >>>> >>> interface: [::], port: 25672, protocol: >>>>> clustering, purpose: >>>>> >>>>> >>>> inter-node and >>>>> >>>>> >>>> >>> CLI tool communication >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>> >>>>> >>>> , >>>>> >>>>> >>>> >>> interface: 172.25.201.209, port: 5672, protocol: >>>>> amqp, purpose: AMQP >>>>> >>>>> >>>> 0-9-1 >>>>> >>>>> >>>> >>> and AMQP 1.0 >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>> >>>>> >>>> , >>>>> >>>>> >>>> >>> interface: [::], port: 15672, protocol: http, >>>>> purpose: HTTP API >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Feature flags >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Flag: drop_unroutable_metric, state: enabled >>>>> >>>>> >>>> >>> Flag: empty_basic_get_metric, state: enabled >>>>> >>>>> >>>> >>> Flag: implicit_default_bindings, state: enabled >>>>> >>>>> >>>> >>> Flag: quorum_queue, state: enabled >>>>> >>>>> >>>> >>> Flag: virtual_host_metadata, state: enabled >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> *Logs:* >>>>> >>>>> >>>> >>> *(Attached)* >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> With regards, >>>>> >>>>> >>>> >>> Swogat Pradhan >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < >>>>> >>>>> >>>> swogatpradhan22@gmail.com> >>>>> >>>>> >>>> >>> wrote: >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>>> Hi, >>>>> >>>>> >>>> >>>> Please find the nova conductor as well as nova >>>>> api log. >>>>> >>>>> >>>> >>>> >>>>> >>>>> >>>> >>>> nova-conuctor: >>>>> >>>>> >>>> >>>> >>>>> >>>>> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING >>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>> - -] >>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>>> exist, drop reply to >>>>> >>>>> >>>> >>>> 16152921c1eb45c2b1f562087140168b >>>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING >>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - >>>>> - -] >>>>> >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't >>>>> exist, drop reply to >>>>> >>>>> >>>> >>>> 83dbe5f567a940b698acfe986f6194fa >>>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING >>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - >>>>> - -] >>>>> >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't >>>>> exist, drop reply to >>>>> >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43: >>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR >>>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - >>>>> - -] The reply >>>>> >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send >>>>> after 60 seconds >>>>> >>>>> >>>> due to a >>>>> >>>>> >>>> >>>> missing queue >>>>> (reply_276049ec36a84486a8a406911d9802f4). >>>>> >>>>> >>>> Abandoning...: >>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>>> >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING >>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>> - -] >>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>>> exist, drop reply to >>>>> >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: >>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>>> >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR >>>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>> - -] The reply >>>>> >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send >>>>> after 60 seconds >>>>> >>>>> >>>> due to a >>>>> >>>>> >>>> >>>> missing queue >>>>> (reply_349bcb075f8c49329435a0f884b33066). >>>>> >>>>> >>>> Abandoning...: >>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>>> >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING >>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>> - -] >>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>>> exist, drop reply to >>>>> >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f: >>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>>> >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR >>>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>> - -] The reply >>>>> >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send >>>>> after 60 seconds >>>>> >>>>> >>>> due to a >>>>> >>>>> >>>> >>>> missing queue >>>>> (reply_349bcb075f8c49329435a0f884b33066). >>>>> >>>>> >>>> Abandoning...: >>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>>> >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING >>>>> nova.cache_utils >>>>> >>>>> >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] Cache enabled >>>>> >>>>> >>>> with >>>>> >>>>> >>>> >>>> backend dogpile.cache.null. >>>>> >>>>> >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING >>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>> - -] >>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>>> exist, drop reply to >>>>> >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb: >>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>>> >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR >>>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>> - -] The reply >>>>> >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send >>>>> after 60 seconds >>>>> >>>>> >>>> due to a >>>>> >>>>> >>>> >>>> missing queue >>>>> (reply_349bcb075f8c49329435a0f884b33066). >>>>> >>>>> >>>> Abandoning...: >>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>>> >>>> >>>> >>>>> >>>>> >>>> >>>> With regards, >>>>> >>>>> >>>> >>>> Swogat Pradhan >>>>> >>>>> >>>> >>>> >>>>> >>>>> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >>>>> >>>>> >>>> >>>> swogatpradhan22@gmail.com> wrote: >>>>> >>>>> >>>> >>>> >>>>> >>>>> >>>> >>>>> Hi, >>>>> >>>>> >>>> >>>>> I currently have 3 compute nodes on edge site1 >>>>> where i am trying to >>>>> >>>>> >>>> >>>>> launch vm's. >>>>> >>>>> >>>> >>>>> When the VM is in spawning state the node goes >>>>> down (openstack >>>>> >>>>> >>>> compute >>>>> >>>>> >>>> >>>>> service list), the node comes backup when i >>>>> restart the nova >>>>> >>>>> >>>> compute >>>>> >>>>> >>>> >>>>> service but then the launch of the vm fails. >>>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>> >>>>> nova-compute.log >>>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO >>>>> nova.compute.manager >>>>> >>>>> >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - >>>>> - -] Running >>>>> >>>>> >>>> >>>>> instance usage >>>>> >>>>> >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from >>>>> 2023-02-26 07:00:00 >>>>> >>>>> >>>> to >>>>> >>>>> >>>> >>>>> 2023-02-26 08:00:00. 0 instances. >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO >>>>> nova.compute.claims >>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] [instance: >>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim >>>>> successful on node >>>>> >>>>> >>>> >>>>> dcn01-hci-0.bdxworld.com >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO >>>>> nova.virt.libvirt.driver >>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] [instance: >>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring >>>>> supplied device >>>>> >>>>> >>>> name: >>>>> >>>>> >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied >>>>> dev names >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO >>>>> nova.virt.block_device >>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] [instance: >>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting >>>>> with volume >>>>> >>>>> >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING >>>>> nova.cache_utils >>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] Cache enabled >>>>> >>>>> >>>> with >>>>> >>>>> >>>> >>>>> backend dogpile.cache.null. >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO >>>>> oslo.privsep.daemon >>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] Running >>>>> >>>>> >>>> >>>>> privsep helper: >>>>> >>>>> >>>> >>>>> ['sudo', 'nova-rootwrap', >>>>> '/etc/nova/rootwrap.conf', >>>>> >>>>> >>>> 'privsep-helper', >>>>> >>>>> >>>> >>>>> '--config-file', '/etc/nova/nova.conf', >>>>> '--config-file', >>>>> >>>>> >>>> >>>>> '/etc/nova/nova-compute.conf', >>>>> '--privsep_context', >>>>> >>>>> >>>> >>>>> 'os_brick.privileged.default', >>>>> '--privsep_sock_path', >>>>> >>>>> >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO >>>>> oslo.privsep.daemon >>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] Spawned new >>>>> >>>>> >>>> privsep >>>>> >>>>> >>>> >>>>> daemon via rootwrap >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO >>>>> oslo.privsep.daemon [-] privsep >>>>> >>>>> >>>> >>>>> daemon starting >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO >>>>> oslo.privsep.daemon [-] privsep >>>>> >>>>> >>>> >>>>> process running with uid/gid: 0/0 >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO >>>>> oslo.privsep.daemon [-] privsep >>>>> >>>>> >>>> >>>>> process running with capabilities (eff/prm/inh): >>>>> >>>>> >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO >>>>> oslo.privsep.daemon [-] privsep >>>>> >>>>> >>>> >>>>> daemon running as pid 2647 >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING >>>>> >>>>> >>>> os_brick.initiator.connectors.nvmeof >>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] Process >>>>> >>>>> >>>> >>>>> execution error >>>>> >>>>> >>>> >>>>> in _get_host_uuid: Unexpected error while >>>>> running command. >>>>> >>>>> >>>> >>>>> Command: blkid overlay -s UUID -o value >>>>> >>>>> >>>> >>>>> Exit code: 2 >>>>> >>>>> >>>> >>>>> Stdout: '' >>>>> >>>>> >>>> >>>>> Stderr: '': >>>>> oslo_concurrency.processutils.ProcessExecutionError: >>>>> >>>>> >>>> >>>>> Unexpected error while running command. >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO >>>>> nova.virt.libvirt.driver >>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] [instance: >>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating >>>>> image >>>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>> >>>>> Is there a way to solve this issue? >>>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>> >>>>> With regards, >>>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>> >>>>> Swogat Pradhan >>>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>>>
Hi, Can someone please help me identify the issue here? Latest cinder-volume logs from dcn02: (ATTACHED) The volume is stuck in creating state. With regards, Swogat Pradhan On Thu, Mar 23, 2023 at 6:12 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Jhon, Thank you for clarifying that. Right now the cinder volume is stuck in *creating *state when adding image as volume source. But when creating an empty volume the volumes are getting created successfully without any errors.
We are getting volume creation request in cinder-volume.log as such: 2023-03-23 12:34:40.152 108 INFO cinder.volume.flows.manager.create_volume [req-18556796-a61c-4097-8fa8-b136ce9814f7 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume 872a2ae6-c75b-4fc0-8172-17a29d07a66c: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-872a2ae6-c75b-4fc0-8172-17a29d07a66c', 'volume_size': 1, 'image_id': '131ed4e0-0474-45be-b74a-43b599a7d6c5', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': '131ed4e0-0474-45be-b74a-43b599a7d6c5', 'created_at': datetime.datetime(2023, 3, 23, 11, 41, 51, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 23, 11, 46, 37, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'tags': [], 'file': '/v2/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f98d869ed68>}
But there is nothing else after that and the volume doesn't even timeout, it just gets stuck in creating state. Can you advise what might be the issue here? All the containers are in a healthy state now.
With regards, Swogat Pradhan
On Thu, Mar 23, 2023 at 6:06 PM Alan Bishop <abishop@redhat.com> wrote:
On Thu, Mar 23, 2023 at 5:20 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Is this bind not required for cinder_scheduler container?
"/var/lib/tripleo-config/ceph:/var/lib/kolla/config_files/src-ceph:ro,rprivate,rbind", I do not see this particular bind on the cinder scheduler containers on my controller nodes.
That is correct, because the scheduler does not access the ceph cluster.
Alan
With regards, Swogat Pradhan
On Thu, Mar 23, 2023 at 2:46 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Cinder volume config:
[tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=a8d5f1f5-48e7-5ede-89ab-8aca59b6397b report_discard_supported=True rbd_ceph_conf=/etc/ceph/dcn02.conf rbd_cluster_name=dcn02
Glance api config:
[dcn02] rbd_store_ceph_conf=/etc/ceph/dcn02.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store [ceph] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend.
On Thu, Mar 23, 2023 at 2:29 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
I still have the same issue, I'm not sure what's left to try. All the pods are now in a healthy state, I am getting log entries 3 mins after I hit the create volume button in cinder-volume when I try to create a volume with an image. And the volumes are just stuck in creating state for more than 20 mins now.
Cinder logs: 2023-03-22 20:32:44.010 108 INFO cinder.rpc [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Automatically selected cinder-volume RPC version 3.17 as minimum service version. 2023-03-22 20:34:59.166 108 INFO cinder.volume.flows.manager.create_volume [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume 5743a879-090d-46db-bc7c-1c0b0669a112: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-5743a879-090d-46db-bc7c-1c0b0669a112', 'volume_size': 2, 'image_id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'created_at': datetime.datetime(2023, 3, 22, 18, 50, 5, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 20, 3, 54, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'tags': [], 'file': '/v2/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f8147973438>}
With regards, Swogat Pradhan
On Wed, Mar 22, 2023 at 9:19 PM Alan Bishop <abishop@redhat.com> wrote:
On Wed, Mar 22, 2023 at 8:38 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
> Hi Adam, > The systems are in same LAN, in this case it seemed like the image > was getting pulled from the central site which was caused due to an > misconfiguration in ceph.conf file in /var/lib/tripleo-config/ceph/ > directory, which seems to have been resolved after the changes i made to > fix it. > > Right now the glance api podman is running in unhealthy state and > the podman logs don't show any error whatsoever and when issued the command > netstat -nultp i do not see any entry for glance port i.e. 9292 in the dcn > site, which is why cinder is throwing an error stating: > > 2023-03-22 13:32:29.786 108 ERROR oslo_messaging.rpc.server > cinder.exception.GlanceConnectionFailed: Connection to glance failed: Error > finding address for > http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: > Unable to establish connection to > http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: > HTTPConnectionPool(host='172.25.228.253', port=9292): Max retries exceeded > with url: /v2/images/736d8779-07cd-4510-bab2-adcb653cc538 (Caused by > NewConnectionError('<urllib3.connection.HTTPConnection object at > 0x7f7682d2cd30>: Failed to establish a new connection: [Errno 111] > ECONNREFUSED',)) > > Now i need to find out why the port is not listed as the glance > service is running, which i am not sure how to find out. >
One other thing to investigate is whether your deployment includes this patch [1]. If it does, then bear in mind the glance-api service running at the edge site will be an "internal" (non public facing) instance that uses port 9293 instead of 9292. You should familiarize yourself with the release note [2].
[1] https://opendev.org/openstack/tripleo-heat-templates/commit/3605d45e417a77a1... [2] https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/walla...
Alan
> With regards, > Swogat Pradhan > > On Wed, Mar 22, 2023 at 8:11 PM Alan Bishop <abishop@redhat.com> > wrote: > >> >> >> On Wed, Mar 22, 2023 at 6:37 AM Swogat Pradhan < >> swogatpradhan22@gmail.com> wrote: >> >>> Update: >>> Here is the log when creating a volume using cirros image: >>> >>> 2023-03-22 11:04:38.449 109 INFO >>> cinder.volume.flows.manager.create_volume >>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - - -] Volume >>> bf341343-6609-4b8c-b9e0-93e2a89c8c8f: being created as image with >>> specification: {'status': 'creating', 'volume_name': >>> 'volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f', 'volume_size': 4, >>> 'image_id': '736d8779-07cd-4510-bab2-adcb653cc538', 'image_location': >>> ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>> [{'url': >>> 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>> 'metadata': {'store': 'ceph'}}, {'url': >>> 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>> 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', >>> 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', >>> 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', >>> 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, >>> 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', >>> 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': >>> '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', >>> 'id': '736d8779-07cd-4510-bab2-adcb653cc538', 'created_at': >>> datetime.datetime(2023, 3, 22, 10, 44, 12, tzinfo=datetime.timezone.utc), >>> 'updated_at': datetime.datetime(2023, 3, 22, 10, 54, 1, >>> tzinfo=datetime.timezone.utc), 'locations': [{'url': >>> 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>> 'metadata': {'store': 'ceph'}}, {'url': >>> 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>> 'metadata': {'store': 'dcn02'}}], 'direct_url': >>> 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>> 'tags': [], 'file': '/v2/images/736d8779-07cd-4510-bab2-adcb653cc538/file', >>> 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', >>> 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', >>> 'owner_specified.openstack.object': 'images/cirros', >>> 'owner_specified.openstack.sha256': ''}}, 'image_service': >>> <cinder.image.glance.GlanceImageService object at 0x7f449ded1198>} >>> 2023-03-22 11:06:16.570 109 INFO cinder.image.image_utils >>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - - -] Image download 15.58 MB at 0.16 MB/s >>> >> >> As Adam Savage would say, well there's your problem ^^ (Image >> download 15.58 MB at 0.16 MB/s). Downloading the image takes too long, and >> 0.16 MB/s suggests you have a network issue. >> >> John Fulton previously stated your cinder-volume service at the >> edge site is not using the local ceph image store. Assuming you are >> deploying GlanceApiEdge service [1], then the cinder-volume service should >> be configured to use the local glance service [2]. You should check >> cinder's glance_api_servers to confirm it's the edge site's glance service. >> >> [1] >> https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/envi... >> [2] >> https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/depl... >> >> Alan >> >> >>> 2023-03-22 11:07:54.023 109 WARNING py.warnings >>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - - -] >>> /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: >>> FutureWarning: The human format is deprecated and the format parameter will >>> be removed. Use explicitly json instead in version 'xena' >>> category=FutureWarning) >>> >>> 2023-03-22 11:11:12.161 109 WARNING py.warnings >>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - - -] >>> /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: >>> FutureWarning: The human format is deprecated and the format parameter will >>> be removed. Use explicitly json instead in version 'xena' >>> category=FutureWarning) >>> >>> 2023-03-22 11:11:12.163 109 INFO cinder.image.image_utils >>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - - -] Converted 112.00 MB image at 112.00 >>> MB/s >>> 2023-03-22 11:11:14.998 109 INFO >>> cinder.volume.flows.manager.create_volume >>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - - -] Volume >>> volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f >>> (bf341343-6609-4b8c-b9e0-93e2a89c8c8f): created successfully >>> 2023-03-22 11:11:15.195 109 INFO cinder.volume.manager >>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>> 4160ce999a31485fa643aed0936dfef0 - - -] Created volume successfully. >>> >>> The image is present in dcn02 store but still it downloaded the >>> image in 0.16 MB/s and then created the volume. >>> >>> With regards, >>> Swogat Pradhan >>> >>> On Tue, Mar 21, 2023 at 6:10 PM Swogat Pradhan < >>> swogatpradhan22@gmail.com> wrote: >>> >>>> Hi Jhon, >>>> This seems to be an issue. >>>> When i deployed the dcn ceph in both dcn01 and dcn02 the >>>> --cluster parameter was specified to the respective cluster names but the >>>> config files were created in the name of ceph.conf and keyring was >>>> ceph.client.openstack.keyring. >>>> >>>> Which created issues in glance as well as the naming convention >>>> of the files didn't match the cluster names, so i had to manually rename >>>> the central ceph conf file as such: >>>> >>>> [root@dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/ >>>> [root@dcn02-compute-0 ceph]# ll >>>> total 16 >>>> -rw-------. 1 root root 257 Mar 13 13:56 >>>> ceph_central.client.openstack.keyring >>>> -rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf >>>> -rw-------. 1 root root 205 Mar 15 18:45 >>>> ceph.client.openstack.keyring >>>> -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf >>>> [root@dcn02-compute-0 ceph]# >>>> >>>> ceph.conf and ceph.client.openstack.keyring contain the fsid of >>>> the respective clusters in both dcn01 and dcn02. >>>> In the above cli output, the ceph.conf and ceph.client... are the >>>> files used to access dcn02 ceph cluster and ceph_central* files are used in >>>> for accessing central ceph cluster. >>>> >>>> glance multistore config: >>>> [dcn02] >>>> rbd_store_ceph_conf=/etc/ceph/ceph.conf >>>> rbd_store_user=openstack >>>> rbd_store_pool=images >>>> rbd_thin_provisioning=False >>>> store_description=dcn02 rbd glance store >>>> >>>> [ceph_central] >>>> rbd_store_ceph_conf=/etc/ceph/ceph_central.conf >>>> rbd_store_user=openstack >>>> rbd_store_pool=images >>>> rbd_thin_provisioning=False >>>> store_description=Default glance store backend. >>>> >>>> >>>> With regards, >>>> Swogat Pradhan >>>> >>>> On Tue, Mar 21, 2023 at 5:52 PM John Fulton <johfulto@redhat.com> >>>> wrote: >>>> >>>>> On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan >>>>> <swogatpradhan22@gmail.com> wrote: >>>>> > >>>>> > Hi, >>>>> > Seems like cinder is not using the local ceph. >>>>> >>>>> That explains the issue. It's a misconfiguration. >>>>> >>>>> I hope this is not a production system since the mailing list >>>>> now has >>>>> the cinder.conf which contains passwords. >>>>> >>>>> The section that looks like this: >>>>> >>>>> [tripleo_ceph] >>>>> volume_backend_name=tripleo_ceph >>>>> volume_driver=cinder.volume.drivers.rbd.RBDDriver >>>>> rbd_ceph_conf=/etc/ceph/ceph.conf >>>>> rbd_user=openstack >>>>> rbd_pool=volumes >>>>> rbd_flatten_volume_from_snapshot=False >>>>> rbd_secret_uuid=<redacted> >>>>> report_discard_supported=True >>>>> >>>>> Should be updated to refer to the local DCN ceph cluster and not >>>>> the >>>>> central one. Use the ceph conf file for that cluster and ensure >>>>> the >>>>> rbd_secret_uuid corresponds to that one. >>>>> >>>>> TripleO’s convention is to set the rbd_secret_uuid to the FSID >>>>> of the >>>>> Ceph cluster. The FSID should be in the ceph.conf file. The >>>>> tripleo_nova_libvirt role will use virsh secret-* commands so >>>>> that >>>>> libvirt can retrieve the cephx secret using the FSID as a key. >>>>> This >>>>> can be confirmed with `podman exec nova_virtsecretd virsh >>>>> secret-get-value $FSID`. >>>>> >>>>> The documentation describes how to configure the central and DCN >>>>> sites >>>>> correctly but an error seems to have occurred while you were >>>>> following >>>>> it. >>>>> >>>>> >>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >>>>> >>>>> John >>>>> >>>>> > >>>>> > Ceph Output: >>>>> > [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l >>>>> > NAME SIZE PARENT >>>>> FMT PROT LOCK >>>>> > 2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB >>>>> 2 excl >>>>> > 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB 2 >>>>> > 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB >>>>> 2 yes >>>>> > 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB 2 >>>>> > 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB >>>>> 2 yes >>>>> > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB 2 >>>>> > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB >>>>> 2 yes >>>>> > 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB 2 >>>>> > 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB >>>>> 2 yes >>>>> > b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB 2 >>>>> > b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB >>>>> 2 yes >>>>> > e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB 2 >>>>> > e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB >>>>> 2 yes >>>>> > >>>>> > [ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l >>>>> > NAME SIZE PARENT >>>>> FMT PROT LOCK >>>>> > volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB >>>>> 2 >>>>> > volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB >>>>> 2 >>>>> > [ceph: root@dcn02-ceph-all-0 /]# >>>>> > >>>>> > Attached the cinder config. >>>>> > Please let me know how I can solve this issue. >>>>> > >>>>> > With regards, >>>>> > Swogat Pradhan >>>>> > >>>>> > On Tue, Mar 21, 2023 at 3:53 PM John Fulton < >>>>> johfulto@redhat.com> wrote: >>>>> >> >>>>> >> in my last message under the line "On a DCN site if you run a >>>>> command like this:" I suggested some steps you could try to confirm the >>>>> image is a COW from the local glance as well as how to look at your cinder >>>>> config. >>>>> >> >>>>> >> On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan < >>>>> swogatpradhan22@gmail.com> wrote: >>>>> >>> >>>>> >>> Update: >>>>> >>> I uploaded an image directly to the dcn02 store, and it >>>>> takes around 10,15 minutes to create a volume with image in dcn02. >>>>> >>> The image size is 389 MB. >>>>> >>> >>>>> >>> On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan < >>>>> swogatpradhan22@gmail.com> wrote: >>>>> >>>> >>>>> >>>> Hi Jhon, >>>>> >>>> I checked in the ceph od dcn02, I can see the images >>>>> created after importing from the central site. >>>>> >>>> But launching an instance normally fails as it takes a long >>>>> time for the volume to get created. >>>>> >>>> >>>>> >>>> When launching an instance from volume the instance is >>>>> getting created properly without any errors. >>>>> >>>> >>>>> >>>> I tried to cache images in nova using >>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>>>> but getting checksum failed error. >>>>> >>>> >>>>> >>>> With regards, >>>>> >>>> Swogat Pradhan >>>>> >>>> >>>>> >>>> On Thu, Mar 16, 2023 at 5:24 PM John Fulton < >>>>> johfulto@redhat.com> wrote: >>>>> >>>>> >>>>> >>>>> On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan >>>>> >>>>> <swogatpradhan22@gmail.com> wrote: >>>>> >>>>> > >>>>> >>>>> > Update: After restarting the nova services on the >>>>> controller and running the deploy script on the edge site, I was able to >>>>> launch the VM from volume. >>>>> >>>>> > >>>>> >>>>> > Right now the instance creation is failing as the block >>>>> device creation is stuck in creating state, it is taking more than 10 mins >>>>> for the volume to be created, whereas the image has already been imported >>>>> to the edge glance. >>>>> >>>>> >>>>> >>>>> Try following this document and making the same >>>>> observations in your >>>>> >>>>> environment for AZs and their local ceph cluster. >>>>> >>>>> >>>>> >>>>> >>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >>>>> >>>>> >>>>> >>>>> On a DCN site if you run a command like this: >>>>> >>>>> >>>>> >>>>> $ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring >>>>> >>>>> /etc/ceph/dcn0.client.admin.keyring >>>>> >>>>> $ rbd --cluster dcn0 -p volumes ls -l >>>>> >>>>> NAME SIZE PARENT >>>>> >>>>> FMT PROT LOCK >>>>> >>>>> volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB >>>>> >>>>> images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 >>>>> excl >>>>> >>>>> $ >>>>> >>>>> >>>>> >>>>> Then, you should see the parent of the volume is the image >>>>> which is on >>>>> >>>>> the same local ceph cluster. >>>>> >>>>> >>>>> >>>>> I wonder if something is misconfigured and thus you're >>>>> encountering >>>>> >>>>> the streaming behavior described here: >>>>> >>>>> >>>>> >>>>> Ideally all images should reside in the central Glance and >>>>> be copied >>>>> >>>>> to DCN sites before instances of those images are booted >>>>> on DCN sites. >>>>> >>>>> If an image is not copied to a DCN site before it is >>>>> booted, then the >>>>> >>>>> image will be streamed to the DCN site and then the image >>>>> will boot as >>>>> >>>>> an instance. This happens because Glance at the DCN site >>>>> has access to >>>>> >>>>> the images store at the Central ceph cluster. Though the >>>>> booting of >>>>> >>>>> the image will take time because it has not been copied in >>>>> advance, >>>>> >>>>> this is still preferable to failing to boot the image. >>>>> >>>>> >>>>> >>>>> You can also exec into the cinder container at the DCN >>>>> site and >>>>> >>>>> confirm it's using it's local ceph cluster. >>>>> >>>>> >>>>> >>>>> John >>>>> >>>>> >>>>> >>>>> > >>>>> >>>>> > I will try and create a new fresh image and test again >>>>> then update. >>>>> >>>>> > >>>>> >>>>> > With regards, >>>>> >>>>> > Swogat Pradhan >>>>> >>>>> > >>>>> >>>>> > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan < >>>>> swogatpradhan22@gmail.com> wrote: >>>>> >>>>> >> >>>>> >>>>> >> Update: >>>>> >>>>> >> In the hypervisor list the compute node state is >>>>> showing down. >>>>> >>>>> >> >>>>> >>>>> >> >>>>> >>>>> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan < >>>>> swogatpradhan22@gmail.com> wrote: >>>>> >>>>> >>> >>>>> >>>>> >>> Hi Brendan, >>>>> >>>>> >>> Now i have deployed another site where i have used 2 >>>>> linux bonds network template for both 3 compute nodes and 3 ceph nodes. >>>>> >>>>> >>> The bonding options is set to mode=802.3ad >>>>> (lacp=active). >>>>> >>>>> >>> I used a cirros image to launch instance but the >>>>> instance timed out so i waited for the volume to be created. >>>>> >>>>> >>> Once the volume was created i tried launching the >>>>> instance from the volume and still the instance is stuck in spawning state. >>>>> >>>>> >>> >>>>> >>>>> >>> Here is the nova-compute log: >>>>> >>>>> >>> >>>>> >>>>> >>> 2023-03-15 17:35:47.739 185437 INFO >>>>> oslo.privsep.daemon [-] privsep daemon starting >>>>> >>>>> >>> 2023-03-15 17:35:47.744 185437 INFO >>>>> oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 >>>>> >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO >>>>> oslo.privsep.daemon [-] privsep process running with capabilities >>>>> (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>>> >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO >>>>> oslo.privsep.daemon [-] privsep daemon running as pid 185437 >>>>> >>>>> >>> 2023-03-15 17:35:47.974 8 WARNING >>>>> os_brick.initiator.connectors.nvmeof >>>>> [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error >>>>> in _get_host_uuid: Unexpected error while running command. >>>>> >>>>> >>> Command: blkid overlay -s UUID -o value >>>>> >>>>> >>> Exit code: 2 >>>>> >>>>> >>> Stdout: '' >>>>> >>>>> >>> Stderr: '': >>>>> oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while >>>>> running command. >>>>> >>>>> >>> 2023-03-15 17:35:51.616 8 INFO >>>>> nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 >>>>> b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image >>>>> >>>>> >>> >>>>> >>>>> >>> It is stuck in creating image, do i need to run the >>>>> template mentioned here ?: >>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>>>> >>>>> >>> >>>>> >>>>> >>> The volume is already created and i do not understand >>>>> why the instance is stuck in spawning state. >>>>> >>>>> >>> >>>>> >>>>> >>> With regards, >>>>> >>>>> >>> Swogat Pradhan >>>>> >>>>> >>> >>>>> >>>>> >>> >>>>> >>>>> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard < >>>>> bshephar@redhat.com> wrote: >>>>> >>>>> >>>> >>>>> >>>>> >>>> Does your environment use different network >>>>> interfaces for each of the networks? Or does it have a bond with everything >>>>> on it? >>>>> >>>>> >>>> >>>>> >>>>> >>>> One issue I have seen before is that when launching >>>>> instances, there is a lot of network traffic between nodes as the >>>>> hypervisor needs to download the image from Glance. Along with various >>>>> other services sending normal network traffic, it can be enough to cause >>>>> issues if everything is running over a single 1Gbe interface. >>>>> >>>>> >>>> >>>>> >>>>> >>>> I have seen the same situation in fact when using a >>>>> single active/backup bond on 1Gbe nics. It’s worth checking the network >>>>> traffic while you try to spawn the instance to see if you’re dropping >>>>> packets. In the situation I described, there were dropped packets which >>>>> resulted in a loss of communication between nova_compute and RMQ, so the >>>>> node appeared offline. You should also confirm that nova_compute is being >>>>> disconnected in the nova_compute logs if you tail them on the Hypervisor >>>>> while spawning the instance. >>>>> >>>>> >>>> >>>>> >>>>> >>>> In my case, changing from active/backup to LACP >>>>> helped. So, based on that experience, from my perspective, is certainly >>>>> sounds like some kind of network issue. >>>>> >>>>> >>>> >>>>> >>>>> >>>> Regards, >>>>> >>>>> >>>> >>>>> >>>>> >>>> Brendan Shephard >>>>> >>>>> >>>> Senior Software Engineer >>>>> >>>>> >>>> Red Hat Australia >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> >>>>> wrote: >>>>> >>>>> >>>> >>>>> >>>>> >>>> Hi, >>>>> >>>>> >>>> >>>>> >>>>> >>>> I tried to help someone with a similar issue some >>>>> time ago in this thread: >>>>> >>>>> >>>> >>>>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... >>>>> >>>>> >>>> >>>>> >>>>> >>>> But apparently a neutron reinstallation fixed it for >>>>> that user, not sure if that could apply here. But is it possible that your >>>>> nova and neutron versions are different between central and edge site? Have >>>>> you restarted nova and neutron services on the compute nodes after >>>>> installation? Have you debug logs of nova-conductor and maybe nova-compute? >>>>> Maybe they can help narrow down the issue. >>>>> >>>>> >>>> If there isn't any additional information in the >>>>> debug logs I probably would start "tearing down" rabbitmq. I didn't have to >>>>> do that in a production system yet so be careful. I can think of two routes: >>>>> >>>>> >>>> >>>>> >>>>> >>>> - Either remove queues, exchanges etc. while rabbit >>>>> is running, this will most likely impact client IO depending on your load. >>>>> Check out the rabbitmqctl commands. >>>>> >>>>> >>>> - Or stop the rabbitmq cluster, remove the mnesia >>>>> tables from all nodes and restart rabbitmq so the exchanges, queues etc. >>>>> rebuild. >>>>> >>>>> >>>> >>>>> >>>>> >>>> I can imagine that the failed reply "survives" while >>>>> being replicated across the rabbit nodes. But I don't really know the >>>>> rabbit internals too well, so maybe someone else can chime in here and give >>>>> a better advice. >>>>> >>>>> >>>> >>>>> >>>>> >>>> Regards, >>>>> >>>>> >>>> Eugen >>>>> >>>>> >>>> >>>>> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>>>> >>>>> >>>> >>>>> >>>>> >>>> Hi, >>>>> >>>>> >>>> Can someone please help me out on this issue? >>>>> >>>>> >>>> >>>>> >>>>> >>>> With regards, >>>>> >>>>> >>>> Swogat Pradhan >>>>> >>>>> >>>> >>>>> >>>>> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < >>>>> swogatpradhan22@gmail.com> >>>>> >>>>> >>>> wrote: >>>>> >>>>> >>>> >>>>> >>>>> >>>> Hi >>>>> >>>>> >>>> I don't see any major packet loss. >>>>> >>>>> >>>> It seems the problem is somewhere in rabbitmq maybe >>>>> but not due to packet >>>>> >>>>> >>>> loss. >>>>> >>>>> >>>> >>>>> >>>>> >>>> with regards, >>>>> >>>>> >>>> Swogat Pradhan >>>>> >>>>> >>>> >>>>> >>>>> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < >>>>> swogatpradhan22@gmail.com> >>>>> >>>>> >>>> wrote: >>>>> >>>>> >>>> >>>>> >>>>> >>>> Hi, >>>>> >>>>> >>>> Yes the MTU is the same as the default '1500'. >>>>> >>>>> >>>> Generally I haven't seen any packet loss, but never >>>>> checked when >>>>> >>>>> >>>> launching the instance. >>>>> >>>>> >>>> I will check that and come back. >>>>> >>>>> >>>> But everytime i launch an instance the instance gets >>>>> stuck at spawning >>>>> >>>>> >>>> state and there the hypervisor becomes down, so not >>>>> sure if packet loss >>>>> >>>>> >>>> causes this. >>>>> >>>>> >>>> >>>>> >>>>> >>>> With regards, >>>>> >>>>> >>>> Swogat pradhan >>>>> >>>>> >>>> >>>>> >>>>> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block < >>>>> eblock@nde.ag> wrote: >>>>> >>>>> >>>> >>>>> >>>>> >>>> One more thing coming to mind is MTU size. Are they >>>>> identical between >>>>> >>>>> >>>> central and edge site? Do you see packet loss through >>>>> the tunnel? >>>>> >>>>> >>>> >>>>> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>: >>>>> >>>>> >>>> >>>>> >>>>> >>>> > Hi Eugen, >>>>> >>>>> >>>> > Request you to please add my email either on 'to' >>>>> or 'cc' as i am not >>>>> >>>>> >>>> > getting email's from you. >>>>> >>>>> >>>> > Coming to the issue: >>>>> >>>>> >>>> > >>>>> >>>>> >>>> > [root@overcloud-controller-no-ceph-3 /]# >>>>> rabbitmqctl list_policies -p >>>>> >>>>> >>>> / >>>>> >>>>> >>>> > Listing policies for vhost "/" ... >>>>> >>>>> >>>> > vhost name pattern apply-to definition >>>>> priority >>>>> >>>>> >>>> > / ha-all ^(?!amq\.).* queues >>>>> >>>>> >>>> > >>>>> >>>>> >>>> >>>>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 >>>>> >>>>> >>>> > >>>>> >>>>> >>>> > I have the edge site compute nodes up, it only goes >>>>> down when i am >>>>> >>>>> >>>> trying >>>>> >>>>> >>>> > to launch an instance and the instance comes to a >>>>> spawning state and >>>>> >>>>> >>>> then >>>>> >>>>> >>>> > gets stuck. >>>>> >>>>> >>>> > >>>>> >>>>> >>>> > I have a tunnel setup between the central and the >>>>> edge sites. >>>>> >>>>> >>>> > >>>>> >>>>> >>>> > With regards, >>>>> >>>>> >>>> > Swogat Pradhan >>>>> >>>>> >>>> > >>>>> >>>>> >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < >>>>> >>>>> >>>> swogatpradhan22@gmail.com> >>>>> >>>>> >>>> > wrote: >>>>> >>>>> >>>> > >>>>> >>>>> >>>> >> Hi Eugen, >>>>> >>>>> >>>> >> For some reason i am not getting your email to me >>>>> directly, i am >>>>> >>>>> >>>> checking >>>>> >>>>> >>>> >> the email digest and there i am able to find your >>>>> reply. >>>>> >>>>> >>>> >> Here is the log for download: >>>>> https://we.tl/t-L8FEkGZFSq >>>>> >>>>> >>>> >> Yes, these logs are from the time when the issue >>>>> occurred. >>>>> >>>>> >>>> >> >>>>> >>>>> >>>> >> *Note: i am able to create vm's and perform other >>>>> activities in the >>>>> >>>>> >>>> >> central site, only facing this issue in the edge >>>>> site.* >>>>> >>>>> >>>> >> >>>>> >>>>> >>>> >> With regards, >>>>> >>>>> >>>> >> Swogat Pradhan >>>>> >>>>> >>>> >> >>>>> >>>>> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < >>>>> >>>>> >>>> swogatpradhan22@gmail.com> >>>>> >>>>> >>>> >> wrote: >>>>> >>>>> >>>> >> >>>>> >>>>> >>>> >>> Hi Eugen, >>>>> >>>>> >>>> >>> Thanks for your response. >>>>> >>>>> >>>> >>> I have actually a 4 controller setup so here are >>>>> the details: >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> *PCS Status:* >>>>> >>>>> >>>> >>> * Container bundle set: rabbitmq-bundle [ >>>>> >>>>> >>>> >>> >>>>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: >>>>> >>>>> >>>> >>> * rabbitmq-bundle-0 >>>>> (ocf::heartbeat:rabbitmq-cluster): >>>>> >>>>> >>>> Started >>>>> >>>>> >>>> >>> overcloud-controller-no-ceph-3 >>>>> >>>>> >>>> >>> * rabbitmq-bundle-1 >>>>> (ocf::heartbeat:rabbitmq-cluster): >>>>> >>>>> >>>> Started >>>>> >>>>> >>>> >>> overcloud-controller-2 >>>>> >>>>> >>>> >>> * rabbitmq-bundle-2 >>>>> (ocf::heartbeat:rabbitmq-cluster): >>>>> >>>>> >>>> Started >>>>> >>>>> >>>> >>> overcloud-controller-1 >>>>> >>>>> >>>> >>> * rabbitmq-bundle-3 >>>>> (ocf::heartbeat:rabbitmq-cluster): >>>>> >>>>> >>>> Started >>>>> >>>>> >>>> >>> overcloud-controller-0 >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> I have tried restarting the bundle multiple times >>>>> but the issue is >>>>> >>>>> >>>> still >>>>> >>>>> >>>> >>> present. >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> *Cluster status:* >>>>> >>>>> >>>> >>> [root@overcloud-controller-0 /]# rabbitmqctl >>>>> cluster_status >>>>> >>>>> >>>> >>> Cluster status of node >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >>>>> >>>>> >>>> >>> Basics >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Cluster name: >>>>> rabbit@overcloud-controller-no-ceph-3.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Disk Nodes >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Running Nodes >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Versions >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ >>>>> >>>>> >>>> 3.8.3 >>>>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ >>>>> >>>>> >>>> 3.8.3 >>>>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ >>>>> >>>>> >>>> 3.8.3 >>>>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>>>> >>>>> >>>> >>> >>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: >>>>> >>>>> >>>> RabbitMQ >>>>> >>>>> >>>> >>> 3.8.3 on Erlang 22.3.4.1 >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Alarms >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> (none) >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Network Partitions >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> (none) >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Listeners >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: >>>>> inter-node and CLI >>>>> >>>>> >>>> tool >>>>> >>>>> >>>> >>> communication >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, >>>>> purpose: AMQP 0-9-1 >>>>> >>>>> >>>> >>> and AMQP 1.0 >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP >>>>> API >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: >>>>> inter-node and CLI >>>>> >>>>> >>>> tool >>>>> >>>>> >>>> >>> communication >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, >>>>> purpose: AMQP 0-9-1 >>>>> >>>>> >>>> >>> and AMQP 1.0 >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP >>>>> API >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: >>>>> inter-node and CLI >>>>> >>>>> >>>> tool >>>>> >>>>> >>>> >>> communication >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, >>>>> purpose: AMQP 0-9-1 >>>>> >>>>> >>>> >>> and AMQP 1.0 >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>>> >>>>> >>>> interface: >>>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP >>>>> API >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>> >>>>> >>>> , >>>>> >>>>> >>>> >>> interface: [::], port: 25672, protocol: >>>>> clustering, purpose: >>>>> >>>>> >>>> inter-node and >>>>> >>>>> >>>> >>> CLI tool communication >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>> >>>>> >>>> , >>>>> >>>>> >>>> >>> interface: 172.25.201.209, port: 5672, protocol: >>>>> amqp, purpose: AMQP >>>>> >>>>> >>>> 0-9-1 >>>>> >>>>> >>>> >>> and AMQP 1.0 >>>>> >>>>> >>>> >>> Node: >>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>> >>>>> >>>> , >>>>> >>>>> >>>> >>> interface: [::], port: 15672, protocol: http, >>>>> purpose: HTTP API >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Feature flags >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> Flag: drop_unroutable_metric, state: enabled >>>>> >>>>> >>>> >>> Flag: empty_basic_get_metric, state: enabled >>>>> >>>>> >>>> >>> Flag: implicit_default_bindings, state: enabled >>>>> >>>>> >>>> >>> Flag: quorum_queue, state: enabled >>>>> >>>>> >>>> >>> Flag: virtual_host_metadata, state: enabled >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> *Logs:* >>>>> >>>>> >>>> >>> *(Attached)* >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> With regards, >>>>> >>>>> >>>> >>> Swogat Pradhan >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < >>>>> >>>>> >>>> swogatpradhan22@gmail.com> >>>>> >>>>> >>>> >>> wrote: >>>>> >>>>> >>>> >>> >>>>> >>>>> >>>> >>>> Hi, >>>>> >>>>> >>>> >>>> Please find the nova conductor as well as nova >>>>> api log. >>>>> >>>>> >>>> >>>> >>>>> >>>>> >>>> >>>> nova-conuctor: >>>>> >>>>> >>>> >>>> >>>>> >>>>> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING >>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>> - -] >>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>>> exist, drop reply to >>>>> >>>>> >>>> >>>> 16152921c1eb45c2b1f562087140168b >>>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING >>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - >>>>> - -] >>>>> >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't >>>>> exist, drop reply to >>>>> >>>>> >>>> >>>> 83dbe5f567a940b698acfe986f6194fa >>>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING >>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - >>>>> - -] >>>>> >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't >>>>> exist, drop reply to >>>>> >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43: >>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR >>>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - >>>>> - -] The reply >>>>> >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send >>>>> after 60 seconds >>>>> >>>>> >>>> due to a >>>>> >>>>> >>>> >>>> missing queue >>>>> (reply_276049ec36a84486a8a406911d9802f4). >>>>> >>>>> >>>> Abandoning...: >>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>>> >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING >>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>> - -] >>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>>> exist, drop reply to >>>>> >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: >>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>>> >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR >>>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>> - -] The reply >>>>> >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send >>>>> after 60 seconds >>>>> >>>>> >>>> due to a >>>>> >>>>> >>>> >>>> missing queue >>>>> (reply_349bcb075f8c49329435a0f884b33066). >>>>> >>>>> >>>> Abandoning...: >>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>>> >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING >>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>> - -] >>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>>> exist, drop reply to >>>>> >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f: >>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>>> >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR >>>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>> - -] The reply >>>>> >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send >>>>> after 60 seconds >>>>> >>>>> >>>> due to a >>>>> >>>>> >>>> >>>> missing queue >>>>> (reply_349bcb075f8c49329435a0f884b33066). >>>>> >>>>> >>>> Abandoning...: >>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>>> >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING >>>>> nova.cache_utils >>>>> >>>>> >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] Cache enabled >>>>> >>>>> >>>> with >>>>> >>>>> >>>> >>>> backend dogpile.cache.null. >>>>> >>>>> >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING >>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>> - -] >>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>>> exist, drop reply to >>>>> >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb: >>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>>> >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR >>>>> oslo_messaging._drivers.amqpdriver >>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>> - -] The reply >>>>> >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send >>>>> after 60 seconds >>>>> >>>>> >>>> due to a >>>>> >>>>> >>>> >>>> missing queue >>>>> (reply_349bcb075f8c49329435a0f884b33066). >>>>> >>>>> >>>> Abandoning...: >>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>> >>>>> >>>> >>>> >>>>> >>>>> >>>> >>>> With regards, >>>>> >>>>> >>>> >>>> Swogat Pradhan >>>>> >>>>> >>>> >>>> >>>>> >>>>> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >>>>> >>>>> >>>> >>>> swogatpradhan22@gmail.com> wrote: >>>>> >>>>> >>>> >>>> >>>>> >>>>> >>>> >>>>> Hi, >>>>> >>>>> >>>> >>>>> I currently have 3 compute nodes on edge site1 >>>>> where i am trying to >>>>> >>>>> >>>> >>>>> launch vm's. >>>>> >>>>> >>>> >>>>> When the VM is in spawning state the node goes >>>>> down (openstack >>>>> >>>>> >>>> compute >>>>> >>>>> >>>> >>>>> service list), the node comes backup when i >>>>> restart the nova >>>>> >>>>> >>>> compute >>>>> >>>>> >>>> >>>>> service but then the launch of the vm fails. >>>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>> >>>>> nova-compute.log >>>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO >>>>> nova.compute.manager >>>>> >>>>> >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - >>>>> - -] Running >>>>> >>>>> >>>> >>>>> instance usage >>>>> >>>>> >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from >>>>> 2023-02-26 07:00:00 >>>>> >>>>> >>>> to >>>>> >>>>> >>>> >>>>> 2023-02-26 08:00:00. 0 instances. >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO >>>>> nova.compute.claims >>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] [instance: >>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim >>>>> successful on node >>>>> >>>>> >>>> >>>>> dcn01-hci-0.bdxworld.com >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO >>>>> nova.virt.libvirt.driver >>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] [instance: >>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring >>>>> supplied device >>>>> >>>>> >>>> name: >>>>> >>>>> >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied >>>>> dev names >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO >>>>> nova.virt.block_device >>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] [instance: >>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting >>>>> with volume >>>>> >>>>> >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING >>>>> nova.cache_utils >>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] Cache enabled >>>>> >>>>> >>>> with >>>>> >>>>> >>>> >>>>> backend dogpile.cache.null. >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO >>>>> oslo.privsep.daemon >>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] Running >>>>> >>>>> >>>> >>>>> privsep helper: >>>>> >>>>> >>>> >>>>> ['sudo', 'nova-rootwrap', >>>>> '/etc/nova/rootwrap.conf', >>>>> >>>>> >>>> 'privsep-helper', >>>>> >>>>> >>>> >>>>> '--config-file', '/etc/nova/nova.conf', >>>>> '--config-file', >>>>> >>>>> >>>> >>>>> '/etc/nova/nova-compute.conf', >>>>> '--privsep_context', >>>>> >>>>> >>>> >>>>> 'os_brick.privileged.default', >>>>> '--privsep_sock_path', >>>>> >>>>> >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO >>>>> oslo.privsep.daemon >>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] Spawned new >>>>> >>>>> >>>> privsep >>>>> >>>>> >>>> >>>>> daemon via rootwrap >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO >>>>> oslo.privsep.daemon [-] privsep >>>>> >>>>> >>>> >>>>> daemon starting >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO >>>>> oslo.privsep.daemon [-] privsep >>>>> >>>>> >>>> >>>>> process running with uid/gid: 0/0 >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO >>>>> oslo.privsep.daemon [-] privsep >>>>> >>>>> >>>> >>>>> process running with capabilities (eff/prm/inh): >>>>> >>>>> >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO >>>>> oslo.privsep.daemon [-] privsep >>>>> >>>>> >>>> >>>>> daemon running as pid 2647 >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING >>>>> >>>>> >>>> os_brick.initiator.connectors.nvmeof >>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] Process >>>>> >>>>> >>>> >>>>> execution error >>>>> >>>>> >>>> >>>>> in _get_host_uuid: Unexpected error while >>>>> running command. >>>>> >>>>> >>>> >>>>> Command: blkid overlay -s UUID -o value >>>>> >>>>> >>>> >>>>> Exit code: 2 >>>>> >>>>> >>>> >>>>> Stdout: '' >>>>> >>>>> >>>> >>>>> Stderr: '': >>>>> oslo_concurrency.processutils.ProcessExecutionError: >>>>> >>>>> >>>> >>>>> Unexpected error while running command. >>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO >>>>> nova.virt.libvirt.driver >>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>> default] [instance: >>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating >>>>> image >>>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>> >>>>> Is there a way to solve this issue? >>>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>> >>>>> With regards, >>>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>> >>>>> Swogat Pradhan >>>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>> >>>>> >>>>> >>>>> >>>>>
On Thu, Mar 23, 2023 at 9:01 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Can someone please help me identify the issue here? Latest cinder-volume logs from dcn02: (ATTACHED)
It's really not possible to analyze what's happening with just one or two log entries. Do you have debug logs enabled? One thing I noticed is the glance image's disk_format is qcow2. You should use "raw" images with ceph RBD. Alan
The volume is stuck in creating state.
With regards, Swogat Pradhan
On Thu, Mar 23, 2023 at 6:12 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Jhon, Thank you for clarifying that. Right now the cinder volume is stuck in *creating *state when adding image as volume source. But when creating an empty volume the volumes are getting created successfully without any errors.
We are getting volume creation request in cinder-volume.log as such: 2023-03-23 12:34:40.152 108 INFO cinder.volume.flows.manager.create_volume [req-18556796-a61c-4097-8fa8-b136ce9814f7 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume 872a2ae6-c75b-4fc0-8172-17a29d07a66c: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-872a2ae6-c75b-4fc0-8172-17a29d07a66c', 'volume_size': 1, 'image_id': '131ed4e0-0474-45be-b74a-43b599a7d6c5', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': '131ed4e0-0474-45be-b74a-43b599a7d6c5', 'created_at': datetime.datetime(2023, 3, 23, 11, 41, 51, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 23, 11, 46, 37, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'tags': [], 'file': '/v2/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f98d869ed68>}
But there is nothing else after that and the volume doesn't even timeout, it just gets stuck in creating state. Can you advise what might be the issue here? All the containers are in a healthy state now.
With regards, Swogat Pradhan
On Thu, Mar 23, 2023 at 6:06 PM Alan Bishop <abishop@redhat.com> wrote:
On Thu, Mar 23, 2023 at 5:20 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi, Is this bind not required for cinder_scheduler container?
"/var/lib/tripleo-config/ceph:/var/lib/kolla/config_files/src-ceph:ro,rprivate,rbind", I do not see this particular bind on the cinder scheduler containers on my controller nodes.
That is correct, because the scheduler does not access the ceph cluster.
Alan
With regards, Swogat Pradhan
On Thu, Mar 23, 2023 at 2:46 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Cinder volume config:
[tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=a8d5f1f5-48e7-5ede-89ab-8aca59b6397b report_discard_supported=True rbd_ceph_conf=/etc/ceph/dcn02.conf rbd_cluster_name=dcn02
Glance api config:
[dcn02] rbd_store_ceph_conf=/etc/ceph/dcn02.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store [ceph] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend.
On Thu, Mar 23, 2023 at 2:29 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
I still have the same issue, I'm not sure what's left to try. All the pods are now in a healthy state, I am getting log entries 3 mins after I hit the create volume button in cinder-volume when I try to create a volume with an image. And the volumes are just stuck in creating state for more than 20 mins now.
Cinder logs: 2023-03-22 20:32:44.010 108 INFO cinder.rpc [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Automatically selected cinder-volume RPC version 3.17 as minimum service version. 2023-03-22 20:34:59.166 108 INFO cinder.volume.flows.manager.create_volume [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume 5743a879-090d-46db-bc7c-1c0b0669a112: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-5743a879-090d-46db-bc7c-1c0b0669a112', 'volume_size': 2, 'image_id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'created_at': datetime.datetime(2023, 3, 22, 18, 50, 5, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 20, 3, 54, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', 'tags': [], 'file': '/v2/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f8147973438>}
With regards, Swogat Pradhan
On Wed, Mar 22, 2023 at 9:19 PM Alan Bishop <abishop@redhat.com> wrote:
> > > On Wed, Mar 22, 2023 at 8:38 AM Swogat Pradhan < > swogatpradhan22@gmail.com> wrote: > >> Hi Adam, >> The systems are in same LAN, in this case it seemed like the image >> was getting pulled from the central site which was caused due to an >> misconfiguration in ceph.conf file in /var/lib/tripleo-config/ceph/ >> directory, which seems to have been resolved after the changes i made to >> fix it. >> >> Right now the glance api podman is running in unhealthy state and >> the podman logs don't show any error whatsoever and when issued the command >> netstat -nultp i do not see any entry for glance port i.e. 9292 in the dcn >> site, which is why cinder is throwing an error stating: >> >> 2023-03-22 13:32:29.786 108 ERROR oslo_messaging.rpc.server >> cinder.exception.GlanceConnectionFailed: Connection to glance failed: Error >> finding address for >> http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: >> Unable to establish connection to >> http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: >> HTTPConnectionPool(host='172.25.228.253', port=9292): Max retries exceeded >> with url: /v2/images/736d8779-07cd-4510-bab2-adcb653cc538 (Caused by >> NewConnectionError('<urllib3.connection.HTTPConnection object at >> 0x7f7682d2cd30>: Failed to establish a new connection: [Errno 111] >> ECONNREFUSED',)) >> >> Now i need to find out why the port is not listed as the glance >> service is running, which i am not sure how to find out. >> > > One other thing to investigate is whether your deployment includes > this patch [1]. If it does, then bear in mind > the glance-api service running at the edge site will be an > "internal" (non public facing) instance that uses port 9293 > instead of 9292. You should familiarize yourself with the release > note [2]. > > [1] > https://opendev.org/openstack/tripleo-heat-templates/commit/3605d45e417a77a1... > [2] > https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/walla... > > Alan > > >> With regards, >> Swogat Pradhan >> >> On Wed, Mar 22, 2023 at 8:11 PM Alan Bishop <abishop@redhat.com> >> wrote: >> >>> >>> >>> On Wed, Mar 22, 2023 at 6:37 AM Swogat Pradhan < >>> swogatpradhan22@gmail.com> wrote: >>> >>>> Update: >>>> Here is the log when creating a volume using cirros image: >>>> >>>> 2023-03-22 11:04:38.449 109 INFO >>>> cinder.volume.flows.manager.create_volume >>>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - - -] Volume >>>> bf341343-6609-4b8c-b9e0-93e2a89c8c8f: being created as image with >>>> specification: {'status': 'creating', 'volume_name': >>>> 'volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f', 'volume_size': 4, >>>> 'image_id': '736d8779-07cd-4510-bab2-adcb653cc538', 'image_location': >>>> ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>>> [{'url': >>>> 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>>> 'metadata': {'store': 'ceph'}}, {'url': >>>> 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>>> 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', >>>> 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', >>>> 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', >>>> 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, >>>> 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', >>>> 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': >>>> '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', >>>> 'id': '736d8779-07cd-4510-bab2-adcb653cc538', 'created_at': >>>> datetime.datetime(2023, 3, 22, 10, 44, 12, tzinfo=datetime.timezone.utc), >>>> 'updated_at': datetime.datetime(2023, 3, 22, 10, 54, 1, >>>> tzinfo=datetime.timezone.utc), 'locations': [{'url': >>>> 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>>> 'metadata': {'store': 'ceph'}}, {'url': >>>> 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>>> 'metadata': {'store': 'dcn02'}}], 'direct_url': >>>> 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>>> 'tags': [], 'file': '/v2/images/736d8779-07cd-4510-bab2-adcb653cc538/file', >>>> 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', >>>> 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', >>>> 'owner_specified.openstack.object': 'images/cirros', >>>> 'owner_specified.openstack.sha256': ''}}, 'image_service': >>>> <cinder.image.glance.GlanceImageService object at 0x7f449ded1198>} >>>> 2023-03-22 11:06:16.570 109 INFO cinder.image.image_utils >>>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - - -] Image download 15.58 MB at 0.16 MB/s >>>> >>> >>> As Adam Savage would say, well there's your problem ^^ (Image >>> download 15.58 MB at 0.16 MB/s). Downloading the image takes too long, and >>> 0.16 MB/s suggests you have a network issue. >>> >>> John Fulton previously stated your cinder-volume service at the >>> edge site is not using the local ceph image store. Assuming you are >>> deploying GlanceApiEdge service [1], then the cinder-volume service should >>> be configured to use the local glance service [2]. You should check >>> cinder's glance_api_servers to confirm it's the edge site's glance service. >>> >>> [1] >>> https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/envi... >>> [2] >>> https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/depl... >>> >>> Alan >>> >>> >>>> 2023-03-22 11:07:54.023 109 WARNING py.warnings >>>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - - -] >>>> /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: >>>> FutureWarning: The human format is deprecated and the format parameter will >>>> be removed. Use explicitly json instead in version 'xena' >>>> category=FutureWarning) >>>> >>>> 2023-03-22 11:11:12.161 109 WARNING py.warnings >>>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - - -] >>>> /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: >>>> FutureWarning: The human format is deprecated and the format parameter will >>>> be removed. Use explicitly json instead in version 'xena' >>>> category=FutureWarning) >>>> >>>> 2023-03-22 11:11:12.163 109 INFO cinder.image.image_utils >>>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - - -] Converted 112.00 MB image at 112.00 >>>> MB/s >>>> 2023-03-22 11:11:14.998 109 INFO >>>> cinder.volume.flows.manager.create_volume >>>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - - -] Volume >>>> volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f >>>> (bf341343-6609-4b8c-b9e0-93e2a89c8c8f): created successfully >>>> 2023-03-22 11:11:15.195 109 INFO cinder.volume.manager >>>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>>> 4160ce999a31485fa643aed0936dfef0 - - -] Created volume successfully. >>>> >>>> The image is present in dcn02 store but still it downloaded the >>>> image in 0.16 MB/s and then created the volume. >>>> >>>> With regards, >>>> Swogat Pradhan >>>> >>>> On Tue, Mar 21, 2023 at 6:10 PM Swogat Pradhan < >>>> swogatpradhan22@gmail.com> wrote: >>>> >>>>> Hi Jhon, >>>>> This seems to be an issue. >>>>> When i deployed the dcn ceph in both dcn01 and dcn02 the >>>>> --cluster parameter was specified to the respective cluster names but the >>>>> config files were created in the name of ceph.conf and keyring was >>>>> ceph.client.openstack.keyring. >>>>> >>>>> Which created issues in glance as well as the naming convention >>>>> of the files didn't match the cluster names, so i had to manually rename >>>>> the central ceph conf file as such: >>>>> >>>>> [root@dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/ >>>>> [root@dcn02-compute-0 ceph]# ll >>>>> total 16 >>>>> -rw-------. 1 root root 257 Mar 13 13:56 >>>>> ceph_central.client.openstack.keyring >>>>> -rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf >>>>> -rw-------. 1 root root 205 Mar 15 18:45 >>>>> ceph.client.openstack.keyring >>>>> -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf >>>>> [root@dcn02-compute-0 ceph]# >>>>> >>>>> ceph.conf and ceph.client.openstack.keyring contain the fsid of >>>>> the respective clusters in both dcn01 and dcn02. >>>>> In the above cli output, the ceph.conf and ceph.client... are >>>>> the files used to access dcn02 ceph cluster and ceph_central* files are >>>>> used in for accessing central ceph cluster. >>>>> >>>>> glance multistore config: >>>>> [dcn02] >>>>> rbd_store_ceph_conf=/etc/ceph/ceph.conf >>>>> rbd_store_user=openstack >>>>> rbd_store_pool=images >>>>> rbd_thin_provisioning=False >>>>> store_description=dcn02 rbd glance store >>>>> >>>>> [ceph_central] >>>>> rbd_store_ceph_conf=/etc/ceph/ceph_central.conf >>>>> rbd_store_user=openstack >>>>> rbd_store_pool=images >>>>> rbd_thin_provisioning=False >>>>> store_description=Default glance store backend. >>>>> >>>>> >>>>> With regards, >>>>> Swogat Pradhan >>>>> >>>>> On Tue, Mar 21, 2023 at 5:52 PM John Fulton <johfulto@redhat.com> >>>>> wrote: >>>>> >>>>>> On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan >>>>>> <swogatpradhan22@gmail.com> wrote: >>>>>> > >>>>>> > Hi, >>>>>> > Seems like cinder is not using the local ceph. >>>>>> >>>>>> That explains the issue. It's a misconfiguration. >>>>>> >>>>>> I hope this is not a production system since the mailing list >>>>>> now has >>>>>> the cinder.conf which contains passwords. >>>>>> >>>>>> The section that looks like this: >>>>>> >>>>>> [tripleo_ceph] >>>>>> volume_backend_name=tripleo_ceph >>>>>> volume_driver=cinder.volume.drivers.rbd.RBDDriver >>>>>> rbd_ceph_conf=/etc/ceph/ceph.conf >>>>>> rbd_user=openstack >>>>>> rbd_pool=volumes >>>>>> rbd_flatten_volume_from_snapshot=False >>>>>> rbd_secret_uuid=<redacted> >>>>>> report_discard_supported=True >>>>>> >>>>>> Should be updated to refer to the local DCN ceph cluster and >>>>>> not the >>>>>> central one. Use the ceph conf file for that cluster and ensure >>>>>> the >>>>>> rbd_secret_uuid corresponds to that one. >>>>>> >>>>>> TripleO’s convention is to set the rbd_secret_uuid to the FSID >>>>>> of the >>>>>> Ceph cluster. The FSID should be in the ceph.conf file. The >>>>>> tripleo_nova_libvirt role will use virsh secret-* commands so >>>>>> that >>>>>> libvirt can retrieve the cephx secret using the FSID as a key. >>>>>> This >>>>>> can be confirmed with `podman exec nova_virtsecretd virsh >>>>>> secret-get-value $FSID`. >>>>>> >>>>>> The documentation describes how to configure the central and >>>>>> DCN sites >>>>>> correctly but an error seems to have occurred while you were >>>>>> following >>>>>> it. >>>>>> >>>>>> >>>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >>>>>> >>>>>> John >>>>>> >>>>>> > >>>>>> > Ceph Output: >>>>>> > [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l >>>>>> > NAME SIZE PARENT >>>>>> FMT PROT LOCK >>>>>> > 2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB >>>>>> 2 excl >>>>>> > 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB >>>>>> 2 >>>>>> > 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB >>>>>> 2 yes >>>>>> > 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB >>>>>> 2 >>>>>> > 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB >>>>>> 2 yes >>>>>> > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB >>>>>> 2 >>>>>> > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB >>>>>> 2 yes >>>>>> > 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB >>>>>> 2 >>>>>> > 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB >>>>>> 2 yes >>>>>> > b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB >>>>>> 2 >>>>>> > b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB >>>>>> 2 yes >>>>>> > e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB >>>>>> 2 >>>>>> > e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB >>>>>> 2 yes >>>>>> > >>>>>> > [ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l >>>>>> > NAME SIZE PARENT >>>>>> FMT PROT LOCK >>>>>> > volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB >>>>>> 2 >>>>>> > volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB >>>>>> 2 >>>>>> > [ceph: root@dcn02-ceph-all-0 /]# >>>>>> > >>>>>> > Attached the cinder config. >>>>>> > Please let me know how I can solve this issue. >>>>>> > >>>>>> > With regards, >>>>>> > Swogat Pradhan >>>>>> > >>>>>> > On Tue, Mar 21, 2023 at 3:53 PM John Fulton < >>>>>> johfulto@redhat.com> wrote: >>>>>> >> >>>>>> >> in my last message under the line "On a DCN site if you run >>>>>> a command like this:" I suggested some steps you could try to confirm the >>>>>> image is a COW from the local glance as well as how to look at your cinder >>>>>> config. >>>>>> >> >>>>>> >> On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan < >>>>>> swogatpradhan22@gmail.com> wrote: >>>>>> >>> >>>>>> >>> Update: >>>>>> >>> I uploaded an image directly to the dcn02 store, and it >>>>>> takes around 10,15 minutes to create a volume with image in dcn02. >>>>>> >>> The image size is 389 MB. >>>>>> >>> >>>>>> >>> On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan < >>>>>> swogatpradhan22@gmail.com> wrote: >>>>>> >>>> >>>>>> >>>> Hi Jhon, >>>>>> >>>> I checked in the ceph od dcn02, I can see the images >>>>>> created after importing from the central site. >>>>>> >>>> But launching an instance normally fails as it takes a >>>>>> long time for the volume to get created. >>>>>> >>>> >>>>>> >>>> When launching an instance from volume the instance is >>>>>> getting created properly without any errors. >>>>>> >>>> >>>>>> >>>> I tried to cache images in nova using >>>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>>>>> but getting checksum failed error. >>>>>> >>>> >>>>>> >>>> With regards, >>>>>> >>>> Swogat Pradhan >>>>>> >>>> >>>>>> >>>> On Thu, Mar 16, 2023 at 5:24 PM John Fulton < >>>>>> johfulto@redhat.com> wrote: >>>>>> >>>>> >>>>>> >>>>> On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan >>>>>> >>>>> <swogatpradhan22@gmail.com> wrote: >>>>>> >>>>> > >>>>>> >>>>> > Update: After restarting the nova services on the >>>>>> controller and running the deploy script on the edge site, I was able to >>>>>> launch the VM from volume. >>>>>> >>>>> > >>>>>> >>>>> > Right now the instance creation is failing as the block >>>>>> device creation is stuck in creating state, it is taking more than 10 mins >>>>>> for the volume to be created, whereas the image has already been imported >>>>>> to the edge glance. >>>>>> >>>>> >>>>>> >>>>> Try following this document and making the same >>>>>> observations in your >>>>>> >>>>> environment for AZs and their local ceph cluster. >>>>>> >>>>> >>>>>> >>>>> >>>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >>>>>> >>>>> >>>>>> >>>>> On a DCN site if you run a command like this: >>>>>> >>>>> >>>>>> >>>>> $ sudo cephadm shell --config /etc/ceph/dcn0.conf >>>>>> --keyring >>>>>> >>>>> /etc/ceph/dcn0.client.admin.keyring >>>>>> >>>>> $ rbd --cluster dcn0 -p volumes ls -l >>>>>> >>>>> NAME SIZE PARENT >>>>>> >>>>> FMT PROT LOCK >>>>>> >>>>> volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB >>>>>> >>>>> images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 >>>>>> excl >>>>>> >>>>> $ >>>>>> >>>>> >>>>>> >>>>> Then, you should see the parent of the volume is the >>>>>> image which is on >>>>>> >>>>> the same local ceph cluster. >>>>>> >>>>> >>>>>> >>>>> I wonder if something is misconfigured and thus you're >>>>>> encountering >>>>>> >>>>> the streaming behavior described here: >>>>>> >>>>> >>>>>> >>>>> Ideally all images should reside in the central Glance >>>>>> and be copied >>>>>> >>>>> to DCN sites before instances of those images are booted >>>>>> on DCN sites. >>>>>> >>>>> If an image is not copied to a DCN site before it is >>>>>> booted, then the >>>>>> >>>>> image will be streamed to the DCN site and then the image >>>>>> will boot as >>>>>> >>>>> an instance. This happens because Glance at the DCN site >>>>>> has access to >>>>>> >>>>> the images store at the Central ceph cluster. Though the >>>>>> booting of >>>>>> >>>>> the image will take time because it has not been copied >>>>>> in advance, >>>>>> >>>>> this is still preferable to failing to boot the image. >>>>>> >>>>> >>>>>> >>>>> You can also exec into the cinder container at the DCN >>>>>> site and >>>>>> >>>>> confirm it's using it's local ceph cluster. >>>>>> >>>>> >>>>>> >>>>> John >>>>>> >>>>> >>>>>> >>>>> > >>>>>> >>>>> > I will try and create a new fresh image and test again >>>>>> then update. >>>>>> >>>>> > >>>>>> >>>>> > With regards, >>>>>> >>>>> > Swogat Pradhan >>>>>> >>>>> > >>>>>> >>>>> > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan < >>>>>> swogatpradhan22@gmail.com> wrote: >>>>>> >>>>> >> >>>>>> >>>>> >> Update: >>>>>> >>>>> >> In the hypervisor list the compute node state is >>>>>> showing down. >>>>>> >>>>> >> >>>>>> >>>>> >> >>>>>> >>>>> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan < >>>>>> swogatpradhan22@gmail.com> wrote: >>>>>> >>>>> >>> >>>>>> >>>>> >>> Hi Brendan, >>>>>> >>>>> >>> Now i have deployed another site where i have used 2 >>>>>> linux bonds network template for both 3 compute nodes and 3 ceph nodes. >>>>>> >>>>> >>> The bonding options is set to mode=802.3ad >>>>>> (lacp=active). >>>>>> >>>>> >>> I used a cirros image to launch instance but the >>>>>> instance timed out so i waited for the volume to be created. >>>>>> >>>>> >>> Once the volume was created i tried launching the >>>>>> instance from the volume and still the instance is stuck in spawning state. >>>>>> >>>>> >>> >>>>>> >>>>> >>> Here is the nova-compute log: >>>>>> >>>>> >>> >>>>>> >>>>> >>> 2023-03-15 17:35:47.739 185437 INFO >>>>>> oslo.privsep.daemon [-] privsep daemon starting >>>>>> >>>>> >>> 2023-03-15 17:35:47.744 185437 INFO >>>>>> oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 >>>>>> >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO >>>>>> oslo.privsep.daemon [-] privsep process running with capabilities >>>>>> (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>>>> >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO >>>>>> oslo.privsep.daemon [-] privsep daemon running as pid 185437 >>>>>> >>>>> >>> 2023-03-15 17:35:47.974 8 WARNING >>>>>> os_brick.initiator.connectors.nvmeof >>>>>> [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db >>>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error >>>>>> in _get_host_uuid: Unexpected error while running command. >>>>>> >>>>> >>> Command: blkid overlay -s UUID -o value >>>>>> >>>>> >>> Exit code: 2 >>>>>> >>>>> >>> Stdout: '' >>>>>> >>>>> >>> Stderr: '': >>>>>> oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while >>>>>> running command. >>>>>> >>>>> >>> 2023-03-15 17:35:51.616 8 INFO >>>>>> nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 >>>>>> b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default >>>>>> default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image >>>>>> >>>>> >>> >>>>>> >>>>> >>> It is stuck in creating image, do i need to run the >>>>>> template mentioned here ?: >>>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>>>>> >>>>> >>> >>>>>> >>>>> >>> The volume is already created and i do not understand >>>>>> why the instance is stuck in spawning state. >>>>>> >>>>> >>> >>>>>> >>>>> >>> With regards, >>>>>> >>>>> >>> Swogat Pradhan >>>>>> >>>>> >>> >>>>>> >>>>> >>> >>>>>> >>>>> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard < >>>>>> bshephar@redhat.com> wrote: >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> Does your environment use different network >>>>>> interfaces for each of the networks? Or does it have a bond with everything >>>>>> on it? >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> One issue I have seen before is that when launching >>>>>> instances, there is a lot of network traffic between nodes as the >>>>>> hypervisor needs to download the image from Glance. Along with various >>>>>> other services sending normal network traffic, it can be enough to cause >>>>>> issues if everything is running over a single 1Gbe interface. >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> I have seen the same situation in fact when using a >>>>>> single active/backup bond on 1Gbe nics. It’s worth checking the network >>>>>> traffic while you try to spawn the instance to see if you’re dropping >>>>>> packets. In the situation I described, there were dropped packets which >>>>>> resulted in a loss of communication between nova_compute and RMQ, so the >>>>>> node appeared offline. You should also confirm that nova_compute is being >>>>>> disconnected in the nova_compute logs if you tail them on the Hypervisor >>>>>> while spawning the instance. >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> In my case, changing from active/backup to LACP >>>>>> helped. So, based on that experience, from my perspective, is certainly >>>>>> sounds like some kind of network issue. >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> Regards, >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> Brendan Shephard >>>>>> >>>>> >>>> Senior Software Engineer >>>>>> >>>>> >>>> Red Hat Australia >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block < >>>>>> eblock@nde.ag> wrote: >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> Hi, >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> I tried to help someone with a similar issue some >>>>>> time ago in this thread: >>>>>> >>>>> >>>> >>>>>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> But apparently a neutron reinstallation fixed it for >>>>>> that user, not sure if that could apply here. But is it possible that your >>>>>> nova and neutron versions are different between central and edge site? Have >>>>>> you restarted nova and neutron services on the compute nodes after >>>>>> installation? Have you debug logs of nova-conductor and maybe nova-compute? >>>>>> Maybe they can help narrow down the issue. >>>>>> >>>>> >>>> If there isn't any additional information in the >>>>>> debug logs I probably would start "tearing down" rabbitmq. I didn't have to >>>>>> do that in a production system yet so be careful. I can think of two routes: >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> - Either remove queues, exchanges etc. while rabbit >>>>>> is running, this will most likely impact client IO depending on your load. >>>>>> Check out the rabbitmqctl commands. >>>>>> >>>>> >>>> - Or stop the rabbitmq cluster, remove the mnesia >>>>>> tables from all nodes and restart rabbitmq so the exchanges, queues etc. >>>>>> rebuild. >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> I can imagine that the failed reply "survives" while >>>>>> being replicated across the rabbit nodes. But I don't really know the >>>>>> rabbit internals too well, so maybe someone else can chime in here and give >>>>>> a better advice. >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> Regards, >>>>>> >>>>> >>>> Eugen >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com >>>>>> >: >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> Hi, >>>>>> >>>>> >>>> Can someone please help me out on this issue? >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> With regards, >>>>>> >>>>> >>>> Swogat Pradhan >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < >>>>>> swogatpradhan22@gmail.com> >>>>>> >>>>> >>>> wrote: >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> Hi >>>>>> >>>>> >>>> I don't see any major packet loss. >>>>>> >>>>> >>>> It seems the problem is somewhere in rabbitmq maybe >>>>>> but not due to packet >>>>>> >>>>> >>>> loss. >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> with regards, >>>>>> >>>>> >>>> Swogat Pradhan >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < >>>>>> swogatpradhan22@gmail.com> >>>>>> >>>>> >>>> wrote: >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> Hi, >>>>>> >>>>> >>>> Yes the MTU is the same as the default '1500'. >>>>>> >>>>> >>>> Generally I haven't seen any packet loss, but never >>>>>> checked when >>>>>> >>>>> >>>> launching the instance. >>>>>> >>>>> >>>> I will check that and come back. >>>>>> >>>>> >>>> But everytime i launch an instance the instance gets >>>>>> stuck at spawning >>>>>> >>>>> >>>> state and there the hypervisor becomes down, so not >>>>>> sure if packet loss >>>>>> >>>>> >>>> causes this. >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> With regards, >>>>>> >>>>> >>>> Swogat pradhan >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block < >>>>>> eblock@nde.ag> wrote: >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> One more thing coming to mind is MTU size. Are they >>>>>> identical between >>>>>> >>>>> >>>> central and edge site? Do you see packet loss >>>>>> through the tunnel? >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com >>>>>> >: >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> > Hi Eugen, >>>>>> >>>>> >>>> > Request you to please add my email either on 'to' >>>>>> or 'cc' as i am not >>>>>> >>>>> >>>> > getting email's from you. >>>>>> >>>>> >>>> > Coming to the issue: >>>>>> >>>>> >>>> > >>>>>> >>>>> >>>> > [root@overcloud-controller-no-ceph-3 /]# >>>>>> rabbitmqctl list_policies -p >>>>>> >>>>> >>>> / >>>>>> >>>>> >>>> > Listing policies for vhost "/" ... >>>>>> >>>>> >>>> > vhost name pattern apply-to >>>>>> definition priority >>>>>> >>>>> >>>> > / ha-all ^(?!amq\.).* queues >>>>>> >>>>> >>>> > >>>>>> >>>>> >>>> >>>>>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 >>>>>> >>>>> >>>> > >>>>>> >>>>> >>>> > I have the edge site compute nodes up, it only >>>>>> goes down when i am >>>>>> >>>>> >>>> trying >>>>>> >>>>> >>>> > to launch an instance and the instance comes to a >>>>>> spawning state and >>>>>> >>>>> >>>> then >>>>>> >>>>> >>>> > gets stuck. >>>>>> >>>>> >>>> > >>>>>> >>>>> >>>> > I have a tunnel setup between the central and the >>>>>> edge sites. >>>>>> >>>>> >>>> > >>>>>> >>>>> >>>> > With regards, >>>>>> >>>>> >>>> > Swogat Pradhan >>>>>> >>>>> >>>> > >>>>>> >>>>> >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < >>>>>> >>>>> >>>> swogatpradhan22@gmail.com> >>>>>> >>>>> >>>> > wrote: >>>>>> >>>>> >>>> > >>>>>> >>>>> >>>> >> Hi Eugen, >>>>>> >>>>> >>>> >> For some reason i am not getting your email to me >>>>>> directly, i am >>>>>> >>>>> >>>> checking >>>>>> >>>>> >>>> >> the email digest and there i am able to find your >>>>>> reply. >>>>>> >>>>> >>>> >> Here is the log for download: >>>>>> https://we.tl/t-L8FEkGZFSq >>>>>> >>>>> >>>> >> Yes, these logs are from the time when the issue >>>>>> occurred. >>>>>> >>>>> >>>> >> >>>>>> >>>>> >>>> >> *Note: i am able to create vm's and perform other >>>>>> activities in the >>>>>> >>>>> >>>> >> central site, only facing this issue in the edge >>>>>> site.* >>>>>> >>>>> >>>> >> >>>>>> >>>>> >>>> >> With regards, >>>>>> >>>>> >>>> >> Swogat Pradhan >>>>>> >>>>> >>>> >> >>>>>> >>>>> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < >>>>>> >>>>> >>>> swogatpradhan22@gmail.com> >>>>>> >>>>> >>>> >> wrote: >>>>>> >>>>> >>>> >> >>>>>> >>>>> >>>> >>> Hi Eugen, >>>>>> >>>>> >>>> >>> Thanks for your response. >>>>>> >>>>> >>>> >>> I have actually a 4 controller setup so here are >>>>>> the details: >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> *PCS Status:* >>>>>> >>>>> >>>> >>> * Container bundle set: rabbitmq-bundle [ >>>>>> >>>>> >>>> >>> >>>>>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest >>>>>> ]: >>>>>> >>>>> >>>> >>> * rabbitmq-bundle-0 >>>>>> (ocf::heartbeat:rabbitmq-cluster): >>>>>> >>>>> >>>> Started >>>>>> >>>>> >>>> >>> overcloud-controller-no-ceph-3 >>>>>> >>>>> >>>> >>> * rabbitmq-bundle-1 >>>>>> (ocf::heartbeat:rabbitmq-cluster): >>>>>> >>>>> >>>> Started >>>>>> >>>>> >>>> >>> overcloud-controller-2 >>>>>> >>>>> >>>> >>> * rabbitmq-bundle-2 >>>>>> (ocf::heartbeat:rabbitmq-cluster): >>>>>> >>>>> >>>> Started >>>>>> >>>>> >>>> >>> overcloud-controller-1 >>>>>> >>>>> >>>> >>> * rabbitmq-bundle-3 >>>>>> (ocf::heartbeat:rabbitmq-cluster): >>>>>> >>>>> >>>> Started >>>>>> >>>>> >>>> >>> overcloud-controller-0 >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> I have tried restarting the bundle multiple >>>>>> times but the issue is >>>>>> >>>>> >>>> still >>>>>> >>>>> >>>> >>> present. >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> *Cluster status:* >>>>>> >>>>> >>>> >>> [root@overcloud-controller-0 /]# rabbitmqctl >>>>>> cluster_status >>>>>> >>>>> >>>> >>> Cluster status of node >>>>>> >>>>> >>>> >>> >>>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >>>>>> >>>>> >>>> >>> Basics >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> Cluster name: >>>>>> rabbit@overcloud-controller-no-ceph-3.bdxworld.com >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> Disk Nodes >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> >>>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>>>> >>>>> >>>> >>> >>>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>>>> >>>>> >>>> >>> >>>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>>>> >>>>> >>>> >>> >>>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> Running Nodes >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> >>>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>>>> >>>>> >>>> >>> >>>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>>>> >>>>> >>>> >>> >>>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>>>> >>>>> >>>> >>> >>>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> Versions >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> >>>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: >>>>>> RabbitMQ >>>>>> >>>>> >>>> 3.8.3 >>>>>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>>>>> >>>>> >>>> >>> >>>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: >>>>>> RabbitMQ >>>>>> >>>>> >>>> 3.8.3 >>>>>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>>>>> >>>>> >>>> >>> >>>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: >>>>>> RabbitMQ >>>>>> >>>>> >>>> 3.8.3 >>>>>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>>>>> >>>>> >>>> >>> >>>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com: >>>>>> >>>>> >>>> RabbitMQ >>>>>> >>>>> >>>> >>> 3.8.3 on Erlang 22.3.4.1 >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> Alarms >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> (none) >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> Network Partitions >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> (none) >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> Listeners >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> Node: >>>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>>>> >>>>> >>>> interface: >>>>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, >>>>>> purpose: inter-node and CLI >>>>>> >>>>> >>>> tool >>>>>> >>>>> >>>> >>> communication >>>>>> >>>>> >>>> >>> Node: >>>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>>>> >>>>> >>>> interface: >>>>>> >>>>> >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, >>>>>> purpose: AMQP 0-9-1 >>>>>> >>>>> >>>> >>> and AMQP 1.0 >>>>>> >>>>> >>>> >>> Node: >>>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>>>> >>>>> >>>> interface: >>>>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP >>>>>> API >>>>>> >>>>> >>>> >>> Node: >>>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>>>> >>>>> >>>> interface: >>>>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, >>>>>> purpose: inter-node and CLI >>>>>> >>>>> >>>> tool >>>>>> >>>>> >>>> >>> communication >>>>>> >>>>> >>>> >>> Node: >>>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>>>> >>>>> >>>> interface: >>>>>> >>>>> >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, >>>>>> purpose: AMQP 0-9-1 >>>>>> >>>>> >>>> >>> and AMQP 1.0 >>>>>> >>>>> >>>> >>> Node: >>>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>>>> >>>>> >>>> interface: >>>>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP >>>>>> API >>>>>> >>>>> >>>> >>> Node: >>>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>>>> >>>>> >>>> interface: >>>>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, >>>>>> purpose: inter-node and CLI >>>>>> >>>>> >>>> tool >>>>>> >>>>> >>>> >>> communication >>>>>> >>>>> >>>> >>> Node: >>>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>>>> >>>>> >>>> interface: >>>>>> >>>>> >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, >>>>>> purpose: AMQP 0-9-1 >>>>>> >>>>> >>>> >>> and AMQP 1.0 >>>>>> >>>>> >>>> >>> Node: >>>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>>>> >>>>> >>>> interface: >>>>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP >>>>>> API >>>>>> >>>>> >>>> >>> Node: >>>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>>> >>>>> >>>> , >>>>>> >>>>> >>>> >>> interface: [::], port: 25672, protocol: >>>>>> clustering, purpose: >>>>>> >>>>> >>>> inter-node and >>>>>> >>>>> >>>> >>> CLI tool communication >>>>>> >>>>> >>>> >>> Node: >>>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>>> >>>>> >>>> , >>>>>> >>>>> >>>> >>> interface: 172.25.201.209, port: 5672, protocol: >>>>>> amqp, purpose: AMQP >>>>>> >>>>> >>>> 0-9-1 >>>>>> >>>>> >>>> >>> and AMQP 1.0 >>>>>> >>>>> >>>> >>> Node: >>>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>>> >>>>> >>>> , >>>>>> >>>>> >>>> >>> interface: [::], port: 15672, protocol: http, >>>>>> purpose: HTTP API >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> Feature flags >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> Flag: drop_unroutable_metric, state: enabled >>>>>> >>>>> >>>> >>> Flag: empty_basic_get_metric, state: enabled >>>>>> >>>>> >>>> >>> Flag: implicit_default_bindings, state: enabled >>>>>> >>>>> >>>> >>> Flag: quorum_queue, state: enabled >>>>>> >>>>> >>>> >>> Flag: virtual_host_metadata, state: enabled >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> *Logs:* >>>>>> >>>>> >>>> >>> *(Attached)* >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> With regards, >>>>>> >>>>> >>>> >>> Swogat Pradhan >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < >>>>>> >>>>> >>>> swogatpradhan22@gmail.com> >>>>>> >>>>> >>>> >>> wrote: >>>>>> >>>>> >>>> >>> >>>>>> >>>>> >>>> >>>> Hi, >>>>>> >>>>> >>>> >>>> Please find the nova conductor as well as nova >>>>>> api log. >>>>>> >>>>> >>>> >>>> >>>>>> >>>>> >>>> >>>> nova-conuctor: >>>>>> >>>>> >>>> >>>> >>>>>> >>>>> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING >>>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>>> - -] >>>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>>>> exist, drop reply to >>>>>> >>>>> >>>> >>>> 16152921c1eb45c2b1f562087140168b >>>>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING >>>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - >>>>>> - -] >>>>>> >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't >>>>>> exist, drop reply to >>>>>> >>>>> >>>> >>>> 83dbe5f567a940b698acfe986f6194fa >>>>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING >>>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - >>>>>> - -] >>>>>> >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't >>>>>> exist, drop reply to >>>>>> >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43: >>>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR >>>>>> oslo_messaging._drivers.amqpdriver >>>>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - >>>>>> - -] The reply >>>>>> >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send >>>>>> after 60 seconds >>>>>> >>>>> >>>> due to a >>>>>> >>>>> >>>> >>>> missing queue >>>>>> (reply_276049ec36a84486a8a406911d9802f4). >>>>>> >>>>> >>>> Abandoning...: >>>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>>> >>>>> >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING >>>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>>> - -] >>>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>>>> exist, drop reply to >>>>>> >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: >>>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>>> >>>>> >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR >>>>>> oslo_messaging._drivers.amqpdriver >>>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>>> - -] The reply >>>>>> >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send >>>>>> after 60 seconds >>>>>> >>>>> >>>> due to a >>>>>> >>>>> >>>> >>>> missing queue >>>>>> (reply_349bcb075f8c49329435a0f884b33066). >>>>>> >>>>> >>>> Abandoning...: >>>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>>> >>>>> >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING >>>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>>> - -] >>>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>>>> exist, drop reply to >>>>>> >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f: >>>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>>> >>>>> >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR >>>>>> oslo_messaging._drivers.amqpdriver >>>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>>> - -] The reply >>>>>> >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send >>>>>> after 60 seconds >>>>>> >>>>> >>>> due to a >>>>>> >>>>> >>>> >>>> missing queue >>>>>> (reply_349bcb075f8c49329435a0f884b33066). >>>>>> >>>>> >>>> Abandoning...: >>>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>>> >>>>> >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING >>>>>> nova.cache_utils >>>>>> >>>>> >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>> >>>>> >>>> b240e3e89d99489284cd731e75f2a5db >>>>>> >>>>> >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>> default] Cache enabled >>>>>> >>>>> >>>> with >>>>>> >>>>> >>>> >>>> backend dogpile.cache.null. >>>>>> >>>>> >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING >>>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>>> - -] >>>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>>>> exist, drop reply to >>>>>> >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb: >>>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>>> >>>>> >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR >>>>>> oslo_messaging._drivers.amqpdriver >>>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - >>>>>> - -] The reply >>>>>> >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send >>>>>> after 60 seconds >>>>>> >>>>> >>>> due to a >>>>>> >>>>> >>>> >>>> missing queue >>>>>> (reply_349bcb075f8c49329435a0f884b33066). >>>>>> >>>>> >>>> Abandoning...: >>>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>>> >>>>> >>>> >>>> >>>>>> >>>>> >>>> >>>> With regards, >>>>>> >>>>> >>>> >>>> Swogat Pradhan >>>>>> >>>>> >>>> >>>> >>>>>> >>>>> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan < >>>>>> >>>>> >>>> >>>> swogatpradhan22@gmail.com> wrote: >>>>>> >>>>> >>>> >>>> >>>>>> >>>>> >>>> >>>>> Hi, >>>>>> >>>>> >>>> >>>>> I currently have 3 compute nodes on edge site1 >>>>>> where i am trying to >>>>>> >>>>> >>>> >>>>> launch vm's. >>>>>> >>>>> >>>> >>>>> When the VM is in spawning state the node goes >>>>>> down (openstack >>>>>> >>>>> >>>> compute >>>>>> >>>>> >>>> >>>>> service list), the node comes backup when i >>>>>> restart the nova >>>>>> >>>>> >>>> compute >>>>>> >>>>> >>>> >>>>> service but then the launch of the vm fails. >>>>>> >>>>> >>>> >>>>> >>>>>> >>>>> >>>> >>>>> nova-compute.log >>>>>> >>>>> >>>> >>>>> >>>>>> >>>>> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO >>>>>> nova.compute.manager >>>>>> >>>>> >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - >>>>>> - - -] Running >>>>>> >>>>> >>>> >>>>> instance usage >>>>>> >>>>> >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from >>>>>> 2023-02-26 07:00:00 >>>>>> >>>>> >>>> to >>>>>> >>>>> >>>> >>>>> 2023-02-26 08:00:00. 0 instances. >>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO >>>>>> nova.compute.claims >>>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>> default] [instance: >>>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim >>>>>> successful on node >>>>>> >>>>> >>>> >>>>> dcn01-hci-0.bdxworld.com >>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO >>>>>> nova.virt.libvirt.driver >>>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>> default] [instance: >>>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring >>>>>> supplied device >>>>>> >>>>> >>>> name: >>>>>> >>>>> >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied >>>>>> dev names >>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO >>>>>> nova.virt.block_device >>>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>> default] [instance: >>>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting >>>>>> with volume >>>>>> >>>>> >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at >>>>>> /dev/vda >>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING >>>>>> nova.cache_utils >>>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>> default] Cache enabled >>>>>> >>>>> >>>> with >>>>>> >>>>> >>>> >>>>> backend dogpile.cache.null. >>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO >>>>>> oslo.privsep.daemon >>>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>> default] Running >>>>>> >>>>> >>>> >>>>> privsep helper: >>>>>> >>>>> >>>> >>>>> ['sudo', 'nova-rootwrap', >>>>>> '/etc/nova/rootwrap.conf', >>>>>> >>>>> >>>> 'privsep-helper', >>>>>> >>>>> >>>> >>>>> '--config-file', '/etc/nova/nova.conf', >>>>>> '--config-file', >>>>>> >>>>> >>>> >>>>> '/etc/nova/nova-compute.conf', >>>>>> '--privsep_context', >>>>>> >>>>> >>>> >>>>> 'os_brick.privileged.default', >>>>>> '--privsep_sock_path', >>>>>> >>>>> >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] >>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO >>>>>> oslo.privsep.daemon >>>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>> default] Spawned new >>>>>> >>>>> >>>> privsep >>>>>> >>>>> >>>> >>>>> daemon via rootwrap >>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO >>>>>> oslo.privsep.daemon [-] privsep >>>>>> >>>>> >>>> >>>>> daemon starting >>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO >>>>>> oslo.privsep.daemon [-] privsep >>>>>> >>>>> >>>> >>>>> process running with uid/gid: 0/0 >>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO >>>>>> oslo.privsep.daemon [-] privsep >>>>>> >>>>> >>>> >>>>> process running with capabilities >>>>>> (eff/prm/inh): >>>>>> >>>>> >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO >>>>>> oslo.privsep.daemon [-] privsep >>>>>> >>>>> >>>> >>>>> daemon running as pid 2647 >>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING >>>>>> >>>>> >>>> os_brick.initiator.connectors.nvmeof >>>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>> default] Process >>>>>> >>>>> >>>> >>>>> execution error >>>>>> >>>>> >>>> >>>>> in _get_host_uuid: Unexpected error while >>>>>> running command. >>>>>> >>>>> >>>> >>>>> Command: blkid overlay -s UUID -o value >>>>>> >>>>> >>>> >>>>> Exit code: 2 >>>>>> >>>>> >>>> >>>>> Stdout: '' >>>>>> >>>>> >>>> >>>>> Stderr: '': >>>>>> oslo_concurrency.processutils.ProcessExecutionError: >>>>>> >>>>> >>>> >>>>> Unexpected error while running command. >>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO >>>>>> nova.virt.libvirt.driver >>>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>> default] [instance: >>>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating >>>>>> image >>>>>> >>>>> >>>> >>>>> >>>>>> >>>>> >>>> >>>>> Is there a way to solve this issue? >>>>>> >>>>> >>>> >>>>> >>>>>> >>>>> >>>> >>>>> >>>>>> >>>>> >>>> >>>>> With regards, >>>>>> >>>>> >>>> >>>>> >>>>>> >>>>> >>>> >>>>> Swogat Pradhan >>>>>> >>>>> >>>> >>>>> >>>>>> >>>>> >>>> >>>> >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> >>>>>> >>>>> >>>> >>>>>> >>>>> >>>>>> >>>>>>
Hi Alan, Thank you so much. The issue was with the images. Qcow2 images are not working. I used RAW images and it now takes 2 mins to spawn the instances without any issues. Thank you With regards, Swogat Pradhan On Thu, Mar 23, 2023 at 10:35 PM Alan Bishop <abishop@redhat.com> wrote:
On Thu, Mar 23, 2023 at 9:01 AM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi, Can someone please help me identify the issue here? Latest cinder-volume logs from dcn02: (ATTACHED)
It's really not possible to analyze what's happening with just one or two log entries. Do you have debug logs enabled? One thing I noticed is the glance image's disk_format is qcow2. You should use "raw" images with ceph RBD.
Alan
The volume is stuck in creating state.
With regards, Swogat Pradhan
On Thu, Mar 23, 2023 at 6:12 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
Hi Jhon, Thank you for clarifying that. Right now the cinder volume is stuck in *creating *state when adding image as volume source. But when creating an empty volume the volumes are getting created successfully without any errors.
We are getting volume creation request in cinder-volume.log as such: 2023-03-23 12:34:40.152 108 INFO cinder.volume.flows.manager.create_volume [req-18556796-a61c-4097-8fa8-b136ce9814f7 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume 872a2ae6-c75b-4fc0-8172-17a29d07a66c: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-872a2ae6-c75b-4fc0-8172-17a29d07a66c', 'volume_size': 1, 'image_id': '131ed4e0-0474-45be-b74a-43b599a7d6c5', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': '131ed4e0-0474-45be-b74a-43b599a7d6c5', 'created_at': datetime.datetime(2023, 3, 23, 11, 41, 51, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 23, 11, 46, 37, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/snap', 'tags': [], 'file': '/v2/images/131ed4e0-0474-45be-b74a-43b599a7d6c5/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f98d869ed68>}
But there is nothing else after that and the volume doesn't even timeout, it just gets stuck in creating state. Can you advise what might be the issue here? All the containers are in a healthy state now.
With regards, Swogat Pradhan
On Thu, Mar 23, 2023 at 6:06 PM Alan Bishop <abishop@redhat.com> wrote:
On Thu, Mar 23, 2023 at 5:20 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Hi, Is this bind not required for cinder_scheduler container?
"/var/lib/tripleo-config/ceph:/var/lib/kolla/config_files/src-ceph:ro,rprivate,rbind", I do not see this particular bind on the cinder scheduler containers on my controller nodes.
That is correct, because the scheduler does not access the ceph cluster.
Alan
With regards, Swogat Pradhan
On Thu, Mar 23, 2023 at 2:46 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
Cinder volume config:
[tripleo_ceph] volume_backend_name=tripleo_ceph volume_driver=cinder.volume.drivers.rbd.RBDDriver rbd_user=openstack rbd_pool=volumes rbd_flatten_volume_from_snapshot=False rbd_secret_uuid=a8d5f1f5-48e7-5ede-89ab-8aca59b6397b report_discard_supported=True rbd_ceph_conf=/etc/ceph/dcn02.conf rbd_cluster_name=dcn02
Glance api config:
[dcn02] rbd_store_ceph_conf=/etc/ceph/dcn02.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=dcn02 rbd glance store [ceph] rbd_store_ceph_conf=/etc/ceph/ceph.conf rbd_store_user=openstack rbd_store_pool=images rbd_thin_provisioning=False store_description=Default glance store backend.
On Thu, Mar 23, 2023 at 2:29 AM Swogat Pradhan < swogatpradhan22@gmail.com> wrote:
> I still have the same issue, I'm not sure what's left to try. > All the pods are now in a healthy state, I am getting log entries 3 > mins after I hit the create volume button in cinder-volume when I try to > create a volume with an image. > And the volumes are just stuck in creating state for more than 20 > mins now. > > Cinder logs: > 2023-03-22 20:32:44.010 108 INFO cinder.rpc > [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - - -] Automatically selected > cinder-volume RPC version 3.17 as minimum service version. > 2023-03-22 20:34:59.166 108 INFO > cinder.volume.flows.manager.create_volume > [req-0d2093a0-efbd-45a5-bd7d-cce25ddc200e b240e3e89d99489284cd731e75f2a5db > 4160ce999a31485fa643aed0936dfef0 - - -] Volume > 5743a879-090d-46db-bc7c-1c0b0669a112: being created as image with > specification: {'status': 'creating', 'volume_name': > 'volume-5743a879-090d-46db-bc7c-1c0b0669a112', 'volume_size': 2, > 'image_id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'image_location': > ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', > [{'url': > 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', > 'metadata': {'store': 'ceph'}}, {'url': > 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', > 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', > 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', > 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', > 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, > 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', > 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': > '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', > 'id': 'acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b', 'created_at': > datetime.datetime(2023, 3, 22, 18, 50, 5, tzinfo=datetime.timezone.utc), > 'updated_at': datetime.datetime(2023, 3, 22, 20, 3, 54, > tzinfo=datetime.timezone.utc), 'locations': [{'url': > 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', > 'metadata': {'store': 'ceph'}}, {'url': > 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', > 'metadata': {'store': 'dcn02'}}], 'direct_url': > 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/snap', > 'tags': [], 'file': '/v2/images/acfd0a14-69e0-44d6-a6a1-aa9dc83e9d5b/file', > 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', > 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', > 'owner_specified.openstack.object': 'images/cirros', > 'owner_specified.openstack.sha256': ''}}, 'image_service': > <cinder.image.glance.GlanceImageService object at 0x7f8147973438>} > > With regards, > Swogat Pradhan > > On Wed, Mar 22, 2023 at 9:19 PM Alan Bishop <abishop@redhat.com> > wrote: > >> >> >> On Wed, Mar 22, 2023 at 8:38 AM Swogat Pradhan < >> swogatpradhan22@gmail.com> wrote: >> >>> Hi Adam, >>> The systems are in same LAN, in this case it seemed like the image >>> was getting pulled from the central site which was caused due to an >>> misconfiguration in ceph.conf file in /var/lib/tripleo-config/ceph/ >>> directory, which seems to have been resolved after the changes i made to >>> fix it. >>> >>> Right now the glance api podman is running in unhealthy state and >>> the podman logs don't show any error whatsoever and when issued the command >>> netstat -nultp i do not see any entry for glance port i.e. 9292 in the dcn >>> site, which is why cinder is throwing an error stating: >>> >>> 2023-03-22 13:32:29.786 108 ERROR oslo_messaging.rpc.server >>> cinder.exception.GlanceConnectionFailed: Connection to glance failed: Error >>> finding address for >>> http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: >>> Unable to establish connection to >>> http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: >>> HTTPConnectionPool(host='172.25.228.253', port=9292): Max retries exceeded >>> with url: /v2/images/736d8779-07cd-4510-bab2-adcb653cc538 (Caused by >>> NewConnectionError('<urllib3.connection.HTTPConnection object at >>> 0x7f7682d2cd30>: Failed to establish a new connection: [Errno 111] >>> ECONNREFUSED',)) >>> >>> Now i need to find out why the port is not listed as the glance >>> service is running, which i am not sure how to find out. >>> >> >> One other thing to investigate is whether your deployment includes >> this patch [1]. If it does, then bear in mind >> the glance-api service running at the edge site will be an >> "internal" (non public facing) instance that uses port 9293 >> instead of 9292. You should familiarize yourself with the release >> note [2]. >> >> [1] >> https://opendev.org/openstack/tripleo-heat-templates/commit/3605d45e417a77a1... >> [2] >> https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/walla... >> >> Alan >> >> >>> With regards, >>> Swogat Pradhan >>> >>> On Wed, Mar 22, 2023 at 8:11 PM Alan Bishop <abishop@redhat.com> >>> wrote: >>> >>>> >>>> >>>> On Wed, Mar 22, 2023 at 6:37 AM Swogat Pradhan < >>>> swogatpradhan22@gmail.com> wrote: >>>> >>>>> Update: >>>>> Here is the log when creating a volume using cirros image: >>>>> >>>>> 2023-03-22 11:04:38.449 109 INFO >>>>> cinder.volume.flows.manager.create_volume >>>>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>>>> 4160ce999a31485fa643aed0936dfef0 - - -] Volume >>>>> bf341343-6609-4b8c-b9e0-93e2a89c8c8f: being created as image with >>>>> specification: {'status': 'creating', 'volume_name': >>>>> 'volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f', 'volume_size': 4, >>>>> 'image_id': '736d8779-07cd-4510-bab2-adcb653cc538', 'image_location': >>>>> ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>>>> [{'url': >>>>> 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>>>> 'metadata': {'store': 'ceph'}}, {'url': >>>>> 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>>>> 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', >>>>> 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', >>>>> 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', >>>>> 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, >>>>> 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', >>>>> 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': >>>>> '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', >>>>> 'id': '736d8779-07cd-4510-bab2-adcb653cc538', 'created_at': >>>>> datetime.datetime(2023, 3, 22, 10, 44, 12, tzinfo=datetime.timezone.utc), >>>>> 'updated_at': datetime.datetime(2023, 3, 22, 10, 54, 1, >>>>> tzinfo=datetime.timezone.utc), 'locations': [{'url': >>>>> 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>>>> 'metadata': {'store': 'ceph'}}, {'url': >>>>> 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>>>> 'metadata': {'store': 'dcn02'}}], 'direct_url': >>>>> 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', >>>>> 'tags': [], 'file': '/v2/images/736d8779-07cd-4510-bab2-adcb653cc538/file', >>>>> 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', >>>>> 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', >>>>> 'owner_specified.openstack.object': 'images/cirros', >>>>> 'owner_specified.openstack.sha256': ''}}, 'image_service': >>>>> <cinder.image.glance.GlanceImageService object at 0x7f449ded1198>} >>>>> 2023-03-22 11:06:16.570 109 INFO cinder.image.image_utils >>>>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>>>> 4160ce999a31485fa643aed0936dfef0 - - -] Image download 15.58 MB at 0.16 MB/s >>>>> >>>> >>>> As Adam Savage would say, well there's your problem ^^ (Image >>>> download 15.58 MB at 0.16 MB/s). Downloading the image takes too long, and >>>> 0.16 MB/s suggests you have a network issue. >>>> >>>> John Fulton previously stated your cinder-volume service at the >>>> edge site is not using the local ceph image store. Assuming you are >>>> deploying GlanceApiEdge service [1], then the cinder-volume service should >>>> be configured to use the local glance service [2]. You should check >>>> cinder's glance_api_servers to confirm it's the edge site's glance service. >>>> >>>> [1] >>>> https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/envi... >>>> [2] >>>> https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/depl... >>>> >>>> Alan >>>> >>>> >>>>> 2023-03-22 11:07:54.023 109 WARNING py.warnings >>>>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>>>> 4160ce999a31485fa643aed0936dfef0 - - -] >>>>> /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: >>>>> FutureWarning: The human format is deprecated and the format parameter will >>>>> be removed. Use explicitly json instead in version 'xena' >>>>> category=FutureWarning) >>>>> >>>>> 2023-03-22 11:11:12.161 109 WARNING py.warnings >>>>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>>>> 4160ce999a31485fa643aed0936dfef0 - - -] >>>>> /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: >>>>> FutureWarning: The human format is deprecated and the format parameter will >>>>> be removed. Use explicitly json instead in version 'xena' >>>>> category=FutureWarning) >>>>> >>>>> 2023-03-22 11:11:12.163 109 INFO cinder.image.image_utils >>>>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>>>> 4160ce999a31485fa643aed0936dfef0 - - -] Converted 112.00 MB image at 112.00 >>>>> MB/s >>>>> 2023-03-22 11:11:14.998 109 INFO >>>>> cinder.volume.flows.manager.create_volume >>>>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>>>> 4160ce999a31485fa643aed0936dfef0 - - -] Volume >>>>> volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f >>>>> (bf341343-6609-4b8c-b9e0-93e2a89c8c8f): created successfully >>>>> 2023-03-22 11:11:15.195 109 INFO cinder.volume.manager >>>>> [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db >>>>> 4160ce999a31485fa643aed0936dfef0 - - -] Created volume successfully. >>>>> >>>>> The image is present in dcn02 store but still it downloaded the >>>>> image in 0.16 MB/s and then created the volume. >>>>> >>>>> With regards, >>>>> Swogat Pradhan >>>>> >>>>> On Tue, Mar 21, 2023 at 6:10 PM Swogat Pradhan < >>>>> swogatpradhan22@gmail.com> wrote: >>>>> >>>>>> Hi Jhon, >>>>>> This seems to be an issue. >>>>>> When i deployed the dcn ceph in both dcn01 and dcn02 the >>>>>> --cluster parameter was specified to the respective cluster names but the >>>>>> config files were created in the name of ceph.conf and keyring was >>>>>> ceph.client.openstack.keyring. >>>>>> >>>>>> Which created issues in glance as well as the naming convention >>>>>> of the files didn't match the cluster names, so i had to manually rename >>>>>> the central ceph conf file as such: >>>>>> >>>>>> [root@dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/ >>>>>> [root@dcn02-compute-0 ceph]# ll >>>>>> total 16 >>>>>> -rw-------. 1 root root 257 Mar 13 13:56 >>>>>> ceph_central.client.openstack.keyring >>>>>> -rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf >>>>>> -rw-------. 1 root root 205 Mar 15 18:45 >>>>>> ceph.client.openstack.keyring >>>>>> -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf >>>>>> [root@dcn02-compute-0 ceph]# >>>>>> >>>>>> ceph.conf and ceph.client.openstack.keyring contain the fsid of >>>>>> the respective clusters in both dcn01 and dcn02. >>>>>> In the above cli output, the ceph.conf and ceph.client... are >>>>>> the files used to access dcn02 ceph cluster and ceph_central* files are >>>>>> used in for accessing central ceph cluster. >>>>>> >>>>>> glance multistore config: >>>>>> [dcn02] >>>>>> rbd_store_ceph_conf=/etc/ceph/ceph.conf >>>>>> rbd_store_user=openstack >>>>>> rbd_store_pool=images >>>>>> rbd_thin_provisioning=False >>>>>> store_description=dcn02 rbd glance store >>>>>> >>>>>> [ceph_central] >>>>>> rbd_store_ceph_conf=/etc/ceph/ceph_central.conf >>>>>> rbd_store_user=openstack >>>>>> rbd_store_pool=images >>>>>> rbd_thin_provisioning=False >>>>>> store_description=Default glance store backend. >>>>>> >>>>>> >>>>>> With regards, >>>>>> Swogat Pradhan >>>>>> >>>>>> On Tue, Mar 21, 2023 at 5:52 PM John Fulton < >>>>>> johfulto@redhat.com> wrote: >>>>>> >>>>>>> On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan >>>>>>> <swogatpradhan22@gmail.com> wrote: >>>>>>> > >>>>>>> > Hi, >>>>>>> > Seems like cinder is not using the local ceph. >>>>>>> >>>>>>> That explains the issue. It's a misconfiguration. >>>>>>> >>>>>>> I hope this is not a production system since the mailing list >>>>>>> now has >>>>>>> the cinder.conf which contains passwords. >>>>>>> >>>>>>> The section that looks like this: >>>>>>> >>>>>>> [tripleo_ceph] >>>>>>> volume_backend_name=tripleo_ceph >>>>>>> volume_driver=cinder.volume.drivers.rbd.RBDDriver >>>>>>> rbd_ceph_conf=/etc/ceph/ceph.conf >>>>>>> rbd_user=openstack >>>>>>> rbd_pool=volumes >>>>>>> rbd_flatten_volume_from_snapshot=False >>>>>>> rbd_secret_uuid=<redacted> >>>>>>> report_discard_supported=True >>>>>>> >>>>>>> Should be updated to refer to the local DCN ceph cluster and >>>>>>> not the >>>>>>> central one. Use the ceph conf file for that cluster and >>>>>>> ensure the >>>>>>> rbd_secret_uuid corresponds to that one. >>>>>>> >>>>>>> TripleO’s convention is to set the rbd_secret_uuid to the FSID >>>>>>> of the >>>>>>> Ceph cluster. The FSID should be in the ceph.conf file. The >>>>>>> tripleo_nova_libvirt role will use virsh secret-* commands so >>>>>>> that >>>>>>> libvirt can retrieve the cephx secret using the FSID as a key. >>>>>>> This >>>>>>> can be confirmed with `podman exec nova_virtsecretd virsh >>>>>>> secret-get-value $FSID`. >>>>>>> >>>>>>> The documentation describes how to configure the central and >>>>>>> DCN sites >>>>>>> correctly but an error seems to have occurred while you were >>>>>>> following >>>>>>> it. >>>>>>> >>>>>>> >>>>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >>>>>>> >>>>>>> John >>>>>>> >>>>>>> > >>>>>>> > Ceph Output: >>>>>>> > [ceph: root@dcn02-ceph-all-0 /]# rbd -p images ls -l >>>>>>> > NAME SIZE PARENT >>>>>>> FMT PROT LOCK >>>>>>> > 2abfafaa-eff4-4c2e-a538-dc2e1249ab65 8 MiB >>>>>>> 2 excl >>>>>>> > 55f40c8a-8f79-48c5-a52a-9b679b762f19 16 MiB >>>>>>> 2 >>>>>>> > 55f40c8a-8f79-48c5-a52a-9b679b762f19@snap 16 MiB >>>>>>> 2 yes >>>>>>> > 59f6a9cd-721c-45b5-a15f-fd021b08160d 321 MiB >>>>>>> 2 >>>>>>> > 59f6a9cd-721c-45b5-a15f-fd021b08160d@snap 321 MiB >>>>>>> 2 yes >>>>>>> > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 386 MiB >>>>>>> 2 >>>>>>> > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0@snap 386 MiB >>>>>>> 2 yes >>>>>>> > 9b27248e-a8cf-4f00-a039-d3e3066cd26a 15 GiB >>>>>>> 2 >>>>>>> > 9b27248e-a8cf-4f00-a039-d3e3066cd26a@snap 15 GiB >>>>>>> 2 yes >>>>>>> > b7356adc-bb47-4c05-968b-6d3c9ca0079b 15 GiB >>>>>>> 2 >>>>>>> > b7356adc-bb47-4c05-968b-6d3c9ca0079b@snap 15 GiB >>>>>>> 2 yes >>>>>>> > e77e78ad-d369-4a1d-b758-8113621269a3 15 GiB >>>>>>> 2 >>>>>>> > e77e78ad-d369-4a1d-b758-8113621269a3@snap 15 GiB >>>>>>> 2 yes >>>>>>> > >>>>>>> > [ceph: root@dcn02-ceph-all-0 /]# rbd -p volumes ls -l >>>>>>> > NAME SIZE >>>>>>> PARENT FMT PROT LOCK >>>>>>> > volume-c644086f-d3cf-406d-b0f1-7691bde5981d 100 GiB >>>>>>> 2 >>>>>>> > volume-f0969935-a742-4744-9375-80bf323e4d63 10 GiB >>>>>>> 2 >>>>>>> > [ceph: root@dcn02-ceph-all-0 /]# >>>>>>> > >>>>>>> > Attached the cinder config. >>>>>>> > Please let me know how I can solve this issue. >>>>>>> > >>>>>>> > With regards, >>>>>>> > Swogat Pradhan >>>>>>> > >>>>>>> > On Tue, Mar 21, 2023 at 3:53 PM John Fulton < >>>>>>> johfulto@redhat.com> wrote: >>>>>>> >> >>>>>>> >> in my last message under the line "On a DCN site if you run >>>>>>> a command like this:" I suggested some steps you could try to confirm the >>>>>>> image is a COW from the local glance as well as how to look at your cinder >>>>>>> config. >>>>>>> >> >>>>>>> >> On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan < >>>>>>> swogatpradhan22@gmail.com> wrote: >>>>>>> >>> >>>>>>> >>> Update: >>>>>>> >>> I uploaded an image directly to the dcn02 store, and it >>>>>>> takes around 10,15 minutes to create a volume with image in dcn02. >>>>>>> >>> The image size is 389 MB. >>>>>>> >>> >>>>>>> >>> On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan < >>>>>>> swogatpradhan22@gmail.com> wrote: >>>>>>> >>>> >>>>>>> >>>> Hi Jhon, >>>>>>> >>>> I checked in the ceph od dcn02, I can see the images >>>>>>> created after importing from the central site. >>>>>>> >>>> But launching an instance normally fails as it takes a >>>>>>> long time for the volume to get created. >>>>>>> >>>> >>>>>>> >>>> When launching an instance from volume the instance is >>>>>>> getting created properly without any errors. >>>>>>> >>>> >>>>>>> >>>> I tried to cache images in nova using >>>>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>>>>>> but getting checksum failed error. >>>>>>> >>>> >>>>>>> >>>> With regards, >>>>>>> >>>> Swogat Pradhan >>>>>>> >>>> >>>>>>> >>>> On Thu, Mar 16, 2023 at 5:24 PM John Fulton < >>>>>>> johfulto@redhat.com> wrote: >>>>>>> >>>>> >>>>>>> >>>>> On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan >>>>>>> >>>>> <swogatpradhan22@gmail.com> wrote: >>>>>>> >>>>> > >>>>>>> >>>>> > Update: After restarting the nova services on the >>>>>>> controller and running the deploy script on the edge site, I was able to >>>>>>> launch the VM from volume. >>>>>>> >>>>> > >>>>>>> >>>>> > Right now the instance creation is failing as the >>>>>>> block device creation is stuck in creating state, it is taking more than 10 >>>>>>> mins for the volume to be created, whereas the image has already been >>>>>>> imported to the edge glance. >>>>>>> >>>>> >>>>>>> >>>>> Try following this document and making the same >>>>>>> observations in your >>>>>>> >>>>> environment for AZs and their local ceph cluster. >>>>>>> >>>>> >>>>>>> >>>>> >>>>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features... >>>>>>> >>>>> >>>>>>> >>>>> On a DCN site if you run a command like this: >>>>>>> >>>>> >>>>>>> >>>>> $ sudo cephadm shell --config /etc/ceph/dcn0.conf >>>>>>> --keyring >>>>>>> >>>>> /etc/ceph/dcn0.client.admin.keyring >>>>>>> >>>>> $ rbd --cluster dcn0 -p volumes ls -l >>>>>>> >>>>> NAME SIZE PARENT >>>>>>> >>>>> FMT PROT LOCK >>>>>>> >>>>> volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB >>>>>>> >>>>> images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076@snap 2 >>>>>>> excl >>>>>>> >>>>> $ >>>>>>> >>>>> >>>>>>> >>>>> Then, you should see the parent of the volume is the >>>>>>> image which is on >>>>>>> >>>>> the same local ceph cluster. >>>>>>> >>>>> >>>>>>> >>>>> I wonder if something is misconfigured and thus you're >>>>>>> encountering >>>>>>> >>>>> the streaming behavior described here: >>>>>>> >>>>> >>>>>>> >>>>> Ideally all images should reside in the central Glance >>>>>>> and be copied >>>>>>> >>>>> to DCN sites before instances of those images are booted >>>>>>> on DCN sites. >>>>>>> >>>>> If an image is not copied to a DCN site before it is >>>>>>> booted, then the >>>>>>> >>>>> image will be streamed to the DCN site and then the >>>>>>> image will boot as >>>>>>> >>>>> an instance. This happens because Glance at the DCN site >>>>>>> has access to >>>>>>> >>>>> the images store at the Central ceph cluster. Though the >>>>>>> booting of >>>>>>> >>>>> the image will take time because it has not been copied >>>>>>> in advance, >>>>>>> >>>>> this is still preferable to failing to boot the image. >>>>>>> >>>>> >>>>>>> >>>>> You can also exec into the cinder container at the DCN >>>>>>> site and >>>>>>> >>>>> confirm it's using it's local ceph cluster. >>>>>>> >>>>> >>>>>>> >>>>> John >>>>>>> >>>>> >>>>>>> >>>>> > >>>>>>> >>>>> > I will try and create a new fresh image and test again >>>>>>> then update. >>>>>>> >>>>> > >>>>>>> >>>>> > With regards, >>>>>>> >>>>> > Swogat Pradhan >>>>>>> >>>>> > >>>>>>> >>>>> > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan < >>>>>>> swogatpradhan22@gmail.com> wrote: >>>>>>> >>>>> >> >>>>>>> >>>>> >> Update: >>>>>>> >>>>> >> In the hypervisor list the compute node state is >>>>>>> showing down. >>>>>>> >>>>> >> >>>>>>> >>>>> >> >>>>>>> >>>>> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan < >>>>>>> swogatpradhan22@gmail.com> wrote: >>>>>>> >>>>> >>> >>>>>>> >>>>> >>> Hi Brendan, >>>>>>> >>>>> >>> Now i have deployed another site where i have used 2 >>>>>>> linux bonds network template for both 3 compute nodes and 3 ceph nodes. >>>>>>> >>>>> >>> The bonding options is set to mode=802.3ad >>>>>>> (lacp=active). >>>>>>> >>>>> >>> I used a cirros image to launch instance but the >>>>>>> instance timed out so i waited for the volume to be created. >>>>>>> >>>>> >>> Once the volume was created i tried launching the >>>>>>> instance from the volume and still the instance is stuck in spawning state. >>>>>>> >>>>> >>> >>>>>>> >>>>> >>> Here is the nova-compute log: >>>>>>> >>>>> >>> >>>>>>> >>>>> >>> 2023-03-15 17:35:47.739 185437 INFO >>>>>>> oslo.privsep.daemon [-] privsep daemon starting >>>>>>> >>>>> >>> 2023-03-15 17:35:47.744 185437 INFO >>>>>>> oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0 >>>>>>> >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO >>>>>>> oslo.privsep.daemon [-] privsep process running with capabilities >>>>>>> (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>>>>> >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO >>>>>>> oslo.privsep.daemon [-] privsep daemon running as pid 185437 >>>>>>> >>>>> >>> 2023-03-15 17:35:47.974 8 WARNING >>>>>>> os_brick.initiator.connectors.nvmeof >>>>>>> [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db >>>>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error >>>>>>> in _get_host_uuid: Unexpected error while running command. >>>>>>> >>>>> >>> Command: blkid overlay -s UUID -o value >>>>>>> >>>>> >>> Exit code: 2 >>>>>>> >>>>> >>> Stdout: '' >>>>>>> >>>>> >>> Stderr: '': >>>>>>> oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while >>>>>>> running command. >>>>>>> >>>>> >>> 2023-03-15 17:35:51.616 8 INFO >>>>>>> nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 >>>>>>> b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default >>>>>>> default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image >>>>>>> >>>>> >>> >>>>>>> >>>>> >>> It is stuck in creating image, do i need to run the >>>>>>> template mentioned here ?: >>>>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep... >>>>>>> >>>>> >>> >>>>>>> >>>>> >>> The volume is already created and i do not >>>>>>> understand why the instance is stuck in spawning state. >>>>>>> >>>>> >>> >>>>>>> >>>>> >>> With regards, >>>>>>> >>>>> >>> Swogat Pradhan >>>>>>> >>>>> >>> >>>>>>> >>>>> >>> >>>>>>> >>>>> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard < >>>>>>> bshephar@redhat.com> wrote: >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> Does your environment use different network >>>>>>> interfaces for each of the networks? Or does it have a bond with everything >>>>>>> on it? >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> One issue I have seen before is that when launching >>>>>>> instances, there is a lot of network traffic between nodes as the >>>>>>> hypervisor needs to download the image from Glance. Along with various >>>>>>> other services sending normal network traffic, it can be enough to cause >>>>>>> issues if everything is running over a single 1Gbe interface. >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> I have seen the same situation in fact when using a >>>>>>> single active/backup bond on 1Gbe nics. It’s worth checking the network >>>>>>> traffic while you try to spawn the instance to see if you’re dropping >>>>>>> packets. In the situation I described, there were dropped packets which >>>>>>> resulted in a loss of communication between nova_compute and RMQ, so the >>>>>>> node appeared offline. You should also confirm that nova_compute is being >>>>>>> disconnected in the nova_compute logs if you tail them on the Hypervisor >>>>>>> while spawning the instance. >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> In my case, changing from active/backup to LACP >>>>>>> helped. So, based on that experience, from my perspective, is certainly >>>>>>> sounds like some kind of network issue. >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> Regards, >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> Brendan Shephard >>>>>>> >>>>> >>>> Senior Software Engineer >>>>>>> >>>>> >>>> Red Hat Australia >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block < >>>>>>> eblock@nde.ag> wrote: >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> Hi, >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> I tried to help someone with a similar issue some >>>>>>> time ago in this thread: >>>>>>> >>>>> >>>> >>>>>>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception... >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> But apparently a neutron reinstallation fixed it >>>>>>> for that user, not sure if that could apply here. But is it possible that >>>>>>> your nova and neutron versions are different between central and edge site? >>>>>>> Have you restarted nova and neutron services on the compute nodes after >>>>>>> installation? Have you debug logs of nova-conductor and maybe nova-compute? >>>>>>> Maybe they can help narrow down the issue. >>>>>>> >>>>> >>>> If there isn't any additional information in the >>>>>>> debug logs I probably would start "tearing down" rabbitmq. I didn't have to >>>>>>> do that in a production system yet so be careful. I can think of two routes: >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> - Either remove queues, exchanges etc. while rabbit >>>>>>> is running, this will most likely impact client IO depending on your load. >>>>>>> Check out the rabbitmqctl commands. >>>>>>> >>>>> >>>> - Or stop the rabbitmq cluster, remove the mnesia >>>>>>> tables from all nodes and restart rabbitmq so the exchanges, queues etc. >>>>>>> rebuild. >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> I can imagine that the failed reply "survives" >>>>>>> while being replicated across the rabbit nodes. But I don't really know the >>>>>>> rabbit internals too well, so maybe someone else can chime in here and give >>>>>>> a better advice. >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> Regards, >>>>>>> >>>>> >>>> Eugen >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com >>>>>>> >: >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> Hi, >>>>>>> >>>>> >>>> Can someone please help me out on this issue? >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> With regards, >>>>>>> >>>>> >>>> Swogat Pradhan >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan < >>>>>>> swogatpradhan22@gmail.com> >>>>>>> >>>>> >>>> wrote: >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> Hi >>>>>>> >>>>> >>>> I don't see any major packet loss. >>>>>>> >>>>> >>>> It seems the problem is somewhere in rabbitmq maybe >>>>>>> but not due to packet >>>>>>> >>>>> >>>> loss. >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> with regards, >>>>>>> >>>>> >>>> Swogat Pradhan >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan < >>>>>>> swogatpradhan22@gmail.com> >>>>>>> >>>>> >>>> wrote: >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> Hi, >>>>>>> >>>>> >>>> Yes the MTU is the same as the default '1500'. >>>>>>> >>>>> >>>> Generally I haven't seen any packet loss, but never >>>>>>> checked when >>>>>>> >>>>> >>>> launching the instance. >>>>>>> >>>>> >>>> I will check that and come back. >>>>>>> >>>>> >>>> But everytime i launch an instance the instance >>>>>>> gets stuck at spawning >>>>>>> >>>>> >>>> state and there the hypervisor becomes down, so not >>>>>>> sure if packet loss >>>>>>> >>>>> >>>> causes this. >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> With regards, >>>>>>> >>>>> >>>> Swogat pradhan >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block < >>>>>>> eblock@nde.ag> wrote: >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> One more thing coming to mind is MTU size. Are they >>>>>>> identical between >>>>>>> >>>>> >>>> central and edge site? Do you see packet loss >>>>>>> through the tunnel? >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22@gmail.com >>>>>>> >: >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> > Hi Eugen, >>>>>>> >>>>> >>>> > Request you to please add my email either on 'to' >>>>>>> or 'cc' as i am not >>>>>>> >>>>> >>>> > getting email's from you. >>>>>>> >>>>> >>>> > Coming to the issue: >>>>>>> >>>>> >>>> > >>>>>>> >>>>> >>>> > [root@overcloud-controller-no-ceph-3 /]# >>>>>>> rabbitmqctl list_policies -p >>>>>>> >>>>> >>>> / >>>>>>> >>>>> >>>> > Listing policies for vhost "/" ... >>>>>>> >>>>> >>>> > vhost name pattern apply-to >>>>>>> definition priority >>>>>>> >>>>> >>>> > / ha-all ^(?!amq\.).* queues >>>>>>> >>>>> >>>> > >>>>>>> >>>>> >>>> >>>>>>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"} 0 >>>>>>> >>>>> >>>> > >>>>>>> >>>>> >>>> > I have the edge site compute nodes up, it only >>>>>>> goes down when i am >>>>>>> >>>>> >>>> trying >>>>>>> >>>>> >>>> > to launch an instance and the instance comes to a >>>>>>> spawning state and >>>>>>> >>>>> >>>> then >>>>>>> >>>>> >>>> > gets stuck. >>>>>>> >>>>> >>>> > >>>>>>> >>>>> >>>> > I have a tunnel setup between the central and the >>>>>>> edge sites. >>>>>>> >>>>> >>>> > >>>>>>> >>>>> >>>> > With regards, >>>>>>> >>>>> >>>> > Swogat Pradhan >>>>>>> >>>>> >>>> > >>>>>>> >>>>> >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan < >>>>>>> >>>>> >>>> swogatpradhan22@gmail.com> >>>>>>> >>>>> >>>> > wrote: >>>>>>> >>>>> >>>> > >>>>>>> >>>>> >>>> >> Hi Eugen, >>>>>>> >>>>> >>>> >> For some reason i am not getting your email to >>>>>>> me directly, i am >>>>>>> >>>>> >>>> checking >>>>>>> >>>>> >>>> >> the email digest and there i am able to find >>>>>>> your reply. >>>>>>> >>>>> >>>> >> Here is the log for download: >>>>>>> https://we.tl/t-L8FEkGZFSq >>>>>>> >>>>> >>>> >> Yes, these logs are from the time when the issue >>>>>>> occurred. >>>>>>> >>>>> >>>> >> >>>>>>> >>>>> >>>> >> *Note: i am able to create vm's and perform >>>>>>> other activities in the >>>>>>> >>>>> >>>> >> central site, only facing this issue in the edge >>>>>>> site.* >>>>>>> >>>>> >>>> >> >>>>>>> >>>>> >>>> >> With regards, >>>>>>> >>>>> >>>> >> Swogat Pradhan >>>>>>> >>>>> >>>> >> >>>>>>> >>>>> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan < >>>>>>> >>>>> >>>> swogatpradhan22@gmail.com> >>>>>>> >>>>> >>>> >> wrote: >>>>>>> >>>>> >>>> >> >>>>>>> >>>>> >>>> >>> Hi Eugen, >>>>>>> >>>>> >>>> >>> Thanks for your response. >>>>>>> >>>>> >>>> >>> I have actually a 4 controller setup so here >>>>>>> are the details: >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> *PCS Status:* >>>>>>> >>>>> >>>> >>> * Container bundle set: rabbitmq-bundle [ >>>>>>> >>>>> >>>> >>> >>>>>>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest >>>>>>> ]: >>>>>>> >>>>> >>>> >>> * rabbitmq-bundle-0 >>>>>>> (ocf::heartbeat:rabbitmq-cluster): >>>>>>> >>>>> >>>> Started >>>>>>> >>>>> >>>> >>> overcloud-controller-no-ceph-3 >>>>>>> >>>>> >>>> >>> * rabbitmq-bundle-1 >>>>>>> (ocf::heartbeat:rabbitmq-cluster): >>>>>>> >>>>> >>>> Started >>>>>>> >>>>> >>>> >>> overcloud-controller-2 >>>>>>> >>>>> >>>> >>> * rabbitmq-bundle-2 >>>>>>> (ocf::heartbeat:rabbitmq-cluster): >>>>>>> >>>>> >>>> Started >>>>>>> >>>>> >>>> >>> overcloud-controller-1 >>>>>>> >>>>> >>>> >>> * rabbitmq-bundle-3 >>>>>>> (ocf::heartbeat:rabbitmq-cluster): >>>>>>> >>>>> >>>> Started >>>>>>> >>>>> >>>> >>> overcloud-controller-0 >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> I have tried restarting the bundle multiple >>>>>>> times but the issue is >>>>>>> >>>>> >>>> still >>>>>>> >>>>> >>>> >>> present. >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> *Cluster status:* >>>>>>> >>>>> >>>> >>> [root@overcloud-controller-0 /]# rabbitmqctl >>>>>>> cluster_status >>>>>>> >>>>> >>>> >>> Cluster status of node >>>>>>> >>>>> >>>> >>> >>>>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com ... >>>>>>> >>>>> >>>> >>> Basics >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> Cluster name: >>>>>>> rabbit@overcloud-controller-no-ceph-3.bdxworld.com >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> Disk Nodes >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> >>>>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>>>>> >>>>> >>>> >>> >>>>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>>>>> >>>>> >>>> >>> >>>>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>>>>> >>>>> >>>> >>> >>>>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> Running Nodes >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> >>>>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com >>>>>>> >>>>> >>>> >>> >>>>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com >>>>>>> >>>>> >>>> >>> >>>>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com >>>>>>> >>>>> >>>> >>> >>>>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> Versions >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> >>>>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com: >>>>>>> RabbitMQ >>>>>>> >>>>> >>>> 3.8.3 >>>>>>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>>>>>> >>>>> >>>> >>> >>>>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com: >>>>>>> RabbitMQ >>>>>>> >>>>> >>>> 3.8.3 >>>>>>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>>>>>> >>>>> >>>> >>> >>>>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com: >>>>>>> RabbitMQ >>>>>>> >>>>> >>>> 3.8.3 >>>>>>> >>>>> >>>> >>> on Erlang 22.3.4.1 >>>>>>> >>>>> >>>> >>> >>>>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>>>> : >>>>>>> >>>>> >>>> RabbitMQ >>>>>>> >>>>> >>>> >>> 3.8.3 on Erlang 22.3.4.1 >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> Alarms >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> (none) >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> Network Partitions >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> (none) >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> Listeners >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> Node: >>>>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>>>>> >>>>> >>>> interface: >>>>>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, >>>>>>> purpose: inter-node and CLI >>>>>>> >>>>> >>>> tool >>>>>>> >>>>> >>>> >>> communication >>>>>>> >>>>> >>>> >>> Node: >>>>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>>>>> >>>>> >>>> interface: >>>>>>> >>>>> >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, >>>>>>> purpose: AMQP 0-9-1 >>>>>>> >>>>> >>>> >>> and AMQP 1.0 >>>>>>> >>>>> >>>> >>> Node: >>>>>>> rabbit@overcloud-controller-0.internalapi.bdxworld.com, >>>>>>> >>>>> >>>> interface: >>>>>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: >>>>>>> HTTP API >>>>>>> >>>>> >>>> >>> Node: >>>>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>>>>> >>>>> >>>> interface: >>>>>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, >>>>>>> purpose: inter-node and CLI >>>>>>> >>>>> >>>> tool >>>>>>> >>>>> >>>> >>> communication >>>>>>> >>>>> >>>> >>> Node: >>>>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>>>>> >>>>> >>>> interface: >>>>>>> >>>>> >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, >>>>>>> purpose: AMQP 0-9-1 >>>>>>> >>>>> >>>> >>> and AMQP 1.0 >>>>>>> >>>>> >>>> >>> Node: >>>>>>> rabbit@overcloud-controller-1.internalapi.bdxworld.com, >>>>>>> >>>>> >>>> interface: >>>>>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: >>>>>>> HTTP API >>>>>>> >>>>> >>>> >>> Node: >>>>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>>>>> >>>>> >>>> interface: >>>>>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, >>>>>>> purpose: inter-node and CLI >>>>>>> >>>>> >>>> tool >>>>>>> >>>>> >>>> >>> communication >>>>>>> >>>>> >>>> >>> Node: >>>>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>>>>> >>>>> >>>> interface: >>>>>>> >>>>> >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, >>>>>>> purpose: AMQP 0-9-1 >>>>>>> >>>>> >>>> >>> and AMQP 1.0 >>>>>>> >>>>> >>>> >>> Node: >>>>>>> rabbit@overcloud-controller-2.internalapi.bdxworld.com, >>>>>>> >>>>> >>>> interface: >>>>>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: >>>>>>> HTTP API >>>>>>> >>>>> >>>> >>> Node: >>>>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>>>> >>>>> >>>> , >>>>>>> >>>>> >>>> >>> interface: [::], port: 25672, protocol: >>>>>>> clustering, purpose: >>>>>>> >>>>> >>>> inter-node and >>>>>>> >>>>> >>>> >>> CLI tool communication >>>>>>> >>>>> >>>> >>> Node: >>>>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>>>> >>>>> >>>> , >>>>>>> >>>>> >>>> >>> interface: 172.25.201.209, port: 5672, >>>>>>> protocol: amqp, purpose: AMQP >>>>>>> >>>>> >>>> 0-9-1 >>>>>>> >>>>> >>>> >>> and AMQP 1.0 >>>>>>> >>>>> >>>> >>> Node: >>>>>>> rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com >>>>>>> >>>>> >>>> , >>>>>>> >>>>> >>>> >>> interface: [::], port: 15672, protocol: http, >>>>>>> purpose: HTTP API >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> Feature flags >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> Flag: drop_unroutable_metric, state: enabled >>>>>>> >>>>> >>>> >>> Flag: empty_basic_get_metric, state: enabled >>>>>>> >>>>> >>>> >>> Flag: implicit_default_bindings, state: enabled >>>>>>> >>>>> >>>> >>> Flag: quorum_queue, state: enabled >>>>>>> >>>>> >>>> >>> Flag: virtual_host_metadata, state: enabled >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> *Logs:* >>>>>>> >>>>> >>>> >>> *(Attached)* >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> With regards, >>>>>>> >>>>> >>>> >>> Swogat Pradhan >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan < >>>>>>> >>>>> >>>> swogatpradhan22@gmail.com> >>>>>>> >>>>> >>>> >>> wrote: >>>>>>> >>>>> >>>> >>> >>>>>>> >>>>> >>>> >>>> Hi, >>>>>>> >>>>> >>>> >>>> Please find the nova conductor as well as nova >>>>>>> api log. >>>>>>> >>>>> >>>> >>>> >>>>>>> >>>>> >>>> >>>> nova-conuctor: >>>>>>> >>>>> >>>> >>>> >>>>>>> >>>>> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING >>>>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - >>>>>>> - - -] >>>>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>>>>> exist, drop reply to >>>>>>> >>>>> >>>> >>>> 16152921c1eb45c2b1f562087140168b >>>>>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING >>>>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - >>>>>>> - - -] >>>>>>> >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't >>>>>>> exist, drop reply to >>>>>>> >>>>> >>>> >>>> 83dbe5f567a940b698acfe986f6194fa >>>>>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING >>>>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - >>>>>>> - - -] >>>>>>> >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't >>>>>>> exist, drop reply to >>>>>>> >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43: >>>>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR >>>>>>> oslo_messaging._drivers.amqpdriver >>>>>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - >>>>>>> - - -] The reply >>>>>>> >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to >>>>>>> send after 60 seconds >>>>>>> >>>>> >>>> due to a >>>>>>> >>>>> >>>> >>>> missing queue >>>>>>> (reply_276049ec36a84486a8a406911d9802f4). >>>>>>> >>>>> >>>> Abandoning...: >>>>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>>>> >>>>> >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING >>>>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - >>>>>>> - - -] >>>>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>>>>> exist, drop reply to >>>>>>> >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566: >>>>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>>>> >>>>> >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR >>>>>>> oslo_messaging._drivers.amqpdriver >>>>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - >>>>>>> - - -] The reply >>>>>>> >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to >>>>>>> send after 60 seconds >>>>>>> >>>>> >>>> due to a >>>>>>> >>>>> >>>> >>>> missing queue >>>>>>> (reply_349bcb075f8c49329435a0f884b33066). >>>>>>> >>>>> >>>> Abandoning...: >>>>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>>>> >>>>> >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING >>>>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - >>>>>>> - - -] >>>>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>>>>> exist, drop reply to >>>>>>> >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f: >>>>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>>>> >>>>> >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR >>>>>>> oslo_messaging._drivers.amqpdriver >>>>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - >>>>>>> - - -] The reply >>>>>>> >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to >>>>>>> send after 60 seconds >>>>>>> >>>>> >>>> due to a >>>>>>> >>>>> >>>> >>>> missing queue >>>>>>> (reply_349bcb075f8c49329435a0f884b33066). >>>>>>> >>>>> >>>> Abandoning...: >>>>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>>>> >>>>> >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING >>>>>>> nova.cache_utils >>>>>>> >>>>> >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>>> >>>>> >>>> b240e3e89d99489284cd731e75f2a5db >>>>>>> >>>>> >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>>> default] Cache enabled >>>>>>> >>>>> >>>> with >>>>>>> >>>>> >>>> >>>> backend dogpile.cache.null. >>>>>>> >>>>> >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING >>>>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver >>>>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - >>>>>>> - - -] >>>>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't >>>>>>> exist, drop reply to >>>>>>> >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb: >>>>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>>>> >>>>> >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR >>>>>>> oslo_messaging._drivers.amqpdriver >>>>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - >>>>>>> - - -] The reply >>>>>>> >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to >>>>>>> send after 60 seconds >>>>>>> >>>>> >>>> due to a >>>>>>> >>>>> >>>> >>>> missing queue >>>>>>> (reply_349bcb075f8c49329435a0f884b33066). >>>>>>> >>>>> >>>> Abandoning...: >>>>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable >>>>>>> >>>>> >>>> >>>> >>>>>>> >>>>> >>>> >>>> With regards, >>>>>>> >>>>> >>>> >>>> Swogat Pradhan >>>>>>> >>>>> >>>> >>>> >>>>>>> >>>>> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan >>>>>>> < >>>>>>> >>>>> >>>> >>>> swogatpradhan22@gmail.com> wrote: >>>>>>> >>>>> >>>> >>>> >>>>>>> >>>>> >>>> >>>>> Hi, >>>>>>> >>>>> >>>> >>>>> I currently have 3 compute nodes on edge >>>>>>> site1 where i am trying to >>>>>>> >>>>> >>>> >>>>> launch vm's. >>>>>>> >>>>> >>>> >>>>> When the VM is in spawning state the node >>>>>>> goes down (openstack >>>>>>> >>>>> >>>> compute >>>>>>> >>>>> >>>> >>>>> service list), the node comes backup when i >>>>>>> restart the nova >>>>>>> >>>>> >>>> compute >>>>>>> >>>>> >>>> >>>>> service but then the launch of the vm fails. >>>>>>> >>>>> >>>> >>>>> >>>>>>> >>>>> >>>> >>>>> nova-compute.log >>>>>>> >>>>> >>>> >>>>> >>>>>>> >>>>> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO >>>>>>> nova.compute.manager >>>>>>> >>>>> >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - >>>>>>> - - -] Running >>>>>>> >>>>> >>>> >>>>> instance usage >>>>>>> >>>>> >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from >>>>>>> 2023-02-26 07:00:00 >>>>>>> >>>>> >>>> to >>>>>>> >>>>> >>>> >>>>> 2023-02-26 08:00:00. 0 instances. >>>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO >>>>>>> nova.compute.claims >>>>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>>> default] [instance: >>>>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim >>>>>>> successful on node >>>>>>> >>>>> >>>> >>>>> dcn01-hci-0.bdxworld.com >>>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO >>>>>>> nova.virt.libvirt.driver >>>>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>>> default] [instance: >>>>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] >>>>>>> Ignoring supplied device >>>>>>> >>>>> >>>> name: >>>>>>> >>>>> >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied >>>>>>> dev names >>>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO >>>>>>> nova.virt.block_device >>>>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>>> default] [instance: >>>>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting >>>>>>> with volume >>>>>>> >>>>> >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at >>>>>>> /dev/vda >>>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING >>>>>>> nova.cache_utils >>>>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>>> default] Cache enabled >>>>>>> >>>>> >>>> with >>>>>>> >>>>> >>>> >>>>> backend dogpile.cache.null. >>>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO >>>>>>> oslo.privsep.daemon >>>>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>>> default] Running >>>>>>> >>>>> >>>> >>>>> privsep helper: >>>>>>> >>>>> >>>> >>>>> ['sudo', 'nova-rootwrap', >>>>>>> '/etc/nova/rootwrap.conf', >>>>>>> >>>>> >>>> 'privsep-helper', >>>>>>> >>>>> >>>> >>>>> '--config-file', '/etc/nova/nova.conf', >>>>>>> '--config-file', >>>>>>> >>>>> >>>> >>>>> '/etc/nova/nova-compute.conf', >>>>>>> '--privsep_context', >>>>>>> >>>>> >>>> >>>>> 'os_brick.privileged.default', >>>>>>> '--privsep_sock_path', >>>>>>> >>>>> >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock'] >>>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO >>>>>>> oslo.privsep.daemon >>>>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>>> default] Spawned new >>>>>>> >>>>> >>>> privsep >>>>>>> >>>>> >>>> >>>>> daemon via rootwrap >>>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO >>>>>>> oslo.privsep.daemon [-] privsep >>>>>>> >>>>> >>>> >>>>> daemon starting >>>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO >>>>>>> oslo.privsep.daemon [-] privsep >>>>>>> >>>>> >>>> >>>>> process running with uid/gid: 0/0 >>>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO >>>>>>> oslo.privsep.daemon [-] privsep >>>>>>> >>>>> >>>> >>>>> process running with capabilities >>>>>>> (eff/prm/inh): >>>>>>> >>>>> >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none >>>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO >>>>>>> oslo.privsep.daemon [-] privsep >>>>>>> >>>>> >>>> >>>>> daemon running as pid 2647 >>>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING >>>>>>> >>>>> >>>> os_brick.initiator.connectors.nvmeof >>>>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>>> default] Process >>>>>>> >>>>> >>>> >>>>> execution error >>>>>>> >>>>> >>>> >>>>> in _get_host_uuid: Unexpected error while >>>>>>> running command. >>>>>>> >>>>> >>>> >>>>> Command: blkid overlay -s UUID -o value >>>>>>> >>>>> >>>> >>>>> Exit code: 2 >>>>>>> >>>>> >>>> >>>>> Stdout: '' >>>>>>> >>>>> >>>> >>>>> Stderr: '': >>>>>>> oslo_concurrency.processutils.ProcessExecutionError: >>>>>>> >>>>> >>>> >>>>> Unexpected error while running command. >>>>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO >>>>>>> nova.virt.libvirt.driver >>>>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 >>>>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db >>>>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default >>>>>>> default] [instance: >>>>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] >>>>>>> Creating image >>>>>>> >>>>> >>>> >>>>> >>>>>>> >>>>> >>>> >>>>> Is there a way to solve this issue? >>>>>>> >>>>> >>>> >>>>> >>>>>>> >>>>> >>>> >>>>> >>>>>>> >>>>> >>>> >>>>> With regards, >>>>>>> >>>>> >>>> >>>>> >>>>>>> >>>>> >>>> >>>>> Swogat Pradhan >>>>>>> >>>>> >>>> >>>>> >>>>>>> >>>>> >>>> >>>> >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>> >>>>>>> >>>>> >>>>>>> >>>>>>>
participants (5)
-
Alan Bishop
-
Brendan Shephard
-
Eugen Block
-
John Fulton
-
Swogat Pradhan