Open Stack

Wed Mar 22 13:46:05 UTC 2023

On Wed, Mar 22, 2023 at 9:42 AM Swogat Pradhan
<swogatpradhan22 at gmail.com> wrote:
>
> Hi Jhon,
> After some changes i feel like the cinder is now trying to pull the image from local glance as i am getting the following error in cinder-colume log:
>
> 2023-03-22 13:32:29.786 108 ERROR oslo_messaging.rpc.server cinder.exception.GlanceConnectionFailed: Connection to glance failed: Error finding address for http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: Unable to establish connection to http://172.25.228.253:9292/v2/images/736d8779-07cd-4510-bab2-adcb653cc538: HTTPConnectionPool(host='172.25.228.253', port=9292): Max retries exceeded with url: /v2/images/736d8779-07cd-4510-bab2-adcb653cc538 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7682d2cd30>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
>
> As the endpoint it is trying to reach is the dcn02 IP address.
>
> But when i check the ports i don't find the port 9292 running:
> [root at dcn02-compute-2 ceph]# netstat -nultp
> Active Internet connections (only servers)
> Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
> tcp        0      0 0.0.0.0:2022            0.0.0.0:*               LISTEN      656800/sshd
> tcp        0      0 127.0.0.1:199           0.0.0.0:*               LISTEN      4878/snmpd
> tcp        0      0 172.25.228.253:2379     0.0.0.0:*               LISTEN      6232/etcd
> tcp        0      0 172.25.228.253:2380     0.0.0.0:*               LISTEN      6232/etcd
> tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1/systemd
> tcp        0      0 127.0.0.1:6640          0.0.0.0:*               LISTEN      2779/ovsdb-server
> tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      4918/sshd
> tcp6       0      0 :::2022                 :::*                    LISTEN      656800/sshd
> tcp6       0      0 :::111                  :::*                    LISTEN      1/systemd
> tcp6       0      0 :::22                   :::*                    LISTEN      4918/sshd
> udp        0      0 0.0.0.0:111             0.0.0.0:*                           1/systemd
> udp        0      0 0.0.0.0:161             0.0.0.0:*                           4878/snmpd
> udp        0      0 127.0.0.1:323           0.0.0.0:*                           2609/chronyd
> udp        0      0 0.0.0.0:6081            0.0.0.0:*                           -
> udp6       0      0 :::111                  :::*                                1/systemd
> udp6       0      0 ::1:161                 :::*                                4878/snmpd
> udp6       0      0 ::1:323                 :::*                                2609/chronyd
> udp6       0      0 :::6081                 :::*                                -
>
> I see in the glance-api.conf that bind port parameter is set to 9292 but the port is not listed in netstat command.
> Can you please guide me in getting this port up and running as i feel like this would solve the issue i am facing right now.

Looks like your glance container stopped running. Ask podman to show
you all containers (including stopped ones) and investigate why the
glance container stopped.

>
> With regards,
> Swogat Pradhan
>
> On Wed, Mar 22, 2023 at 4:55 PM Swogat Pradhan <swogatpradhan22 at gmail.com> wrote:
>>
>> Update:
>> Here is the log when creating a volume using cirros image:
>>
>> 2023-03-22 11:04:38.449 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume bf341343-6609-4b8c-b9e0-93e2a89c8c8f: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f', 'volume_size': 4, 'image_id': '736d8779-07cd-4510-bab2-adcb653cc538', 'image_location': ('rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}]), 'image_meta': {'name': 'cirros', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 16338944, 'virtual_size': 117440512, 'status': 'active', 'checksum': '1d3062cd89af34e419f7100277f38b2b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '4160ce999a31485fa643aed0936dfef0', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b91489acf687183adcd689b53b38e3ddd22e627e7f98a09c46', 'id': '736d8779-07cd-4510-bab2-adcb653cc538', 'created_at': datetime.datetime(2023, 3, 22, 10, 44, 12, tzinfo=datetime.timezone.utc), 'updated_at': datetime.datetime(2023, 3, 22, 10, 54, 1, tzinfo=datetime.timezone.utc), 'locations': [{'url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'ceph'}}, {'url': 'rbd://a8d5f1f5-48e7-5ede-89ab-8aca59b6397b/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'metadata': {'store': 'dcn02'}}], 'direct_url': 'rbd://a5ae877c-bcba-53fe-8336-450e63014757/images/736d8779-07cd-4510-bab2-adcb653cc538/snap', 'tags': [], 'file': '/v2/images/736d8779-07cd-4510-bab2-adcb653cc538/file', 'stores': 'ceph,dcn02', 'properties': {'os_glance_failed_import': '', 'os_glance_importing_to_stores': '', 'owner_specified.openstack.md5': '', 'owner_specified.openstack.object': 'images/cirros', 'owner_specified.openstack.sha256': ''}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7f449ded1198>}
>> 2023-03-22 11:06:16.570 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Image download 15.58 MB at 0.16 MB/s
>> 2023-03-22 11:07:54.023 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena'
>>   category=FutureWarning)
>>
>> 2023-03-22 11:11:12.161 109 WARNING py.warnings [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] /usr/lib/python3.6/site-packages/oslo_utils/imageutils.py:75: FutureWarning: The human format is deprecated and the format parameter will be removed. Use explicitly json instead in version 'xena'
>>   category=FutureWarning)
>>
>> 2023-03-22 11:11:12.163 109 INFO cinder.image.image_utils [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Converted 112.00 MB image at 112.00 MB/s
>> 2023-03-22 11:11:14.998 109 INFO cinder.volume.flows.manager.create_volume [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Volume volume-bf341343-6609-4b8c-b9e0-93e2a89c8c8f (bf341343-6609-4b8c-b9e0-93e2a89c8c8f): created successfully
>> 2023-03-22 11:11:15.195 109 INFO cinder.volume.manager [req-646b9ac8-a5a7-45ac-a96d-8dd6bb45da17 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - - -] Created volume successfully.
>>
>> The image is present in dcn02 store but still it downloaded the image in 0.16 MB/s and then created the volume.
>>
>> With regards,
>> Swogat Pradhan
>>
>> On Tue, Mar 21, 2023 at 6:10 PM Swogat Pradhan <swogatpradhan22 at gmail.com> wrote:
>>>
>>> Hi Jhon,
>>> This seems to be an issue.
>>> When i deployed the dcn ceph in both dcn01 and dcn02 the --cluster parameter was specified to the respective cluster names but the config files were created in the name of ceph.conf and keyring was ceph.client.openstack.keyring.
>>>
>>> Which created issues in glance as well as the naming convention of the files didn't match the cluster names, so i had to manually rename the central ceph conf file as such:
>>>
>>> [root at dcn02-compute-0 ~]# cd /var/lib/tripleo-config/ceph/
>>> [root at dcn02-compute-0 ceph]# ll
>>> total 16
>>> -rw-------. 1 root root 257 Mar 13 13:56 ceph_central.client.openstack.keyring
>>> -rw-r--r--. 1 root root 428 Mar 13 13:56 ceph_central.conf
>>> -rw-------. 1 root root 205 Mar 15 18:45 ceph.client.openstack.keyring
>>> -rw-r--r--. 1 root root 362 Mar 15 18:45 ceph.conf
>>> [root at dcn02-compute-0 ceph]#
>>>
>>> ceph.conf and ceph.client.openstack.keyring contain the fsid of the respective clusters in both dcn01 and dcn02.
>>> In the above cli output, the ceph.conf and ceph.client... are the files used to access dcn02 ceph cluster and ceph_central* files are used in for accessing central ceph cluster.
>>>
>>> glance multistore config:
>>> [dcn02]
>>> rbd_store_ceph_conf=/etc/ceph/ceph.conf
>>> rbd_store_user=openstack
>>> rbd_store_pool=images
>>> rbd_thin_provisioning=False
>>> store_description=dcn02 rbd glance store
>>>
>>> [ceph_central]
>>> rbd_store_ceph_conf=/etc/ceph/ceph_central.conf
>>> rbd_store_user=openstack
>>> rbd_store_pool=images
>>> rbd_thin_provisioning=False
>>> store_description=Default glance store backend.
>>>
>>>
>>> With regards,
>>> Swogat Pradhan
>>>
>>> On Tue, Mar 21, 2023 at 5:52 PM John Fulton <johfulto at redhat.com> wrote:
>>>>
>>>> On Tue, Mar 21, 2023 at 8:03 AM Swogat Pradhan
>>>> <swogatpradhan22 at gmail.com> wrote:
>>>> >
>>>> > Hi,
>>>> > Seems like cinder is not using the local ceph.
>>>>
>>>> That explains the issue. It's a misconfiguration.
>>>>
>>>> I hope this is not a production system since the mailing list now has
>>>> the cinder.conf which contains passwords.
>>>>
>>>> The section that looks like this:
>>>>
>>>> [tripleo_ceph]
>>>> volume_backend_name=tripleo_ceph
>>>> volume_driver=cinder.volume.drivers.rbd.RBDDriver
>>>> rbd_ceph_conf=/etc/ceph/ceph.conf
>>>> rbd_user=openstack
>>>> rbd_pool=volumes
>>>> rbd_flatten_volume_from_snapshot=False
>>>> rbd_secret_uuid=<redacted>
>>>> report_discard_supported=True
>>>>
>>>> Should be updated to refer to the local DCN ceph cluster and not the
>>>> central one. Use the ceph conf file for that cluster and ensure the
>>>> rbd_secret_uuid corresponds to that one.
>>>>
>>>> TripleO’s convention is to set the rbd_secret_uuid to the FSID of the
>>>> Ceph cluster. The FSID should be in the ceph.conf file. The
>>>> tripleo_nova_libvirt role will use virsh secret-* commands so that
>>>> libvirt can retrieve the cephx secret using the FSID as a key. This
>>>> can be confirmed with `podman exec nova_virtsecretd virsh
>>>> secret-get-value $FSID`.
>>>>
>>>> The documentation describes how to configure the central and DCN sites
>>>> correctly but an error seems to have occurred while you were following
>>>> it.
>>>>
>>>>   https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features/distributed_multibackend_storage.html
>>>>
>>>>   John
>>>>
>>>> >
>>>> > Ceph Output:
>>>> > [ceph: root at dcn02-ceph-all-0 /]# rbd -p images ls -l
>>>> > NAME                                       SIZE     PARENT  FMT  PROT  LOCK
>>>> > 2abfafaa-eff4-4c2e-a538-dc2e1249ab65         8 MiB            2        excl
>>>> > 55f40c8a-8f79-48c5-a52a-9b679b762f19        16 MiB            2
>>>> > 55f40c8a-8f79-48c5-a52a-9b679b762f19 at snap   16 MiB            2  yes
>>>> > 59f6a9cd-721c-45b5-a15f-fd021b08160d       321 MiB            2
>>>> > 59f6a9cd-721c-45b5-a15f-fd021b08160d at snap  321 MiB            2  yes
>>>> > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0       386 MiB            2
>>>> > 5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 at snap  386 MiB            2  yes
>>>> > 9b27248e-a8cf-4f00-a039-d3e3066cd26a        15 GiB            2
>>>> > 9b27248e-a8cf-4f00-a039-d3e3066cd26a at snap   15 GiB            2  yes
>>>> > b7356adc-bb47-4c05-968b-6d3c9ca0079b        15 GiB            2
>>>> > b7356adc-bb47-4c05-968b-6d3c9ca0079b at snap   15 GiB            2  yes
>>>> > e77e78ad-d369-4a1d-b758-8113621269a3        15 GiB            2
>>>> > e77e78ad-d369-4a1d-b758-8113621269a3 at snap   15 GiB            2  yes
>>>> >
>>>> > [ceph: root at dcn02-ceph-all-0 /]# rbd -p volumes ls -l
>>>> > NAME                                         SIZE     PARENT  FMT  PROT  LOCK
>>>> > volume-c644086f-d3cf-406d-b0f1-7691bde5981d  100 GiB            2
>>>> > volume-f0969935-a742-4744-9375-80bf323e4d63   10 GiB            2
>>>> > [ceph: root at dcn02-ceph-all-0 /]#
>>>> >
>>>> > Attached the cinder config.
>>>> > Please let me know how I can solve this issue.
>>>> >
>>>> > With regards,
>>>> > Swogat Pradhan
>>>> >
>>>> > On Tue, Mar 21, 2023 at 3:53 PM John Fulton <johfulto at redhat.com> wrote:
>>>> >>
>>>> >> in my last message under the line "On a DCN site if you run a command like this:" I suggested some steps you could try to confirm the image is a COW from the local glance as well as how to look at your cinder config.
>>>> >>
>>>> >> On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan <swogatpradhan22 at gmail.com> wrote:
>>>> >>>
>>>> >>> Update:
>>>> >>> I uploaded an image directly to the dcn02 store, and it takes around 10,15 minutes to create a volume with image in dcn02.
>>>> >>> The image size is 389 MB.
>>>> >>>
>>>> >>> On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan <swogatpradhan22 at gmail.com> wrote:
>>>> >>>>
>>>> >>>> Hi Jhon,
>>>> >>>> I checked in the ceph od dcn02, I can see the images created after importing from the central site.
>>>> >>>> But launching an instance normally fails as it takes a long time for the volume to get created.
>>>> >>>>
>>>> >>>> When launching an instance from volume the instance is getting created properly without any errors.
>>>> >>>>
>>>> >>>> I tried to cache images in nova using https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_deployment/pre_cache_images.html but getting checksum failed error.
>>>> >>>>
>>>> >>>> With regards,
>>>> >>>> Swogat Pradhan
>>>> >>>>
>>>> >>>> On Thu, Mar 16, 2023 at 5:24 PM John Fulton <johfulto at redhat.com> wrote:
>>>> >>>>>
>>>> >>>>> On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan
>>>> >>>>> <swogatpradhan22 at gmail.com> wrote:
>>>> >>>>> >
>>>> >>>>> > Update: After restarting the nova services on the controller and running the deploy script on the edge site, I was able to launch the VM from volume.
>>>> >>>>> >
>>>> >>>>> > Right now the instance creation is failing as the block device creation is stuck in creating state, it is taking more than 10 mins for the volume to be created, whereas the image has already been imported to the edge glance.
>>>> >>>>>
>>>> >>>>> Try following this document and making the same observations in your
>>>> >>>>> environment for AZs and their local ceph cluster.
>>>> >>>>>
>>>> >>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features/distributed_multibackend_storage.html#confirm-images-may-be-copied-between-sites
>>>> >>>>>
>>>> >>>>> On a DCN site if you run a command like this:
>>>> >>>>>
>>>> >>>>> $ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring
>>>> >>>>> /etc/ceph/dcn0.client.admin.keyring
>>>> >>>>> $ rbd --cluster dcn0 -p volumes ls -l
>>>> >>>>> NAME                                      SIZE  PARENT
>>>> >>>>>                           FMT PROT LOCK
>>>> >>>>> volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB
>>>> >>>>> images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076 at snap   2      excl
>>>> >>>>> $
>>>> >>>>>
>>>> >>>>> Then, you should see the parent of the volume is the image which is on
>>>> >>>>> the same local ceph cluster.
>>>> >>>>>
>>>> >>>>> I wonder if something is misconfigured and thus you're encountering
>>>> >>>>> the streaming behavior described here:
>>>> >>>>>
>>>> >>>>> Ideally all images should reside in the central Glance and be copied
>>>> >>>>> to DCN sites before instances of those images are booted on DCN sites.
>>>> >>>>> If an image is not copied to a DCN site before it is booted, then the
>>>> >>>>> image will be streamed to the DCN site and then the image will boot as
>>>> >>>>> an instance. This happens because Glance at the DCN site has access to
>>>> >>>>> the images store at the Central ceph cluster. Though the booting of
>>>> >>>>> the image will take time because it has not been copied in advance,
>>>> >>>>> this is still preferable to failing to boot the image.
>>>> >>>>>
>>>> >>>>> You can also exec into the cinder container at the DCN site and
>>>> >>>>> confirm it's using it's local ceph cluster.
>>>> >>>>>
>>>> >>>>>   John
>>>> >>>>>
>>>> >>>>> >
>>>> >>>>> > I will try and create a new fresh image and test again then update.
>>>> >>>>> >
>>>> >>>>> > With regards,
>>>> >>>>> > Swogat Pradhan
>>>> >>>>> >
>>>> >>>>> > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan <swogatpradhan22 at gmail.com> wrote:
>>>> >>>>> >>
>>>> >>>>> >> Update:
>>>> >>>>> >> In the hypervisor list the compute node state is showing down.
>>>> >>>>> >>
>>>> >>>>> >>
>>>> >>>>> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan <swogatpradhan22 at gmail.com> wrote:
>>>> >>>>> >>>
>>>> >>>>> >>> Hi Brendan,
>>>> >>>>> >>> Now i have deployed another site where i have used 2 linux bonds network template for both 3 compute nodes and 3 ceph nodes.
>>>> >>>>> >>> The bonding options is set to mode=802.3ad (lacp=active).
>>>> >>>>> >>> I used a cirros image to launch instance but the instance timed out so i waited for the volume to be created.
>>>> >>>>> >>> Once the volume was created i tried launching the instance from the volume and still the instance is stuck in spawning state.
>>>> >>>>> >>>
>>>> >>>>> >>> Here is the nova-compute log:
>>>> >>>>> >>>
>>>> >>>>> >>> 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-] privsep daemon starting
>>>> >>>>> >>> 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0
>>>> >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none
>>>> >>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep daemon running as pid 185437
>>>> >>>>> >>> 2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error in _get_host_uuid: Unexpected error while running command.
>>>> >>>>> >>> Command: blkid overlay -s UUID -o value
>>>> >>>>> >>> Exit code: 2
>>>> >>>>> >>> Stdout: ''
>>>> >>>>> >>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
>>>> >>>>> >>> 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db 4160ce999a31485fa643aed0936dfef0 - default default] [instance: 450b749c-a10a-4308-80a9-3b8020fee758] Creating image
>>>> >>>>> >>>
>>>> >>>>> >>> It is stuck in creating image, do i need to run the template mentioned here ?: https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_deployment/pre_cache_images.html
>>>> >>>>> >>>
>>>> >>>>> >>> The volume is already created and i do not understand why the instance is stuck in spawning state.
>>>> >>>>> >>>
>>>> >>>>> >>> With regards,
>>>> >>>>> >>> Swogat Pradhan
>>>> >>>>> >>>
>>>> >>>>> >>>
>>>> >>>>> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard <bshephar at redhat.com> wrote:
>>>> >>>>> >>>>
>>>> >>>>> >>>> Does your environment use different network interfaces for each of the networks? Or does it have a bond with everything on it?
>>>> >>>>> >>>>
>>>> >>>>> >>>> One issue I have seen before is that when launching instances, there is a lot of network traffic between nodes as the hypervisor needs to download the image from Glance. Along with various other services sending normal network traffic, it can be enough to cause issues if everything is running over a single 1Gbe interface.
>>>> >>>>> >>>>
>>>> >>>>> >>>> I have seen the same situation in fact when using a single active/backup bond on 1Gbe nics. It’s worth checking the network traffic while you try to spawn the instance to see if you’re dropping packets. In the situation I described, there were dropped packets which resulted in a loss of communication between nova_compute and RMQ, so the node appeared offline. You should also confirm that nova_compute is being disconnected in the nova_compute logs if you tail them on the Hypervisor while spawning the instance.
>>>> >>>>> >>>>
>>>> >>>>> >>>> In my case, changing from active/backup to LACP helped. So, based on that experience, from my perspective, is certainly sounds like some kind of network issue.
>>>> >>>>> >>>>
>>>> >>>>> >>>> Regards,
>>>> >>>>> >>>>
>>>> >>>>> >>>> Brendan Shephard
>>>> >>>>> >>>> Senior Software Engineer
>>>> >>>>> >>>> Red Hat Australia
>>>> >>>>> >>>>
>>>> >>>>> >>>>
>>>> >>>>> >>>>
>>>> >>>>> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock at nde.ag> wrote:
>>>> >>>>> >>>>
>>>> >>>>> >>>> Hi,
>>>> >>>>> >>>>
>>>> >>>>> >>>> I tried to help someone with a similar issue some time ago in this thread:
>>>> >>>>> >>>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception-in-nova-conductor
>>>> >>>>> >>>>
>>>> >>>>> >>>> But apparently a neutron reinstallation fixed it for that user, not sure if that could apply here. But is it possible that your nova and neutron versions are different between central and edge site? Have you restarted nova and neutron services on the compute nodes after installation? Have you debug logs of nova-conductor and maybe nova-compute? Maybe they can help narrow down the issue.
>>>> >>>>> >>>> If there isn't any additional information in the debug logs I probably would start "tearing down" rabbitmq. I didn't have to do that in a production system yet so be careful. I can think of two routes:
>>>> >>>>> >>>>
>>>> >>>>> >>>> - Either remove queues, exchanges etc. while rabbit is running, this will most likely impact client IO depending on your load. Check out the rabbitmqctl commands.
>>>> >>>>> >>>> - Or stop the rabbitmq cluster, remove the mnesia tables from all nodes and restart rabbitmq so the exchanges, queues etc. rebuild.
>>>> >>>>> >>>>
>>>> >>>>> >>>> I can imagine that the failed reply "survives" while being replicated across the rabbit nodes. But I don't really know the rabbit internals too well, so maybe someone else can chime in here and give a better advice.
>>>> >>>>> >>>>
>>>> >>>>> >>>> Regards,
>>>> >>>>> >>>> Eugen
>>>> >>>>> >>>>
>>>> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22 at gmail.com>:
>>>> >>>>> >>>>
>>>> >>>>> >>>> Hi,
>>>> >>>>> >>>> Can someone please help me out on this issue?
>>>> >>>>> >>>>
>>>> >>>>> >>>> With regards,
>>>> >>>>> >>>> Swogat Pradhan
>>>> >>>>> >>>>
>>>> >>>>> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan <swogatpradhan22 at gmail.com>
>>>> >>>>> >>>> wrote:
>>>> >>>>> >>>>
>>>> >>>>> >>>> Hi
>>>> >>>>> >>>> I don't see any major packet loss.
>>>> >>>>> >>>> It seems the problem is somewhere in rabbitmq maybe but not due to packet
>>>> >>>>> >>>> loss.
>>>> >>>>> >>>>
>>>> >>>>> >>>> with regards,
>>>> >>>>> >>>> Swogat Pradhan
>>>> >>>>> >>>>
>>>> >>>>> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan <swogatpradhan22 at gmail.com>
>>>> >>>>> >>>> wrote:
>>>> >>>>> >>>>
>>>> >>>>> >>>> Hi,
>>>> >>>>> >>>> Yes the MTU is the same as the default '1500'.
>>>> >>>>> >>>> Generally I haven't seen any packet loss, but never checked when
>>>> >>>>> >>>> launching the instance.
>>>> >>>>> >>>> I will check that and come back.
>>>> >>>>> >>>> But everytime i launch an instance the instance gets stuck at spawning
>>>> >>>>> >>>> state and there the hypervisor becomes down, so not sure if packet loss
>>>> >>>>> >>>> causes this.
>>>> >>>>> >>>>
>>>> >>>>> >>>> With regards,
>>>> >>>>> >>>> Swogat pradhan
>>>> >>>>> >>>>
>>>> >>>>> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock at nde.ag> wrote:
>>>> >>>>> >>>>
>>>> >>>>> >>>> One more thing coming to mind is MTU size. Are they identical between
>>>> >>>>> >>>> central and edge site? Do you see packet loss through the tunnel?
>>>> >>>>> >>>>
>>>> >>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22 at gmail.com>:
>>>> >>>>> >>>>
>>>> >>>>> >>>> > Hi Eugen,
>>>> >>>>> >>>> > Request you to please add my email either on 'to' or 'cc' as i am not
>>>> >>>>> >>>> > getting email's from you.
>>>> >>>>> >>>> > Coming to the issue:
>>>> >>>>> >>>> >
>>>> >>>>> >>>> > [root at overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p
>>>> >>>>> >>>> /
>>>> >>>>> >>>> > Listing policies for vhost "/" ...
>>>> >>>>> >>>> > vhost   name    pattern apply-to        definition      priority
>>>> >>>>> >>>> > /       ha-all  ^(?!amq\.).*    queues
>>>> >>>>> >>>> >
>>>> >>>>> >>>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"}   0
>>>> >>>>> >>>> >
>>>> >>>>> >>>> > I have the edge site compute nodes up, it only goes down when i am
>>>> >>>>> >>>> trying
>>>> >>>>> >>>> > to launch an instance and the instance comes to a spawning state and
>>>> >>>>> >>>> then
>>>> >>>>> >>>> > gets stuck.
>>>> >>>>> >>>> >
>>>> >>>>> >>>> > I have a tunnel setup between the central and the edge sites.
>>>> >>>>> >>>> >
>>>> >>>>> >>>> > With regards,
>>>> >>>>> >>>> > Swogat Pradhan
>>>> >>>>> >>>> >
>>>> >>>>> >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan <
>>>> >>>>> >>>> swogatpradhan22 at gmail.com>
>>>> >>>>> >>>> > wrote:
>>>> >>>>> >>>> >
>>>> >>>>> >>>> >> Hi Eugen,
>>>> >>>>> >>>> >> For some reason i am not getting your email to me directly, i am
>>>> >>>>> >>>> checking
>>>> >>>>> >>>> >> the email digest and there i am able to find your reply.
>>>> >>>>> >>>> >> Here is the log for download: https://we.tl/t-L8FEkGZFSq
>>>> >>>>> >>>> >> Yes, these logs are from the time when the issue occurred.
>>>> >>>>> >>>> >>
>>>> >>>>> >>>> >> *Note: i am able to create vm's and perform other activities in the
>>>> >>>>> >>>> >> central site, only facing this issue in the edge site.*
>>>> >>>>> >>>> >>
>>>> >>>>> >>>> >> With regards,
>>>> >>>>> >>>> >> Swogat Pradhan
>>>> >>>>> >>>> >>
>>>> >>>>> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan <
>>>> >>>>> >>>> swogatpradhan22 at gmail.com>
>>>> >>>>> >>>> >> wrote:
>>>> >>>>> >>>> >>
>>>> >>>>> >>>> >>> Hi Eugen,
>>>> >>>>> >>>> >>> Thanks for your response.
>>>> >>>>> >>>> >>> I have actually a 4 controller setup so here are the details:
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> *PCS Status:*
>>>> >>>>> >>>> >>>   * Container bundle set: rabbitmq-bundle [
>>>> >>>>> >>>> >>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]:
>>>> >>>>> >>>> >>>     * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster):
>>>> >>>>> >>>> Started
>>>> >>>>> >>>> >>> overcloud-controller-no-ceph-3
>>>> >>>>> >>>> >>>     * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster):
>>>> >>>>> >>>> Started
>>>> >>>>> >>>> >>> overcloud-controller-2
>>>> >>>>> >>>> >>>     * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster):
>>>> >>>>> >>>> Started
>>>> >>>>> >>>> >>> overcloud-controller-1
>>>> >>>>> >>>> >>>     * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster):
>>>> >>>>> >>>> Started
>>>> >>>>> >>>> >>> overcloud-controller-0
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> I have tried restarting the bundle multiple times but the issue is
>>>> >>>>> >>>> still
>>>> >>>>> >>>> >>> present.
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> *Cluster status:*
>>>> >>>>> >>>> >>> [root at overcloud-controller-0 /]# rabbitmqctl cluster_status
>>>> >>>>> >>>> >>> Cluster status of node
>>>> >>>>> >>>> >>> rabbit at overcloud-controller-0.internalapi.bdxworld.com ...
>>>> >>>>> >>>> >>> Basics
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> Cluster name: rabbit at overcloud-controller-no-ceph-3.bdxworld.com
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> Disk Nodes
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> rabbit at overcloud-controller-0.internalapi.bdxworld.com
>>>> >>>>> >>>> >>> rabbit at overcloud-controller-1.internalapi.bdxworld.com
>>>> >>>>> >>>> >>> rabbit at overcloud-controller-2.internalapi.bdxworld.com
>>>> >>>>> >>>> >>> rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> Running Nodes
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> rabbit at overcloud-controller-0.internalapi.bdxworld.com
>>>> >>>>> >>>> >>> rabbit at overcloud-controller-1.internalapi.bdxworld.com
>>>> >>>>> >>>> >>> rabbit at overcloud-controller-2.internalapi.bdxworld.com
>>>> >>>>> >>>> >>> rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> Versions
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> rabbit at overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ
>>>> >>>>> >>>> 3.8.3
>>>> >>>>> >>>> >>> on Erlang 22.3.4.1
>>>> >>>>> >>>> >>> rabbit at overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ
>>>> >>>>> >>>> 3.8.3
>>>> >>>>> >>>> >>> on Erlang 22.3.4.1
>>>> >>>>> >>>> >>> rabbit at overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ
>>>> >>>>> >>>> 3.8.3
>>>> >>>>> >>>> >>> on Erlang 22.3.4.1
>>>> >>>>> >>>> >>> rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com:
>>>> >>>>> >>>> RabbitMQ
>>>> >>>>> >>>> >>> 3.8.3 on Erlang 22.3.4.1
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> Alarms
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> (none)
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> Network Partitions
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> (none)
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> Listeners
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> Node: rabbit at overcloud-controller-0.internalapi.bdxworld.com,
>>>> >>>>> >>>> interface:
>>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI
>>>> >>>>> >>>> tool
>>>> >>>>> >>>> >>> communication
>>>> >>>>> >>>> >>> Node: rabbit at overcloud-controller-0.internalapi.bdxworld.com,
>>>> >>>>> >>>> interface:
>>>> >>>>> >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1
>>>> >>>>> >>>> >>> and AMQP 1.0
>>>> >>>>> >>>> >>> Node: rabbit at overcloud-controller-0.internalapi.bdxworld.com,
>>>> >>>>> >>>> interface:
>>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API
>>>> >>>>> >>>> >>> Node: rabbit at overcloud-controller-1.internalapi.bdxworld.com,
>>>> >>>>> >>>> interface:
>>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI
>>>> >>>>> >>>> tool
>>>> >>>>> >>>> >>> communication
>>>> >>>>> >>>> >>> Node: rabbit at overcloud-controller-1.internalapi.bdxworld.com,
>>>> >>>>> >>>> interface:
>>>> >>>>> >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1
>>>> >>>>> >>>> >>> and AMQP 1.0
>>>> >>>>> >>>> >>> Node: rabbit at overcloud-controller-1.internalapi.bdxworld.com,
>>>> >>>>> >>>> interface:
>>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API
>>>> >>>>> >>>> >>> Node: rabbit at overcloud-controller-2.internalapi.bdxworld.com,
>>>> >>>>> >>>> interface:
>>>> >>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI
>>>> >>>>> >>>> tool
>>>> >>>>> >>>> >>> communication
>>>> >>>>> >>>> >>> Node: rabbit at overcloud-controller-2.internalapi.bdxworld.com,
>>>> >>>>> >>>> interface:
>>>> >>>>> >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1
>>>> >>>>> >>>> >>> and AMQP 1.0
>>>> >>>>> >>>> >>> Node: rabbit at overcloud-controller-2.internalapi.bdxworld.com,
>>>> >>>>> >>>> interface:
>>>> >>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API
>>>> >>>>> >>>> >>> Node: rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>>> >>>>> >>>> ,
>>>> >>>>> >>>> >>> interface: [::], port: 25672, protocol: clustering, purpose:
>>>> >>>>> >>>> inter-node and
>>>> >>>>> >>>> >>> CLI tool communication
>>>> >>>>> >>>> >>> Node: rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>>> >>>>> >>>> ,
>>>> >>>>> >>>> >>> interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP
>>>> >>>>> >>>> 0-9-1
>>>> >>>>> >>>> >>> and AMQP 1.0
>>>> >>>>> >>>> >>> Node: rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>>> >>>>> >>>> ,
>>>> >>>>> >>>> >>> interface: [::], port: 15672, protocol: http, purpose: HTTP API
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> Feature flags
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> Flag: drop_unroutable_metric, state: enabled
>>>> >>>>> >>>> >>> Flag: empty_basic_get_metric, state: enabled
>>>> >>>>> >>>> >>> Flag: implicit_default_bindings, state: enabled
>>>> >>>>> >>>> >>> Flag: quorum_queue, state: enabled
>>>> >>>>> >>>> >>> Flag: virtual_host_metadata, state: enabled
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> *Logs:*
>>>> >>>>> >>>> >>> *(Attached)*
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> With regards,
>>>> >>>>> >>>> >>> Swogat Pradhan
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan <
>>>> >>>>> >>>> swogatpradhan22 at gmail.com>
>>>> >>>>> >>>> >>> wrote:
>>>> >>>>> >>>> >>>
>>>> >>>>> >>>> >>>> Hi,
>>>> >>>>> >>>> >>>> Please find the nova conductor as well as nova api log.
>>>> >>>>> >>>> >>>>
>>>> >>>>> >>>> >>>> nova-conuctor:
>>>> >>>>> >>>> >>>>
>>>> >>>>> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING
>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver
>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to
>>>> >>>>> >>>> >>>> 16152921c1eb45c2b1f562087140168b
>>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING
>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver
>>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -]
>>>> >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to
>>>> >>>>> >>>> >>>> 83dbe5f567a940b698acfe986f6194fa
>>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING
>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver
>>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -]
>>>> >>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to
>>>> >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43:
>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>>> >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver
>>>> >>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply
>>>> >>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds
>>>> >>>>> >>>> due to a
>>>> >>>>> >>>> >>>> missing queue (reply_276049ec36a84486a8a406911d9802f4).
>>>> >>>>> >>>> Abandoning...:
>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>>> >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING
>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver
>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to
>>>> >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566:
>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>>> >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver
>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply
>>>> >>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds
>>>> >>>>> >>>> due to a
>>>> >>>>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066).
>>>> >>>>> >>>> Abandoning...:
>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>>> >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING
>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver
>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to
>>>> >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f:
>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>>> >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver
>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply
>>>> >>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds
>>>> >>>>> >>>> due to a
>>>> >>>>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066).
>>>> >>>>> >>>> Abandoning...:
>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>>> >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils
>>>> >>>>> >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>>> >>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>>> >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled
>>>> >>>>> >>>> with
>>>> >>>>> >>>> >>>> backend dogpile.cache.null.
>>>> >>>>> >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING
>>>> >>>>> >>>> oslo_messaging._drivers.amqpdriver
>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
>>>> >>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to
>>>> >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb:
>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>>> >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver
>>>> >>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply
>>>> >>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds
>>>> >>>>> >>>> due to a
>>>> >>>>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066).
>>>> >>>>> >>>> Abandoning...:
>>>> >>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>>> >>>> >>>>
>>>> >>>>> >>>> >>>> With regards,
>>>> >>>>> >>>> >>>> Swogat Pradhan
>>>> >>>>> >>>> >>>>
>>>> >>>>> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan <
>>>> >>>>> >>>> >>>> swogatpradhan22 at gmail.com> wrote:
>>>> >>>>> >>>> >>>>
>>>> >>>>> >>>> >>>>> Hi,
>>>> >>>>> >>>> >>>>> I currently have 3 compute nodes on edge site1 where i am trying to
>>>> >>>>> >>>> >>>>> launch vm's.
>>>> >>>>> >>>> >>>>> When the VM is in spawning state the node goes down (openstack
>>>> >>>>> >>>> compute
>>>> >>>>> >>>> >>>>> service list), the node comes backup when i restart the nova
>>>> >>>>> >>>> compute
>>>> >>>>> >>>> >>>>> service but then the launch of the vm fails.
>>>> >>>>> >>>> >>>>>
>>>> >>>>> >>>> >>>>> nova-compute.log
>>>> >>>>> >>>> >>>>>
>>>> >>>>> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager
>>>> >>>>> >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running
>>>> >>>>> >>>> >>>>> instance usage
>>>> >>>>> >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00
>>>> >>>>> >>>> to
>>>> >>>>> >>>> >>>>> 2023-02-26 08:00:00. 0 instances.
>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims
>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance:
>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node
>>>> >>>>> >>>> >>>>> dcn01-hci-0.bdxworld.com
>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver
>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance:
>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device
>>>> >>>>> >>>> name:
>>>> >>>>> >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied dev names
>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device
>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance:
>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume
>>>> >>>>> >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda
>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils
>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled
>>>> >>>>> >>>> with
>>>> >>>>> >>>> >>>>> backend dogpile.cache.null.
>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon
>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running
>>>> >>>>> >>>> >>>>> privsep helper:
>>>> >>>>> >>>> >>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf',
>>>> >>>>> >>>> 'privsep-helper',
>>>> >>>>> >>>> >>>>> '--config-file', '/etc/nova/nova.conf', '--config-file',
>>>> >>>>> >>>> >>>>> '/etc/nova/nova-compute.conf', '--privsep_context',
>>>> >>>>> >>>> >>>>> 'os_brick.privileged.default', '--privsep_sock_path',
>>>> >>>>> >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock']
>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon
>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new
>>>> >>>>> >>>> privsep
>>>> >>>>> >>>> >>>>> daemon via rootwrap
>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep
>>>> >>>>> >>>> >>>>> daemon starting
>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep
>>>> >>>>> >>>> >>>>> process running with uid/gid: 0/0
>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep
>>>> >>>>> >>>> >>>>> process running with capabilities (eff/prm/inh):
>>>> >>>>> >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none
>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep
>>>> >>>>> >>>> >>>>> daemon running as pid 2647
>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING
>>>> >>>>> >>>> os_brick.initiator.connectors.nvmeof
>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process
>>>> >>>>> >>>> >>>>> execution error
>>>> >>>>> >>>> >>>>> in _get_host_uuid: Unexpected error while running command.
>>>> >>>>> >>>> >>>>> Command: blkid overlay -s UUID -o value
>>>> >>>>> >>>> >>>>> Exit code: 2
>>>> >>>>> >>>> >>>>> Stdout: ''
>>>> >>>>> >>>> >>>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError:
>>>> >>>>> >>>> >>>>> Unexpected error while running command.
>>>> >>>>> >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver
>>>> >>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance:
>>>> >>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image
>>>> >>>>> >>>> >>>>>
>>>> >>>>> >>>> >>>>> Is there a way to solve this issue?
>>>> >>>>> >>>> >>>>>
>>>> >>>>> >>>> >>>>>
>>>> >>>>> >>>> >>>>> With regards,
>>>> >>>>> >>>> >>>>>
>>>> >>>>> >>>> >>>>> Swogat Pradhan
>>>> >>>>> >>>> >>>>>
>>>> >>>>> >>>> >>>>
>>>> >>>>> >>>>
>>>> >>>>> >>>>
>>>> >>>>> >>>>
>>>> >>>>> >>>>
>>>> >>>>> >>>>
>>>> >>>>> >>>>
>>>> >>>>> >>>>
>>>> >>>>> >>>>
>>>> >>>>> >>>>
>>>> >>>>>
>>>>

Open Stack

DCN compute service goes down when a instance is scheduled to launch | wallaby | tripleo

OpenStack

Community

Documentation

Branding & Legal