DCN compute service goes down when a instance is scheduled to launch | wallaby | tripleo

Swogat Pradhan swogatpradhan22 at gmail.com
Tue Mar 21 12:03:20 UTC 2023


Hi,
Seems like cinder is not using the local ceph.

Ceph Output:
[ceph: root at dcn02-ceph-all-0 /]# rbd -p images ls -l
NAME                                       SIZE     PARENT  FMT  PROT  LOCK
2abfafaa-eff4-4c2e-a538-dc2e1249ab65         8 MiB            2        excl
55f40c8a-8f79-48c5-a52a-9b679b762f19        16 MiB            2
55f40c8a-8f79-48c5-a52a-9b679b762f19 at snap   16 MiB            2  yes
59f6a9cd-721c-45b5-a15f-fd021b08160d       321 MiB            2
59f6a9cd-721c-45b5-a15f-fd021b08160d at snap  321 MiB            2  yes
5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0       386 MiB            2
5f5ddd77-35f3-45e8-9dd3-8c1cbb1f39f0 at snap  386 MiB            2  yes
9b27248e-a8cf-4f00-a039-d3e3066cd26a        15 GiB            2
9b27248e-a8cf-4f00-a039-d3e3066cd26a at snap   15 GiB            2  yes
b7356adc-bb47-4c05-968b-6d3c9ca0079b        15 GiB            2
b7356adc-bb47-4c05-968b-6d3c9ca0079b at snap   15 GiB            2  yes
e77e78ad-d369-4a1d-b758-8113621269a3        15 GiB            2
e77e78ad-d369-4a1d-b758-8113621269a3 at snap   15 GiB            2  yes

[ceph: root at dcn02-ceph-all-0 /]# rbd -p volumes ls -l
NAME                                         SIZE     PARENT  FMT  PROT
 LOCK
volume-c644086f-d3cf-406d-b0f1-7691bde5981d  100 GiB            2
volume-f0969935-a742-4744-9375-80bf323e4d63   10 GiB            2
[ceph: root at dcn02-ceph-all-0 /]#

Attached the cinder config.
Please let me know how I can solve this issue.

With regards,
Swogat Pradhan

On Tue, Mar 21, 2023 at 3:53 PM John Fulton <johfulto at redhat.com> wrote:

> in my last message under the line "On a DCN site if you run a command like
> this:" I suggested some steps you could try to confirm the image is a COW
> from the local glance as well as how to look at your cinder config.
>
> On Tue, Mar 21, 2023, 12:06 AM Swogat Pradhan <swogatpradhan22 at gmail.com>
> wrote:
>
>> Update:
>> I uploaded an image directly to the dcn02 store, and it takes
>> around 10,15 minutes to create a volume with image in dcn02.
>> The image size is 389 MB.
>>
>> On Mon, Mar 20, 2023 at 10:26 PM Swogat Pradhan <
>> swogatpradhan22 at gmail.com> wrote:
>>
>>> Hi Jhon,
>>> I checked in the ceph od dcn02, I can see the images created after
>>> importing from the central site.
>>> But launching an instance normally fails as it takes a long time for the
>>> volume to get created.
>>>
>>> When launching an instance from volume the instance is getting created
>>> properly without any errors.
>>>
>>> I tried to cache images in nova using
>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_deployment/pre_cache_images.html
>>> but getting checksum failed error.
>>>
>>> With regards,
>>> Swogat Pradhan
>>>
>>> On Thu, Mar 16, 2023 at 5:24 PM John Fulton <johfulto at redhat.com> wrote:
>>>
>>>> On Wed, Mar 15, 2023 at 8:05 PM Swogat Pradhan
>>>> <swogatpradhan22 at gmail.com> wrote:
>>>> >
>>>> > Update: After restarting the nova services on the controller and
>>>> running the deploy script on the edge site, I was able to launch the VM
>>>> from volume.
>>>> >
>>>> > Right now the instance creation is failing as the block device
>>>> creation is stuck in creating state, it is taking more than 10 mins for the
>>>> volume to be created, whereas the image has already been imported to the
>>>> edge glance.
>>>>
>>>> Try following this document and making the same observations in your
>>>> environment for AZs and their local ceph cluster.
>>>>
>>>>
>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features/distributed_multibackend_storage.html#confirm-images-may-be-copied-between-sites
>>>>
>>>> On a DCN site if you run a command like this:
>>>>
>>>> $ sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring
>>>> /etc/ceph/dcn0.client.admin.keyring
>>>> $ rbd --cluster dcn0 -p volumes ls -l
>>>> NAME                                      SIZE  PARENT
>>>>                           FMT PROT LOCK
>>>> volume-28c6fc32-047b-4306-ad2d-de2be02716b7 8 GiB
>>>> images/8083c7e7-32d8-4f7a-b1da-0ed7884f1076 at snap   2      excl
>>>> $
>>>>
>>>> Then, you should see the parent of the volume is the image which is on
>>>> the same local ceph cluster.
>>>>
>>>> I wonder if something is misconfigured and thus you're encountering
>>>> the streaming behavior described here:
>>>>
>>>> Ideally all images should reside in the central Glance and be copied
>>>> to DCN sites before instances of those images are booted on DCN sites.
>>>> If an image is not copied to a DCN site before it is booted, then the
>>>> image will be streamed to the DCN site and then the image will boot as
>>>> an instance. This happens because Glance at the DCN site has access to
>>>> the images store at the Central ceph cluster. Though the booting of
>>>> the image will take time because it has not been copied in advance,
>>>> this is still preferable to failing to boot the image.
>>>>
>>>> You can also exec into the cinder container at the DCN site and
>>>> confirm it's using it's local ceph cluster.
>>>>
>>>>   John
>>>>
>>>> >
>>>> > I will try and create a new fresh image and test again then update.
>>>> >
>>>> > With regards,
>>>> > Swogat Pradhan
>>>> >
>>>> > On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan <
>>>> swogatpradhan22 at gmail.com> wrote:
>>>> >>
>>>> >> Update:
>>>> >> In the hypervisor list the compute node state is showing down.
>>>> >>
>>>> >>
>>>> >> On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan <
>>>> swogatpradhan22 at gmail.com> wrote:
>>>> >>>
>>>> >>> Hi Brendan,
>>>> >>> Now i have deployed another site where i have used 2 linux bonds
>>>> network template for both 3 compute nodes and 3 ceph nodes.
>>>> >>> The bonding options is set to mode=802.3ad (lacp=active).
>>>> >>> I used a cirros image to launch instance but the instance timed out
>>>> so i waited for the volume to be created.
>>>> >>> Once the volume was created i tried launching the instance from the
>>>> volume and still the instance is stuck in spawning state.
>>>> >>>
>>>> >>> Here is the nova-compute log:
>>>> >>>
>>>> >>> 2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-] privsep
>>>> daemon starting
>>>> >>> 2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-] privsep
>>>> process running with uid/gid: 0/0
>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep
>>>> process running with capabilities (eff/prm/inh):
>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none
>>>> >>> 2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep
>>>> daemon running as pid 185437
>>>> >>> 2023-03-15 17:35:47.974 8 WARNING
>>>> os_brick.initiator.connectors.nvmeof
>>>> [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db
>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process execution error
>>>> in _get_host_uuid: Unexpected error while running command.
>>>> >>> Command: blkid overlay -s UUID -o value
>>>> >>> Exit code: 2
>>>> >>> Stdout: ''
>>>> >>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError:
>>>> Unexpected error while running command.
>>>> >>> 2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver
>>>> [req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db
>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance:
>>>> 450b749c-a10a-4308-80a9-3b8020fee758] Creating image
>>>> >>>
>>>> >>> It is stuck in creating image, do i need to run the template
>>>> mentioned here ?:
>>>> https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_deployment/pre_cache_images.html
>>>> >>>
>>>> >>> The volume is already created and i do not understand why the
>>>> instance is stuck in spawning state.
>>>> >>>
>>>> >>> With regards,
>>>> >>> Swogat Pradhan
>>>> >>>
>>>> >>>
>>>> >>> On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard <
>>>> bshephar at redhat.com> wrote:
>>>> >>>>
>>>> >>>> Does your environment use different network interfaces for each of
>>>> the networks? Or does it have a bond with everything on it?
>>>> >>>>
>>>> >>>> One issue I have seen before is that when launching instances,
>>>> there is a lot of network traffic between nodes as the hypervisor needs to
>>>> download the image from Glance. Along with various other services sending
>>>> normal network traffic, it can be enough to cause issues if everything is
>>>> running over a single 1Gbe interface.
>>>> >>>>
>>>> >>>> I have seen the same situation in fact when using a single
>>>> active/backup bond on 1Gbe nics. It’s worth checking the network traffic
>>>> while you try to spawn the instance to see if you’re dropping packets. In
>>>> the situation I described, there were dropped packets which resulted in a
>>>> loss of communication between nova_compute and RMQ, so the node appeared
>>>> offline. You should also confirm that nova_compute is being disconnected in
>>>> the nova_compute logs if you tail them on the Hypervisor while spawning the
>>>> instance.
>>>> >>>>
>>>> >>>> In my case, changing from active/backup to LACP helped. So, based
>>>> on that experience, from my perspective, is certainly sounds like some kind
>>>> of network issue.
>>>> >>>>
>>>> >>>> Regards,
>>>> >>>>
>>>> >>>> Brendan Shephard
>>>> >>>> Senior Software Engineer
>>>> >>>> Red Hat Australia
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> On 5 Mar 2023, at 6:47 am, Eugen Block <eblock at nde.ag> wrote:
>>>> >>>>
>>>> >>>> Hi,
>>>> >>>>
>>>> >>>> I tried to help someone with a similar issue some time ago in this
>>>> thread:
>>>> >>>>
>>>> https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception-in-nova-conductor
>>>> >>>>
>>>> >>>> But apparently a neutron reinstallation fixed it for that user,
>>>> not sure if that could apply here. But is it possible that your nova and
>>>> neutron versions are different between central and edge site? Have you
>>>> restarted nova and neutron services on the compute nodes after
>>>> installation? Have you debug logs of nova-conductor and maybe nova-compute?
>>>> Maybe they can help narrow down the issue.
>>>> >>>> If there isn't any additional information in the debug logs I
>>>> probably would start "tearing down" rabbitmq. I didn't have to do that in a
>>>> production system yet so be careful. I can think of two routes:
>>>> >>>>
>>>> >>>> - Either remove queues, exchanges etc. while rabbit is running,
>>>> this will most likely impact client IO depending on your load. Check out
>>>> the rabbitmqctl commands.
>>>> >>>> - Or stop the rabbitmq cluster, remove the mnesia tables from all
>>>> nodes and restart rabbitmq so the exchanges, queues etc. rebuild.
>>>> >>>>
>>>> >>>> I can imagine that the failed reply "survives" while being
>>>> replicated across the rabbit nodes. But I don't really know the rabbit
>>>> internals too well, so maybe someone else can chime in here and give a
>>>> better advice.
>>>> >>>>
>>>> >>>> Regards,
>>>> >>>> Eugen
>>>> >>>>
>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22 at gmail.com>:
>>>> >>>>
>>>> >>>> Hi,
>>>> >>>> Can someone please help me out on this issue?
>>>> >>>>
>>>> >>>> With regards,
>>>> >>>> Swogat Pradhan
>>>> >>>>
>>>> >>>> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan <
>>>> swogatpradhan22 at gmail.com>
>>>> >>>> wrote:
>>>> >>>>
>>>> >>>> Hi
>>>> >>>> I don't see any major packet loss.
>>>> >>>> It seems the problem is somewhere in rabbitmq maybe but not due to
>>>> packet
>>>> >>>> loss.
>>>> >>>>
>>>> >>>> with regards,
>>>> >>>> Swogat Pradhan
>>>> >>>>
>>>> >>>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan <
>>>> swogatpradhan22 at gmail.com>
>>>> >>>> wrote:
>>>> >>>>
>>>> >>>> Hi,
>>>> >>>> Yes the MTU is the same as the default '1500'.
>>>> >>>> Generally I haven't seen any packet loss, but never checked when
>>>> >>>> launching the instance.
>>>> >>>> I will check that and come back.
>>>> >>>> But everytime i launch an instance the instance gets stuck at
>>>> spawning
>>>> >>>> state and there the hypervisor becomes down, so not sure if packet
>>>> loss
>>>> >>>> causes this.
>>>> >>>>
>>>> >>>> With regards,
>>>> >>>> Swogat pradhan
>>>> >>>>
>>>> >>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock at nde.ag> wrote:
>>>> >>>>
>>>> >>>> One more thing coming to mind is MTU size. Are they identical
>>>> between
>>>> >>>> central and edge site? Do you see packet loss through the tunnel?
>>>> >>>>
>>>> >>>> Zitat von Swogat Pradhan <swogatpradhan22 at gmail.com>:
>>>> >>>>
>>>> >>>> > Hi Eugen,
>>>> >>>> > Request you to please add my email either on 'to' or 'cc' as i
>>>> am not
>>>> >>>> > getting email's from you.
>>>> >>>> > Coming to the issue:
>>>> >>>> >
>>>> >>>> > [root at overcloud-controller-no-ceph-3 /]# rabbitmqctl
>>>> list_policies -p
>>>> >>>> /
>>>> >>>> > Listing policies for vhost "/" ...
>>>> >>>> > vhost   name    pattern apply-to        definition      priority
>>>> >>>> > /       ha-all  ^(?!amq\.).*    queues
>>>> >>>> >
>>>> >>>>
>>>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"}   0
>>>> >>>> >
>>>> >>>> > I have the edge site compute nodes up, it only goes down when i
>>>> am
>>>> >>>> trying
>>>> >>>> > to launch an instance and the instance comes to a spawning state
>>>> and
>>>> >>>> then
>>>> >>>> > gets stuck.
>>>> >>>> >
>>>> >>>> > I have a tunnel setup between the central and the edge sites.
>>>> >>>> >
>>>> >>>> > With regards,
>>>> >>>> > Swogat Pradhan
>>>> >>>> >
>>>> >>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan <
>>>> >>>> swogatpradhan22 at gmail.com>
>>>> >>>> > wrote:
>>>> >>>> >
>>>> >>>> >> Hi Eugen,
>>>> >>>> >> For some reason i am not getting your email to me directly, i am
>>>> >>>> checking
>>>> >>>> >> the email digest and there i am able to find your reply.
>>>> >>>> >> Here is the log for download: https://we.tl/t-L8FEkGZFSq
>>>> >>>> >> Yes, these logs are from the time when the issue occurred.
>>>> >>>> >>
>>>> >>>> >> *Note: i am able to create vm's and perform other activities in
>>>> the
>>>> >>>> >> central site, only facing this issue in the edge site.*
>>>> >>>> >>
>>>> >>>> >> With regards,
>>>> >>>> >> Swogat Pradhan
>>>> >>>> >>
>>>> >>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan <
>>>> >>>> swogatpradhan22 at gmail.com>
>>>> >>>> >> wrote:
>>>> >>>> >>
>>>> >>>> >>> Hi Eugen,
>>>> >>>> >>> Thanks for your response.
>>>> >>>> >>> I have actually a 4 controller setup so here are the details:
>>>> >>>> >>>
>>>> >>>> >>> *PCS Status:*
>>>> >>>> >>>   * Container bundle set: rabbitmq-bundle [
>>>> >>>> >>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest
>>>> ]:
>>>> >>>> >>>     * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster):
>>>> >>>> Started
>>>> >>>> >>> overcloud-controller-no-ceph-3
>>>> >>>> >>>     * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster):
>>>> >>>> Started
>>>> >>>> >>> overcloud-controller-2
>>>> >>>> >>>     * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster):
>>>> >>>> Started
>>>> >>>> >>> overcloud-controller-1
>>>> >>>> >>>     * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster):
>>>> >>>> Started
>>>> >>>> >>> overcloud-controller-0
>>>> >>>> >>>
>>>> >>>> >>> I have tried restarting the bundle multiple times but the
>>>> issue is
>>>> >>>> still
>>>> >>>> >>> present.
>>>> >>>> >>>
>>>> >>>> >>> *Cluster status:*
>>>> >>>> >>> [root at overcloud-controller-0 /]# rabbitmqctl cluster_status
>>>> >>>> >>> Cluster status of node
>>>> >>>> >>> rabbit at overcloud-controller-0.internalapi.bdxworld.com ...
>>>> >>>> >>> Basics
>>>> >>>> >>>
>>>> >>>> >>> Cluster name:
>>>> rabbit at overcloud-controller-no-ceph-3.bdxworld.com
>>>> >>>> >>>
>>>> >>>> >>> Disk Nodes
>>>> >>>> >>>
>>>> >>>> >>> rabbit at overcloud-controller-0.internalapi.bdxworld.com
>>>> >>>> >>> rabbit at overcloud-controller-1.internalapi.bdxworld.com
>>>> >>>> >>> rabbit at overcloud-controller-2.internalapi.bdxworld.com
>>>> >>>> >>> rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>>> >>>> >>>
>>>> >>>> >>> Running Nodes
>>>> >>>> >>>
>>>> >>>> >>> rabbit at overcloud-controller-0.internalapi.bdxworld.com
>>>> >>>> >>> rabbit at overcloud-controller-1.internalapi.bdxworld.com
>>>> >>>> >>> rabbit at overcloud-controller-2.internalapi.bdxworld.com
>>>> >>>> >>> rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>>> >>>> >>>
>>>> >>>> >>> Versions
>>>> >>>> >>>
>>>> >>>> >>> rabbit at overcloud-controller-0.internalapi.bdxworld.com:
>>>> RabbitMQ
>>>> >>>> 3.8.3
>>>> >>>> >>> on Erlang 22.3.4.1
>>>> >>>> >>> rabbit at overcloud-controller-1.internalapi.bdxworld.com:
>>>> RabbitMQ
>>>> >>>> 3.8.3
>>>> >>>> >>> on Erlang 22.3.4.1
>>>> >>>> >>> rabbit at overcloud-controller-2.internalapi.bdxworld.com:
>>>> RabbitMQ
>>>> >>>> 3.8.3
>>>> >>>> >>> on Erlang 22.3.4.1
>>>> >>>> >>> rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>>> :
>>>> >>>> RabbitMQ
>>>> >>>> >>> 3.8.3 on Erlang 22.3.4.1
>>>> >>>> >>>
>>>> >>>> >>> Alarms
>>>> >>>> >>>
>>>> >>>> >>> (none)
>>>> >>>> >>>
>>>> >>>> >>> Network Partitions
>>>> >>>> >>>
>>>> >>>> >>> (none)
>>>> >>>> >>>
>>>> >>>> >>> Listeners
>>>> >>>> >>>
>>>> >>>> >>> Node: rabbit at overcloud-controller-0.internalapi.bdxworld.com,
>>>> >>>> interface:
>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node
>>>> and CLI
>>>> >>>> tool
>>>> >>>> >>> communication
>>>> >>>> >>> Node: rabbit at overcloud-controller-0.internalapi.bdxworld.com,
>>>> >>>> interface:
>>>> >>>> >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1
>>>> >>>> >>> and AMQP 1.0
>>>> >>>> >>> Node: rabbit at overcloud-controller-0.internalapi.bdxworld.com,
>>>> >>>> interface:
>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API
>>>> >>>> >>> Node: rabbit at overcloud-controller-1.internalapi.bdxworld.com,
>>>> >>>> interface:
>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node
>>>> and CLI
>>>> >>>> tool
>>>> >>>> >>> communication
>>>> >>>> >>> Node: rabbit at overcloud-controller-1.internalapi.bdxworld.com,
>>>> >>>> interface:
>>>> >>>> >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1
>>>> >>>> >>> and AMQP 1.0
>>>> >>>> >>> Node: rabbit at overcloud-controller-1.internalapi.bdxworld.com,
>>>> >>>> interface:
>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API
>>>> >>>> >>> Node: rabbit at overcloud-controller-2.internalapi.bdxworld.com,
>>>> >>>> interface:
>>>> >>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node
>>>> and CLI
>>>> >>>> tool
>>>> >>>> >>> communication
>>>> >>>> >>> Node: rabbit at overcloud-controller-2.internalapi.bdxworld.com,
>>>> >>>> interface:
>>>> >>>> >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1
>>>> >>>> >>> and AMQP 1.0
>>>> >>>> >>> Node: rabbit at overcloud-controller-2.internalapi.bdxworld.com,
>>>> >>>> interface:
>>>> >>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API
>>>> >>>> >>> Node:
>>>> rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>>> >>>> ,
>>>> >>>> >>> interface: [::], port: 25672, protocol: clustering, purpose:
>>>> >>>> inter-node and
>>>> >>>> >>> CLI tool communication
>>>> >>>> >>> Node:
>>>> rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>>> >>>> ,
>>>> >>>> >>> interface: 172.25.201.209, port: 5672, protocol: amqp,
>>>> purpose: AMQP
>>>> >>>> 0-9-1
>>>> >>>> >>> and AMQP 1.0
>>>> >>>> >>> Node:
>>>> rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>>> >>>> ,
>>>> >>>> >>> interface: [::], port: 15672, protocol: http, purpose: HTTP API
>>>> >>>> >>>
>>>> >>>> >>> Feature flags
>>>> >>>> >>>
>>>> >>>> >>> Flag: drop_unroutable_metric, state: enabled
>>>> >>>> >>> Flag: empty_basic_get_metric, state: enabled
>>>> >>>> >>> Flag: implicit_default_bindings, state: enabled
>>>> >>>> >>> Flag: quorum_queue, state: enabled
>>>> >>>> >>> Flag: virtual_host_metadata, state: enabled
>>>> >>>> >>>
>>>> >>>> >>> *Logs:*
>>>> >>>> >>> *(Attached)*
>>>> >>>> >>>
>>>> >>>> >>> With regards,
>>>> >>>> >>> Swogat Pradhan
>>>> >>>> >>>
>>>> >>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan <
>>>> >>>> swogatpradhan22 at gmail.com>
>>>> >>>> >>> wrote:
>>>> >>>> >>>
>>>> >>>> >>>> Hi,
>>>> >>>> >>>> Please find the nova conductor as well as nova api log.
>>>> >>>> >>>>
>>>> >>>> >>>> nova-conuctor:
>>>> >>>> >>>>
>>>> >>>> >>>> 2023-02-26 08:45:01.108 31 WARNING
>>>> >>>> oslo_messaging._drivers.amqpdriver
>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop
>>>> reply to
>>>> >>>> >>>> 16152921c1eb45c2b1f562087140168b
>>>> >>>> >>>> 2023-02-26 08:45:02.144 26 WARNING
>>>> >>>> oslo_messaging._drivers.amqpdriver
>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -]
>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop
>>>> reply to
>>>> >>>> >>>> 83dbe5f567a940b698acfe986f6194fa
>>>> >>>> >>>> 2023-02-26 08:45:02.314 32 WARNING
>>>> >>>> oslo_messaging._drivers.amqpdriver
>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -]
>>>> >>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop
>>>> reply to
>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43:
>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>> >>>> 2023-02-26 08:45:02.316 32 ERROR
>>>> oslo_messaging._drivers.amqpdriver
>>>> >>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply
>>>> >>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60
>>>> seconds
>>>> >>>> due to a
>>>> >>>> >>>> missing queue (reply_276049ec36a84486a8a406911d9802f4).
>>>> >>>> Abandoning...:
>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>> >>>> 2023-02-26 08:48:01.282 35 WARNING
>>>> >>>> oslo_messaging._drivers.amqpdriver
>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop
>>>> reply to
>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566:
>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>> >>>> 2023-02-26 08:48:01.284 35 ERROR
>>>> oslo_messaging._drivers.amqpdriver
>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply
>>>> >>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60
>>>> seconds
>>>> >>>> due to a
>>>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066).
>>>> >>>> Abandoning...:
>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>> >>>> 2023-02-26 08:49:01.303 33 WARNING
>>>> >>>> oslo_messaging._drivers.amqpdriver
>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop
>>>> reply to
>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f:
>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>> >>>> 2023-02-26 08:49:01.304 33 ERROR
>>>> oslo_messaging._drivers.amqpdriver
>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply
>>>> >>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60
>>>> seconds
>>>> >>>> due to a
>>>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066).
>>>> >>>> Abandoning...:
>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>> >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils
>>>> >>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache
>>>> enabled
>>>> >>>> with
>>>> >>>> >>>> backend dogpile.cache.null.
>>>> >>>> >>>> 2023-02-26 08:50:01.264 27 WARNING
>>>> >>>> oslo_messaging._drivers.amqpdriver
>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
>>>> >>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop
>>>> reply to
>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb:
>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>> >>>> 2023-02-26 08:50:01.266 27 ERROR
>>>> oslo_messaging._drivers.amqpdriver
>>>> >>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply
>>>> >>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60
>>>> seconds
>>>> >>>> due to a
>>>> >>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066).
>>>> >>>> Abandoning...:
>>>> >>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>> >>>>
>>>> >>>> >>>> With regards,
>>>> >>>> >>>> Swogat Pradhan
>>>> >>>> >>>>
>>>> >>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan <
>>>> >>>> >>>> swogatpradhan22 at gmail.com> wrote:
>>>> >>>> >>>>
>>>> >>>> >>>>> Hi,
>>>> >>>> >>>>> I currently have 3 compute nodes on edge site1 where i am
>>>> trying to
>>>> >>>> >>>>> launch vm's.
>>>> >>>> >>>>> When the VM is in spawning state the node goes down
>>>> (openstack
>>>> >>>> compute
>>>> >>>> >>>>> service list), the node comes backup when i restart the nova
>>>> >>>> compute
>>>> >>>> >>>>> service but then the launch of the vm fails.
>>>> >>>> >>>>>
>>>> >>>> >>>>> nova-compute.log
>>>> >>>> >>>>>
>>>> >>>> >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager
>>>> >>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running
>>>> >>>> >>>>> instance usage
>>>> >>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26
>>>> 07:00:00
>>>> >>>> to
>>>> >>>> >>>>> 2023-02-26 08:00:00. 0 instances.
>>>> >>>> >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims
>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default]
>>>> [instance:
>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on
>>>> node
>>>> >>>> >>>>> dcn01-hci-0.bdxworld.com
>>>> >>>> >>>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver
>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default]
>>>> [instance:
>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied
>>>> device
>>>> >>>> name:
>>>> >>>> >>>>> /dev/vda. Libvirt can't honour user-supplied dev names
>>>> >>>> >>>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device
>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default]
>>>> [instance:
>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume
>>>> >>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda
>>>> >>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils
>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache
>>>> enabled
>>>> >>>> with
>>>> >>>> >>>>> backend dogpile.cache.null.
>>>> >>>> >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon
>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running
>>>> >>>> >>>>> privsep helper:
>>>> >>>> >>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf',
>>>> >>>> 'privsep-helper',
>>>> >>>> >>>>> '--config-file', '/etc/nova/nova.conf', '--config-file',
>>>> >>>> >>>>> '/etc/nova/nova-compute.conf', '--privsep_context',
>>>> >>>> >>>>> 'os_brick.privileged.default', '--privsep_sock_path',
>>>> >>>> >>>>> '/tmp/tmpin40tah6/privsep.sock']
>>>> >>>> >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon
>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned
>>>> new
>>>> >>>> privsep
>>>> >>>> >>>>> daemon via rootwrap
>>>> >>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-]
>>>> privsep
>>>> >>>> >>>>> daemon starting
>>>> >>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-]
>>>> privsep
>>>> >>>> >>>>> process running with uid/gid: 0/0
>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-]
>>>> privsep
>>>> >>>> >>>>> process running with capabilities (eff/prm/inh):
>>>> >>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none
>>>> >>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-]
>>>> privsep
>>>> >>>> >>>>> daemon running as pid 2647
>>>> >>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING
>>>> >>>> os_brick.initiator.connectors.nvmeof
>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process
>>>> >>>> >>>>> execution error
>>>> >>>> >>>>> in _get_host_uuid: Unexpected error while running command.
>>>> >>>> >>>>> Command: blkid overlay -s UUID -o value
>>>> >>>> >>>>> Exit code: 2
>>>> >>>> >>>>> Stdout: ''
>>>> >>>> >>>>> Stderr: '':
>>>> oslo_concurrency.processutils.ProcessExecutionError:
>>>> >>>> >>>>> Unexpected error while running command.
>>>> >>>> >>>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver
>>>> >>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default]
>>>> [instance:
>>>> >>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image
>>>> >>>> >>>>>
>>>> >>>> >>>>> Is there a way to solve this issue?
>>>> >>>> >>>>>
>>>> >>>> >>>>>
>>>> >>>> >>>>> With regards,
>>>> >>>> >>>>>
>>>> >>>> >>>>> Swogat Pradhan
>>>> >>>> >>>>>
>>>> >>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230321/c623f061/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cinder.conf
Type: application/octet-stream
Size: 2768 bytes
Desc: not available
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230321/c623f061/attachment-0001.obj>


More information about the openstack-discuss mailing list