DCN compute service goes down when a instance is scheduled to launch | wallaby | tripleo

Eugen Block eblock at nde.ag
Sat Mar 4 20:47:45 UTC 2023


Hi,

I tried to help someone with a similar issue some time ago in this thread:
https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception-in-nova-conductor

But apparently a neutron reinstallation fixed it for that user, not  
sure if that could apply here. But is it possible that your nova and  
neutron versions are different between central and edge site? Have you  
restarted nova and neutron services on the compute nodes after  
installation? Have you debug logs of nova-conductor and maybe  
nova-compute? Maybe they can help narrow down the issue.
If there isn't any additional information in the debug logs I probably  
would start "tearing down" rabbitmq. I didn't have to do that in a  
production system yet so be careful. I can think of two routes:

- Either remove queues, exchanges etc. while rabbit is running, this  
will most likely impact client IO depending on your load. Check out  
the rabbitmqctl commands.
- Or stop the rabbitmq cluster, remove the mnesia tables from all  
nodes and restart rabbitmq so the exchanges, queues etc. rebuild.

I can imagine that the failed reply "survives" while being replicated  
across the rabbit nodes. But I don't really know the rabbit internals  
too well, so maybe someone else can chime in here and give a better  
advice.

Regards,
Eugen

Zitat von Swogat Pradhan <swogatpradhan22 at gmail.com>:

> Hi,
> Can someone please help me out on this issue?
>
> With regards,
> Swogat Pradhan
>
> On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan <swogatpradhan22 at gmail.com>
> wrote:
>
>> Hi
>> I don't see any major packet loss.
>> It seems the problem is somewhere in rabbitmq maybe but not due to packet
>> loss.
>>
>> with regards,
>> Swogat Pradhan
>>
>> On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan <swogatpradhan22 at gmail.com>
>> wrote:
>>
>>> Hi,
>>> Yes the MTU is the same as the default '1500'.
>>> Generally I haven't seen any packet loss, but never checked when
>>> launching the instance.
>>> I will check that and come back.
>>> But everytime i launch an instance the instance gets stuck at spawning
>>> state and there the hypervisor becomes down, so not sure if packet loss
>>> causes this.
>>>
>>> With regards,
>>> Swogat pradhan
>>>
>>> On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock at nde.ag> wrote:
>>>
>>>> One more thing coming to mind is MTU size. Are they identical between
>>>> central and edge site? Do you see packet loss through the tunnel?
>>>>
>>>> Zitat von Swogat Pradhan <swogatpradhan22 at gmail.com>:
>>>>
>>>> > Hi Eugen,
>>>> > Request you to please add my email either on 'to' or 'cc' as i am not
>>>> > getting email's from you.
>>>> > Coming to the issue:
>>>> >
>>>> > [root at overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p
>>>> /
>>>> > Listing policies for vhost "/" ...
>>>> > vhost   name    pattern apply-to        definition      priority
>>>> > /       ha-all  ^(?!amq\.).*    queues
>>>> >
>>>> {"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"}   0
>>>> >
>>>> > I have the edge site compute nodes up, it only goes down when i am
>>>> trying
>>>> > to launch an instance and the instance comes to a spawning state and
>>>> then
>>>> > gets stuck.
>>>> >
>>>> > I have a tunnel setup between the central and the edge sites.
>>>> >
>>>> > With regards,
>>>> > Swogat Pradhan
>>>> >
>>>> > On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan <
>>>> swogatpradhan22 at gmail.com>
>>>> > wrote:
>>>> >
>>>> >> Hi Eugen,
>>>> >> For some reason i am not getting your email to me directly, i am
>>>> checking
>>>> >> the email digest and there i am able to find your reply.
>>>> >> Here is the log for download: https://we.tl/t-L8FEkGZFSq
>>>> >> Yes, these logs are from the time when the issue occurred.
>>>> >>
>>>> >> *Note: i am able to create vm's and perform other activities in the
>>>> >> central site, only facing this issue in the edge site.*
>>>> >>
>>>> >> With regards,
>>>> >> Swogat Pradhan
>>>> >>
>>>> >> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan <
>>>> swogatpradhan22 at gmail.com>
>>>> >> wrote:
>>>> >>
>>>> >>> Hi Eugen,
>>>> >>> Thanks for your response.
>>>> >>> I have actually a 4 controller setup so here are the details:
>>>> >>>
>>>> >>> *PCS Status:*
>>>> >>>   * Container bundle set: rabbitmq-bundle [
>>>> >>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]:
>>>> >>>     * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster):
>>>>  Started
>>>> >>> overcloud-controller-no-ceph-3
>>>> >>>     * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster):
>>>>  Started
>>>> >>> overcloud-controller-2
>>>> >>>     * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster):
>>>>  Started
>>>> >>> overcloud-controller-1
>>>> >>>     * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster):
>>>>  Started
>>>> >>> overcloud-controller-0
>>>> >>>
>>>> >>> I have tried restarting the bundle multiple times but the issue is
>>>> still
>>>> >>> present.
>>>> >>>
>>>> >>> *Cluster status:*
>>>> >>> [root at overcloud-controller-0 /]# rabbitmqctl cluster_status
>>>> >>> Cluster status of node
>>>> >>> rabbit at overcloud-controller-0.internalapi.bdxworld.com ...
>>>> >>> Basics
>>>> >>>
>>>> >>> Cluster name: rabbit at overcloud-controller-no-ceph-3.bdxworld.com
>>>> >>>
>>>> >>> Disk Nodes
>>>> >>>
>>>> >>> rabbit at overcloud-controller-0.internalapi.bdxworld.com
>>>> >>> rabbit at overcloud-controller-1.internalapi.bdxworld.com
>>>> >>> rabbit at overcloud-controller-2.internalapi.bdxworld.com
>>>> >>> rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>>> >>>
>>>> >>> Running Nodes
>>>> >>>
>>>> >>> rabbit at overcloud-controller-0.internalapi.bdxworld.com
>>>> >>> rabbit at overcloud-controller-1.internalapi.bdxworld.com
>>>> >>> rabbit at overcloud-controller-2.internalapi.bdxworld.com
>>>> >>> rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>>> >>>
>>>> >>> Versions
>>>> >>>
>>>> >>> rabbit at overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ
>>>> 3.8.3
>>>> >>> on Erlang 22.3.4.1
>>>> >>> rabbit at overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ
>>>> 3.8.3
>>>> >>> on Erlang 22.3.4.1
>>>> >>> rabbit at overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ
>>>> 3.8.3
>>>> >>> on Erlang 22.3.4.1
>>>> >>> rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com:
>>>> RabbitMQ
>>>> >>> 3.8.3 on Erlang 22.3.4.1
>>>> >>>
>>>> >>> Alarms
>>>> >>>
>>>> >>> (none)
>>>> >>>
>>>> >>> Network Partitions
>>>> >>>
>>>> >>> (none)
>>>> >>>
>>>> >>> Listeners
>>>> >>>
>>>> >>> Node: rabbit at overcloud-controller-0.internalapi.bdxworld.com,
>>>> interface:
>>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI
>>>> tool
>>>> >>> communication
>>>> >>> Node: rabbit at overcloud-controller-0.internalapi.bdxworld.com,
>>>> interface:
>>>> >>> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1
>>>> >>> and AMQP 1.0
>>>> >>> Node: rabbit at overcloud-controller-0.internalapi.bdxworld.com,
>>>> interface:
>>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API
>>>> >>> Node: rabbit at overcloud-controller-1.internalapi.bdxworld.com,
>>>> interface:
>>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI
>>>> tool
>>>> >>> communication
>>>> >>> Node: rabbit at overcloud-controller-1.internalapi.bdxworld.com,
>>>> interface:
>>>> >>> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1
>>>> >>> and AMQP 1.0
>>>> >>> Node: rabbit at overcloud-controller-1.internalapi.bdxworld.com,
>>>> interface:
>>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API
>>>> >>> Node: rabbit at overcloud-controller-2.internalapi.bdxworld.com,
>>>> interface:
>>>> >>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI
>>>> tool
>>>> >>> communication
>>>> >>> Node: rabbit at overcloud-controller-2.internalapi.bdxworld.com,
>>>> interface:
>>>> >>> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1
>>>> >>> and AMQP 1.0
>>>> >>> Node: rabbit at overcloud-controller-2.internalapi.bdxworld.com,
>>>> interface:
>>>> >>> [::], port: 15672, protocol: http, purpose: HTTP API
>>>> >>> Node: rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>>> ,
>>>> >>> interface: [::], port: 25672, protocol: clustering, purpose:
>>>> inter-node and
>>>> >>> CLI tool communication
>>>> >>> Node: rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>>> ,
>>>> >>> interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP
>>>> 0-9-1
>>>> >>> and AMQP 1.0
>>>> >>> Node: rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>>> ,
>>>> >>> interface: [::], port: 15672, protocol: http, purpose: HTTP API
>>>> >>>
>>>> >>> Feature flags
>>>> >>>
>>>> >>> Flag: drop_unroutable_metric, state: enabled
>>>> >>> Flag: empty_basic_get_metric, state: enabled
>>>> >>> Flag: implicit_default_bindings, state: enabled
>>>> >>> Flag: quorum_queue, state: enabled
>>>> >>> Flag: virtual_host_metadata, state: enabled
>>>> >>>
>>>> >>> *Logs:*
>>>> >>> *(Attached)*
>>>> >>>
>>>> >>> With regards,
>>>> >>> Swogat Pradhan
>>>> >>>
>>>> >>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan <
>>>> swogatpradhan22 at gmail.com>
>>>> >>> wrote:
>>>> >>>
>>>> >>>> Hi,
>>>> >>>> Please find the nova conductor as well as nova api log.
>>>> >>>>
>>>> >>>> nova-conuctor:
>>>> >>>>
>>>> >>>> 2023-02-26 08:45:01.108 31 WARNING
>>>> oslo_messaging._drivers.amqpdriver
>>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
>>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to
>>>> >>>> 16152921c1eb45c2b1f562087140168b
>>>> >>>> 2023-02-26 08:45:02.144 26 WARNING
>>>> oslo_messaging._drivers.amqpdriver
>>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -]
>>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to
>>>> >>>> 83dbe5f567a940b698acfe986f6194fa
>>>> >>>> 2023-02-26 08:45:02.314 32 WARNING
>>>> oslo_messaging._drivers.amqpdriver
>>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -]
>>>> >>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to
>>>> >>>> f3bfd7f65bd542b18d84cea3033abb43:
>>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver
>>>> >>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply
>>>> >>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds
>>>> due to a
>>>> >>>> missing queue (reply_276049ec36a84486a8a406911d9802f4).
>>>> Abandoning...:
>>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>> 2023-02-26 08:48:01.282 35 WARNING
>>>> oslo_messaging._drivers.amqpdriver
>>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
>>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to
>>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566:
>>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver
>>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply
>>>> >>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds
>>>> due to a
>>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066).
>>>> Abandoning...:
>>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>> 2023-02-26 08:49:01.303 33 WARNING
>>>> oslo_messaging._drivers.amqpdriver
>>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
>>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to
>>>> >>>> 897911a234a445d8a0d8af02ece40f6f:
>>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver
>>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply
>>>> >>>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds
>>>> due to a
>>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066).
>>>> Abandoning...:
>>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils
>>>> >>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled
>>>> with
>>>> >>>> backend dogpile.cache.null.
>>>> >>>> 2023-02-26 08:50:01.264 27 WARNING
>>>> oslo_messaging._drivers.amqpdriver
>>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
>>>> >>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to
>>>> >>>> 8f723ceb10c3472db9a9f324861df2bb:
>>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver
>>>> >>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply
>>>> >>>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds
>>>> due to a
>>>> >>>> missing queue (reply_349bcb075f8c49329435a0f884b33066).
>>>> Abandoning...:
>>>> >>>> oslo_messaging.exceptions.MessageUndeliverable
>>>> >>>>
>>>> >>>> With regards,
>>>> >>>> Swogat Pradhan
>>>> >>>>
>>>> >>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan <
>>>> >>>> swogatpradhan22 at gmail.com> wrote:
>>>> >>>>
>>>> >>>>> Hi,
>>>> >>>>> I currently have 3 compute nodes on edge site1 where i am trying to
>>>> >>>>> launch vm's.
>>>> >>>>> When the VM is in spawning state the node goes down (openstack
>>>> compute
>>>> >>>>> service list), the node comes backup when i restart the nova
>>>> compute
>>>> >>>>> service but then the launch of the vm fails.
>>>> >>>>>
>>>> >>>>> nova-compute.log
>>>> >>>>>
>>>> >>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager
>>>> >>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running
>>>> >>>>> instance usage
>>>> >>>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00
>>>> to
>>>> >>>>> 2023-02-26 08:00:00. 0 instances.
>>>> >>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims
>>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance:
>>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node
>>>> >>>>> dcn01-hci-0.bdxworld.com
>>>> >>>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver
>>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance:
>>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device
>>>> name:
>>>> >>>>> /dev/vda. Libvirt can't honour user-supplied dev names
>>>> >>>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device
>>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance:
>>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume
>>>> >>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda
>>>> >>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils
>>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled
>>>> with
>>>> >>>>> backend dogpile.cache.null.
>>>> >>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon
>>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running
>>>> >>>>> privsep helper:
>>>> >>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf',
>>>> 'privsep-helper',
>>>> >>>>> '--config-file', '/etc/nova/nova.conf', '--config-file',
>>>> >>>>> '/etc/nova/nova-compute.conf', '--privsep_context',
>>>> >>>>> 'os_brick.privileged.default', '--privsep_sock_path',
>>>> >>>>> '/tmp/tmpin40tah6/privsep.sock']
>>>> >>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon
>>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new
>>>> privsep
>>>> >>>>> daemon via rootwrap
>>>> >>>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep
>>>> >>>>> daemon starting
>>>> >>>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep
>>>> >>>>> process running with uid/gid: 0/0
>>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep
>>>> >>>>> process running with capabilities (eff/prm/inh):
>>>> >>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none
>>>> >>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep
>>>> >>>>> daemon running as pid 2647
>>>> >>>>> 2023-02-26 08:49:55.956 7 WARNING
>>>> os_brick.initiator.connectors.nvmeof
>>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process
>>>> >>>>> execution error
>>>> >>>>> in _get_host_uuid: Unexpected error while running command.
>>>> >>>>> Command: blkid overlay -s UUID -o value
>>>> >>>>> Exit code: 2
>>>> >>>>> Stdout: ''
>>>> >>>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError:
>>>> >>>>> Unexpected error while running command.
>>>> >>>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver
>>>> >>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>>>> >>>>> b240e3e89d99489284cd731e75f2a5db
>>>> >>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance:
>>>> >>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image
>>>> >>>>>
>>>> >>>>> Is there a way to solve this issue?
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> With regards,
>>>> >>>>>
>>>> >>>>> Swogat Pradhan
>>>> >>>>>
>>>> >>>>
>>>>
>>>>
>>>>
>>>>






More information about the openstack-discuss mailing list