DCN compute service goes down when a instance is scheduled to launch | wallaby | tripleo

Eugen Block eblock at nde.ag
Wed Mar 1 08:11:42 UTC 2023


I'm not familiar with TripleO so I'm not sure how much of help I can  
be here, maybe someone else with can chime in. I would look for  
network and rabbit issues. Are the control nodes heavily loaded? Do  
you see the compute services from the edge site up all the time? If  
you run a 'watch -n 20 openstack compute service list', do they "flap"  
all the time or only if you launch instances? Maybe rabbitmq needs  
some tweaking? Can you show your policies?
rabbitmqctl list_policies -p <VHOST>

What network connection do they have, is the network saturated? Is it  
different on the edge site compared to the central site?

Zitat von Swogat Pradhan <swogatpradhan22 at gmail.com>:

> Hi Eugen,
> For some reason i am not getting your email to me directly, i am checking
> the email digest and there i am able to find your reply.
> Here is the log for download: https://we.tl/t-L8FEkGZFSq
> Yes, these logs are from the time when the issue occurred.
>
> *Note: i am able to create vm's and perform other activities in the central
> site, only facing this issue in the edge site.*
>
> With regards,
> Swogat Pradhan
>
> On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan <swogatpradhan22 at gmail.com>
> wrote:
>
>> Hi Eugen,
>> Thanks for your response.
>> I have actually a 4 controller setup so here are the details:
>>
>> *PCS Status:*
>>   * Container bundle set: rabbitmq-bundle [
>> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]:
>>     * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster):       Started
>> overcloud-controller-no-ceph-3
>>     * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster):       Started
>> overcloud-controller-2
>>     * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster):       Started
>> overcloud-controller-1
>>     * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster):       Started
>> overcloud-controller-0
>>
>> I have tried restarting the bundle multiple times but the issue is still
>> present.
>>
>> *Cluster status:*
>> [root at overcloud-controller-0 /]# rabbitmqctl cluster_status
>> Cluster status of node
>> rabbit at overcloud-controller-0.internalapi.bdxworld.com ...
>> Basics
>>
>> Cluster name: rabbit at overcloud-controller-no-ceph-3.bdxworld.com
>>
>> Disk Nodes
>>
>> rabbit at overcloud-controller-0.internalapi.bdxworld.com
>> rabbit at overcloud-controller-1.internalapi.bdxworld.com
>> rabbit at overcloud-controller-2.internalapi.bdxworld.com
>> rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>
>> Running Nodes
>>
>> rabbit at overcloud-controller-0.internalapi.bdxworld.com
>> rabbit at overcloud-controller-1.internalapi.bdxworld.com
>> rabbit at overcloud-controller-2.internalapi.bdxworld.com
>> rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com
>>
>> Versions
>>
>> rabbit at overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ 3.8.3 on
>> Erlang 22.3.4.1
>> rabbit at overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ 3.8.3 on
>> Erlang 22.3.4.1
>> rabbit at overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ 3.8.3 on
>> Erlang 22.3.4.1
>> rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com: RabbitMQ
>> 3.8.3 on Erlang 22.3.4.1
>>
>> Alarms
>>
>> (none)
>>
>> Network Partitions
>>
>> (none)
>>
>> Listeners
>>
>> Node: rabbit at overcloud-controller-0.internalapi.bdxworld.com, interface:
>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool
>> communication
>> Node: rabbit at overcloud-controller-0.internalapi.bdxworld.com, interface:
>> 172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
>> Node: rabbit at overcloud-controller-0.internalapi.bdxworld.com, interface:
>> [::], port: 15672, protocol: http, purpose: HTTP API
>> Node: rabbit at overcloud-controller-1.internalapi.bdxworld.com, interface:
>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool
>> communication
>> Node: rabbit at overcloud-controller-1.internalapi.bdxworld.com, interface:
>> 172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
>> Node: rabbit at overcloud-controller-1.internalapi.bdxworld.com, interface:
>> [::], port: 15672, protocol: http, purpose: HTTP API
>> Node: rabbit at overcloud-controller-2.internalapi.bdxworld.com, interface:
>> [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool
>> communication
>> Node: rabbit at overcloud-controller-2.internalapi.bdxworld.com, interface:
>> 172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
>> Node: rabbit at overcloud-controller-2.internalapi.bdxworld.com, interface:
>> [::], port: 15672, protocol: http, purpose: HTTP API
>> Node: rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com,
>> interface: [::], port: 25672, protocol: clustering, purpose: inter-node and
>> CLI tool communication
>> Node: rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com,
>> interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP 0-9-1
>> and AMQP 1.0
>> Node: rabbit at overcloud-controller-no-ceph-3.internalapi.bdxworld.com,
>> interface: [::], port: 15672, protocol: http, purpose: HTTP API
>>
>> Feature flags
>>
>> Flag: drop_unroutable_metric, state: enabled
>> Flag: empty_basic_get_metric, state: enabled
>> Flag: implicit_default_bindings, state: enabled
>> Flag: quorum_queue, state: enabled
>> Flag: virtual_host_metadata, state: enabled
>>
>> *Logs:*
>> *(Attached)*
>>
>> With regards,
>> Swogat Pradhan
>>
>> On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan <swogatpradhan22 at gmail.com>
>> wrote:
>>
>>> Hi,
>>> Please find the nova conductor as well as nova api log.
>>>
>>> nova-conuctor:
>>>
>>> 2023-02-26 08:45:01.108 31 WARNING oslo_messaging._drivers.amqpdriver
>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to
>>> 16152921c1eb45c2b1f562087140168b
>>> 2023-02-26 08:45:02.144 26 WARNING oslo_messaging._drivers.amqpdriver
>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -]
>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to
>>> 83dbe5f567a940b698acfe986f6194fa
>>> 2023-02-26 08:45:02.314 32 WARNING oslo_messaging._drivers.amqpdriver
>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -]
>>> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to
>>> f3bfd7f65bd542b18d84cea3033abb43:
>>> oslo_messaging.exceptions.MessageUndeliverable
>>> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver
>>> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply
>>> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds due to a
>>> missing queue (reply_276049ec36a84486a8a406911d9802f4). Abandoning...:
>>> oslo_messaging.exceptions.MessageUndeliverable
>>> 2023-02-26 08:48:01.282 35 WARNING oslo_messaging._drivers.amqpdriver
>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to
>>> d4b9180f91a94f9a82c3c9c4b7595566:
>>> oslo_messaging.exceptions.MessageUndeliverable
>>> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver
>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply
>>> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds due to a
>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...:
>>> oslo_messaging.exceptions.MessageUndeliverable
>>> 2023-02-26 08:49:01.303 33 WARNING oslo_messaging._drivers.amqpdriver
>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to
>>> 897911a234a445d8a0d8af02ece40f6f:
>>> oslo_messaging.exceptions.MessageUndeliverable
>>> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver
>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply
>>> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds due to a
>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...:
>>> oslo_messaging.exceptions.MessageUndeliverable
>>> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils
>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db
>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with
>>> backend dogpile.cache.null.
>>> 2023-02-26 08:50:01.264 27 WARNING oslo_messaging._drivers.amqpdriver
>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
>>> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to
>>> 8f723ceb10c3472db9a9f324861df2bb:
>>> oslo_messaging.exceptions.MessageUndeliverable
>>> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver
>>> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply
>>> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds due to a
>>> missing queue (reply_349bcb075f8c49329435a0f884b33066). Abandoning...:
>>> oslo_messaging.exceptions.MessageUndeliverable
>>>
>>> With regards,
>>> Swogat Pradhan
>>>
>>> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan <swogatpradhan22 at gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>> I currently have 3 compute nodes on edge site1 where i am trying to
>>>> launch vm's.
>>>> When the VM is in spawning state the node goes down (openstack compute
>>>> service list), the node comes backup when i restart the nova compute
>>>> service but then the launch of the vm fails.
>>>>
>>>> nova-compute.log
>>>>
>>>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager
>>>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running  
>>>> instance usage
>>>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00 to
>>>> 2023-02-26 08:00:00. 0 instances.
>>>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims
>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db
>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance:
>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node
>>>> dcn01-hci-0.bdxworld.com
>>>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver
>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db
>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance:
>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device name:
>>>> /dev/vda. Libvirt can't honour user-supplied dev names
>>>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device
>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db
>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance:
>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume
>>>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda
>>>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils
>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db
>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled with
>>>> backend dogpile.cache.null.
>>>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon
>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db
>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Running  
>>>> privsep helper:
>>>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf', 'privsep-helper',
>>>> '--config-file', '/etc/nova/nova.conf', '--config-file',
>>>> '/etc/nova/nova-compute.conf', '--privsep_context',
>>>> 'os_brick.privileged.default', '--privsep_sock_path',
>>>> '/tmp/tmpin40tah6/privsep.sock']
>>>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon
>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db
>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new privsep
>>>> daemon via rootwrap
>>>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep daemon
>>>> starting
>>>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep
>>>> process running with uid/gid: 0/0
>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep
>>>> process running with capabilities (eff/prm/inh):
>>>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none
>>>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep daemon
>>>> running as pid 2647
>>>> 2023-02-26 08:49:55.956 7 WARNING os_brick.initiator.connectors.nvmeof
>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db
>>>> 4160ce999a31485fa643aed0936dfef0 - default default] Process  
>>>> execution error
>>>> in _get_host_uuid: Unexpected error while running command.
>>>> Command: blkid overlay -s UUID -o value
>>>> Exit code: 2
>>>> Stdout: ''
>>>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError:
>>>> Unexpected error while running command.
>>>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver
>>>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45 b240e3e89d99489284cd731e75f2a5db
>>>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance:
>>>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image
>>>>
>>>> Is there a way to solve this issue?
>>>>
>>>>
>>>> With regards,
>>>>
>>>> Swogat Pradhan
>>>>
>>>






More information about the openstack-discuss mailing list