Re: DCN compute service goes down when a instance is scheduled to launch | wallaby | tripleo

15 Mar 2023

      Update: After restarting the nova services on the controller and running
the deploy script on the edge site, I was able to launch the VM from volume.

Right now the instance creation is failing as the block device creation is
stuck in creating state, it is taking more than 10 mins for the volume to
be created, whereas the image has already been imported to the edge glance.

I will try and create a new fresh image and test again then update.

With regards,
Swogat Pradhan

On Wed, Mar 15, 2023 at 11:13 PM Swogat Pradhan <swogatpradhan22@gmail.com>
wrote:
...
Update:
In the hypervisor list the compute node state is showing down.
On Wed, Mar 15, 2023 at 11:11 PM Swogat Pradhan <swogatpradhan22@gmail.com>
wrote:
...
Hi Brendan,
Now i have deployed another site where i have used 2 linux bonds network
template for both 3 compute nodes and 3 ceph nodes.
The bonding options is set to mode=802.3ad (lacp=active).
I used a cirros image to launch instance but the instance timed out so i
waited for the volume to be created.
Once the volume was created i tried launching the instance from the
volume and still the instance is stuck in spawning state.
Here is the nova-compute log:
2023-03-15 17:35:47.739 185437 INFO oslo.privsep.daemon [-] privsep
daemon starting
2023-03-15 17:35:47.744 185437 INFO oslo.privsep.daemon [-] privsep
process running with uid/gid: 0/0
2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep
process running with capabilities (eff/prm/inh):
CAP_SYS_ADMIN/CAP_SYS_ADMIN/none
2023-03-15 17:35:47.749 185437 INFO oslo.privsep.daemon [-] privsep
daemon running as pid 185437
2023-03-15 17:35:47.974 8 WARNING os_brick.initiator.connectors.nvmeof
[req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db
4160ce999a31485fa643aed0936dfef0 - default default] Process execution error
in _get_host_uuid: Unexpected error while running command.
Command: blkid overlay -s UUID -o value
Exit code: 2
Stdout: ''
Stderr: '': oslo_concurrency.processutils.ProcessExecutionError:
Unexpected error while running command.
2023-03-15 17:35:51.616 8 INFO nova.virt.libvirt.driver
[req-dbb11a9b-317e-4957-b141-f9e0bdf6a266 b240e3e89d99489284cd731e75f2a5db
4160ce999a31485fa643aed0936dfef0 - default default] [instance:
450b749c-a10a-4308-80a9-3b8020fee758] Creating image
It is stuck in creating image, do i need to run the template mentioned
here ?:
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/post_dep...
The volume is already created and i do not understand why the instance is
stuck in spawning state.
With regards,
Swogat Pradhan
On Sun, Mar 5, 2023 at 4:02 PM Brendan Shephard <bshephar@redhat.com>
wrote:
...
Does your environment use different network interfaces for each of the
networks? Or does it have a bond with everything on it?
One issue I have seen before is that when launching instances, there is
a lot of network traffic between nodes as the hypervisor needs to download
the image from Glance. Along with various other services sending normal
network traffic, it can be enough to cause issues if everything is running
over a single 1Gbe interface.
I have seen the same situation in fact when using a single active/backup
bond on 1Gbe nics. It’s worth checking the network traffic while you try to
spawn the instance to see if you’re dropping packets. In the situation I
described, there were dropped packets which resulted in a loss of
communication between nova_compute and RMQ, so the node appeared offline.
You should also confirm that nova_compute is being disconnected in the
nova_compute logs if you tail them on the Hypervisor while spawning the
instance.
In my case, changing from active/backup to LACP helped. So, based on
that experience, from my perspective, is certainly sounds like some kind of
network issue.
Regards,
Brendan Shephard
Senior Software Engineer
Red Hat Australia
On 5 Mar 2023, at 6:47 am, Eugen Block <eblock@nde.ag> wrote:
Hi,
I tried to help someone with a similar issue some time ago in this
thread:
https://serverfault.com/questions/1116771/openstack-oslo-messaging-exception...
But apparently a neutron reinstallation fixed it for that user, not sure
if that could apply here. But is it possible that your nova and neutron
versions are different between central and edge site? Have you restarted
nova and neutron services on the compute nodes after installation? Have you
debug logs of nova-conductor and maybe nova-compute? Maybe they can help
narrow down the issue.
If there isn't any additional information in the debug logs I probably
would start "tearing down" rabbitmq. I didn't have to do that in a
production system yet so be careful. I can think of two routes:
- Either remove queues, exchanges etc. while rabbit is running, this
will most likely impact client IO depending on your load. Check out the
rabbitmqctl commands.
- Or stop the rabbitmq cluster, remove the mnesia tables from all nodes
and restart rabbitmq so the exchanges, queues etc. rebuild.
I can imagine that the failed reply "survives" while being replicated
across the rabbit nodes. But I don't really know the rabbit internals too
well, so maybe someone else can chime in here and give a better advice.
Regards,
Eugen
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
Hi,
Can someone please help me out on this issue?
With regards,
Swogat Pradhan
On Thu, Mar 2, 2023 at 1:24 PM Swogat Pradhan <swogatpradhan22@gmail.com
...
wrote:
Hi
I don't see any major packet loss.
It seems the problem is somewhere in rabbitmq maybe but not due to packet
loss.
with regards,
Swogat Pradhan
On Wed, Mar 1, 2023 at 3:34 PM Swogat Pradhan <swogatpradhan22@gmail.com
...
wrote:
Hi,
Yes the MTU is the same as the default '1500'.
Generally I haven't seen any packet loss, but never checked when
launching the instance.
I will check that and come back.
But everytime i launch an instance the instance gets stuck at spawning
state and there the hypervisor becomes down, so not sure if packet loss
causes this.
With regards,
Swogat pradhan
On Wed, Mar 1, 2023 at 3:30 PM Eugen Block <eblock@nde.ag> wrote:
One more thing coming to mind is MTU size. Are they identical between
central and edge site? Do you see packet loss through the tunnel?
Zitat von Swogat Pradhan <swogatpradhan22@gmail.com>:
...
Hi Eugen,
Request you to please add my email either on 'to' or 'cc' as i am not
getting email's from you.
Coming to the issue:
[root@overcloud-controller-no-ceph-3 /]# rabbitmqctl list_policies -p
/
Listing policies for vhost "/" ...
vhost   name    pattern apply-to        definition      priority
/       ha-all  ^(?!amq\.).*    queues
{"ha-mode":"exactly","ha-params":2,"ha-promote-on-shutdown":"always"}   0
...
I have the edge site compute nodes up, it only goes down when i am
...
to launch an instance and the instance comes to a spawning state and
...
gets stuck.
I have a tunnel setup between the central and the edge sites.
With regards,
Swogat Pradhan
On Tue, Feb 28, 2023 at 9:11 PM Swogat Pradhan <
swogatpradhan22@gmail.com>
wrote:
...
Hi Eugen,
For some reason i am not getting your email to me directly, i am
checking
the email digest and there i am able to find your reply.
Here is the log for download: https://we.tl/t-L8FEkGZFSq
Yes, these logs are from the time when the issue occurred.
*Note: i am able to create vm's and perform other activities in the
central site, only facing this issue in the edge site.*
With regards,
Swogat Pradhan
On Mon, Feb 27, 2023 at 5:12 PM Swogat Pradhan <
swogatpradhan22@gmail.com>
wrote:
...
Hi Eugen,
Thanks for your response.
I have actually a 4 controller setup so here are the details:
*PCS Status:*
  * Container bundle set: rabbitmq-bundle [
172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]:
    * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster):
Started
overcloud-controller-no-ceph-3
    * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster):
Started
overcloud-controller-2
    * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster):
Started
overcloud-controller-1
    * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster):
Started
overcloud-controller-0
I have tried restarting the bundle multiple times but the issue is
still
present.
*Cluster status:*
[root@overcloud-controller-0 /]# rabbitmqctl cluster_status
Cluster status of node
rabbit@overcloud-controller-0.internalapi.bdxworld.com ...
Basics
Cluster name: rabbit@overcloud-controller-no-ceph-3.bdxworld.com
Disk Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com
rabbit@overcloud-controller-1.internalapi.bdxworld.com
rabbit@overcloud-controller-2.internalapi.bdxworld.com
rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Running Nodes
rabbit@overcloud-controller-0.internalapi.bdxworld.com
rabbit@overcloud-controller-1.internalapi.bdxworld.com
rabbit@overcloud-controller-2.internalapi.bdxworld.com
rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
Versions
rabbit@overcloud-controller-0.internalapi.bdxworld.com: RabbitMQ
3.8.3
on Erlang 22.3.4.1
rabbit@overcloud-controller-1.internalapi.bdxworld.com: RabbitMQ
3.8.3
on Erlang 22.3.4.1
rabbit@overcloud-controller-2.internalapi.bdxworld.com: RabbitMQ
3.8.3
on Erlang 22.3.4.1
rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com:
RabbitMQ
3.8.3 on Erlang 22.3.4.1
Alarms
(none)
Network Partitions
(none)
Listeners
Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com,
interface:
[::], port: 25672, protocol: clustering, purpose: inter-node and CLI
tool
communication
Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com,
interface:
172.25.201.212, port: 5672, protocol: amqp, purpose: AMQP 0-9-1
and AMQP 1.0
Node: rabbit@overcloud-controller-0.internalapi.bdxworld.com,
interface:
[::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com,
interface:
[::], port: 25672, protocol: clustering, purpose: inter-node and CLI
tool
communication
Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com,
interface:
172.25.201.205, port: 5672, protocol: amqp, purpose: AMQP 0-9-1
and AMQP 1.0
Node: rabbit@overcloud-controller-1.internalapi.bdxworld.com,
interface:
[::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com,
interface:
[::], port: 25672, protocol: clustering, purpose: inter-node and CLI
tool
communication
Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com,
interface:
172.25.201.201, port: 5672, protocol: amqp, purpose: AMQP 0-9-1
and AMQP 1.0
Node: rabbit@overcloud-controller-2.internalapi.bdxworld.com,
interface:
[::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
,
interface: [::], port: 25672, protocol: clustering, purpose:
inter-node and
CLI tool communication
Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
,
interface: 172.25.201.209, port: 5672, protocol: amqp, purpose: AMQP
0-9-1
and AMQP 1.0
Node: rabbit@overcloud-controller-no-ceph-3.internalapi.bdxworld.com
,
interface: [::], port: 15672, protocol: http, purpose: HTTP API
Feature flags
Flag: drop_unroutable_metric, state: enabled
Flag: empty_basic_get_metric, state: enabled
Flag: implicit_default_bindings, state: enabled
Flag: quorum_queue, state: enabled
Flag: virtual_host_metadata, state: enabled
*Logs:*
*(Attached)*
With regards,
Swogat Pradhan
On Sun, Feb 26, 2023 at 2:34 PM Swogat Pradhan <
swogatpradhan22@gmail.com>
wrote:
> Hi,
> Please find the nova conductor as well as nova api log.
>
> nova-conuctor:
>
> 2023-02-26 08:45:01.108 31 WARNING
oslo_messaging._drivers.amqpdriver
> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to
> 16152921c1eb45c2b1f562087140168b
> 2023-02-26 08:45:02.144 26 WARNING
oslo_messaging._drivers.amqpdriver
> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -]
> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to
> 83dbe5f567a940b698acfe986f6194fa
> 2023-02-26 08:45:02.314 32 WARNING
oslo_messaging._drivers.amqpdriver
> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -]
> reply_276049ec36a84486a8a406911d9802f4 doesn't exist, drop reply to
> f3bfd7f65bd542b18d84cea3033abb43:
> oslo_messaging.exceptions.MessageUndeliverable
> 2023-02-26 08:45:02.316 32 ERROR oslo_messaging._drivers.amqpdriver
> [req-7b43c4e5-0475-4598-92c0-fcacb51d9813 - - - - -] The reply
> f3bfd7f65bd542b18d84cea3033abb43 failed to send after 60 seconds
due to a
> missing queue (reply_276049ec36a84486a8a406911d9802f4).
Abandoning...:
> oslo_messaging.exceptions.MessageUndeliverable
> 2023-02-26 08:48:01.282 35 WARNING
oslo_messaging._drivers.amqpdriver
> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to
> d4b9180f91a94f9a82c3c9c4b7595566:
> oslo_messaging.exceptions.MessageUndeliverable
> 2023-02-26 08:48:01.284 35 ERROR oslo_messaging._drivers.amqpdriver
> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply
> d4b9180f91a94f9a82c3c9c4b7595566 failed to send after 60 seconds
due to a
> missing queue (reply_349bcb075f8c49329435a0f884b33066).
Abandoning...:
> oslo_messaging.exceptions.MessageUndeliverable
> 2023-02-26 08:49:01.303 33 WARNING
oslo_messaging._drivers.amqpdriver
> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to
> 897911a234a445d8a0d8af02ece40f6f:
> oslo_messaging.exceptions.MessageUndeliverable
> 2023-02-26 08:49:01.304 33 ERROR oslo_messaging._drivers.amqpdriver
> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply
> 897911a234a445d8a0d8af02ece40f6f failed to send after 60 seconds
due to a
> missing queue (reply_349bcb075f8c49329435a0f884b33066).
Abandoning...:
> oslo_messaging.exceptions.MessageUndeliverable
> 2023-02-26 08:49:52.254 31 WARNING nova.cache_utils
> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
b240e3e89d99489284cd731e75f2a5db
> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled
with
> backend dogpile.cache.null.
> 2023-02-26 08:50:01.264 27 WARNING
oslo_messaging._drivers.amqpdriver
> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -]
> reply_349bcb075f8c49329435a0f884b33066 doesn't exist, drop reply to
> 8f723ceb10c3472db9a9f324861df2bb:
> oslo_messaging.exceptions.MessageUndeliverable
> 2023-02-26 08:50:01.266 27 ERROR oslo_messaging._drivers.amqpdriver
> [req-caefe26d-153a-4dfd-9ea6-bc5ca0d46679 - - - - -] The reply
> 8f723ceb10c3472db9a9f324861df2bb failed to send after 60 seconds
due to a
> missing queue (reply_349bcb075f8c49329435a0f884b33066).
Abandoning...:
> oslo_messaging.exceptions.MessageUndeliverable
>
> With regards,
> Swogat Pradhan
>
> On Sun, Feb 26, 2023 at 2:26 PM Swogat Pradhan <
> swogatpradhan22@gmail.com> wrote:
>
>> Hi,
>> I currently have 3 compute nodes on edge site1 where i am trying to
>> launch vm's.
>> When the VM is in spawning state the node goes down (openstack
compute
>> service list), the node comes backup when i restart the nova
compute
>> service but then the launch of the vm fails.
>>
>> nova-compute.log
>>
>> 2023-02-26 08:15:51.808 7 INFO nova.compute.manager
>> [req-bc0f5f2e-53fc-4dae-b1da-82f1f972d617 - - - - -] Running
>> instance usage
>> audit for host dcn01-hci-0.bdxworld.com from 2023-02-26 07:00:00
to
>> 2023-02-26 08:00:00. 0 instances.
>> 2023-02-26 08:49:52.813 7 INFO nova.compute.claims
>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>> b240e3e89d99489284cd731e75f2a5db
>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance:
>> 0c62c1ef-9010-417d-a05f-4db77e901600] Claim successful on node
>> dcn01-hci-0.bdxworld.com
>> 2023-02-26 08:49:54.225 7 INFO nova.virt.libvirt.driver
>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>> b240e3e89d99489284cd731e75f2a5db
>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance:
>> 0c62c1ef-9010-417d-a05f-4db77e901600] Ignoring supplied device
name:
>> /dev/vda. Libvirt can't honour user-supplied dev names
>> 2023-02-26 08:49:54.398 7 INFO nova.virt.block_device
>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>> b240e3e89d99489284cd731e75f2a5db
>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance:
>> 0c62c1ef-9010-417d-a05f-4db77e901600] Booting with volume
>> c4bd7885-5973-4860-bbe6-7a2f726baeee at /dev/vda
>> 2023-02-26 08:49:55.216 7 WARNING nova.cache_utils
>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>> b240e3e89d99489284cd731e75f2a5db
>> 4160ce999a31485fa643aed0936dfef0 - default default] Cache enabled
with
>> backend dogpile.cache.null.
>> 2023-02-26 08:49:55.283 7 INFO oslo.privsep.daemon
>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>> b240e3e89d99489284cd731e75f2a5db
>> 4160ce999a31485fa643aed0936dfef0 - default default] Running
>> privsep helper:
>> ['sudo', 'nova-rootwrap', '/etc/nova/rootwrap.conf',
'privsep-helper',
>> '--config-file', '/etc/nova/nova.conf', '--config-file',
>> '/etc/nova/nova-compute.conf', '--privsep_context',
>> 'os_brick.privileged.default', '--privsep_sock_path',
>> '/tmp/tmpin40tah6/privsep.sock']
>> 2023-02-26 08:49:55.791 7 INFO oslo.privsep.daemon
>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>> b240e3e89d99489284cd731e75f2a5db
>> 4160ce999a31485fa643aed0936dfef0 - default default] Spawned new
trying
then
privsep
...
...
...
>> daemon via rootwrap
>> 2023-02-26 08:49:55.717 2647 INFO oslo.privsep.daemon [-] privsep
>> daemon starting
>> 2023-02-26 08:49:55.722 2647 INFO oslo.privsep.daemon [-] privsep
>> process running with uid/gid: 0/0
>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep
>> process running with capabilities (eff/prm/inh):
>> CAP_SYS_ADMIN/CAP_SYS_ADMIN/none
>> 2023-02-26 08:49:55.726 2647 INFO oslo.privsep.daemon [-] privsep
>> daemon running as pid 2647
>> 2023-02-26 08:49:55.956 7 WARNING
os_brick.initiator.connectors.nvmeof
>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>> b240e3e89d99489284cd731e75f2a5db
>> 4160ce999a31485fa643aed0936dfef0 - default default] Process
>> execution error
>> in _get_host_uuid: Unexpected error while running command.
>> Command: blkid overlay -s UUID -o value
>> Exit code: 2
>> Stdout: ''
>> Stderr: '': oslo_concurrency.processutils.ProcessExecutionError:
>> Unexpected error while running command.
>> 2023-02-26 08:49:58.247 7 INFO nova.virt.libvirt.driver
>> [req-3a1547ea-326f-4dd0-9127-7f4a4bdf1e45
>> b240e3e89d99489284cd731e75f2a5db
>> 4160ce999a31485fa643aed0936dfef0 - default default] [instance:
>> 0c62c1ef-9010-417d-a05f-4db77e901600] Creating image
>>
>> Is there a way to solve this issue?
>>
>>
>> With regards,
>>
>> Swogat Pradhan
>>
>