Re: openstack Vm shutoff by itself

30 Nov 2023


      Do you have access to an instance which unexpectedly shut down? If so,  
can you see anything in the syslog right before the shutdown? Maybe I  
missed it, but did you also provide libvirt logs from a hypervisor?  
Anything in the syslog of the hypervisor(s)? How healthy is your ceph  
cluster (ceph -s)? Do you see anything logged to the monitors while  
the shutdown happens? Are those instances backed by cinder volumes or  
ephemeral disks? If cinder is involved as well, check the  
cinder-volume logs.
Basically, you need to look for clues on each involved system right  
before that shutdown. You could also enable debug logs for  
nova-compute, maybe that reveals some more information.
Is it possible that someone uses something like virt-manager to manage  
the instances? I remember we had something similar quite some time ago  
when we would configure some VMs with virt-manager and also powered  
them off via virt-manager. Turning it on again resulted in nova  
shutting them down again because the power states in the database  
differed from the state on the hypervisor. Can you check that as well?

Zitat von AJ_ sunny <jains8550@gmail.com>:
...
Hi Eugen,
Please find the details below
FROM HYPERVISOR SYSLOG:-
Nov 29 07:07:46 kernel: [ 1171.392249] IPv6: ADDRCONF(NETDEV_CHANGE):
qvoc178e343-d8: link becomes ready
Nov 29 07:07:46  kernel: [ 1171.392460] IPv6: ADDRCONF(NETDEV_CHANGE):
qvbc178e343-d8: link becomes ready
Nov 29 07:07:46  kernel: [ 1171.397266] qbrc178e343-d8: port
1(qvbc178e343-d8) entered blocking state
Nov 29 07:07:46kernel: [ 1171.397268] qbrc178e343-d8: port
1(qvbc178e343-d8) entered disabled state
Nov 29 07:07:46  kernel: [ 1171.397414] device qvbc178e343-d8 entered
promiscuous mode
FROM DMESG LOG:-
[Wed Nov 29 07:07:45 2023] qbrc178e343-d8: port 1(qvbc178e343-d8) entered
disabled state
[Wed Nov 29 07:07:45 2023] device qvbc178e343-d8 entered promiscuous mode
[Wed Nov 29 07:07:45 2023] qbrc178e343-d8: port 1(qvbc178e343-d8) entered
blocking state
[Wed Nov 29 07:07:45 2023] qbrc178e343-d8: port 1(qvbc178e343-d8) entered
forwarding state
[Wed Nov 29 07:07:45 2023] device qvoc178e343-d8 entered promiscuous mode
[Wed Nov 29 07:07:49 2023] qbrc178e343-d8: port 2(tapc178e343-d8) entered
blocking state
[Wed Nov 29 07:07:49 2023] qbrc178e343-d8: port 2(tapc178e343-d8) entered
disabled state
[Wed Nov 29 07:07:49 2023] device tapc178e343-d8 entered promiscuous mode
[Wed Nov 29 07:07:49 2023] qbrc178e343-d8: port 2(tapc178e343-d8) entered
blocking state
[Wed Nov 29 07:07:49 2023] qbrc178e343-d8: port 2(tapc178e343-d8) entered
forwarding state
NOVA-COMPUTE LOG:-
2023-11-29 00:38:31.027 7 INFO nova.compute.manager [-] [instance:
1dbd1562-44c1-44b7-9d1e-97ac61716db3] VM Stopped (Lifecycle Event)
2023-11-29 00:38:31.115 7 INFO nova.compute.manager
[req-cda5058f-c026-4586-a2b3-50f9727f1220 - - - - -] [instance:
1dbd1562-44c1-44b7-9d1e-97
ac61716db3] During _sync_instance_power_state the DB power_state (1) does
not match the vm_power_state from the hypervisor (4). Updating power
_state in the DB to match the hypervisor.
2023-11-29 00:38:34.045 7 INFO nova.compute.manager [-] [instance:
46b48b4e-3675-453c-8c87-f21f1b7fb86c] VM Stopped (Lifecycle Event)
2023-11-29 00:38:34.080 7 INFO nova.compute.manager [-] [instance:
b3df30a6-de61-448e-8451-938309b20ab5] VM Stopped (Lifecycle Event)
2023-11-29 00:38:34.122 7 INFO nova.compute.manager
[req-a6d0b47f-f50b-4bbc-a0d7-df49511ca4a7 - - - - -] [instance:
46b48b4e-3675-453c-8c87-f2
1f1b7fb86c] During _sync_instance_power_state the DB power_state (1) does
not match the vm_power_state from the hypervisor (4). Updating power
_state in the DB to match the hypervisor.
2023-11-29 00:38:34.164 7 INFO nova.compute.manager
[req-7aab66e0-73f6-4ae1-960d-2f24dfba3131 - - - - -] [instance:
b3df30a6-de61-448e-8451-93
8309b20ab5] During _sync_instance_power_state the DB power_state (1) does
not match the vm_power_state from the hypervisor (4). Updating power
_state in the DB to match the hypervisor.
There are multiple virtual machine went down on different hypervisor
And also os disk resides on ceph storage During this incident if directly
restart the machine then we will get I/O error on console
so first we have to rebuild the os-disk volume from ceph then we have
restart the machine for proper functioning of VM
I checked instance console logs also but nothing found suspicious
Thanks &  Regards
Arihant Jain
On Mon, Nov 27, 2023 at 7:49 PM Eugen Block <eblock@nde.ag> wrote:
...
I don't see how ceph could be the issue here. Do you have libvirt logs
and dmesg output from the hypervisor and something from the VM like
the relevant syslog excerpts (before it gets shut down)? Is only one
VM affected or all from the same hypervisor or several across
different hypervisors?
The nova-compute.log doesn't seem to be enough, but you could also
enable debug logs to see if it reveals more.
Zitat von AJ_ sunny <jains8550@gmail.com>:
...
Hi team,
After doing above changes I am still getting the issue in which machine
continuously went shutdown
In nova-compute logs I am getting only this footprint
Logs:-
2023-10-16 08:48:10.971 7 WARNING nova.compute.manager
[req-c7b731db-2b61-400e-917f-8645c9984696 f226d81a45dd46488fb2e19515 848
316d215042914de190f5f9e1c8466bf0 default default] [instance:
4b04d3f1-1fbd-4b63-b693-a0ef316ecff3] Received unexpected - vent
network-vif-plugged-f191f6c8-dff5-4c1b-94b3-8d91aa6ff5ac for instance
with
vm_state active and task_state None. 2023-10-21 22:42:44.589 7 INFO
nova.compute.manager [-] [instance: 4b04d3f1-1fbd-4b63-b693-a0ef316ecff3]
VM Stopped (Lifecyc Event)
2023-10-21 22:42:44.683 7 INFO nova.compute.manager
[req-1d99b87b-7ff7-462d-ab18-fbdec6bda71d -] [instance: 4b04d3f1-
fbd-4b63-b693-a0ef316ecff3] During _sync_instance_power_state the DB
power_state (1) does not match the vm_power_state from ti e hypervisor
(4).
Updating power_state in the DB to match the hypervisor.
2023-10-21 22:42:44.811 7 WARNING nova.compute.manager
[req-1d99b87b-7ff7-462d-ab18-fbdec6bda71d ----] [instance: 4b04d3f
1-1fbd-4b63-b693-a0ef316ecff3] Instance shutdown by itself. Calling the
stop API. Current vm_state: active, current task_state : None, original
DB
power_state: 1, current VM power_state: 4 2023-10-21 22:42:44.977 7 INFO
nova.compute.manager [req-1d99b87b-7ff7-462d-ab18-fbdec6bda71d -]
[instance: 4b04d3f1-1
fbd-4b63-b693-a0ef316ecff3] Instance is already powered off in the
hypervisor when stop is called.
And in this architecture we are using ceph is the backend storage for
Nova,glance & cinder
When machine auto goes down and if i try to start the machine it will go
in
error i.e. in Vm console is show I/O ERROR during boot so first we need
to
rebuild the volume from ceph side then I have to start the machine
Rbd object-map rebuild<volume-id>
Openstack server start <server-id>
So this issue is showing two faces one from ceph side and another from
nova-compute log
can someone please help me out to fix out this issue asap
Thanks & Regards
Arihant Jain
On Tue, 24 Oct, 2023, 4:56 pm , <smooney@redhat.com> wrote:
...
...
Hi team,
Vm is not shutting off by owner from inside its automatically went to
shutdown i.e. libvirt lifecycle stop event triggering
In my  nova.conf configuration I am using ram_allocation_ratio = 1.5
And previously I tried to set in nova.conf
Sync_power_state_interval = -1 but still facing the same problem
OOM might be causing this issue
Can you please give me some idea to fix this issue if OOM is the cause
On Tue, 2023-10-24 at 10:11 +0530, AJ_ sunny wrote:
the general answer is swap.
nova should alwasy be deployed with swap even if you do not have over
commit enabled.
there are a few reason for this the first being python allocates memory
diffently if
any swap is aviable, even 1G is enough to have it not try to commit all
memory. so
when swap is aviable the nova/neutron agents will use much less resident
memeory even with
out usign any of the swap space.
we have some docs about this downstream
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/17....
...
if you are being ultra conservative we recommend allocating (ram *
allocation ratio) in swap so in your case allcoate
1.5 times your ram as swap. we woudl expect the actul useage of swap to
be
...
a small fraction of that however so we
also provide a formula for
overcommit_ratio = NovaRAMAllocationRatio - 1
    Minimum swap size (MB) = (total_RAM * overcommit_ratio) +
RHEL_min_swap
    Recommended swap size (MB) = total_RAM * (overcommit_ratio +
percentage_of_RAM_to_use_for_swap)
so say your host had 64G of ram with an allocation ratio of 1.5 and a
min
swap percentaiong of 25%
the conserviver swap recommentation would be
(64*(0.5+0.25)) + disto_min_swap
(64*0.75) + 4G = 52G of recommended swap
if your wondering why we add a min swap precentage and disto min swap
its
basically to acocund for the
Qemu and host OS memory overhead as well as the memory used by the
nova/neutron agents and libvirt/ovs
if your not using memory over commit my general recommdation is if you
have less then 64G of ram allcoate 16G if you
have more then 256G of ram allocate 64G and you should be fine. when you
do use memofy over commit you must
have at least enouch swap to account for the qemu overhead of all
instance
+ the over committed memory.
the other common cause of OOM errors is if you are using numa affinity
and
the guest dont request
hw:mem_page_size=<something> without setting a mem_page_size request we
dont do numa aware memory placement. the kernel
OOM system works
on a per numa node basis, numa affintiy does not supprot memory over
commit either so that is likly not your issue.
i jsut said i woudl mention it to cover all basis.
regards
sean
...
Thanks & Regards
Arihant Jain
On Mon, 23 Oct, 2023, 11:29 pm , <smooney@redhat.com> wrote:
...
On Mon, 2023-10-23 at 13:19 -0400, Jonathan Proulx wrote:
>
> I've seen similar log traces with overcommitted memory when the
> hypervisor runs out of physical memory and OOM killer gets the VM
> process.
>
> This is an unusuall configuration (I think) but if the VM owner
claims
...
...
> they didn't power down the VM internally you might look at the
local
> hypevisor logs to see if the VM process crashed or was killed for
some
> other reason.
yep OOM events are one common causes fo this.
nova is bacialy just saying "hay you said this vm should be active
but
its
not, im going to update the db to reflect
reality." you can turn that off with
https://docs.openstack.org/nova/latest/configuration/config.html#workarounds...
...
...
...
or
...
...
...
either disabel the sync via setign the interval to -1
or disable haneling the virt lifecycle events.
i would recommend the sync_power_state_interval approach but again
if
vms
are stopping
and you dont know why you likely should discover why rahter then
just
turning if the update of the nova db to reflect
the actual sate.
>
> -Jon
>
> On Mon, Oct 23, 2023 at 02:02:26PM +0100, smooney@redhat.com
wrote:
> :On Mon, 2023-10-23 at 17:45 +0530, AJ_ sunny wrote:
> :> Hi team,
> :>
> :> I am using openstack kolla ansible on wallaby version and
currently I
am
> :> facing issue with virtual machine, vm is shutoff by itself and
and
from log
> :> it seems libvirt lifecycle stop event triggering again and
again
> :>
> :> Logs:-
> :> 2023-10-16 08:48:10.971 7 WARNING nova.compute.manager
> :> [req-c7b731db-2b61-400e-917f-8645c9984696
f226d81a45dd46488fb2e19515
848
> :> 316d215042914de190f5f9e1c8466bf0 default default] [instance:
> :> 4b04d3f1-1fbd-4b63-b693-a0ef316ecff3] Received unexpected -
vent
> :> network-vif-plugged-f191f6c8-dff5-4c1b-94b3-8d91aa6ff5ac for
instance
with
> :> vm_state active and task_state None. 2023-10-21 22:42:44.589 7
INFO
> :> nova.compute.manager [-] [instance:
4b04d3f1-1fbd-4b63-b693-a0ef316ecff3]
> :> VM Stopped (Lifecyc Event)
> :>
> :> 2023-10-21 22:42:44.683 7 INFO nova.compute.manager
> :> [req-1d99b87b-7ff7-462d-ab18-fbdec6bda71d -] [instance:
4b04d3f1-
> :> fbd-4b63-b693-a0ef316ecff3] During _sync_instance_power_state
...
DB
...
...
> :> power_state (1) does not match the vm_power_state from ti e
hypervisor (4).
> :> Updating power_state in the DB to match the hypervisor.
> :>
> :> 2023-10-21 22:42:44.811 7 WARNING nova.compute.manager
> :> [req-1d99b87b-7ff7-462d-ab18-fbdec6bda71d ----] [instance:
4b04d3f
> :> 1-1fbd-4b63-b693-a0ef316ecff3] Instance shutdown by itself.
Calling
the
> :> stop API. Current vm_state: active, current task_state : None,
original DB
> :> power_state: 1, current VM power_state: 4 2023-10-21
22:42:44.977
7
INFO
> :> nova.compute.manager [req-1d99b87b-7ff7-462d-ab18-fbdec6bda71d
-]
> :> [instance: 4b04d3f1-1
> :>
> :> fbd-4b63-b693-a0ef316ecff3] Instance is already powered off in
https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.syn...
the
the
...
...
...
> :> hypervisor when stop is called.
> :
> :that sounds like the guest os shutdown the vm.
> :i.e. somethign in the guest ran sudo poweroff
> :then nova detected teh vm was stoped by the user and updated its
db
to
match
> :that.
> :
> :that is the expected beahvior wehn you have the power sync
enabled.
> :it is enabled by default.
> :>
> :>
> :> Thanks & Regards
> :> Arihant Jain
> :> +91 8299719369
> :
>