On Thu, Feb 16, 2023 at 10:44 AM Eugen Block <eblock@nde.ag> wrote:

In addition to Sean's response, this has been asked multiple times,
e.g. [1]. You could check if your hypervisors gave up the lock on the
RBDs or if they are still locked (rbd status <pool>/<image>), in that
case you might need to blacklist the clients and see if that resolves
anything. Do you have regular snapshots (or backups) to be able to
rollback in case of a curruption?

[1] https://www.spinics.net/lists/ceph-users/msg45937.html

I usually look for the lock doing something like:

rbd lock ls vms/37d52c81-e78d-4237-b357-db62b820db04_disk

Then remove it doing something like:

rbd lock rm vms/37d52c81-e78d-4237-b357-db62b820db04_disk 'auto 94276942759680' client.56157074

If you have a very large number of VMs, you can gather a list of VM uuids with the Openstack client, and then do some awk or similar voodoo to gather the info from Ceph and nuke the locks. After that you should be able to boot the instances normally.

Maybe there's some more graceful way that's blessed by Ceph to do this, but this has worked for me.

-Erik

Zitat von Sean Mooney <smooney@redhat.com>:

> On Thu, 2023-02-16 at 09:56 -0500, Satish Patel wrote:
>> Folks,
>>
>> I am running a small 3 node compute/controller with 3 node ceph storage in
>> my lab. Yesterday, because of a power outage all my nodes went down. After
>> reboot of all nodes ceph seems to show good health and no error (in ceph
>> -s).
>>
>> When I started using the existing VM I noticed the following errors. Seems
>> like data loss. This is a lab machine and has zero activity on vms but
>> still loses data and the file system corrupt. Is this normal ?
> if the vm/cluster hard crashes due to the power cut yes it can.
> personally i have hit this more often with XFS then ext4 but i have
> seen it with both.
>>
>> I am not using eraser coding, does that help in this matter?
>>
>> blk_update_request: I/O error, dev sda, sector 233000 op 0x1: (WRITE) flags
>> 0x800 phys_seg 8 prio class 0
>
> you will proably need to rescue the isntance and repair the
> filesystem of each vm with fsck
> or similar. so boot with recue image -> repair filestem -> unrescue
> -> hardreboot/start vm if needed
>
> you might be able to mitigate this somewhat by disableing disk
> cacheing at teh qemu level but
> that will reduce performance. ceph recommenes that you use
> virtio-scis fo the device model and
> writeback cach mode. we generally recommend that too however you can
> use the disk_cachemodes option to
> chage that.
> https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.disk_cachemodes
>
> [libvirt]
> disk_cachemodes=file=none,block=none,network=none
>
> this curreption may also have happend on the cecph cluter side.
> they have some options that can help prevent that via journaling wirtes
>
> if you can afford it i would get even a small UPS to allow a
> graceful shutdown if you have future powercuts
> to aovid dataloss issues.