Ceph outage cause filesystem error on VM

Sean Mooney smooney at redhat.com
Thu Feb 16 15:28:32 UTC 2023


On Thu, 2023-02-16 at 09:56 -0500, Satish Patel wrote:
> Folks,
> 
> I am running a small 3 node compute/controller with 3 node ceph storage in
> my lab. Yesterday, because of a power outage all my nodes went down. After
> reboot of all nodes ceph seems to show good health and no error (in ceph
> -s).
> 
> When I started using the existing VM I noticed the following errors. Seems
> like data loss. This is a lab machine and has zero activity on vms but
> still loses data and the file system corrupt. Is this normal ?
if the vm/cluster hard crashes due to the power cut yes it can.
personally i have hit this more often with XFS then ext4 but i have seen it with both.
> 
> I am not using eraser coding, does that help in this matter?
> 
> blk_update_request: I/O error, dev sda, sector 233000 op 0x1: (WRITE) flags
> 0x800 phys_seg 8 prio class 0

you will proably need to rescue the isntance and repair the filesystem of each vm with fsck
or similar. so boot with recue image -> repair filestem -> unrescue -> hardreboot/start vm if needed

you might be able to mitigate this somewhat by disableing disk cacheing at teh qemu level but
that will reduce performance. ceph recommenes that you use virtio-scis fo the device model and
writeback cach mode. we generally recommend that too however you can use the disk_cachemodes option to
chage that. https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.disk_cachemodes

[libvirt]
disk_cachemodes=file=none,block=none,network=none

this curreption may also have happend on the cecph cluter side.
they have some options that can help prevent that via journaling wirtes

if you can afford it i would get even a small UPS to allow a graceful shutdown if you have future powercuts
to aovid dataloss issues.




More information about the openstack-discuss mailing list