Ceph outage cause filesystem error on VM

Eugen Block eblock at nde.ag
Thu Feb 16 15:43:21 UTC 2023


In addition to Sean's response, this has been asked multiple times,  
e.g. [1]. You could check if your hypervisors gave up the lock on the  
RBDs or if they are still locked (rbd status <pool>/<image>), in that  
case you might need to blacklist the clients and see if that resolves  
anything. Do you have regular snapshots (or backups) to be able to  
rollback in case of a curruption?

[1] https://www.spinics.net/lists/ceph-users/msg45937.html


Zitat von Sean Mooney <smooney at redhat.com>:

> On Thu, 2023-02-16 at 09:56 -0500, Satish Patel wrote:
>> Folks,
>>
>> I am running a small 3 node compute/controller with 3 node ceph storage in
>> my lab. Yesterday, because of a power outage all my nodes went down. After
>> reboot of all nodes ceph seems to show good health and no error (in ceph
>> -s).
>>
>> When I started using the existing VM I noticed the following errors. Seems
>> like data loss. This is a lab machine and has zero activity on vms but
>> still loses data and the file system corrupt. Is this normal ?
> if the vm/cluster hard crashes due to the power cut yes it can.
> personally i have hit this more often with XFS then ext4 but i have  
> seen it with both.
>>
>> I am not using eraser coding, does that help in this matter?
>>
>> blk_update_request: I/O error, dev sda, sector 233000 op 0x1: (WRITE) flags
>> 0x800 phys_seg 8 prio class 0
>
> you will proably need to rescue the isntance and repair the  
> filesystem of each vm with fsck
> or similar. so boot with recue image -> repair filestem -> unrescue  
> -> hardreboot/start vm if needed
>
> you might be able to mitigate this somewhat by disableing disk  
> cacheing at teh qemu level but
> that will reduce performance. ceph recommenes that you use  
> virtio-scis fo the device model and
> writeback cach mode. we generally recommend that too however you can  
> use the disk_cachemodes option to
> chage that.  
> https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.disk_cachemodes
>
> [libvirt]
> disk_cachemodes=file=none,block=none,network=none
>
> this curreption may also have happend on the cecph cluter side.
> they have some options that can help prevent that via journaling wirtes
>
> if you can afford it i would get even a small UPS to allow a  
> graceful shutdown if you have future powercuts
> to aovid dataloss issues.






More information about the openstack-discuss mailing list