Re: Ceph outage cause filesystem error on VM

16 Feb 2023


      Do you think this is my issue? https://bugs.launchpad.net/ceph/+bug/1968369

On Thu, Feb 16, 2023 at 11:05 PM Satish Patel <satish.txt@gmail.com> wrote:
...
Hi Eugen,
This is what I did, let me know if I missed anything.
root@ceph1:~# ceph osd blacklist ls
192.168.3.12:0/0 2023-02-17T04:48:54.381763+0000
192.168.3.22:0/753370860 2023-02-17T04:47:08.185434+0000
192.168.3.22:0/2833179066 2023-02-17T04:47:08.185434+0000
192.168.3.22:0/1812968936 2023-02-17T04:47:08.185434+0000
192.168.3.22:6824/2057987683 2023-02-17T04:47:08.185434+0000
192.168.3.21:0/2756666482 2023-02-17T05:16:23.939511+0000
192.168.3.21:0/1646520197 2023-02-17T05:16:23.939511+0000
192.168.3.22:6825/2057987683 2023-02-17T04:47:08.185434+0000
192.168.3.21:0/526748613 2023-02-17T05:16:23.939511+0000
192.168.3.21:6815/2454821797 2023-02-17T05:16:23.939511+0000
192.168.3.22:0/288537807 2023-02-17T04:47:08.185434+0000
192.168.3.21:0/4161448504 2023-02-17T05:16:23.939511+0000
192.168.3.21:6824/2454821797 2023-02-17T05:16:23.939511+0000
listed 13 entries
root@ceph1:~# rbd lock list --image
55dbf40b-0a6a-4bab-b3a5-b4bb74e963af_disk -p vms
There is 1 exclusive lock on this image.
Locker         ID                    Address
client.268212  auto 139971105131968  192.168.3.12:0/1649312807
root@ceph1:~# ceph osd blacklist rm 192.168.3.12:0/1649312807
192.168.3.12:0/1649312807 isn't blocklisted
How do I create a lock?
On Thu, Feb 16, 2023 at 10:45 AM Eugen Block <eblock@nde.ag> wrote:
...
In addition to Sean's response, this has been asked multiple times,
e.g. [1]. You could check if your hypervisors gave up the lock on the
RBDs or if they are still locked (rbd status <pool>/<image>), in that
case you might need to blacklist the clients and see if that resolves
anything. Do you have regular snapshots (or backups) to be able to
rollback in case of a curruption?
[1] https://www.spinics.net/lists/ceph-users/msg45937.html
Zitat von Sean Mooney <smooney@redhat.com>:
...
...
Folks,
I am running a small 3 node compute/controller with 3 node ceph
storage in
my lab. Yesterday, because of a power outage all my nodes went down.
After
reboot of all nodes ceph seems to show good health and no error (in
ceph
-s).
When I started using the existing VM I noticed the following errors.
Seems
like data loss. This is a lab machine and has zero activity on vms but
still loses data and the file system corrupt. Is this normal ?
if the vm/cluster hard crashes due to the power cut yes it can.
On Thu, 2023-02-16 at 09:56 -0500, Satish Patel wrote:
personally i have hit this more often with XFS then ext4 but i have
seen it with both.
...
I am not using eraser coding, does that help in this matter?
blk_update_request: I/O error, dev sda, sector 233000 op 0x1: (WRITE)
flags
...
0x800 phys_seg 8 prio class 0
you will proably need to rescue the isntance and repair the
filesystem of each vm with fsck
or similar. so boot with recue image -> repair filestem -> unrescue
-> hardreboot/start vm if needed
you might be able to mitigate this somewhat by disableing disk
cacheing at teh qemu level but
that will reduce performance. ceph recommenes that you use
virtio-scis fo the device model and
writeback cach mode. we generally recommend that too however you can
use the disk_cachemodes option to
chage that.
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.dis...
...
[libvirt]
disk_cachemodes=file=none,block=none,network=none
this curreption may also have happend on the cecph cluter side.
they have some options that can help prevent that via journaling wirtes
if you can afford it i would get even a small UPS to allow a
graceful shutdown if you have future powercuts
to aovid dataloss issues.