Ceph outage cause filesystem error on VM
Satish Patel
satish.txt at gmail.com
Fri Feb 17 04:05:35 UTC 2023
Hi Eugen,
This is what I did, let me know if I missed anything.
root at ceph1:~# ceph osd blacklist ls
192.168.3.12:0/0 2023-02-17T04:48:54.381763+0000
192.168.3.22:0/753370860 2023-02-17T04:47:08.185434+0000
192.168.3.22:0/2833179066 2023-02-17T04:47:08.185434+0000
192.168.3.22:0/1812968936 2023-02-17T04:47:08.185434+0000
192.168.3.22:6824/2057987683 2023-02-17T04:47:08.185434+0000
192.168.3.21:0/2756666482 2023-02-17T05:16:23.939511+0000
192.168.3.21:0/1646520197 2023-02-17T05:16:23.939511+0000
192.168.3.22:6825/2057987683 2023-02-17T04:47:08.185434+0000
192.168.3.21:0/526748613 2023-02-17T05:16:23.939511+0000
192.168.3.21:6815/2454821797 2023-02-17T05:16:23.939511+0000
192.168.3.22:0/288537807 2023-02-17T04:47:08.185434+0000
192.168.3.21:0/4161448504 2023-02-17T05:16:23.939511+0000
192.168.3.21:6824/2454821797 2023-02-17T05:16:23.939511+0000
listed 13 entries
root at ceph1:~# rbd lock list --image
55dbf40b-0a6a-4bab-b3a5-b4bb74e963af_disk -p vms
There is 1 exclusive lock on this image.
Locker ID Address
client.268212 auto 139971105131968 192.168.3.12:0/1649312807
root at ceph1:~# ceph osd blacklist rm 192.168.3.12:0/1649312807
192.168.3.12:0/1649312807 isn't blocklisted
How do I create a lock?
On Thu, Feb 16, 2023 at 10:45 AM Eugen Block <eblock at nde.ag> wrote:
> In addition to Sean's response, this has been asked multiple times,
> e.g. [1]. You could check if your hypervisors gave up the lock on the
> RBDs or if they are still locked (rbd status <pool>/<image>), in that
> case you might need to blacklist the clients and see if that resolves
> anything. Do you have regular snapshots (or backups) to be able to
> rollback in case of a curruption?
>
> [1] https://www.spinics.net/lists/ceph-users/msg45937.html
>
>
> Zitat von Sean Mooney <smooney at redhat.com>:
>
> > On Thu, 2023-02-16 at 09:56 -0500, Satish Patel wrote:
> >> Folks,
> >>
> >> I am running a small 3 node compute/controller with 3 node ceph storage
> in
> >> my lab. Yesterday, because of a power outage all my nodes went down.
> After
> >> reboot of all nodes ceph seems to show good health and no error (in ceph
> >> -s).
> >>
> >> When I started using the existing VM I noticed the following errors.
> Seems
> >> like data loss. This is a lab machine and has zero activity on vms but
> >> still loses data and the file system corrupt. Is this normal ?
> > if the vm/cluster hard crashes due to the power cut yes it can.
> > personally i have hit this more often with XFS then ext4 but i have
> > seen it with both.
> >>
> >> I am not using eraser coding, does that help in this matter?
> >>
> >> blk_update_request: I/O error, dev sda, sector 233000 op 0x1: (WRITE)
> flags
> >> 0x800 phys_seg 8 prio class 0
> >
> > you will proably need to rescue the isntance and repair the
> > filesystem of each vm with fsck
> > or similar. so boot with recue image -> repair filestem -> unrescue
> > -> hardreboot/start vm if needed
> >
> > you might be able to mitigate this somewhat by disableing disk
> > cacheing at teh qemu level but
> > that will reduce performance. ceph recommenes that you use
> > virtio-scis fo the device model and
> > writeback cach mode. we generally recommend that too however you can
> > use the disk_cachemodes option to
> > chage that.
> >
> https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.disk_cachemodes
> >
> > [libvirt]
> > disk_cachemodes=file=none,block=none,network=none
> >
> > this curreption may also have happend on the cecph cluter side.
> > they have some options that can help prevent that via journaling wirtes
> >
> > if you can afford it i would get even a small UPS to allow a
> > graceful shutdown if you have future powercuts
> > to aovid dataloss issues.
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230216/9df8b92e/attachment-0001.htm>
More information about the openstack-discuss
mailing list