[Openstack-operators] [ceph-users] After power outage, nearly all vm volumes corrupted and unmountable

Gary Molenkamp molenkam at uwo.ca
Fri Jul 6 13:17:20 UTC 2018


Thank you Jason,  Not sure how I missed that step.


On 2018-07-06 08:34 AM, Jason Dillaman wrote:
> There have been several similar reports on the mailing list about this 
> [1][2][3][4] that are always a result of skipping step 6 from the 
> Luminous upgrade guide [5]. The new (starting Luminous) 'profile 
> rbd'-style caps are designed to try to simplify caps going forward [6].
>
> TL;DR: your Openstack CephX users need to have permission to blacklist 
> dead clients that failed to properly release the exclusive lock.
>
> [1] 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-November/022278.html
> [2] 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-November/022694.html
> [3] 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026496.html
> [4] https://www.spinics.net/lists/ceph-users/msg45665.html
> [5] 
> http://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken
> [6] 
> http://docs.ceph.com/docs/luminous/rbd/rbd-openstack/#setup-ceph-client-authentication
>
>
> On Fri, Jul 6, 2018 at 7:55 AM Gary Molenkamp <molenkam at uwo.ca 
> <mailto:molenkam at uwo.ca>> wrote:
>
>     Good morning all,
>
>     After losing all power to our DC last night due to a storm, nearly
>     all
>     of the volumes in our Pike cluster are unmountable.  Of the 30 VMs in
>     use at the time, only one has been able to successfully mount and
>     boot
>     from its rootfs.   We are using Ceph as the backend storage to cinder
>     and glance.  Any help or pointers to bring this back online would be
>     appreciated.
>
>       What most of the volumes are seeing is
>
>     [    2.622252] SGI XFS with ACLs, security attributes, no debug
>     enabled
>     [    2.629285] XFS (sda1): Mounting V5 Filesystem
>     [    2.832223] sd 2:0:0:0: [sda] FAILED Result: hostbyte=DID_OK
>     driverbyte=DRIVER_SENSE
>     [    2.838412] sd 2:0:0:0: [sda] Sense Key : Aborted Command [current]
>     [    2.842383] sd 2:0:0:0: [sda] Add. Sense: I/O process terminated
>     [    2.846152] sd 2:0:0:0: [sda] CDB: Write(10) 2a 00 00 80 2c 19
>     00 04
>     00 00
>     [    2.850146] blk_update_request: I/O error, dev sda, sector 8399897
>
>     or
>
>     [    2.590178] EXT4-fs (vda1): INFO: recovery required on readonly
>     filesystem
>     [    2.594319] EXT4-fs (vda1): write access will be enabled during
>     recovery
>     [    2.957742] print_req_error: I/O error, dev vda, sector 227328
>     [    2.962468] Buffer I/O error on dev vda1, logical block 0, lost
>     async
>     page write
>     [    2.967933] Buffer I/O error on dev vda1, logical block 1, lost
>     async
>     page write
>     [    2.973076] print_req_error: I/O error, dev vda, sector 229384
>
>     As a test for one of the less critical vms, I deleted the vm and
>     mounted
>     the volume on the one VM I managed to start.  The results were not
>     promising:
>
>
>     # dmesg |tail
>     [    5.136862] type=1305 audit(1530847244.811:4): audit_pid=496 old=0
>     auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0
>     res=1
>     [    7.726331] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
>     [29374.967315] scsi 2:0:0:1: Direct-Access     QEMU     QEMU HARDDISK
>     2.5+ PQ: 0 ANSI: 5
>     [29374.988104] sd 2:0:0:1: [sdb] 83886080 512-byte logical blocks:
>     (42.9
>     GB/40.0 GiB)
>     [29374.991126] sd 2:0:0:1: Attached scsi generic sg1 type 0
>     [29374.995302] sd 2:0:0:1: [sdb] Write Protect is off
>     [29374.997109] sd 2:0:0:1: [sdb] Mode Sense: 63 00 00 08
>     [29374.997186] sd 2:0:0:1: [sdb] Write cache: enabled, read cache:
>     enabled, doesn't support DPO or FUA
>     [29375.005968]  sdb: sdb1
>     [29375.007746] sd 2:0:0:1: [sdb] Attached SCSI disk
>
>     # parted /dev/sdb
>     GNU Parted 3.1
>     Using /dev/sdb
>     Welcome to GNU Parted! Type 'help' to view a list of commands.
>     (parted) p
>     Model: QEMU QEMU HARDDISK (scsi)
>     Disk /dev/sdb: 42.9GB
>     Sector size (logical/physical): 512B/512B
>     Partition Table: msdos
>     Disk Flags:
>
>     Number  Start   End     Size    Type     File system  Flags
>       1      1049kB  42.9GB  42.9GB  primary  xfs          boot
>
>     # mount -t xfs /dev/sdb temp
>     mount: wrong fs type, bad option, bad superblock on /dev/sdb,
>             missing codepage or helper program, or other error
>
>             In some cases useful info is found in syslog - try
>             dmesg | tail or so.
>
>     # xfs_repair /dev/sdb
>     Phase 1 - find and verify superblock...
>     bad primary superblock - bad magic number !!!
>
>     attempting to find secondary superblock...
>
>
>
>     Which eventually fails.   The ceph cluster looks healthy, I can
>     export
>     the volumes from rbd.  I can find no other errors in ceph of
>     openstack
>     indicating a fault in either system.
>
>          - Is this recoverable?
>
>          - What happened to all of these volumes and can this be
>     prevented
>     from occurring again?  Note that any shutdown vm at the time of the
>     outage appears to be fine.
>
>
>     Relevant versions:
>
>          Base OS:  all Centos 7.5
>
>          Ceph:  Luminous 12.2.5-0
>
>          Openstack:  Latest Pike releases in
>     centos-release-openstack-pike-1-1
>
>              nova 16.1.4-1
>
>              cinder  11.1.1-1
>
>
>
>     -- 
>     Gary Molenkamp                  Computer Science/Science
>     Technology Services
>     Systems Administrator           University of Western Ontario
>     molenkam at uwo.ca <mailto:molenkam at uwo.ca> http://www.csd.uwo.ca
>     (519) 661-2111 x86882           (519) 661-3566
>
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> -- 
> Jason

-- 
Gary Molenkamp			Computer Science/Science Technology Services
Systems Administrator		University of Western Ontario
molenkam at uwo.ca                 http://www.csd.uwo.ca
(519) 661-2111 x86882		(519) 661-3566

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20180706/131b2453/attachment.html>


More information about the OpenStack-operators mailing list