[Openstack-operators] Ceph recovery going unusually slow

Saverio Proto zioproto at gmail.com
Fri Jun 2 11:55:40 UTC 2017


Usually 'ceph health detail' gives better info on what is making
everything stuck.

Saverio

2017-06-02 13:51 GMT+02:00 Grant Morley <grantmorley1985 at gmail.com>:
> Hi All,
>
> I wonder if anyone could help at all.
>
> We were doing some routine maintenance on our ceph cluster and after running
> a "service ceph-all restart" on one of our nodes we noticed that something
> wasn't quite right. The cluster has gone into an error mode and we have
> multiple stuck PGs and the object replacement recovery is taking a strangely
> long time. At first there was about 46% objects misplaced and we now have
> roughly 16%.
>
> However it has taken about 36 hours to do the recovery so far and with a
> possible 16 to go we are looking at a fairly major issue. As a lot of the
> system is now blocked for read / writes, customers cannot access their VMs.
>
> I think the main issue at the moment is that we have 210pgs stuck inactive
> and nothing we seem to do can get them to peer.
>
> Below is an ouptut of the ceph status. Can anyone help or have any ideas on
> how to speed up the recover process? We have tried turning down logging on
> the OSD's but some are going so slow they wont allow us to injectargs into
> them.
>
> health HEALTH_ERR
>             210 pgs are stuck inactive for more than 300 seconds
>             298 pgs backfill_wait
>             3 pgs backfilling
>             1 pgs degraded
>             200 pgs peering
>             1 pgs recovery_wait
>             1 pgs stuck degraded
>             210 pgs stuck inactive
>             512 pgs stuck unclean
>             3310 requests are blocked > 32 sec
>             recovery 2/11094405 objects degraded (0.000%)
>             recovery 1785063/11094405 objects misplaced (16.090%)
>             nodown,noout,noscrub,nodeep-scrub flag(s) set
>
>             election epoch 16314, quorum 0,1,2,3,4,5,6,7,8
> storage-1,storage-2,storage-3,storage-4,storage-5,storage-6,storage-7,storage-8,storage-9
>      osdmap e213164: 54 osds: 54 up, 54 in; 329 remapped pgs
>             flags nodown,noout,noscrub,nodeep-scrub
>       pgmap v41030942: 2036 pgs, 14 pools, 14183 GB data, 3309 kobjects
>             43356 GB used, 47141 GB / 90498 GB avail
>             2/11094405 objects degraded (0.000%)
>             1785063/11094405 objects misplaced (16.090%)
>                 1524 active+clean
>                  298 active+remapped+wait_backfill
>                  153 peering
>                   47 remapped+peering
>                   10 inactive
>                    3 active+remapped+backfilling
>                    1 active+recovery_wait+degraded+remapped
>
> Many thanks,
>
> Grant
>
>
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>



More information about the OpenStack-operators mailing list