[Openstack-operators] Ceph recovery going unusually slow

George Mihaiescu lmihaiescu at gmail.com
Fri Jun 2 12:05:48 UTC 2017


Having 9 ceph-mon servers doesn't help...

I would look at the stuck PGs in order to find the common OSDs and focus on them.

Their logs will  probably have details on where the problem is.

> On Jun 2, 2017, at 07:51, Grant Morley <grantmorley1985 at gmail.com> wrote:
> 
> Hi All,
> 
> I wonder if anyone could help at all.
> 
> We were doing some routine maintenance on our ceph cluster and after running a "service ceph-all restart" on one of our nodes we noticed that something wasn't quite right. The cluster has gone into an error mode and we have multiple stuck PGs and the object replacement recovery is taking a strangely long time. At first there was about 46% objects misplaced and we now have roughly 16%.
> 
> However it has taken about 36 hours to do the recovery so far and with a possible 16 to go we are looking at a fairly major issue. As a lot of the system is now blocked for read / writes, customers cannot access their VMs.
> 
> I think the main issue at the moment is that we have 210pgs stuck inactive and nothing we seem to do can get them to peer.
> 
> Below is an ouptut of the ceph status. Can anyone help or have any ideas on how to speed up the recover process? We have tried turning down logging on the OSD's but some are going so slow they wont allow us to injectargs into them.
> 
> health HEALTH_ERR
>             210 pgs are stuck inactive for more than 300 seconds
>             298 pgs backfill_wait
>             3 pgs backfilling
>             1 pgs degraded
>             200 pgs peering
>             1 pgs recovery_wait
>             1 pgs stuck degraded
>             210 pgs stuck inactive
>             512 pgs stuck unclean
>             3310 requests are blocked > 32 sec
>             recovery 2/11094405 objects degraded (0.000%)
>             recovery 1785063/11094405 objects misplaced (16.090%)
>             nodown,noout,noscrub,nodeep-scrub flag(s) set
> 
>             election epoch 16314, quorum 0,1,2,3,4,5,6,7,8 storage-1,storage-2,storage-3,storage-4,storage-5,storage-6,storage-7,storage-8,storage-9
>      osdmap e213164: 54 osds: 54 up, 54 in; 329 remapped pgs
>             flags nodown,noout,noscrub,nodeep-scrub
>       pgmap v41030942: 2036 pgs, 14 pools, 14183 GB data, 3309 kobjects
>             43356 GB used, 47141 GB / 90498 GB avail
>             2/11094405 objects degraded (0.000%)
>             1785063/11094405 objects misplaced (16.090%)
>                 1524 active+clean
>                  298 active+remapped+wait_backfill
>                  153 peering
>                   47 remapped+peering
>                   10 inactive
>                    3 active+remapped+backfilling
>                    1 active+recovery_wait+degraded+remapped
> 
> Many thanks,
> 
> Grant
> 
> 
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



More information about the OpenStack-operators mailing list