[Openstack-operators] Ceph recovery going unusually slow
Grant Morley
grantmorley1985 at gmail.com
Fri Jun 2 11:51:28 UTC 2017
Hi All,
I wonder if anyone could help at all.
We were doing some routine maintenance on our ceph cluster and after
running a "service ceph-all restart" on one of our nodes we noticed that
something wasn't quite right. The cluster has gone into an error mode and
we have multiple stuck PGs and the object replacement recovery is taking a
strangely long time. At first there was about 46% objects misplaced and we
now have roughly 16%.
However it has taken about 36 hours to do the recovery so far and with a
possible 16 to go we are looking at a fairly major issue. As a lot of the
system is now blocked for read / writes, customers cannot access their VMs.
I think the main issue at the moment is that we have 210pgs stuck inactive
and nothing we seem to do can get them to peer.
Below is an ouptut of the ceph status. Can anyone help or have any ideas on
how to speed up the recover process? We have tried turning down logging on
the OSD's but some are going so slow they wont allow us to injectargs into
them.
health HEALTH_ERR
210 pgs are stuck inactive for more than 300 seconds
298 pgs backfill_wait
3 pgs backfilling
1 pgs degraded
200 pgs peering
1 pgs recovery_wait
1 pgs stuck degraded
210 pgs stuck inactive
512 pgs stuck unclean
3310 requests are blocked > 32 sec
recovery 2/11094405 objects degraded (0.000%)
recovery 1785063/11094405 objects misplaced (16.090%)
nodown,noout,noscrub,nodeep-scrub flag(s) set
election epoch 16314, quorum 0,1,2,3,4,5,6,7,8
storage-1,storage-2,storage-3,storage-4,storage-5,storage-6,storage-7,storage-8,storage-9
osdmap e213164: 54 osds: 54 up, 54 in; 329 remapped pgs
flags nodown,noout,noscrub,nodeep-scrub
pgmap v41030942: 2036 pgs, 14 pools, 14183 GB data, 3309 kobjects
43356 GB used, 47141 GB / 90498 GB avail
2/11094405 objects degraded (0.000%)
1785063/11094405 objects misplaced (16.090%)
1524 active+clean
298 active+remapped+wait_backfill
153 peering
47 remapped+peering
10 inactive
3 active+remapped+backfilling
1 active+recovery_wait+degraded+remapped
Many thanks,
Grant
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20170602/a54ba8c8/attachment-0001.html>
More information about the OpenStack-operators
mailing list