[Openstack-operators] Ceph recovery going unusually slow

Grant Morley grantmorley1985 at gmail.com
Fri Jun 2 11:51:28 UTC 2017


Hi All,

I wonder if anyone could help at all.

We were doing some routine maintenance on our ceph cluster and after
running a "service ceph-all restart" on one of our nodes we noticed that
something wasn't quite right. The cluster has gone into an error mode and
we have multiple stuck PGs and the object replacement recovery is taking a
strangely long time. At first there was about 46% objects misplaced and we
now have roughly 16%.

However it has taken about 36 hours to do the recovery so far and with a
possible 16 to go we are looking at a fairly major issue. As a lot of the
system is now blocked for read / writes, customers cannot access their VMs.

I think the main issue at the moment is that we have 210pgs stuck inactive
and nothing we seem to do can get them to peer.

Below is an ouptut of the ceph status. Can anyone help or have any ideas on
how to speed up the recover process? We have tried turning down logging on
the OSD's but some are going so slow they wont allow us to injectargs into
them.

health HEALTH_ERR
            210 pgs are stuck inactive for more than 300 seconds
            298 pgs backfill_wait
            3 pgs backfilling
            1 pgs degraded
            200 pgs peering
            1 pgs recovery_wait
            1 pgs stuck degraded
            210 pgs stuck inactive
            512 pgs stuck unclean
            3310 requests are blocked > 32 sec
            recovery 2/11094405 objects degraded (0.000%)
            recovery 1785063/11094405 objects misplaced (16.090%)
            nodown,noout,noscrub,nodeep-scrub flag(s) set

            election epoch 16314, quorum 0,1,2,3,4,5,6,7,8
storage-1,storage-2,storage-3,storage-4,storage-5,storage-6,storage-7,storage-8,storage-9
     osdmap e213164: 54 osds: 54 up, 54 in; 329 remapped pgs
            flags nodown,noout,noscrub,nodeep-scrub
      pgmap v41030942: 2036 pgs, 14 pools, 14183 GB data, 3309 kobjects
            43356 GB used, 47141 GB / 90498 GB avail
            2/11094405 objects degraded (0.000%)
            1785063/11094405 objects misplaced (16.090%)
                1524 active+clean
                 298 active+remapped+wait_backfill
                 153 peering
                  47 remapped+peering
                  10 inactive
                   3 active+remapped+backfilling
                   1 active+recovery_wait+degraded+remapped

Many thanks,

Grant
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20170602/a54ba8c8/attachment-0001.html>


More information about the OpenStack-operators mailing list