[Openstack-operators] Ceph recovery going unusually slow

Nick Jones nick.jones at datacentred.co.uk
Fri Jun 2 14:27:49 UTC 2017


You definitely have my sympathies;  We encountered a similar situation a
couple of years ago and it was a very hairy ordeal indeed.  We found most
of the suggestions in this mailing list post to be extremely beneficial in
coaxing our cluster back into life:

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg14170.html

In our case we were caught out early on by under-spec'd OSD nodes that
didn't have enough RAM installed.  Actually what had happened is that a few
nodes lost a significant portion of their installed memory due to faulty
DIMM slots that we didn't spot.  When we found ourselves in a similar
situation as yourself, i.e what should have been a routine, BAU recovery
scenario, the weight of the rebalancing operation brought these nodes to
their knees and they consistently ran out of memory.  Since then we've
stuck to the (updated!) recommendation of at least 1GB of RAM per 1TB of
storage per OSD.

It would also be worth taking a step back and looking at the healthiness of
the nodes in your cluster, and double-check metrics for things like Ceph
journal device latency.  We've been bitten in the past by cheap SSDs used
for journal drives that don't fail properly, i.e writes slow down into the
seconds and the whole cluster bogs down as a result.

Good luck.

-- 

-Nick

On 2 June 2017 at 14:09, Mike Lowe <jomlowe at iu.edu> wrote:

> A couple of things here, you have nodown and noout set which is
> understandable based on what you were doing but now it’s probably time to
> let ceph do it’s thing since you believe all of the osd’s are back in
> service and should stay up and in.  You may be masking a problem by having
> these set.  Do any of the problem osd’s have messages in their logs like
> “marked me down wrongly” or stuff about compaction?  And once you unset
> those flags do all your osd’s stay up and in?
>
> On Jun 2, 2017, at 8:00 AM, Grant Morley <grantmorley1985 at gmail.com>
> wrote:
>
> HEALTH_ERR 210 pgs are stuck inactive for more than 300 seconds; 296 pgs
> backfill_wait; 3 pgs backfilling; 1 pgs degraded; 202 pgs peering; 1 pgs
> recovery_wait; 1 pgs stuck degraded; 210 pgs stuck inactive; 510 pgs stuck
> unclean; 3308 requests are blocked > 32 sec; 41 osds have slow requests;
> recovery 2/11091408 objects degraded (0.000%); recovery 1778127/11091408
> objects misplaced (16.032%); nodown,noout,noscrub,nodeep-scrub flag(s) set
>
> pg 3.235 is stuck inactive for 138232.508429, current state peering, last
> acting [11,26,1]
> pg 1.237 is stuck inactive for 138260.482588, current state peering, last
> acting [8,41,34]
> pg 2.231 is stuck inactive for 138258.316031, current state peering, last
> acting [24,53,8]
> pg 2.22e is stuck inactive for 194033.321591, current state
> remapped+peering, last acting [0,29,1]
> pg 1.22c is stuck inactive for 102514.200154, current state peering, last
> acting [51,7,20]
> pg 2.228 is stuck inactive for 138258.317797, current state peering, last
> acting [53,4,34]
> pg 1.227 is stuck inactive for 138258.244681, current state
> remapped+peering, last acting [48,35,11]
> pg 2.220 is stuck inactive for 193940.066322, current state
> remapped+peering, last acting [9,39,8]
> pg 1.222 is stuck inactive for 101474.087688, current state peering, last
> acting [23,11,35]
> pg 3.130 is stuck inactive for 99735.451290, current state peering, last
> acting [27,37,17]
> pg 3.136 is stuck inactive for 138221.552865, current state peering, last
> acting [26,49,10]
> pg 3.13c is stuck inactive for 137563.906503, current state peering, last
> acting [51,53,7]
> pg 2.142 is stuck inactive for 99962.462932, current state peering, last
> acting [37,16,34]
> pg 1.141 is stuck inactive for 138257.572476, current state
> remapped+peering, last acting [5,17,49]
> pg 2.141 is stuck inactive for 102567.745720, current state peering, last
> acting [36,7,15]
> pg 3.144 is stuck inactive for 138218.289585, current state
> remapped+peering, last acting [18,28,16]
> pg 1.14d is stuck inactive for 138260.030530, current state peering, last
> acting [46,43,17]
> pg 3.155 is stuck inactive for 138227.368541, current state
> remapped+peering, last acting [33,20,52]
> pg 2.8d is stuck inactive for 100251.802576, current state peering, last
> acting [6,39,27]
> pg 2.15c is stuck inactive for 102567.512279, current state
> remapped+peering, last acting [7,35,49]
> pg 2.167 is stuck inactive for 138260.093367, current state peering, last
> acting [35,23,17]
> pg 3.9d is stuck inactive for 117050.294600, current state peering, last
> acting [12,51,23]
> pg 2.16e is stuck inactive for 99846.214239, current state peering, last
> acting [25,5,8]
> pg 2.17b is stuck inactive for 99733.504794, current state peering, last
> acting [49,27,14]
> pg 3.178 is stuck inactive for 99973.600671, current state peering, last
> acting [29,16,40]
> pg 3.240 is stuck inactive for 28768.488851, current state
> remapped+peering, last acting [33,8,32]
> pg 3.b6 is stuck inactive for 138222.461160, current state peering, last
> acting [26,29,34]
> pg 2.17e is stuck inactive for 159229.154401, current state peering, last
> acting [13,42,48]
> pg 2.17c is stuck inactive for 104921.767401, current state
> remapped+peering, last acting [23,12,24]
> pg 3.17d is stuck inactive for 137563.979966, current state
> remapped+peering, last acting [43,24,29]
> pg 1.24b is stuck inactive for 93144.933177, current state peering, last
> acting [43,20,37]
> pg 1.bd is stuck inactive for 102616.793475, current state peering, last
> acting [16,30,35]
> pg 3.1d6 is stuck inactive for 99974.485247, current state peering, last
> acting [16,38,29]
> pg 2.172 is stuck inactive for 193919.627310, current state inactive, last
> acting [39,21,10]
> pg 1.171 is stuck inactive for 104947.558748, current state peering, last
> acting [49,9,25]
> pg 1.243 is stuck inactive for 208452.393430, current state peering, last
> acting [45,32,24]
> pg 3.aa is stuck inactive for 104958.230601, current state
> remapped+peering, last acting [51,12,13]
>
> 41 osds have slow requests
> recovery 2/11091408 objects degraded (0.000%)
> recovery 1778127/11091408 objects misplaced (16.032%)
> nodown,noout,noscrub,nodeep-scrub flag(s) set
>
> That is what we seem to be getting  a lot of. It appears the PG's are just
> stuck as inactive. I am not sure how to get around that.
>
> Thanks,
>
> On Fri, Jun 2, 2017 at 12:55 PM, Saverio Proto <zioproto at gmail.com> wrote:
>
>> Usually 'ceph health detail' gives better info on what is making
>> everything stuck.
>>
>> Saverio
>>
>> 2017-06-02 13:51 GMT+02:00 Grant Morley <grantmorley1985 at gmail.com>:
>> > Hi All,
>> >
>> > I wonder if anyone could help at all.
>> >
>> > We were doing some routine maintenance on our ceph cluster and after
>> running
>> > a "service ceph-all restart" on one of our nodes we noticed that
>> something
>> > wasn't quite right. The cluster has gone into an error mode and we have
>> > multiple stuck PGs and the object replacement recovery is taking a
>> strangely
>> > long time. At first there was about 46% objects misplaced and we now
>> have
>> > roughly 16%.
>> >
>> > However it has taken about 36 hours to do the recovery so far and with a
>> > possible 16 to go we are looking at a fairly major issue. As a lot of
>> the
>> > system is now blocked for read / writes, customers cannot access their
>> VMs.
>> >
>> > I think the main issue at the moment is that we have 210pgs stuck
>> inactive
>> > and nothing we seem to do can get them to peer.
>> >
>> > Below is an ouptut of the ceph status. Can anyone help or have any
>> ideas on
>> > how to speed up the recover process? We have tried turning down logging
>> on
>> > the OSD's but some are going so slow they wont allow us to injectargs
>> into
>> > them.
>> >
>> > health HEALTH_ERR
>> >             210 pgs are stuck inactive for more than 300 seconds
>> >             298 pgs backfill_wait
>> >             3 pgs backfilling
>> >             1 pgs degraded
>> >             200 pgs peering
>> >             1 pgs recovery_wait
>> >             1 pgs stuck degraded
>> >             210 pgs stuck inactive
>> >             512 pgs stuck unclean
>> >             3310 requests are blocked > 32 sec
>> >             recovery 2/11094405 objects degraded (0.000%)
>> >             recovery 1785063/11094405 objects misplaced (16.090%)
>> >             nodown,noout,noscrub,nodeep-scrub flag(s) set
>> >
>> >             election epoch 16314, quorum 0,1,2,3,4,5,6,7,8
>> > storage-1,storage-2,storage-3,storage-4,storage-5,storage-6,
>> storage-7,storage-8,storage-9
>> >      osdmap e213164: 54 osds: 54 up, 54 in; 329 remapped pgs
>> >             flags nodown,noout,noscrub,nodeep-scrub
>> >       pgmap v41030942: 2036 pgs, 14 pools, 14183 GB data, 3309 kobjects
>> >             43356 GB used, 47141 GB / 90498 GB avail
>> >             2/11094405 objects degraded (0.000%)
>> >             1785063/11094405 objects misplaced (16.090%)
>> >                 1524 active+clean
>> >                  298 active+remapped+wait_backfill
>> >                  153 peering
>> >                   47 remapped+peering
>> >                   10 inactive
>> >                    3 active+remapped+backfilling
>> >                    1 active+recovery_wait+degraded+remapped
>> >
>> > Many thanks,
>> >
>> > Grant
>>
>

-- 
DataCentred Limited registered in England and Wales no. 05611763
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20170602/8b660ed5/attachment.html>


More information about the OpenStack-operators mailing list