[Openstack-operators] Ceph recovery going unusually slow

Grant Morley grantmorley1985 at gmail.com
Fri Jun 2 12:16:34 UTC 2017


We are using Ceph Jewel (10.2.7) running on Ubuntu 14.04LTS

osd_recovery_max_active": "1"
osd_max_backfills": "1"
osd_recovery_op_priority": "3"

Limit                     Soft Limit           Hard Limit
Units

Max cpu time              unlimited            unlimited
seconds

Max file size             unlimited            unlimited
bytes

Max data size             unlimited            unlimited
bytes

Max stack size            8388608              unlimited
bytes

Max core file size        0                    unlimited
bytes

Max resident set          unlimited            unlimited
bytes

Max processes             256369               256369
processes

Max open files            327680               327680
files

Max locked memory         65536                65536
bytes

Max address space         unlimited            unlimited
bytes

Max file locks            unlimited            unlimited
locks

Max pending signals       256369               256369
signals

Max msgqueue size         819200               819200
bytes

Max nice priority         0
0

Max realtime priority     0
0

Max realtime timeout      unlimited            unlimited            us

We did try changing the osd_recovery_max_active to "3" but that seemed tlo
make things run slower

Thanks,

On Fri, Jun 2, 2017 at 1:08 PM, Saverio Proto <zioproto at gmail.com> wrote:

> To give you some help you need to tell us the ceph version you are
> using and from ceph.conf in the section [osd] what values you have for
> the following ?
>
> [osd]
> osd max backfills
> osd recovery max active
> osd recovery op priority
>
> these three settings can influence the recovery speed.
>
> Also, do you have big enough limits ?
>
> Check on any host the content of: /proc/`pid_of_the_osd`/limits
>
>
> Saverio
>
> 2017-06-02 14:00 GMT+02:00 Grant Morley <grantmorley1985 at gmail.com>:
> > HEALTH_ERR 210 pgs are stuck inactive for more than 300 seconds; 296 pgs
> > backfill_wait; 3 pgs backfilling; 1 pgs degraded; 202 pgs peering; 1 pgs
> > recovery_wait; 1 pgs stuck degraded; 210 pgs stuck inactive; 510 pgs
> stuck
> > unclean; 3308 requests are blocked > 32 sec; 41 osds have slow requests;
> > recovery 2/11091408 objects degraded (0.000%); recovery 1778127/11091408
> > objects misplaced (16.032%); nodown,noout,noscrub,nodeep-scrub flag(s)
> set
> >
> > pg 3.235 is stuck inactive for 138232.508429, current state peering, last
> > acting [11,26,1]
> > pg 1.237 is stuck inactive for 138260.482588, current state peering, last
> > acting [8,41,34]
> > pg 2.231 is stuck inactive for 138258.316031, current state peering, last
> > acting [24,53,8]
> > pg 2.22e is stuck inactive for 194033.321591, current state
> > remapped+peering, last acting [0,29,1]
> > pg 1.22c is stuck inactive for 102514.200154, current state peering, last
> > acting [51,7,20]
> > pg 2.228 is stuck inactive for 138258.317797, current state peering, last
> > acting [53,4,34]
> > pg 1.227 is stuck inactive for 138258.244681, current state
> > remapped+peering, last acting [48,35,11]
> > pg 2.220 is stuck inactive for 193940.066322, current state
> > remapped+peering, last acting [9,39,8]
> > pg 1.222 is stuck inactive for 101474.087688, current state peering, last
> > acting [23,11,35]
> > pg 3.130 is stuck inactive for 99735.451290, current state peering, last
> > acting [27,37,17]
> > pg 3.136 is stuck inactive for 138221.552865, current state peering, last
> > acting [26,49,10]
> > pg 3.13c is stuck inactive for 137563.906503, current state peering, last
> > acting [51,53,7]
> > pg 2.142 is stuck inactive for 99962.462932, current state peering, last
> > acting [37,16,34]
> > pg 1.141 is stuck inactive for 138257.572476, current state
> > remapped+peering, last acting [5,17,49]
> > pg 2.141 is stuck inactive for 102567.745720, current state peering, last
> > acting [36,7,15]
> > pg 3.144 is stuck inactive for 138218.289585, current state
> > remapped+peering, last acting [18,28,16]
> > pg 1.14d is stuck inactive for 138260.030530, current state peering, last
> > acting [46,43,17]
> > pg 3.155 is stuck inactive for 138227.368541, current state
> > remapped+peering, last acting [33,20,52]
> > pg 2.8d is stuck inactive for 100251.802576, current state peering, last
> > acting [6,39,27]
> > pg 2.15c is stuck inactive for 102567.512279, current state
> > remapped+peering, last acting [7,35,49]
> > pg 2.167 is stuck inactive for 138260.093367, current state peering, last
> > acting [35,23,17]
> > pg 3.9d is stuck inactive for 117050.294600, current state peering, last
> > acting [12,51,23]
> > pg 2.16e is stuck inactive for 99846.214239, current state peering, last
> > acting [25,5,8]
> > pg 2.17b is stuck inactive for 99733.504794, current state peering, last
> > acting [49,27,14]
> > pg 3.178 is stuck inactive for 99973.600671, current state peering, last
> > acting [29,16,40]
> > pg 3.240 is stuck inactive for 28768.488851, current state
> remapped+peering,
> > last acting [33,8,32]
> > pg 3.b6 is stuck inactive for 138222.461160, current state peering, last
> > acting [26,29,34]
> > pg 2.17e is stuck inactive for 159229.154401, current state peering, last
> > acting [13,42,48]
> > pg 2.17c is stuck inactive for 104921.767401, current state
> > remapped+peering, last acting [23,12,24]
> > pg 3.17d is stuck inactive for 137563.979966, current state
> > remapped+peering, last acting [43,24,29]
> > pg 1.24b is stuck inactive for 93144.933177, current state peering, last
> > acting [43,20,37]
> > pg 1.bd is stuck inactive for 102616.793475, current state peering, last
> > acting [16,30,35]
> > pg 3.1d6 is stuck inactive for 99974.485247, current state peering, last
> > acting [16,38,29]
> > pg 2.172 is stuck inactive for 193919.627310, current state inactive,
> last
> > acting [39,21,10]
> > pg 1.171 is stuck inactive for 104947.558748, current state peering, last
> > acting [49,9,25]
> > pg 1.243 is stuck inactive for 208452.393430, current state peering, last
> > acting [45,32,24]
> > pg 3.aa is stuck inactive for 104958.230601, current state
> remapped+peering,
> > last acting [51,12,13]
> >
> > 41 osds have slow requests
> > recovery 2/11091408 objects degraded (0.000%)
> > recovery 1778127/11091408 objects misplaced (16.032%)
> > nodown,noout,noscrub,nodeep-scrub flag(s) set
> >
> > That is what we seem to be getting  a lot of. It appears the PG's are
> just
> > stuck as inactive. I am not sure how to get around that.
> >
> > Thanks,
> >
> > On Fri, Jun 2, 2017 at 12:55 PM, Saverio Proto <zioproto at gmail.com>
> wrote:
> >>
> >> Usually 'ceph health detail' gives better info on what is making
> >> everything stuck.
> >>
> >> Saverio
> >>
> >> 2017-06-02 13:51 GMT+02:00 Grant Morley <grantmorley1985 at gmail.com>:
> >> > Hi All,
> >> >
> >> > I wonder if anyone could help at all.
> >> >
> >> > We were doing some routine maintenance on our ceph cluster and after
> >> > running
> >> > a "service ceph-all restart" on one of our nodes we noticed that
> >> > something
> >> > wasn't quite right. The cluster has gone into an error mode and we
> have
> >> > multiple stuck PGs and the object replacement recovery is taking a
> >> > strangely
> >> > long time. At first there was about 46% objects misplaced and we now
> >> > have
> >> > roughly 16%.
> >> >
> >> > However it has taken about 36 hours to do the recovery so far and
> with a
> >> > possible 16 to go we are looking at a fairly major issue. As a lot of
> >> > the
> >> > system is now blocked for read / writes, customers cannot access their
> >> > VMs.
> >> >
> >> > I think the main issue at the moment is that we have 210pgs stuck
> >> > inactive
> >> > and nothing we seem to do can get them to peer.
> >> >
> >> > Below is an ouptut of the ceph status. Can anyone help or have any
> ideas
> >> > on
> >> > how to speed up the recover process? We have tried turning down
> logging
> >> > on
> >> > the OSD's but some are going so slow they wont allow us to injectargs
> >> > into
> >> > them.
> >> >
> >> > health HEALTH_ERR
> >> >             210 pgs are stuck inactive for more than 300 seconds
> >> >             298 pgs backfill_wait
> >> >             3 pgs backfilling
> >> >             1 pgs degraded
> >> >             200 pgs peering
> >> >             1 pgs recovery_wait
> >> >             1 pgs stuck degraded
> >> >             210 pgs stuck inactive
> >> >             512 pgs stuck unclean
> >> >             3310 requests are blocked > 32 sec
> >> >             recovery 2/11094405 objects degraded (0.000%)
> >> >             recovery 1785063/11094405 objects misplaced (16.090%)
> >> >             nodown,noout,noscrub,nodeep-scrub flag(s) set
> >> >
> >> >             election epoch 16314, quorum 0,1,2,3,4,5,6,7,8
> >> >
> >> > storage-1,storage-2,storage-3,storage-4,storage-5,storage-6,
> storage-7,storage-8,storage-9
> >> >      osdmap e213164: 54 osds: 54 up, 54 in; 329 remapped pgs
> >> >             flags nodown,noout,noscrub,nodeep-scrub
> >> >       pgmap v41030942: 2036 pgs, 14 pools, 14183 GB data, 3309
> kobjects
> >> >             43356 GB used, 47141 GB / 90498 GB avail
> >> >             2/11094405 objects degraded (0.000%)
> >> >             1785063/11094405 objects misplaced (16.090%)
> >> >                 1524 active+clean
> >> >                  298 active+remapped+wait_backfill
> >> >                  153 peering
> >> >                   47 remapped+peering
> >> >                   10 inactive
> >> >                    3 active+remapped+backfilling
> >> >                    1 active+recovery_wait+degraded+remapped
> >> >
> >> > Many thanks,
> >> >
> >> > Grant
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > OpenStack-operators mailing list
> >> > OpenStack-operators at lists.openstack.org
> >> > http://lists.openstack.org/cgi-bin/mailman/listinfo/
> openstack-operators
> >> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20170602/7c3a09e9/attachment.html>


More information about the OpenStack-operators mailing list