[Openstack-operators] Ceph recovery going unusually slow
Grant Morley
grantmorley1985 at gmail.com
Fri Jun 2 12:16:34 UTC 2017
We are using Ceph Jewel (10.2.7) running on Ubuntu 14.04LTS
osd_recovery_max_active": "1"
osd_max_backfills": "1"
osd_recovery_op_priority": "3"
Limit Soft Limit Hard Limit
Units
Max cpu time unlimited unlimited
seconds
Max file size unlimited unlimited
bytes
Max data size unlimited unlimited
bytes
Max stack size 8388608 unlimited
bytes
Max core file size 0 unlimited
bytes
Max resident set unlimited unlimited
bytes
Max processes 256369 256369
processes
Max open files 327680 327680
files
Max locked memory 65536 65536
bytes
Max address space unlimited unlimited
bytes
Max file locks unlimited unlimited
locks
Max pending signals 256369 256369
signals
Max msgqueue size 819200 819200
bytes
Max nice priority 0
0
Max realtime priority 0
0
Max realtime timeout unlimited unlimited us
We did try changing the osd_recovery_max_active to "3" but that seemed tlo
make things run slower
Thanks,
On Fri, Jun 2, 2017 at 1:08 PM, Saverio Proto <zioproto at gmail.com> wrote:
> To give you some help you need to tell us the ceph version you are
> using and from ceph.conf in the section [osd] what values you have for
> the following ?
>
> [osd]
> osd max backfills
> osd recovery max active
> osd recovery op priority
>
> these three settings can influence the recovery speed.
>
> Also, do you have big enough limits ?
>
> Check on any host the content of: /proc/`pid_of_the_osd`/limits
>
>
> Saverio
>
> 2017-06-02 14:00 GMT+02:00 Grant Morley <grantmorley1985 at gmail.com>:
> > HEALTH_ERR 210 pgs are stuck inactive for more than 300 seconds; 296 pgs
> > backfill_wait; 3 pgs backfilling; 1 pgs degraded; 202 pgs peering; 1 pgs
> > recovery_wait; 1 pgs stuck degraded; 210 pgs stuck inactive; 510 pgs
> stuck
> > unclean; 3308 requests are blocked > 32 sec; 41 osds have slow requests;
> > recovery 2/11091408 objects degraded (0.000%); recovery 1778127/11091408
> > objects misplaced (16.032%); nodown,noout,noscrub,nodeep-scrub flag(s)
> set
> >
> > pg 3.235 is stuck inactive for 138232.508429, current state peering, last
> > acting [11,26,1]
> > pg 1.237 is stuck inactive for 138260.482588, current state peering, last
> > acting [8,41,34]
> > pg 2.231 is stuck inactive for 138258.316031, current state peering, last
> > acting [24,53,8]
> > pg 2.22e is stuck inactive for 194033.321591, current state
> > remapped+peering, last acting [0,29,1]
> > pg 1.22c is stuck inactive for 102514.200154, current state peering, last
> > acting [51,7,20]
> > pg 2.228 is stuck inactive for 138258.317797, current state peering, last
> > acting [53,4,34]
> > pg 1.227 is stuck inactive for 138258.244681, current state
> > remapped+peering, last acting [48,35,11]
> > pg 2.220 is stuck inactive for 193940.066322, current state
> > remapped+peering, last acting [9,39,8]
> > pg 1.222 is stuck inactive for 101474.087688, current state peering, last
> > acting [23,11,35]
> > pg 3.130 is stuck inactive for 99735.451290, current state peering, last
> > acting [27,37,17]
> > pg 3.136 is stuck inactive for 138221.552865, current state peering, last
> > acting [26,49,10]
> > pg 3.13c is stuck inactive for 137563.906503, current state peering, last
> > acting [51,53,7]
> > pg 2.142 is stuck inactive for 99962.462932, current state peering, last
> > acting [37,16,34]
> > pg 1.141 is stuck inactive for 138257.572476, current state
> > remapped+peering, last acting [5,17,49]
> > pg 2.141 is stuck inactive for 102567.745720, current state peering, last
> > acting [36,7,15]
> > pg 3.144 is stuck inactive for 138218.289585, current state
> > remapped+peering, last acting [18,28,16]
> > pg 1.14d is stuck inactive for 138260.030530, current state peering, last
> > acting [46,43,17]
> > pg 3.155 is stuck inactive for 138227.368541, current state
> > remapped+peering, last acting [33,20,52]
> > pg 2.8d is stuck inactive for 100251.802576, current state peering, last
> > acting [6,39,27]
> > pg 2.15c is stuck inactive for 102567.512279, current state
> > remapped+peering, last acting [7,35,49]
> > pg 2.167 is stuck inactive for 138260.093367, current state peering, last
> > acting [35,23,17]
> > pg 3.9d is stuck inactive for 117050.294600, current state peering, last
> > acting [12,51,23]
> > pg 2.16e is stuck inactive for 99846.214239, current state peering, last
> > acting [25,5,8]
> > pg 2.17b is stuck inactive for 99733.504794, current state peering, last
> > acting [49,27,14]
> > pg 3.178 is stuck inactive for 99973.600671, current state peering, last
> > acting [29,16,40]
> > pg 3.240 is stuck inactive for 28768.488851, current state
> remapped+peering,
> > last acting [33,8,32]
> > pg 3.b6 is stuck inactive for 138222.461160, current state peering, last
> > acting [26,29,34]
> > pg 2.17e is stuck inactive for 159229.154401, current state peering, last
> > acting [13,42,48]
> > pg 2.17c is stuck inactive for 104921.767401, current state
> > remapped+peering, last acting [23,12,24]
> > pg 3.17d is stuck inactive for 137563.979966, current state
> > remapped+peering, last acting [43,24,29]
> > pg 1.24b is stuck inactive for 93144.933177, current state peering, last
> > acting [43,20,37]
> > pg 1.bd is stuck inactive for 102616.793475, current state peering, last
> > acting [16,30,35]
> > pg 3.1d6 is stuck inactive for 99974.485247, current state peering, last
> > acting [16,38,29]
> > pg 2.172 is stuck inactive for 193919.627310, current state inactive,
> last
> > acting [39,21,10]
> > pg 1.171 is stuck inactive for 104947.558748, current state peering, last
> > acting [49,9,25]
> > pg 1.243 is stuck inactive for 208452.393430, current state peering, last
> > acting [45,32,24]
> > pg 3.aa is stuck inactive for 104958.230601, current state
> remapped+peering,
> > last acting [51,12,13]
> >
> > 41 osds have slow requests
> > recovery 2/11091408 objects degraded (0.000%)
> > recovery 1778127/11091408 objects misplaced (16.032%)
> > nodown,noout,noscrub,nodeep-scrub flag(s) set
> >
> > That is what we seem to be getting a lot of. It appears the PG's are
> just
> > stuck as inactive. I am not sure how to get around that.
> >
> > Thanks,
> >
> > On Fri, Jun 2, 2017 at 12:55 PM, Saverio Proto <zioproto at gmail.com>
> wrote:
> >>
> >> Usually 'ceph health detail' gives better info on what is making
> >> everything stuck.
> >>
> >> Saverio
> >>
> >> 2017-06-02 13:51 GMT+02:00 Grant Morley <grantmorley1985 at gmail.com>:
> >> > Hi All,
> >> >
> >> > I wonder if anyone could help at all.
> >> >
> >> > We were doing some routine maintenance on our ceph cluster and after
> >> > running
> >> > a "service ceph-all restart" on one of our nodes we noticed that
> >> > something
> >> > wasn't quite right. The cluster has gone into an error mode and we
> have
> >> > multiple stuck PGs and the object replacement recovery is taking a
> >> > strangely
> >> > long time. At first there was about 46% objects misplaced and we now
> >> > have
> >> > roughly 16%.
> >> >
> >> > However it has taken about 36 hours to do the recovery so far and
> with a
> >> > possible 16 to go we are looking at a fairly major issue. As a lot of
> >> > the
> >> > system is now blocked for read / writes, customers cannot access their
> >> > VMs.
> >> >
> >> > I think the main issue at the moment is that we have 210pgs stuck
> >> > inactive
> >> > and nothing we seem to do can get them to peer.
> >> >
> >> > Below is an ouptut of the ceph status. Can anyone help or have any
> ideas
> >> > on
> >> > how to speed up the recover process? We have tried turning down
> logging
> >> > on
> >> > the OSD's but some are going so slow they wont allow us to injectargs
> >> > into
> >> > them.
> >> >
> >> > health HEALTH_ERR
> >> > 210 pgs are stuck inactive for more than 300 seconds
> >> > 298 pgs backfill_wait
> >> > 3 pgs backfilling
> >> > 1 pgs degraded
> >> > 200 pgs peering
> >> > 1 pgs recovery_wait
> >> > 1 pgs stuck degraded
> >> > 210 pgs stuck inactive
> >> > 512 pgs stuck unclean
> >> > 3310 requests are blocked > 32 sec
> >> > recovery 2/11094405 objects degraded (0.000%)
> >> > recovery 1785063/11094405 objects misplaced (16.090%)
> >> > nodown,noout,noscrub,nodeep-scrub flag(s) set
> >> >
> >> > election epoch 16314, quorum 0,1,2,3,4,5,6,7,8
> >> >
> >> > storage-1,storage-2,storage-3,storage-4,storage-5,storage-6,
> storage-7,storage-8,storage-9
> >> > osdmap e213164: 54 osds: 54 up, 54 in; 329 remapped pgs
> >> > flags nodown,noout,noscrub,nodeep-scrub
> >> > pgmap v41030942: 2036 pgs, 14 pools, 14183 GB data, 3309
> kobjects
> >> > 43356 GB used, 47141 GB / 90498 GB avail
> >> > 2/11094405 objects degraded (0.000%)
> >> > 1785063/11094405 objects misplaced (16.090%)
> >> > 1524 active+clean
> >> > 298 active+remapped+wait_backfill
> >> > 153 peering
> >> > 47 remapped+peering
> >> > 10 inactive
> >> > 3 active+remapped+backfilling
> >> > 1 active+recovery_wait+degraded+remapped
> >> >
> >> > Many thanks,
> >> >
> >> > Grant
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > OpenStack-operators mailing list
> >> > OpenStack-operators at lists.openstack.org
> >> > http://lists.openstack.org/cgi-bin/mailman/listinfo/
> openstack-operators
> >> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20170602/7c3a09e9/attachment.html>
More information about the OpenStack-operators
mailing list