<div dir="ltr"><div><div>We are using Ceph Jewel (10.2.7) running on Ubuntu 14.04LTS<br><br>osd_recovery_max_active": "1"<br>osd_max_backfills": "1"<br>osd_recovery_op_priority": "3"<br><br>Limit                     Soft Limit           Hard Limit           Units                                                                                                                                     <br>Max cpu time              unlimited            unlimited            seconds                                                                                                                                   <br>Max file size             unlimited            unlimited            bytes                                                                                                                                    <br>Max data size             unlimited            unlimited            bytes                                                                                                                                     <br>Max stack size            8388608              unlimited            bytes                                                                                                                                   <br>Max core file size        0                    unlimited            bytes                                                                                                                                    <br>Max resident set          unlimited            unlimited            bytes                                                                                                                                   <br>Max processes             256369               256369               processes                                                                                                                                <br>Max open files            327680               327680               files                                                                                                                                <br>Max locked memory         65536                65536                bytes                                                                                                                                    <br>Max address space         unlimited            unlimited            bytes                                                                                                                                     <br>Max file locks            unlimited            unlimited            locks                                                                                                                                     <br>Max pending signals       256369               256369               signals                                                                                                                                  <br>Max msgqueue size         819200               819200               bytes                                                                                                                                     <br>Max nice priority         0                    0                                                                                                                                                              <br>Max realtime priority     0                    0                                                                                                                                                              <br>Max realtime timeout      unlimited            unlimited            us<br><br></div>We did try changing the osd_recovery_max_active to "3" but that seemed tlo make things run slower<br><br></div><div>Thanks,<br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jun 2, 2017 at 1:08 PM, Saverio Proto <span dir="ltr"><<a href="mailto:zioproto@gmail.com" target="_blank">zioproto@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">To give you some help you need to tell us the ceph version you are<br>
using and from ceph.conf in the section [osd] what values you have for<br>
the following ?<br>
<br>
[osd]<br>
osd max backfills<br>
osd recovery max active<br>
osd recovery op priority<br>
<br>
these three settings can influence the recovery speed.<br>
<br>
Also, do you have big enough limits ?<br>
<br>
Check on any host the content of: /proc/`pid_of_the_osd`/limits<br>
<span class="HOEnZb"><font color="#888888"><br>
<br>
Saverio<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
2017-06-02 14:00 GMT+02:00 Grant Morley <<a href="mailto:grantmorley1985@gmail.com">grantmorley1985@gmail.com</a>>:<br>
> HEALTH_ERR 210 pgs are stuck inactive for more than 300 seconds; 296 pgs<br>
> backfill_wait; 3 pgs backfilling; 1 pgs degraded; 202 pgs peering; 1 pgs<br>
> recovery_wait; 1 pgs stuck degraded; 210 pgs stuck inactive; 510 pgs stuck<br>
> unclean; 3308 requests are blocked > 32 sec; 41 osds have slow requests;<br>
> recovery 2/11091408 objects degraded (0.000%); recovery 1778127/11091408<br>
> objects misplaced (16.032%); nodown,noout,noscrub,nodeep-<wbr>scrub flag(s) set<br>
><br>
> pg 3.235 is stuck inactive for 138232.508429, current state peering, last<br>
> acting [11,26,1]<br>
> pg 1.237 is stuck inactive for 138260.482588, current state peering, last<br>
> acting [8,41,34]<br>
> pg 2.231 is stuck inactive for 138258.316031, current state peering, last<br>
> acting [24,53,8]<br>
> pg 2.22e is stuck inactive for 194033.321591, current state<br>
> remapped+peering, last acting [0,29,1]<br>
> pg 1.22c is stuck inactive for 102514.200154, current state peering, last<br>
> acting [51,7,20]<br>
> pg 2.228 is stuck inactive for 138258.317797, current state peering, last<br>
> acting [53,4,34]<br>
> pg 1.227 is stuck inactive for 138258.244681, current state<br>
> remapped+peering, last acting [48,35,11]<br>
> pg 2.220 is stuck inactive for 193940.066322, current state<br>
> remapped+peering, last acting [9,39,8]<br>
> pg 1.222 is stuck inactive for 101474.087688, current state peering, last<br>
> acting [23,11,35]<br>
> pg 3.130 is stuck inactive for 99735.451290, current state peering, last<br>
> acting [27,37,17]<br>
> pg 3.136 is stuck inactive for 138221.552865, current state peering, last<br>
> acting [26,49,10]<br>
> pg 3.13c is stuck inactive for 137563.906503, current state peering, last<br>
> acting [51,53,7]<br>
> pg 2.142 is stuck inactive for 99962.462932, current state peering, last<br>
> acting [37,16,34]<br>
> pg 1.141 is stuck inactive for 138257.572476, current state<br>
> remapped+peering, last acting [5,17,49]<br>
> pg 2.141 is stuck inactive for 102567.745720, current state peering, last<br>
> acting [36,7,15]<br>
> pg 3.144 is stuck inactive for 138218.289585, current state<br>
> remapped+peering, last acting [18,28,16]<br>
> pg 1.14d is stuck inactive for 138260.030530, current state peering, last<br>
> acting [46,43,17]<br>
> pg 3.155 is stuck inactive for 138227.368541, current state<br>
> remapped+peering, last acting [33,20,52]<br>
> pg 2.8d is stuck inactive for 100251.802576, current state peering, last<br>
> acting [6,39,27]<br>
> pg 2.15c is stuck inactive for 102567.512279, current state<br>
> remapped+peering, last acting [7,35,49]<br>
> pg 2.167 is stuck inactive for 138260.093367, current state peering, last<br>
> acting [35,23,17]<br>
> pg 3.9d is stuck inactive for 117050.294600, current state peering, last<br>
> acting [12,51,23]<br>
> pg 2.16e is stuck inactive for 99846.214239, current state peering, last<br>
> acting [25,5,8]<br>
> pg 2.17b is stuck inactive for 99733.504794, current state peering, last<br>
> acting [49,27,14]<br>
> pg 3.178 is stuck inactive for 99973.600671, current state peering, last<br>
> acting [29,16,40]<br>
> pg 3.240 is stuck inactive for 28768.488851, current state remapped+peering,<br>
> last acting [33,8,32]<br>
> pg 3.b6 is stuck inactive for 138222.461160, current state peering, last<br>
> acting [26,29,34]<br>
> pg 2.17e is stuck inactive for 159229.154401, current state peering, last<br>
> acting [13,42,48]<br>
> pg 2.17c is stuck inactive for 104921.767401, current state<br>
> remapped+peering, last acting [23,12,24]<br>
> pg 3.17d is stuck inactive for 137563.979966, current state<br>
> remapped+peering, last acting [43,24,29]<br>
> pg 1.24b is stuck inactive for 93144.933177, current state peering, last<br>
> acting [43,20,37]<br>
> pg <a href="http://1.bd" rel="noreferrer" target="_blank">1.bd</a> is stuck inactive for 102616.793475, current state peering, last<br>
> acting [16,30,35]<br>
> pg 3.1d6 is stuck inactive for 99974.485247, current state peering, last<br>
> acting [16,38,29]<br>
> pg 2.172 is stuck inactive for 193919.627310, current state inactive, last<br>
> acting [39,21,10]<br>
> pg 1.171 is stuck inactive for 104947.558748, current state peering, last<br>
> acting [49,9,25]<br>
> pg 1.243 is stuck inactive for 208452.393430, current state peering, last<br>
> acting [45,32,24]<br>
> pg 3.aa is stuck inactive for 104958.230601, current state remapped+peering,<br>
> last acting [51,12,13]<br>
><br>
> 41 osds have slow requests<br>
> recovery 2/11091408 objects degraded (0.000%)<br>
> recovery 1778127/11091408 objects misplaced (16.032%)<br>
> nodown,noout,noscrub,nodeep-<wbr>scrub flag(s) set<br>
><br>
> That is what we seem to be getting  a lot of. It appears the PG's are just<br>
> stuck as inactive. I am not sure how to get around that.<br>
><br>
> Thanks,<br>
><br>
> On Fri, Jun 2, 2017 at 12:55 PM, Saverio Proto <<a href="mailto:zioproto@gmail.com">zioproto@gmail.com</a>> wrote:<br>
>><br>
>> Usually 'ceph health detail' gives better info on what is making<br>
>> everything stuck.<br>
>><br>
>> Saverio<br>
>><br>
>> 2017-06-02 13:51 GMT+02:00 Grant Morley <<a href="mailto:grantmorley1985@gmail.com">grantmorley1985@gmail.com</a>>:<br>
>> > Hi All,<br>
>> ><br>
>> > I wonder if anyone could help at all.<br>
>> ><br>
>> > We were doing some routine maintenance on our ceph cluster and after<br>
>> > running<br>
>> > a "service ceph-all restart" on one of our nodes we noticed that<br>
>> > something<br>
>> > wasn't quite right. The cluster has gone into an error mode and we have<br>
>> > multiple stuck PGs and the object replacement recovery is taking a<br>
>> > strangely<br>
>> > long time. At first there was about 46% objects misplaced and we now<br>
>> > have<br>
>> > roughly 16%.<br>
>> ><br>
>> > However it has taken about 36 hours to do the recovery so far and with a<br>
>> > possible 16 to go we are looking at a fairly major issue. As a lot of<br>
>> > the<br>
>> > system is now blocked for read / writes, customers cannot access their<br>
>> > VMs.<br>
>> ><br>
>> > I think the main issue at the moment is that we have 210pgs stuck<br>
>> > inactive<br>
>> > and nothing we seem to do can get them to peer.<br>
>> ><br>
>> > Below is an ouptut of the ceph status. Can anyone help or have any ideas<br>
>> > on<br>
>> > how to speed up the recover process? We have tried turning down logging<br>
>> > on<br>
>> > the OSD's but some are going so slow they wont allow us to injectargs<br>
>> > into<br>
>> > them.<br>
>> ><br>
>> > health HEALTH_ERR<br>
>> >             210 pgs are stuck inactive for more than 300 seconds<br>
>> >             298 pgs backfill_wait<br>
>> >             3 pgs backfilling<br>
>> >             1 pgs degraded<br>
>> >             200 pgs peering<br>
>> >             1 pgs recovery_wait<br>
>> >             1 pgs stuck degraded<br>
>> >             210 pgs stuck inactive<br>
>> >             512 pgs stuck unclean<br>
>> >             3310 requests are blocked > 32 sec<br>
>> >             recovery 2/11094405 objects degraded (0.000%)<br>
>> >             recovery 1785063/11094405 objects misplaced (16.090%)<br>
>> >             nodown,noout,noscrub,nodeep-<wbr>scrub flag(s) set<br>
>> ><br>
>> >             election epoch 16314, quorum 0,1,2,3,4,5,6,7,8<br>
>> ><br>
>> > storage-1,storage-2,storage-3,<wbr>storage-4,storage-5,storage-6,<wbr>storage-7,storage-8,storage-9<br>
>> >      osdmap e213164: 54 osds: 54 up, 54 in; 329 remapped pgs<br>
>> >             flags nodown,noout,noscrub,nodeep-<wbr>scrub<br>
>> >       pgmap v41030942: 2036 pgs, 14 pools, 14183 GB data, 3309 kobjects<br>
>> >             43356 GB used, 47141 GB / 90498 GB avail<br>
>> >             2/11094405 objects degraded (0.000%)<br>
>> >             1785063/11094405 objects misplaced (16.090%)<br>
>> >                 1524 active+clean<br>
>> >                  298 active+remapped+wait_backfill<br>
>> >                  153 peering<br>
>> >                   47 remapped+peering<br>
>> >                   10 inactive<br>
>> >                    3 active+remapped+backfilling<br>
>> >                    1 active+recovery_wait+degraded+<wbr>remapped<br>
>> ><br>
>> > Many thanks,<br>
>> ><br>
>> > Grant<br>
>> ><br>
>> ><br>
>> ><br>
>> > ______________________________<wbr>_________________<br>
>> > OpenStack-operators mailing list<br>
>> > <a href="mailto:OpenStack-operators@lists.openstack.org">OpenStack-operators@lists.<wbr>openstack.org</a><br>
>> > <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" rel="noreferrer" target="_blank">http://lists.openstack.org/<wbr>cgi-bin/mailman/listinfo/<wbr>openstack-operators</a><br>
>> ><br>
><br>
><br>
</div></div></blockquote></div><br></div>