<div dir="ltr"><div><div><div><div><div><div><div>Hi All,<br><br></div>I wonder if anyone could help at all.<br><br></div>We


 were doing some routine maintenance on our ceph cluster and after 


running a "service ceph-all restart" on one of our nodes we noticed that


 something wasn't quite right. The cluster has gone into an error mode 


and we have multiple stuck PGs and the object replacement recovery is 


taking a strangely long time. At first there was about 46% objects 


misplaced and we now have roughly 16%.<br><br></div>However it has taken


 about 36 hours to do the recovery so far and with a possible 16 to go 


we are looking at a fairly major issue. As a lot of the system is now 


blocked for read / writes, customers cannot access their VMs.<br><br></div>I think the main issue at the moment is that we have 210pgs stuck inactive and nothing we seem to do can get them to peer.<br><br></div>Below


 is an ouptut of the ceph status. Can anyone help or have any ideas on 


how to speed up the recover process? We have tried turning down logging 


on the OSD's but some are going so slow they wont allow us to injectargs


 into them.<br><br>health HEALTH_ERR<br>            210 pgs are stuck inactive for more than 300 seconds<br>            298 pgs backfill_wait<br>            3 pgs backfilling<br>            1 pgs degraded<br>            200 pgs peering<br>            1 pgs recovery_wait<br>            1 pgs stuck degraded<br>            210 pgs stuck inactive<br>            512 pgs stuck unclean<br>            3310 requests are blocked > 32 sec<br>            recovery 2/11094405 objects degraded (0.000%)<br>            recovery 1785063/11094405 objects misplaced (16.090%)<br>            nodown,noout,noscrub,nodeep-scrub flag(s) set<br><br>           


 election epoch 16314, quorum 0,1,2,3,4,5,6,7,8 


storage-1,storage-2,storage-3,storage-4,storage-5,storage-6,storage-7,storage-8,storage-9<br>     osdmap e213164: 54 osds: 54 up, 54 in; 329 remapped pgs<br>            flags nodown,noout,noscrub,nodeep-scrub<br>      pgmap v41030942: 2036 pgs, 14 pools, 14183 GB data, 3309 kobjects<br>            43356 GB used, 47141 GB / 90498 GB avail<br>            2/11094405 objects degraded (0.000%)<br>            1785063/11094405 objects misplaced (16.090%)<br>                1524 active+clean<br>                 298 active+remapped+wait_backfill<br>                 153 peering<br>                  47 remapped+peering<br>                  10 inactive<br>                   3 active+remapped+backfilling<br>                   1 active+recovery_wait+degraded+remapped<br><br></div>Many thanks,<br><br></div>Grant<br><div><div><br><br></div></div></div>