<div dir="ltr"><div><div><div><div><div><div><div>Hi All,<br><br></div>I wonder if anyone could help at all.<br><br></div>We
were doing some routine maintenance on our ceph cluster and after
running a "service ceph-all restart" on one of our nodes we noticed that
something wasn't quite right. The cluster has gone into an error mode
and we have multiple stuck PGs and the object replacement recovery is
taking a strangely long time. At first there was about 46% objects
misplaced and we now have roughly 16%.<br><br></div>However it has taken
about 36 hours to do the recovery so far and with a possible 16 to go
we are looking at a fairly major issue. As a lot of the system is now
blocked for read / writes, customers cannot access their VMs.<br><br></div>I think the main issue at the moment is that we have 210pgs stuck inactive and nothing we seem to do can get them to peer.<br><br></div>Below
is an ouptut of the ceph status. Can anyone help or have any ideas on
how to speed up the recover process? We have tried turning down logging
on the OSD's but some are going so slow they wont allow us to injectargs
into them.<br><br>health HEALTH_ERR<br> 210 pgs are stuck inactive for more than 300 seconds<br> 298 pgs backfill_wait<br> 3 pgs backfilling<br> 1 pgs degraded<br> 200 pgs peering<br> 1 pgs recovery_wait<br> 1 pgs stuck degraded<br> 210 pgs stuck inactive<br> 512 pgs stuck unclean<br> 3310 requests are blocked > 32 sec<br> recovery 2/11094405 objects degraded (0.000%)<br> recovery 1785063/11094405 objects misplaced (16.090%)<br> nodown,noout,noscrub,nodeep-scrub flag(s) set<br><br>
election epoch 16314, quorum 0,1,2,3,4,5,6,7,8
storage-1,storage-2,storage-3,storage-4,storage-5,storage-6,storage-7,storage-8,storage-9<br> osdmap e213164: 54 osds: 54 up, 54 in; 329 remapped pgs<br> flags nodown,noout,noscrub,nodeep-scrub<br> pgmap v41030942: 2036 pgs, 14 pools, 14183 GB data, 3309 kobjects<br> 43356 GB used, 47141 GB / 90498 GB avail<br> 2/11094405 objects degraded (0.000%)<br> 1785063/11094405 objects misplaced (16.090%)<br> 1524 active+clean<br> 298 active+remapped+wait_backfill<br> 153 peering<br> 47 remapped+peering<br> 10 inactive<br> 3 active+remapped+backfilling<br> 1 active+recovery_wait+degraded+remapped<br><br></div>Many thanks,<br><br></div>Grant<br><div><div><br><br></div></div></div>