<p dir="ltr"><br>

On Jan 14, 2014 6:14 AM, "Sean Dague" <<a href="mailto:sean@dague.net">sean@dague.net</a>> wrote:<br>

><br>

> I'm doing some fundamental refactors on the ER bot to help us try to<br>

> figure out why we are often not tagging bugs that we should be, and have<br>

> found that we're no longer really indexing in real time (which may be a<br>

> huge part of this).<br>

><br>

> Basically we've got a more or less hard timeout of 13 minutes (it's up<br>

> to 20 attemps with a 40s wait between for random historical reasons)<br>

> from gerrit fail reporting to having the console log index in ES. (We<br>

> give it another 13 minutes after that to gather all the rest of the job<br>

> appropriate logs).<br>

><br>

> Because of the way we process events, timing out on one fail often means<br>

> the next one actually might work, because you'll get 13 minutes from the<br>

> time ER looked at your change, not since your change was posted (we're<br>

> single threaded in this part of the loop).<br>

><br>

> What I'm seeing right now is that starting up the bot locally it will<br>

> always timeout waiting for results of the first failure that it gets,<br>

> then if you get lucky, it might classify the 2nd fail.<br>

><br>

> Given that, we really need to be tracking and alerting on ES delays some<br>

> how, otherwise we're going to loose a lot of the value on this.<br>

><br>

>         -Sean<br>

><br>

> --<br>

> Sean Dague<br>

> <a href="http://dague.net">http://dague.net</a><br>

><br>

><br>

> _______________________________________________<br>

> OpenStack-Infra mailing list<br>

> <a href="mailto:OpenStack-Infra@lists.openstack.org">OpenStack-Infra@lists.openstack.org</a><br>

> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra</a><br>

></p>

<p dir="ltr">There are a couple things we can do about this. First, we should reenable the logstash 05-08 workers to double the worker count. Second, we should enable the new geard graphite statistics so that we can see queue length trends. I can work on this when I get back from Perth, but don't let that stop anyone from attacking it first.</p>


<p dir="ltr">Clark</p>