Open Stack

Mon Jan 13 22:12:40 UTC 2014

I'm doing some fundamental refactors on the ER bot to help us try to
figure out why we are often not tagging bugs that we should be, and have
found that we're no longer really indexing in real time (which may be a
huge part of this).

Basically we've got a more or less hard timeout of 13 minutes (it's up
to 20 attemps with a 40s wait between for random historical reasons)
from gerrit fail reporting to having the console log index in ES. (We
give it another 13 minutes after that to gather all the rest of the job
appropriate logs).

Because of the way we process events, timing out on one fail often means
the next one actually might work, because you'll get 13 minutes from the
time ER looked at your change, not since your change was posted (we're
single threaded in this part of the loop).

What I'm seeing right now is that starting up the bot locally it will
always timeout waiting for results of the first failure that it gets,
then if you get lucky, it might classify the 2nd fail.

Given that, we really need to be tracking and alerting on ES delays some
how, otherwise we're going to loose a lot of the value on this.

	-Sean

-- 
Sean Dague
http://dague.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 547 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack-infra/attachments/20140113/d02075da/attachment.pgp>

Open Stack

[OpenStack-Infra] elastic-search delay metrics?

OpenStack

Community

Documentation

Branding & Legal