[openstack-dev] [ceilometer][qa] Punting ceilometer from whitelist

Sean Dague sean at dague.net
Tue Dec 3 14:51:34 UTC 2013


On 12/03/2013 09:30 AM, Eoghan Glynn wrote:
> 
> 
> ----- Original Message -----
>> On 12/02/2013 10:24 AM, Julien Danjou wrote:
>>> On Fri, Nov 29 2013, David Kranz wrote:
>>>
>>>> In preparing to fail builds with log errors I have been trying to make
>>>> things easier for projects by maintaining a whitelist. But these bugs in
>>>> ceilometer are coming in so fast that I can't keep up. So I am  just
>>>> putting
>>>> ".*" in the white list for any cases I find before gate failing is turned
>>>> on, hopefully early this week.
>>> Following the chat on IRC and the bug reports, it seems this might come
>>>  From the tempest tests that are under reviews, as currently I don't
>>> think Ceilometer generates any error as it's not tested.
>>>
>>> So I'm not sure we want to whitelist anything?
>> So I tested this with https://review.openstack.org/#/c/59443/. There are
>> flaky log errors coming from ceilometer. You
>> can see that the build at 12:27 passed, but the last build failed twice,
>> each with a different set of errors. So the whitelist needs to remain
>> and the ceilometer team should remove each entry when it is believed to
>> be unnecessary.
> 
> Hi David,
> 
> Just looking into this issue.
> 
> So when you say the build failed, do you mean that errors were detected
> in the ceilometer log files? (as opposed to a specific Tempest testcase
> having reported a failure)
> 
> If that interpretation of build failure is correct, I think there's a simple
> explanation for the compute agent ERRORs seen in the log file for the CI
> build related to your patch referenced above, specifically:
> 
>   ERROR ceilometer.compute.pollsters.disk [-] Requested operation is not valid: domain is not running
> 
> The problem I suspect is a side-effect of a nova test that suspends the
> instance in question, followed by a race between the ceilometer logic that
> discovers the local instances via the nova-api followed by the individual
> pollsters that call into the libvirt daemon to gather the disk stats etc.
> It appears that the libvirt virDomainBlockStats() call fails with "domain
> is not running" for suspended instances.
> 
> This would only occur intermittently as it requires the instance to
> remain in the suspended state across a polling interval boundary. 
> 
> So we need tighten up our logic there to avoid spewing needless errors
> when a very normal event occurs (i.e. instance suspension).

Definitely need to tighten things up.

As a developer think about the fact that when you log something as
ERROR, you are expecting a cloud operator to be woken up in the middle
of the night with an email alert to go fix the cloud immediately. You
are intentionally ruining someone's weekend to fix this issue - RIGHT NOW!

Hence why we are going to start failing jobs that add new ERRORs. We
have a whitelist for times when this should be the case. But assume
that's not the normal path.

	-Sean

-- 
Sean Dague
http://dague.net



More information about the OpenStack-dev mailing list