Open Stack

Tue Dec 3 14:30:14 UTC 2013

----- Original Message -----
> On 12/02/2013 10:24 AM, Julien Danjou wrote:
> > On Fri, Nov 29 2013, David Kranz wrote:
> >
> >> In preparing to fail builds with log errors I have been trying to make
> >> things easier for projects by maintaining a whitelist. But these bugs in
> >> ceilometer are coming in so fast that I can't keep up. So I am  just
> >> putting
> >> ".*" in the white list for any cases I find before gate failing is turned
> >> on, hopefully early this week.
> > Following the chat on IRC and the bug reports, it seems this might come
> >  From the tempest tests that are under reviews, as currently I don't
> > think Ceilometer generates any error as it's not tested.
> >
> > So I'm not sure we want to whitelist anything?
> So I tested this with https://review.openstack.org/#/c/59443/. There are
> flaky log errors coming from ceilometer. You
> can see that the build at 12:27 passed, but the last build failed twice,
> each with a different set of errors. So the whitelist needs to remain
> and the ceilometer team should remove each entry when it is believed to
> be unnecessary.

Hi David,

Just looking into this issue.

So when you say the build failed, do you mean that errors were detected
in the ceilometer log files? (as opposed to a specific Tempest testcase
having reported a failure)

If that interpretation of build failure is correct, I think there's a simple
explanation for the compute agent ERRORs seen in the log file for the CI
build related to your patch referenced above, specifically:

  ERROR ceilometer.compute.pollsters.disk [-] Requested operation is not valid: domain is not running

The problem I suspect is a side-effect of a nova test that suspends the
instance in question, followed by a race between the ceilometer logic that
discovers the local instances via the nova-api followed by the individual
pollsters that call into the libvirt daemon to gather the disk stats etc.
It appears that the libvirt virDomainBlockStats() call fails with "domain
is not running" for suspended instances.

This would only occur intermittently as it requires the instance to
remain in the suspended state across a polling interval boundary. 

So we need tighten up our logic there to avoid spewing needless errors
when a very normal event occurs (i.e. instance suspension).

I've filed a bug[1] which some ideas for addressing the issue - this
will require a bit of discussion before agreeing a way forward, but I'll
prioritize getting this knocked on the head asap.

Cheers,
Eoghan

[1] https://bugs.launchpad.net/ceilometer/+bug/1257302

> > The tricky part is going to be for us to fix Ceilometer on one side and
> > re-run Tempest reviews on the other side once a potential fix is merged.
> This is another use case for the promised
> dependent-patch-between-projects thing.
> 
>   -David
> >
> 
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 

Open Stack

[openstack-dev] [ceilometer][qa] Punting ceilometer from whitelist

OpenStack

Community

Documentation

Branding & Legal