[openstack-dev] [ceilometer][qa] Punting ceilometer from whitelist
dkranz at redhat.com
Tue Dec 3 15:13:06 UTC 2013
On 12/03/2013 09:30 AM, Eoghan Glynn wrote:
> ----- Original Message -----
>> On 12/02/2013 10:24 AM, Julien Danjou wrote:
>>> On Fri, Nov 29 2013, David Kranz wrote:
>>>> In preparing to fail builds with log errors I have been trying to make
>>>> things easier for projects by maintaining a whitelist. But these bugs in
>>>> ceilometer are coming in so fast that I can't keep up. So I am just
>>>> ".*" in the white list for any cases I find before gate failing is turned
>>>> on, hopefully early this week.
>>> Following the chat on IRC and the bug reports, it seems this might come
>>> From the tempest tests that are under reviews, as currently I don't
>>> think Ceilometer generates any error as it's not tested.
>>> So I'm not sure we want to whitelist anything?
>> So I tested this with https://review.openstack.org/#/c/59443/. There are
>> flaky log errors coming from ceilometer. You
>> can see that the build at 12:27 passed, but the last build failed twice,
>> each with a different set of errors. So the whitelist needs to remain
>> and the ceilometer team should remove each entry when it is believed to
>> be unnecessary.
> Hi David,
> Just looking into this issue.
> So when you say the build failed, do you mean that errors were detected
> in the ceilometer log files? (as opposed to a specific Tempest testcase
> having reported a failure)
Yes, exactly. This patch removed the whitelist entries for ceilometer
and so those errors then "failed" the build.
> If that interpretation of build failure is correct, I think there's a simple
> explanation for the compute agent ERRORs seen in the log file for the CI
> build related to your patch referenced above, specifically:
> ERROR ceilometer.compute.pollsters.disk [-] Requested operation is not valid: domain is not running
> The problem I suspect is a side-effect of a nova test that suspends the
> instance in question, followed by a race between the ceilometer logic that
> discovers the local instances via the nova-api followed by the individual
> pollsters that call into the libvirt daemon to gather the disk stats etc.
> It appears that the libvirt virDomainBlockStats() call fails with "domain
> is not running" for suspended instances.
> This would only occur intermittently as it requires the instance to
> remain in the suspended state across a polling interval boundary.
> So we need tighten up our logic there to avoid spewing needless errors
> when a very normal event occurs (i.e. instance suspension).
> I've filed a bug which some ideas for addressing the issue - this
> will require a bit of discussion before agreeing a way forward, but I'll
> prioritize getting this knocked on the head asap.
Great! Thanks. The change I pushed yesterday should help prevent this
sort of thing from creeping in across all projects. But as Julian
observed, the process of removing entries from the whitelist that are no
longer needed due to bug fixes is not so easy and automatic. I'm trying
to put together a script that will check the whitelist entries against
the last two weeks of builds using logstash but it is not so simple to
do that since general regexps cannot be used with logstash.
More information about the OpenStack-dev