[openstack-dev] [ceilometer] [qa] Ceilometer ERRORS in normal runs

David Kranz dkranz at redhat.com
Thu Oct 24 02:01:17 UTC 2013


On 10/23/2013 05:08 PM, Rochelle.Grober wrote:
>
> John Griffith wrote:
>
> On Wed, Oct 23, 2013 at 8:47 AM, Sean Dague <sean at dague.net 
> <mailto:sean at dague.net>> wrote:
>
> On 10/23/2013 10:40 AM, John Griffith wrote:
>
>
>
>
>     On Sun, Oct 20, 2013 at 7:38 AM, Sean Dague <sean at dague.net
>     <mailto:sean at dague.net>
>
>     <mailto:sean at dague.net <mailto:sean at dague.net>>> wrote:
>
>         Dave Kranz has been building a system so that we can ensure that
>         during a Tempest run services don't spew ERRORs in the logs.
>         Eventually, we're going to gate on this, because there is nothing
>         that Tempest does to the system that should cause any OpenStack
>         service to ERROR or stack trace (Errors should actually be
>         exceptional events that something is wrong with the system, not
>         regular events).
>
>
>     So I have to disagree with the approach being taken here.
>      Particularly
>     in the case of Cinder and the negative tests that are in place.
>      When I
>     read this last week I assumed you actually meant that "Exceptions"
>     were
>     exceptional and nothing in Tempest should cause Exceptions.  It turns
>     out you apparently did mean Errors.  I completely disagree here,
>     Errors
>     happen, some are recovered, some are expected by the tests etc.
>      Having
>     a policy and especially a gate that says NO ERROR MESSAGE in logs
>     makes
>     absolutely no sense to me.
>
>     Something like NO TRACE/EXCEPTION MESSAGE in logs I can agree
>     with, but
>     this makes no sense to me.  By the way, here's a perfect example:
>     https://bugs.launchpad.net/cinder/+bug/1243485
>
>     As long as we have Tempest tests that do things like "show
>     non-existent
>     volume" you're going to get an Error message and I think that you
>     should
>     quite frankly.
>
>
> Ok, I guess that's where we probably need to clarify what "Not Found" 
> is. Because "Not Found" to me seems like it should be a request at 
> INFO level, not ERROR.
>
>
>     ERROR from an admin perspective should really be something that
>     would suitable for sending an alert to an administrator for them
>     to come and fix the cloud.
>
>     From my perspective as someone who has done Ops in the past, a
>     "Volume Not Found" can be either info or an error.  It all depends
>     on the context.  That said, we need to be able to test ERROR
>     conditions and ensure that they report properly as ERROR, else the
>     poor Ops folks will always be on the spot for not knowing that
>     there is a problem.  A volume that has gone missing is a problem. 
>     Ops would like an immediate report.  They would trigger on the
>     ERROR statement in the log.  On the other hand, if someone/thing
>      fatfingers an input and requests something that has never
>     existed, then that's just info.
>
It is not just a case of fatfingers. Some of the delete apis are 
asynchronous and the only way to know that a delete finished is to check 
if the object still exists. Tempest does such checks to manage resource 
usage, even if there were no negative tests. The logs are not full of 
ERRORs because almost all of our apis, including nova, do not log an 
ERROR when returning 404.

I think John's point is that it can be hard or impossible to tell if an 
object is not found because it truly no longer exists (or never 
existed), or if there is something wrong with the system and the object 
really exists but is not being found. But I would argue that even if 
this is true we cannot alert the operator every time some user checks to 
see if an object is still there. So there has to be some "thing" that 
gets put in the log which says "there is a problem with the system, 
either a bug or ran out of disk or something". The appearance of that 
thing in the log is what an alert should be triggered on, and what 
should fail a gate job. That is pretty close to what ERROR is being used 
for now.
>
>     We need to be able to test for correctness of errors and process
>     logs with errors in them as part of the test verification. 
>     Perhaps a switch in the test that indicates log needs post
>     processing, or a way to redirect the log during a specific error
>     test, or some such?  The question is, how do we keep test system
>     logs clean of ERRORs and still test system logs for intentionally
>     triggered ERRORs?
>

>     --Rocky
>
We might be able to do that in our test framework, but it would not help 
operators. IMO the least of evils here by far is to log events 
associated with an api call that returns 4xx in a way that is 
distinguishable from how we log when we detect a system failure of some 
sort.

  -David
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20131023/2c849fe4/attachment-0001.html>


More information about the OpenStack-dev mailing list