<div dir="ltr"><div>I think harmonizing the log files is a great idea, when working on elastic-recheck I spent a lot of time staring at log files and cursing at how bad and non-uniform they are. I can only imagine what cloud operators must think.</div>
<div><br></div>In addition to harmonizing the log levels, and makings sure we don't have scary looking (stacktrace etc) logs during a normal tempest run I think we should:<div><br></div><div>* Make sure that all projects use the same logging format and use request-ids. I have already filed bugs for neutron and ceilometer on this (<a href="https://bugs.launchpad.net/neutron/+bug/1239923">https://bugs.launchpad.net/neutron/+bug/1239923</a> <a href="https://bugs.launchpad.net/ceilometer/+bug/1244182">https://bugs.launchpad.net/ceilometer/+bug/1244182</a>) and I have a hunch other projects may not use these either.</div>
<div>* Have better default log levels for dependencies, for example when debug logging is enabled for nova, I don't think we really need debug level logs on for amqp, although perhaps I am wrong.</div></div><div class="gmail_extra">
<br><br><div class="gmail_quote">On Wed, Oct 23, 2013 at 8:55 PM, Sean Dague <span dir="ltr"><<a href="mailto:sean@dague.net" target="_blank">sean@dague.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="HOEnZb"><div class="h5">On 10/23/2013 03:35 PM, Robert Collins wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
On 24 October 2013 08:28, John Griffith <<a href="mailto:john.griffith@solidfire.com" target="_blank">john.griffith@solidfire.com</a>> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
So I touched on this a bit in my earlier post but want to reiterate here and<br>
maybe clarify a bit. I agree that cleaning up and standardizing the logs is<br>
a good thing, and particularly removing unhandled exception messages would<br>
be good. What concerns me however is the approach being taken here of<br>
saying things like "Error level messages are banned from Tempest runs".<br>
<br>
The case I mentioned earlier of the negative test is a perfect example.<br>
There's no way for Cinder (or any other service) to know the difference<br>
between the end user specifying/requesting a non-existent volume and a valid<br>
volume being requested that for some reason can't be found. I'm not quite<br>
sure how you place a definitive rule like "no error messages in logs" unless<br>
you make your tests such that you never run negative tests?<br>
</blockquote>
<br>
Let me check that I understand: you want to check that when a user<br>
asks for a volume that doesn't exist, they don't get it, *and* that<br>
the reason they didn't get it was due to Cinder detecting it's<br>
missing, not due to e.g. cinder throwing an error and returning 500 ?<br>
<br>
If so, that seems pretty straight forward; a) check the error that is<br>
reported (it should be a 404 and contain an explanation which we can<br>
check) and b) check the logs to see that nothing was logged (because a<br>
server fault would be logged).<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
There are other cases in cinder as well that I'm concerned about. One<br>
example is iscsi target creation, there are a number of scenarios where this<br>
can fail under certain conditions. In most of these cases we now have retry<br>
mechanisms or alternate implementations to complete the task. The fact is<br>
however that a call somewhere in the system failed, this should be something<br>
in my opinion that stands out in the logs. Maybe this particular case would<br>
be well suited to being a warning other than an error, and that's fine. My<br>
point however though is that I think some thought needs to go into this<br>
before making blanketing rules and especially gating criteria that says "no<br>
error messages in logs".<br>
</blockquote></blockquote>
<br></div></div>
Absolutely agreed. That's why I wanted to kick off this discussion and am thinking about how we get to agreement by Icehouse (giving this lots of time to bake and getting different perspectives in here).<br>
<br>
On the short term of failing jobs in tempest because they've got errors in the logs, we've got a whole white list mechanism right now for "acceptable errors". Over time I'd love to shrink that to 0. But that's going to be a collaboration between the QA team and the specific core projects to make sure that's the right call in each case. Who knows, maybe there are generally agreed to ERROR conditions that we trigger, but we'll figure that out overtime.<br>
<br>
I think the iscsi example is a good case for WARNING, which is the same level we use when we fail to schedule a resource (compute / volume). Especially because we try to recover now. If we fail to recover, ERROR is probably called for. But if we actually failed to alocate a volume, we'd end up failing the tests anyways, which means the ERROR in the log wouldn't be a problem in and of itself.<div class="im">
<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I agree thought and care is needed. As a deployer my concern is that<br>
the only time ERROR is logged in the logs is when something is wrong<br>
with the infrastructure (rather than a user asking for something<br>
stupid). I think my concern and yours can both be handled at the same<br>
time.<br>
</blockquote>
<br></div>
Right, and I think this is the perspective that I'm coming from. Our logs (at INFO and up) are UX to our cloud admins.<br>
<br>
We should be pretty sure that we know something is a problem if we tag it as an ERROR, or CRITICAL. Because that's likely to be something that negatively impacts someones day.<br>
<br>
If we aren't completely sure your cloud is on fire, but we're pretty sure something is odd, WARNING is appropriate.<br>
<br>
If it's no good, but we have no way to test if it's a problem, it's just INFO. I really think the "not found" case falls more into standard INFO.<br>
<br>
Again, more concrete instances like the iscsi case, are probably the most helpful. I think in the abstract this problem is too hard to solve, but with examples, we can probably come to some concensus.<div class="im HOEnZb">
<br>
<br>
-Sean<br>
<br>
-- <br>
Sean Dague<br>
<a href="http://dague.net" target="_blank">http://dague.net</a><br>
<br></div><div class="HOEnZb"><div class="h5">
______________________________<u></u>_________________<br>
OpenStack-dev mailing list<br>
<a href="mailto:OpenStack-dev@lists.openstack.org" target="_blank">OpenStack-dev@lists.openstack.<u></u>org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/<u></u>cgi-bin/mailman/listinfo/<u></u>openstack-dev</a><br>
</div></div></blockquote></div><br></div>