<div dir="ltr"><div class="gmail_default" style="font-family:courier new,monospace"><br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Oct 23, 2013 at 1:03 PM, Clark Boylan <span dir="ltr"><<a href="mailto:clark.boylan@gmail.com" target="_blank">clark.boylan@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On Wed, Oct 23, 2013 at 11:20 AM, Sean Dague <<a href="mailto:sean@dague.net">sean@dague.net</a>> wrote:<br>
> One of the efforts that we're working on from the QA team is tooling that<br>
> ensures we aren't stack tracing into our test logs during normal tempest<br>
> runs. Random stack traces are scary to cloud admins consuming OpenStack<br>
> logs, and exceptions in the logs should really be exceptional events (and<br>
> indicative of a failing system), not something that we do by default. Our<br>
> intent is to gate code on clean logs (no stacktraces) eventually (i.e. if<br>
> you try to land a patch that causes stack traces in OpenStack, that becomes<br>
> a failing condition), and we've got an incremental white list based approach<br>
> that should let us make forward progress on that. But on that thread -<br>
> <a href="http://lists.openstack.org/pipermail/openstack-dev/2013-October/017012.html" target="_blank">http://lists.openstack.org/pipermail/openstack-dev/2013-October/017012.html</a><br>
> we exposed another issue... across projects, OpenStack is very inconsistent<br>
> with logging.<br>
><br>
> First... baseline, these are the logging levels that we have in OpenStack<br>
> today (with numeric values, higher = worse):<br>
><br>
> CRITICAL = 50<br>
> FATAL = CRITICAL<br>
> ERROR = 40<br>
> WARNING = 30<br>
> WARN = WARNING<br>
> AUDIT = 21 # invented for oslo-logging<br>
> INFO = 20<br>
> DEBUG = 10<br>
> NOTSET = 0<br>
><br>
> We also have TRACE, which isn't a level per say, it happens at another<br>
> level. However TRACE is typically an ERROR in the way we use it.<br>
><br>
><br>
> Some examples of oddities in the current system (all from a single<br>
> devstack/tempest run):<br>
><br>
> Example 1:<br>
> ==============<br>
><br>
> n-conductor log in tempest/devstack -<br>
> <a href="http://logs.openstack.org/70/52870/3/check/check-tempest-devstack-vm-full/f46b756/logs/screen-n-cond.txt.gz" target="_blank">http://logs.openstack.org/70/52870/3/check/check-tempest-devstack-vm-full/f46b756/logs/screen-n-cond.txt.gz</a><br>
><br>
> Total log lines: 84076<br>
> Total non DEBUG lines: 61<br>
><br>
> Question: do we need more than 1 level of DEBUG? 3 orders of magnitude<br>
> information change between INFO -> DEBUG seems too steep a cliff.<br>
><br>
> Example 2:<br>
> ==============<br>
><br>
> ceilometer-collector -<br>
> <a href="http://logs.openstack.org/70/52870/3/check/check-tempest-devstack-vm-full/f46b756/logs/screen-ceilometer-collector.txt.gz" target="_blank">http://logs.openstack.org/70/52870/3/check/check-tempest-devstack-vm-full/f46b756/logs/screen-ceilometer-collector.txt.gz</a><br>
><br>
> AUDIT log level being used as DEBUG level (even though it's higher than<br>
> INFO).<br>
><br>
> 2013-10-23 12:24:20.093 26234 AUDIT ceilometer.pipeline [-] Flush pipeline<br>
> meter_pipeline<br>
> 2013-10-23 12:24:20.093 26234 AUDIT ceilometer.pipeline [-] Flush pipeline<br>
> cpu_pipeline<br>
> 2013-10-23 12:24:20.094 26234 AUDIT ceilometer.pipeline [-] Flush pipeline<br>
> meter_pipeline<br>
> 2013-10-23 12:24:20.094 26234 AUDIT ceilometer.pipeline [-] Flush pipeline<br>
> cpu_pipeline<br>
><br>
> (this is every second, for most seconds, for the entire run)<br>
><br>
> Example 3:<br>
> ===============<br>
><br>
> cinder-api -<br>
> <a href="http://logs.openstack.org/70/52870/3/check/check-tempest-devstack-vm-full/f46b756/logs/screen-c-api.txt.gz?level=ERROR" target="_blank">http://logs.openstack.org/70/52870/3/check/check-tempest-devstack-vm-full/f46b756/logs/screen-c-api.txt.gz?level=ERROR</a><br>
> ERROR level being used for 404s of volumes<br>
><br>
> Example 4:<br>
> ===============<br>
> glance-api -<br>
> <a href="http://logs.openstack.org/70/52870/3/check/check-tempest-devstack-vm-full/f46b756/logs/screen-g-api.txt.gz" target="_blank">http://logs.openstack.org/70/52870/3/check/check-tempest-devstack-vm-full/f46b756/logs/screen-g-api.txt.gz</a><br>
><br>
> 2013-10-23 12:23:27.436 23731 ERROR glance.store.sheepdog [-] Error in store<br>
> configuration: Unexpected error while running command.Command: collieExit<br>
> code: 127Stdout: ''Stderr: '/bin/sh: 1: collie: not found\n'<br>
> 2013-10-23 12:23:27.436 23731 WARNING glance.store.base [-] Failed to<br>
> configure store correctly: Store sheepdog could not be configured correctly.<br>
> Reason: Error in store configuration: Unexpected error while running<br>
> command.Command: collieExit code: 127Stdout: ''Stderr: '/bin/sh: 1: collie:<br>
> not found\n' Disabling add method.<br>
><br>
> part of every single Tempest / Devstack run, even though we aren't trying to<br>
> configure sheepdog in the gate<br>
><br>
><br>
> I think we can, and should do better, and started trying to brain dump into<br>
> this etherpad -<br>
> <a href="https://etherpad.openstack.org/p/icehouse-logging-harmonization" target="_blank">https://etherpad.openstack.org/p/icehouse-logging-harmonization</a> (examples<br>
> included).<br>
><br>
> This is one of those topics that I think our current 6 track summit model<br>
> doesn't make easy address, as we really need general concensus across any<br>
> project that's using oslo-logging, so I believe mailing list is the better<br>
> option, at least for now.<br>
><br>
><br>
> Goals - Short Term<br>
> ===============<br>
> As much feedback as possible from both core projects and openstack deployers<br>
> about the kinds of things that they believe we should be logging, and the<br>
> kinds of levels they think those things should land at.<br>
><br>
> Determining how crazy it is to try to harmonize this across services.<br>
><br>
> Figure out who else wants to help. Where help means:<br>
> * helping figure out what's already concensus in services<br>
> * helping figure out things that are really aberrant from that concensus<br>
> * helping build concensus with various core teams on a common<br>
> * helping with contributions to projects that are interested in<br>
> contributions to move them closer to the concensus<br>
><br>
> Determining if everyone just hates the idea, and I should give up now. :)<br>
> (That is a valid response to this RFC, feel free to put that out there).<br>
><br>
><br>
> Goals - Long Term<br>
> ===============<br>
> A set of guidelines on logging standards so that OpenStack as a whole feels<br>
> more whole when it comes to dealing with the log data.<br>
><br>
> These are going to be guidelines, not rules. Some projects are always going<br>
> to have unique needs. But I suspect a lot of developers would be happy with<br>
> a cheet sheet that helps them figure out what log level different things are<br>
> presented at.<br>
><br>
><br>
> Comments / rotten fruit are welcomed. Also welcomed are any additional POV<br>
> on this issue.<br>
><br>
> -Sean<br>
><br>
</div></div>This is great. In additional to all this I want to point out that the<br>
TRACE level is also a oslo-logging invention. In general I am not a<br>
fan of having additional levels as compatibility with python logging<br>
and associated tools would be a good thing, but I can see their<br>
usefulness if we need more granularity. Also, this isn't just for<br>
developers and QA, deployers of OpenStack would benefit tremendously<br>
from more consistent and useful logging. After throwing together<br>
logstash filters[1] for the OpenStack service logs produced by<br>
Devstack in the gate I have found the logs to be quite diverse which<br>
makes dealing with them in a centralized fashion more difficult. +1<br>
from me to guidelines which should drive more consistency.<br>
<br>
This got better during the Havana cycle, but we can definitely do more<br>
in Icehouse.<br>
<br>
[1] <a href="https://git.openstack.org/cgit/openstack-infra/config/tree/modules/openstack_project/templates/logstash/indexer.conf.erb#n11" target="_blank">https://git.openstack.org/cgit/openstack-infra/config/tree/modules/openstack_project/templates/logstash/indexer.conf.erb#n11</a><br>
<span class="HOEnZb"><font color="#888888"><br>
Clark<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
_______________________________________________<br>
OpenStack-dev mailing list<br>
<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
</div></div></blockquote></div><div class="gmail_default" style="font-family:'courier new',monospace"></div><div class="gmail_default" style="font-family:'courier new',monospace">So I touched on this a bit in my earlier post but want to reiterate here and maybe clarify a bit. I agree that cleaning up and standardizing the logs is a good thing, and particularly removing unhandled exception messages would be good. What concerns me however is the approach being taken here of saying things like "Error level messages are banned from Tempest runs".</div>
<div class="gmail_default" style="font-family:'courier new',monospace"><br></div><div class="gmail_default" style="font-family:'courier new',monospace">The case I mentioned earlier of the negative test is a perfect example. There's no way for Cinder (or any other service) to know the difference between the end user specifying/requesting a non-existent volume and a valid volume being requested that for some reason can't be found. I'm not quite sure how you place a definitive rule like "no error messages in logs" unless you make your tests such that you never run negative tests?</div>
<div class="gmail_default" style="font-family:'courier new',monospace"><br></div><div class="gmail_default" style="font-family:'courier new',monospace">There are other cases in cinder as well that I'm concerned about. One example is iscsi target creation, there are a number of scenarios where this can fail under certain conditions. In most of these cases we now have retry mechanisms or alternate implementations to complete the task. The fact is however that a call somewhere in the system failed, this should be something in my opinion that stands out in the logs. Maybe this particular case would be well suited to being a warning other than an error, and that's fine. My point however though is that I think some thought needs to go into this before making blanketing rules and especially gating criteria that says "no error messages in logs".</div>
<div class="gmail_default" style="font-family:'courier new',monospace"><br></div><div class="gmail_default" style="font-family:'courier new',monospace">John</div><br></div></div>