[openstack-dev] Can we get some sanity in the Neutron logs please?

Matt Riedemann mriedem at linux.vnet.ibm.com
Tue Nov 10 18:33:47 UTC 2015


Let me qualify by saying I'm not a Neutron person.

We know that gate-tempest-dsvm-neutron-full is failing hard as of the 
last 24 hours [1].

An error that's been showing up in tempest runs with neutron a lot is:

"AssertionError: 0 == 0 : No IPv4 addresses found in: []"

So checking logstash [2] it's hitting a lot. It's only recent because 
that failure message is new to Tempest in the last day or so, but it has 
a lot of hits, so whatever it is, it's failing a lot.

So the next step is usually digging into service logs looking for 
errors. I check the q-svc logs first. Not many errors but a bazillion 
warnings for things not found (networks and devices). [3]

For example:

2015-11-10 17:13:02.542 WARNING neutron.plugins.ml2.rpc 
[req-15a73753-1512-4689-9404-9658a0cd0c09 None None] Device 
aaa525be-14eb-44a5-beb0-ed722896be93 requested by agent 
ovs-agent-devstack-trusty-rax-iad-5785199 not found in database

2015-11-10 17:14:17.754 WARNING neutron.api.rpc.handlers.dhcp_rpc 
[req-3d7e9848-6151-4780-907f-43f11a2a8545 None None] Network 
b07ad9b2-e63e-4459-879d-3721074704e5 could not be found, it might have 
been deleted concurrently.

Are several hundred of these warnings useful to an operator trying to 
debug a problem? The point of the CI gate testing is to try and simulate 
a production cloud environment. When something goes wrong, you check the 
logs. With the amount of warning/error level logging that is in the 
neutron logs, finding a real problem is like looking for a needle in a 
haystack. Since everything is async, 404s are expected when racing to 
delete a resource and they should be handled gracefully.

Anyway, the server log isn't useful so I go digging in the agent logs 
and stacktraces there are aplenty. [4]

Particularly this:

"Exception: Port tapcea51630-e1 is not ready, resync needed"

That's due to a new change landing in the last 24 hours [5]. But the 
trace shows up over 16K times since it landed [6].

Checking the code, it's basically a loop processing events and when it 
hits an event it can't handle, it punts (breaking the loop so you don't 
process the other events after it - which is a bug), and the code that 
eventually handles it is just catching all Exception and tracing them 
out assuming they are really bad.

At this point, as a non-neutron person, i.e. not well versed in the 
operations of neutron or how to debug it in great detail, I assume 
something is bad here but I don't really know - and the logs are so full 
of noise that I can't distinguish real failures.

I don't mean to pick on this particular change, but it's a good example 
of a recent thing.

I'd like to know if this is all known issue or WIP type stuff. I've 
complained about excessively noisey neutron logs in channel before and 
I'm usually told that they are either necessary (for whatever reason) or 
that rather than complain about the verbosity, I should fix the race 
that is causing it - which is not likely to happen since I don't have 
the async rpc happy nature of everything in neutron in my head to debug 
it (I doubt many do).

Anyway, this is a plea for sanity in the logs. There are logging 
guidelines for openstack [7]. Let's please abide by them. Let's keep 
operators in mind when we're looking at logs and be proactive about 
making them useful (which includes more granular error handling and less 
global try/except Exception: LOG.exception constructs).

[1] http://tinyurl.com/ne3ex4v
[2] 
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22AssertionError:%200%20==%200%20:%20No%20IPv4%20addresses%20found%20in:%20%5B%5D%5C%22%20AND%20tags:%5C%22console%5C%22
[3] 
http://logs.openstack.org/85/239885/2/gate/gate-tempest-dsvm-neutron-full/602d864/logs/screen-q-svc.txt.gz?level=TRACE
[4] 
http://logs.openstack.org/85/239885/2/gate/gate-tempest-dsvm-neutron-full/602d864/logs/screen-q-agt.txt.gz?level=TRACE
[5] https://review.openstack.org/#/c/164880/
[6] 
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22Exception:%20Port%5C%22%20AND%20message:%5C%22is%20not%20ready,%20resync%20needed%5C%22%20AND%20tags:%5C%22screen-q-agt.txt%5C%22
[7] 
http://specs.openstack.org/openstack/openstack-specs/specs/log-guidelines.html

-- 

Thanks,

Matt Riedemann




More information about the OpenStack-dev mailing list