[openstack-dev] Can we get some sanity in the Neutron logs please?

Armando M. armamig at gmail.com
Tue Nov 10 19:39:29 UTC 2015


On 10 November 2015 at 11:12, Matt Riedemann <mriedem at linux.vnet.ibm.com>
wrote:

>
>
> On 11/10/2015 1:10 PM, Matt Riedemann wrote:
>
>>
>>
>> On 11/10/2015 12:51 PM, Armando M. wrote:
>>
>>>
>>>
>>> On 10 November 2015 at 10:33, Matt Riedemann <mriedem at linux.vnet.ibm.com
>>> <mailto:mriedem at linux.vnet.ibm.com>> wrote:
>>>
>>>     Let me qualify by saying I'm not a Neutron person.
>>>
>>>     We know that gate-tempest-dsvm-neutron-full is failing hard as of
>>>     the last 24 hours [1].
>>>
>>>     An error that's been showing up in tempest runs with neutron a lot
>>> is:
>>>
>>>     "AssertionError: 0 == 0 : No IPv4 addresses found in: []"
>>>
>>>     So checking logstash [2] it's hitting a lot. It's only recent
>>>     because that failure message is new to Tempest in the last day or
>>>     so, but it has a lot of hits, so whatever it is, it's failing a lot.
>>>
>>>     So the next step is usually digging into service logs looking for
>>>     errors. I check the q-svc logs first. Not many errors but a
>>>     bazillion warnings for things not found (networks and devices). [3]
>>>
>>>     For example:
>>>
>>>     2015-11-10 17:13:02.542 WARNING neutron.plugins.ml2.rpc
>>>     [req-15a73753-1512-4689-9404-9658a0cd0c09 None None] Device
>>>     aaa525be-14eb-44a5-beb0-ed722896be93 requested by agent
>>>     ovs-agent-devstack-trusty-rax-iad-5785199 not found in database
>>>
>>>     2015-11-10 17:14:17.754 WARNING neutron.api.rpc.handlers.dhcp_rpc
>>>     [req-3d7e9848-6151-4780-907f-43f11a2a8545 None None] Network
>>>     b07ad9b2-e63e-4459-879d-3721074704e5 could not be found, it might
>>>     have been deleted concurrently.
>>>
>>>     Are several hundred of these warnings useful to an operator trying
>>>     to debug a problem? The point of the CI gate testing is to try and
>>>     simulate a production cloud environment. When something goes wrong,
>>>     you check the logs. With the amount of warning/error level logging
>>>     that is in the neutron logs, finding a real problem is like looking
>>>     for a needle in a haystack. Since everything is async, 404s are
>>>     expected when racing to delete a resource and they should be handled
>>>     gracefully.
>>>
>>>     Anyway, the server log isn't useful so I go digging in the agent
>>>     logs and stacktraces there are aplenty. [4]
>>>
>>>     Particularly this:
>>>
>>>     "Exception: Port tapcea51630-e1 is not ready, resync needed"
>>>
>>>     That's due to a new change landing in the last 24 hours [5]. But the
>>>     trace shows up over 16K times since it landed [6].
>>>
>>>     Checking the code, it's basically a loop processing events and when
>>>     it hits an event it can't handle, it punts (breaking the loop so you
>>>     don't process the other events after it - which is a bug), and the
>>>     code that eventually handles it is just catching all Exception and
>>>     tracing them out assuming they are really bad.
>>>
>>>     At this point, as a non-neutron person, i.e. not well versed in the
>>>     operations of neutron or how to debug it in great detail, I assume
>>>     something is bad here but I don't really know - and the logs are so
>>>     full of noise that I can't distinguish real failures.
>>>
>>>     I don't mean to pick on this particular change, but it's a good
>>>     example of a recent thing.
>>>
>>>     I'd like to know if this is all known issue or WIP type stuff. I've
>>>     complained about excessively noisey neutron logs in channel before
>>>     and I'm usually told that they are either necessary (for whatever
>>>     reason) or that rather than complain about the verbosity, I should
>>>     fix the race that is causing it - which is not likely to happen
>>>     since I don't have the async rpc happy nature of everything in
>>>     neutron in my head to debug it (I doubt many do).
>>>
>>>
>>> I am not sure that's a fair statement: we usually pinpoint that just
>>> lowering log levels is not really solving the underlying issue
>>> (whichever it may be), and that comment really should apply to any
>>> project, not just Neutron. That said, we had examples where we took your
>>> input and drove the right fix ourselves.
>>>
>>> We have a 'logging' tag for Neutron bugs that we use to identify these
>>> type of cleanups. We'd need your attention to details to alert us of
>>> issues like these; we'll take care of the right fixes. Currently, the
>>> queue is pretty dry. If you can top it up, that'll be great. Going off
>>> on a log cleanup rampage doesn't seem like the best course of action;
>>> I'd rather knock issues one by one as they come, like the one you just
>>> mentioned.
>>>
>>> [1] https://bugs.launchpad.net/neutron/+bugs?field.tag=logging
>>>
>>
>> Tagged:
>>
>> https://bugs.launchpad.net/nova/+bug/1514935
>>
>> Although it must not be an official tag since it didn't auto-complete
>> for me, that should be fixed.
>>
>
> Crap, nevermind that, my bad - I had created the bug against nova (force
> of habit) rather than neutron. Confirmed that 'logging' is a neutron tag in
> LP.


Ah, gotcha!


>
>
>
>>
>>>
>>>     Anyway, this is a plea for sanity in the logs. There are logging
>>>     guidelines for openstack [7]. Let's please abide by them. Let's keep
>>>     operators in mind when we're looking at logs and be proactive about
>>>     making them useful (which includes more granular error handling and
>>>     less global try/except Exception: LOG.exception constructs).
>>>
>>> Your point is duly noted. We have documented this, and we are being more
>>> meticulous during reviews.
>>>
>>>
>>>     [1] http://tinyurl.com/ne3ex4v
>>>     [2]
>>>
>>>
>>> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22AssertionError:%200%20==%200%20:%20No%20IPv4%20addresses%20found%20in:%20%5B%5D%5C%22%20AND%20tags:%5C%22console%5C%22
>>>
>>>     [3]
>>>
>>>
>>> http://logs.openstack.org/85/239885/2/gate/gate-tempest-dsvm-neutron-full/602d864/logs/screen-q-svc.txt.gz?level=TRACE
>>>
>>>     [4]
>>>
>>>
>>> http://logs.openstack.org/85/239885/2/gate/gate-tempest-dsvm-neutron-full/602d864/logs/screen-q-agt.txt.gz?level=TRACE
>>>
>>>     [5] https://review.openstack.org/#/c/164880/
>>>     [6]
>>>
>>>
>>> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22Exception:%20Port%5C%22%20AND%20message:%5C%22is%20not%20ready,%20resync%20needed%5C%22%20AND%20tags:%5C%22screen-q-agt.txt%5C%22
>>>
>>>     [7]
>>>
>>>
>>> http://specs.openstack.org/openstack/openstack-specs/specs/log-guidelines.html
>>>
>>>
>>>     --
>>>
>>>     Thanks,
>>>
>>>     Matt Riedemann
>>>
>>>
>>>
>>>
>>> __________________________________________________________________________
>>>
>>>     OpenStack Development Mailing List (not for usage questions)
>>>     Unsubscribe:
>>>     OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>
>>> <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
>>>     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>>
>>>
>>>
>>>
>>> __________________________________________________________________________
>>>
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>>
>>
> --
>
> Thanks,
>
> Matt Riedemann
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20151110/3785fd85/attachment.html>


More information about the OpenStack-dev mailing list