[openstack-dev] [all] Gate still backed up - need assistance with nova-network logging enhancements

Matt Riedemann mriedem at linux.vnet.ibm.com
Thu Jun 12 14:50:20 UTC 2014



On 6/10/2014 5:36 AM, Michael Still wrote:
> https://review.openstack.org/99002 adds more logging to
> nova/network/manager.py, but I think you're not going to love the
> debug log level. Was this the sort of thing you were looking for
> though?
>
> Michael
>
> On Mon, Jun 9, 2014 at 11:45 PM, Sean Dague <sean at dague.net> wrote:
>> Based on some back of envelope math the gate is basically processing 2
>> changes an hour, failing one of them. So if you want to know how long
>> the gate is, take the length / 2 in hours.
>>
>> Right now we're doing a lot of revert roulette, trying to revert things
>> that we think landed about the time things went bad. I call this
>> roulette because in many cases the actual issue isn't well understood. A
>> key reason for this is:
>>
>> *nova network is a blackhole*
>>
>> There is no work unit logging in nova-network, and no attempted
>> verification that the commands it ran did a thing. Most of these
>> failures that we don't have good understanding of are the network not
>> working under nova-network.
>>
>> So we could *really* use a volunteer or two to prioritize getting that
>> into nova-network. Without it we might manage to turn down the failure
>> rate by reverting things (or we might not) but we won't really know why,
>> and we'll likely be here again soon.
>>
>>          -Sean
>>
>> --
>> Sean Dague
>> http://dague.net
>>
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
>

I mentioned this in the nova meeting today also but the assocated bug 
for the nova-network ssh timeout issue is bug 1298472 [1].

My latest theory on that one is if there could be a race/network leak in 
the ec2 third party tests in Tempest or something in the ec2 API in 
nova, because I saw this [2] showing up in the n-net logs.  My thinking 
is the tests or the API are not tearing down cleanly and eventually 
network resources are leaked and we start hitting those timeouts.  Just 
a theory at this point, but the ec2 3rd party tests do run concurrently 
with the scenario tests so things could be colliding at that point, but 
I haven't had time to dig into it, plus I have very little experience in 
those tests or the ec2 API in nova.

[1] https://bugs.launchpad.net/tempest/+bug/1298472
[2] http://goo.gl/6f1dfw

-- 

Thanks,

Matt Riedemann




More information about the OpenStack-dev mailing list