[openstack-dev] [all] Gate still backed up - need assistance with nova-network logging enhancements

Davanum Srinivas davanum at gmail.com
Thu Jun 12 15:41:20 UTC 2014


Hey Matt,

There is a connection pool in
https://github.com/boto/boto/blob/develop/boto/connection.py which
could be causing issues...

-- dims

On Thu, Jun 12, 2014 at 10:50 AM, Matt Riedemann
<mriedem at linux.vnet.ibm.com> wrote:
>
>
> On 6/10/2014 5:36 AM, Michael Still wrote:
>>
>> https://review.openstack.org/99002 adds more logging to
>> nova/network/manager.py, but I think you're not going to love the
>> debug log level. Was this the sort of thing you were looking for
>> though?
>>
>> Michael
>>
>> On Mon, Jun 9, 2014 at 11:45 PM, Sean Dague <sean at dague.net> wrote:
>>>
>>> Based on some back of envelope math the gate is basically processing 2
>>> changes an hour, failing one of them. So if you want to know how long
>>> the gate is, take the length / 2 in hours.
>>>
>>> Right now we're doing a lot of revert roulette, trying to revert things
>>> that we think landed about the time things went bad. I call this
>>> roulette because in many cases the actual issue isn't well understood. A
>>> key reason for this is:
>>>
>>> *nova network is a blackhole*
>>>
>>> There is no work unit logging in nova-network, and no attempted
>>> verification that the commands it ran did a thing. Most of these
>>> failures that we don't have good understanding of are the network not
>>> working under nova-network.
>>>
>>> So we could *really* use a volunteer or two to prioritize getting that
>>> into nova-network. Without it we might manage to turn down the failure
>>> rate by reverting things (or we might not) but we won't really know why,
>>> and we'll likely be here again soon.
>>>
>>>          -Sean
>>>
>>> --
>>> Sean Dague
>>> http://dague.net
>>>
>>>
>>> _______________________________________________
>>> OpenStack-dev mailing list
>>> OpenStack-dev at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>>
>>
>
> I mentioned this in the nova meeting today also but the assocated bug for
> the nova-network ssh timeout issue is bug 1298472 [1].
>
> My latest theory on that one is if there could be a race/network leak in the
> ec2 third party tests in Tempest or something in the ec2 API in nova,
> because I saw this [2] showing up in the n-net logs.  My thinking is the
> tests or the API are not tearing down cleanly and eventually network
> resources are leaked and we start hitting those timeouts.  Just a theory at
> this point, but the ec2 3rd party tests do run concurrently with the
> scenario tests so things could be colliding at that point, but I haven't had
> time to dig into it, plus I have very little experience in those tests or
> the ec2 API in nova.
>
> [1] https://bugs.launchpad.net/tempest/+bug/1298472
> [2] http://goo.gl/6f1dfw
>
> --
>
> Thanks,
>
> Matt Riedemann
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



-- 
Davanum Srinivas :: http://davanum.wordpress.com



More information about the OpenStack-dev mailing list