[openstack-dev] [all] Gate still backed up - need assistance with nova-network logging enhancements

Matt Riedemann mriedem at linux.vnet.ibm.com
Thu Jun 12 21:22:10 UTC 2014



On 6/12/2014 10:41 AM, Davanum Srinivas wrote:
> Hey Matt,
>
> There is a connection pool in
> https://github.com/boto/boto/blob/develop/boto/connection.py which
> could be causing issues...
>
> -- dims
>
> On Thu, Jun 12, 2014 at 10:50 AM, Matt Riedemann
> <mriedem at linux.vnet.ibm.com> wrote:
>>
>>
>> On 6/10/2014 5:36 AM, Michael Still wrote:
>>>
>>> https://review.openstack.org/99002 adds more logging to
>>> nova/network/manager.py, but I think you're not going to love the
>>> debug log level. Was this the sort of thing you were looking for
>>> though?
>>>
>>> Michael
>>>
>>> On Mon, Jun 9, 2014 at 11:45 PM, Sean Dague <sean at dague.net> wrote:
>>>>
>>>> Based on some back of envelope math the gate is basically processing 2
>>>> changes an hour, failing one of them. So if you want to know how long
>>>> the gate is, take the length / 2 in hours.
>>>>
>>>> Right now we're doing a lot of revert roulette, trying to revert things
>>>> that we think landed about the time things went bad. I call this
>>>> roulette because in many cases the actual issue isn't well understood. A
>>>> key reason for this is:
>>>>
>>>> *nova network is a blackhole*
>>>>
>>>> There is no work unit logging in nova-network, and no attempted
>>>> verification that the commands it ran did a thing. Most of these
>>>> failures that we don't have good understanding of are the network not
>>>> working under nova-network.
>>>>
>>>> So we could *really* use a volunteer or two to prioritize getting that
>>>> into nova-network. Without it we might manage to turn down the failure
>>>> rate by reverting things (or we might not) but we won't really know why,
>>>> and we'll likely be here again soon.
>>>>
>>>>           -Sean
>>>>
>>>> --
>>>> Sean Dague
>>>> http://dague.net
>>>>
>>>>
>>>> _______________________________________________
>>>> OpenStack-dev mailing list
>>>> OpenStack-dev at lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>
>>>
>>>
>>>
>>
>> I mentioned this in the nova meeting today also but the assocated bug for
>> the nova-network ssh timeout issue is bug 1298472 [1].
>>
>> My latest theory on that one is if there could be a race/network leak in the
>> ec2 third party tests in Tempest or something in the ec2 API in nova,
>> because I saw this [2] showing up in the n-net logs.  My thinking is the
>> tests or the API are not tearing down cleanly and eventually network
>> resources are leaked and we start hitting those timeouts.  Just a theory at
>> this point, but the ec2 3rd party tests do run concurrently with the
>> scenario tests so things could be colliding at that point, but I haven't had
>> time to dig into it, plus I have very little experience in those tests or
>> the ec2 API in nova.
>>
>> [1] https://bugs.launchpad.net/tempest/+bug/1298472
>> [2] http://goo.gl/6f1dfw
>>
>> --
>>
>> Thanks,
>>
>> Matt Riedemann
>>
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>

mtreinish also pointed out that the nightly periodic job to run tempest 
with nova-network and without tenant isolation is failing from hitting 
over quotas on floating IPs [1].  That's also hitting security group 
rule failures [2], possibly those are related.

[1] 
http://logs.openstack.org/periodic-qa/periodic-tempest-dsvm-full-non-isolated-master/b92b844/console.html#_2014-06-12_08_02_55_875
[2] 
http://logs.openstack.org/periodic-qa/periodic-tempest-dsvm-full-non-isolated-master/b92b844/console.html#_2014-06-12_08_02_56_623

-- 

Thanks,

Matt Riedemann




More information about the OpenStack-dev mailing list