Open Stack

Tue Jan 21 22:37:05 UTC 2014

I have some hints which the people looking at neutron failures might find
useful.

# 1 - in [1] a weird thing happens with DHCP. A DHCPDISCOVER with for
fa:16:3e:cc:d9:c7
is pretty much simultaneously received by two dnsmasq instances, which are
listening on ports belonging to two distinct networks.
Looking at the agent logs as well, this apparently happens because both
DHCP ports and the VM VIF port are plugged into br-int at the same time and
none of them has been wired by the ovs agent. It seems that VIF plugging
performed both in nova and the agents does not disable those VIFs by
default, so this is a likely explanation. The resulting effect is probably
that one DHCP server cancels the offer send by the other, thus resulting in
no IP configured in the VM.
A corollary is that there is a chance that so far DHCP has worked by chance
in several cases, because the DISCOVER message was sent before the ports
were wired. So fixing this bug might spur a new set of timeout errors,
especially if we consider that now neutron does not create the dhcp port
until a port is created on the subnet, thus meaning that the wiring of the
DHCP port is likely to happen after the VIF.

# 2 - the latest tempest changes are leaving resources behind - look at the
bottom [2]. Armax has added a check to our CI to verify this; for upstream
jobs, this probably means more load on the system, and higher possibility
of timeouts and other non-deterministic failures.

# 3 - Still in [2] you will notice that the VM has not yet configured
networking when the timeout expires. Correlating the timestamp when the VM
acquires the clock, and the time elapsed from boot at that instant it is
possible to infer the VM has been not doing anything for about 34 seconds
after becoming active; this causes the job to always time out and Armax has
whitelisted this bug in mine sweeper. This problem however might be
exclusive to our CI (which in this instance uses the libvirt/kvm virt
driver).

Regards,
Salvatore

[1]
http://logs.openstack.org/19/67919/2/check/check-tempest-dsvm-neutron-isolated/ddaf7c5/logs/syslog.txt.gz
[2] http://81.156.166.125/67719/1/686/console.txt.gz

On 21 January 2014 15:52, Russell Bryant <rbryant at redhat.com> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 01/21/2014 07:14 AM, Sean Dague wrote:
> > Brief update on where we stand on the gate (still not great) - gate
> > is currently 126 deep - top of queue entered 51hrs ago
> >
> > Bug 1270680 - v3 extensions api inherently racey wrt instances -
> > patch landed (seems to have helped though the exception is still
> > showing up quite a bit, so don't know if this is 100% fixed)
> >
> > - Thanks to Russell, Dan Smith, and Chris Yeoh for diving in here.
>
> The workaround we merged didn't catch all cases of this bug.  I have
> another patch to get the rest.  We should promote this to the front of
> the gate queue.
>
> https://review.openstack.org/68147
>
> - --
> Russell Bryant
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iEYEARECAAYFAlLel84ACgkQFg9ft4s9SAbxJACdGTzWShYGdIOPNVg+UsR4eaS4
> PBIAnjoByv0u5irwhEPSmx5SF18aL2nF
> =2k2f
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140121/7093e3fa/attachment.html>

Open Stack

[openstack-dev] Top Gate Reseting issues that need attention - Tuesday morning update

OpenStack

Community

Documentation

Branding & Legal