<div dir="ltr">I think you're right Darragh.<div><br></div><div>It was actually Montreal's snow and cold freezing my brain as I investigated the same issue a while ago and tried to change cirrOS to send a DHCPDISCOVER every 10 seconds instead of 60 seconds, but then I moved to something else as I wasn't even sure a new centos base image could have been brought into gate tests.</div>

<div><br></div><div>I think I also sent a related email to the mailing list, suggesting to increase timeouts to a value that would ensure at least a second DHCPDISCOVER is sent by the VM. Anyway, we have a few patches which should make this failure mode less frequent. They're all -2 currently as they're always failing the gate (and I don't know why). However, from another email Sean recently sent, it seems it's a general Neutron issue.</div>

<div><br></div><div>Salvatore</div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 20 January 2014 10:51, Darragh O'Reilly <span dir="ltr"><<a href="mailto:dara2002-openstack@yahoo.com" target="_blank">dara2002-openstack@yahoo.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im"><br>

On Monday, 20 January 2014, 15:33, Jay Pipes <<a href="mailto:jaypipes@gmail.com">jaypipes@gmail.com</a>> wrote:<br>

<br>

>Sorry for top-posting -- using web mail client.<br>

</div>no worries - it doesn't bother me.<br>

<div class="im">><br>

>Is it possible to change the retry interval in Cirros (or cloud-init?) so that the backoff is less than 60 seconds?<br>

</div>I think the udhcpc command line parameters are baked into the image. It's part of BusyBox, and I'm not even sure if it's configurable from a script/text file.<br>

<div class="HOEnZb"><div class="h5">><br>

>Best,<br>

><br>

-jay<br>

><br>

><br>

><br>

><br>

>On Mon, Jan 20, 2014 at 10:23 AM, Darragh O'Reilly <<a href="mailto:dara2002-openstack@yahoo.com">dara2002-openstack@yahoo.com</a>> wrote:<br>

><br>

><br>

>>I did a test to see what the dhcp client on cirros does. I killed the dhcp agent and started an instance. The instance sent the first dhcp offer after about 35 sec. Then another 60 sec later, and a final one after another 60 sec.<br>


>><br>

>><br>

>>So a revised theory for what happened is this: <br>

>><br>

>>t=0 tempest starts vm and starts polling for ACTIVE status<br>

>>t=20 instance-->ACTIVE and tempest starts polling the floating ip for 60 sec<br>

>>t=40 instance does a dhcp discover - no response - so sets a timer for 60 sec<br>

>>t=45 ovs-agent sets the port vlan<br>

>>t=80 tempest gives up and kills vm<br>

>>t=100 instance would have sent another dhcp discover now if it had been let live<br>

>><br>

>>I think it would be worth trying to change that test to poll for 120 seconds instead of 60.<br>

>><br>

>><br>

>><br>

>>On Monday, 20 January 2014, 11:23, Darragh O'Reilly <<a href="mailto:dara2002-openstack@yahoo.com">dara2002-openstack@yahoo.com</a>> wrote:<br>

>><br>

>>Hi Salvatore,<br>

>>><br>

>>><br>

>>>I presume it's this one? <br>

>>><a href="http://logs.openstack.org/38/65838/4/check/check-tempest-dsvm-neutron-isolated/d108e4a/logs/tempest.txt.gz?#_2014-01-19_20_50_14_604" target="_blank">http://logs.openstack.org/38/65838/4/check/check-tempest-dsvm-neutron-isolated/d108e4a/logs/tempest.txt.gz?#_2014-01-19_20_50_14_604</a><br>


>>><br>

>>><br>

>>>Is it true that the cirros image just fires off a few dhcp discovers and then gives up? If so, then maybe it did so before the tagging happened. Do we have the instance console log? It took about 45 seconds from when the port was created to when it was tagged.<br>


>>><br>

>>><br>

>>>2014-01-19 20:48:57.412 8142 DEBUG neutron.agent.linux.ovsdb_monitor [-] Output<br>

received from ovsdb monitor:<br>

{"data":[["3602a7b2-b559-4709-9bf0-53ae2af68d06","insert","tap496b808c-b5"]],"headings":["row","action","name"]}<br>

>>><snip><br>

>>>2014-01-19 20:49:41.925 8142 DEBUG neutron.agent.linux.utils [-]<br>

>>>Command:<br>

['sudo', '/usr/local/bin/neutron-rootwrap',<br>

'/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'set',<br>

'Port', 'tap496b808c-b5', 'tag=64']<br>

>>>Exit code: 0<br>

>>><br>

>>><br>

>>>Darragh.<br>

>>><br>

>>><br>

>>><br>

>>>>I have been seeing in the past 2 days timeout failures on gate jobs which I<br>

>>>>am struggling to explain. An example is<br>

available in [1]<br>

>>>>These are the usual failure that we associate with bug 1253896, but this<br>

>>>>time I can verify that:<br>

>>>>- The floating IP is correctly wired (IP and NAT rules)<br>

>>>>- The DHCP port is correctly wired, as well as the VM port and the router<br>

>>>>port<br>

>>>>- The DHCP agent is correctly started for the network<br>

>>>><br>

>>>>However, no DHCP DISCOVER request is sent. Only the DHCP RELEASE message is<br>

>>>>seen.<br>

>>>>Any help at interpreting the logs will be appreciated.<br>

>>>><br>

>>>><br>

>>>>Salvatore<br>

>>>><br>

>>>>[1] <a href="http://logs.openstack.org/38/65838" target="_blank">http://logs.openstack.org/38/65838</a><br>

>>><br>

>>><br>

>>><br>

>>_______________________________________________<br>

>>OpenStack-dev mailing list<br>

>><a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>

>><a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

>><br>

>><br>

><br>

><br>

><br>

<br>

_______________________________________________<br>

OpenStack-dev mailing list<br>

<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

</div></div></blockquote></div><br></div>