<p dir="ltr">Thanks for the pointers. Did you see any exceptions though on the neutron server during port deletion? If not, then the notification should still have been sent and processed, even with a long delay. So unless your driver was causing exceptions post commit that prevented the notification, I'm not certain that could be the issue. </p>
<div class="gmail_quote">On Jun 9, 2015 4:41 AM, "Neil Jerram" <<a href="mailto:Neil.Jerram@metaswitch.com">Neil.Jerram@metaswitch.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 09/06/15 11:34, Neil Jerram wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
On 09/06/15 01:15, Kevin Benton wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I'm having difficulty reproducing the issue. The bug that Neil<br>
referenced (<a href="https://bugs.launchpad.net/neutron/+bug/1192381" target="_blank">https://bugs.launchpad.net/neutron/+bug/1192381</a>) looks like<br>
it was in Icehouse well before the 2014.1.3 release that looks like Fuel<br>
5.1.1 is using.<br>
</blockquote>
<br>
Just to be sure, I assume we're focussing here on the issue that Daniel<br>
reported (IP appears twice in Dnsmasq config), and for which I described<br>
a possible corollary (Dnsmasq config size keeps growing), and NOT on the<br>
"Another DHCP agent problem" that I mentioned below. :-)<br>
<br>
BTW, now that I've reviewed the history of when my team saw this, I can<br>
say that it was actually first reported to us with the 'IP appears twice<br>
in Dnsmasq config' symptom - i.e. exactly the same as Daniel's case. The<br>
fact of the Dnsmasq config increasing in size was noticed later.<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I tried setting the agent report interval to something higher than the<br>
downtime to make it seem like the agent is failing sporadically to the<br>
server, but it's not impacting the notifications.<br>
</blockquote>
<br>
Makes sense - that's the effect of the fix for 1192381.<br>
<br>
To be clear, though, what code are you trying to reproduce on?  Current<br>
master?<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Neil, does your testing where you saw something similar have a lot of<br>
concurrent creation/deletion?<br>
</blockquote>
<br>
It was a test of continuously deleting and creating VMs, with this<br>
pseudocode:<br>
<br>
thread_pool = new_thread_pool(size=30)<br>
for x in range(0,30):<br>
     thread_pool.submit(create_vm)<br>
thread_pool.wait_for_all_threads_to_complete()<br>
while True:<br>
      time.sleep(5)<br>
      for x in range(0,int(random.random()*5)):<br>
           thread_pool.submit(randomly_delete_a_vm_and_create_a_new_one)<br>
<br>
I'm not clear whether that would qualify as 'concurrent', in the sense<br>
that you have in mind.<br>
</blockquote>
<br>
Some further observations from when we were seeing this problem:<br>
<br>
- RabbitMQ diags did not indicate any message queue buildup.<br>
<br>
- If the churn test was paused for a while, the DHCP agent did not catch up.  I.e. it did not eventually rewrite the Dnsmasq config file so as to be smaller and without duplicate IPs.<br>
<br>
- If neutron-dhcp-agent was restarted, it _did_ rewrite the Dnsmasq config file so as to be correct for the standing set of VMs.<br>
<br>
- An effective workaround fix was to make the DHCP agent do its periodic resync on each expiry of resync_interval, even if no needs_resync_reasons.<br>
<br>
We then worked on two changes, after which the problem was no longer observed.<br>
<br>
1. The DHCP agent change that I've described below, to do batched 'reload_allocations' processing for multiple contiguous port-create and port-delete events.<br>
<br>
2. Substantial changes to our mechanism driver to allow it to work properly with Neutron HA - i.e. with multiple controllers and/or api_workers>1 on each controller.  The work here included the point that we are still discussing at <a href="http://lists.openstack.org/pipermail/openstack-dev/2015-June/065558.html" target="_blank">http://lists.openstack.org/pipermail/openstack-dev/2015-June/065558.html</a>.<br>
<br>
After reviewing all this now, my guess is that (2) was the real fix.  It seems quite likely that we had a bug in our mechanism driver that either blocked or delayed port-delete processing, when the non-HA-ready code was run in an HA environment.  Hence, I guess, lots of port-delete notifications were just not being sent at all to the DHCP agent.<br>
<br>
(1) was probably a red herring.  The fix that we made here may be of use in environments where the DHCP agent is being driven hard enough, but the observation of no RabbitMQ queue buildup means, I think, that that wasn't a factor in our own test setup.<br>
<br>
Also, in Kilo, time for each 'reload_allocations' processing may be reduced (compared with Juno and earlier) because of using a rootwrap daemon.  So it may be even harder now to see the problem that (1) addresses.<br>
<br>
(FYI, though, our changes for (1), for Juno-level code, may be seen at:<br>
<a href="https://github.com/Metaswitch/calico-neutron/commit/8a643ac975b3ae620c94d2c24286cd8e13ca13b1" target="_blank">https://github.com/Metaswitch/calico-neutron/commit/8a643ac975b3ae620c94d2c24286cd8e13ca13b1</a><br>
<a href="https://github.com/Metaswitch/calico-neutron/commit/1317ff0b3b9c856a7fff44a847aa46a7ca9dcc0f" target="_blank">https://github.com/Metaswitch/calico-neutron/commit/1317ff0b3b9c856a7fff44a847aa46a7ca9dcc0f</a><br>
<br>
The changes for (2) may be seen by looking at the calico/openstack/* file changes at <a href="https://github.com/Metaswitch/calico/commit/080e96d" target="_blank">https://github.com/Metaswitch/calico/commit/080e96d</a><br>
)<br>
<br>
I hope this information is useful.  In summary, then, I think the most likely explanation, at least in our case, is that it was a mechanism driver HA bug.<br>
<br>
Regards,<br>
        Neil<br>
<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Regards,<br>
     Neil<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
On Mon, Jun 8, 2015 at 12:21 PM, Andrew Woodward <<a href="mailto:awoodward@mirantis.com" target="_blank">awoodward@mirantis.com</a><br>
<mailto:<a href="mailto:awoodward@mirantis.com" target="_blank">awoodward@mirantis.com</a>>> wrote:<br>
<br>
    Daniel,<br>
<br>
    This sounds familiar, see if this matches [1]. IIRC, there was<br>
    another issue like this that was might already address this in the<br>
    updates into Fuel 5.1.2 packages repo [2]. You can either update the<br>
    neutron packages from [2] Or try one of community builds for 5.1.2<br>
    [3]. If this doesn't resolve the issue, open a bug against MOS dev<br>
[4].<br>
<br>
    [1] <a href="https://bugs.launchpad.net/bugs/1295715" target="_blank">https://bugs.launchpad.net/bugs/1295715</a><br>
    [2] <a href="http://fuel-repository.mirantis.com/fwm/5.1.2/ubuntu/pool/main/" target="_blank">http://fuel-repository.mirantis.com/fwm/5.1.2/ubuntu/pool/main/</a><br>
    [3] <a href="https://ci.fuel-infra.org/" target="_blank">https://ci.fuel-infra.org/</a><br>
    [4] <a href="https://bugs.launchpad.net/mos/+filebug" target="_blank">https://bugs.launchpad.net/mos/+filebug</a><br>
<br>
    On Mon, Jun 8, 2015 at 10:15 AM Neil Jerram<br>
    <<a href="mailto:Neil.Jerram@metaswitch.com" target="_blank">Neil.Jerram@metaswitch.com</a> <mailto:<a href="mailto:Neil.Jerram@metaswitch.com" target="_blank">Neil.Jerram@metaswitch.com</a>>><br>
wrote:<br>
<br>
        Two further thoughts on this:<br>
<br>
        1. Another DHCP agent problem that my team noticed is that it<br>
        call_driver('reload_allocations') takes a bit of time (to<br>
        regenerate the<br>
        Dnsmasq config files, and to spawn a shell that sends a HUP<br>
        signal) -<br>
        enough so that if there is a fast steady rate of port-create and<br>
        port-delete notifications coming from the Neutron server,<br>
these can<br>
        build up in DHCPAgent's RPC queue, and then they still only get<br>
        dispatched one at a time.  So the queue and the time delay<br>
        become longer<br>
        and longer.<br>
<br>
        I have a fix pending for this, which uses an extra thread to<br>
        read those<br>
        notifications off the RPC queue onto an internal queue, and then<br>
        batches<br>
        the call_driver('reload_allocations') processing when there is a<br>
        contiguous sequence of such notifications - i.e. only does the<br>
        config<br>
        regeneration and HUP once, instead of lots of times.<br>
<br>
        I don't think this is directly related to what you are seeing<br>
- but<br>
        perhaps there actually is some link that I am missing.<br>
<br>
        2. There is an interesting and vaguely similar thread currently<br>
        being<br>
        discussed about the L3 agent (subject "L3 agent rescheduling<br>
        issue") -<br>
        about possible RPC/threading issues between the agent and the<br>
        Neutron<br>
        server.  You might like to review that thread and see if it<br>
        describes<br>
        any problems analogous to your DHCP one.<br>
<br>
        Regards,<br>
                 Neil<br>
<br>
<br>
        On 08/06/15 17:53, Neil Jerram wrote:<br>
         > My team has seen a problem that could be related: in a churn<br>
        test where<br>
         > VMs are created and terminated at a constant rate - but so<br>
        that the<br>
         > number of active VMs should remain roughly constant - the<br>
        size of the<br>
         > host and addn_hosts files keeps increasing.<br>
         ><br>
         > In other words, it appears that the config for VMs that have<br>
        actually<br>
         > been terminated is not being removed from the config file.<br>
        Clearly, if<br>
         > you have a limited pool of IP addresses, this can eventually<br>
        lead to the<br>
         > problem that you have described.<br>
         ><br>
         > For your case - i.e. with Icehouse - the problem might be<br>
         > <a href="https://bugs.launchpad.net/neutron/+bug/1192381" target="_blank">https://bugs.launchpad.net/neutron/+bug/1192381</a>.  I'm not<br>
        sure if the<br>
         > fix for that problem - i.e. sending port-create and<br>
port-delete<br>
         > notifications to DHCP agents even when the server thinks they<br>
        are down -<br>
         > was merged before the Icehouse release, or not.<br>
         ><br>
         > But there must be at least one other cause as well, because<br>
        my team was<br>
         > seeing this with Juno-level code.<br>
         ><br>
         > Therefore I, too, would be interested in any other insights<br>
        about this<br>
         > problem.<br>
         ><br>
         > Regards,<br>
         >      Neil<br>
         ><br>
         ><br>
         ><br>
         > On 08/06/15 16:26, Daniel Comnea wrote:<br>
         >> Any help, ideas please?<br>
         >><br>
         >> Thx,<br>
         >> Dani<br>
         >><br>
         >> On Mon, Jun 8, 2015 at 9:25 AM, Daniel Comnea<br>
        <<a href="mailto:comnea.dani@gmail.com" target="_blank">comnea.dani@gmail.com</a> <mailto:<a href="mailto:comnea.dani@gmail.com" target="_blank">comnea.dani@gmail.com</a>><br>
         >> <mailto:<a href="mailto:comnea.dani@gmail.com" target="_blank">comnea.dani@gmail.com</a><br>
        <mailto:<a href="mailto:comnea.dani@gmail.com" target="_blank">comnea.dani@gmail.com</a>>>> wrote:<br>
         >><br>
         >>     + Operators<br>
         >><br>
         >>     Much thanks in advance,<br>
         >>     Dani<br>
         >><br>
         >><br>
         >><br>
         >><br>
         >>     On Sun, Jun 7, 2015 at 6:31 PM, Daniel Comnea<br>
        <<a href="mailto:comnea.dani@gmail.com" target="_blank">comnea.dani@gmail.com</a> <mailto:<a href="mailto:comnea.dani@gmail.com" target="_blank">comnea.dani@gmail.com</a>><br>
         >>     <mailto:<a href="mailto:comnea.dani@gmail.com" target="_blank">comnea.dani@gmail.com</a><br>
        <mailto:<a href="mailto:comnea.dani@gmail.com" target="_blank">comnea.dani@gmail.com</a>>>> wrote:<br>
         >><br>
         >>         Hi all,<br>
         >><br>
         >>         I'm running IceHouse (build using Fuel 5.1.1) on<br>
        Ubuntu where<br>
         >>         dnsmask version 2.59-4.<br>
         >>         I have a very basic network layout where i have a<br>
        private net<br>
         >>         which has 2 subnets<br>
         >><br>
         >>           2fb7de9d-d6df-481f-acca-2f7860cffa60 | private-net<br>
         >>                                     |<br>
         >>         e79c3477-d3e5-471c-a728-8d881cf31bee<br>
        <a href="http://192.168.110.0/24" target="_blank">192.168.110.0/24</a> <<a href="http://192.168.110.0/24" target="_blank">http://192.168.110.0/24</a>><br>
         >>         <<a href="http://192.168.110.0/24" target="_blank">http://192.168.110.0/24</a>> |<br>
         >>         |<br>
         >>         |<br>
              |<br>
         >>         f48c3223-8507-455c-9c13-8b727ea5f441<br>
        <a href="http://192.168.111.0/24" target="_blank">192.168.111.0/24</a> <<a href="http://192.168.111.0/24" target="_blank">http://192.168.111.0/24</a>><br>
         >>         <<a href="http://192.168.111.0/24" target="_blank">http://192.168.111.0/24</a>> |<br>
         >><br>
         >>         and i'm creating VMs via HEAT.<br>
         >>         What is happening is that sometimes i get duplicated<br>
        entries in<br>
         >>         [1] and because of that the VM which was spun up<br>
        doesn't get<br>
         >> an ip.<br>
         >>         The Dnsmask processes are running okay [2] and i<br>
        can't see<br>
         >>         anything special/ wrong in it.<br>
         >><br>
         >>         Any idea why this is happening? Or are you aware of<br>
        any bugs<br>
         >>         around this area? Do you see a problems with having<br>
        2 subnets<br>
         >>         mapped to 1 private-net?<br>
         >><br>
         >><br>
         >><br>
         >>         Thanks,<br>
         >>         Dani<br>
         >><br>
         >>         [1]<br>
         >><br>
         >><br>
<br>
/var/lib/neutron/dhcp/2fb7de9d-d6df-481f-acca-2f7860cffa60/addn_hosts<br>
         >><br>
         >>         [2]<br>
         >><br>
         >>         nobody    5664     1  0 Jun02 ?        00:00:08<br>
dnsmasq<br>
         >>         --no-hosts --no-resolv --strict-order<br>
--bind-interfaces<br>
         >>         --interface=tapc9164734-0c --except-interface=lo<br>
         >><br>
         >><br>
<br>
--pid-file=/var/lib/neutron/dhcp/2fb7de9d-d6df-481f-acca-2f7860cffa60/pid<br>
         >><br>
         >><br>
<br>
--dhcp-hostsfile=/var/lib/neutron/dhcp/2fb7de9d-d6df-481f-acca-2f7860cffa60/host<br>
<br>
         >><br>
         >><br>
         >><br>
<br>
--addn-hosts=/var/lib/neutron/dhcp/2fb7de9d-d6df-481f-acca-2f7860cffa60/addn_hosts<br>
<br>
         >><br>
         >><br>
         >><br>
<br>
--dhcp-optsfile=/var/lib/neutron/dhcp/2fb7de9d-d6df-481f-acca-2f7860cffa60/opts<br>
<br>
         >><br>
         >>         --leasefile-ro --dhcp-authoritative<br>
         >>         --dhcp-range=set:tag0,192.168.110.0,static,86400s<br>
         >>         --dhcp-range=set:tag1,192.168.111.0,static,86400s<br>
         >>         --dhcp-lease-max=512 --conf-file= --server=10.0.0.31<br>
         >>         --server=10.0.0.32 --domain=openstacklocal<br>
         >><br>
         >><br>
         >><br>
         >><br>
         >><br>
         >> _______________________________________________<br>
         >> OpenStack-operators mailing list<br>
         >> <a href="mailto:OpenStack-operators@lists.openstack.org" target="_blank">OpenStack-operators@lists.openstack.org</a><br>
        <mailto:<a href="mailto:OpenStack-operators@lists.openstack.org" target="_blank">OpenStack-operators@lists.openstack.org</a>><br>
         >><br>
<br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators</a><br>
         >><br>
         ><br>
         > _______________________________________________<br>
         > OpenStack-operators mailing list<br>
         > <a href="mailto:OpenStack-operators@lists.openstack.org" target="_blank">OpenStack-operators@lists.openstack.org</a><br>
        <mailto:<a href="mailto:OpenStack-operators@lists.openstack.org" target="_blank">OpenStack-operators@lists.openstack.org</a>><br>
         ><br>
<br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators</a><br>
<br>
<br>
__________________________________________________________________________<br>
<br>
        OpenStack Development Mailing List (not for usage questions)<br>
        Unsubscribe:<br>
        <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>
<br>
<<a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" target="_blank">http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a>><br>
        <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
<br>
    --<br>
    --<br>
    Andrew Woodward<br>
    Mirantis<br>
    Fuel Community Ambassador<br>
    Ceph Community<br>
<br>
<br>
__________________________________________________________________________<br>
<br>
    OpenStack Development Mailing List (not for usage questions)<br>
    Unsubscribe:<br>
    <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>
<br>
<<a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" target="_blank">http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a>><br>
    <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
<br>
<br>
<br>
<br>
--<br>
Kevin Benton<br>
<br>
<br>
_______________________________________________<br>
OpenStack-operators mailing list<br>
<a href="mailto:OpenStack-operators@lists.openstack.org" target="_blank">OpenStack-operators@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators</a><br>
<br>
</blockquote>
<br>
_______________________________________________<br>
OpenStack-operators mailing list<br>
<a href="mailto:OpenStack-operators@lists.openstack.org" target="_blank">OpenStack-operators@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators</a><br>
</blockquote>
<br>
_______________________________________________<br>
OpenStack-operators mailing list<br>
<a href="mailto:OpenStack-operators@lists.openstack.org" target="_blank">OpenStack-operators@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators</a><br>
</blockquote></div>