<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Feb 14, 2018 at 5:40 AM, Brian Haley <span dir="ltr"><<a href="mailto:haleyb.dev@gmail.com" target="_blank">haleyb.dev@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 02/13/2018 05:08 PM, Armando M. wrote:<span class=""><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
<br>
On 13 February 2018 at 14:02, Brent Eagles <<a href="mailto:beagles@redhat.com" target="_blank">beagles@redhat.com</a> <mailto:<a href="mailto:beagles@redhat.com" target="_blank">beagles@redhat.com</a>>> wrote:<br>
<br>
Hi,<br>
<br>
The neutron agents are implemented in such a way that key<br>
functionality is implemented in terms of haproxy, dnsmasq,<br>
keepalived and radvd configuration. The agents manage instances of<br>
these services but, by design, the parent is the top-most (pid 1).<br>
<br>
On baremetal this has the advantage that, while control plane<br>
changes cannot be made while the agents are not available, the<br>
configuration at the time the agents were stopped will work (for<br>
example, VMs that are restarted can request their IPs, etc). In<br>
short, the dataplane is not affected by shutting down the agents.<br>
<br>
In the TripleO containerized version of these agents, the supporting<br>
processes (haproxy, dnsmasq, etc.) are run within the agent's<br>
container so when the container is stopped, the supporting processes<br>
are also stopped. That is, the behavior with the current containers<br>
is significantly different than on baremetal and stopping/restarting<br>
containers effectively breaks the dataplane. At the moment this is<br>
being considered a blocker and unless we can find a resolution, we<br>
may need to recommend running the L3, DHCP and metadata agents on<br>
baremetal.<br>
</blockquote>
<br></span>
I didn't think the neutron metadata agent was affected but just the ovn-metadata agent? Or is there a problem with the UNIX domain sockets the haproxy instances use to connect to it when the container is restarted?</blockquote><div><br></div><div>That's right. In ovn-metadata-agent we spawn haproxy inside the q-ovnmeta namespace</div><div>and this is where we'll find a problem if the process goes away. As you said, neutron</div><div>metadata agent is basically receiving the proxied requests from haproxies residing</div><div>in either q-router or q-dhcp namespaces on its UNIX socket and sending them to Nova. </div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
There's quite a bit to unpack here: are you suggesting that running these services in HA configuration doesn't help either with the data plane being gone after a stop/restart? Ultimately this boils down to where the state is persisted, and while certain agents rely on namespaces and processes whose ephemeral nature is hard to persist, enough could be done to allow for a non-disruptive bumping of the afore mentioned services.<br>
</blockquote>
<br></span>
Armando - <a href="https://review.openstack.org/#/c/542858/" rel="noreferrer" target="_blank">https://review.openstack.org/#<wbr>/c/542858/</a> (if accepted) should help with dataplane downtime, as sharing the namespaces lets them persist, which eases what the agent has to configure on the restart of a container (think of what the l3-agent needs to create for 1000 routers).<br>
<br>
But it doesn't address dnsmasq being unavailable when the dhcp-agent container is restarted like it is today. Maybe one way around that is to run 2+ agents per network, but that still leaves a regression from how it works today. Even with l3-ha I'm not sure things are perfect, might wind-up with two masters sometimes.<br>
<br>
I've seen one suggestion of putting all these processes in their own container instead of the agent container so they continue to run, it just might be invasive to the neutron code. Maybe there is another option?</blockquote><div><br></div><div>I had some idea based on that one to reduce the impact on neutron code and its dependency on</div><div>containers. Basically, we would be running dnsmasq, haproxy, keepalived, radvd, etc</div><div>in separate containers (it makes sense as they have independent lifecycles) and we would drive</div><div>those through the docker socket from neutron agents. In order to reduce this dependency, I</div><div>thought of having some sort of 'rootwrap-daemon-docker' which takes the commands and</div><div>checks if it has to spawn the process in a separate container (for example, iptables wouldn't</div><div>be the case) and if so, it'll use the docker socket to do it.</div><div>We'll also have to monitor the PID files on those containers to respawn them in case they<br></div><div>die.</div><div><br></div><div>IMHO, this is far from the containers philosophy since we're using host networking,</div><div>privileged access, sharing namespaces, relying on 'sidecar' containers... but I can't think of</div><div>a better way to do it.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="HOEnZb"><font color="#888888"><br>
<br>
-Brian</font></span><div class="HOEnZb"><div class="h5"><br>
<br>
______________________________<wbr>______________________________<wbr>______________<br>
OpenStack Development Mailing List (not for usage questions)<br>
Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.op<wbr>enstack.org?subject:unsubscrib<wbr>e</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi<wbr>-bin/mailman/listinfo/openstac<wbr>k-dev</a><br>
</div></div></blockquote></div><br></div></div>