<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
<div class="moz-cite-prefix">On 2014/11/19 1:49, Eric Windisch
wrote:<br>
</div>
<blockquote
cite="mid:CAAZDpLe0cEQrje4P5Ow6DF+YtX8nh5jBMmta4L-X4sNEOq9tZA@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">I
think for this cycle we really do need to focus on
consolidating and<br>
testing the existing driver design and fixing up the
biggest<br>
deficiency (1) before we consider moving forward with
lots of new</blockquote>
</div>
<div><br>
</div>
<div>+1</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">1)
Outbound messaging connection re-use - right now every
outbound<br>
messaging creates and consumes a tcp connection - this
approach scales<br>
badly when neutron does large fanout casts.<br>
</blockquote>
<div><br>
</div>
<div><br>
</div>
<div>I'm glad you are looking at this and by doing so, will
understand the system better. I hope the following will
give some insight into, at least, why I made the decisions
I made:</div>
<div> </div>
<div>This was an intentional design trade-off. I saw three
choices here: build a fully decentralized solution, build
a fully-connected network, or use centralized brokerage. I
wrote off centralized brokerage immediately. The problem
with a fully connected system is that active TCP
connections are required between all of the nodes. I
didn't think that would scale and would be brittle against
floods (intentional or otherwise).</div>
<div><br>
</div>
<div>IMHO, I always felt the right solution for large fanout
casts was to use multicast. When the driver was written,
Neutron didn't exist and there was no use-case for large
fanout casts, so I didn't implement multicast, but knew it
as an option if it became necessary. It isn't the right
solution for everyone, of course.</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
Using multicast will add some complexity of switch forwarding plane
that it will enable and maintain multicast group communication. For
large deployment scenario, I prefer to make forwarding simple and
easy-to-maintain. IMO, run a set of fanout-router processes in the
cluster can also achieve the goal.<br>
The data path is: openstack-daemon --------send the message (with
fanout=true) ---------> fanout-router -----read the
matchmaker------> send to the destinations<br>
Actually it just uses unicast to simulate multicast.<br>
<blockquote
cite="mid:CAAZDpLe0cEQrje4P5Ow6DF+YtX8nh5jBMmta4L-X4sNEOq9tZA@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>For connection reuse, you could manage a pool of
connections and keep those connections around for a
configurable amount of time, after which they'd expire and
be re-opened. This would keep the most actively used
connections alive. One problem is that it would make the
service more brittle by making it far more susceptible to
running out of file descriptors by keeping connections
around significantly longer. However, this wouldn't be as
brittle as fully-connecting the nodes nor as poorly
scalable.</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
+1. Set a large number of fds is not a problem. Because we use
socket pool, we can control and keep the fixed number of fds.<br>
<blockquote
cite="mid:CAAZDpLe0cEQrje4P5Ow6DF+YtX8nh5jBMmta4L-X4sNEOq9tZA@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>If OpenStack and oslo.messaging were designed
specifically around this message pattern, I might suggest
that the library and its applications be aware of
high-traffic topics and persist the connections for those
topics, while keeping others ephemeral. A good example for
Nova would be api->scheduler traffic would be
persistent, whereas scheduler->compute_node would be
ephemeral. Perhaps this is something that could still be
added to the library.</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">2)
PUSH/PULL tcp sockets - Pieter suggested we look at
ROUTER/DEALER<br>
as an option once 1) is resolved - this socket type
pairing has some<br>
interesting features which would help with resilience and
availability<br>
including heartbeating. </blockquote>
<div><br>
</div>
<div>Using PUSH/PULL does not eliminate the possibility of
being fully connected, nor is it incompatible with
persistent connections. If you're not going to be
fully-connected, there isn't much advantage to long-lived
persistent connections and without those persistent
connections, you're not benefitting from features such as
heartbeating.</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
How about REQ/REP? I think it is appropriate for long-lived
persistent connections and also provide reliability due to reply.<br>
<blockquote
cite="mid:CAAZDpLe0cEQrje4P5Ow6DF+YtX8nh5jBMmta4L-X4sNEOq9tZA@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>I'm not saying ROUTER/DEALER cannot be used, but use
them with care. They're designed for long-lived channels
between hosts and not for the ephemeral-type connections
used in a peer-to-peer system. Dealing with how to manage
timeouts on the client and the server and the swelling
number of active file descriptions that you'll get by
using ROUTER/DEALER is not trivial, assuming you can get
past the management of all of those synchronous sockets
(hidden away by tons of eventlet greenthreads)...</div>
<div><br>
</div>
<div>Extra anecdote: During a conversation at the OpenStack
summit, someone told me about their experiences using
ZeroMQ and the pain of using REQ/REP sockets and how they
felt it was a mistake they used them. We discussed a bit
about some other problems such as the fact it's impossible
to avoid TCP fragmentation unless you force all frames to
552 bytes or have a well-managed network where you know
the MTUs of all the devices you'll pass through.
Suggestions were made to make ZeroMQ better, until we
realized we had just described TCP-over-ZeroMQ-over-TCP,
finished our beers, and quickly changed topics.<br>
</div>
</div>
</div>
</div>
</blockquote>
Well, seems I need to take my last question back. In our deployment,
I always take advantage of jumbo frame to increase throughput. You
said that REQ/REP would introduce TCP fragmentation unless zeromq
frames == 552 bytes? Could you please elaborate?<br>
<blockquote
cite="mid:CAAZDpLe0cEQrje4P5Ow6DF+YtX8nh5jBMmta4L-X4sNEOq9tZA@mail.gmail.com"
type="cite">
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
OpenStack-dev mailing list
<a class="moz-txt-link-abbreviated" href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a>
<a class="moz-txt-link-freetext" href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a>
</pre>
</blockquote>
<br>
</body>
</html>