<div dir="ltr">Sandy, <div><br></div><div>I see only one race condition. (in current solution we have the same situtaiton) </div><div>Between request to compute node and data is updated in DB, we could use wrong state of compute node.</div>
<div>By the way it is fixed by retry. </div><div><br></div><div>I don't see any new races that are produces by new approach without DB.<br></div><div>Could you say line or method that will produce races? </div><div><br>
</div><div>Best regards,</div><div>Boris Pavlovic</div><div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Sat, Jul 20, 2013 at 2:13 AM, Sandy Walsh <span dir="ltr"><<a href="mailto:sandy.walsh@rackspace.com" target="_blank">sandy.walsh@rackspace.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im"><br>
<br>
On 07/19/2013 05:36 PM, Boris Pavlovic wrote:<br>
> Sandy,<br>
><br>
</div><div class="im">> I don't think that we have such problems here.<br>
> Because scheduler doesn't pool compute_nodes.<br>
> The situation is another compute_nodes notify scheduler about their<br>
> state. (instead of updating their state in DB)<br>
><br>
> So for example if scheduler send request to compute_node, compute_node<br>
> is able to run rpc call to schedulers immediately (not after 60sec).<br>
><br>
> So there is almost no races.<br>
<br>
</div>There are races that occur between the eventlet request threads. This is<br>
why the scheduler has been switched to single threaded and we can only<br>
run one scheduler.<br>
<br>
This problem may have been eliminated with the work that Chris Behrens<br>
and Brian Elliott were doing, but I'm not sure.<br>
<br>
But certainly, the old approach of having the compute node broadcast<br>
status every N seconds is not suitable and was eliminated a long time ago.<br>
<div class="im"><br>
><br>
><br>
> Best regards,<br>
> Boris Pavlovic<br>
><br>
> Mirantis Inc.<br>
><br>
><br>
><br>
> On Sat, Jul 20, 2013 at 12:23 AM, Sandy Walsh <<a href="mailto:sandy.walsh@rackspace.com">sandy.walsh@rackspace.com</a><br>
</div><div><div class="h5">> <mailto:<a href="mailto:sandy.walsh@rackspace.com">sandy.walsh@rackspace.com</a>>> wrote:<br>
><br>
><br>
><br>
> On 07/19/2013 05:01 PM, Boris Pavlovic wrote:<br>
> > Sandy,<br>
> ><br>
> > Hm I don't know that algorithm. But our approach doesn't have<br>
> > exponential exchange.<br>
> > I don't think that in 10k nodes cloud we will have a problems with 150<br>
> > RPC call/sec. Even in 100k we will have only 1.5k RPC call/sec.<br>
> > More then (compute nodes update their state in DB through conductor<br>
> > which produce the same count of RPC calls).<br>
> ><br>
> > So I don't see any explosion here.<br>
><br>
> Sorry, I was commenting on Soren's suggestion from way back (essentially<br>
> listening on a separate exchange for each unique flavor ... so no<br>
> scheduler was needed at all). It was a great idea, but fell apart rather<br>
> quickly.<br>
><br>
> The existing approach the scheduler takes is expensive (asking the db<br>
> for state of all hosts) and polling the compute nodes might be do-able,<br>
> but you're still going to have latency problems waiting for the<br>
> responses (the states are invalid nearly immediately, especially if a<br>
> fill-first scheduling algorithm is used). We ran into this problem<br>
> before in an earlier scheduler implementation. The round-tripping kills.<br>
><br>
> We have a lot of really great information on Host state in the form of<br>
> notifications right now. I think having a service (or notification<br>
> driver) listening for these and keeping an the HostState incrementally<br>
> updated (and reported back to all of the schedulers via the fanout<br>
> queue) would be a better approach.<br>
><br>
> -S<br>
><br>
><br>
> ><br>
> > Best regards,<br>
> > Boris Pavlovic<br>
> ><br>
> > Mirantis Inc.<br>
> ><br>
> ><br>
> > On Fri, Jul 19, 2013 at 11:47 PM, Sandy Walsh<br>
> <<a href="mailto:sandy.walsh@rackspace.com">sandy.walsh@rackspace.com</a> <mailto:<a href="mailto:sandy.walsh@rackspace.com">sandy.walsh@rackspace.com</a>><br>
</div></div>> > <mailto:<a href="mailto:sandy.walsh@rackspace.com">sandy.walsh@rackspace.com</a><br>
<div class="HOEnZb"><div class="h5">> <mailto:<a href="mailto:sandy.walsh@rackspace.com">sandy.walsh@rackspace.com</a>>>> wrote:<br>
> ><br>
> ><br>
> ><br>
> > On 07/19/2013 04:25 PM, Brian Schott wrote:<br>
> > > I think Soren suggested this way back in Cactus to use MQ<br>
> for compute<br>
> > > node state rather than database and it was a good idea then.<br>
> ><br>
> > The problem with that approach was the number of queues went<br>
> exponential<br>
> > as soon as you went beyond simple flavors. Add Capabilities or<br>
> other<br>
> > criteria and you get an explosion of exchanges to listen to.<br>
> ><br>
> ><br>
> ><br>
> > > On Jul 19, 2013, at 10:52 AM, Boris Pavlovic<br>
> <<a href="mailto:boris@pavlovic.me">boris@pavlovic.me</a> <mailto:<a href="mailto:boris@pavlovic.me">boris@pavlovic.me</a>><br>
> > <mailto:<a href="mailto:boris@pavlovic.me">boris@pavlovic.me</a> <mailto:<a href="mailto:boris@pavlovic.me">boris@pavlovic.me</a>>><br>
> > > <mailto:<a href="mailto:boris@pavlovic.me">boris@pavlovic.me</a> <mailto:<a href="mailto:boris@pavlovic.me">boris@pavlovic.me</a>><br>
> <mailto:<a href="mailto:boris@pavlovic.me">boris@pavlovic.me</a> <mailto:<a href="mailto:boris@pavlovic.me">boris@pavlovic.me</a>>>>> wrote:<br>
> > ><br>
> > >> Hi all,<br>
> > >><br>
> > >><br>
> > >> In Mirantis Alexey Ovtchinnikov and me are working on nova<br>
> scheduler<br>
> > >> improvements.<br>
> > >><br>
> > >> As far as we can see the problem, now scheduler has two<br>
> major issues:<br>
> > >><br>
> > >> 1) Scalability. Factors that contribute to bad scalability<br>
> are these:<br>
> > >> *) Each compute node every periodic task interval (60 sec<br>
> by default)<br>
> > >> updates resources state in DB.<br>
> > >> *) On every boot request scheduler has to fetch information<br>
> about all<br>
> > >> compute nodes from DB.<br>
> > >><br>
> > >> 2) Flexibility. Flexibility perishes due to problems with:<br>
> > >> *) Addiing new complex resources (such as big lists of complex<br>
> > objects<br>
> > >> e.g. required by PCI Passthrough<br>
> > >><br>
> ><br>
> <a href="https://review.openstack.org/#/c/34644/5/nova/db/sqlalchemy/models.py" target="_blank">https://review.openstack.org/#/c/34644/5/nova/db/sqlalchemy/models.py</a>)<br>
> > >> *) Using different sources of data in Scheduler for example<br>
> from<br>
> > >> cinder or ceilometer.<br>
> > >> (as required by Volume Affinity Filter<br>
> > >> <a href="https://review.openstack.org/#/c/29343/" target="_blank">https://review.openstack.org/#/c/29343/</a>)<br>
> > >><br>
> > >><br>
> > >> We found a simple way to mitigate this issues by avoiding<br>
> of DB usage<br>
> > >> for host state storage.<br>
> > >><br>
> > >> A more detailed discussion of the problem state and one of<br>
> a possible<br>
> > >> solution can be found here:<br>
> > >><br>
> > >><br>
> ><br>
> <a href="https://docs.google.com/document/d/1_DRv7it_mwalEZzLy5WO92TJcummpmWL4NWsWf0UWiQ/edit#" target="_blank">https://docs.google.com/document/d/1_DRv7it_mwalEZzLy5WO92TJcummpmWL4NWsWf0UWiQ/edit#</a><br>
> > >><br>
> > >><br>
> > >> Best regards,<br>
> > >> Boris Pavlovic<br>
> > >><br>
> > >> Mirantis Inc.<br>
> > >><br>
> > >> _______________________________________________<br>
> > >> OpenStack-dev mailing list<br>
> > >> <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
> <mailto:<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a>><br>
> > <mailto:<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
> <mailto:<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a>>><br>
> > >> <mailto:<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
> <mailto:<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a>><br>
> > <mailto:<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
> <mailto:<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a>>>><br>
> > >><br>
> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
> > ><br>
> > ><br>
> > ><br>
> > > _______________________________________________<br>
> > > OpenStack-dev mailing list<br>
> > > <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
> <mailto:<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a>><br>
> > <mailto:<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
> <mailto:<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a>>><br>
> > ><br>
> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
> > ><br>
> ><br>
> > _______________________________________________<br>
> > OpenStack-dev mailing list<br>
> > <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
> <mailto:<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a>><br>
> > <mailto:<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
> <mailto:<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a>>><br>
> > <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > _______________________________________________<br>
> > OpenStack-dev mailing list<br>
> > <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
> <mailto:<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a>><br>
> > <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
> ><br>
><br>
> _______________________________________________<br>
> OpenStack-dev mailing list<br>
> <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
> <mailto:<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a>><br>
> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
><br>
><br>
><br>
><br>
> _______________________________________________<br>
> OpenStack-dev mailing list<br>
> <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
><br>
<br>
_______________________________________________<br>
OpenStack-dev mailing list<br>
<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
</div></div></blockquote></div><br></div>