[openstack-dev] [tricircle] multiple cascade services
joehuang
joehuang at huawei.com
Sat Aug 29 01:42:26 UTC 2015
Hi,
I think you may have some misunderstanding on the PoC design. (the proxy node only to listen the RPC to compute-node/cinder-volume/L2/L3 agent…)
1) The cascading layer including the proxy nodes are assumed running in VMs but not in physical servers (you can do that). Even in CJK intercloud ( China, Japan, Korea ) intercloud, the cascading layer including API,messagebus, DB, proxy nodes are running in VMs
2) For proxy nodes running in VMs, it's not strange that multiple proxy nodes running over one physical server. And if the load of one proxy nodes increased, it’s easy to move VM from one physical server to another, this is quite mature technology and easy to monitor, to deal with. And most of virtualization also support hot scale-up for one virtual machine.
3) It's already in some scenario that the ZooKeeper is used to manage the proxy node role and membership. And backup node will take over the responsibility of the failed node.
So I did not see that “fake node” mode will bring extra benefit. On the other hand, the “fake node” add additional complexity:
1 ) the complexity of the code in cascade service, to implement the RPC to scheduler and the RPC to compute node/cinder volume.
2 ) how to judge the load of a “fake node”. If all “fake-nodes” will run flatly(no special process or thread, just a symbol) in the same process, then how can you judge the load of a “fake node”, by message number ? but message number does not imply the load. The load is often measured through CPU utilization / memory occupy, so how to calculate the load for each “fake node” and then make decision to move which nodes to other physical server? How to manage this “fake-node” in Zookeeper like cluster ware. You may want to make fake-node run in different process or thread space, then you need to manage “fake-node” and process/thread relationship.
I admit that the proposal 3 is much more complex to make it work for the flexible load balance. We have to record relative stamp for each message in the queue, pick the message from message bus, and put the message into task queue for each site in DB, then execute this task in order.
As what has been described above that the proposal 2 does not bring extra benefit, and if we don’t want to strive for the 3rd direction, we’d better fallback to the proposal 1.
Best Regards
Chaoyi Huang ( Joe Huang )
From: eran at gampel.co.il [mailto:eran at gampel.co.il] On Behalf Of Eran Gampel
Sent: Thursday, August 27, 2015 7:07 PM
To: joehuang; Irena Berezovsky; Eshed Gal-Or; Ayal Baron; OpenStack Development Mailing List (not for usage questions); caizhiyuan (A); Saggi Mizrahi; Orran Krieger; Gal Sagie; Orran Krieger; Zhipeng Huang
Subject: Re: [openstack-dev][tricircle] multiple cascade services
Hi,
Please see my comments inline
BR,
Eran
Hello,
As what we discussed in the yesterday’s meeting, the contradict is how to scale out cascade services.
1) In PoC, one proxy node will only forward to one bottom openstack, the proxy node will be added to a regarding AZ, and multiple proxy nodes for one bottom OpenStack is feasible by adding more proxy nodes into this AZ, and the proxy node will be scheduled like usual.
Is this perfect? No. Because the VM’s host attribute is binding to a specific proxy node, therefore, these multiple proxy nodes can’t work in cluster mode, and each proxy node has to be backup by one slave node.
[Eran] I agree with this point - In the PoC you had a limitation of single active proxy per bottom site. In addition, each proxy could only support a Single bottom site by-design.
2) The fake node introduced in the cascade service.
Because fanout rpc call for Neutron API is assumed, then no multiple fake nodes for one bottom openstack is allowed.
[Eran] In fact, this is not a limitation in the current design. We could have multiple "fake nodes" to handle the same bottom site, but only 1 that is Active. If this Active node becomes unavailable, one of the other "Passive" nodes can take over with some leader-election or any other known design pattern (it's an implementation decision).
And because the traffic to one bottom OpenStack is un-predictable, and move these fake nodes dynamically among cascade service is very complicated, therefore we can’t deploy multiple fake nodes in one cascade service.
[Eran] I'm not sure I follow you on this point... as we see it, there are 3 places where load is an issue (and potential bottleneck):
1. API + message queue + database
2. Cascading Service itself (dependency builder, communication service, DAL)
3. Task execution
I think you were concerned about #2, which in our design must be a single-active per bottom site (to maintain task order of execution).
In our opinion, the heaviest part is actually #3 (task execution), which is delegated to a separate execution path (Mistral workflow or otherwise).
In case we have one Cascading Service handling multiple Bottom sites and at some point in time we wish to handle just one Bottom site and move the rest of them to a different Cascading Service instance, it is possible.
The way we see it, is we have multiple Fake Nodes running in multiple Cascading Services, in Active-Passive. That way, when one Cascading Service instance becomes overloaded, it can give up its "Leadership" on active fake nodes, and some of the other Cascading Services will take over (leader election, or otherwise). This is a very common design pattern, we don't see anything special or complicated here.
At last, we have to deploy one fake node one cascade service.
And one cascade service one bottom openstack will limit the burst traffic to one cascade openstack.
And you have to backup the cascade service.
[Eran] This is correct. In the worst case of traffic burst to a single bottom site, a single Cascading Service will handle a single Fake Node exclusively, and it is not possible to handle a single Bottom Site with more than a single Fake Node at any given time.
Having said that, we don't see a scenario where the Fake Node / Cascading Service will become a bottleneck. We think that #3 (task execution) and #1 (message queue, API and database) will choke before, probably because the OpenStack components in the Top and Bottom sites will not be able to handle the burst (which is a completely different story).
3) From the beginning, I prefer to run multiple cascade service in parallel, and all of them work in load balance cluster mode.
[Eran] I believe we already discussed this before - It is actually not possible.
If you did that, you would have race condition and miss-ordering of actions, and an inconsistent state in the Bottom sites.
For example, if the Top user did:
#1 create security group "111"
#2 update security group "111" with "Allow *"
#3 update security group "111" with "Drop *"
If you have more than a single Cascading service that is responsible for site "A", you don't know what will be the order of actions.
In the example I gave, you may end up with site "A" having security group "111" with "Allow *" or with "Deny *".
API of (Nova, Cinder, Neutron… ) calling cascade service through RPC, and the RPC call will be only forwarded to one of the cascade service ( just put the RPC to message bus queue, and if one of the cascade service pick up the message, the message will be remove from the queue, and will not be consumed by other cascade service ). When the cascade service received a message, will start a task to execute the request. If multiple bottom openstacks will be involved, for example, networking, then the networking request will be forwarded to regarding multiple bottom openstack where there is resources (VM, floating IP) resides ).
To keep the correct order of operations, all tasks will store necessary data in DB to prevent the operation be broken for single site. (if a VM is creating, reboot is not allowed, such kind of use cases has already been done on API of (Nova.Cinder,Neutron,…) side )
[Eran] This will not enforce order - Only keep state between non-racing actions. It will not guarantee consistency in common scenarios of multiple updates to a specific resource within a short period, as I just gave with the security group.
Maybe it will work for a few predictable use cases, but there will always be something else that you did not plan for.
It is ultimately an unsafe design.
If you propose to make the database the coordinator of this process (which I don't see how), you will end-up with an even worse bottleneck - in the database.
Through this way, we can dynamically add cascade service node, and balance the traffic dynamically.
Best Regards
Chaoyi Huang ( Joe Huang )
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150829/a24bc225/attachment.html>
More information about the OpenStack-dev
mailing list