[openstack-dev] [oslo][messaging][zmq][performance]

Oleksii Zamiatin ozamiatin at mirantis.com
Fri Mar 25 12:28:35 UTC 2016


Hi everyone!

Here is the status update for zmq driver upstream development in M and plans for N

What was done:

1. Dedicated patterns usage was finally impelemented. DEALER/ROUTER for CALL, PUSH/PULL for CAST, PUB/SUB for fanout.
	Last time everything worked over DEAL/ROUTER which was not optimal enough.
2. Implemented support for Sentinel clustering in matchmaker Redis (Ty Alexey Yelistratov).
3. More smart (retries based) conversation between redis and services.
	* Dynamic updates
	* Records TTL
4. Transport URL was finally supported.
5. Added full tempest gate with Neutron for zmq (Ty Dmitry Ukhlov).
6. Performed successful multi-node deployment testing (Ty Alexey Yelistratov):
	* devstack multiple nodes
	* Rally nova-boot 200 nodes + fuel deployment
7. Performed benchmark testing with simulator (o.m/tools/simulator.py) on 20 nodes deployment (Ty Yulia Portnova)
	* CALL ~29k msg/sec compared to rabbit-cluster ~2k msg/sec
8. Finally reduced IPC-proxy which could cause problems in container-based deployment like koala.

And many other smaller bug-fixes.


So, we've got closer to zmq usage in real environment but still need more work to do to make this happen.
Here is the list of known issues we've got from testing as a feedback and other things we would like to fix in the driver.
(To find the whole list of known bugs please follow the link [1]).

Most important issues to fix in N here:

1. ZMQ driver eats too many TCP sockets [2].
	Currently with direct client-server connections architecture we faced the problem. Solution is to use
	stateless transparent remote-proxies to reduce the number of connections. The solution is in progress [3].

2. Implement retries for unacknowledged messages and heartbeats [4], [5], [6]
	In order to have reliable messaging in case of bad-network and proxies failures.

3. Fix interaction with name-service and make proper updates both sides (HA-related) [7]
	Properly reconnect restarted services.

4. Get success with ceilometer. [8]

5. Support PGM protocol for multicast as an option [9]

6. Support encryption for messages (libsodium etc.)

All other issues by the link [1]


What kinds of testing is planned:

1. HA testing:
	* Restarting/adding/removing nodes, test reconnects and proper messaging layer recovery send-retries etc.
	* Bad network emulation also test send-retries correctness

2. Benchmark testing:
	* increase load, number of nodes
        * test different kinds of deployment configuration (with different number of proxies, with direct connections).

3. Try Rally with 500 nodes at least.

Many thanks to Oslo and Performance teams for help with testing and reviews.

Thanks,
Oleksii

Links:
1 - https://bugs.launchpad.net/oslo.messaging/+bugs?field.tag=zmq
2 - https://bugs.launchpad.net/oslo.messaging/+bug/1555007
3 - https://review.openstack.org/#/c/287094/
4 - https://bugs.launchpad.net/oslo.messaging/+bug/1497306
5 - https://bugs.launchpad.net/oslo.messaging/+bug/1503295
6 - https://bugs.launchpad.net/oslo.messaging/+bug/1497302
7 - https://bugs.launchpad.net/oslo.messaging/+bug/1548836
8 - https://bugs.launchpad.net/oslo.messaging/+bug/1539047
9 - https://bugs.launchpad.net/oslo.messaging/+bug/1524100


More information about the OpenStack-dev mailing list