[openstack-dev] [zaqar] [marconi] Juno Performance Testing (Round 1)

Flavio Percoco flavio at redhat.com
Wed Aug 27 07:10:16 UTC 2014


On 08/26/2014 11:41 PM, Kurt Griffiths wrote:
> Hi folks,
> 
> I ran some rough benchmarks to get an idea of where Zaqar currently stands
> re latency and throughput for Juno. These results are by no means
> conclusive, but I wanted to publish what I had so far for the sake of
> discussion.
> 
> Note that these tests do not include results for our new Redis driver, but
> I hope to make those available soon.
> 
> As always, the usual disclaimers apply (i.e., benchmarks mostly amount to
> lies; these numbers are only intended to provide a ballpark reference; you
> should perform your own tests, simulating your specific scenarios and
> using your own hardware; etc.).
> 
> ## Setup ##
> 
> Rather than VMs, I provisioned some Rackspace OnMetal[8] servers to
> mitigate noisy neighbor when running the performance tests:
> 
> * 1x Load Generator
>     * Hardware 
>         * 1x Intel Xeon E5-2680 v2 2.8Ghz
>         * 32 GB RAM
>         * 10Gbps NIC
>         * 32GB SATADOM
>     * Software
>         * Debian Wheezy
>         * Python 2.7.3
>         * zaqar-bench from trunk with some extra patches[1]
> * 1x Web Head
>     * Hardware 
>         * 1x Intel Xeon E5-2680 v2 2.8Ghz
>         * 32 GB RAM
>         * 10Gbps NIC
>         * 32GB SATADOM
>     * Software
>         * Debian Wheezy
>         * Python 2.7.3
>         * zaqar server from trunk @47e07cad
>             * storage=mongodb
>             * partitions=4
>             * MongoDB URI configured with w=majority
>         * uWSGI + gevent
>             * config: http://paste.openstack.org/show/100592/
>             * app.py: http://paste.openstack.org/show/100593/
> * 3x MongoDB Nodes
>     * Hardware 
>         * 2x Intel Xeon E5-2680 v2 2.8Ghz
>         * 128 GB RAM
>         * 10Gbps NIC
>         * 2x LSI Nytro WarpDrive BLP4-1600[2]
>     * Software
>         * Debian Wheezy
>         * mongod 2.6.4
>             * Default config, except setting replSet and enabling periodic
>               logging of CPU and I/O
>             * Journaling enabled
>             * Profiling on message DBs enabled for requests over 10ms
> 
> For generating the load, I used the zaqar-bench tool we created during
> Juno as a stepping stone toward integration with Rally. Although the tool
> is still fairly rough, I thought it good enough to provide some useful
> data[3]. The tool uses the python-zaqarclient library.
> 
> Note that I didn’t push the servers particularly hard for these tests; web
> head CPUs averaged around 20%, while the mongod primary’s CPU usage peaked
> at around 10% with DB locking peaking at 5%.
> 
> Several different messaging patterns were tested, taking inspiration
> from: https://wiki.openstack.org/wiki/Use_Cases_(Zaqar)
> 
> Each test was executed three times and the best time recorded.
> 
> A ~1K sample message (1398 bytes) was used for all tests.
> 
> ## Results ##
> 
> ### Event Broadcasting (Read-Heavy) ###
> 
> OK, so let's say you have a somewhat low-volume source, but tons of event
> observers. In this case, the observers easily outpace the producer, making
> this a read-heavy workload.
> 
> Options
>     * 1 producer process with 5 gevent workers
>         * 1 message posted per request
>     * 2 observer processes with 25 gevent workers each
>         * 5 messages listed per request by the observers
>     * Load distributed across 4[7] queues
>     * 10-second duration[4]
> 
> Results
>     * Producer: 2.2 ms/req,  454 req/sec
>     * Observer: 1.5 ms/req, 1224 req/sec
> 
> ### Event Broadcasting (Balanced) ###
> 
> This test uses the same number of producers and consumers, but note that
> the observers are still listing (up to) 5 messages at a time[5], so they
> still outpace the producers, but not as quickly as before.
> 
> Options
>     * 2 producer processes with 10 gevent workers each
>         * 1 message posted per request
>     * 2 observer processes with 25 gevent workers each
>         * 5 messages listed per request by the observers
>     * Load distributed across 4 queues
>     * 10-second duration
> 
> Results
>     * Producer: 2.2 ms/req, 883 req/sec
>     * Observer: 2.8 ms/req, 348 req/sec
> 
> ### Point-to-Point Messaging ###
> 
> In this scenario I simulated one client sending messages directly to a
> different client. Only one queue is required in this case[6].
> 
> Note the higher latency. While running the test there were 1-2 message
> posts that skewed the average by taking much longer (~100ms) than the
> others to complete. Such outliers are probably present in the other tests
> as well, and further investigation is need to discover the root cause.
> 
> Options
>     * 1 producer process with 1 gevent worker
>         * 1 message posted per request
>     * 1 observer process with 1 gevent worker
>         * 1 message listed per request
>     * All load sent to a single queue
>     * 10-second duration
> 
> Results
>     * Producer: 5.5 ms/req, 179 req/sec
>     * Observer: 3.5 ms/req, 278 req/sec
> 
> ### Task Distribution ###
> 
> This test uses several producers and consumers in order to simulate
> distributing tasks to a worker pool. In contrast to the observer worker
> type, consumers claim and delete messages in such a way that each message
> is processed once and only once.
> 
> Options
>     * 2 producer processes with 25 gevent workers
>         * 1 message posted per request
>     * 2 consumer processes with 25 gevent workers
>         * 5 messages claimed per request, then deleted one by one before
>           claiming the next batch of messages
>     * Load distributed across 4 queues
>     * 10-second duration
> 
> Results
>     * Producer: 2.5 ms/req, 798 req/sec
>     * Consumer
>         * Claim: 8.4 ms/req
>         * Delete: 2.5 ms/req
>         * 813 req/sec (overall)
> 
> ### Auditing / Diagnostics ###
> 
> This test is the same as performed in Task Distribution, but also adds a
> few observers to the mix:
> 
> Options
>     * 2 producer processes with 25 gevent workers each
>         * 1 message posted per request
>     * 2 consumer processes with 25 gevent workers each
>         * 5 messages claimed per request, then deleted one by one before
> claiming the next batch of messages
>     * 1 observer processes with 5 gevent workers each
>         * 5 messages listed per request
>     * Load distributed across 4 queues
>     * 10-second duration
> 
> Results
>     * Producer: 2.2 ms/req, 878 req/sec
>     * Consumer
>         * Claim: 8.2 ms/req
>         * Delete: 2.3 ms/req
>         * 876 req/sec (overall)
>     * Observer: 7.4 ms/req, 133 req/sec
> 
> ## Conclusions ##
> 
> While more testing is needed to track performance against increasing
> load (spoiler: latency will increase), these initial results are
> Encouraging; turning around requests in ~10 (or even ~20) ms is fast
> enough for a variety of use cases. I anticipate enabling the keystone
> middleware will add 1-2 ms (assuming tokens are cached).
> 
> Let’s keep digging and see what we can learn, and what needs to be
> improved. 

Kurt,

Thanks a lot for working on this. These results are indeed encouraging
from a performance point of view. I'm looking forward to see the results
of these tests on the new redis driver.

I think the next round should focus on doing the same tests with
keystone enabled since I'd expect most of the deployers to use Zaqar
with it.

Flavio

> 
> @kgriffs
> 
> --------
> 
> [1]: https://review.openstack.org/#/c/116384/
> [2]: Yes, I know that's some crazy IOPS, but there is plenty of RAM to
> avoid paging, so you should be able to get similar results with some
> regular disks, assuming they are decent enough to support enabling
> journaling (if you need that level of durability).
> [3]: It would be interesting to verify the results presented here using
> Tsung and/or JMeter; zaqar-bench isn't particularly efficient, but it does
> provide the potential to do some interesting reporting, such as measuring
> the total end-to-end time of enqueuing and subsequently dequeuing each
> Message (TODO). In any case, I'd love to see the team set up a
> benchmarking cluster that runs 2-3 tools regularly (or as part of every
> patch) and reports the results so we always know where we stand.
> [4]: Yes, I know this is a short duration; I'll try to do some longer
> tests in my next round of benchmarking.
> [5]: In a real app, messages will usually be requested in batches.
> [6]: In this test, the target client does not send a response message back
> to the sender. However, if it did, the test would still only require a
> single queue, since in Zaqar queues are duplex.
> [7]: Chosen somewhat arbitrarily.
> [8]: One might argue that the only thing these performance tests show
> is that *OnMetal* is fast. However, as I pointed out, there was plenty
> of headroom left on these servers during the tests, so similar results
> should be achievable using more modest hardware.
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 


-- 
@flaper87
Flavio Percoco



More information about the OpenStack-dev mailing list