[openstack-dev] [zaqar] [marconi] Juno Performance Testing (Round 1)

Kurt Griffiths kurt.griffiths at rackspace.com
Tue Aug 26 21:41:28 UTC 2014


Hi folks,

I ran some rough benchmarks to get an idea of where Zaqar currently stands
re latency and throughput for Juno. These results are by no means
conclusive, but I wanted to publish what I had so far for the sake of
discussion.

Note that these tests do not include results for our new Redis driver, but
I hope to make those available soon.

As always, the usual disclaimers apply (i.e., benchmarks mostly amount to
lies; these numbers are only intended to provide a ballpark reference; you
should perform your own tests, simulating your specific scenarios and
using your own hardware; etc.).

## Setup ##

Rather than VMs, I provisioned some Rackspace OnMetal[8] servers to
mitigate noisy neighbor when running the performance tests:

* 1x Load Generator
    * Hardware 
        * 1x Intel Xeon E5-2680 v2 2.8Ghz
        * 32 GB RAM
        * 10Gbps NIC
        * 32GB SATADOM
    * Software
        * Debian Wheezy
        * Python 2.7.3
        * zaqar-bench from trunk with some extra patches[1]
* 1x Web Head
    * Hardware 
        * 1x Intel Xeon E5-2680 v2 2.8Ghz
        * 32 GB RAM
        * 10Gbps NIC
        * 32GB SATADOM
    * Software
        * Debian Wheezy
        * Python 2.7.3
        * zaqar server from trunk @47e07cad
            * storage=mongodb
            * partitions=4
            * MongoDB URI configured with w=majority
        * uWSGI + gevent
            * config: http://paste.openstack.org/show/100592/
            * app.py: http://paste.openstack.org/show/100593/
* 3x MongoDB Nodes
    * Hardware 
        * 2x Intel Xeon E5-2680 v2 2.8Ghz
        * 128 GB RAM
        * 10Gbps NIC
        * 2x LSI Nytro WarpDrive BLP4-1600[2]
    * Software
        * Debian Wheezy
        * mongod 2.6.4
            * Default config, except setting replSet and enabling periodic
              logging of CPU and I/O
            * Journaling enabled
            * Profiling on message DBs enabled for requests over 10ms

For generating the load, I used the zaqar-bench tool we created during
Juno as a stepping stone toward integration with Rally. Although the tool
is still fairly rough, I thought it good enough to provide some useful
data[3]. The tool uses the python-zaqarclient library.

Note that I didn’t push the servers particularly hard for these tests; web
head CPUs averaged around 20%, while the mongod primary’s CPU usage peaked
at around 10% with DB locking peaking at 5%.

Several different messaging patterns were tested, taking inspiration
from: https://wiki.openstack.org/wiki/Use_Cases_(Zaqar)

Each test was executed three times and the best time recorded.

A ~1K sample message (1398 bytes) was used for all tests.

## Results ##

### Event Broadcasting (Read-Heavy) ###

OK, so let's say you have a somewhat low-volume source, but tons of event
observers. In this case, the observers easily outpace the producer, making
this a read-heavy workload.

Options
    * 1 producer process with 5 gevent workers
        * 1 message posted per request
    * 2 observer processes with 25 gevent workers each
        * 5 messages listed per request by the observers
    * Load distributed across 4[7] queues
    * 10-second duration[4]

Results
    * Producer: 2.2 ms/req,  454 req/sec
    * Observer: 1.5 ms/req, 1224 req/sec

### Event Broadcasting (Balanced) ###

This test uses the same number of producers and consumers, but note that
the observers are still listing (up to) 5 messages at a time[5], so they
still outpace the producers, but not as quickly as before.

Options
    * 2 producer processes with 10 gevent workers each
        * 1 message posted per request
    * 2 observer processes with 25 gevent workers each
        * 5 messages listed per request by the observers
    * Load distributed across 4 queues
    * 10-second duration

Results
    * Producer: 2.2 ms/req, 883 req/sec
    * Observer: 2.8 ms/req, 348 req/sec

### Point-to-Point Messaging ###

In this scenario I simulated one client sending messages directly to a
different client. Only one queue is required in this case[6].

Note the higher latency. While running the test there were 1-2 message
posts that skewed the average by taking much longer (~100ms) than the
others to complete. Such outliers are probably present in the other tests
as well, and further investigation is need to discover the root cause.

Options
    * 1 producer process with 1 gevent worker
        * 1 message posted per request
    * 1 observer process with 1 gevent worker
        * 1 message listed per request
    * All load sent to a single queue
    * 10-second duration

Results
    * Producer: 5.5 ms/req, 179 req/sec
    * Observer: 3.5 ms/req, 278 req/sec

### Task Distribution ###

This test uses several producers and consumers in order to simulate
distributing tasks to a worker pool. In contrast to the observer worker
type, consumers claim and delete messages in such a way that each message
is processed once and only once.

Options
    * 2 producer processes with 25 gevent workers
        * 1 message posted per request
    * 2 consumer processes with 25 gevent workers
        * 5 messages claimed per request, then deleted one by one before
          claiming the next batch of messages
    * Load distributed across 4 queues
    * 10-second duration

Results
    * Producer: 2.5 ms/req, 798 req/sec
    * Consumer
        * Claim: 8.4 ms/req
        * Delete: 2.5 ms/req
        * 813 req/sec (overall)

### Auditing / Diagnostics ###

This test is the same as performed in Task Distribution, but also adds a
few observers to the mix:

Options
    * 2 producer processes with 25 gevent workers each
        * 1 message posted per request
    * 2 consumer processes with 25 gevent workers each
        * 5 messages claimed per request, then deleted one by one before
claiming the next batch of messages
    * 1 observer processes with 5 gevent workers each
        * 5 messages listed per request
    * Load distributed across 4 queues
    * 10-second duration

Results
    * Producer: 2.2 ms/req, 878 req/sec
    * Consumer
        * Claim: 8.2 ms/req
        * Delete: 2.3 ms/req
        * 876 req/sec (overall)
    * Observer: 7.4 ms/req, 133 req/sec

## Conclusions ##

While more testing is needed to track performance against increasing
load (spoiler: latency will increase), these initial results are
Encouraging; turning around requests in ~10 (or even ~20) ms is fast
enough for a variety of use cases. I anticipate enabling the keystone
middleware will add 1-2 ms (assuming tokens are cached).

Let’s keep digging and see what we can learn, and what needs to be
improved. 

@kgriffs

--------

[1]: https://review.openstack.org/#/c/116384/
[2]: Yes, I know that's some crazy IOPS, but there is plenty of RAM to
avoid paging, so you should be able to get similar results with some
regular disks, assuming they are decent enough to support enabling
journaling (if you need that level of durability).
[3]: It would be interesting to verify the results presented here using
Tsung and/or JMeter; zaqar-bench isn't particularly efficient, but it does
provide the potential to do some interesting reporting, such as measuring
the total end-to-end time of enqueuing and subsequently dequeuing each
Message (TODO). In any case, I'd love to see the team set up a
benchmarking cluster that runs 2-3 tools regularly (or as part of every
patch) and reports the results so we always know where we stand.
[4]: Yes, I know this is a short duration; I'll try to do some longer
tests in my next round of benchmarking.
[5]: In a real app, messages will usually be requested in batches.
[6]: In this test, the target client does not send a response message back
to the sender. However, if it did, the test would still only require a
single queue, since in Zaqar queues are duplex.
[7]: Chosen somewhat arbitrarily.
[8]: One might argue that the only thing these performance tests show
is that *OnMetal* is fast. However, as I pointed out, there was plenty
of headroom left on these servers during the tests, so similar results
should be achievable using more modest hardware.



More information about the OpenStack-dev mailing list