Open Stack

Tue Mar 11 17:05:16 UTC 2014

Hi team!

Last week we were working on notification problem in ceilometer during
tempest tests creation. Tests for notification passed successfully on
Postgres but failed on MySQL. This made us start investigations and this
email contains some results.

As it turned out, tempest as it is is something like performance-testing
for Ceilometer. It contains 2057 tests. Almost in all test OpenStack
resources are being created and deleted: images, instances, volumes. E.g.
during instance creation nova sends 9 notifications. And all the tests are
running in parallel for about 40 minutes.
>From ceilometer-collector logs we may found very useful message:

2014-03-10 09:42:41.356
<http://logs.openstack.org/36/64136/20/check/check-tempest-dsvm-full/e361520/logs/screen-ceilometer-collector.txt.gz#_2014-03-10_09_42_41_356>
22845 DEBUG ceilometer.dispatcher.database
[req-16ea95c5-6454-407a-9c64-94d5ef900c9e - - - - -] metering data
storage.objects.outgoing.bytes for b7a490322e65422cb1129b13b49020e6 @
2014-03-10T09:34:31.090107:

So collector starts to process_metering_data in dispatcher only in 9:42 but
nova sent it in 9:34. To look at whole picture please take look at picture
[1]. It illustrates time difference based on this message in logs.
Besides, I decided to take a look on difference between the RPC-publisher
sends the message and the collector receives the message. To create this
plot I've parsed the lines like below from anotifications log:

2014-03-10 09:25:49.333
<http://logs.openstack.org/36/64136/20/check/check-tempest-dsvm-full/e361520/logs/screen-ceilometer-anotification.txt.gz#_2014-03-10_09_25_49_333>
22833 DEBUG ceilometer.openstack.common.rpc.amqp [-] UNIQUE_ID is
683dd3f130534b9fbb5606aef862b83d.

After that I found the corresponding id in collector log:

2014-03-10 09:25:49.352
<http://logs.openstack.org/36/64136/20/check/check-tempest-dsvm-full/e361520/logs/screen-ceilometer-collector.txt.gz#_2014-03-10_09_25_49_352>
22845 DEBUG ceilometer.openstack.common.rpc.amqp [-] received
{u'_context_domain': None, u'_context_request_id':
u'req-0a5fafe6-e097-4f90-a68a-a91da1cff22c',

u'args': {u'data': [...,
 u'message_id': u'f7ad63fc-a835-11e3-8223-bc764e205385',
u'counter_type': u'gauge'}]}, u'_context_read_only': False,
u'_unique_id': u'683dd3f130534b9fbb5606aef862b83d',

u'_context_user_identity': u'- - - - -', u'_context_instance_uuid':
None, u'_context_show_deleted': False, u'_context_tenant': None,
u'_context_auth_token': '<SANITIZED>',

....} _safe_log
/opt/stack/new/ceilometer/ceilometer/openstack/common/rpc/common.py:280

So in the example above we see time-difference only in 20 milliseconds. But
it grows very quickly :( To see it please take a look on picture [2].

To summarize pictures:
1. Picture 1: Axis Y: amount of seconds between nova creates notification
and the collector retrieves the message. Axis X: timestamp
2. Picture 2: Axis Y: amount of seconds between the publisher publishes the
message and the collector retrieves the message. Axis X: timestamp

These pictures are almost the same and it makes me think that collector
cannot manage with big amount of messages. What do you think about it? Do
you agree or you need more evidences, e.g. amount of messages in rabbit or
amth else?
Let's discuss that in [Ceilometer] topic first, I will create a new thread
about testing strategy in tempest later. Because in this circumstances we
forced to refuse from created notification tests and cannot reduce time for
polling because it will make everything even worst.

[1]: http://postimg.org/image/r4501bdyb/
[2]: http://postimg.org/image/yy5a1ste1/

Thanks for your attention,
Nadya
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140311/d1df6ffa/attachment.html>

Open Stack

[openstack-dev] [Ceilometer]Collector's performance

OpenStack

Community

Documentation

Branding & Legal