[openstack-dev] [Openstack] [Ceilometer][Architecture] Transformers in Kilo vs Liberty(and Mitaka)

Nadya Shakhat nprivalova at mirantis.com
Thu Apr 14 09:28:55 UTC 2016


Hi Gordon,

I'd like to add some clarifications and comments.

this is not entirely accurate pre-polling change, the polling agents
> publish one message per sample. not the polling agents publish one
> message per interval (multiple samples).

Looks like there is some misunderstanding here. In the code, there is
"batch_polled_samples" option. You can switch it off and get the result you
described, but it's True by default.  See
https://github.com/openstack/ceilometer/blob/master/ceilometer/agent/manager.py#L205-L211
.

You wrote:

> the polling change is not related to coordination work in notification.
> the coordination work was to handle HA / multiple notification agents.
> regardless polling change, this must exist.

and

> transformers are already optional. they can be removed from
> pipeline.yaml if not required (and thus coordination can be disabled).


So, coordination is needed only to support transformations. Polling change
does relate to this because it has brought additional transformations on
notification agent side. I suggest to pay attention to the existing use
cases. In real life, people use transformers for polling-based metrics
only. The most important use case for transformation is Heat autoscaling.
It usually based on cpu_util. Before Liberty, we were able not to use
coordination for notification agent to support the autoscaling usecase. In
Liberty we cannot support it without Redis. Now "transformers are already
optional", that's true. But I think it's better to add some restrictions
like "we don't support transformations for notifications" and have
transformers optional on polling-agent only instead of introducing such a
comprehensive coordination.

> IPC is one of the
> standard use cases for message queues. the concept of using queues to
> pass around and distribute work is essentially what it's designed for.
> if rabbit or any message queue service can't provide this function, it
> does worry me.


I see your point here, but Ceilometer aims to take care of the OpenStack,
monitor it's state. Now it is known as a "Rabbit killer". We cannot ignore
that if we want anybody uses Ceilometer.


Also, I'd like to copy-paste Chris's ideas from the previous message:

Are the options the following?
> * Do what you suggest and pull transformers back into the pollsters.

  Basically revert the change. I think this is the wrong long term
>   solution but might be the best option if there's nobody to do the
>   other options.
> * Implement a pollster.yaml for use by the pollsters and consider
>   pipeline.yaml as the canonical file for the notification agents as
>   there's where the actual _pipelines_ are. Somewhere in there kill
>   interval as a concept on pipeline side.
>   This of course doesn't address the messaging complexity. I admit
>   that I don't understand all the issues there but it often feels
>   like we are doing that aspect of things completely wrong, so I
>   would hope that before we change things there we consider all the
>   options.

I think that two types of agents should have two different pipeline
descriptions, but I still think that "pipeline" should be described and
fully applied on the both types of agents. On polling ones it should be the
same as it is now, on notification: remove interval and refuse from
transformations at all. Chris, I see your point about "long term", but I'm
afraid that "long term" may not happen...


> What else?
> One probably crazy idea: What about figuring out the desired end-meters
> of common transformations and making them into dedicated pollsters?
> Encapsulating that transformation not at the level of the polling
> manager but at the individual pollster.


Your "crazy idea" may work at least for restoring autoscaling functionality
indeed.

Thanks,
Nadya

On Wed, Apr 13, 2016 at 9:25 PM, gordon chung <gord at live.ca> wrote:

> hi Nadya,
>
> copy/pasting full original message with comments inline to clarify some
> comments.
>
> i think a lot of the confusion is because we use pipeline.yaml across
> both polling and notification agents when really it only applies to
> latter. just an fyi, we've had an open work item to create a
> polling.yaml file... just the issue of 'resources'.
>
> > Hello colleagues,
> >
> >     I'd like to discuss one question with you. Perhaps, you remember that
> > in Liberty we decided to get rid of transformers on polling agents [1].
> I'd
> > like to describe several issues we are facing now because of this
> decision.
> > 1. pipeline.yaml inconsistency.
> >     Ceilometer pipeline consists from the two basic things: source and
> > sink. In source, we describe how to get data, in sink - how to deal with
> > the data. After the refactoring described in [1], on polling agents we
> > apply only "source" definition, on notification agents we apply only
> "sink"
> > one. It causes the problems described in the mailing thread [2]: the
> "pipe"
> > concept is actually broken. To make it work more or less correctly, the
> > user should care that from a polling agent he/she doesn't send duplicated
> > samples. In the example below, we send "cpu" Sample twice each 600
> seconds
> > from a compute agents:
> >
> > sources:
> > - name: meter_source
> > interval: 600
> > meters:
> > - "*"
> > sinks:
> > - meter_sink
> > - name: cpu_source
> > interval: 60
> > meters:
> > - "cpu"
> > sinks:
> > - cpu_sink
> > - cpu_delta_sink
> >
> > If we apply the same configuration on notification agent, each "cpu"
> Sample
> > will be processed by all of the 3 sinks. Please refer to the mailing
> thread
> > [2] for more details.
> >     As I understood from the specification, the main reason for [1] is
> > making the pollster code more readable. That's why I call this change a
> > "refactoring". Please correct me if I miss anything here.
>
> i don't know about more readable. it was also to offload work from
> compute nodes and all the stuff cdent mentions.
>
> >
> > 2. Coordination stuff.
> >     TBH, coordination for notification agents is the most painful thing
> for
> > me because of several reasons:
> >
> > a. Stateless service has became stateful. Here I'd like to note that tooz
> > usage for central agents and alarm-evaluators may be called "optional".
> If
> > you want to have these services scalable, it is recommended to use tooz,
> > i.e. install Redis/Zookeeper. But you may have your puppets unchanged and
> > everything continue to work with one service (central agent or
> > alarm-evaluator) per cloud. If we are talking about notification agent,
> > it's not the case. You must change the deployment: eighter rewrite the
> > puppets for notification agent deployment (to have only one notification
> > agent per cloud)  or make tooz installation with Redis/Zookeeper
> required.
> > One more option: remove transformations completely - that's what we've
> done
> > in our company's product by default.
>
> the polling change is not related to coordination work in notification.
> the coordination work was to handle HA / multiple notification agents.
> regardless polling change, this must exist.
>
> >
> > b. RabbitMQ high utilisation. As you know, tooz does only one part of
> > coordination for a notification agent. In Ceilometer, we use IPC queues
> > mechanism to be sure that samples with the one metric and from the one
> > resource are processed by exactly the one notification agent (to make it
> > possible to use a local cache). I'd like to remind you that without
> > coordination (but with [1] applied) each compute agent polls each
> instances
> > and send the result as one message to a notification agent. The
>
> this is not entirely accurate pre-polling change, the polling agents
> publish one message per sample. not the polling agents publish one
> message per interval (multiple samples).
>
> > notification agent processes all the samples and sends as many messages
> to
> > a collector as many sinks it is defined (2-4, not many). If [1] if not
> > applied, one "publishing" round is skipped. But with [1] and coordination
> > (it's the most recommended deployment), amount of publications increases
> > dramatically because we publish each Sample as a separate message.
> Instead
> > of 3-5 "publish" calls, we do 1+2*instance_amount_on_compute publishings
> > per each compute. And it's by design, i.e. it's not a bug but a feature.
>
> i don't think the maths is right but regardless, IPC is one of the
> standard use cases for message queues. the concept of using queues to
> pass around and distribute work is essentially what it's designed for.
> if rabbit or any message queue service can't provide this function, it
> does worry me.
>
> >
> > c. Samples ordering in the queues. It may be considered as a corner case,
> > but anyway I'd like to describe it here too. We have a lot of
> > order-sensitive transformers (cpu.delta, cpu_util), but we can guarantee
> > message ordering only in the "main" polling queue, but not in IPC
> queues. At
> > the picture below (hope it will be displayed) there are 3 agents A1, A2
> and
> > A3 and 3 time-ordered messages in the MQ. Let's assume that at the same
> > time 3 agents start to read messages from the MQ. All the messages are
> > related to only one resource, that’s why they will go to only the one IPC
> > queue. Let it be IPC queue for A1 agent. At this point, we cannot
> guarantee
> > that the order will be kept, i.e. we cannot do order-sensitive
> > transformations without some loss.
>
> we can do ordering with batch processing. this is my proposal:
> https://review.openstack.org/#/c/275741/. we can discuss whether it
> works, should be changed, etc...
> >
> >
> >   Now I'd like to remind you that we need this coordination _only_ to
> > support transformations. Take a look on these specs: [3], [4]
> >>From [3]: The issue that arises is that if we want to implement a
> pipeline
> > to process events, we cannot guarantee what event each agent worker will
> > get and because of that, we cannot enable transformers which
> > aggregate/collate some relationship across similar events.
> >
> > We don't have events transformations. In default pipeline.yaml we event
> > don't use transformations for notification-based samples (perhaps, we get
> > cpu from instance.exist, but we can drop it without any impact). The most
> > common case is transformations only for polling-based metrics. Please,
> > correct me if I'm wrong here.
> >
> > tl;dr
> > I suggest the following:
> > 1. Return transformations to polling agent
> > 2. Have a special format for pipeline.yaml on notification agents without
> > "interval" and "transformations". Notification-based transformations is
> > better to be done "offline".
>
> transformers are already optional. they can be removed from
> pipeline.yaml if not required (and thus coordination can be disabled).
> also interval value is not used by notification agent although in theory
> could be and thus resolving the original issue.
>
> >
> > [1]
> >
> https://github.com/openstack/telemetry-specs/blob/master/specs/liberty/pollsters-no-transform.rst
> > [2] http://www.gossamer-threads.com/lists/openstack/dev/53983
> > [3]
> >
> https://github.com/openstack/ceilometer-specs/blob/master/specs/kilo/notification-coordiation.rst
> > [4]
> >
> https://github.com/openstack/ceilometer-specs/blob/master/specs/liberty/distributed-coordinated-notifications.rst
> >
> > Thanks for you attention,
> > Nadya
>
>
>
>
> --
> gord
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160414/5f693fab/attachment.html>


More information about the OpenStack-dev mailing list