<div dir="ltr">Hi Gordon,<div><br></div><div>I'd like to add some clarifications and comments.</div><div><br></div><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span style="font-size:13px">this is not entirely accurate pre-polling change, the polling agents<br></span><span style="font-size:13px">publish one message per sample. not the polling agents publish one<br></span><span style="font-size:13px">message per interval (multiple samples).</span></blockquote><div>Looks like there is some misunderstanding here. In the code, there is "batch_polled_samples" option. You can switch it off and get the result you described, but it's True by default.  See  <a href="https://github.com/openstack/ceilometer/blob/master/ceilometer/agent/manager.py#L205-L211" target="_blank">https://github.com/openstack/ceilometer/blob/master/ceilometer/agent/manager.py#L205-L211</a> . </div><div><br></div><div>You wrote:</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span style="font-size:13px">the polling change is not related to coordination work in notification.<br></span><span style="font-size:13px">the coordination work was to handle HA / multiple notification agents.<br></span><span style="font-size:13px">regardless polling change, this must exist.</span></blockquote><div>and  </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span style="font-size:13px">transformers are already optional. they can be removed from<br></span><span style="font-size:13px">pipeline.yaml if not required (and thus coordination can be disabled).</span></blockquote><div><br></div><div>So, coordination is needed only to support transformations. Polling change does relate to this because it has brought additional transformations on notification agent side. I suggest to pay attention to the existing use cases. In real life, people use transformers for polling-based metrics only. The most important use case for transformation is Heat autoscaling. It usually based on cpu_util. Before Liberty, we were able not to use coordination for notification agent to support the autoscaling usecase. In Liberty we cannot support it without Redis. Now "<span style="font-size:13px">transformers are already optional</span>", that's true. But I think it's better to add some restrictions like "we don't support transformations for notifications" and have transformers optional on polling-agent only instead of introducing such a comprehensive coordination. </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span style="font-size:13px">IPC is one of the<br></span><span style="font-size:13px">standard use cases for message queues. the concept of using queues to<br></span><span style="font-size:13px">pass around and distribute work is essentially what it's designed for.<br></span><span style="font-size:13px">if rabbit or any message queue service can't provide this function, it<br></span><span style="font-size:13px">does worry me.</span></blockquote><div><br></div><div>I see your point here, but Ceilometer aims to take care of the OpenStack, monitor it's state. Now it is known as a "Rabbit killer". We cannot ignore that if we want anybody uses Ceilometer. </div><div><br></div><div><br></div><div>Also, I'd like to copy-paste Chris's ideas from the previous message:</div><div><br></div><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span style="font-size:13px">Are the options the following?</span><br style="font-size:13px"><span style="font-size:13px">* Do what you suggest and pull transformers back into the pollsters.</span></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span style="font-size:13px">  Basically revert the change. I think this is the wrong long term<br></span><span style="font-size:13px">  solution but might be the best option if there's nobody to do the<br></span><span style="font-size:13px">  other options.</span><br style="font-size:13px"><span style="font-size:13px">* Implement a pollster.yaml for use by the pollsters and consider<br></span><span style="font-size:13px">  pipeline.yaml as the canonical file for the notification agents as<br></span><span style="font-size:13px">  there's where the actual _pipelines_ are. Somewhere in there kill<br></span><span style="font-size:13px">  interval as a concept on pipeline side.</span><br style="font-size:13px"><span style="font-size:13px">  This of course doesn't address the messaging complexity. I admit<br></span><span style="font-size:13px">  that I don't understand all the issues there but it often feels<br></span><span style="font-size:13px">  like we are doing that aspect of things completely wrong, so I<br></span><span style="font-size:13px">  would hope that before we change things there we consider all the<br></span><span style="font-size:13px">  options.</span></blockquote><div>I think that two types of agents should have two different pipeline descriptions, but I still think that "pipeline" should be described and fully applied on the both types of agents. On polling ones it should be the same as it is now, on notification: remove interval and refuse from transformations at all. Chris, I see your point about "long term", but I'm afraid that "long term" may not happen... </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span style="font-size:13px">What else?</span><br style="font-size:13px"><span style="font-size:13px">One probably crazy idea: What about figuring out the desired end-meters<br></span><span style="font-size:13px">of common transformations and making them into dedicated pollsters?<br></span><span style="font-size:13px">Encapsulating that transformation not at the level of the polling<br></span><span style="font-size:13px">manager but at the individual pollster.</span></blockquote><div> </div><div>Your "crazy idea" may work at least for restoring autoscaling functionality indeed. </div><div> </div></div><div>Thanks,</div><div>Nadya</div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Apr 13, 2016 at 9:25 PM, gordon chung <span dir="ltr"><<a href="mailto:gord@live.ca" target="_blank">gord@live.ca</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">hi Nadya,<br>

<br>

copy/pasting full original message with comments inline to clarify some<br>

comments.<br>

<br>

i think a lot of the confusion is because we use pipeline.yaml across<br>

both polling and notification agents when really it only applies to<br>

latter. just an fyi, we've had an open work item to create a<br>

polling.yaml file... just the issue of 'resources'.<br>

<br>

> Hello colleagues,<br>

><br>

>     I'd like to discuss one question with you. Perhaps, you remember that<br>

> in Liberty we decided to get rid of transformers on polling agents [1]. I'd<br>

> like to describe several issues we are facing now because of this decision.<br>

> 1. pipeline.yaml inconsistency.<br>

>     Ceilometer pipeline consists from the two basic things: source and<br>

> sink. In source, we describe how to get data, in sink - how to deal with<br>

> the data. After the refactoring described in [1], on polling agents we<br>

> apply only "source" definition, on notification agents we apply only "sink"<br>

> one. It causes the problems described in the mailing thread [2]: the "pipe"<br>

> concept is actually broken. To make it work more or less correctly, the<br>

> user should care that from a polling agent he/she doesn't send duplicated<br>

> samples. In the example below, we send "cpu" Sample twice each 600 seconds<br>

> from a compute agents:<br>

><br>

> sources:<br>

> - name: meter_source<br>

> interval: 600<br>

> meters:<br>

> - "*"<br>

> sinks:<br>

> - meter_sink<br>

> - name: cpu_source<br>

> interval: 60<br>

> meters:<br>

> - "cpu"<br>

> sinks:<br>

> - cpu_sink<br>

> - cpu_delta_sink<br>

><br>

> If we apply the same configuration on notification agent, each "cpu" Sample<br>

> will be processed by all of the 3 sinks. Please refer to the mailing thread<br>

> [2] for more details.<br>

>     As I understood from the specification, the main reason for [1] is<br>

> making the pollster code more readable. That's why I call this change a<br>

> "refactoring". Please correct me if I miss anything here.<br>

<br>

i don't know about more readable. it was also to offload work from<br>

compute nodes and all the stuff cdent mentions.<br>

<br>

><br>

> 2. Coordination stuff.<br>

>     TBH, coordination for notification agents is the most painful thing for<br>

> me because of several reasons:<br>

><br>

> a. Stateless service has became stateful. Here I'd like to note that tooz<br>

> usage for central agents and alarm-evaluators may be called "optional". If<br>

> you want to have these services scalable, it is recommended to use tooz,<br>

> i.e. install Redis/Zookeeper. But you may have your puppets unchanged and<br>

> everything continue to work with one service (central agent or<br>

> alarm-evaluator) per cloud. If we are talking about notification agent,<br>

> it's not the case. You must change the deployment: eighter rewrite the<br>

> puppets for notification agent deployment (to have only one notification<br>

> agent per cloud)  or make tooz installation with Redis/Zookeeper required.<br>

> One more option: remove transformations completely - that's what we've done<br>

> in our company's product by default.<br>

<br>

the polling change is not related to coordination work in notification.<br>

the coordination work was to handle HA / multiple notification agents.<br>

regardless polling change, this must exist.<br>

<br>

><br>

> b. RabbitMQ high utilisation. As you know, tooz does only one part of<br>

> coordination for a notification agent. In Ceilometer, we use IPC queues<br>

> mechanism to be sure that samples with the one metric and from the one<br>

> resource are processed by exactly the one notification agent (to make it<br>

> possible to use a local cache). I'd like to remind you that without<br>

> coordination (but with [1] applied) each compute agent polls each instances<br>

> and send the result as one message to a notification agent. The<br>

<br>

this is not entirely accurate pre-polling change, the polling agents<br>

publish one message per sample. not the polling agents publish one<br>

message per interval (multiple samples).<br>

<br>

> notification agent processes all the samples and sends as many messages to<br>

> a collector as many sinks it is defined (2-4, not many). If [1] if not<br>

> applied, one "publishing" round is skipped. But with [1] and coordination<br>

> (it's the most recommended deployment), amount of publications increases<br>

> dramatically because we publish each Sample as a separate message. Instead<br>

> of 3-5 "publish" calls, we do 1+2*instance_amount_on_compute publishings<br>

> per each compute. And it's by design, i.e. it's not a bug but a feature.<br>

<br>

i don't think the maths is right but regardless, IPC is one of the<br>

standard use cases for message queues. the concept of using queues to<br>

pass around and distribute work is essentially what it's designed for.<br>

if rabbit or any message queue service can't provide this function, it<br>

does worry me.<br>

<br>

><br>

> c. Samples ordering in the queues. It may be considered as a corner case,<br>

> but anyway I'd like to describe it here too. We have a lot of<br>

> order-sensitive transformers (cpu.delta, cpu_util), but we can guarantee<br>

> message ordering only in the "main" polling queue, but not in IPC queues. At<br>

> the picture below (hope it will be displayed) there are 3 agents A1, A2 and<br>

> A3 and 3 time-ordered messages in the MQ. Let's assume that at the same<br>

> time 3 agents start to read messages from the MQ. All the messages are<br>

> related to only one resource, that’s why they will go to only the one IPC<br>

> queue. Let it be IPC queue for A1 agent. At this point, we cannot guarantee<br>

> that the order will be kept, i.e. we cannot do order-sensitive<br>

> transformations without some loss.<br>

<br>

we can do ordering with batch processing. this is my proposal:<br>

<a href="https://review.openstack.org/#/c/275741/" rel="noreferrer" target="_blank">https://review.openstack.org/#/c/275741/</a>. we can discuss whether it<br>

works, should be changed, etc...<br>

><br>

><br>

>   Now I'd like to remind you that we need this coordination _only_ to<br>

> support transformations. Take a look on these specs: [3], [4]<br>

>>From [3]: The issue that arises is that if we want to implement a pipeline<br>

> to process events, we cannot guarantee what event each agent worker will<br>

> get and because of that, we cannot enable transformers which<br>

> aggregate/collate some relationship across similar events.<br>

><br>

> We don't have events transformations. In default pipeline.yaml we event<br>

> don't use transformations for notification-based samples (perhaps, we get<br>

> cpu from instance.exist, but we can drop it without any impact). The most<br>

> common case is transformations only for polling-based metrics. Please,<br>

> correct me if I'm wrong here.<br>

><br>

> tl;dr<br>

> I suggest the following:<br>

> 1. Return transformations to polling agent<br>

> 2. Have a special format for pipeline.yaml on notification agents without<br>

> "interval" and "transformations". Notification-based transformations is<br>

> better to be done "offline".<br>

<br>

transformers are already optional. they can be removed from<br>

pipeline.yaml if not required (and thus coordination can be disabled).<br>

also interval value is not used by notification agent although in theory<br>

could be and thus resolving the original issue.<br>

<br>

><br>

> [1]<br>

> <a href="https://github.com/openstack/telemetry-specs/blob/master/specs/liberty/pollsters-no-transform.rst" rel="noreferrer" target="_blank">https://github.com/openstack/telemetry-specs/blob/master/specs/liberty/pollsters-no-transform.rst</a><br>

> [2] <a href="http://www.gossamer-threads.com/lists/openstack/dev/53983" rel="noreferrer" target="_blank">http://www.gossamer-threads.com/lists/openstack/dev/53983</a><br>

> [3]<br>

> <a href="https://github.com/openstack/ceilometer-specs/blob/master/specs/kilo/notification-coordiation.rst" rel="noreferrer" target="_blank">https://github.com/openstack/ceilometer-specs/blob/master/specs/kilo/notification-coordiation.rst</a><br>

> [4]<br>

> <a href="https://github.com/openstack/ceilometer-specs/blob/master/specs/liberty/distributed-coordinated-notifications.rst" rel="noreferrer" target="_blank">https://github.com/openstack/ceilometer-specs/blob/master/specs/liberty/distributed-coordinated-notifications.rst</a><br>

><br>

> Thanks for you attention,<br>

> Nadya<br>

<span class="HOEnZb"><font color="#888888"><br>

<br>

<br>

<br>

--<br>

gord<br>

__________________________________________________________________________<br>

OpenStack Development Mailing List (not for usage questions)<br>

Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

</font></span></blockquote></div><br></div>