Open Stack

Thu Aug 21 08:57:54 UTC 2014

> One of the outcomes from Juno will be horizontal scalability in the
> central agent and alarm evaluator via partitioning[1]. The compute
> agent will get the same capability if you choose to use it, but it
> doesn't make quite as much sense.
> 
> I haven't investigated the alarm evaluator side closely yet, but one
> concern I have with the central agent partitioning is that, as far
> as I can tell, it will result in stored samples that give no
> indication of which (of potentially very many) central-agent it came
> from.
> 
> This strikes me as a debugging nightmare when something goes wrong
> with the content of a sample that makes it all the way to storage.
> We need some way, via the artifact itself, to narrow the scope of
> our investigation.
> 
> a) Am I right that no indicator is there?
> 
> b) Assuming there should be one:
> 
>     * Where should it go? Presumably it needs to be an attribute of
>       each sample because as agents leave and join the group, where
>       samples are published from can change.
> 
>     * How should it be named? The never-ending problem.
> 
> Thoughts?

Probably best to keep the bulk of this dicussion on-gerrit, but
FWIW here's my riff just commented there ...

Cheers,
Eoghan

WRT to marking each sample with an indication of originating agent.

First, IIUC, true provenance would require that the full chain-of-
ownership could be reconstructed for the sample, so we'd need to
also record the individual collector that persisted each sample.
So let's assume that we're only talking here about associating the
originating agent with the sample.  For most classes of bugs/issues
that could impact on an agent, we'd expect an equivalent impact on
all agents. However, I guess there would be a subset of issues, e.g.
an agent being "left behind" after an upgrade, that could be localized.

So in the classic ceilometer approach to metadata, one could imagine
the agent identity being recorded in the sample itself. However this
would become a lot more problematic, I think, after a shift to pure
timeseries data. In which case, I don't think we'd necessarily want
to pollute the limited number of dimensions that can be efficiently
associated with a datapoint with additional information purely related
to the implementation/architecture of ceilometer.

So how about turning the issue on its head, and putting the onus on
the agent to record its allocated resources for each cycle? The
obvious way to do that would be via logging.

Then in order to determine which agent was responsible for polling a
particular resource at a particular time, the problem would collapse
down to a distributed search over the agent log files for that period
(perhaps aided by whatever log retention scheme is in use, e.g. logstash).

> [1] https://review.openstack.org/#/c/113549/
> [2] https://review.openstack.org/#/c/115237/

Open Stack

[openstack-dev] [ceilometer] indicating sample provenance

OpenStack

Community

Documentation

Branding & Legal