[openstack-dev] [gnocchi] new measures backlog scheduling

Julien Danjou julien at danjou.info
Tue Apr 18 12:44:29 UTC 2017


On Tue, Apr 18 2017, gordon chung wrote:

> the issue i see is not with how the sacks will be assigned to metricd 
> but how metrics (not daemon) are assigned to sacks. i don't think 
> storing value in storage object solves the issue because when would we 
> load/read it when the api and metricd processes startup? it seems this 
> would require: 1) all services to be shut down and 2) have a completely 
> clean incoming storage path. if any of the two steps aren't done, you 
> have a corrupt incoming storage. if this is a requirement and both of 
> these are done successfully, this means, any kind of 'live upgrade' is 
> impossible in gnocchi.

Live upgrade never has been supported in Gnocchi, so I don't see how
that's a problem. It'd be cool to support it for sure, but we're far
from having been able to implement it at any point in time in the best.
So it's not a new issue or anything like that. I really don't see
a problem with loading the number of sacks at startup.

> i had did test w/ 2 replicas (see: google sheet) and it's still 
> non-uniform but better than without replicas: ~4%-30% vs ~8%-45%. we 
> could also minimise the number lock calls by dividing sacks across 
> workers per agent.
>
> going to play devils advocate now, using hashring in our use case will 
> always hurt throughput (even with perfect distribution since the sack 
> contents themselves are not uniform). returning to original question, is 
> using hashring worth it? i don't think we're even leveraging the 
> re-balancing aspect of hashring.

I think it's worth it only if you use replicas – and I don't think 2 is
enough, I'd try 3 at least, and make it configurable. It'll reduce a lot
lock-contention (e.g. by 17x time in my previous example).
As far as I'm concerned, since the number of replicas is configurable,
you can add a knob that would set replicas=number_of_metricd_worker that
would implement the current behaviour you implemented – every worker
tries to grab every sack.

We're not leveraging the re-balancing aspect of hashring, that's true.
We could probably use any dumber system to spread sacks across workers,
We could stick to the good ol' "len(sacks) / len(workers in the group)".

But I think there's a couple of things down the road that may help us:
Using the hashring makes sure worker X does not jump from sacks [A, B,
C], to [W, X, Y, Z] but just to [A, B] or [A, B, C, X]. That should
minimize lock contention when bringing up/down new workers. I admit it's
a very marginal win, but… it comes free with it.
Also, I envision a push based approach in the future (to replace the
metricd_processing_delay) which will require worker to register to
sacks. Making sure the rebalancing does not shake everything but is
rather smooth will also reduce workload around that. Again, it comes
free.

-- 
Julien Danjou
# Free Software hacker
# https://julien.danjou.info
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 800 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170418/7ee34132/attachment.sig>


More information about the OpenStack-dev mailing list