[openstack-dev] [gnocchi] new measures backlog scheduling
gordon chung
gord at live.ca
Mon Apr 17 16:09:58 UTC 2017
hi,
i've started to implement multiple buckets and the initial tests look
promising. here's some things i've done:
- dropped the scheduler process and allow processing workers to figure
out tasks themselves
- each sack is now handled fully (not counting anything added after
processing worker)
- number of sacks are static
after the above, i've been testing it and it works pretty well, i'm able
to process 40K metrics, 60 points each, in 8-10mins with 54 workers when
it took significantly longer before.
the issues i've run into:
- dynamic sack size
making number of sacks dynamic is a concern. previously, we said to have
sack size in conf file. the concern is that changing that option
incorrectly actually 'corrupts' the db to a state that it cannot recover
from. it will have stray unprocessed measures constantly. if we change
the db path incorrectly, we don't actually corrupt anything, we just
lose data. we've said we don't want sack mappings in indexer so it seems
to me, the only safe solution is to make it sack size static and only
changeable by hacking?
- sack distribution
to distribute sacks across workers, i initially implemented consistent
hashing. the issue i noticed is that because hashring is inherently has
non-uniform distribution[1], i would have workers sitting idle because
it was given less sacks, while other workers were still working.
i tried also to implement jump hash[2], which improved distribution and
is in theory, less memory intensive as it does not maintain a hash
table. while better at distribution, it still is not completely uniform
and similarly, the less sacks per worker, the worse the distribution.
lastly, i tried just simple locking where each worker is completely
unaware of any other worker and handles all sacks. it will lock the sack
it is working on, so if another worker tries to work on it, it will just
skip. this will effectively cause an additional requirement on locking
system (in my case redis) as each worker will make x lock requests where
x is number of sacks. so if we have 50 workers and 2048 sacks, it will
be 102K requests per cycle. this is in addition to the n number of lock
requests per metric (10K-1M metrics?). this does guarantee if a worker
is free and there is work to be done, it will do it.
i guess the question i have is: by using a non-uniform hash, it seems we
gain possibly less load at the expense of efficiency/'speed'. the number
of sacks/tasks we have is stable, it won't really change. the number of
metricd workers may change but not constantly. lastly, the number of
sacks per worker will always be relatively low (10:1, 100:1 assuming max
number of sacks is 2048). given these conditions, do we need
consistent/jump hashing? is it better to just modulo sacks and ensure
'uniform' distribution and allow for 'larger' set of buckets to be
reshuffled when workers are added?
[1]
https://docs.google.com/spreadsheets/d/1flXw1lqao2tIc0p1baxVeJIXgzhy1Ksw3uFoiwyZkXk/edit?usp=sharing
[2] https://arxiv.org/pdf/1406.2294.pdf
[3] https://review.openstack.org/#/q/topic:buckets+project:openstack/gnocchi
--
gord
More information about the OpenStack-dev
mailing list