[openstack-dev] [gnocchi] per-sack vs per-metric locking tradeoffs

gordon chung gord at live.ca
Fri Apr 28 12:38:14 UTC 2017



On 28/04/17 03:48 AM, Julien Danjou wrote:
>
> Yes, I wrote that in a review somewhere. We need to rework 1. so
> deletion happens at the same time we lock the sack to process metrics
> basically. We might want to merge the janitor into the worker I imagine.
> Currently a janitor can grab metrics and do dumb things like:
> - metric1 from sackA
> - metric2 from sackB
> - metric3 from sackA
>
> and do 3 different lock+delete -_-

so the tradeoff here is that now we're doing a lot more calls to 
indexer. additionally, we're pulling a lot more unused results from db.
a single janitor currently just grabs all deleted metrics and starts 
attempting to clean them up one at a time. if we merge, we will have n 
calls to indexer, where n is number of workers, each pulling in all the 
deleted metrics, and then checking to see if the metric is in it's sack, 
and if not, moving on. that's a lot of extra, wasted work. we could 
reduce that work by adding sack information to indexer ;) but that will 
still add significantly more calls to indexer (which we could reduce by 
not triggering cleanup every job interval)


>>
>> alternatively, this could be solved by keeping per-metric locks in
>> addition to per-sack locks. this would effectively double the number of
>> active locks we have so instead of each metricd worker having a single
>> per-sack lock, it will also have a per-metric lock for whatever metric
>> it may be publishing at the time.
>
> If we got a timeout set for scenario 3, I'm not that worried. I guess
> worst thing is that people would be unhappy with the API spending time
> doing computation anyway so we'd need to rework how refresh work or add
> an ability to disable it.
>

refresh is currently disabled by default so i think we're ok.

what's the timeout for? timeout api's attempt to aggregate metric? i 
think it's a bad experience if we add any timeout since i assume it will 
still return what it can return and then the results become somewhat 
ambiguous.

now that i think about it more this issue still exists in per-metric 
scenario (but to lesser extent). 'refresh' can still be blocked by 
metricd but it's just a significantly smaller chance and the window for 
missed unprocessed measures is smaller.

-- 
gord



More information about the OpenStack-dev mailing list