[openstack-dev] [nova] nova cellsv2 and DBs / down cells / quotas
melwittt at gmail.com
Thu Oct 25 06:29:18 UTC 2018
On Thu, 25 Oct 2018 10:55:15 +1100, Sam Morrison wrote:
>> On 24 Oct 2018, at 4:01 pm, melanie witt <melwittt at gmail.com> wrote:
>> On Wed, 24 Oct 2018 10:54:31 +1100, Sam Morrison wrote:
>>> Hi nova devs,
>>> Have been having a good look into cellsv2 and how we migrate to them (we’re still on cellsv1 and about to upgrade to queens and still run cells v1 for now).
>>> One of the problems I have is that now all our nova cell database servers need to respond to API requests.
>>> With cellsv1 our architecture was to have a big powerful DB cluster (3 physical servers) at the API level to handle the API cell and then a smallish non HA DB server (usually just a VM) for each of the compute cells.
>>> This architecture won’t work with cells V2 and we’ll now need to have a lot of highly available and responsive DB servers for all the cells.
>>> It will also mean that our nova-apis which reside in Melbourne, Australia will now need to talk to database servers in Auckland, New Zealand.
>>> The biggest issue we have is when a cell is down. We sometimes have cells go down for an hour or so planned or unplanned and with cellsv1 this does not affect other cells.
>>> Looks like some good work going on here https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bp/handling-down-cell
>>> But what about quota? If a cell goes down then it would seem that a user all of a sudden would regain some quota from the instances that are in the down cell?
>>> Just wondering if anyone has thought about this?
>> Yes, we've discussed it quite a bit. The current plan is to offer a policy-driven behavior as part of the "down" cell handling which will control whether nova will:
>> a) Reject a server create request if the user owns instances in "down" cells
>> b) Go ahead and count quota usage "as-is" if the user owns instances in "down" cells and allow quota limit to be potentially exceeded
>> We would like to know if you think this plan will work for you.
>> Further down the road, if we're able to come to an agreement on a consumer type/owner or partitioning concept in placement (to be certain we are counting usage our instance of nova owns, as placement is a shared service), we could count quota usage from placement instead of querying cells.
> OK great, always good to know other people are thinking for you :-) , I don’t really like a or b but the idea about using placement sounds like a good one to me.
Your honesty is appreciated. :) We do want to get to where we can use
placement for quota usage. There is a significant amount of higher
priority placement-related work in flight right now (getting nested
resource providers working end-to-end, for one) for it to receive
adequate attention at this moment. We've been discussing it on the spec
 the past few days, if you're interested.
> I guess our architecture is pretty unique in a way but I wonder if other people are also a little scared about the whole all DB servers need to up to serve API requests?
You are not alone. At CERN, they are experiencing the same challenges.
They too have an architecture where they had deployed less powerful
database servers in cells and also have cell sites that are located
geographically far away. They have been driving the "handling of a down
> I’ve been thinking of some hybrid cellsv1/v2 thing where we’d still have the top level api cell DB but the API would only ever read from it. Nova-api would only write to the compute cell DBs.
> Then keep the nova-cells processes just doing instance_update_at_top to keep the nova-cell-api db up to date.
> We’d still have syncing issues but we have that with placement now and that is more frequent than nova-cells-v1 is for us.
I have had similar thoughts, but keep ending up at the syncing/racing
issues, like you said. I think it's something we'll need to discuss and
explore more, to see if we can come up with a reasonable way to address
the increased demand on cell databases as it's been a considerable pain
point for deployments like yours and CERN's.
More information about the OpenStack-dev