Thanks all for your responses. Replies to Dan inline. On Mon, 30 Sep 2019 at 18:27, Dan Smith <dms@danplanet.com> wrote:
1. Is there any benefit to not having a superconductor? Presumably it's a little more efficient in the single cell case? Also IIUC it only requires a single message queue so is a little simpler?
In a multi-cell case you need it, but you're asking about the case where there's only one (real) cell yeah?
If the deployment is really small, then the overhead of having one is probably measurable and undesirable. I dunno what to tell you about where that cut-off is, unfortunately. However, once you're over a certain number of nodes, that probably shakes out a bit. The superconductor does things that the cell-specific ones won't have to do, so there's about the same amount of total load, just a potentially larger memory footprint for running extra services, which would be measurable at small scales. For a tiny deployment there's also overhead just in the complexity, but one of the goals of v2 has always been to get everyone on the same architecture, so having a "small mode" and a "large mode" brings with it its own complexity.
Thanks for the explanation. We've built in a switch for single or super mode, and single mode keeps us compatible with existing deployments, so I guess we'll keep the switch.
2. Do console proxies need to live in the cells? This is what devstack does in superconductor mode. I did some digging through nova code, and it looks that way. Testing with novncproxy agrees. This suggests we need to expose a unique proxy endpoint for each cell, and configure all computes to use the right one via e.g. novncproxy_base_url, correct?
I'll punt this to Melanie, as she's the console expert at this point, but I imagine you're right.
3. Should I upgrade the superconductor or conductor service first?
Superconductor first, although they all kinda have to go around the same time. Superconductor, like the regular conductors, needs to look at the cell database directly, so if you were to upgrade superconductor before the cell database you'd likely have issues. I think probably the ideal would be to upgrade the db schema everywhere (which you can do without rolling code), then upgrade the top-level services (conductor, scheduler, api) and then you could probably get away with doing conductor in the cell along with computes, or whatever. If possible rolling the cell conductors with the top-level services would be ideal.
I should have included my strawman deploy and upgrade flow for context, but I'm still honing it. All DB schema changes will be done up front in both cases. In terms of ordering, the API-level services (superconductor, API scheduler) are grouped together and will be rolled first - agreeing with what you've said. I think between Ansible's tags and limiting actions to specific hosts, the code can be written to support upgrading all cell conductors together, or at the same time as (well, immediately before) the cell's computes. The thinking behind upgrading one cell at a time is to limit the blast radius if something goes wrong. You suggest it would be better to roll all cell conductors at the same time though - do you think it's safer to run with the version disparity between conductor and computes rather than super- and cell- conductors?
4. Does the cell conductor need access to the API DB?
Technically it should not be allowed to talk to the API DB for "separation of concerns" reasons. However, there are a couple of features that still rely on the cell conductor being able to upcall to the API database, such as the late affinity check. If you can only choose one, then I'd say configure the cell conductors to talk to the API DB, but if there's a knob for "isolate them" it'd be better.
Knobs are easy to make, and difficult to keep working in all positions :) It seems worthwhile in this case.
5. What DB configuration should be used in nova.conf when running online data migrations? I can see some migrations that seem to need the API DB, and others that need a cell DB. If I just give it the API DB, will it use the cell mappings to get to each cell DB, or do I need to run it once for each cell?
The API DB has its own set of migrations, so you obviously need API DB connection info to make that happen. There is no fanout to all the rest of the cells (currently), so you need to run it with a conf file pointing to the cell, for each cell you have. The latest attempt at making this fan out was abanoned in July with no explanation, so it dropped off my radar at least.
That makes sense. The rolling upgrade docs could be a little clearer for multi-cell deployments here.
6. After an upgrade, when can we restart services to unpin the compute RPC version? Looking at the compute RPC API, it looks like the super conductor will remain pinned until all computes have been upgraded. For a cell conductor, it looks like I could restart it to unpin after upgrading all computes in that cell, correct?
Yeah.
7. Which services require policy.{yml,json}? I can see policy referenced in API, conductor and compute.
That's a good question. I would have thought it was just API, so maybe someone else can chime in here, although it's not specific to cells.
Yeah, unrelated to cells, just something I wondered while digging through our nova Ansible role. Here is the line that made me think policies are required in conductors: https://opendev.org/openstack/nova/src/commit/6d5fdb4ef4dc3e5f40298e751d966c.... I guess this is only required for cell conductors though?
--Dan