On Wed, Nov 18, 2020 at 11:24, Dan Smith <dms@danplanet.com> wrote:
There was a long discussion on #openstack-nova today[3] around this topic. So you can find more detailed reasoning there[3].
I just had a long voice chat with Ollie about this, trying to understand some of the challenges that still exist with some of the things we've discussed and the proposed forward steps.
Thank you Dan for pushing this forward.
There are lots of things to clarify, I think, and we could honestly probably use a wider voice chat among the people that need to work on this. However, I'll try here.
First off, I want to address the "do it like devstack" recommendation, and the subsequent suggestion of standardizing on something like a "nova-db.conf" file arrangement to keep the db credentials in one place. As Ollie has pointed out, devstack doesn't support all the various deployment arrangements (such as "one metadata api per cell") and thus doesn't have a prescription for how to arrange the config files in those scenarios. Specifically, an all-in-one deployment that includes multiple API services with different configs would currently have to hack around the hardcoded nova.conf file with the environment variable that allows them to specify a directory other than /etc/nova.conf. Devstack doesn't have to do this because it doesn't support that arrangement of services, but if it did, it would have to.
So, I wanted to clarify something that came out of the IRC discussion, which is "do it like devstack". When we say that, or at least when *I* said it, we meant "have different config files for each service so that you can get the config elements appropriate for each service set correctly." That doesn't mean "nova-compute should point to /etc/nova/nova-cpu.conf because that's what devstack does". Especially in a containerized setup, I would expect everything to use /etc/nova/nova.conf, since those are namespaced per-service because of the containerization. Further, just because devstack doesn't run a metadata per cell (or any other such optional arrangement), obviously that doesn't mean you can't.
That leads me to the first action item I think we need:
ACTION 1: Make the wsgi app able to use something other than nova.conf
All of our other services support a --config-file flag, and I don't see why we wouldn't allow this if it's necessary for deployments. In the pure API arrangement, it shouldn't really be necessary to change this, but clearly you may need to have multiple API workers with different configs, as is the case with metadata-per-cell, or even potentially some admin-focused private API endpoint. If it makes it easier for deployment tools to start the API workers with a specific config file, we should let them do that. You can already work around that hard-coded filename by setting OS_NOVA_CONFIG_DIR to something other than /etc/nova, but we shouldn't require them to create directories just to have separate configs.
I will look into this change and propose patch a patch.
The next item is related:
ACTION 2: We need to document a minimal viable config for each service
Ollie points out that, obviously, handing the deployer a config reference with 1.21 bajillion config options does not clearly indicate which services need which thing, and especially, which things are _not_allowed_ to be set for a service (such as db credentials on the compute). We can't feasibly audit every config option we have, but it would be very helpful to have a reference that shows what a completely minimal configuration for each service looks like. We could do that by tagging services per config options (which I suggested earlier, and would still be good) but I think maybe a standalone document would be easier for people to follow.
I've opened a doc bug to track the minimal config documentation work: https://bugs.launchpad.net/nova/+bug/1904179 But I would like to get help from others to start proposing this doc.
Now, on to the question about the hard-fail patch for compute:
https://review.opendev.org/#/c/762176/
The Nova devs have wanted people to _not_ have db credentials in the config file that nova-compute can see for a "Very Long Time." We have even made some assumptions in the code that rely on these credentials being not set on the compute, which is at least partially why we're having this conversation (again) now.
Despite the fact that nobody *should* be setting these credentials in their config file, we know that at in least some circumstances, they are. TripleO is setting them (always I think?) because puppet-nova does that. OSA has said that they're setting them somewhat incidentally when they do all-in-one deployments. The latter is unlikely to affect any real deployments, but the former definitely _will_, and anyone else that isn't present in this discussion may be including credentials unnecessarily, which we will break when we merge that patch. Thus, I think that, even though it feels long overdue for devs, the most prudent and cautious approach will be to:
ACTION 3: Not merge that hard fail until X
We have the warning, we have the reno. We expect that the deployment tools will be working to eliminate the credentials for the compute config, but merging that will make it an emergency for them. We've waited years at this point, we can wait one more cycle for safety. As Ollie pointed out, we've not been super clear about messaging this, despite it being a well-worn assumption amongst the developers for a long time.
I'm OK not to merge the breaking patch in W.
To aid in smoking out any of the jobs or configs that deployment tools may still have which will break when we merge that patch, we should also consider:
ACTION 4: A workaround flag to opt-in to the strict checking
Basically, this would be a one or two-cycle workaround, which when enabled, would opt-in to the (above) behavior of causing compute to fail on startup if db credentials are present. This would allow the deployment tool maintainers, as well as people responsible for CI jobs to smoke test the new behavior early, before we merge it and cause an emergency for them. We can set this as on by default in our devstack jobs to signal that it is good with the new behavior, and also encourage the other deployment projects to set it as well once they're generating clean configs for their nova-computes. Once we merge the patch to actually fail all the time, we can remove this workaround config, according to the original intent of the workarounds group, that those flags are short-lived.
We could do this by altering gibi's patch to only fail if the workaround is enabled, and then follow up in X by removing the workaround check.
I will make the suggested changes in my patch.
So, I know that was a lot of words, but I think if we can agree to the above four items, we'll have a path forward that doesn't overly burden any one specific group while still allowing us to chart a path to getting where we want to be.
I think we need acks from a bunch of groups, probably at least:
- Nova (gibi, stephenfin) - TripleO (owalsh) - Debian (zigo) - Kolla (yoctozepto) - OSA (noonedeadpunk) - rdo-ci (jpena) - puppet-nova (tkajinam)
Are people okay with these action items?
I'm on OK with this direction. Cheers, gibi
--Dan