[openstack-dev] [TripleO] Removing global bootstrap_nodeid?
Jiří Stránský
jistr at redhat.com
Tue Sep 25 13:05:51 UTC 2018
Hi Steve,
On 25/09/2018 10:51, Steven Hardy wrote:
> Hi all,
>
> After some discussions with bandini at the PTG, I've been taking a
> look at this bug and how to solve it:
>
> https://bugs.launchpad.net/tripleo/+bug/1792613
> (Also more information in downstream bz1626140)
>
> The problem is that we always run container bootstrap tasks (as well
> as a bunch of update/upgrade tasks) on the bootstrap_nodeid, which by
> default is always the overcloud-controller-0 node (regardless of which
> services are running there).
>
> This breaks a pattern we established a while ago for Composable HA,
> where we' work out the bootstrap node by
> $service_short_bootstrap_hostname, which means we always run on the
> first node that has the service enabled (even if it spans more than
> one Role).
>
> This presents two problems:
>
> 1. service_config_settings only puts the per-service hieradata on
> nodes where a service is enabled, hence data needed for the
> bootstrapping (e.g keystone users etc) can be missing if, say,
> keystone is running on some role that's not Controller (this, I think
> is the root-cause of the bug/bz linked above)
>
> 2. Even when we by luck have the data needed to complete the bootstrap
> tasks, we'll end up pulling service containers to nodes where the
> service isn't running, potentially wasting both time and space.
>
> I've been looking at solutions, and it seems like we either have to
> generate per-service bootstrap_nodeid's (I have a patch to do this
> https://review.openstack.org/605010), or perhaps we could just remove
> all the bootstrap node id's, and switch to using hostnames instead?
> Seems like that could be simpler, but wanted to check if there's
> anything I'm missing?
I think we should recheck he initial assumptions, because based on my
testing:
* the bootstrap_nodeid is in fact a hostname already, despite its
deceptive name,
* it's not global, it is per-role.
From my env:
[root at overcloud-controller-2 ~]# hiera -c /etc/puppet/hiera.yaml
bootstrap_nodeid
overcloud-controller-0
[root at overcloud-novacompute-1 ~]# hiera -c /etc/puppet/hiera.yaml
bootstrap_nodeid
overcloud-novacompute-0
This makes me think the problems (1) and (2) as stated above shouldn't
be happening. The containers or tasks present in service definition
should be executed on all nodes where a service is present, and if they
additionally filter for bootstrap_nodeid, it would only pick one node
per role. So, the service *should* be deployed on the selected bootstrap
node, which means the service-specific hiera should be present there and
needless container downloading should not be happening, AFAICT.
However, thinking about it this way, we probably have a different
problem still:
(3) The actions which use bootstrap_nodeid check are not guaranteed to
execute once per service. In case the service is present on multiple
roles, the bootstrap_nodeid check succeeds once per such role.
Using per-service bootstrap host instead of per-role bootstrap host
sounds like going the right way then.
However, none of the above provides a solid explanation of what's really
happening in the LP/BZ mentioned above. Hopefully this info is at least
a piece of the puzzle.
Jirka
More information about the OpenStack-dev
mailing list