[openstack-dev] [devstack] [ironic] [nova] Trying again on wait_for_compute in devstack
sean at dague.net
Wed Aug 2 11:17:46 UTC 2017
The 3 node scenarios in Neutron (which are still experimental nv) are
typically failing to bring online the 3rd compute. In cells v2 you have
to explicitly add nodes to the cells. There is a nova-manage command
"discover-hosts" that takes all the compute nodes which have checked in,
but aren't yet assigned to a cell, and puts them into a cell of your
choosing. We do this in devstack-gate in the gate.
However... subnodes don't take very long to setup (so few services). And
the nova-compute process takes about 30s before it's done all it's
initialization and actually checks in to the cluster. It's a real
possibility that discover_hosts will run before subnode 3 checks in. We
see it in logs. This also really could come and bite us on any multinode
job, and I'm a bit concerned some of the multinode jobs aren't running
multinode some times because of it.
One way to fix this, without putting more logic in devstack-gate, is
ensure that by the time stack.sh finishes, the compute node is up. This
was tried previously, but it turned out that we totally missed that it
broke Ironic (the check happened too early, ironic was not yet running,
so we always failed), Cells v1 (munges hostnames :( ), and PowerVM
(their nova-compute was never starting correctly, and they were working
around it with a restart later).
This patch https://review.openstack.org/#/c/488381/ tries again. The
check is moved very late, Ironic seems to be running fine with it. Cells
v1 is just skipped, that's deprecated in Nova now, and we're not going
to use it in multinode scenarios. The PowerVM team fixed their
nova-compute start issues, so they should be good to go as well.
This is an FYI that we're going to land this again soon. If you think
this impacts your CI / jobs, please speak up. The CI runs on both the
main and experimental queue on devstack for this change look pretty
good, so I think we're safe to move forward this time. But we also
thought that the last time, and were wrong.
More information about the OpenStack-dev