[openstack-dev] [TripleO][CI] Need more undercloud resources

James Slagle james.slagle at gmail.com
Wed Aug 24 18:11:32 UTC 2016


The latest recurring problem that is failing a lot of the nonha ssl
jobs in tripleo-ci is:

https://bugs.launchpad.net/tripleo/+bug/1616144
tripleo-ci: nonha jobs failing with Unable to establish connection to
https://192.0.2.2:13004/v1/a90407df1e7f4f80a38a1b1671ced2ff/stacks/overcloud/f9f6f712-8e89-4ea9-a34b-6084dc74b5c1

This error happens while polling for events from the overcloud stack
by tripleoclient.

I can reproduce this error very easily locally by deploying with an
ssl undercloud with 6GB ram and 2 vcpus. If I don't enable swap,
something gets OOM killed. If I do enable swap, swap gets used (< 1GB)
and then I hit this error almost every time.

The stack keeps deploying but the client has died, so the job fails.
My investigation so far has only pointed out that it's the swap
allocation that is delaying things enough to cause the failure.

We do not see this error in the ha job even though it deploys more
nodes. As of now, my only suspect is that it's the overhead of the
initial SSL connections causing the error.

If I test with 6GB ram and 4 vcpus I can't reproduce the error,
although much more swap is used due to the increased number of default
workers for each API service.

However, I suggest we just raise the undercloud specs in our jobs to
8GB ram and 4 vcpus. These seem reasonable to me because those are the
default specs used by infra in all of their devstack single and
multinode jobs spawned on all their other cloud providers. Our own
multinode job for the undercloud/overcloud and undercloud only job are
running on instances of these sizes.

Yes, this is just sidestepping the problem by throwing more resources
at it. The reality is that we do not prioritize working on optimizing
for speed/performance/resources. We prioritize feature work that
indirectly (or maybe it's directly?) makes everything slower,
especially at this point in the development cycle.

We should therefore expect to have to continue to provide more and
more resources to our CI jobs until we prioritize optimizing them to
run with less.

Let me know if there is any disagreement on making these changes. If
there isn't, I'll apply them in the next day or so. If there are any
other ideas on how to address this particular bug for some immediate
short term relief, please let me know.

-- 
-- James Slagle
--



More information about the OpenStack-dev mailing list