[openstack-dev] [TripleO][CI] Need more undercloud resources

Steve Baker sbaker at redhat.com
Wed Aug 24 23:51:24 UTC 2016


On 25/08/16 06:11, James Slagle wrote:
> The latest recurring problem that is failing a lot of the nonha ssl
> jobs in tripleo-ci is:
>
> https://bugs.launchpad.net/tripleo/+bug/1616144
> tripleo-ci: nonha jobs failing with Unable to establish connection to
> https://192.0.2.2:13004/v1/a90407df1e7f4f80a38a1b1671ced2ff/stacks/overcloud/f9f6f712-8e89-4ea9-a34b-6084dc74b5c1
>
> This error happens while polling for events from the overcloud stack
> by tripleoclient.
>
> I can reproduce this error very easily locally by deploying with an
> ssl undercloud with 6GB ram and 2 vcpus. If I don't enable swap,
> something gets OOM killed. If I do enable swap, swap gets used (< 1GB)
> and then I hit this error almost every time.
>
> The stack keeps deploying but the client has died, so the job fails.
> My investigation so far has only pointed out that it's the swap
> allocation that is delaying things enough to cause the failure.
>
> We do not see this error in the ha job even though it deploys more
> nodes. As of now, my only suspect is that it's the overhead of the
> initial SSL connections causing the error.
>
> If I test with 6GB ram and 4 vcpus I can't reproduce the error,
> although much more swap is used due to the increased number of default
> workers for each API service.
>
> However, I suggest we just raise the undercloud specs in our jobs to
> 8GB ram and 4 vcpus. These seem reasonable to me because those are the
> default specs used by infra in all of their devstack single and
> multinode jobs spawned on all their other cloud providers. Our own
> multinode job for the undercloud/overcloud and undercloud only job are
> running on instances of these sizes.
>
> Yes, this is just sidestepping the problem by throwing more resources
> at it. The reality is that we do not prioritize working on optimizing
> for speed/performance/resources. We prioritize feature work that
> indirectly (or maybe it's directly?) makes everything slower,
> especially at this point in the development cycle.
>
> We should therefore expect to have to continue to provide more and
> more resources to our CI jobs until we prioritize optimizing them to
> run with less.
>
> Let me know if there is any disagreement on making these changes. If
> there isn't, I'll apply them in the next day or so. If there are any
> other ideas on how to address this particular bug for some immediate
> short term relief, please let me know.
>
Heat now has efficient polling of nested events, but it doesn't look 
like tripleoclient is using that.

Its not clear if the current polling is contributing to the above issue 
but I'd definitely recommend switching over.

This is the recommended approach:
http://git.openstack.org/cgit/openstack/python-heatclient/tree/heatclient/osc/v1/stack.py#n180

This is what tripleoclient does currently:

http://git.openstack.org/cgit/openstack/python-tripleoclient/tree/tripleoclient/utils.py#n272

The get_events call is low-overhead, but the get_stack call isn't, and 
calling it in a loop won't be helping.

poll_for_events currently doesn't have an argument for specifying the 
nested_depth for what events to log. I can add that to heatclient unless 
you can live with only logging the events for the top level resources.




More information about the OpenStack-dev mailing list