[openstack-dev] [TripleO][CI] Need more undercloud resources

Derek Higgins derekh at redhat.com
Thu Aug 25 09:40:37 UTC 2016


On 25 August 2016 at 02:56, Paul Belanger <pabelanger at redhat.com> wrote:
> On Wed, Aug 24, 2016 at 02:11:32PM -0400, James Slagle wrote:
>> The latest recurring problem that is failing a lot of the nonha ssl
>> jobs in tripleo-ci is:
>>
>> https://bugs.launchpad.net/tripleo/+bug/1616144
>> tripleo-ci: nonha jobs failing with Unable to establish connection to
>> https://192.0.2.2:13004/v1/a90407df1e7f4f80a38a1b1671ced2ff/stacks/overcloud/f9f6f712-8e89-4ea9-a34b-6084dc74b5c1
>>
>> This error happens while polling for events from the overcloud stack
>> by tripleoclient.
>>
>> I can reproduce this error very easily locally by deploying with an
>> ssl undercloud with 6GB ram and 2 vcpus. If I don't enable swap,
>> something gets OOM killed. If I do enable swap, swap gets used (< 1GB)
>> and then I hit this error almost every time.
>>
>> The stack keeps deploying but the client has died, so the job fails.
>> My investigation so far has only pointed out that it's the swap
>> allocation that is delaying things enough to cause the failure.
>>
>> We do not see this error in the ha job even though it deploys more
>> nodes. As of now, my only suspect is that it's the overhead of the
>> initial SSL connections causing the error.
>>
>> If I test with 6GB ram and 4 vcpus I can't reproduce the error,
>> although much more swap is used due to the increased number of default
>> workers for each API service.
>>
>> However, I suggest we just raise the undercloud specs in our jobs to
>> 8GB ram and 4 vcpus. These seem reasonable to me because those are the
>> default specs used by infra in all of their devstack single and
>> multinode jobs spawned on all their other cloud providers. Our own
>> multinode job for the undercloud/overcloud and undercloud only job are
>> running on instances of these sizes.
>>
> Close, our current flavors are 8vCPU, 8GB RAM, 80GB HDD. I'd recommend doing
> that for the undercloud just to be consistent.

The HD on most of the compute nodes are 200GB so we've been trying
really hard[1] to keep the disk usage for each instance down so that
we can fit as many instances onto each compute nodes as possible
without being restricted by the HD's. We've also allowed nova to
overcommit on storage by a factor of 3. The assumption is that all of
the instances are short lived and a most of them never fully exhaust
the storage allocated to them. Even the ones that do (the undercloud
being the one that does) hit peak at different times so everything is
tickety boo.

I'd strongly encourage against using a flavor with a 80GB HDD, if we
increase the disk space available to the undercloud to 80GB then we
will eventually be using it in CI. And 3 undercloud on the same
compute node will end up filling up the disk on that host.

[1] http://git.openstack.org/cgit/openstack-infra/tripleo-ci/tree/toci_gate_test.sh#n26

>
> [1] http://docs.openstack.org/infra/system-config/contribute-cloud.html
>
>> Yes, this is just sidestepping the problem by throwing more resources
>> at it. The reality is that we do not prioritize working on optimizing
>> for speed/performance/resources. We prioritize feature work that
>> indirectly (or maybe it's directly?) makes everything slower,
>> especially at this point in the development cycle.
>>
>> We should therefore expect to have to continue to provide more and
>> more resources to our CI jobs until we prioritize optimizing them to
>> run with less.
>>
> I actually believe these problem highlights how large tripleo-ci has grown, and
> in need of a refactor. While we won't solve this problem today, I do think
> tripleo-ci is to monolithic today. I believe there is some discussion on
> breaking jobs into different scenarios, but I haven't had a chance to read up on
> that.
>
> I'm hoping in Barcelona we can have a topic on CI pipelines and how better to
> optimize our runs.
>
>> Let me know if there is any disagreement on making these changes. If
>> there isn't, I'll apply them in the next day or so. If there are any
>> other ideas on how to address this particular bug for some immediate
>> short term relief, please let me know.
>>
>> --
>> -- James Slagle
>> --
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list