[openstack-dev] [TripleO][CI] Need more undercloud resources

James Slagle james.slagle at gmail.com
Fri Aug 26 15:37:17 UTC 2016


On Thu, Aug 25, 2016 at 9:49 AM, James Slagle <james.slagle at gmail.com> wrote:
> On Thu, Aug 25, 2016 at 5:40 AM, Derek Higgins <derekh at redhat.com> wrote:
>> On 25 August 2016 at 02:56, Paul Belanger <pabelanger at redhat.com> wrote:
>>> On Wed, Aug 24, 2016 at 02:11:32PM -0400, James Slagle wrote:
>>>> The latest recurring problem that is failing a lot of the nonha ssl
>>>> jobs in tripleo-ci is:
>>>>
>>>> https://bugs.launchpad.net/tripleo/+bug/1616144
>>>> tripleo-ci: nonha jobs failing with Unable to establish connection to
>>>> https://192.0.2.2:13004/v1/a90407df1e7f4f80a38a1b1671ced2ff/stacks/overcloud/f9f6f712-8e89-4ea9-a34b-6084dc74b5c1
>>>>
>>>> This error happens while polling for events from the overcloud stack
>>>> by tripleoclient.
>>>>
>>>> I can reproduce this error very easily locally by deploying with an
>>>> ssl undercloud with 6GB ram and 2 vcpus. If I don't enable swap,
>>>> something gets OOM killed. If I do enable swap, swap gets used (< 1GB)
>>>> and then I hit this error almost every time.
>>>>
>>>> The stack keeps deploying but the client has died, so the job fails.
>>>> My investigation so far has only pointed out that it's the swap
>>>> allocation that is delaying things enough to cause the failure.
>>>>
>>>> We do not see this error in the ha job even though it deploys more
>>>> nodes. As of now, my only suspect is that it's the overhead of the
>>>> initial SSL connections causing the error.
>>>>
>>>> If I test with 6GB ram and 4 vcpus I can't reproduce the error,
>>>> although much more swap is used due to the increased number of default
>>>> workers for each API service.
>>>>
>>>> However, I suggest we just raise the undercloud specs in our jobs to
>>>> 8GB ram and 4 vcpus. These seem reasonable to me because those are the
>>>> default specs used by infra in all of their devstack single and
>>>> multinode jobs spawned on all their other cloud providers. Our own
>>>> multinode job for the undercloud/overcloud and undercloud only job are
>>>> running on instances of these sizes.
>>>>
>>> Close, our current flavors are 8vCPU, 8GB RAM, 80GB HDD. I'd recommend doing
>>> that for the undercloud just to be consistent.
>>
>> The HD on most of the compute nodes are 200GB so we've been trying
>> really hard[1] to keep the disk usage for each instance down so that
>> we can fit as many instances onto each compute nodes as possible
>> without being restricted by the HD's. We've also allowed nova to
>> overcommit on storage by a factor of 3. The assumption is that all of
>> the instances are short lived and a most of them never fully exhaust
>> the storage allocated to them. Even the ones that do (the undercloud
>> being the one that does) hit peak at different times so everything is
>> tickety boo.
>>
>> I'd strongly encourage against using a flavor with a 80GB HDD, if we
>> increase the disk space available to the undercloud to 80GB then we
>> will eventually be using it in CI. And 3 undercloud on the same
>> compute node will end up filling up the disk on that host.
>
> I've gone ahead and made the changes to the undercloud flavor in rh1
> to use 8GB ram and 4 vcpus. I left the disk at 40. I'd like to see use
> the same flavor specs as the default infra flavor, but going up to
> 8vcpus would require configuring less workers per api service I think.
> That's something we can iterate towards I think.

It looks like this has had the desired positive effect in the nonha jobs.

Most of the failures now are due to timeouts. When we feel like CI is
stable enough and no adverse effects from the additional resource
usage have been found, it would be worth considering moving forward
with:

https://review.openstack.org/#/c/359481/

to help with the timeouts.

-- 
-- James Slagle
--



More information about the OpenStack-dev mailing list