[openstack-dev] [tripleo] CI jobs failures
Ben Nemec
openstack at nemebean.com
Tue Mar 8 18:04:42 UTC 2016
On 03/08/2016 11:58 AM, Derek Higgins wrote:
> On 7 March 2016 at 18:22, Ben Nemec <openstack at nemebean.com> wrote:
>> On 03/07/2016 11:33 AM, Derek Higgins wrote:
>>> On 7 March 2016 at 15:24, Derek Higgins <derekh at redhat.com> wrote:
>>>> On 6 March 2016 at 16:58, James Slagle <james.slagle at gmail.com> wrote:
>>>>> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi <emilien at redhat.com> wrote:
>>>>>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>>>>>> technical improvements to stop having so much CI failures.
>>>>>>
>>>>>>
>>>>>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
>>>>>> mistake to swap on files because we don't have enough RAM. In my
>>>>>> experience, swaping on non-SSD disks is even worst that not having
>>>>>> enough RAM. We should stop doing that I think.
>>>>>
>>>>> We have been relying on swap in tripleo-ci for a little while. While
>>>>> not ideal, it has been an effective way to at least be able to test
>>>>> what we've been testing given the amount of physical RAM that is
>>>>> available.
>>>>
>>>> Ok, so I have a few points here, in places where I'm making
>>>> assumptions I'll try to point it out
>>>>
>>>> o Yes I agree using swap should be avoided if at all possible
>>>>
>>>> o We are currently looking into adding more RAM to our testenv hosts,
>>>> it which point we can afford to be a little more liberal with Memory
>>>> and this problem should become less of an issue, having said that
>>>>
>>>> o Even though using swap is bad, if we have some processes with a
>>>> large Mem footprint that don't require constant access to a portion of
>>>> the footprint swaping it out over the duration of the CI test isn't as
>>>> expensive as it would suggest (assuming it doesn't need to be swapped
>>>> back in and the kernel has selected good candidates to swap out)
>>>>
>>>> o The test envs that host the undercloud and overcloud nodes have 64G
>>>> of RAM each, they each host 4 testenvs and each test env if running a
>>>> HA job can use up to 21G of RAM so we have over committed there, it
>>>> this is only a problem if a test env host gets 4 HA jobs that are
>>>> started around the same time (and as a result a each have 4 overcloud
>>>> nodes running at the same time), to allow this to happen without VM's
>>>> being killed by the OOM we've also enabled swap there. The majority of
>>>> the time this swap isn't in use, only if all 4 testenvs are being
>>>> simultaneously used and they are all running the second half of a CI
>>>> test at the same time.
>>>>
>>>> o The overcloud nodes are VM's running with a "unsafe" disk caching
>>>> mechanism, this causes sync requests from guest to be ignored and as a
>>>> result if the instances being hosted on these nodes are going into
>>>> swap this swap will be cached on the host as long as RAM is available.
>>>> i.e. swap being used in the undercloud or overcloud isn't being synced
>>>> to the disk on the host unless it has to be.
>>>>
>>>> o What I'd like us to avoid is simply bumping up the memory every time
>>>> we hit a OOM error without at least
>>>> 1. Explaining why we need more memory all of a sudden
>>>> 2. Looking into a way we may be able to avoid simply bumping the RAM
>>>> (at peak times we are memory constrained)
>>>>
>>>> as an example, Lets take a look at the swap usage on the undercloud of
>>>> a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
>>>> swap enabled via a swapfile
>>>> the overcloud deploy started @22:07:46 and finished at @22:28:06
>>>>
>>>> In the graph you'll see a spike in memory being swapped out around
>>>> 22:09, this corresponds almost exactly to when the overcloud image is
>>>> being downloaded from swift[3], looking the top output at the end of
>>>> the test you'll see that swift-proxy is using over 500M of Mem[4].
>>>>
>>>> I'd much prefer we spend time looking into why the swift proxy is
>>>> using this much memory rather then blindly bump the memory allocated
>>>> to the VM, perhaps we have something configured incorrectly or we've
>>>> hit a bug in swift.
>>>>
>>>> Having said all that we can bump the memory allocated to each node but
>>>> we have to accept 1 of 2 possible consequences
>>>> 1. We'll env up using the swap on the testenv hosts more then we
>>>> currently are or
>>>> 2. We'll have to reduce the number of test envs per host from 4 down
>>>> to 3, wiping 25% of our capacity
>>>
>>> Thinking about this a little more, we could do a radical experiment
>>> for a week and just do this, i.e. bump up the RAM on each env and
>>> accept we loose 25 of our capacity, maybe it doesn't matter, if our
>>> success rate goes up then we'd be running less rechecks anyways.
>>> The downside is that we'd probably hit less timing errors (assuming
>>> the tight resources is whats showing them up), I say downside because
>>> this just means downstream users might hit them more often if CI
>>> isn't. Anyways maybe worth discussing at tomorrows meeting.
>>
>> +1 to reducing the number of testenvs and allocating more memory to
>> each. The huge number of rechecks we're having to do is definitely
>> contributing to our CI load in a big way, so if we could cut those down
>> by 50% I bet it would offset the lost testenvs. And it would reduce
>> developer aggravation by about a million percent. :-)
>>
>> Also, on some level I'm not too concerned about the absolute minimum
>> memory use case. Nobody deploying OpenStack in the real world is doing
>> so on 4 GB nodes. I doubt 99% of them are doing so on less than 32 GB
>> nodes. Until we have composable services, I don't know that we can
>> support the 4 GB use case anymore. We've just added too many services
>> to the overcloud.
>
> We discussed this at today's meeting but never really came to a
> conclusion except to say most people wanted to try it. The main
> objection brought up was that we shouldn't go dropping the nonha job,
> that isn't what I was proposing so let me rephrase here and see if we
> can gather +/-1's
>
> I'm proposing we redeploy our testenvs with more RAM allocated per
> env, specifically we would go from
> 5G undercloud and 4G overcloud nodes to
> 6G undercloud and 5G overcloud nodes to
>
> In addition to accommodate this we would reduce the number of env's
> available from 48 (the actually number varies from time to time) to 36
> (3 envs per host)
>
> No changes would be happening on the jobs we actually run
>
> The assumption is that with the increased resources we would hit less
> false negative test results and as a result recheck jobs less (so the
> 25% reduction in capacity wouldn't hit us as hard as it might seem),
> we also may not be able to easily undo this if it doesn't work out as
> once we start merging things that use the extra RAM it will be hard to
> go back.
I think the problem is we already merged things that use the extra RAM,
but the RAM isn't actually there. :-)
So +1 from me.
More information about the OpenStack-dev
mailing list