[openstack-dev] [tripleo] CI jobs failures

Dan Prince dprince at redhat.com
Wed Mar 9 13:26:47 UTC 2016


On Tue, 2016-03-08 at 17:58 +0000, Derek Higgins wrote:
> On 7 March 2016 at 18:22, Ben Nemec <openstack at nemebean.com> wrote:
> > 
> > On 03/07/2016 11:33 AM, Derek Higgins wrote:
> > > 
> > > On 7 March 2016 at 15:24, Derek Higgins <derekh at redhat.com>
> > > wrote:
> > > > 
> > > > On 6 March 2016 at 16:58, James Slagle <james.slagle at gmail.com>
> > > > wrote:
> > > > > 
> > > > > On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi <emilien at redh
> > > > > at.com> wrote:
> > > > > > 
> > > > > > I'm kind of hijacking Dan's e-mail but I would like to
> > > > > > propose some
> > > > > > technical improvements to stop having so much CI failures.
> > > > > > 
> > > > > > 
> > > > > > 1/ Stop creating swap files. We don't have SSD, this is
> > > > > > IMHO a terrible
> > > > > > mistake to swap on files because we don't have enough RAM.
> > > > > > In my
> > > > > > experience, swaping on non-SSD disks is even worst that not
> > > > > > having
> > > > > > enough RAM. We should stop doing that I think.
> > > > > We have been relying on swap in tripleo-ci for a little
> > > > > while. While
> > > > > not ideal, it has been an effective way to at least be able
> > > > > to test
> > > > > what we've been testing given the amount of physical RAM that
> > > > > is
> > > > > available.
> > > > Ok, so I have a few points here, in places where I'm making
> > > > assumptions I'll try to point it out
> > > > 
> > > > o Yes I agree using swap should be avoided if at all possible
> > > > 
> > > > o We are currently looking into adding more RAM to our testenv
> > > > hosts,
> > > > it which point we can afford to be a little more liberal with
> > > > Memory
> > > > and this problem should become less of an issue, having said
> > > > that
> > > > 
> > > > o Even though using swap is bad, if we have some processes with
> > > > a
> > > > large Mem footprint that don't require constant access to a
> > > > portion of
> > > > the footprint swaping it out over the duration of the CI test
> > > > isn't as
> > > > expensive as it would suggest (assuming it doesn't need to be
> > > > swapped
> > > > back in and the kernel has selected good candidates to swap
> > > > out)
> > > > 
> > > > o The test envs that host the undercloud and overcloud nodes
> > > > have 64G
> > > > of RAM each, they each host 4 testenvs and each test env if
> > > > running a
> > > > HA job can use up to 21G of RAM so we have over committed
> > > > there, it
> > > > this is only a problem if a test env host gets 4 HA jobs that
> > > > are
> > > > started around the same time (and as a result a each have 4
> > > > overcloud
> > > > nodes running at the same time), to allow this to happen
> > > > without VM's
> > > > being killed by the OOM we've also enabled swap there. The
> > > > majority of
> > > > the time this swap isn't in use, only if all 4 testenvs are
> > > > being
> > > > simultaneously used and they are all running the second half of
> > > > a CI
> > > > test at the same time.
> > > > 
> > > > o The overcloud nodes are VM's running with a "unsafe" disk
> > > > caching
> > > > mechanism, this causes sync requests from guest to be ignored
> > > > and as a
> > > > result if the instances being hosted on these nodes are going
> > > > into
> > > > swap this swap will be cached on the host as long as RAM is
> > > > available.
> > > > i.e. swap being used in the undercloud or overcloud isn't being
> > > > synced
> > > > to the disk on the host unless it has to be.
> > > > 
> > > > o What I'd like us to avoid is simply bumping up the memory
> > > > every time
> > > > we hit a OOM error without at least
> > > >   1. Explaining why we need more memory all of a sudden
> > > >   2. Looking into a way we may be able to avoid simply bumping
> > > > the RAM
> > > > (at peak times we are memory constrained)
> > > > 
> > > > as an example, Lets take a look at the swap usage on the
> > > > undercloud of
> > > > a recent ci nonha job[1][2], These insances have 5G of RAM with
> > > > 2G or
> > > > swap enabled via a swapfile
> > > > the overcloud deploy started @22:07:46 and finished at
> > > > @22:28:06
> > > > 
> > > > In the graph you'll see a spike in memory being swapped out
> > > > around
> > > > 22:09, this corresponds almost exactly to when the overcloud
> > > > image is
> > > > being downloaded from swift[3], looking the top output at the
> > > > end of
> > > > the test you'll see that swift-proxy is using over 500M of
> > > > Mem[4].
> > > > 
> > > > I'd much prefer we spend time looking into why the swift proxy
> > > > is
> > > > using this much memory rather then blindly bump the memory
> > > > allocated
> > > > to the VM, perhaps we have something configured incorrectly or
> > > > we've
> > > > hit a bug in swift.
> > > > 
> > > > Having said all that we can bump the memory allocated to each
> > > > node but
> > > > we have to accept 1 of 2 possible consequences
> > > > 1. We'll env up using the swap on the testenv hosts more then
> > > > we
> > > > currently are or
> > > > 2. We'll have to reduce the number of test envs per host from 4
> > > > down
> > > > to 3, wiping 25% of our capacity
> > > Thinking about this a little more, we could do a radical
> > > experiment
> > > for a week and just do this, i.e. bump up the RAM on each env and
> > > accept we loose 25 of our capacity, maybe it doesn't matter, if
> > > our
> > > success rate goes up then we'd be running less rechecks anyways.
> > > The downside is that we'd probably hit less timing errors
> > > (assuming
> > > the tight resources is whats showing them up), I say downside
> > > because
> > > this just means downstream users might hit them more often if CI
> > > isn't. Anyways maybe worth discussing at tomorrows meeting.
> > +1 to reducing the number of testenvs and allocating more memory to
> > each.  The huge number of rechecks we're having to do is definitely
> > contributing to our CI load in a big way, so if we could cut those
> > down
> > by 50% I bet it would offset the lost testenvs.  And it would
> > reduce
> > developer aggravation by about a million percent. :-)
> > 
> > Also, on some level I'm not too concerned about the absolute
> > minimum
> > memory use case.  Nobody deploying OpenStack in the real world is
> > doing
> > so on 4 GB nodes.  I doubt 99% of them are doing so on less than 32
> > GB
> > nodes.  Until we have composable services, I don't know that we can
> > support the 4 GB use case anymore.  We've just added too many
> > services
> > to the overcloud.
> We discussed this at today's meeting but never really came to a
> conclusion except to say most people wanted to try it. The main
> objection brought up was that we shouldn't go dropping the nonha job,
> that isn't what I was proposing so let me rephrase here and see if we
> can gather +/-1's
> 
> I'm proposing we redeploy our testenvs with more RAM allocated per
> env, specifically we would go from
> 5G undercloud and 4G overcloud nodes to
> 6G undercloud and 5G overcloud nodes to
> 
> In addition to accommodate this we would reduce the number of env's
> available from 48 (the actually number varies from time to time) to
> 36
> (3 envs per host)

I support making these changes to obtain CI stability. So +1 for what
derekh has suggested doing above.

Dan

> 
> No changes would be happening on the jobs we actually run
> 
> The assumption is that with the increased resources we would hit less
> false negative test results and as a result recheck jobs less (so the
> 25% reduction in capacity wouldn't hit us as hard as it might seem),
> we also may not be able to easily undo this if it doesn't work out as
> once we start merging things that use the extra RAM it will be hard
> to
> go back.
> 
> > 
> > 
> > That said though, keeping service memory usage under control is
> > still
> > valuable and we should figure out why Swift is using so much memory
> > when
> > it's not under much load at all.  That's actually the undercloud,
> > so
> > it's sort of tangential to this discussion.
> <snip/>
> 
> _____________________________________________________________________
> _____
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubs
> cribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list