Open Stack

Fri Mar 17 13:39:17 UTC 2017

On Fri, Mar 17, 2017 at 1:03 PM Sean Dague <sean at dague.net> wrote:

> On 03/17/2017 08:27 AM, Jordan Pittier wrote:
> > The patch that reduced the number of Tempest Scenarios we run in every
> > job and also reduce the test run concurrency [0] was merged 13 days ago.
> > Since, the situation (i.e the high number of false negative job results)
> > has not improved significantly. We need to keep looking collectively at
> > this.
>
> While the situation hasn't completely cleared out -
> http://tinyurl.com/mdmdxlk - since we've merged this we've not seen that
> job go over 25% failure rate in the gate, which it was regularly
> crossing in the prior 2 week period. That does feel like progress.

I agree the situation improved a bit, but there are still too many failures.
There is a peak of failures on Mar 12th in the graph, I remember looking
at it briefly, as it was on a Sunday - and then by Monday it was back to
normal. It's not clear yet to me what caused / fixed that peak. The mysql
revert was merged on March 15th, which is too late to explain the change.

> In
> spot checking I we are also rarely failing in scenario tests now, but
> the fails tend to end up inside heavy API tests running in parallel.
>
>
An ssh failure in volume scenario tests  is still on top of the recheck
queue, but looking at logstash I see it's mostly happening in
gate-tempest-dsvm-networking-odl-* jobs. The integration jobs seem to
be behaving for scenario tests.

> There seems to be an agreement that we are hitting some memory limit.
> > Several of our most frequent failures are memory related [1]. So we
> > should either reduce our memory usage or ask for bigger VMs, with more
> > than 8GB of RAM.
> >
> > There was/is several attempts to reduce our memory usage, by reducing
> > the Mysql memory consumption ([2] but quickly reverted [3]), reducing
> > the number of Apache workers ([4], [5]), more apache2 tuning [6]. If you
> > have any crazy idea to help in this regard, please help. This is high
> > priority for the whole openstack project, because it's plaguing many
> > projects.
>

I think it's very important to work on both sides: make sure our testing
does
not kill the SUT, but also keep the footprint of the SUT under control.
This may be a good topic of discussion for the forum in Boston.

On the testing side, I started working on using two jobs instead of one:
- one running all API tests, to a degree of parallelism that does not break
the SUT
- one running scenario tests, perhaps on a two nodes test environment

That would give us more space in terms of testing, but it would also mean
more
test nodes and more test jobs to track.
The scenario test job is defined, one d-g patch is missing to complete it
https://review.openstack.org/#/c/442565/.

> Interesting, I hadn't seen the revert. It is also curious that it was
> largely limitted to the neutron-api test job. It's also notable that the
> sort buffers seem to have been set to the minimum allowed limit of mysql
> -
>
> https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_sort_buffer_size
> - and is over an order of magnitude decrease from the existing default.
>
> I wonder about redoing the change with everything except it and seeing
> how that impacts the neutron-api job.
>
>         -Sean
>
> --
> Sean Dague
> http://dague.net
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170317/bad0f210/attachment.html>

Open Stack

[openstack-dev] [QA][gate][all] dsvm gate stability and scenario tests

OpenStack

Community

Documentation

Branding & Legal