Open Stack

Wed May 18 14:32:09 UTC 2016

On Tue, 2016-05-17 at 20:03 +0300, Sagi Shnaidman wrote:
> Hi,
> raising again the question about tempest running on TripleO CI as it
> was discussed in the last TripleO meeting.
> 
> I'd like to get your attention that in these tests, which I ran just
> for ensure it works, there were bugs discovered, and these weren't
> corner cases but real failures of TripleO installation. Like this one
> for Sahara: https://review.openstack.org/#/c/309042/
> I'm sorry, I should have prepared these bugs for the meeting as
> proofs for testing value.

Would it be reasonable for us to add optional Sahara coverage to our
ping test by using the Heat OS::Sahara resources? I feel like this
would gives us lightweight sahara coverage which if enabled costs us
next to nothing in extra CI wall time but gives us the extra coverage?

> 
> The second issue that was blocker before is a wall time and now, as
> we can see from jobs length, after HW upgrade of CI - is not an issue
> anymore. We can run tempest without any fear to get into timeout
> problem, "nonha" job for sure, as most short from all.

It isn't just the 3 hour timeout that matters I think. This was an
upstream constraint that for good reason was an upper limit on our job
times. It is about how long it takes to land code upstream and the
length and consistency of the CI time is a big part of that. Thanks to
some recent optimization on both the hardware and software side we are
now running jobs maybe 20-30 minutes faster. So our HA and upgrades
jobs take around 2 hours or so. This is certainly a welcome improvement
but we've still got much work to do. I'm not willing to hand 10-15
minutes of time back just so we can run Tempest across the board.
Rather, I would see us further optimize the job times to get them
faster still. I don't think the point of investing in a major hardware
upgrade was so we could make some time to run Tempest, it was more to
help us get stability and consistency in our upstream CI test results.

> 
> So I'd insist on running tempest exactly on promoting job in order
> not to promote images with bugs, especially the critical like the
> whole service not available at all. The pingtest is not enough for
> this purpose as we can see from the bugs above, it checks very basic
> things and not all services are covered. I think we aren't interested
> just to see the jobs green, but sticking for the basic working
> functionality and quality of promoting. Maybe it's influence of my
> previous QA roles, but I don't see any value to promote something
> with bugs.

Promoting and being able to consume new packages from TripleO is a very
important part of our pipeline. I think the feedback the core team
initially gave with regards to Tempest was that running it in a
periodic job was a good place for it. This was a few weeks back. It
would allow us to keep things generally working with Tempest but not
block our pipelines should a one-off regression occur.

What changed in the meantime was that we started using the existing
periodic jobs for package promotion directly. So if we add Tempest
there now it is my understanding we depend on it to be working in order
to promote packages and be able to consume the latest sources. While
this may seem ideal, I don't think that is what we want, or need and it
would slow us down. When we need to promote packages we often need
something fixed quickly... and having to chase down Tempest failures at
that time isn't ideal. So ideally if Tempest is going to be running on
our periodic job that blocks package promotion, we would also want it
running on all of our CI jobs so that we aren't surprised by Tempest
failures at the last minute (before package promotion). But this brings
us back to the time it takes to run Tempest across the board.

So my suggestion would be that we have a separate, independent periodic
job (or even a downstream job) that monitors and runs the CI testing of
TripleO with tempest. This job will not block package promotion. We can
monitor the Tempest results and fix them accordingly. And, If there are
any important services that we can test quickly in ping test to add
extra coverage add them in ping test directly to have up to the minute
coverage for TripleO.

Call it a tiered testing approach.

> 
> The point about CI stability - the last issues that CI faces now are
> not so connected to tempest tests or CI code at all, it's bugs of
> underlying projects and whether tempest will run or not doesn't
> really matters in this case. These issues fail everything yet before
> any testing starts. Indication of such issues before they leak into
> TripleO is different topic and approach.
> 
> So my main point for running tempest tests on "nonha" periodic jobs
> is:
> Quality and guaranteed basic functionality of installed overcloud
> services. At least all of them are up and can accept connections.
> Avoid and early discover critical bugs that are not seen in pingtest.
> I remind that we going to run the only smoke tests, which takes not
> much time and check the basic functionality only. 
> 
> P.S. If there is interest, we can run the whole tempest set or
> specific sets in experimental or third-party jobs just for
> indication. And I mean not only tempest tests, but project scenario
> tests as well, for example Heat integration tests. Both for
> undercloud and overcloud.
> 
> P.P.S Just ping me if you have any unclear points or would like to
> discuss it in separate meeting, I'll give the all required info.
> 
> Thanks
> -- 
> Best regards
> Sagi Shnaidman

Open Stack

[openstack-dev] [TripleO][CI] Tempest on periodic jobs

OpenStack

Community

Documentation

Branding & Legal