Open Stack

Mon Mar 24 22:58:39 UTC 2014

On Fri, Mar 21, 2014 at 6:29 AM, Derek Higgins <derekh at redhat.com> wrote:

> Hi All,
>    I'm trying to get a handle on what needs to happen before getting
> tripleo-ci(toci) into the gate, I realize this may take some time but
> I'm trying to map out how to get to the end goal of putting multi node
> tripleo based deployments in the gate which should cover a lot of uses
> cases that devstact-gate doesn't. Here are some of the stages I think we
> need to achieve before being in the gate along with some questions where
> people may be able to fill in the blanks.
>
> Stage 1: check - tripleo projects
>    This is what we currently have running, 5 separate jobs running non
> voting checks against tripleo projects
>
> Stage 2 (a). reliability
>    Obviously keeping the reliability of both the results and the ci
> system is a must and we should always aim towards 0% false test results,
> but is there an acceptable number of false negatives for example that
> would be acceptable to infa, what are the numbers on the gate at the
> moment? should we aim to match those at the very least (Maybe we already
> have). And for how long do we need to maintain those levels before
> considering the system proven?
>

I cannot come up with a specific number for this, perhaps someone else can.
I see the results and CI system reliability as two very different things,
for the CI system it should ideally never go down for very long (although
this is less critical while tripleo is non-voting check only, like all
other 3rd party systems).  As for false negatives in the results, they
should be on par with devstack-gate jobs especially once you start running
tempest.

>
> Stage 2 (b). speedup
>    How long can the longest jobs take? We have plans in place to speed
> up our current jobs but what should the target be?
>
>
Gate jobs currently take up to a little over an hour [0][1]

[0]
https://jenkins01.openstack.org/job/check-tempest-dsvm-postgres-full/buildTimeTrend
[1]
https://jenkins02.openstack.org/job/check-tempest-dsvm-postgres-full/buildTimeTrend

> 3. More Capacity
>

If you wanted to run tripleo-check everwhere a ''check-tempest-dsvm-full'
job is run that is over 600 jobs in a  24 hour period.

[3] graphite<http://graphite.openstack.org/render/?from=00%3A00_20140203&fgcolor=000000&title=Check%20Hit%20Count&_t=0.2247244759928435&height=308&bgcolor=ffffff&width=586&hideGrid=false&until=23%3A59_20140324&showTarget=color(alias(hitcount(sum(stats.zuul.pipeline.gate.job.gate-tempest-dsvm-neutron-heat-slow.%7BSUCCESS%2CFAILURE%7D)%2C%275hours%27)%2C%20%27gate-tempest-dsvm-neutron-heat-slow%27)%2C%27green%27)&_salt=1395701365.817&lineMode=staircase&target=color(alias(hitcount(sum(stats.zuul.pipeline.check.job.check-tempest-dsvm-full.%7BSUCCESS%2CFAILURE%7D)%2C%2724hours%27)%2C%20%27check-tempest-dsvm-full%20hits%20over%2024%20hours%27)%2C%27orange%27)>
color(alias(hitcount(sum(stats.zuul.pipeline.check.job.check-tempest-dsvm-full.{SUCCESS,FAILURE}),'24hours'),
'check-tempest-dsvm-full hits over 24 hours'),'orange')

>    I'm going to talk about RAM here as its probably the resource where
> we will hit our infrastructure limits first.
>    Each time a suite of toci jobs is kicked off we currently kick off 5
> jobs (which will double once Fedora is added[1])
>    In total these jobs spawn 15 vm's consuming 80G of RAM (its actually
> 120G to workaround a bug we will should soon have fixed[2]), we also
> have plans that will reduce this 80G further but lets stick with it for
> the moment.
>    Some of these jobs complete after about 30 minutes but lets say our
> target is an overall average of 45 minutes.
>
>    With Fedora that means each run will tie up 160G for 45 minutes. Or
> 160G can provide us with 32 runs (each including 10 jobs) per day
>
>    So to kick off 500 (I made this number up) runs per day, we would need
>    (500 / 32.0) * 160G = 2500G of RAM
>
>    We then need to double this number to allow for redundancy, so thats
> 5000G of RAM
>
>    We probably have about 3/4 of this available to us at the moment but
> its not evenly balanced between the 2 clouds so we're not covered from a
> redundancy point of view.
>
>    So we need more hardware (either by expanding the clouds we have or
> added new clouds), I'd like for us to start a separate effort to map out
> exactly what our medium term goals should be, including
>    o jobs we want to run
>    o how long we expect each of them to take
>    o how much ram each one would take
>    so that we can roughly put together an idea of what our HW
> requirements will be.
>
> 4. check - all openstack projects
>    Once we're happy we have the required capacity I think we can then
> move to check on all openstack projects
>
> 5. voting check - all projects
>    Once we're happy that everybody is happy with reliability I think we
> can move to voting check
>
> 6. gate on all openstack projects
>    And then finally when everything else lines up I think we can be
> added to the gate
>
> A) Gating with Ironic
>   I bring this up because there was some confusion about ironic's status
> in the Gate at a recent tripleo meeting[3], when can tripleo's ironic
> jobs be part of the gate?
>
> Any thoughts? Am I way off with any of my assumptions? Is my maths correct?
>
> thanks,
> Derek.
>
> [1] https://review.openstack.org/#/q/status:open+topic:add-f20-jobs,n,z
> [2] https://bugs.launchpad.net/diskimage-builder/+bug/1289582
> [3]
>
> http://eavesdrop.openstack.org/meetings/tripleo/2014/tripleo.2014-03-11-19.01.log.html
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140324/e221af0d/attachment.html>

Open Stack

[openstack-dev] [TripleO] Moving tripleo-ci towards the gate

OpenStack

Community

Documentation

Branding & Legal