[openstack-dev] [TripleO][CI] Bridging the production/CI workflow gap with large periodic CI jobs

Ben Nemec openstack at nemebean.com
Thu Apr 20 21:17:19 UTC 2017



On 04/19/2017 12:17 PM, Justin Kilpatrick wrote:
> More nodes is always better but I don't think we need to push the host
> cloud to it's absolute limits right away. I have a list of several
> pain points I expect to find with just 30ish nodes that should keep us
> busy for a while.
>
> I think the optimizations are a good idea though, especially if we
> want to pave the way for the next level of this sort of effort. Which
> is devs being able to ask for a 'scale ci' run on gerrit and schedule
> a decent sized job for whenever it's convenient. The closer we can get
> devs to large environments on demand the faster and easier these
> issues can be solved.
>
> But for now baby steps.

https://review.openstack.org/458651 should enable the heterogeneous 
environments I mentioned.  We need to be a little careful with a patch 
like that since they aren't actually tested in the gate, but something 
along those lines should work.  It may also need a larger controller 
flavor so the controller(s) don't OOM with that many computes, but if 
we've got a custom env for this use case anyway that should be doable too.

>
> On Wed, Apr 19, 2017 at 12:30 PM, Ben Nemec <openstack at nemebean.com> wrote:
>> TLDR: We have the capacity to do this.  One scale job can be absorbed into
>> our existing test infrastructure with minimal impact.
>>
>>
>> On 04/19/2017 07:50 AM, Flavio Percoco wrote:
>>>
>>> On 18/04/17 14:28 -0400, Emilien Macchi wrote:
>>>>
>>>> On Mon, Apr 17, 2017 at 3:52 PM, Justin Kilpatrick
>>>> <jkilpatr at redhat.com> wrote:
>>>>>
>>>>> Because CI jobs tend to max out about 5 nodes there's a whole class of
>>>>> minor bugs that make it into releases.
>>>>>
>>>>> What happens is that they never show up in small clouds, then when
>>>>> they do show up in larger testing clouds the people deploying those
>>>>> simply work around the issue and get onto what they where supposed to
>>>>> be testing. These workarounds do get documented/BZ'd but since they
>>>>> don't block anyone and only show up in large environments they become
>>>>> hard for developers to fix.
>>>>>
>>>>> So the issue gets stuck in limbo, with nowhere to test a patchset and
>>>>> no one owning the issue.
>>>>>
>>>>> These issues pile up and pretty soon there is a significant difference
>>>>> between the default documented workflow and the 'scale' workflow which
>>>>> is filled with workarounds which may or may not be documented
>>>>> upstream.
>>>>>
>>>>> I'd like to propose getting these issues more visibility to having a
>>>>> periodic upstream job that uses 20-30 ovb instances to do a larger
>>>>> deployment. Maybe at 3am on a Sunday or some other time where there's
>>>>> idle execution capability to exploit. The goal being to make these
>>>>> sorts of issues more visible and hopefully get better at fixing them.
>>>>
>>>>
>>>> Wait no, I know some folks at 3am on a Saturday night who use TripleO
>>>> CI (ok that was a joke).
>>>
>>>
>>> Jokes apart, it really depends on the TZ and when you schedule it. 3:00
>>> UTC on a
>>> Sunday is Monday 13:00 in Sydney :) Saturdays might work better but
>>> remember
>>> that some countries work on Sundays.
>>
>>
>> With the exception of the brief period where the ovb jobs were running at
>> full capacity 24 hours a day, there has always been a lull in activity
>> during early morning UTC.  Yes, there are people working during that time,
>> but generally far fewer and the load on TripleO CI is at its lowest point.
>> Honestly I'd be okay running this scale job every night, not just on the
>> weekend.  A week of changes is a lot to sift through if a scaling issue
>> creeps into one of the many, many projects that affect such things in
>> TripleO.
>>
>> Also, I should note that we're not currently being constrained by absolute
>> hardware limits in rh1.  The reason I haven't scaled our concurrent jobs
>> higher is that there is already performance degradation when we have a full
>> 70 jobs running at once.  This type of scale job would require a lot of
>> theoretical resources, but those 30 compute nodes are mostly going to be
>> sitting there idle while the controller(s) get deployed, so in reality their
>> impact on the infrastructure is going to be less than if we just added more
>> concurrent jobs that used 30 additional nodes.  And we do have the
>> memory/cpu/disk to spare in rh1 to spin up more vms.
>>
>> We could also take advantage of heterogeneous OVB environments now so that
>> the compute nodes are only 3 GB VMs instead of 8 as they are now. That would
>> further reduce the impact of this sort of job.  It would require some tweaks
>> to how the testenvs are created, but that shouldn't be a problem.
>>
>>>
>>>>> To be honest I'm not sure this is the best solution, but I'm seeing
>>>>> this anti pattern across several issues and I think we should try and
>>>>> come up with a solution.
>>>>>
>>>>
>>>> Yes this proposal is really cool. There is an alternative to run this
>>>> periodic scenario outside TripleO CI and send results via email maybe.
>>>> But it is something we need to discuss with RDO Cloud people and see
>>>> if we would have such resources to make it on a weekly frequency.
>>>>
>>>> Thanks for bringing this up, it's crucial for us to have this kind of
>>>> feedback, now let's take actions.
>>>
>>>
>>> +1
>>>
>>> Flavio
>>>
>>>
>>>
>>> __________________________________________________________________________
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



More information about the OpenStack-dev mailing list