[openstack-dev] [tripleo][infra] RH2 is up and running

Derek Higgins derekh at redhat.com
Fri Jul 1 11:58:02 UTC 2016


Hi All,
    Yesterday the final patch merged to run CI jobs on RH2, and last
night we merged the patch to tripleo-ci to support RH2 jobs. So we now
have a new job (gate-tripleo-ci-centos-7-ovb-ha) running on all
tripleo patch reviews. This job is running pacemaker HA with a 3 node
controller and a single compute node. Its basically the same as our
current HA job without net-iso.

Looking at pass rates this morning
1. The jobs are failing on stable branches[1]
  o I've submitted a patch to the mitaka and liberty branch to fix
this(see the bug)
2. The pass rate does seem to be a little lower then the RH1 HA job
  o I'll look into this today but overall the pass rate should be good
enough for when RH1 is taken offline

The main difference between jobs running on rh2 when compared to rh1
is that the CI slave IS the undercloud (we've eliminated the need for
an extra undercloud node), this saves resources. We no longer build a
instack qcow2 image, this saves us a little time.

To make this work, early in the CI process we make a call out to a
geard broker and pass it the instance ID of the undercloud, this
broker creates a heat stack (using OVB heat templates) with a number
nodes on a provisioning network. It then attaches an interface on this
provisioning network to the undercloud[2]. Ironic can then talk (with
ipmi) to a bmc node to power them on and PXE boot them. At the end of
the job the stack is deleted.

Whats next?
o On Tuesday evening next, rh1 will be taken offline so I'll be
submitting a patch to remove all of the RH1 jobs and until we bring it
back up we will only have a single triple-ci job
o The RH1 rack will be back available to us on Thursday, we then have a choice
 1. Bring rh1 back up as is and return everything back to the status quo
 2. Redeploy rh1 with OVB and move away from the legacy system permanently
 If the OVB based jobs prove to be reliable etc.. I think option 2 is
worth thinking about, it wasn't the original plan but it would allow
us move away from a legacy system that is getting harder to support as
time goes on.
o RH2 was a loaned to us to allow this to happen so once we pick
either option above and complete the deployment of RH1 we'll have to
give it back

The OVB based cloud opens up a couple of interesting options to us
that we can explore if we were to stick with using OVB
1. Periodic scale test
  o With OVB its possible to select the number of nodes we place on
the provisioning network, for example while testing rh2 I was able to
deploy a overcloud with 80(we could do up to 120 on rh2 even higher on
rh1) compute nodes, doing this nightly when CI load is low would be an
extremely valuable test to run and gather data on.
2. Dev quota to reproduce CI
  o On OVB its now a lot easier to give somebody some quota to
reproduce exactly what CI is using in order to reproduce problems
etc... this was possible on rh1 but required a cloud admin to manually
take testenvs away from CI(it was manual and messy so we didn't do it
much)

The move doesn't come without its costs

1. tripleo-quickstart
  o Part of the tripleo-quickstart install is to first download a
prebuilt undercloud image that we were building in our periodic job.
Because the undercloud is now the CI slave we no longer build a
instack.qcow2 image. For the near future we can host the most recent
one on RH2(the IP will change so this needs to change in tripleo
quickstart or better still a DNS entry could be used so switch over
would be smother in future) but if we make the move to jobs of this
type permanent we'll no longer be generating this image for
quickstart. So we'll have to see if we can come to an alternative. We
could generate one in the periodic job but I'm not sure how we could
test it easily.

2. moving the current-tripleo pin
  o I havn't put in place yet anything needed for our periodic job to
move the current-tripleo pin, so until we get this done (and decide
what to do about 1. above) we're stuck on what ever pin we happen to
be on on Tuesday when rh1 is taken offline. The pin moved last night
to a repository from 2016-06-29 so we are at least reasonably up to
date. If it looks like the rh1 deployment is going to take an
excessive amount time we'll need to make this a priority.

3. The ability to telnet to CI slaves to get the console for running
CI jobs doesn't work on RH2  jobs, this is because its is using the
same port number(8088) we use in tripleo for ironic to serve its iPXE
images over http. So I've had to kill the console serving process
until we solve this. If we want to fix this we'll have to explore
changing the port number in either tripleo or infra.

I was putting together a screencast of how rh2 was deployed(with RDO
mitaka) but after several hours of editing the screen casts into
something usable the software I was using(openshot) refused to
generate what I had put together, in fact it crashed a lot, so if
anybody has any good suggestions of software I could use I'll try
again.

If I've missed anything please feel free to ask,

thanks,
Derek.

[1] - https://bugs.launchpad.net/tripleo/+bug/1598089
[2] - http://git.openstack.org/cgit/openstack-infra/tripleo-ci/tree/scripts/te-broker/create-env



More information about the OpenStack-dev mailing list