Jeremy Stanley
Tue Dec 6 18:56:53 UTC 2016

I'm Cc'ing this to the openstack-infra ML but setting MFT to direct
subsequent discussion to the openstack-dev ML so we can hopefully
avoid further cross-posting as much as possible. If you're replying
on a particular session topic, please update the Subject so that the
subthreads are easier to keep straight.

Apologies for the _extreme_ delay in getting this composed and sent



This was primarily a brainstorming/roadmap session on possible
future plans for the firehose.openstack.org service. Discussed was
potential to have Zuul (post-v3) both consume and emit events over
MQTT, as well as having StoryBoard (should probably support an
analog of the events handled by lpmqtt at a minimum, probably easy
to add given it already has an RabbitMQ one) and Nodepool event
streams. The gerritbot consumer PoC was mentioned, but determined
that it would be better to shelve that until planning for an
Errbot-based gerritbot reimplementation is fleshed out.

We talked about how the current logstash MQTT stream implementation,
while interesting, has significant scaling (especially bandwidth)
issues with the volume of logging we do in tests while only offering
limited benefit. We could potentially make use of it in concern with
a separate logstash for production server and Ansible logs, but it's
efficacy for our job logs was called into question.

We also spent much of the timeslot talking about possible
integration with FedMesg (particularly that they're considering
pluggable backend support which could include an MQTT
implementation), which yields much opportunity for collaboration
between our projects.

One other topic which came up was how to do a future HA
implementation, either by having publishers send to multiple brokers
and configure consumers to have a primary/fallback behavior or my
trying to implement a more integrated clustering solution with load
balancing proxies. We concluded that current use cases don't demand
anywhere near 100% message delivery and 100% uptime, so we can dig
deeper when there's an actual use case.

Status update and plans for task tracking


As is traditional, we held a fishbowl on our ongoing task tracking
woes. We started with a brief introduction of stakeholders who
attended and the groups whose needs they were there to represent.
After that, some presentation was made of recent StoryBoard
development progress since Austin (including Gerrit integration,
private story support for embargoed security issues, improved event
timelines, better discoverability for boards and worklists, flexible
task prioritization), as well as the existing backlog of blockers.

We revisited the Infra spec on task tracking
for the benefit of those present, and Kendall Nelson (diablo_rojo)
agreed to pick up and continue with the excellent stakeholder
blocking issues outreach/coordination work begun by Anita Kuno

Next steps for infra-cloud


This was sort of a catch-all opportunity to hash out current plans
and upcoming needs for the infra-cloud. We determined that the
current heterogeneous hardware in the in-progress "chocolate" region
should be split into two homogeneous regions named "chocolate" and
"strawberry" (our "vanilla" region was already homogeneous). We also
talked about ongoing work to get a quote from OSUOSL for hosting the
hardware so that we can move it out of HPE data centers, and
attempting to find funding once we have some figures firmed up.

There were also some sideline discussions on possible monitoring and
high-availability options for the underlying services.
Containerization was, as always, brought up but the usual "not a fit
for this use case" answers abounded. It was further suggested that
using infra-cloud resources for things like special TripleO tests or
Docker registry hosting were somehow in scope, but there are other
solutions to these needs which should not be conflated with the
purpose of the infra-cloud effort.

Interactive infra-cloud debugging


The original intent for this session was to try to gather
leaders/representatives from the various projects that we're relying
on in the infra-cloud deployment and step through an interactive
session debugging the sorts of failures we see arise on the servers.
The idea was that this would be potentially educational for some
since this is a live bare metal "production" deployment of Nova,
Neutron, Keystone, Glance, et cetera with all the warts and rough
edges that operators handle on a daily basis but our developers may
not have directly experienced.

As well-intentioned as it was, the session suffered from several
issues. First and foremost we didn't realize the Friday morning
workroom we got was going to lack a projector (only so many people
can gather around one laptop, and if it's mine then fewer still!).
Trying to get people from lots of different projects to show up for
the same slot on a day that isn't for cross-project sessions is
pretty intractable. And then there's the fact that we were all
approaching burnout as it was the last day of the week and coffee
was all the way at the opposite end of the design summit space. :/

Instead the time was spent partly continuing the "future of
infra-cloud" discussion, and partly just talking about random things
like troubleshooting CI jobs (some people misunderstood the session
description and thought that's what we had planned) or general Infra
team wishlist items. Not a complete waste, but some lessons learned
if we ever want to try this idea again at a future summit.

Test environment expectations


After the morning break we managed to perk back up again and discuss
test platform expectations. This was a remarkably productive
brainstorming session where we assembled a rough list of
expectations developers can and, more importantly, can't make about
the systems on which our CI jobs run. The culmination of these
musings can since be found in a shiny new page of the Infra Manual:


Xenial jobs transition for stable/newton


Another constructive session right on the heels of the last...
planning the last mile of the cut-over from Ubuntu 14.04 to 16.04
testing. We confirmed that we would switch all jobs for
stable/newton as well as master (since the implementation started
early in the Newton cycle and we need to be consistent across
projects in a stable branch). We decided to set a date (which
incidentally is TODAY) to finalize the transition. The plan was
announced to the dev ML a month ago:


The (numerous) changes in flight today to switch the lingering jobs
are covered under a common review topic:


Unconference afternoon


At this stage things were starting to wind up and a lot of people
with early departures had already bowed out. Those of us who
remained were treated to our own room for the first time in many
summits (no offense to the Release and QA teams, but it was nice to
not need to share for a change). Since we were a little more at
liberty to set our own pace this time we treated it as a sort of
home base from which many of us set forth to pitch in on
Infra-related planning discussions in other teams' spaces, then
regroup and disseminate what we'd done (from translation platform
upgrades to release automation designs).

We also got in some good one-on-one time to work through topics
which weren't covered in scheduled sessions, such as Zuul v3 spec
additions or changes to the pep8 jobs to guard against missing sdist
build dependencies. As the afternoon progressed and the crowd
dwindled further we said our goodbyes and split up into smaller
groups to go out for one last meal, commiserate with those who found
themselves newly in search of employment and generally celebrate a
successful week in Barcelona.

That concludes my recollection of these sessions over the course of
the week--thanks for reading this far--feel free to follow up (on
the openstack-dev ML please) with any corrections/additions. Many
thanks to all who attended, and to those who could not: we missed
you. I hope to see lots of you again at the PTG in Atlanta, only a
couple months away now. Don't forget to register and book your
Jeremy Stanley
