Open Stack

Mon Feb 4 19:04:34 UTC 2019

Hi,

At the last infra team meeting, we talked about whether and how to
proceed with Gitea.  I'd like to summarize that quickly and make sure
we're all on board with it.

* We will continue to deploy our own Kubernetes using the
  k8s-for-openstack Ansible playbook that Monty found.  Since that's
  developed by a third-party, we will use it by checking out the
  upstream source from GitHub, but pinning to a known sha so that we
  don't encounter surprises.

* We discussed deploying with a new version of rook which does not
  require the flex driver, but it turns out I was a bit ahead of things
  -- that hasn't landed yet.  So we can probably keep our current
  deployment.

Ian raised two new issues:

1) We should verify that the system still functions if our single-master
Kubernetes loses its master.

Monty and I tried this -- it doesn't.  The main culprit here seems to be
DNS.  The single master is responsible for intra-(and extra!)-cluster
DNS.  This makes gitea unhappy for three reasons: a) if its SQL
connections have gone idle and terminated, it cannot re-establish them,
and b) it is unable to resolve remote hostnames for avatars, which can
greatly slow down page loads, and c) the replication receiver is not a
long running process, it's just run over SSH, so it can't connect to the
database either, and therefore replication fails.

The obvious solution, use a multi-master setup, apparently has issues if
k8s is deployed in a cloud with LoadBalancer objects (which we are
using).

Kubernetes does have support for scale-out DNS, it's not clear whether
that still has a SPOF though.  Monty is experimenting with this.

If that doesn't improve things, we may still want to proceed since the
system should still mostly work for browsing and git clones if the
master fails, and full operation will resume when it comes online.

2) Rook is difficult to upgrade.

This appears to be the case.  When it does come time to upgrade rook, we
may want to simply build a new Kubernetes cluster for the system.
Presumably by that point, it won't require the flexvolume driver, which
will be a good reason to make a new cluster anyway, and perhaps further
upgrades after that won't be as complicated.

Once we conclude investigation into issue #1, I think these are the next
steps:

* Land the patches to manage the opendev k8s cluster with Ansible.

* Pin the k8s-on-openstack repo to the current sha.

* Add HTTPS termination to the cluster.

* Update opendev.org DNS to point to the cluster.

* Treat this as a soft-launch of the production service.  Do not
  publicise it or encourage people to switch to it yet, but continue to
  observe it as we complete the rest of the tasks in [1].

[1] http://specs.openstack.org/openstack-infra/infra-specs/specs/opendev-gerrit.html#work-items

-Jim

Open Stack

[OpenStack-Infra] Gitea next steps

OpenStack

Community

Documentation

Branding & Legal