On Tue, Jun 6, 2023 at 3:11 PM Vivian Rook <vrook@wikimedia.org> wrote:

After 31 days I lose kubectl access to magnum clusters. This has happened consistently for any cluster that I have deployed. The clusters run just fine, though around 31 days of operation kubectl cannot connect, and the web service shows the service as down (Though the web service on the cluster is responding enough to say that nothing is working, so the cluster has not completely crashed)
All kubectl commands have a long pause (about 10 minutes) then gives errors like:
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.apps)
Unable to connect to the server: stream error: stream ID 11; INTERNAL_ERROR; received from peer
I have a little more information in
https://phabricator.wikimedia.org/T336586
It feels like a cert is expiring as it always seems to happen right about 31 days after deployment. Does magnum have some kind of certificate like that? I checked the kubectl certs, they were set to be fine for years, so I don't think it is them unless I didn't check them correctly (Let's not discount that possibility, I totally could have read the wrong bit of the cert).

I can still generate a new kubectl config file with
openstack coe cluster config <cluster>

Though the resulting configuration will have the same issue as the original config (long pause, then timeout errors). I have also tried to run:
openstack coe ca rotate <cluster>

Which is accepted and seems to run fine, but after that point if I regenerate a kubeconfig file as above I get new errors when running kubectl:
Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "<cluster>")

If the key rotation would work, and I'm not doing it correctly, I would be delighted to hear how to run it correctly. Though ideally I would like to find where the original key is failing, and if it is an expiration, how to set it to a longer time.

Thank you!
--
Vivian Rook (They/Them)
Site Reliability Engineer
Wikimedia Foundation