[magnum] kubectl loses access after 31 days

Vivian Rook vrook at wikimedia.org
Tue Jun 6 19:11:34 UTC 2023


After 31 days I lose kubectl access to magnum clusters. This has happened
consistently for any cluster that I have deployed. The clusters run just
fine, though around 31 days of operation kubectl cannot connect, and the
web service shows the service as down (Though the web service on the
cluster is responding enough to say that nothing is working, so the cluster
has not completely crashed)

All kubectl commands have a long pause (about 10 minutes) then gives errors
like:

Error from server (Timeout): the server was unable to return a response in
the time allotted, but may still be processing the request (get
deployments.apps)
Unable to connect to the server: stream error: stream ID 11;
INTERNAL_ERROR; received from peer

I have a little more information in
https://phabricator.wikimedia.org/T336586
It feels like a cert is expiring as it always seems to happen right about
31 days after deployment. Does magnum have some kind of certificate like
that? I checked the kubectl certs, they were set to be fine for years, so I
don't think it is them unless I didn't check them correctly (Let's not
discount that possibility, I totally could have read the wrong bit of the
cert).

I can still generate a new kubectl config file with
openstack coe cluster config <cluster>

Though the resulting configuration will have the same issue as the original
config (long pause, then timeout errors). I have also tried to run:
openstack coe ca rotate <cluster>

Which is accepted and seems to run fine, but after that point if I
regenerate a kubeconfig file as above I get new errors when running kubectl:
Unable to connect to the server: x509: certificate signed by unknown
authority (possibly because of "crypto/rsa: verification error" while
trying to verify candidate authority certificate "<cluster>")

If the key rotation would work, and I'm not doing it correctly, I would be
delighted to hear how to run it correctly. Though ideally I would like to
find where the original key is failing, and if it is an expiration, how to
set it to a longer time.

Thank you!
-- 

*Vivian Rook (They/Them)*
Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230606/3aa3691c/attachment-0001.htm>


More information about the openstack-discuss mailing list