[magnum] kubectl loses access after 31 days

Vivian Rook vrook at wikimedia.org
Fri Jun 9 23:58:32 UTC 2023


It would appear that this is due to podman logs getting large. I've got one
now that is about 7G and growing. I see
https://opendev.org/openstack/magnum/commit/9d543960d2827ede5be4f851b1cb62c986981f32
was included a few years ago that should limit to 50M, perhaps this is not
working as expected in more recent times? Or are there any settings that
this needs to limit logs that I might not have set?

Thank you!

On Tue, Jun 6, 2023 at 3:11 PM Vivian Rook <vrook at wikimedia.org> wrote:

> After 31 days I lose kubectl access to magnum clusters. This has happened
> consistently for any cluster that I have deployed. The clusters run just
> fine, though around 31 days of operation kubectl cannot connect, and the
> web service shows the service as down (Though the web service on the
> cluster is responding enough to say that nothing is working, so the cluster
> has not completely crashed)
>
> All kubectl commands have a long pause (about 10 minutes) then gives
> errors like:
>
> Error from server (Timeout): the server was unable to return a response in
> the time allotted, but may still be processing the request (get
> deployments.apps)
> Unable to connect to the server: stream error: stream ID 11;
> INTERNAL_ERROR; received from peer
>
> I have a little more information in
> https://phabricator.wikimedia.org/T336586
> It feels like a cert is expiring as it always seems to happen right about
> 31 days after deployment. Does magnum have some kind of certificate like
> that? I checked the kubectl certs, they were set to be fine for years, so I
> don't think it is them unless I didn't check them correctly (Let's not
> discount that possibility, I totally could have read the wrong bit of the
> cert).
>
> I can still generate a new kubectl config file with
> openstack coe cluster config <cluster>
>
> Though the resulting configuration will have the same issue as the
> original config (long pause, then timeout errors). I have also tried to run:
> openstack coe ca rotate <cluster>
>
> Which is accepted and seems to run fine, but after that point if I
> regenerate a kubeconfig file as above I get new errors when running kubectl:
> Unable to connect to the server: x509: certificate signed by unknown
> authority (possibly because of "crypto/rsa: verification error" while
> trying to verify candidate authority certificate "<cluster>")
>
> If the key rotation would work, and I'm not doing it correctly, I would be
> delighted to hear how to run it correctly. Though ideally I would like to
> find where the original key is failing, and if it is an expiration, how to
> set it to a longer time.
>
> Thank you!
> --
>
> *Vivian Rook (They/Them)*
> Site Reliability Engineer
> Wikimedia Foundation <https://wikimediafoundation.org/>
>


-- 

*Vivian Rook (They/Them)*
Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230609/83888318/attachment.htm>


More information about the openstack-discuss mailing list