[Magnum][Kayobe] Magnum Kubernetes clusters failing to be created (bugs?)

1 Sep 2020

      Hi guys, I hope you are all keeping safe and well at the moment.

I am trying to launch Kubernetes clusters into Openstack Train which has
been deployed via Kayobe (Kayobe as I understand is a wrapper for
kolla-ansible). There have been a few strange issues here and I've
struggled to isolate them. These issues started recently after a fresh
Openstack deployment some months ago (around February 2020) to give some
context. This Openstack is not "live" as I've been trying to get to the
bottom of the issues:

Issue 1. When trying to launch a cluster we get error "Resource Create
Failed: Forbidden: Resources.Kube
Masters.Resources[0].Resources.Kube-Master: Only Volume-Backed Servers Are
Allowed For Flavors With Zero Disk. "

Issue 2. After successfully creating a cluster of a smaller node size, the
"resize cluster" is failing (however update the cluster is working).

Some background on this specific environment:
Deployed via Kayobe, with these components:
Cinder, Designate, iscsid, Magnum, Multipathd, neutron provider networks

The Cinder component integrates with iSCSI SAN storage using the Nimble
driver. This is the only storage. In order to prevent Openstack from
allocating Compute node local HDD as instance storage, I have all flavours
configured with root disk / ephemeral disk / swap disk = "0MB". This then
results in all instance data being stored on the backend Cinder storage
appliance.

I was able to get a cluster deployed by first creating the template as
needed, then when launching the cluster Horizon prompts you for items
already there in the template such as number of nodes, node flavour and
labels etc. I re-supplied all of the info (as to duplicate it) and then
tried creating the cluster. After many many times trying over the course of
a few weeks to a few months it was successful. I was then able to work
around the issue #2 above to get it increased in size.

When looking at the logs for issue #2, it looks like some content is
missing in the API but I am not certain. I will include a link to the
pastebin below [1].
When trying to resize the cluster, Horizon gives error: "Error: Unable to
resize given cluster id: 99693dbf-160a-40e0-9ed4-93f3370367ee". I then
searched the controller node /var/log directory for this ID and found
"horizon.log  [:error] [pid 25] Not Found:
/api/container_infra/clusters/99693dbf-160a-40e0-9ed4-93f3370367ee/resize".
Going to the Horizon menu "update cluster" allows you to increase the
number of nodes and then save/apply the config which does indeed resize the
cluster.

Regarding issue #1, we've been unable to deploy a cluster in a new project
and the error is hinting it relates to the flavours having 0MB disk
specified, though this error is new and we've been successful previously
with deploying clusters (albeit with the hit-and-miss experiences) using
the flavour with 0MB disk as described above. Again I searched for the
(stack) ID after the failure, in the logs on the controller and I obtained
not much more than the error already seen with Horizon [2].

I was able to create new flavours with root disk = 15GB and then
successfully deploy a cluster on the next immediate try. Update cluster
from 3 nodes to 6 nodes was also immediately successful. However I see the
compute nodes "used" disk space increasing after increasing the cluster
size which is an issue as the compute node has very limited HDD capacity
(32GB SD card).

At this point I also checked 1) previously installed cluster using the 0MB
disk flavour and 2) new instances using the 0MB disk flavour. I notice that
the previous cluster is having host storage allocated but while the new
instance is not having host storage allocated. So the cluster create
success is using flavour with disk = 0MB while the result is compute HDD
storage being consumed.

So with the above, please may I clarify on the following?
1. It seems that 0MB disk flavours may not be supported with magnum now?
Could the experts confirm? :) Is there another way that I should be
configuring this so that compute node disk is not being consumed (because
it is slow and has limited capacity).
2. The issue #1 looks like a bug to me, is it known? If not, is this mail
enough to get it realised?

Pastebin links as mentioned
[1] http://paste.openstack.org/show/797316/

[2] http://paste.openstack.org/show/797318/

Many thanks,

Regards,

Tony Pearce

Tony Pearce

feilong

Tony Pearce

feilong

Tony Pearce

tags

participants (2)