[magnum] failed to launch Kubernetes cluster

Bharat Kunwar bharat at stackhpc.com
Mon Jul 13 08:35:39 UTC 2020


Hi Tony

I have not used designate myself so not sure about the exact details but if you are using Kayobe/Kolla-Ansible, we recently proposed these backports to train, https://review.opendev.org/#/c/738882/1/ansible/roles/magnum/templates/magnum.conf.j2 <https://review.opendev.org/#/c/738882/1/ansible/roles/magnum/templates/magnum.conf.j2>. Magnum queries Keystone catalog for the url instances can use to talk back with Keystone and Magnum itself. Usually this is the public URL but essentially you need to specify an endpoint name which fits the bill. Please check /etc/kolla/magnum-conductor/magnum.conf in your control plane where Magnum is deployed and ensure it it configured to the correct interface.


Cheers

Bharat

> On 13 Jul 2020, at 08:43, Tony Pearce <tonyppe at gmail.com> wrote:
> 
> Hi Bharat, many thanks for your super quick response to me last week. I really appreciate that, especially since I had been trying for so long on this issue here. I wanted to try out your suggestion before coming back and creating a reply. 
> 
> I tried your suggestion and at first, I got the same experience (failure) when creating a cluster. It appeared to stop in the same place as I described in the mail previous. I noticed some weird things with DNS integration (Designate) during the investigation [1] and [2]. I decided to remove Designate from Openstack and retest and now I am successfully able to deploy a kubernetes cluster! :) 
> 
> Regarding those 2 points: 
> [1] - the configured designate zone was project.cloud.company.com <http://project.cloud.company.com/> and instance1 would be instance1.project.cloud.company.com <http://instance1.project.cloud.company.com/> however, the kube master instance hostname was getting master.cloud.company.com <http://master.cloud.company.com/>
> [2] - when doing a dns lookup on master.project.cloud.company.com <http://master.project.cloud.company.com/> the private IP was being returned instead of the floating IP. This meant that from outside the project, the instance couldnt be pinged by hostname. 
> 
> I've removed both magnum and Designate and then redeployed both by first deploying Magnum and testing successful kubernetes cluster deployment using your fix Bharat. Then I deployed Designate again. Issue [1] is still present while issue [2] is resolved and no longer present. Kubernetes cluster deployment is still successful :) 
> 
> Thank you once again and have a great week ahead! 
> 
> Kind regards,
> 
> Tony Pearce
> 
> 
> 
> On Fri, 10 Jul 2020 at 16:24, Bharat Kunwar <bharat at stackhpc.com <mailto:bharat at stackhpc.com>> wrote:
> Hi Tony
> 
> That is a known issue and is due to the default version of heat container agent baked into Train release. Please use label heat_container_agent_tag=train-stable-3 and you should be good to go.
> 
> Cheers
> 
> Bharat
> 
>> On 10 Jul 2020, at 09:18, Tony Pearce <tonyppe at gmail.com <mailto:tonyppe at gmail.com>> wrote:
>> 
>> Hi team, I hope you are all keeping safe and well at the moment. 
>> 
>> I am trying to use magnum to launch a kubernetes cluster. I have tried different images but currently using Fedora-Atomic 27. The cluster deployment from the cluster template is failing and I am here to ask if you could please point me in the right direction? I have become stuck and I am uncertain how to further troubleshoot this. The cluster seems to fail a few minutes after booting up the master node because after I see the logs ([1],[2]), I do not see any progress in terms of new (different) logs or load on the master. Then the 60-minute timeout is reached and fails the cluster. 
>> 
>> I deployed this openstack stack using kayobe (kolla-ansible) and this is version Train. This is deployed on CentOS 7 within docker containers. Kayobe manages this deployment through the ansible playbooks.
>> 
>> This was previously working some months back although I think I may have used coreos image at that time, and that is also not working today. The deployment would have been back around February 2020. I then deleted that deployment and re-deployed. The only change being the hostname for controller node as updated in the inventory file for the kayobe.
>> Since then which was a month or so back I've been unable to successfully deploy a kubernetes cluster. I've tried other fedora-atomic images as well as coreos without success. When using the coreos image and when tagging the image with the coreos tag as per the magnum docs, the instance fails to boot and goes to the rescue shell. However if I manually launch the coreos image then it does successfully boot and get configured via cloud-init. All of the deployment attempts stop at the same place when using fedora image and I have a different experience if I disable TLS: 
>> 
>> TLS enabled: master launched, no nodes. Fails when running /usr/lib/python2.7/site-packages/magnum/drivers/k8s_fedora_atomic_v1/templates/kubemaster.yaml
>> 
>> TLS disabled: master and nodes launched but later fails. I didnt investigate this very much. 
>> 
>> When looking for help around the web, I found this which looks to be the same issue that I have at the moment (although he's deployed slightly differently, using centos8 and mentions magnum 10):  
>> https://ask.openstack.org/en/question/128391/magnum-ussuri-container-not-booting-up/ <https://ask.openstack.org/en/question/128391/magnum-ussuri-container-not-booting-up/> 
>> 
>> I have the same log messages on the master node within heat. 
>> 
>> When going through the troubleshooting guide I see that etcd is running and no errors however I dont see any flannel service at all. But I also don't know if this has simply failed before getting to deploy flannel or whether flannel is the reason. I did try to deploy using a cluster template that is using calico as a test but the same result from the logs.
>> 
>> When looking at the stack via cli to see the failed stacks this is what I see there: http://paste.openstack.org/show/795736/ <http://paste.openstack.org/show/795736/>
>> 
>> I'm using master node flavour with 4cpu and 4GB memory. Node with 2cpu and 2GB memory. 
>> Storage is only via cinder as I am using iscsi storage with a cinder driver. I dont have any other storage. 
>> 
>> On the master, after the failure the heat log repeats these logs: 
>> 
>> ++ curl --silent http://127.0.0.1:8080/healthz <http://127.0.0.1:8080/healthz>
>> + '[' ok = ok ']'
>> + kubectl patch node k8s-cluster-onvaoh2zxotf-master-0 --patch '{"metadata": {"labels": {"node-role.kubernetes.io/master <http://node-role.kubernetes.io/master>": ""}}}'
>> error: no configuration has been provided, try setting KUBERNETES_MASTER environment variable
>> Trying to label master node with node-role.kubernetes.io/master= <http://node-role.kubernetes.io/master=>""
>> + echo 'Trying to label master node with node-role.kubernetes.io/master= <http://node-role.kubernetes.io/master=>""'
>> + sleep 5s
>> 
>> [1]Here's the cloud-init.log: http://paste.openstack.org/show/795737/ <http://paste.openstack.org/show/795737/> 
>> [2]and cloud-init-output.log: http://paste.openstack.org/show/795738/ <http://paste.openstack.org/show/795738/>
>> 
>> May I ask if anyone has a recent deployment of Magnum and a working deployment of kubernetes that could share with me the relevant details like the image you have used so that I can try and replicate? 
>> 
>> To create the cluster template I have been using: 
>> openstack coe cluster template create k8s-cluster-template \
>>                            --image Fedora-Atomic-27 \
>>                            --keypair testpair \
>>                            --external-network physnet2vlan20 \
>>                            --dns-nameserver 192.168.7.233 \
>>                            --flavor 2GB-2vCPU \
>>                            --docker-volume-size 15 \
>>                            --network-driver flannel \
>>                            --coe kubernetes
>> 
>> 
>> If I have missed anything, I am happy to provide it. 
>> 
>> Many thanks in advance for any help or pointers on this. 
>> 
>> Regards,
>> 
>> Tony Pearce
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200713/a4f8d3f9/attachment.html>


More information about the openstack-discuss mailing list