Hi Openstack Team, When I tried to create instance in Openstack dashboard, I encountered below error: Error: Failed to perform requested operation on instance "Test-Instance", the instance has an error status: Please try again later [Error: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance f8cb7bb1-64db-4242-a26a-65f3568ff4e9.]. Can anyone guide me what is the root cause for this error? Regards, -----Original Message----- From: openstack-discuss-request@lists.openstack.org <openstack-discuss-request@lists.openstack.org> Sent: Wednesday, November 1, 2023 7:26 AM To: openstack-discuss@lists.openstack.org Subject: openstack-discuss Digest, Vol 61, Issue 1 Send openstack-discuss mailing list submissions to openstack-discuss@lists.openstack.org To subscribe or unsubscribe via email, send a message with subject or body 'help' to openstack-discuss-request@lists.openstack.org You can reach the person managing the list at openstack-discuss-owner@lists.openstack.org When replying, please edit your Subject line so it is more specific than "Re: Contents of openstack-discuss digest..." Today's Topics: 1. Re: [ops] Seeking Advice on Service Deployment and Layout for an 8-Node Cluster (Forrest Fuqua) ---------------------------------------------------------------------- Message: 1 Date: Wed, 1 Nov 2023 02:26:02 +0000 From: Forrest Fuqua <fffics@rit.edu> Subject: Re: [ops] Seeking Advice on Service Deployment and Layout for an 8-Node Cluster To: Karl Kloppenborg <kkloppenborg@resetdata.com.au>, "openstack-discuss@lists.openstack.org" <openstack-discuss@lists.openstack.org> Message-ID: <CO6PR16MB420923931283EAF219118489D0A7A@CO6PR16MB4209.namp rd16.prod.outlook.com> Content-Type: multipart/alternative; boundary="_000_CO6PR16MB420923 931283EAF219118489D0A7ACO6PR16MB4209namp_" I am currently doing COW images in Ceph right now, that seems to be the same thing as volume backed images, I have about 40 images I use ( I even published the set here:http://mirrors.rit.edu/oszoo/ ) Ill take a look at Openstack Helm, I've never deployed Kerb before. ________________________________ From: Karl Kloppenborg <kkloppenborg@resetdata.com.au> Sent: Tuesday, October 31, 2023 8:03 PM To: Forrest Fuqua <fffics@rit.edu>; openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org> Subject: Re: [ops] Seeking Advice on Service Deployment and Layout for an 8-Node Cluster Hi Forrest, One of the first recommendations I would suggest is to utilise volume-backed-images. https://docs.openstack.org/cinder/latest/admin/volume-backed-image.html If you have say a set of linux and windows images, you can use this to instead of RBD imaging, ceph will snapshot from the base image the RBD and this can provision in seconds. I would also suggest (and this is completely biased because I’m on the core team) looking at OpenstackHelm – we run this at scale with many thousands of VM’s. Thanks, Karl. From: Forrest Fuqua <fffics@rit.edu> Date: Wednesday, 1 November 2023 at 2:15 am To: Karl Kloppenborg <kkloppenborg@resetdata.com.au>, openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org> Subject: Re: [ops] Seeking Advice on Service Deployment and Layout for an 8-Node Cluster In response to Karl's request, Here is last year's major load I ran: http://mirrors.rit.edu/cptc/2023/virtual-machines/1-README.txt<http://mirrors.rit.edu/cptc/2023/virtual-machines/1-README.txt> The TL;DR its a mix of Windows and Linux VMs at about 4GB of Ram, and 4vCPU at about 50 machines per team, and having up to 50 teams at a time, with all the extras support VMs at various workloads, we do about 2500 VMs just for the yearly compitations All of these VMs will be under high load since they will be under a consistent cyberattack by students. The biggest issues I've had is Ceph starts to slow down on RBD creation of a COW image and Nova will timeout the creation, If it gets too back, entries like volumes, and VMs can get stuck in weird error states of error_deleting or some other weird state (I've had Volumes stuck attached to VMs that don't exist!) At peak, Netdata was showing about 500kIOP/s across the whole cluster before it really started to go north and went into a death spiral, A single windows VM was a 4 hour boot time. Ceph started showing slow writes with up to 40s commit latency When I was running a multi-node management, the MariaDB and RabbitMQ would get super mad and start dropping messages and database issues, having a single dedicated compute node has worked well so far. but I am thinking of moving away from Kolla and going to Ansible during my summer rebuild. ________________________________ From: Karl Kloppenborg <kkloppenborg@resetdata.com.au> Sent: Monday, October 30, 2023 6:44 PM To: Forrest Fuqua <fffics@rit.edu>; openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org> Subject: Re: [ops] Seeking Advice on Service Deployment and Layout for an 8-Node Cluster Hi Forrest, Based on the configuration specified, I will assume that 1.8TB is assignable to VM resources, less the management node (though honestly why not just share over the cluster, doesn’t seem like it’s a production workload?) However, (1.8TB*7)*1000 = 12600GB usable ram, with no oversubscription. (roughly 180-200GB each system for host and ceph services) Therefore 12600/3100 = ~4GB ram per VM at peak load, no oversubscription. Based on the CPU arch: 128*2 threads = 256 vCPU cores (which in this model I would count as vCPU cores, there’s a LOT of contention about what defines a vCPU, there’s many camps and religious arguments on this)… However, we want some space for the host so as a general thought, I will put 250 vCPU available in this calculation. 250*7 = 1750 vCPU cores avail. There are 3100-1750 = 1,350 deficit vCPU cores. As such to make up a 1vCPU core to VM at peak workload an oversubscription would be 1.8 1750*1.8 = 3150 vCPU available. Honestly, depending on these workloads and your requirements you can probably bust this subscription ratio up more and again, I would instead share the management workloads over the cluster more and induct the first node as another compute resource. I would like to know more about the workloads, what is the flavour of each VM, Is the workload disk intensive, cpu or memory, or io? In terms of running the workload, I don’t see an obvious issue with your nova config, the allocation ratio is high but that’s not going to probably be an issue here based on the numbers I mentioned. Ensuring host reservation memory is good, which I can see you have around 128GB in reservation and ram_allocation_ratio as 1.0 will ensure no OOM scenarios. Overall, I think your settings are sane and my only comment would be to add that management node into the compute mix if you can but otherwise without knowing more on the workload, seems fine. Have you done any scale testing? Spun up a workload of 3100 VMs? ( easiest way to do this would likely be to heat template…) then run an image with a default start workload of something you want to test with. --Karl. From: Forrest Fuqua <fffics@rit.edu> Date: Tuesday, 31 October 2023 at 5:40 am To: openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org> Subject: [ops] Seeking Advice on Service Deployment and Layout for an 8-Node Cluster Hi there, I'm currently exploring some strategies and seeking advice on the deployment and layout of services on an 8-node cluster. Each node is equipped with dual AMD EYPC 7702s, two 100G Links, and 2TB of RAM. Additionally, I've set up two 13TB NVME drives in a Ceph configuration for each node. At present, the setup involves one management node, while the others are dedicated to running Nova, Neutron, Glance, and Cinder. The workload, while fairly basic, involves handling substantial numbers. It's primarily used for one-off research projects and a few classes, serving around 60 end users. The most demanding scenario occurs during cybersecurity competitions when we need to deploy 3100 VMs at once. When this is spread out over a few days, it's manageable. However, there's a risk of overloading the system, leading to DiskIO issues and Database inconsistencies. Currently, I'm utilizing Kolla for deployment, and I've shared my configurations on this GitHub repository: Link to Configs<https://github.com/RIT-GCI-CyberRange/openstack-configs>. I'd greatly appreciate any insights or advice on optimizing this setup to handle the peak workload more efficiently and prevent potential performance issues. Thanks in advance for your help!