Hi Forrest,

One of the first recommendations I would suggest is to utilise volume-backed-images.

https://docs.openstack.org/cinder/latest/admin/volume-backed-image.html

If you have say a set of linux and windows images, you can use this to instead of RBD imaging, ceph will snapshot from the base image the RBD and this can provision in seconds.

I would also suggest (and this is completely biased because I’m on the core team) looking at OpenstackHelm – we run this at scale with many thousands of VM’s.

Thanks,
Karl.

From: Forrest Fuqua <fffics@rit.edu>
Date: Wednesday, 1 November 2023 at 2:15 am
To: Karl Kloppenborg <kkloppenborg@resetdata.com.au>, openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org>
Subject: Re: [ops] Seeking Advice on Service Deployment and Layout for an 8-Node Cluster

In response to Karl's request, Here is last year's major load I ran: http://mirrors.rit.edu/cptc/2023/virtual-machines/1-README.txt

The TL;DR its a mix of Windows and Linux VMs at about 4GB of Ram, and 4vCPU at about 50 machines per team, and having up to 50 teams at a time, with all the extras support VMs at various workloads, we do about 2500 VMs just for the yearly compitations

All of these VMs will be under high load since they will be under a consistent cyberattack by students.

The biggest issues I've had is Ceph starts to slow down on RBD creation of a COW image and Nova will timeout the creation, If it gets too back, entries like volumes, and VMs can get stuck in weird error states of error_deleting or some other weird state (I've had Volumes stuck attached to VMs that don't exist!)

At peak, Netdata was showing about 500kIOP/s across the whole cluster before it really started to go north and went into a death spiral, A single windows VM was a 4 hour boot time. Ceph started showing slow writes with up to 40s commit latency

When I was running a multi-node management, the MariaDB and RabbitMQ would get super mad and start dropping messages and database issues, having a single dedicated compute node has worked well so far. but I am thinking of moving away from Kolla and going to Ansible during my summer rebuild.

From: Karl Kloppenborg <kkloppenborg@resetdata.com.au>
Sent: Monday, October 30, 2023 6:44 PM
To: Forrest Fuqua <fffics@rit.edu>; openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org>
Subject: Re: [ops] Seeking Advice on Service Deployment and Layout for an 8-Node Cluster

Hi Forrest,

Based on the configuration specified, I will assume that 1.8TB is assignable to VM resources, less the management node (though honestly why not just share over the cluster, doesn’t seem like it’s a production workload?)

However, (1.8TB*7)*1000 = 12600GB usable ram, with no oversubscription. (roughly 180-200GB each system for host and ceph services)

Therefore 12600/3100 = ~4GB ram per VM at peak load, no oversubscription.

Based on the CPU arch:

128*2 threads = 256 vCPU cores (which in this model I would count as vCPU cores, there’s a LOT of contention about what defines a vCPU, there’s many camps and religious arguments on this)…

However, we want some space for the host so as a general thought, I will put 250 vCPU available in this calculation.

250*7 = 1750 vCPU cores avail.

There are 3100-1750 = 1,350 deficit vCPU cores.

As such to make up a 1vCPU core to VM at peak workload an oversubscription would be 1.8

1750*1.8 = 3150 vCPU available.

Honestly, depending on these workloads and your requirements you can probably bust this subscription ratio up more and again, I would instead share the management workloads over the cluster more and induct the first node as another compute resource.

I would like to know more about the workloads, what is the flavour of each VM, Is the workload disk intensive, cpu or memory, or io?

In terms of running the workload, I don’t see an obvious issue with your nova config, the allocation ratio is high but that’s not going to probably be an issue here based on the numbers I mentioned.

Ensuring host reservation memory is good, which I can see you have around 128GB in reservation and ram_allocation_ratio as 1.0 will ensure no OOM scenarios.

Overall, I think your settings are sane and my only comment would be to add that management node into the compute mix if you can but otherwise without knowing more on the workload, seems fine.

Have you done any scale testing? Spun up a workload of 3100 VMs? ( easiest way to do this would likely be to heat template…) then run an image with a default start workload of something you want to test with.

--Karl.

From: Forrest Fuqua <fffics@rit.edu>
Date: Tuesday, 31 October 2023 at 5:40 am
To: openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org>
Subject: [ops] Seeking Advice on Service Deployment and Layout for an 8-Node Cluster

Hi there,

I'm currently exploring some strategies and seeking advice on the deployment and layout of services on an 8-node cluster. Each node is equipped with dual AMD EYPC 7702s, two 100G Links, and 2TB of RAM. Additionally, I've set up two 13TB NVME drives in a Ceph configuration for each node.

At present, the setup involves one management node, while the others are dedicated to running Nova, Neutron, Glance, and Cinder. The workload, while fairly basic, involves handling substantial numbers. It's primarily used for one-off research projects and a few classes, serving around 60 end users.

The most demanding scenario occurs during cybersecurity competitions when we need to deploy 3100 VMs at once. When this is spread out over a few days, it's manageable. However, there's a risk of overloading the system, leading to DiskIO issues and Database inconsistencies.

Currently, I'm utilizing Kolla for deployment, and I've shared my configurations on this GitHub repository: Link to Configs.

I'd greatly appreciate any insights or advice on optimizing this setup to handle the peak workload more efficiently and prevent potential performance issues.

Thanks in advance for your help!