Diagnosing Sluggish Instance Boot Performance in OpenStack Cluster
Dear All, Problem Statement: Instances remain stuck in the “Block Device Mapping” state for an extended period (~9 minutes) before eventually booting. This issue is observed consistently in one of my OpenStack clusters, while another cluster built on similar hardware and configuration spins up instances much faster. Environment Details: 1. Cluster Configuration: * Deployed using Kolla-Ansible on similar hardware specifications. * Both clusters connect to a common Ceph SSD-backed storage cluster. * Connection: 20 Gbps bonded interface over Cisco Nexus switches. * Same volumes used for images, VMs, backups. * Ceph keyrings and configurations are identical. 2. Performance Comparison: * “Good” cluster: Instances (even with large images >20 GB) boot to a login prompt in ~1-2 minutes. * “Bad” cluster: Instances with the same images and size take ~8-9 minutes. 3. Storage Backend Architecture: * “Good” cluster: Cinder-volume service is co-resident on compute nodes. * “Bad” cluster: Cinder-volume service runs only on controller nodes. 4. Instance Images: * Performance issue occurs with both RAW and QCOW2 images. * Example: A Windows 10 Pro RAW image boots in 1 min 40 sec in the "Good" cluster and takes 8 min 45 sec in the "Bad" cluster. What I’ve Tried So Far: * Configuration Matching: Reviewed and aligned as many configuration parameters as possible between the two clusters. * Resource Monitoring: Verified CPU and memory usage on nodes during instance launches—no significant resource contention observed. * Ceph Performance: Checked Ceph cluster performance; it seems healthy with no latency or throughput issues. * Network Testing: Verified 20 Gbps bond interfaces on both clusters; no observed bottlenecks. Hypothesis: I suspect the cinder-volume architecture could be contributing to the issue. In the "Good" cluster, cinder-volume is co-located with compute nodes, possibly optimizing data transfer paths. In contrast, the "Bad" cluster relies solely on controller-hosted cinder-volume services, which might introduce delays in data availability to compute nodes. Request for Community Input: 1. Does the difference in the placement of the cinder-volume service (co-resident vs. controller-only) align with the observed performance discrepancy? 2. Could there be other subtle configuration or architectural nuances that might explain the delays in the "Bad" cluster? 3. What specific logs or metrics would be most insightful in diagnosing this issue further? * I’ve reviewed Nova and Cinder logs but could dive deeper if pointed in the right direction. 4. Are there recommended best practices for tuning Cinder, Nova, or Ceph for environments with high storage performance requirements? Thank You! I appreciate any insights, suggestions, or pointers you can provide to help narrow down the root cause of this frustrating performance discrepancy. Regards, Abhijit S Anand
Dear All, Could i get some help here please? Thank you. Regards, Abhijit S Anand On 10-Dec-2024, at 11:59 AM, Abhijit Singh Anand <contact@abhijitanand.com> wrote: Dear All, Problem Statement: Instances remain stuck in the “Block Device Mapping” state for an extended period (~9 minutes) before eventually booting. This issue is observed consistently in one of my OpenStack clusters, while another cluster built on similar hardware and configuration spins up instances much faster. Environment Details: 1. Cluster Configuration: * Deployed using Kolla-Ansible on similar hardware specifications. * Both clusters connect to a common Ceph SSD-backed storage cluster. * Connection: 20 Gbps bonded interface over Cisco Nexus switches. * Same volumes used for images, VMs, backups. * Ceph keyrings and configurations are identical. 2. Performance Comparison: * “Good” cluster: Instances (even with large images >20 GB) boot to a login prompt in ~1-2 minutes. * “Bad” cluster: Instances with the same images and size take ~8-9 minutes. 3. Storage Backend Architecture: * “Good” cluster: Cinder-volume service is co-resident on compute nodes. * “Bad” cluster: Cinder-volume service runs only on controller nodes. 4. Instance Images: * Performance issue occurs with both RAW and QCOW2 images. * Example: A Windows 10 Pro RAW image boots in 1 min 40 sec in the "Good" cluster and takes 8 min 45 sec in the "Bad" cluster. What I’ve Tried So Far: * Configuration Matching: Reviewed and aligned as many configuration parameters as possible between the two clusters. * Resource Monitoring: Verified CPU and memory usage on nodes during instance launches—no significant resource contention observed. * Ceph Performance: Checked Ceph cluster performance; it seems healthy with no latency or throughput issues. * Network Testing: Verified 20 Gbps bond interfaces on both clusters; no observed bottlenecks. Hypothesis: I suspect the cinder-volume architecture could be contributing to the issue. In the "Good" cluster, cinder-volume is co-located with compute nodes, possibly optimizing data transfer paths. In contrast, the "Bad" cluster relies solely on controller-hosted cinder-volume services, which might introduce delays in data availability to compute nodes. Request for Community Input: 1. Does the difference in the placement of the cinder-volume service (co-resident vs. controller-only) align with the observed performance discrepancy? 2. Could there be other subtle configuration or architectural nuances that might explain the delays in the "Bad" cluster? 3. What specific logs or metrics would be most insightful in diagnosing this issue further? * I’ve reviewed Nova and Cinder logs but could dive deeper if pointed in the right direction. 4. Are there recommended best practices for tuning Cinder, Nova, or Ceph for environments with high storage performance requirements? Thank You! I appreciate any insights, suggestions, or pointers you can provide to help narrow down the root cause of this frustrating performance discrepancy. Regards, Abhijit S Anand
Hi, I don't think the colocation of cinder-volume and compute can have this impact if Ceph is the backend. If you just used local LVM storage on compute nodes, that might make a difference (never used that), but I wouldn't think so in your environment. Despite your statement about keyrings, I would still double check the permissions. Just yesterday Tony posted that is volume creation took longer because the glance image was downloaded first, due to a missing auth caps for the cinder user in the glance pool. Then I would enable debug logs for nova-compute (it should be sufficient to do that on one compute node on the "bad" cluster and launch an instance there), it should show you the exact steps and how long they took. Do you see new files in /var/lib/nova/_base? If your image is raw but you set its format to qcow within glance, nova will still download it to the local compute storage first, then upload it back to Ceph as a flattened image. But in that case you would see improved instance launch times if you did tried it again on the same compute node, because the converted image would already be present, nova would only have to re-upload it to Ceph again. If you boot your instances from volume, do you see temporary files in /var/lib/cinder/conversion? That would also point to a download/upload process during instance creation. So I would take a look at those mentioned directories during instance creation and enable nova debug logs on a compute node to better understand where it takes that much time. Regards, Eugen Zitat von Abhijit Singh Anand <contact@abhijitanand.com>:
Dear All,
Problem Statement: Instances remain stuck in the “Block Device Mapping” state for an extended period (~9 minutes) before eventually booting. This issue is observed consistently in one of my OpenStack clusters, while another cluster built on similar hardware and configuration spins up instances much faster.
Environment Details:
1. Cluster Configuration:
* Deployed using Kolla-Ansible on similar hardware specifications. * Both clusters connect to a common Ceph SSD-backed storage cluster. * Connection: 20 Gbps bonded interface over Cisco Nexus switches. * Same volumes used for images, VMs, backups. * Ceph keyrings and configurations are identical. 2. Performance Comparison:
* “Good” cluster: Instances (even with large images >20 GB) boot to a login prompt in ~1-2 minutes. * “Bad” cluster: Instances with the same images and size take ~8-9 minutes. 3. Storage Backend Architecture:
* “Good” cluster: Cinder-volume service is co-resident on compute nodes. * “Bad” cluster: Cinder-volume service runs only on controller nodes. 4. Instance Images:
* Performance issue occurs with both RAW and QCOW2 images. * Example: A Windows 10 Pro RAW image boots in 1 min 40 sec in the "Good" cluster and takes 8 min 45 sec in the "Bad" cluster.
What I’ve Tried So Far:
* Configuration Matching: Reviewed and aligned as many configuration parameters as possible between the two clusters. * Resource Monitoring: Verified CPU and memory usage on nodes during instance launches—no significant resource contention observed. * Ceph Performance: Checked Ceph cluster performance; it seems healthy with no latency or throughput issues. * Network Testing: Verified 20 Gbps bond interfaces on both clusters; no observed bottlenecks.
Hypothesis: I suspect the cinder-volume architecture could be contributing to the issue. In the "Good" cluster, cinder-volume is co-located with compute nodes, possibly optimizing data transfer paths. In contrast, the "Bad" cluster relies solely on controller-hosted cinder-volume services, which might introduce delays in data availability to compute nodes.
Request for Community Input:
1. Does the difference in the placement of the cinder-volume service (co-resident vs. controller-only) align with the observed performance discrepancy? 2. Could there be other subtle configuration or architectural nuances that might explain the delays in the "Bad" cluster? 3. What specific logs or metrics would be most insightful in diagnosing this issue further? * I’ve reviewed Nova and Cinder logs but could dive deeper if pointed in the right direction. 4. Are there recommended best practices for tuning Cinder, Nova, or Ceph for environments with high storage performance requirements?
Thank You! I appreciate any insights, suggestions, or pointers you can provide to help narrow down the root cause of this frustrating performance discrepancy.
Regards,
Abhijit S Anand
participants (2)
-
Abhijit Singh Anand
-
Eugen Block