On 10-Dec-2024, at 11:59 AM, Abhijit Singh Anand <contact@abhijitanand.com> wrote:
Dear All,
Problem Statement:
Instances remain stuck in the “Block Device Mapping” state for an extended period (~9 minutes) before eventually booting. This issue is observed consistently in one of my OpenStack clusters, while another cluster built on similar hardware and configuration spins up instances much faster.Environment Details:
Cluster Configuration:
- Deployed using Kolla-Ansible on similar hardware specifications.
- Both clusters connect to a common Ceph SSD-backed storage cluster.
- Connection: 20 Gbps bonded interface over Cisco Nexus switches.
- Same volumes used for images, VMs, backups.
- Ceph keyrings and configurations are identical.
Performance Comparison:
- “Good” cluster: Instances (even with large images >20 GB) boot to a login prompt in ~1-2 minutes.
- “Bad” cluster: Instances with the same images and size take ~8-9 minutes.
Storage Backend Architecture:
- “Good” cluster: Cinder-volume service is co-resident on compute nodes.
- “Bad” cluster: Cinder-volume service runs only on controller nodes.
Instance Images:
- Performance issue occurs with both RAW and QCOW2 images.
- Example: A Windows 10 Pro RAW image boots in 1 min 40 sec in the "Good" cluster and takes 8 min 45 sec in the "Bad" cluster.
What I’ve Tried So Far:
- Configuration Matching: Reviewed and aligned as many configuration parameters as possible between the two clusters.
- Resource Monitoring: Verified CPU and memory usage on nodes during instance launches—no significant resource contention observed.
- Ceph Performance: Checked Ceph cluster performance; it seems healthy with no latency or throughput issues.
- Network Testing: Verified 20 Gbps bond interfaces on both clusters; no observed bottlenecks.
Hypothesis:
I suspect the cinder-volume architecture could be contributing to the issue. In the "Good" cluster, cinder-volume is co-located with compute nodes, possibly optimizing data transfer paths. In contrast, the "Bad" cluster relies solely on controller-hosted cinder-volume services, which might introduce delays in data availability to compute nodes.Request for Community Input:
- Does the difference in the placement of the cinder-volume service (co-resident vs. controller-only) align with the observed performance discrepancy?
- Could there be other subtle configuration or architectural nuances that might explain the delays in the "Bad" cluster?
- What specific logs or metrics would be most insightful in diagnosing this issue further?
- I’ve reviewed Nova and Cinder logs but could dive deeper if pointed in the right direction.
- Are there recommended best practices for tuning Cinder, Nova, or Ceph for environments with high storage performance requirements?
Thank You!
I appreciate any insights, suggestions, or pointers you can provide to help narrow down the root cause of this frustrating performance discrepancy.
Regards,
Abhijit S Anand