Diagnosing Sluggish Instance Boot Performance in OpenStack Cluster

9 Dec 2024

      Dear All,

Problem Statement:
Instances remain stuck in the “Block Device Mapping” state for an extended period (~9 minutes) before eventually booting. This issue is observed consistently in one of my OpenStack clusters, while another cluster built on similar hardware and configuration spins up instances much faster.

Environment Details:

  1.  Cluster Configuration:

     *   Deployed using Kolla-Ansible on similar hardware specifications.
     *   Both clusters connect to a common Ceph SSD-backed storage cluster.
        *   Connection: 20 Gbps bonded interface over Cisco Nexus switches.
        *   Same volumes used for images, VMs, backups.
        *   Ceph keyrings and configurations are identical.
  2.  Performance Comparison:

     *   “Good” cluster: Instances (even with large images >20 GB) boot to a login prompt in ~1-2 minutes.
     *   “Bad” cluster: Instances with the same images and size take ~8-9 minutes.
  3.  Storage Backend Architecture:

     *   “Good” cluster: Cinder-volume service is co-resident on compute nodes.
     *   “Bad” cluster: Cinder-volume service runs only on controller nodes.
  4.  Instance Images:

     *   Performance issue occurs with both RAW and QCOW2 images.
     *   Example: A Windows 10 Pro RAW image boots in 1 min 40 sec in the "Good" cluster and takes 8 min 45 sec in the "Bad" cluster.

What I’ve Tried So Far:

  *   Configuration Matching: Reviewed and aligned as many configuration parameters as possible between the two clusters.
  *   Resource Monitoring: Verified CPU and memory usage on nodes during instance launches—no significant resource contention observed.
  *   Ceph Performance: Checked Ceph cluster performance; it seems healthy with no latency or throughput issues.
  *   Network Testing: Verified 20 Gbps bond interfaces on both clusters; no observed bottlenecks.

Hypothesis:
I suspect the cinder-volume architecture could be contributing to the issue. In the "Good" cluster, cinder-volume is co-located with compute nodes, possibly optimizing data transfer paths. In contrast, the "Bad" cluster relies solely on controller-hosted cinder-volume services, which might introduce delays in data availability to compute nodes.

Request for Community Input:

  1.  Does the difference in the placement of the cinder-volume service (co-resident vs. controller-only) align with the observed performance discrepancy?
  2.  Could there be other subtle configuration or architectural nuances that might explain the delays in the "Bad" cluster?
  3.  What specific logs or metrics would be most insightful in diagnosing this issue further?
     *   I’ve reviewed Nova and Cinder logs but could dive deeper if pointed in the right direction.
  4.  Are there recommended best practices for tuning Cinder, Nova, or Ceph for environments with high storage performance requirements?

Thank You!
I appreciate any insights, suggestions, or pointers you can provide to help narrow down the root cause of this frustrating performance discrepancy.

Regards,

Abhijit S Anand

Abhijit Singh Anand

Abhijit Singh Anand

Eugen Block

tags

participants (2)