Dear All,

Problem Statement:
Instances remain stuck in the “Block Device Mapping” state for an extended period (~9 minutes) before eventually booting. This issue is observed consistently in one of my OpenStack clusters, while another cluster built on similar hardware and configuration spins up instances much faster.

Environment Details:

  1. Cluster Configuration:

  2. Performance Comparison:

  3. Storage Backend Architecture:

  4. Instance Images:

What I’ve Tried So Far:

Hypothesis:
I suspect the cinder-volume architecture could be contributing to the issue. In the "Good" cluster, cinder-volume is co-located with compute nodes, possibly optimizing data transfer paths. In contrast, the "Bad" cluster relies solely on controller-hosted cinder-volume services, which might introduce delays in data availability to compute nodes.

Request for Community Input:

  1. Does the difference in the placement of the cinder-volume service (co-resident vs. controller-only) align with the observed performance discrepancy?
  2. Could there be other subtle configuration or architectural nuances that might explain the delays in the "Bad" cluster?
  3. What specific logs or metrics would be most insightful in diagnosing this issue further?
  4. Are there recommended best practices for tuning Cinder, Nova, or Ceph for environments with high storage performance requirements?

Thank You!
I appreciate any insights, suggestions, or pointers you can provide to help narrow down the root cause of this frustrating performance discrepancy.


Regards,

Abhijit S Anand