[kolla] Persistent `cannot fork child process` when launching any VM in all-in-one deployment
Hi everyone, I've been debugging an issue for several weeks and have exhausted the obvious possibilities. I'm hoping the collective expertise here can point me in the right direction. ## Environment - **OpenStack**: Kolla-Ansible 2025.1 (Epoxy), all-in-one deployment - **OS**: Ubuntu 24.04 on a VM (nested virtualization) - **OKD target**: Trying to deploy OKD 4.18, but even a simple Cirros test VM fails ## The Problem Any attempt to create a VM (even `--image cirros --flavor m1.tiny`) fails with: ``` libvirt.libvirtError: cannot fork child process: Resource temporarily unavailable ``` The VM goes straight to `ERROR` state with no useful details in `nova-compute.log` beyond the same error. ## What Works ✅ - OpenStack services are up (`nova-compute`, `neutron`, etc. all healthy) - Can create networks, subnets, routers, keypairs via CLI - Direct `qemu-system-x86_64 -enable-kvm` test succeeds (KVM itself works) - `/dev/kvm` exists with correct permissions - KVM modules loaded (`kvm_intel`), nested virt enabled (`Y`) ## What We've Checked (All Fine) | Check | What We Found | Status | |-------|---------------|--------| | System PID limit (`pid_max`) | 4,194,304 | ✅ OK | | Kernel threads max (`threads-max`) | 722,077 | ✅ OK | | User process limit (`ulimit -u`) | 361,038 | ✅ OK | | Host thread count (`ps -eLf \| wc -l`) | ~111,000 | ✅ OK | | `nova_compute` container `pids.max` | 108,000 | ✅ OK | | `nova_compute` PIDs in container | 27 | ✅ OK | | `nova_libvirt` container `PidsLimit` | `<nil>` (unlimited) | ✅ OK | | AppArmor blocking libvirt | No profiles loaded | ✅ OK | | Host `libvirtd` running | Inactive | ✅ OK | | Libvirt log volume size | 12KB | ✅ OK | | RAM available | 88 GB total, plenty free | ✅ OK | | vCPUs | 32 cores | ✅ OK | ## What We Can't Check (Missing Commands) Inside the `nova_libvirt` container, the `ulimit` command is missing, so we cannot determine: - `nproc` limit inside the container - `nofile` (file descriptor) limit ## What We Haven't Checked - Kernel parameters like `vm.max_map_count` (default is 65530, could this be an issue?) - cgroup v2 limits on the `nova_libvirt` container (Ubuntu 24.04 uses cgroup v2) - `libvirtd` logs inside the container (the log file may not exist or be empty) ## The Ask Has anyone seen this in a Kolla-Ansible all-in-one deployment where all the obvious limits are large but libvirt still refuses to fork? Could it be: 1. A cgroup v2 limit we missed? 2. A kernel parameter that needs tuning (`vm.max_map_count`)? 3. Something else entirely? I'm happy to run any additional diagnostics or provide more logs. Any guidance would be hugely appreciated. Thanks, Dennis
I haven't seen this before, but have you tried looking at the qemu logs in the libvirt container? Michael Person on the internet. https://madebymikal.com. He / his / him. On Fri, 13 Mar 2026, 9:01 am Dennis Martin, <dennismartinkariba@gmail.com> wrote:
Hi everyone,
I've been debugging an issue for several weeks and have exhausted the obvious possibilities. I'm hoping the collective expertise here can point me in the right direction.
## Environment - **OpenStack**: Kolla-Ansible 2025.1 (Epoxy), all-in-one deployment - **OS**: Ubuntu 24.04 on a VM (nested virtualization) - **OKD target**: Trying to deploy OKD 4.18, but even a simple Cirros test VM fails
## The Problem Any attempt to create a VM (even `--image cirros --flavor m1.tiny`) fails with: ``` libvirt.libvirtError: cannot fork child process: Resource temporarily unavailable ```
The VM goes straight to `ERROR` state with no useful details in `nova-compute.log` beyond the same error.
## What Works ✅ - OpenStack services are up (`nova-compute`, `neutron`, etc. all healthy) - Can create networks, subnets, routers, keypairs via CLI - Direct `qemu-system-x86_64 -enable-kvm` test succeeds (KVM itself works) - `/dev/kvm` exists with correct permissions - KVM modules loaded (`kvm_intel`), nested virt enabled (`Y`)
## What We've Checked (All Fine)
| Check | What We Found | Status | |-------|---------------|--------| | System PID limit (`pid_max`) | 4,194,304 | ✅ OK | | Kernel threads max (`threads-max`) | 722,077 | ✅ OK | | User process limit (`ulimit -u`) | 361,038 | ✅ OK | | Host thread count (`ps -eLf \| wc -l`) | ~111,000 | ✅ OK | | `nova_compute` container `pids.max` | 108,000 | ✅ OK | | `nova_compute` PIDs in container | 27 | ✅ OK | | `nova_libvirt` container `PidsLimit` | `<nil>` (unlimited) | ✅ OK | | AppArmor blocking libvirt | No profiles loaded | ✅ OK | | Host `libvirtd` running | Inactive | ✅ OK | | Libvirt log volume size | 12KB | ✅ OK | | RAM available | 88 GB total, plenty free | ✅ OK | | vCPUs | 32 cores | ✅ OK |
## What We Can't Check (Missing Commands) Inside the `nova_libvirt` container, the `ulimit` command is missing, so we cannot determine: - `nproc` limit inside the container - `nofile` (file descriptor) limit
## What We Haven't Checked - Kernel parameters like `vm.max_map_count` (default is 65530, could this be an issue?) - cgroup v2 limits on the `nova_libvirt` container (Ubuntu 24.04 uses cgroup v2) - `libvirtd` logs inside the container (the log file may not exist or be empty)
## The Ask Has anyone seen this in a Kolla-Ansible all-in-one deployment where all the obvious limits are large but libvirt still refuses to fork? Could it be: 1. A cgroup v2 limit we missed? 2. A kernel parameter that needs tuning (`vm.max_map_count`)? 3. Something else entirely?
I'm happy to run any additional diagnostics or provide more logs. Any guidance would be hugely appreciated.
Thanks, Dennis
participants (2)
-
Dennis Martin
-
Michael Still