[kolla] Persistent `cannot fork child process` when launching any VM in all-in-one deployment

12 Mar 2026

      Hi everyone,

I've been debugging an issue for several weeks and have exhausted the
obvious possibilities. I'm hoping the collective expertise here can point
me in the right direction.

## Environment
- **OpenStack**: Kolla-Ansible 2025.1 (Epoxy), all-in-one deployment
- **OS**: Ubuntu 24.04 on a VM (nested virtualization)
- **OKD target**: Trying to deploy OKD 4.18, but even a simple Cirros test
VM fails

## The Problem
Any attempt to create a VM (even `--image cirros --flavor m1.tiny`) fails
with:
```
libvirt.libvirtError: cannot fork child process: Resource temporarily
unavailable
```

The VM goes straight to `ERROR` state with no useful details in
`nova-compute.log` beyond the same error.

## What Works ✅
- OpenStack services are up (`nova-compute`, `neutron`, etc. all healthy)
- Can create networks, subnets, routers, keypairs via CLI
- Direct `qemu-system-x86_64 -enable-kvm` test succeeds (KVM itself works)
- `/dev/kvm` exists with correct permissions
- KVM modules loaded (`kvm_intel`), nested virt enabled (`Y`)

## What We've Checked (All Fine)

| Check | What We Found | Status |
|-------|---------------|--------|
| System PID limit (`pid_max`) | 4,194,304 | ✅ OK |
| Kernel threads max (`threads-max`) | 722,077 | ✅ OK |
| User process limit (`ulimit -u`) | 361,038 | ✅ OK |
| Host thread count (`ps -eLf \| wc -l`) | ~111,000 | ✅ OK |
| `nova_compute` container `pids.max` | 108,000 | ✅ OK |
| `nova_compute` PIDs in container | 27 | ✅ OK |
| `nova_libvirt` container `PidsLimit` | `<nil>` (unlimited) | ✅ OK |
| AppArmor blocking libvirt | No profiles loaded | ✅ OK |
| Host `libvirtd` running | Inactive | ✅ OK |
| Libvirt log volume size | 12KB | ✅ OK |
| RAM available | 88 GB total, plenty free | ✅ OK |
| vCPUs | 32 cores | ✅ OK |

## What We Can't Check (Missing Commands)
Inside the `nova_libvirt` container, the `ulimit` command is missing, so we
cannot determine:
- `nproc` limit inside the container
- `nofile` (file descriptor) limit

## What We Haven't Checked
- Kernel parameters like `vm.max_map_count` (default is 65530, could this
be an issue?)
- cgroup v2 limits on the `nova_libvirt` container (Ubuntu 24.04 uses
cgroup v2)
- `libvirtd` logs inside the container (the log file may not exist or be
empty)

## The Ask
Has anyone seen this in a Kolla-Ansible all-in-one deployment where all the
obvious limits are large but libvirt still refuses to fork? Could it be:
1. A cgroup v2 limit we missed?
2. A kernel parameter that needs tuning (`vm.max_map_count`)?
3. Something else entirely?

I'm happy to run any additional diagnostics or provide more logs. Any
guidance would be hugely appreciated.

Thanks,
Dennis

Dennis Martin

Michael Still

tags

participants (2)