Hi everyone,
I've been debugging an issue for several weeks and have exhausted the obvious possibilities. I'm hoping the collective expertise here can point me in the right direction.
## Environment
- **OpenStack**: Kolla-Ansible 2025.1 (Epoxy), all-in-one deployment
- **OS**: Ubuntu 24.04 on a VM (nested virtualization)
- **OKD target**: Trying to deploy OKD 4.18, but even a simple Cirros test VM fails
## The Problem
Any attempt to create a VM (even `--image cirros --flavor m1.tiny`) fails with:
```
libvirt.libvirtError: cannot fork child process: Resource temporarily unavailable
```
The VM goes straight to `ERROR` state with no useful details in `nova-compute.log` beyond the same error.
## What Works ✅
- OpenStack services are up (`nova-compute`, `neutron`, etc. all healthy)
- Can create networks, subnets, routers, keypairs via CLI
- Direct `qemu-system-x86_64 -enable-kvm` test succeeds (KVM itself works)
- `/dev/kvm` exists with correct permissions
- KVM modules loaded (`kvm_intel`), nested virt enabled (`Y`)
## What We've Checked (All Fine)
| Check | What We Found | Status |
|-------|---------------|--------|
| System PID limit (`pid_max`) | 4,194,304 | ✅ OK |
| Kernel threads max (`threads-max`) | 722,077 | ✅ OK |
| User process limit (`ulimit -u`) | 361,038 | ✅ OK |
| Host thread count (`ps -eLf \| wc -l`) | ~111,000 | ✅ OK |
| `nova_compute` container `pids.max` | 108,000 | ✅ OK |
| `nova_compute` PIDs in container | 27 | ✅ OK |
| `nova_libvirt` container `PidsLimit` | `<nil>` (unlimited) | ✅ OK |
| AppArmor blocking libvirt | No profiles loaded | ✅ OK |
| Host `libvirtd` running | Inactive | ✅ OK |
| Libvirt log volume size | 12KB | ✅ OK |
| RAM available | 88 GB total, plenty free | ✅ OK |
| vCPUs | 32 cores | ✅ OK |
## What We Can't Check (Missing Commands)
Inside the `nova_libvirt` container, the `ulimit` command is missing, so we cannot determine:
- `nproc` limit inside the container
- `nofile` (file descriptor) limit
## What We Haven't Checked
- Kernel parameters like `vm.max_map_count` (default is 65530, could this be an issue?)
- cgroup v2 limits on the `nova_libvirt` container (Ubuntu 24.04 uses cgroup v2)
- `libvirtd` logs inside the container (the log file may not exist or be empty)
## The Ask
Has anyone seen this in a Kolla-Ansible all-in-one deployment where all the obvious limits are large but libvirt still refuses to fork? Could it be:
1. A cgroup v2 limit we missed?
2. A kernel parameter that needs tuning (`vm.max_map_count`)?
3. Something else entirely?
I'm happy to run any additional diagnostics or provide more logs. Any guidance would be hugely appreciated.
Thanks,
Dennis