<div dir="ltr">Hello,<div><br></div><div>I ran into a very odd issue today when setting up a new OpenStack cloud. Instances that were migrated to another compute node lost communication with the DHCP server once their lease was up.</div>
<div><br></div><div>The cloud is configured with nova-network, FlatDHCPManager, and uses multi-host. Shared storage is not being used, so we were migrating with --block-migrate.</div><div><br></div><div>We narrowed the issue down to iptables. The rules are acting very funny.</div>
<div><br></div><div>On the source compute node (192.168.1.12), before migrating:</div><div><br></div><div><span style="font-family:arial,sans-serif;font-size:13px">:nova-compute-inst-49 - [0:0]</span><br style="font-family:arial,sans-serif;font-size:13px">
<span style="font-family:arial,sans-serif;font-size:13px">-A nova-compute-inst-49 -m state --state INVALID -j DROP</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">-A nova-compute-inst-49 -m state --state RELATED,ESTABLISHED -j ACCEPT</span><br style="font-family:arial,sans-serif;font-size:13px">
<span style="font-family:arial,sans-serif;font-size:13px">-A nova-compute-inst-49 -j nova-compute-provider</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">-A nova-compute-inst-49 -s <a href="http://192.168.1.12/32">192.168.1.12/32</a></span><span style="font-family:arial,sans-serif;font-size:13px"> -p udp -m udp --sport 67</span><br style="font-family:arial,sans-serif;font-size:13px">
<span style="font-family:arial,sans-serif;font-size:13px">--dport 68 -j ACCEPT</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">-A nova-compute-inst-49 -p icmp -j ACCEPT</span><br style="font-family:arial,sans-serif;font-size:13px">
<span style="font-family:arial,sans-serif;font-size:13px">-A nova-compute-inst-49 -p tcp -m tcp --dport 22 -j ACCEPT</span><br style="font-family:arial,sans-serif;font-size:13px"><span style="font-family:arial,sans-serif;font-size:13px">-A nova-compute-inst-49 -j nova-compute-sg-fallback</span><br style="font-family:arial,sans-serif;font-size:13px">
</div><div><span style="font-family:arial,sans-serif;font-size:13px"><br></span></div><div><span style="font-family:arial,sans-serif;font-size:13px">On the destination compute node (192.168.1.11), after migrating:</span></div>
<div><span style="font-family:arial,sans-serif;font-size:13px"><br></span></div><div><span style="font-family:arial,sans-serif;font-size:13px"><div>:nova-compute-inst-49 - [0:0]</div><div>-A nova-compute-inst-49 -m state --state INVALID -j DROP</div>
<div>-A nova-compute-inst-49 -m state --state RELATED,ESTABLISHED -j ACCEPT</div><div>-A nova-compute-inst-49 -j nova-compute-provider</div><div>-A nova-compute-inst-49 -s <a href="http://192.168.1.12/32">192.168.1.12/32</a> -p udp -m udp --sport 67</div>
<div>--dport 68 -j ACCEPT</div><div>-A nova-compute-inst-49 -p icmp -j ACCEPT</div><div>-A nova-compute-inst-49 -p tcp -m tcp --dport 22 -j ACCEPT</div><div>-A nova-compute-inst-49 -j nova-compute-sg-fallback</div><div><br>
</div><div>Note how 192.168.1.12 was directly copied over. The old compute node no longer accepts the instance's lease request and performs a DHCP NAK. This is now an invalid rule.</div><div><br></div><div>After 60 seconds, the instance loses its DHCP lease and becomes unreachable.</div>
<div><br></div><div>On the destination compute node after hard rebooting the instance:</div><div><br></div></span></div><div><span style="font-family:arial,sans-serif;font-size:13px"><div>:nova-compute-inst-49 - [0:0]</div>
<div>-A nova-compute-inst-49 -m state --state INVALID -j DROP</div><div>-A nova-compute-inst-49 -m state --state RELATED,ESTABLISHED -j ACCEPT</div><div>-A nova-compute-inst-49 -j nova-compute-provider</div><div>-A nova-compute-inst-49 -s <a href="http://192.168.1.12/32">192.168.1.12/32</a> -p udp -m udp --sport 67</div>
<div>--dport 68 -j ACCEPT</div><div>-A nova-compute-inst-49 -p icmp -j ACCEPT</div><div>-A nova-compute-inst-49 -p tcp -m tcp --dport 22 -j ACCEPT</div><div>-A nova-compute-inst-49 -j nova-compute-sg-fallback</div><div>-A nova-compute-inst-49 -s <a href="http://192.168.1.11/32">192.168.1.11/32</a> -p udp -m udp --sport 67</div>
<div>--dport 68 -j ACCEPT</div><div><br></div><div>Note how 192.168.1.11 has been added to the ruleset, but it's after the fallback jump. The fallback jump simply drops the packet.</div><div><br></div><div>So we were scratching our heads on what to do. The first thing we tried was to delete the fallback jump. That worked. But when we rebooted the node, the rule was, of course, reinjected.</div>
<div><br></div><div>Our next thought was to add a security group rule allowing DHCP. We did that and saw that any edit to the security group fixed the whole issue! </div><div><br></div><div>Note the addition of a port 80 rule, and how the DHCP rule is for the right server as well as in the right location:</div>
<div><br></div><div><div>:nova-compute-inst-49 - [0:0]</div><div>-A nova-compute-inst-49 -m state --state INVALID -j DROP</div><div>-A nova-compute-inst-49 -m state --state RELATED,ESTABLISHED -j ACCEPT</div><div>-A nova-compute-inst-49 -j nova-compute-provider</div>
<div>-A nova-compute-inst-49 -s <a href="http://192.168.1.11/32">192.168.1.11/32</a> -p udp -m udp --sport 67</div><div>--dport 68 -j ACCEPT</div><div>-A nova-compute-inst-49 -p icmp -j ACCEPT</div><div>-A nova-compute-inst-49 -p tcp -m tcp --dport 22 -j ACCEPT</div>
<div>-A nova-compute-inst-49 -p tcp -m tcp --dport 80 -j ACCEPT</div><div>-A nova-compute-inst-49 -j nova-compute-sg-fallback</div></div><div><br></div><div><br></div><div>Does anyone know what's going on here? </div>
<div><br></div><div>Thanks,</div><div>Joe</div><div><br></div></span></div></div>