[Openstack-operators] [Openstack] Recovering from full outage

George Mihaiescu lmihaiescu at gmail.com
Fri Jul 6 15:14:23 UTC 2018


Can you manually assign an IP address to a VM and once inside, ping the
address of the dhcp server?
That would confirm if there is connectivity at least.


Also, on the controller node where the dhcp server for that network is,
check the "/var/lib/neutron/dhcp/d85c2a00-a637-4109-83f0-7c2949be4cad/leases"
and make sure there are entries corresponding to your instances.

In my experience, if neutron is broken after working fine (so excluding any
miss-configuration), then an agent is out-of-sync and restart usually fixes
things.



On Fri, Jul 6, 2018 at 9:38 AM, Torin Woltjer <torin.woltjer at granddial.com>
wrote:

> I have done tcpdumps on both the controllers and on a compute node.
> Controller:
> `ip netns exec qdhcp-d85c2a00-a637-4109-83f0-7c2949be4cad tcpdump -vnes0
> -i ns-83d68c76-b8 port 67`
> `tcpdump -vnes0 -i any port 67`
> Compute:
> `tcpdump -vnes0 -i brqd85c2a00-a6 port 68`
>
> For the first command on the controller, there are no packets captured at
> all. The second command on the controller captures packets, but they don't
> appear to be relevant to openstack. The dump from the compute node shows
> constant requests are getting sent by openstack instances.
>
> In summary; DHCP requests are being sent, but are never received.
>
> *Torin Woltjer*
>
> *Grand Dial Communications - A ZK Tech Inc. Company*
>
> *616.776.1066 ext. 2006*
> * <http://www.granddial.com>www.granddial.com <http://www.granddial.com>*
>
> ------------------------------
> *From*: George Mihaiescu <lmihaiescu at gmail.com>
> *Sent*: 7/5/18 4:50 PM
> *To*: torin.woltjer at granddial.com
> *Subject*: Re: [Openstack] Recovering from full outage
>
> The cloud-init requires network connectivity by default in order to reach
> the metadata server for the hostname, ssh-key, etc
>
> You can configure cloud-init to use the config-drive, but the lack of
> network connectivity will make the instance useless anyway, even though it
> will have you ssh-key and hostname...
>
> Did you check the things I told you?
>
> On Jul 5, 2018, at 16:06, Torin Woltjer <torin.woltjer at granddial.com>
> wrote:
>
> Are IP addresses set by cloud-init on boot? I noticed that cloud-init
> isn't working on my VMs. created a new instance from an ubuntu 18.04 image
> to test with, the hostname was not set to the name of the instance and
> could not login as users I had specified in the configuration.
>
> *Torin Woltjer*
>
> *Grand Dial Communications - A ZK Tech Inc. Company*
>
> *616.776.1066 ext. 2006*
> * <http://www.granddial.com> <http://www.granddial.com>
> <http://www.granddial.com>www.granddial.com <http://www.granddial.com>*
>
> ------------------------------
> *From*: George Mihaiescu <lmihaiescu at gmail.com>
> *Sent*: 7/5/18 12:57 PM
> *To*: torin.woltjer at granddial.com
> *Cc*: "openstack at lists.openstack.org" <openstack at lists.openstack.org>, "
> openstack-operators at lists.openstack.org" <openstack-operators at lists.
> openstack.org>
> *Subject*: Re: [Openstack] Recovering from full outage
> You should tcpdump inside the qdhcp namespace to see if the requests make
> it there, and also check iptables rules on the compute nodes for the return
> traffic.
>
>
> On Thu, Jul 5, 2018 at 12:39 PM, Torin Woltjer <
> torin.woltjer at granddial.com> wrote:
>
>> Yes, I've done this. The VMs hang for awhile waiting for DHCP and
>> eventually come up with no addresses. neutron-dhcp-agent has been restarted
>> on both controllers. The qdhcp netns's were all present; I stopped the
>> service, removed the qdhcp netns's, noted the dhcp agents show offline by
>> `neutron agent-list`, restarted all neutron services, noted the qdhcp
>> netns's were recreated, restarted a VM again and it still fails to pull an
>> IP address.
>>
>> *Torin Woltjer*
>>
>> *Grand Dial Communications - A ZK Tech Inc. Company*
>>
>> *616.776.1066 ext. 2006*
>> * <http://www.granddial.com> <http://www.granddial.com>
>> <http://www.granddial.com> <http://www.granddial.com>
>> <http://www.granddial.com>www.granddial.com <http://www.granddial.com>*
>>
>> ------------------------------
>> *From*: George Mihaiescu <lmihaiescu at gmail.com>
>> *Sent*: 7/5/18 10:38 AM
>> *To*: torin.woltjer at granddial.com
>> *Subject*: Re: [Openstack] Recovering from full outage
>> Did you restart the neutron-dhcp-agent  and rebooted the VMs?
>>
>> On Thu, Jul 5, 2018 at 10:30 AM, Torin Woltjer <
>> torin.woltjer at granddial.com> wrote:
>>
>>> The qrouter netns appears once the lock_path is specified, the neutron
>>> router is pingable as well. However, instances are not pingable. If I log
>>> in via console, the instances have not been given IP addresses, if I
>>> manually give them an address and route they are pingable and seem to work.
>>> So the router is working correctly but dhcp is not working.
>>>
>>> No errors in any of the neutron or nova logs on controllers or compute
>>> nodes.
>>>
>>>
>>> *Torin Woltjer*
>>>
>>> *Grand Dial Communications - A ZK Tech Inc. Company*
>>>
>>> *616.776.1066 ext. 2006*
>>> * <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com>www.granddial.com <http://www.granddial.com>*
>>>
>>> ------------------------------
>>> *From*: "Torin Woltjer" <torin.woltjer at granddial.com>
>>> *Sent*: 7/5/18 8:53 AM
>>> *To*: <lmihaiescu at gmail.com>
>>> *Cc*: openstack-operators at lists.openstack.org,
>>> openstack at lists.openstack.org
>>> *Subject*: Re: [Openstack] Recovering from full outage
>>> There is no lock path set in my neutron configuration. Does it
>>> ultimately matter what it is set to as long as it is consistent? Does it
>>> need to be set on compute nodes as well as controllers?
>>>
>>> *Torin Woltjer*
>>>
>>> *Grand Dial Communications - A ZK Tech Inc. Company*
>>>
>>> *616.776.1066 ext. 2006*
>>> * <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com>www.granddial.com <http://www.granddial.com>*
>>>
>>> ------------------------------
>>> *From*: George Mihaiescu <lmihaiescu at gmail.com>
>>> *Sent*: 7/3/18 7:47 PM
>>> *To*: torin.woltjer at granddial.com
>>> *Cc*: openstack-operators at lists.openstack.org,
>>> openstack at lists.openstack.org
>>> *Subject*: Re: [Openstack] Recovering from full outage
>>>
>>> Did you set a lock_path in the neutron’s config?
>>>
>>> On Jul 3, 2018, at 17:34, Torin Woltjer <torin.woltjer at granddial.com>
>>> wrote:
>>>
>>> The following errors appear in the neutron-linuxbridge-agent.log on both
>>> controllers: <http://paste.openstack.org/show/724930/>
>>> <http://paste.openstack.org/show/724930/>
>>> <http://paste.openstack.org/show/724930/>
>>> <http://paste.openstack.org/show/724930/>
>>> <http://paste.openstack.org/show/724930/>
>>> <http://paste.openstack.org/show/724930/>
>>> <http://paste.openstack.org/show/724930/>
>>> <http://paste.openstack.org/show/724930/>
>>> <http://paste.openstack.org/show/724930/>
>>> <http://paste.openstack.org/show/724930/>http://paste.openstack.org/sho
>>> w/724930/
>>>
>>> No such errors are on the compute nodes themselves.
>>>
>>> *Torin Woltjer*
>>>
>>> *Grand Dial Communications - A ZK Tech Inc. Company*
>>>
>>> *616.776.1066 ext. 2006*
>>> * <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com>www.granddial.com <http://www.granddial.com>*
>>>
>>> ------------------------------
>>> *From*: "Torin Woltjer" <torin.woltjer at granddial.com>
>>> *Sent*: 7/3/18 5:14 PM
>>> *To*: <lmihaiescu at gmail.com>
>>> *Cc*: "openstack-operators at lists.openstack.org" <
>>> openstack-operators at lists.openstack.org>, "openstack at lists.openstack.org"
>>> <openstack at lists.openstack.org>
>>> *Subject*: Re: [Openstack] Recovering from full outage
>>> Running `openstack server reboot` on an instance just causes the
>>> instance to be stuck in a rebooting status. Most notable of the logs is
>>> neutron-server.log which shows the following:
>>> <http://paste.openstack.org/show/724917/>
>>> <http://paste.openstack.org/show/724917/>
>>> <http://paste.openstack.org/show/724917/>
>>> <http://paste.openstack.org/show/724917/>
>>> <http://paste.openstack.org/show/724917/>
>>> <http://paste.openstack.org/show/724917/>
>>> <http://paste.openstack.org/show/724917/>
>>> <http://paste.openstack.org/show/724917/>
>>> <http://paste.openstack.org/show/724917/>
>>> <http://paste.openstack.org/show/724917/>
>>> <http://paste.openstack.org/show/724917/>
>>> <http://paste.openstack.org/show/724917/>http://paste.openstack.org/sho
>>> w/724917/
>>>
>>> I realized that rabbitmq was in a failed state, so I bootstrapped it,
>>> rebooted controllers, and all of the agents show online.
>>> <http://paste.openstack.org/show/724921/>
>>> <http://paste.openstack.org/show/724921/>
>>> <http://paste.openstack.org/show/724921/>
>>> <http://paste.openstack.org/show/724921/>
>>> <http://paste.openstack.org/show/724921/>
>>> <http://paste.openstack.org/show/724921/>
>>> <http://paste.openstack.org/show/724921/>
>>> <http://paste.openstack.org/show/724921/>
>>> <http://paste.openstack.org/show/724921/>
>>> <http://paste.openstack.org/show/724921/>
>>> <http://paste.openstack.org/show/724921/>
>>> <http://paste.openstack.org/show/724921/>http://paste.openstack.org/sho
>>> w/724921/
>>> And all of the instances can be properly started, however I cannot ping
>>> any of the instances floating IPs or the neutron router. And when logging
>>> into an instance with the console, there is no IP address on any interface.
>>>
>>> *Torin Woltjer*
>>>
>>> *Grand Dial Communications - A ZK Tech Inc. Company*
>>>
>>> *616.776.1066 ext. 2006*
>>> * <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com> <http://www.granddial.com>
>>> <http://www.granddial.com>www.granddial.com <http://www.granddial.com>*
>>>
>>> ------------------------------
>>> *From*: George Mihaiescu <lmihaiescu at gmail.com>
>>> *Sent*: 7/3/18 11:50 AM
>>> *To*: torin.woltjer at granddial.com
>>> *Subject*: Re: [Openstack] Recovering from full outage
>>> Try restarting them using "openstack server reboot" and also check the
>>> nova-compute.log and neutron agents logs on the compute nodes.
>>>
>>> On Tue, Jul 3, 2018 at 11:28 AM, Torin Woltjer <
>>> torin.woltjer at granddial.com> wrote:
>>>
>>>> We just suffered a power outage in out data center and I'm having
>>>> trouble recovering the Openstack cluster. All of the nodes are back online,
>>>> every instance shows active but `virsh list --all` on the compute nodes
>>>> show that all of the VMs are actually shut down. Running `ip addr` on any
>>>> of the nodes shows that none of the bridges are present and `ip netns`
>>>> shows that all of the network namespaces are missing as well. So despite
>>>> all of the neutron service running, none of the networking appears to be
>>>> active, which is concerning. How do I solve this without recreating all of
>>>> the networks?
>>>>
>>>> *Torin Woltjer*
>>>>
>>>> *Grand Dial Communications - A ZK Tech Inc. Company*
>>>>
>>>> *616.776.1066 ext. 2006*
>>>> * <http://www.granddial.com> <http://www.granddial.com>
>>>> <http://www.granddial.com> <http://www.granddial.com>
>>>> <http://www.granddial.com> <http://www.granddial.com>
>>>> <http://www.granddial.com> <http://www.granddial.com>
>>>> <http://www.granddial.com> <http://www.granddial.com>
>>>> <http://www.granddial.com> <http://www.granddial.com>
>>>> <http://www.granddial.com> <http://www.granddial.com>
>>>> <http://www.granddial.com>www.granddial.com <http://www.granddial.com>*
>>>>
>>>> _______________________________________________
>>>> Mailing list:
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>>>> Post to     : openstack at lists.openstack.org
>>>> Unsubscribe :
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20180706/ca78ba8f/attachment.html>


More information about the OpenStack-operators mailing list