Help Needed - Mirantis - All Services Down in 2 Controllers
Hello All, Using Mirantis - Mitaka on Ubuntu 14.04 Currently I'm facing this issue: almost all services are down on 2 controllers, and are up only in one node. Pl find below pcs status output. Any help to recover this is highly appreciated. root@node-11:/var/log# pcs status Cluster name: WARNING: corosync and pacemaker node names do not match (IPs used in setup?) Last updated: Sun Feb 17 02:12:59 2019 Last change: Sat Feb 16 08:56:52 2019 by root via crm_attribute on node-1.mydomain.com Stack: corosync Current DC: node-10.mydomain.com (version 1.1.14-70404b0) - partition with quorum 3 nodes and 46 resources configured Online: [ node-1.mydomain.com node-10.mydomain.com node-11.mydomain.com ] Full list of resources: Clone Set: clone_p_vrouter [p_vrouter] Started: [ node-1.mydomain.com ] Stopped: [ node-10.mydomain.com node-11.mydomain.com ] vip__management (ocf::fuel:ns_IPaddr2): Started node-1.mydomain.com vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started node-1.mydomain.com vip__vrouter (ocf::fuel:ns_IPaddr2): Started node-1.mydomain.com vip__public (ocf::fuel:ns_IPaddr2): Started node-1.mydomain.com Clone Set: clone_p_haproxy [p_haproxy] Started: [ node-1.mydomain.com ] Stopped: [ node-10.mydomain.com node-11.mydomain.com ] Clone Set: clone_p_mysqld [p_mysqld] Started: [ node-1.mydomain.com ] Stopped: [ node-10.mydomain.com node-11.mydomain.com ] Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server] Masters: [ node-1.mydomain.com ] Stopped: [ node-10.mydomain.com node-11.mydomain.com ] Clone Set: clone_neutron-openvswitch-agent [neutron-openvswitch-agent] Started: [ node-1.mydomain.com ] Stopped: [ node-10.mydomain.com node-11.mydomain.com ] Clone Set: clone_neutron-l3-agent [neutron-l3-agent] Started: [ node-1.mydomain.com ] Stopped: [ node-10.mydomain.com node-11.mydomain.com ] Clone Set: clone_neutron-metadata-agent [neutron-metadata-agent] Started: [ node-1.mydomain.com ] Stopped: [ node-10.mydomain.com node-11.mydomain.com ] Clone Set: clone_neutron-dhcp-agent [neutron-dhcp-agent] Started: [ node-1.mydomain.com ] Stopped: [ node-10.mydomain.com node-11.mydomain.com ] Clone Set: clone_p_heat-engine [p_heat-engine] Started: [ node-1.mydomain.com ] Stopped: [ node-10.mydomain.com node-11.mydomain.com ] sysinfo_node-1.mydomain.com (ocf::pacemaker:SysInfo): Started node-1.mydomain.com Clone Set: clone_p_dns [p_dns] Started: [ node-1.mydomain.com ] Stopped: [ node-10.mydomain.com node-11.mydomain.com ] Master/Slave Set: master_p_conntrackd [p_conntrackd] Masters: [ node-1.mydomain.com ] Stopped: [ node-10.mydomain.com node-11.mydomain.com ] Clone Set: clone_p_ntp [p_ntp] Started: [ node-1.mydomain.com ] Stopped: [ node-10.mydomain.com node-11.mydomain.com ] Clone Set: clone_ping_vip__public [ping_vip__public] Started: [ node-1.mydomain.com ] Stopped: [ node-10.mydomain.com node-11.mydomain.com ] sysinfo_node-10.mydomain.com (ocf::pacemaker:SysInfo): Stopped sysinfo_node-11.mydomain.com (ocf::pacemaker:SysInfo): Stopped PCSD Status: node-1.mydomain.com member (172.17.6.24): Offline node-10.mydomain.com member (172.17.6.32): Offline node-11.mydomain.com member (172.17.6.33): Offline Thanks! Raja. -- :^)
You should try to determine what caused this condition. In all likelihood you ran out of resources on these nodes (memory is a likely culprit). Restarting pacemaker on the nodes where services are no longer running should bring them back up but you probably want to check that the nodes are back in a good state before you do so. You can also reboot the nodes but keep in mind that if you're running ceph and your ceph mons live on your controllers you can only have one ceph mon offline at a time. Any less than two active monitors and your storage will go offline.
participants (2)
-
John Petrini
-
Raja T