我们在使用Fuel 9.0部署M版OpenStack时, 系统在正常运行数天后,总会出现控制节点失效,不得不重启整个环境的问题。
使用brctl show 发现正常情况下的虚拟网桥接口全部消失不见了,如下:
root@node-5:/var/log# brctl show
bridge namebridge idSTP enabledinterfaces
br-ex8000.9457a5565678noeno1.104
p_ff798dba-0
br-fw-admin8000.9457a5565678noeno1
p_eeee51a2-0
br-mgmt8000.3215d3e4d700noeno1.101
mgmt-conntrd
br-storage8000.9457a5565678noeno1.102
经过查看neturon、upstart、nova、rabbitmq等组件的日志,均未解决问题,最终在pacemaker日志中发现如下错误:
ec 08 03:20:50 [7126] node-5.domain.tld lrmd: notice: operation_finished:p_rabbitmq-server_notify_0:79939:stderr [ Error: rabbit application is not running on node rabbit@messaging-node-5. ]
Dec 08 03:20:50 [7126] node-5.domain.tld lrmd: notice: operation_finished:p_rabbitmq-server_notify_0:79939:stderr [ * Suggestion: start it with "rabbitmqctl start_app" and try again ]
Dec 08 03:20:50 [7126] node-5.domain.tld lrmd: info: log_finished:finished - rsc:p_rabbitmq-server action:notify call_id:200 pid:79939 exit-code:0 exec-time:3738ms queue-time:0ms
Dec 08 03:20:50 [7129] node-5.domain.tld crmd: info: match_graph_event:Action p_rabbitmq-server_notify_0 (116) confirmed on node-5.domain.tld (rc=0)
Dec 08 03:20:50 [7129] node-5.domain.tld crmd: notice: process_lrm_event:Operation p_rabbitmq-server_notify_0: ok (node=node-5.domain.tld, call=200, rc=0, cib-update=0, confirmed=true)
Dec 08 03:20:50 [7129] node-5.domain.tld crmd: notice: run_graph:Transition 53849 (Complete=18, Pending=0, Fired=0, Skipped=1, Incomplete=8, Source=/var/lib/pacemaker/pengine/pe-input-2696.bz2): Stopped
Dec 08 03:20:50 [7129] node-5.domain.tld crmd: info: do_state_transition:State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: notice: unpack_config:On loss of CCM Quorum: Ignore
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: determine_online_status:Node node-5.domain.tld is online
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: determine_op_status:Operation monitor found resource p_vrouter:0 active on node-5.domain.tld
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: apply_system_health:Applying automated node health strategy: migrate-on-red
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: apply_system_health: Node node-5.domain.tld has an combined system health of -1000000
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_p_vrouter [p_vrouter]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_print:vip__management(ocf::fuel:ns_IPaddr2):Stopped
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_print:vip__vrouter_pub(ocf::fuel:ns_IPaddr2):Stopped
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_print:vip__vrouter(ocf::fuel:ns_IPaddr2):Stopped
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_print:vip__public(ocf::fuel:ns_IPaddr2):Stopped
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_p_haproxy [p_haproxy]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_p_mysqld [p_mysqld]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Slaves: [ node-5.domain.tld ]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_print:p_aodh-evaluator(ocf::fuel:aodh-evaluator):Stopped
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_print:p_ceilometer-agent-central(ocf::fuel:ceilometer-agent-central):Stopped
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_neutron-openvswitch-agent [neutron-openvswitch-agent]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_neutron-l3-agent [neutron-l3-agent]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_neutron-metadata-agent [neutron-metadata-agent]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_p_heat-engine [p_heat-engine]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_neutron-dhcp-agent [neutron-dhcp-agent]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_print:sysinfo_node-5.domain.tld(ocf::pacemaker:SysInfo):Stopped
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_p_dns [p_dns]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Master/Slave Set: master_p_conntrackd [p_conntrackd]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_p_ntp [p_ntp]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_ping_vip__public [ping_vip__public]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ]
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: rsc_merge_weights:clone_p_vrouter: Rolling back scores from clone_p_dns
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: rsc_merge_weights:clone_p_vrouter: Rolling back scores from clone_p_ntp
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource p_vrouter:0 cannot run anywhere
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: rsc_merge_weights:clone_p_haproxy: Rolling back scores from vip__management
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: rsc_merge_weights:clone_p_haproxy: Rolling back scores from vip__public
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource p_haproxy:0 cannot run anywhere
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource vip__management cannot run anywhere
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: rsc_merge_weights:vip__vrouter_pub: Rolling back scores from master_p_conntrackd
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: rsc_merge_weights:vip__vrouter_pub: Rolling back scores from vip__vrouter
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource vip__vrouter_pub cannot run anywhere
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource vip__vrouter cannot run anywhere
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource vip__public cannot run anywhere
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource p_mysqld:0 cannot run anywhere
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource p_rabbitmq-server:0 cannot run anywhere
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: master_color:master_p_rabbitmq-server: Promoted 0 instances of a possible 1 to master
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource p_aodh-evaluator cannot run anywhere
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource p_ceilometer-agent-central cannot run anywhere
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource neutron-openvswitch-agent:0 cannot run anywhere
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource neutron-l3-agent:0 cannot run anywhere
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource neutron-metadata-agent:0 cannot run anywhere
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource p_heat-engine:0 cannot run anywhere
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource neutron-dhcp-agent:0 cannot run anywhere
Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource sysinfo_node-5.domain.tld cannot run anywhere
可以看出pacemaker某些情况下认为节点健康度为负无穷,认为节点不可用,所有的资源无法找到可运行的节点,就停止掉了。关键信息是:
Applying automated node health strategy: migrate-on-red
证明可能与节点健康策略有关,通过google搜索pacemaker migrate-on-red,发现配置成该选项后操作系统等出现的问题会被设置为负无穷,导致节点不可用,但实际情况是服务器仍可用,
可能是硬件健康检查有一些告警,单节点情况下无节点切换,就整个当机了。
修正错误方法:
1、登入控制节点
2、输入crm 进入 pacemaker控制台
3、输入configure进入配置界面
4、输入edit编辑,将property中的node health strategy设为none,如下:
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0 \
cluster-infrastructure=corosync \
cluster-recheck-interval=190s \
no-quorum-policy=ignore \
stonith-enabled=false \
start-failure-is-fatal=false \
symmetric-cluster=false \
last-lrm-refresh=1477747972 \
node-health-strategy=none