我们在使用Fuel 9.0部署M版OpenStack时, 系统在正常运行数天后,总会出现控制节点失效,不得不重启整个环境的问题。 使用brctl show 发现正常情况下的虚拟网桥接口全部消失不见了,如下: root@node-5:/var/log# brctl show bridge namebridge idSTP enabledinterfaces br-ex8000.9457a5565678noeno1.104 p_ff798dba-0 br-fw-admin8000.9457a5565678noeno1 p_eeee51a2-0 br-mgmt8000.3215d3e4d700noeno1.101 mgmt-conntrd br-storage8000.9457a5565678noeno1.102 经过查看neturon、upstart、nova、rabbitmq等组件的日志,均未解决问题,最终在pacemaker日志中发现如下错误: ec 08 03:20:50 [7126] node-5.domain.tld lrmd: notice: operation_finished:p_rabbitmq-server_notify_0:79939:stderr [ Error: rabbit application is not running on node rabbit@messaging-node-5. ] Dec 08 03:20:50 [7126] node-5.domain.tld lrmd: notice: operation_finished:p_rabbitmq-server_notify_0:79939:stderr [ * Suggestion: start it with "rabbitmqctl start_app" and try again ] Dec 08 03:20:50 [7126] node-5.domain.tld lrmd: info: log_finished:finished - rsc:p_rabbitmq-server action:notify call_id:200 pid:79939 exit-code:0 exec-time:3738ms queue-time:0ms Dec 08 03:20:50 [7129] node-5.domain.tld crmd: info: match_graph_event:Action p_rabbitmq-server_notify_0 (116) confirmed on node-5.domain.tld (rc=0) Dec 08 03:20:50 [7129] node-5.domain.tld crmd: notice: process_lrm_event:Operation p_rabbitmq-server_notify_0: ok (node=node-5.domain.tld, call=200, rc=0, cib-update=0, confirmed=true) Dec 08 03:20:50 [7129] node-5.domain.tld crmd: notice: run_graph:Transition 53849 (Complete=18, Pending=0, Fired=0, Skipped=1, Incomplete=8, Source=/var/lib/pacemaker/pengine/pe-input-2696.bz2): Stopped Dec 08 03:20:50 [7129] node-5.domain.tld crmd: info: do_state_transition:State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: notice: unpack_config:On loss of CCM Quorum: Ignore Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: determine_online_status:Node node-5.domain.tld is online Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: determine_op_status:Operation monitor found resource p_vrouter:0 active on node-5.domain.tld Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: apply_system_health:Applying automated node health strategy: migrate-on-red Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: apply_system_health: Node node-5.domain.tld has an combined system health of -1000000 Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_p_vrouter [p_vrouter] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_print:vip__management(ocf::fuel:ns_IPaddr2):Stopped Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_print:vip__vrouter_pub(ocf::fuel:ns_IPaddr2):Stopped Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_print:vip__vrouter(ocf::fuel:ns_IPaddr2):Stopped Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_print:vip__public(ocf::fuel:ns_IPaddr2):Stopped Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_p_haproxy [p_haproxy] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_p_mysqld [p_mysqld] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Slaves: [ node-5.domain.tld ] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_print:p_aodh-evaluator(ocf::fuel:aodh-evaluator):Stopped Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_print:p_ceilometer-agent-central(ocf::fuel:ceilometer-agent-central):Stopped Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_neutron-openvswitch-agent [neutron-openvswitch-agent] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_neutron-l3-agent [neutron-l3-agent] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_neutron-metadata-agent [neutron-metadata-agent] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_p_heat-engine [p_heat-engine] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_neutron-dhcp-agent [neutron-dhcp-agent] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_print:sysinfo_node-5.domain.tld(ocf::pacemaker:SysInfo):Stopped Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_p_dns [p_dns] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Master/Slave Set: master_p_conntrackd [p_conntrackd] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_p_ntp [p_ntp] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: clone_print: Clone Set: clone_ping_vip__public [ping_vip__public] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: short_print: Stopped: [ node-5.domain.tld ] Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: rsc_merge_weights:clone_p_vrouter: Rolling back scores from clone_p_dns Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: rsc_merge_weights:clone_p_vrouter: Rolling back scores from clone_p_ntp Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource p_vrouter:0 cannot run anywhere Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: rsc_merge_weights:clone_p_haproxy: Rolling back scores from vip__management Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: rsc_merge_weights:clone_p_haproxy: Rolling back scores from vip__public Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource p_haproxy:0 cannot run anywhere Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource vip__management cannot run anywhere Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: rsc_merge_weights:vip__vrouter_pub: Rolling back scores from master_p_conntrackd Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: rsc_merge_weights:vip__vrouter_pub: Rolling back scores from vip__vrouter Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource vip__vrouter_pub cannot run anywhere Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource vip__vrouter cannot run anywhere Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource vip__public cannot run anywhere Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource p_mysqld:0 cannot run anywhere Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource p_rabbitmq-server:0 cannot run anywhere Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: master_color:master_p_rabbitmq-server: Promoted 0 instances of a possible 1 to master Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource p_aodh-evaluator cannot run anywhere Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource p_ceilometer-agent-central cannot run anywhere Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource neutron-openvswitch-agent:0 cannot run anywhere Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource neutron-l3-agent:0 cannot run anywhere Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource neutron-metadata-agent:0 cannot run anywhere Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource p_heat-engine:0 cannot run anywhere Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource neutron-dhcp-agent:0 cannot run anywhere Dec 08 03:20:50 [7128] node-5.domain.tld pengine: info: native_color:Resource sysinfo_node-5.domain.tld cannot run anywhere 可以看出pacemaker某些情况下认为节点健康度为负无穷,认为节点不可用,所有的资源无法找到可运行的节点,就停止掉了。关键信息是: Applying automated node health strategy: migrate-on-red 证明可能与节点健康策略有关,通过google搜索pacemaker migrate-on-red,发现配置成该选项后操作系统等出现的问题会被设置为负无穷,导致节点不可用,但实际情况是服务器仍可用, 可能是硬件健康检查有一些告警,单节点情况下无节点切换,就整个当机了。 修正错误方法: 1、登入控制节点 2、输入crm 进入 pacemaker控制台 3、输入configure进入配置界面 4、输入edit编辑,将property中的node health strategy设为none,如下: property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.14-70404b0 \ cluster-infrastructure=corosync \ cluster-recheck-interval=190s \ no-quorum-policy=ignore \ stonith-enabled=false \ start-failure-is-fatal=false \ symmetric-cluster=false \ last-lrm-refresh=1477747972 \ node-health-strategy=none
participants (1)
-
陈亚峰