[Openstack] [senlin] Unable to get senlin health monitoring to work

Peter White Peter.White at metaswitch.com
Wed Jul 20 11:34:27 UTC 2016


I have been unable to get senlin health policies to work at all, and I'm confused about what I might be doing wrong, though have some idea that it is related to credentials. Any help would be much appreciated.


What I have tried is the following (using a heat template, though same error making senlin commands directly from the command line).

  *   Create a cluster with a single node in it (initial size), and a health policy attached to it using NODE_STATUS_POLLING.
  *   Verify that the cluster exists, that a node has been created with a VM, and that the health policy exists and has been linked to the cluster.
  *   Nuke the VM (nova delete) to try and trigger healing.

I would expect that senlin health policy would detect that the VM has gone, and do healing. However, that does not happen. If I do "senlin node-check" then the node state changes to ERROR and the cluster state changes to WARNING (so it can tell that the cluster is in a bad way). However, the health policy does not do as I would expect (replacing the Senlin node).

I'm seeing some odd log extracts that make me think that the issue is that the health policy does not have access to the right credentials in order to issue the polling requests. I have found http://docs.openstack.org/developer/senlin/developer/authorization.html but cannot see quite how it relates.

I'm using stable/mitaka and devstack on a single ubuntu server, heat template below, and also the extract from the logs.

Can anybody suggest what I might be doing wrong or point me at some documentation that explains how healing / authentication in Senlin should / does work?

Thanks, Peter White


Heat template

heat_template_version: 2016-04-08

description: Simple template to test healing

resources:
  profile:
    type: OS::Senlin::Profile
    properties:
      type: os.nova.server-1.0
      properties:
        image: cirros-0.3.4-x86_64-uec
        flavor: m1.tiny

  cluster1:
    type: OS::Senlin::Cluster
    properties:
      name: cluster1
      profile: {get_resource: profile}
      desired_capacity: 1
      min_size: 1

  heal_policy:
    type: OS::Senlin::Policy
    properties:
      type: senlin.policy.health-1.0
      bindings:
        - cluster: {get_resource: cluster1}
      properties:
        detection:
          type: NODE_STATUS_POLLING
          options:
            interval: 60
        recovery:
          actions:
            - RECREATE
          #fencing: # Not sure what this does, but didn't seem to make any difference.
          #  - COMPUTE


Senlin log extract

2016-07-20 11:20:04.379 DEBUG oslo_messaging._drivers.amqpdriver [-] received reply msg_id: 3b7f3c9c28074c7eb14af8e50ba10a42 from (pid=21537) __call__ /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:302
2016-07-20 11:20:04.379 ERROR oslo.service.loopingcall [-] Fixed interval looping call 'senlin.engine.health_manager.HealthManager._poll_cluster' failed
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall Traceback (most recent call last):
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 136, in _run_loop
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     result = func(*self.args, **self.kw)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/opt/stack/senlin/senlin/engine/health_manager.py", line 110, in _poll_cluster
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     self.rpc_client.cluster_check(self.ctx, cluster_id)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/opt/stack/senlin/senlin/rpc/client.py", line 217, in cluster_check
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     params=params))
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/opt/stack/senlin/senlin/rpc/client.py", line 50, in call
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     return client.call(ctxt, method, **kwargs)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 413, in call
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     return self.prepare().call(ctxt, method, **kwargs)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     retry=self.retry)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 90, in _send
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     timeout=timeout, retry=retry)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 470, in send
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     retry=retry)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 461, in _send
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     raise result
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall ValueError: Field `user' cannot be None
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall Traceback (most recent call last):
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 138, in _dispatch_and_reply
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     incoming.message))
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 185, in _dispatch
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     return self._do_dispatch(endpoint, method, ctxt, args)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 127, in _do_dispatch
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     result = func(ctxt, **new_args)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/opt/stack/senlin/senlin/engine/service.py", line 68, in wrapped
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     return func(self, ctx, *args, **kwargs)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/opt/stack/senlin/senlin/engine/service.py", line 1328, in cluster_check
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     consts.CLUSTER_CHECK, **params)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/opt/stack/senlin/senlin/engine/actions/base.py", line 282, in create
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     return obj.store(context)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/opt/stack/senlin/senlin/engine/actions/base.py", line 187, in store
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     action = ao.Action.create(context, values)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/opt/stack/senlin/senlin/objects/action.py", line 52, in create
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     return cls._from_db_object(context, cls(context), obj)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/opt/stack/senlin/senlin/objects/base.py", line 43, in _from_db_object
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     obj[field] = db_obj[field]
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 727, in __setitem__
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     setattr(self, name, value)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 72, in setter
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     field_value = field.coerce(self, name, value)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/fields.py", line 190, in coerce
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     return self._null(obj, attr)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall   File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/fields.py", line 168, in _null
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall     raise ValueError(_("Field `%s' cannot be None") % attr)
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall ValueError: Field `user' cannot be None
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall
2016-07-20 11:20:04.379 TRACE oslo.service.loopingcall
2016-07-20 11:20:04.462 INFO senlin.engine.event [req-6ce9b961-acdf-4523-a04b-ef98d7752f85 None None] cluster1 [6f720478] CLUSTER_ATTACH_POLICY - SUCCEEDED: Policy attached.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20160720/153365d3/attachment.html>


More information about the Openstack mailing list