[Openstack] [Heat/Ceilometer/Havana]: Auto scaling no longer occurring after some time

Eoghan Glynn eglynn at redhat.com
Tue Feb 25 09:22:09 UTC 2014


Juha,

What is the actual cpu_util trend looking like about the time
upscaling occurs?

In the original template you provided, the cooldown period was set
so as to be quite short (IIRC, 20s).

So if your artificial load on the first instance drives the cpu_util
above the high-water-mark alarm threshold, e.g. to say 91%, then the
newly launched instance has little load to contend with, giving an
average cpu_util of the instance group of ~46%, then the continual
scale-up/scale-down thrashing that you see is just autoscaling doing
exactly what you've told it to do.

To avoid this, you'll need to:

* ensure that the "load" is spread across the current instance group
  members in a roughly fair distribution (this is often achieved in
  practice using a load balancer randomizing or round-robining)

* increase the cooldown period to allow the load distribution to
  "settle" after a scaling operation has taken place

* ensure that the low-water-mark alarm threshold is sufficiently
  distant from that of the high-water-mark alarm

Cheers,
Eoghan

----- Original Message -----
> Hi,
> 
> Some update... I yesterday added "repeat_actions" : true -definition to
> OS::Ceilometer::Alarm resources in the Heat template:
> 
> "CPUAlarmHigh": {
> "Type": "OS::Ceilometer::Alarm",
> "Properties": {
> "description": "Scale-up if CPU is greater than 90% for 30 seconds",
> "meter_name": "cpu_util",
> "statistic": "avg",
> "period": "30",
> "evaluation_periods": "1",
> "threshold": "90",
> "alarm_actions":
> [ {"Fn::GetAtt": ["ScaleUpPolicy", "AlarmUrl"]} ],
> "matching_metadata":
> {"metadata.user_metadata.server_group": "Group_A" },
> "comparison_operator": "gt",
> "repeat_actions" : true
> }
> },
> 
> "CPUAlarmLow": {
> "Type": "OS::Ceilometer::Alarm",
> "Properties": {
> "description": "Scale-down if CPU is less than 50% for 30 seconds",
> "meter_name": "cpu_util",
> "statistic": "avg",
> "period": "30",
> "evaluation_periods": "1",
> "threshold": "50",
> "alarm_actions":
> [ {"Fn::GetAtt": ["ScaleDownPolicy", "AlarmUrl"]} ],
> "matching_metadata":
> {"metadata.user_metadata.server_group": "Group_A" },
> "comparison_operator": "lt",
> "repeat_actions" : true
> }
> }
> 
> ...and everything seemed to work fine. But now I just created a stack again
> and generated some load inside the first VM started. Scaling up occurred,
> but after that the system is now continuously scaling up and down the VMs
> even the load situation doesn't change. Seems to be the "repeat_actions"
> definitions didn't help after all...
> 
> Br,
> -Juha
> 
> 
> On 25 February 2014 00:27, Steven Dake < sdake at redhat.com > wrote:
> 
> 
> 
> Juha,
> 
> Copying Angus so he sees. He wrote a big majority of the ceilometer + heat
> integration and might have a better idea of the details of the problem you
> face.
> 
> 
> On 02/24/2014 01:27 AM, Juha Tynninen wrote:
> 
> 
> 
> Hi,
> 
> I'm having some problems concerning auto scaling feature.
> Any ideas?
> 
> First scaling up and down is working just fine. But then when tested later on
> scaling down/up is no longer working properly.
> Scaling down may occur even it shouldn't or scaling up doesn't occur even it
> should. When in this situation I remove all the
> received metric data from the DB, auto scaling starts to work again.
> 
> Ceilometer is configured to use Mongo and the auto scaling is based on the
> cpu_util metrics.
> 
> Related configurations:
> -----------------------
> /etc/ceilometer/pipeline.yaml on compute nodes:
> 
> name: cpu_pipeline
> interval: 15
> 
> /etc/ceilometer/ceilometer.conf on controller:
> evaluation_interval=15
> 
> Heat template used:
> -------------------
> "Resources" : {
> 
> "Group_A" : {
> "Type" : "AWS::AutoScaling::AutoScalingGroup",
> "Properties" : {
> "AvailabilityZones" : { "Fn::GetAZs" : ""},
> "LaunchConfigurationName" : { "Ref" : "Group_A_Config" },
> "MinSize" : "1",
> "MaxSize" : "3",
> "Tags" : [
> { "Key" : "metering.server_group", "Value" : "Group_A" },
> { "Key" : "custom_metadata", "Value" : "test" }
> ],
> "VPCZoneIdentifier" : [ { "Ref" : "PrivateSubnetId" } ]
> }
> },
> 
> "Group_A_Config" : {
> "Type" : "AWS::AutoScaling::LaunchConfiguration",
> "Properties": {
> "ImageId" : { "Ref" : "ImageId" },
> "InstanceType" : { "Ref" : "InstanceType" },
> "KeyName" : { "Ref" : "KeyName" }
> }
> },
> 
> "ScaleUpPolicy" : {
> "Type" : "AWS::AutoScaling::ScalingPolicy",
> "Properties" : {
> "AdjustmentType" : "ChangeInCapacity",
> "AutoScalingGroupName" : { "Ref" : "Group_A" },
> "Cooldown" : "20",
> "ScalingAdjustment" : "1"
> }
> },
> 
> "ScaleDownPolicy" : {
> "Type" : "AWS::AutoScaling::ScalingPolicy",
> "Properties" : {
> "AdjustmentType" : "ChangeInCapacity",
> "AutoScalingGroupName" : { "Ref" : "Group_A" },
> "Cooldown" : "20",
> "ScalingAdjustment" : "-1"
> }
> },
> 
> "CPUAlarmHigh": {
> "Type": "OS::Ceilometer::Alarm",
> "Properties": {
> "description": "Scale-up if CPU is greater than 90% for 20 seconds",
> "meter_name": "cpu_util",
> "statistic": "avg",
> "period": "20",
> "evaluation_periods": "1",
> "threshold": "90",
> "alarm_actions":
> [ {"Fn::GetAtt": ["ScaleUpPolicy", "AlarmUrl"]} ],
> "matching_metadata":
> {"metadata.user_metadata.server_group": "Group_A" },
> "comparison_operator": "gt"
> }
> },
> 
> "CPUAlarmLow": {
> "Type": "OS::Ceilometer::Alarm",
> "Properties": {
> "description": "Scale-down if CPU is less than 50% for 20 seconds",
> "meter_name": "cpu_util",
> "statistic": "avg",
> "period": "20",
> "evaluation_periods": "1",
> "threshold": "50",
> "alarm_actions":
> [ {"Fn::GetAtt": ["ScaleDownPolicy", "AlarmUrl"]} ],
> "matching_metadata":
> {"metadata.user_metadata.server_group": "Group_A" },
> "comparison_operator": "lt"
> }
> 
> In ceilometer logs I can see the following kind of warnings:
> 
> <44>Feb 24 08:41:08 node-16
> ceilometer-ceilometer.collector.dispatcher.database WARNING: message
> signature invalid, discarding message: {u'counter_name':
> u'instance.scheduled', u'user_id': None, u'message_signature':
> u'd1b49ddf004edc5b7a8dc9405b42a71f2ae975d04c25838c3dc0ea0e6f6e4edd',
> u'timestamp': u'2014-02-24 08:41:08.334580', u'resource_id':
> u'48c815ab-01c9-4ac8-9096-ac171976598c', u'message_id':
> u'67e611e4-9d2f-11e3-81f1-080027e519cb', u'source': u'openstack',
> u'counter_unit': u'instance', u'counter_volume': 1, u'project_id':
> u'efcca4ba425c4beda73eb31a54df931a', u'resource_metadata': {u'instance_id':
> u'48c815ab-01c9-4ac8-9096-ac171976598c', u'weighted_host': {u'host':
> u'node-18', u'weight': 3818.0}, u'host': u'scheduler.node-16',
> u'request_spec': {u'num_instances': 1, u'block_device_mapping':
> [{u'instance_uuid': u'48c815ab-01c9-4ac8-9096-ac171976598c',
> u'guest_format': None, u'boot_index': 0, u'delete_on_termination': True,
> u'no_device': None, u'connection_info': None, u'volume_id': None,
> u'device_name': None, u'disk_bus': None, u'image_id':
> u'11848cbf-a428-4dfb-8818-2f0a981f540b', u'source_type': u'image',
> u'device_type': u'disk', u'snapshot_id': None, u'destination_type':
> u'local', u'volume_size': None}], u'image': {u'status': u'active', u'name':
> u'cirrosImg', u'deleted': False, u'container_format': u'bare',
> u'created_at': u'2014-02-12T08:46:04.000000', u'disk_format': u'qcow2',
> u'updated_at': u'2014-02-12T08:46:04.000000', u'properties': {},
> u'min_disk': 0, u'min_ram': 0, u'checksum':
> u'50bdc35edb03a38d91b1b071afb20a3c', u'owner':
> u'efcca4ba425c4beda73eb31a54df931a', u'is_public': True, u'deleted_at':
> None, u'id': u'11848cbf-a428-4dfb-8818-2f0a981f540b', u'size': 9761280},
> u'instance_type': {u'root_gb': 1, u'name': u'm1.tiny', u'ephemeral_gb': 0,
> u'memory_mb': 512, u'vcpus': 1, u'extra_specs': {}, u'swap': 0,
> u'rxtx_factor': 1.0, u'flavorid': u'1', u'vcpu_weight': None, u'id': 2},
> u'instance_properties': {u'vm_state': u'building', u'availability_zone':
> None, u'terminated_at': None, u'ephemeral_gb': 0, u'instance_type_id': 2,
> u'user_data': u'Q29udGVudC1UeXBlOiBtdWx0aXBhcnQvbWl4ZWQ7IGJvdW5kYXJ5PSI9PT0
> ...
> , u'cleaned': False, u'vm_mode': None, u'deleted_at': None,
> u'reservation_id': u'r-l91mh33v', u'id': 274, u'security_groups':
> {u'objects': []}, u'disable_terminate': False, u'root_device_name': None,
> u'display_name': u'tyky-Group_A-55cklit7nvbq-Group_A-2-yis32na5m7ey',
> u'uuid': u'48c815ab-01c9-4ac8-9096-ac171976598c', u'default_swap_device':
> None, u'info_cache': {u'instance_uuid':
> u'48c815ab-01c9-4ac8-9096-ac171976598c', u'network_info': []}, u'hostname':
> u'tyky-group-a-55cklit7nvbq-group-a-2-yis32na5m7ey', u'launched_on': None,
> u'display_description': u'tyky-Group_A-55cklit7nvbq-Group_A-2-yis32na5m7ey',
> u'key_data': u'ssh-rsa
> AAAAB3NzaC1yc2EAAAADAQABAAABAQC39hmz8e40Xv/+QKkLyRA7j02RfIG61cr1j41RftnkOF3ZbwBzi7qibsOA3gC9Ln05YbB6z2/iUnQzxQsoOpmlnXuv2O296utY2ZCTKhdFSzn2Ot7l635zEXkivMc97wz4bITtaBTjX3nV6sXOfevdTIOJeC11SqxmfNRRzXcz9fRv6kLjz7IrA0tvRTp2xDVtFEj+vFLWaXc3TcUSygxiSLeAuNkH1rZ9jVuHXXvzb/e7navrGyJec2P86AQg2TUk77MhLjPcbyKiJJK0DhK6zOkZUWXtgIVQx7+gO/Xs2QgQHcw+VdzRzpJK+/EOzUOU8IDWNnyfaJEnQEoX2oMj
> Generated by Nova\n', u'deleted': False, u'config_drive': u'',
> u'power_state': 0, u'default_ephemeral_device': None, u'progress': 0,
> u'project_id': u'efcca4ba425c4beda73eb31a54df931a', u'launched_at': None,
> u'scheduled_at': None, u'node': None, u'ramdisk_id': u'', u'access_ip_v6':
> None, u'access_ip_v4': None, u'kernel_id': u'', u'key_name': u'heat_key',
> u'updated_at': None, u'host': None, u'user_id':
> u'ef4e983291ef4ad1b88eb1f776bd52b6', u'system_metadata':
> {u'instance_type_memory_mb': 512, u'instance_type_swap': 0,
> u'instance_type_vcpu_weight': None, u'instance_type_root_gb': 1,
> u'instance_type_name': u'm1.tiny', u'instance_type_id': 2,
> u'instance_type_ephemeral_gb': 0, u'instance_type_rxtx_factor': 1.0,
> u'image_disk_format': u'qcow2', u'instance_type_flavorid': u'1',
> u'instance_type_vcpus': 1, u'image_container_format': u'bare',
> u'image_min_ram': 0, u'image_min_disk': 1, u'image_base_image_ref':
> u'11848cbf-a428-4dfb-8818-2f0a981f540b'}, u'task_state': u'scheduling',
> u'shutdown_terminate': False, u'cell_name': None, u'root_gb': 1, u'locked':
> False, u'name': u'instance-00000112', u'created_at':
> u'2014-02-24T08:41:08.257534', u'locked_by': None, u'launch_index': 0,
> u'memory_mb': 512, u'vcpus': 1, u'image_ref':
> u'11848cbf-a428-4dfb-8818-2f0a981f540b', u'architecture': None,
> u'auto_disk_config': False, u'os_type': None, u'metadata':
> {u'metering.server_group': u'Group_A', u'AutoScalingGroupName':
> u'tyky-Group_A-55cklit7nvbq', u'custom_metadata': u'test'}},
> u'security_group': [u'default'], u'instance_uuids':
> [u'48c815ab-01c9-4ac8-9096-ac171976598c']}, u'event_type':
> u'scheduler.run_instance.scheduled'}, u'counter_type': u'delta'}
> 
> Also the following warnings/errors can be seen but they seem to occur when
> auto scaling is properly working and have no negative effects as such:
> 
> <44>Feb 24 08:43:08 node-16
> <U+FEFF>ceilometer-ceilometer.transformer.conversions WARNING: dropping
> sample with no predecessor: <ceilometer.sample.Sample object at 0x3774fd0>
> <44>Feb 24 08:43:08 node-16 ceilometer-ceilometer.publisher.rpc AUDIT:
> Publishing 1 samples on metering
> <44>Feb 24 08:43:08 node-16 ceilometer-ceilometer.publisher.rpc AUDIT:
> Publishing 1 samples on metering
> <44>Feb 24 08:43:08 node-16 ceilometer-ceilometer.publisher.rpc AUDIT:
> Publishing 1 samples on metering
> <44>Feb 24 08:43:08 node-16 ceilometer-ceilometer.publisher.rpc AUDIT:
> Publishing 1 samples on metering
> <44>Feb 24 08:43:08 node-16 ceilometer-ceilometer.publisher.rpc AUDIT:
> Publishing 1 samples on metering
> <44>Feb 24 08:43:08 node-16 ceilometer-ceilometer.publisher.rpc AUDIT:
> Publishing 1 samples on metering
> <44>Feb 24 08:43:09 node-16 ceilometer-ceilometer.publisher.rpc AUDIT:
> Publishing 1 samples on metering
> <43>Feb 24 08:43:09 node-16
> ceilometer-ceilometer.collector.dispatcher.database ERROR: Failed to record
> metering data: not okForStor
> age
> Traceback (most recent call last):
> File
> "/usr/lib/python2.7/dist-packages/ceilometer/collector/dispatcher/database.py",
> line 65, in record_metering_data
> self.storage_conn.record_metering_data(meter)
> File "/usr/lib/python2.7/dist-packages/ceilometer/storage/impl_mongodb.py",
> line 417, in record_metering_data
> upsert=True,
> File "/usr/lib/python2.7/dist-packages/pymongo/collection.py", line 487, in
> update
> check_keys, self.__uuid_subtype), safe)
> File "/usr/lib/python2.7/dist-packages/pymongo/mongo_client.py", line 969, in
> _send_message
> rv = self.__check_response_to_last_error(response)
> File "/usr/lib/python2.7/dist-packages/pymongo/mongo_client.py", line 911, in
> __check_response_to_last_error
> raise OperationFailure(details["err"], details["code"])
> OperationFailure: not okForStorage
> 
> Br,
> -Juha
> 
> 
> _______________________________________________
> Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to     : openstack at lists.openstack.org Unsubscribe :
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> 
> 
> 
> _______________________________________________
> Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to     : openstack at lists.openstack.org
> Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> 




More information about the Openstack mailing list