On 4/15/2019 10:36 AM, Nicolas Ghirlanda wrote:
New VMs are just currently always scheduled to the same compute node, even though a manual live-migration is working fine to other compute nodes.
How are you doing the live migration? If you're using the openstack command line and defaulting to the 2.1 compute API microversion, you're forcing the server to another host by bypassing the scheduler which is maybe why live migration is "working" but server create is not ever using the other computes.
We're not sure, what the issue is, but perhaps someone may spot it from our config:
# nova.conf scheduler config
default_availability_zone = az1
How many computes are in az1? All 8?
...
[filter_scheduler] available_filters = nova.scheduler.filters.all_filters enabled_filters = RetryFilter, AvailabilityZoneFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter, AggregateInstanceExtraSpecsFilter, AggregateMultiTenancyIsolation, DifferentHostFilter, RamFilter, SameHostFilter, NUMATopologyFilter
Not really related to this probably but you can remove RamFilter since placement does the MEMORY_MB filtering and the RamFilter was deprecated in Stein as a result. It looks like you're getting the default host_subset_size value: https://docs.openstack.org/nova/queens/configuration/config.html#filter_sche... Which means your scheduler is "packing" by default. If you have multiple computes and you want to spread instances across them, you can adjust the host_subset_size value.
Database is an external Percona XtraDB Cluster (Version 5.7.24) with haproxy for read-write-splitting (currently only one write node).
We do see mysql errors in the nova-scheduler.log on the write DB node when an instance is created.
2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db [-] Unexpected error while reporting service status: OperationalError: (pymysql.err.OperationalError) (1213, u'WSREP detected deadlock/conflict and aborted the transaction. Try restarting the transaction') (Background on this error at: http://sqlalche.me/e/e3q8) 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db Traceback (most recent call last): 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py", line 91, in _report_state 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db service.service_ref.save() 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 226, in wrapper 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return fn(self, *args, **kwargs) 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/objects/service.py", line 397, in save 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db db_service = db.service_update(self._context, self.id, updates) 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/db/api.py", line 183, in service_update 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return IMPL.service_update(context, service_id, values) 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/api.py", line 154, in wrapper 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db ectxt.value = e.inner_exc 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db self.force_reraise() 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db six.reraise(self.type_, self.value, self.tb) 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/api.py", line 142, in wrapper 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return f(*args, **kwargs) 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 227, in wrapped 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return f(context, *args, **kwargs) 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db self.gen.next() 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 1043, in _transaction_scope 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db yield resource 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db self.gen.next() 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 653, in _session 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db self.session.rollback() 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db self.force_reraise() 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db six.reraise(self.type_, self.value, self.tb) 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 650, in _session 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db self._end_session_transaction(self.session) 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 678, in _end_session_transaction 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db session.commit() 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 943, in commit 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db self.transaction.commit() 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 471, in commit 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db t[1].commit() 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1643, in commit 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db self._do_commit() 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1674, in _do_commit 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db self.connection._commit_impl() 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 726, in _commit_impl 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db self._handle_dbapi_exception(e, None, None, None, None) 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1409, in _handle_dbapi_exception 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db util.raise_from_cause(newraise, exc_info) 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 265, in raise_from_cause 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db reraise(type(exception), exception, tb=exc_tb, cause=cause) 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 724, in _commit_impl 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db self.engine.dialect.do_commit(self.connection) 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 1765, in do_commit 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db dbapi_connection.commit() 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/connections.py", line 422, in commit 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db self._read_ok_packet() 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/connections.py", line 396, in _read_ok_packet 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db pkt = self._read_packet() 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/connections.py", line 683, in _read_packet 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db packet.check_error() 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/protocol.py", line 220, in check_error 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db err.raise_mysql_exception(self._data) 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/err.py", line 109, in raise_mysql_exception 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db raise errorclass(errno, errval) 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db OperationalError: (pymysql.err.OperationalError) (1213, u'WSREP detected deadlock/conflict and aborted the transaction. Try restarting the transaction') (Background on this error at: http://sqlalche.me/e/e3q8) 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 2019-04-15 16:52:20.020 24 INFO nova.servicegroup.drivers.db [-] Recovered from being unable to report status.
This is a service update operation which could indicate that the other computes are reported as 'down' and that's why nothing is getting scheduled to them. Have you checked the "openstack compute service list" output to make sure those computes are all reporting as "up"? https://docs.openstack.org/python-openstackclient/latest/cli/command-objects... There is a retry_on_deadlock decorator on that service_update DB API though so I'm kind of surprised to still see the deadlock errors, unless those just get logged while retrying? https://github.com/openstack/nova/blob/stable/queens/nova/db/sqlalchemy/api....
The deadlock message is quite strange, as we have haproxy configured so all write requests are handled by one node.
There are NO errors in the mysqld.log WHILE creating an instance, but we see from time to time aborted connections from nova.
2019-04-15T14:22:36.232108Z 30616972 [Note] Aborted connection 30616972 to db: 'nova' user: 'nova' host: '10.x.y.z' (Got an error reading communication packets)
As I said, all instances are allocated to the same compute node. nova-compute.log doesn't show an error while creating the instance.
Beside that, we also see messages from nova.scheduler.host_manager on all other nodes like (but those messages are _not_ triggered, when an instance is spawned.!)
2019-04-15 16:28:47.771 22 INFO nova.scheduler.host_manager [req-f92e340e-a88a-44a0-8cad-588390c25bc2 - - - - -] The instance sync for host 'xxx' did not match. Re-created its InstanceList.
Are there any instances on these other hosts? My guess is you're seeing that after the live migration to another host.
Don't know if that may be relevant, but somehow our (currently single) AZ is listed several times.
# openstack availability zone list +------------+-------------+ | Zone Name | Zone Status | +------------+-------------+ | internal | available | | az1 | available | | az1 | available | | az1 | available | | az1 | available | +------------+-------------+
May that be related somehow?
I believe those are the AZs for other services as well (cinder/neutron). Specify the --compute option to filter that. -- Another thing to check is placement - are there 8 compute node resource providers reporting into placement? You can check using the CLI: https://docs.openstack.org/osc-placement/latest/cli/index.html#resource-prov... In Queens, there should be one resource provider per working compute node in the cell database's compute_nodes table (the UUIDs should match as well). -- Thanks, Matt