[nova][scheduler] scheduler spawns to the same compute node only
Matt Riedemann
mriedemos at gmail.com
Mon Apr 15 17:04:17 UTC 2019
On 4/15/2019 10:36 AM, Nicolas Ghirlanda wrote:
> New VMs are just currently always scheduled to the same compute node,
> even though a manual live-migration is working fine to other compute nodes.
How are you doing the live migration? If you're using the openstack
command line and defaulting to the 2.1 compute API microversion, you're
forcing the server to another host by bypassing the scheduler which is
maybe why live migration is "working" but server create is not ever
using the other computes.
>
>
> We're not sure, what the issue is, but perhaps someone may spot it from
> our config:
>
>
> # nova.conf scheduler config
>
> default_availability_zone = az1
How many computes are in az1? All 8?
>
> ...
>
> [filter_scheduler]
> available_filters = nova.scheduler.filters.all_filters
> enabled_filters = RetryFilter, AvailabilityZoneFilter,
> ComputeCapabilitiesFilter, ImagePropertiesFilter,
> ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter,
> AggregateInstanceExtraSpecsFilter, AggregateMultiTenancyIsolation,
> DifferentHostFilter, RamFilter, SameHostFilter, NUMATopologyFilter
>
Not really related to this probably but you can remove RamFilter since
placement does the MEMORY_MB filtering and the RamFilter was deprecated
in Stein as a result.
It looks like you're getting the default host_subset_size value:
https://docs.openstack.org/nova/queens/configuration/config.html#filter_scheduler.host_subset_size
Which means your scheduler is "packing" by default. If you have multiple
computes and you want to spread instances across them, you can adjust
the host_subset_size value.
>
>
> Database is an external Percona XtraDB Cluster (Version 5.7.24) with
> haproxy for read-write-splitting (currently only one write node).
>
> We do see mysql errors in the nova-scheduler.log on the write DB node
> when an instance is created.
>
>
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db [-]
> Unexpected error while reporting service status: OperationalError:
> (pymysql.err.OperationalError) (1213, u'WSREP detected deadlock/conflict
> and aborted the transaction. Try restarting the transaction')
> (Background on this error at: http://sqlalche.me/e/e3q8)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db Traceback
> (most recent call last):
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py",
> line 91, in _report_state
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> service.service_ref.save()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_versionedobjects/base.py",
> line 226, in wrapper
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return
> fn(self, *args, **kwargs)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/objects/service.py",
> line 397, in save
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db db_service
> = db.service_update(self._context, self.id, updates)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/db/api.py",
> line 183, in service_update
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return
> IMPL.service_update(context, service_id, values)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/api.py",
> line 154, in wrapper
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> ectxt.value = e.inner_exc
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py",
> line 220, in __exit__
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> self.force_reraise()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py",
> line 196, in force_reraise
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> six.reraise(self.type_, self.value, self.tb)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/api.py",
> line 142, in wrapper
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return
> f(*args, **kwargs)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py",
> line 227, in wrapped
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return
> f(context, *args, **kwargs)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> self.gen.next()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py",
> line 1043, in _transaction_scope
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db yield
> resource
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> self.gen.next()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py",
> line 653, in _session
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> self.session.rollback()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py",
> line 220, in __exit__
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> self.force_reraise()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py",
> line 196, in force_reraise
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> six.reraise(self.type_, self.value, self.tb)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py",
> line 650, in _session
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> self._end_session_transaction(self.session)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py",
> line 678, in _end_session_transaction
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> session.commit()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py",
> line 943, in commit
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> self.transaction.commit()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py",
> line 471, in commit
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db t[1].commit()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
> line 1643, in commit
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> self._do_commit()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
> line 1674, in _do_commit
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> self.connection._commit_impl()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
> line 726, in _commit_impl
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> self._handle_dbapi_exception(e, None, None, None, None)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
> line 1409, in _handle_dbapi_exception
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> util.raise_from_cause(newraise, exc_info)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py",
> line 265, in raise_from_cause
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> reraise(type(exception), exception, tb=exc_tb, cause=cause)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
> line 724, in _commit_impl
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> self.engine.dialect.do_commit(self.connection)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py",
> line 1765, in do_commit
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> dbapi_connection.commit()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/connections.py",
> line 422, in commit
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> self._read_ok_packet()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/connections.py",
> line 396, in _read_ok_packet
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db pkt =
> self._read_packet()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/connections.py",
> line 683, in _read_packet
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> packet.check_error()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/protocol.py",
> line 220, in check_error
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> err.raise_mysql_exception(self._data)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/err.py",
> line 109, in raise_mysql_exception
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db raise
> errorclass(errno, errval)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> OperationalError: (pymysql.err.OperationalError) (1213, u'WSREP detected
> deadlock/conflict and aborted the transaction. Try restarting the
> transaction') (Background on this error at: http://sqlalche.me/e/e3q8)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> 2019-04-15 16:52:20.020 24 INFO nova.servicegroup.drivers.db [-]
> Recovered from being unable to report status.
This is a service update operation which could indicate that the other
computes are reported as 'down' and that's why nothing is getting
scheduled to them. Have you checked the "openstack compute service list"
output to make sure those computes are all reporting as "up"?
https://docs.openstack.org/python-openstackclient/latest/cli/command-objects/compute-service.html#compute-service-list
There is a retry_on_deadlock decorator on that service_update DB API
though so I'm kind of surprised to still see the deadlock errors, unless
those just get logged while retrying?
https://github.com/openstack/nova/blob/stable/queens/nova/db/sqlalchemy/api.py#L566
>
>
> The deadlock message is quite strange, as we have haproxy configured so
> all write requests are handled by one node.
>
>
> There are NO errors in the mysqld.log WHILE creating an instance, but we
> see from time to time aborted connections from nova.
>
> 2019-04-15T14:22:36.232108Z 30616972 [Note] Aborted connection 30616972
> to db: 'nova' user: 'nova' host: '10.x.y.z' (Got an error reading
> communication packets)
>
>
>
> As I said, all instances are allocated to the same compute node.
> nova-compute.log doesn't show an error while creating the instance.
>
>
> Beside that, we also see messages from nova.scheduler.host_manager on
> all other nodes like (but those messages are _not_ triggered, when an
> instance is spawned.!)
>
>
> 2019-04-15 16:28:47.771 22 INFO nova.scheduler.host_manager
> [req-f92e340e-a88a-44a0-8cad-588390c25bc2 - - - - -] The instance sync
> for host 'xxx' did not match. Re-created its InstanceList.
Are there any instances on these other hosts? My guess is you're seeing
that after the live migration to another host.
>
>
>
> Don't know if that may be relevant, but somehow our (currently single)
> AZ is listed several times.
>
>
> # openstack availability zone list
> +------------+-------------+
> | Zone Name | Zone Status |
> +------------+-------------+
> | internal | available |
> | az1 | available |
> | az1 | available |
> | az1 | available |
> | az1 | available |
> +------------+-------------+
>
> May that be related somehow?
I believe those are the AZs for other services as well (cinder/neutron).
Specify the --compute option to filter that.
--
Another thing to check is placement - are there 8 compute node resource
providers reporting into placement? You can check using the CLI:
https://docs.openstack.org/osc-placement/latest/cli/index.html#resource-provider-list
In Queens, there should be one resource provider per working compute
node in the cell database's compute_nodes table (the UUIDs should match
as well).
--
Thanks,
Matt
More information about the openstack-discuss
mailing list