[nova][scheduler] scheduler spawns to the same compute node only

Mike Carden mike.carden at gmail.com
Mon Apr 15 21:46:14 UTC 2019


For what it's worth, we had a discussion about this in November last year:

http://lists.openstack.org/pipermail/openstack-discuss/2018-November/000209.html

I made a comment at the end of that thread about a 'workaround' we have
used. It still happens here on Queens and the workaround doesn't solve it
permanently.

--
MC



On Tue, Apr 16, 2019 at 3:22 AM Matt Riedemann <mriedemos at gmail.com> wrote:

> On 4/15/2019 10:36 AM, Nicolas Ghirlanda wrote:
> > New VMs  are just currently always scheduled to the same compute node,
> > even though a manual live-migration is working fine to other compute
> nodes.
>
> How are you doing the live migration? If you're using the openstack
> command line and defaulting to the 2.1 compute API microversion, you're
> forcing the server to another host by bypassing the scheduler which is
> maybe why live migration is "working" but server create is not ever
> using the other computes.
>
> >
> >
> > We're not sure, what the issue is, but perhaps someone may spot it from
> > our config:
> >
> >
> > # nova.conf  scheduler config
> >
> > default_availability_zone = az1
>
> How many computes are in az1? All 8?
>
> >
> > ...
> >
> > [filter_scheduler]
> > available_filters = nova.scheduler.filters.all_filters
> > enabled_filters = RetryFilter, AvailabilityZoneFilter,
> > ComputeCapabilitiesFilter, ImagePropertiesFilter,
> > ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter,
> > AggregateInstanceExtraSpecsFilter, AggregateMultiTenancyIsolation,
> > DifferentHostFilter, RamFilter, SameHostFilter, NUMATopologyFilter
> >
>
> Not really related to this probably but you can remove RamFilter since
> placement does the MEMORY_MB filtering and the RamFilter was deprecated
> in Stein as a result.
>
> It looks like you're getting the default host_subset_size value:
>
>
> https://docs.openstack.org/nova/queens/configuration/config.html#filter_scheduler.host_subset_size
>
> Which means your scheduler is "packing" by default. If you have multiple
> computes and you want to spread instances across them, you can adjust
> the host_subset_size value.
>
> >
> >
> > Database is an external Percona XtraDB Cluster (Version 5.7.24) with
> > haproxy for read-write-splitting (currently only one write node).
> >
> > We do see mysql errors in the nova-scheduler.log on the write DB node
> > when an instance is created.
> >
> >
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db [-]
> > Unexpected error while reporting service status: OperationalError:
> > (pymysql.err.OperationalError) (1213, u'WSREP detected deadlock/conflict
> > and aborted the transaction. Try restarting the transaction')
> > (Background on this error at: http://sqlalche.me/e/e3q8)
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db Traceback
> > (most recent call last):
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py",
>
> > line 91, in _report_state
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > service.service_ref.save()
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_versionedobjects/base.py",
>
> > line 226, in wrapper
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return
> > fn(self, *args, **kwargs)
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/objects/service.py",
>
> > line 397, in save
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db db_service
> > = db.service_update(self._context, self.id, updates)
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> > "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/db/api.py",
> > line 183, in service_update
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return
> > IMPL.service_update(context, service_id, values)
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> > "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/api.py",
> > line 154, in wrapper
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > ectxt.value = e.inner_exc
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py",
>
> > line 220, in __exit__
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > self.force_reraise()
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py",
>
> > line 196, in force_reraise
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > six.reraise(self.type_, self.value, self.tb)
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> > "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/api.py",
> > line 142, in wrapper
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return
> > f(*args, **kwargs)
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py",
>
> > line 227, in wrapped
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return
> > f(context, *args, **kwargs)
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> > "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > self.gen.next()
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py",
>
> > line 1043, in _transaction_scope
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db yield
> > resource
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> > "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > self.gen.next()
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py",
>
> > line 653, in _session
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > self.session.rollback()
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py",
>
> > line 220, in __exit__
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > self.force_reraise()
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py",
>
> > line 196, in force_reraise
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > six.reraise(self.type_, self.value, self.tb)
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py",
>
> > line 650, in _session
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > self._end_session_transaction(self.session)
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py",
>
> > line 678, in _end_session_transaction
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > session.commit()
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py",
>
> > line 943, in commit
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > self.transaction.commit()
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py",
>
> > line 471, in commit
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> t[1].commit()
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
>
> > line 1643, in commit
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > self._do_commit()
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
>
> > line 1674, in _do_commit
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > self.connection._commit_impl()
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
>
> > line 726, in _commit_impl
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > self._handle_dbapi_exception(e, None, None, None, None)
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
>
> > line 1409, in _handle_dbapi_exception
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > util.raise_from_cause(newraise, exc_info)
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py",
>
> > line 265, in raise_from_cause
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > reraise(type(exception), exception, tb=exc_tb, cause=cause)
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
>
> > line 724, in _commit_impl
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > self.engine.dialect.do_commit(self.connection)
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py",
>
> > line 1765, in do_commit
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > dbapi_connection.commit()
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/connections.py",
>
> > line 422, in commit
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > self._read_ok_packet()
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/connections.py",
>
> > line 396, in _read_ok_packet
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db pkt =
> > self._read_packet()
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/connections.py",
>
> > line 683, in _read_packet
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > packet.check_error()
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> >
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/protocol.py",
>
> > line 220, in check_error
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > err.raise_mysql_exception(self._data)
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File
> > "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/err.py",
> > line 109, in raise_mysql_exception
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db raise
> > errorclass(errno, errval)
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > OperationalError: (pymysql.err.OperationalError) (1213, u'WSREP detected
> > deadlock/conflict and aborted the transaction. Try restarting the
> > transaction') (Background on this error at: http://sqlalche.me/e/e3q8)
> > 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> > 2019-04-15 16:52:20.020 24 INFO nova.servicegroup.drivers.db [-]
> > Recovered from being unable to report status.
>
> This is a service update operation which could indicate that the other
> computes are reported as 'down' and that's why nothing is getting
> scheduled to them. Have you checked the "openstack compute service list"
> output to make sure those computes are all reporting as "up"?
>
>
> https://docs.openstack.org/python-openstackclient/latest/cli/command-objects/compute-service.html#compute-service-list
>
> There is a retry_on_deadlock decorator on that service_update DB API
> though so I'm kind of surprised to still see the deadlock errors, unless
> those just get logged while retrying?
>
>
> https://github.com/openstack/nova/blob/stable/queens/nova/db/sqlalchemy/api.py#L566
>
> >
> >
> > The deadlock message is quite strange, as we have haproxy configured so
> > all write requests are handled by one node.
> >
> >
> > There are NO errors in the mysqld.log WHILE creating an instance, but we
> > see from time to time aborted connections from nova.
> >
> > 2019-04-15T14:22:36.232108Z 30616972 [Note] Aborted connection 30616972
> > to db: 'nova' user: 'nova' host: '10.x.y.z' (Got an error reading
> > communication packets)
> >
> >
> >
> > As I said, all instances are allocated to the same compute node.
> > nova-compute.log doesn't show an error while creating the instance.
> >
> >
> > Beside that, we also see messages from nova.scheduler.host_manager on
> > all other nodes like (but those messages are _not_ triggered, when an
> > instance is spawned.!)
> >
> >
> > 2019-04-15 16:28:47.771 22 INFO nova.scheduler.host_manager
> > [req-f92e340e-a88a-44a0-8cad-588390c25bc2 - - - - -] The instance sync
> > for host 'xxx' did not match. Re-created its InstanceList.
>
> Are there any instances on these other hosts? My guess is you're seeing
> that after the live migration to another host.
>
> >
> >
> >
> > Don't know if that may be relevant, but somehow our (currently single)
> > AZ is listed several times.
> >
> >
> > # openstack availability zone list
> > +------------+-------------+
> > | Zone Name  | Zone Status |
> > +------------+-------------+
> > | internal   | available   |
> > | az1 | available           |
> > | az1 | available           |
> > | az1 | available           |
> > | az1 | available           |
> > +------------+-------------+
> >
> > May that be related somehow?
>
> I believe those are the AZs for other services as well (cinder/neutron).
> Specify the --compute option to filter that.
>
> --
>
> Another thing to check is placement - are there 8 compute node resource
> providers reporting into placement? You can check using the CLI:
>
>
> https://docs.openstack.org/osc-placement/latest/cli/index.html#resource-provider-list
>
> In Queens, there should be one resource provider per working compute
> node in the cell database's compute_nodes table (the UUIDs should match
> as well).
>
> --
>
> Thanks,
>
> Matt
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190416/29d8a95d/attachment-0001.html>


More information about the openstack-discuss mailing list