[openstack-dev] [gate] [glance] concurrent workers issues
Sean Dague
sean at dague.net
Wed Jul 9 20:21:31 UTC 2014
Beyond the database connections issues, I think it's also worth noting that:
Bug 1339735 - Deadlock updating image properties in glance-registry daemon
Showed up at the same time. I think glance might just not be multiworker
safe at the moment, but because it wasn't tested before, we never saw
the races before. This is something to keep an eye on.
In the glance case, I'm not sure what we should do here. Reverting
setting up multiworker will make this race go away, but that's just
hiding the issue, because we let users set that, and we know the errors
will fail out some operations.
-Sean
On 07/09/2014 03:59 PM, Matt Riedemann wrote:
> Bug 1338841 [1] started showing up yesterday and I first noticed it on
> the change to set osapi_volume_workers equal to the number of CPUs
> available by default. Similar patches for trove (api/conductor workers)
> and glance (api/registry workers) have landed in the last week also, and
> nova has been running with multiple api/conductor workers by default
> since Icehouse.
>
> It looks like the cinder change tipped the default postgresql
> max_connections over and we started getting asynchronous connection
> failures in that job. [2]
>
> We can also note that the postgresql job is the only one that runs the
> nova api-metadata service, which has it's own workers.
>
> The VMs the jobs are running on have 8 VCPUs, so that's at least 88
> workers between nova (3), cinder (1), glance (2), trove (2), neutron,
> heat and ceilometer.
>
> So osapi_volume_workers (8) + n-api-meta workers (8) seems to have
> tipped it over.
>
> The first attempt at a fix is to simply double the default
> max_connections value [3].
>
> While looking up the postgresql configuration docs, I also read a bit on
> synchronous_commit=off and fsync=off, which sound like we might want to
> also think about using one of those in devstack runs since they are
> supposed to be more performant if you don't care about disaster recovery
> (which we don't in gate runs on VMs).
>
> Anyway, bumping max connections might fix the gate, I'm just sending
> this out to see if there are any postgresql experts out there with
> additional tips or insights on things we can tweak or look for,
> including whether or not it might be worthwhile to set
> synchronous_commit=off or fsync=off for gate runs.
>
> [1] https://bugs.launchpad.net/nova/+bug/1338841
> [2] http://goo.gl/yRBDjQ
> [3] https://review.openstack.org/#/c/105854/
>
--
Sean Dague
http://dague.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140709/7d5b3b3d/attachment.pgp>
More information about the OpenStack-dev
mailing list