[openstack-dev] [all][python3] use of six.iteritems()
robertc at robertcollins.net
Mon Jun 15 09:06:37 UTC 2015
On 12 June 2015 at 05:39, Dolph Mathews <dolph.mathews at gmail.com> wrote:
> On Thu, Jun 11, 2015 at 12:34 AM, Robert Collins <robertc at robertcollins.net>
>> On 11 June 2015 at 17:16, Robert Collins <robertc at robertcollins.net>
>> > This test conflates setup and execution. Better like my example,
>> Just had it pointed out to me that I've let my inner asshole out again
>> - sorry. I'm going to step away from the thread for a bit; my personal
>> state (daughter just had a routine but painful operation) shouldn't be
>> taken out on other folk, however indirectly.
> Ha, no worries. You are completely correct about conflating setup and
> execution. As far as I can tell though, even if I isolate the dict setup
> from the benchmark, I get the same relative differences in results.
> iteritems() was introduced for a reason!
Absolutely: the key question is whether that reason is applicable to us.
> If you don't need to go back to .items()'s copy behavior in py2, then
> six.iteritems() seems to be the best general purpose choice.
> I think Gordon said it best elsewhere in this thread:
>> again, i just want to reiterate, i'm not saying don't use items(), i just
>> think we should not blindly use items() just as we shouldn't blindly use
I'd like to recap and summarise a bit.
I think its broadly agreed that:
The three view based methods -- iteritems, iterkeys, iteritems -- in
Python2 became unified with the list-form equivalents in Python3.
The view based methods are substantially faster and lower overhead
than the list form methods, approximately 3x.
We don't have any services today that expect to hold million item
dicts, or even 10K item dicts in a persistent fashion.
There's some cognitive overhead involved in reading six.iteritems(d)
We should use d.items() except where it matters.
Where does it matter?
We have several process architectures in OpenStack:
- We have API servers that are eventlet (except keystone) WSGI
servers. They respond to requests on HTTP[S], each request is
independent and loads all its state from the DB and/or memcache each
time. We don't expect large numbers of concurrent active requests per
process. (Where large would be e.g. 1000).
- We have MQ servers that are conceptually the same as WSGI, just a
different listening protocol. They do sometimes have background tasks,
and for some (e.g. neutron-l3-agent) may hold significant cached state
between requests. But thats still scoped to medium size datasets. We
expect moderate numbers of concurrent active requests, as these are
the actual backends doing things for users, but since these servers
are typically working with actual slow things (e.g. the hypervisor)
high concurrency typically goes badly :).
- We have CLIs that start up, process some data and exit. This
includes python-novaclient and nova-manage. They generally work with
very small datasets and have no concurrency at all.
There are two ways that iteritems vs items etc could matter. One A) is
memory&cpu on single use of very large dicts. The other B) is
aggregate overhead on many concurrent uses of a single shared dict (or
C) possibly N similar-sized dicts).
A) Doesn't apply to us in any case I can think of.
B) Doesn't apply to us either - our peak concurrency on any single
process is still low (we may manage to make it higher now we're moving
on the PyMYSql thing, but thats still in progress - and of course
there are tradeoffs with high concurrency depending on the ratio of
work-to-wait each request has. Very high concurrency depends on a very
low ratio: to have 1000 concurrent requests that aren't slowing each
other down requires that each requests wall clock be 1000x the time
spent in-process actioning it; and that there be enough backend
capacity (whatever that is) to dispatch the work to without causing
queuing in that part of the system.
C) We can eliminate via both the argument on B, and on relative
overheads: if we had 10000 1000-item dicts in process at once, the
relative overhead of making items() from them all is approx the size
of the dicts: but its almost certain we have much more state hanging
around in each of those 10000 threads than each dict: so the
incremental cost will not dominate the process overheads.
I'm not - and haven't - said that iteritems() is never applicable *in
general*, rather I don't believe its ever applicable *to us* today:
and I'm arguing that we should default to items() and bring in
iteritems() if and when we need it.
Robert Collins <rbtcollins at hp.com>
HP Converged Cloud
More information about the OpenStack-dev