[openstack-dev] [tc] supporting Go

Gregory Haynes greg at greghaynes.net
Wed May 11 17:23:56 UTC 2016

On Wed, May 11, 2016, at 05:09 AM, Hayes, Graham wrote:
> On 10/05/2016 23:28, Gregory Haynes wrote:
> >
> > OK, I'll bite.
> >
> > I had a look at the code and there's a *ton* of low hanging fruit. I
> > decided to hack in some fixes or emulation of fixes to see whether I
> > could get any major improvements. Each test I ran 4 workers using
> > SO_REUSEPORT and timed doing 1k axfr's with 4 in parallel at a time and
> > recorded 5 timings. I also added these changes on top of one another in
> > the order they follow.
> Thanks for the analysis - any suggestions about how we can improve the
> current design are more than welcome .
> For this test, was it a single static zone? What size was it?

This was a small single static zone - so the most time possible was
spent in python, as opposed to blocking on the network.

> >
> > Base timings: [9.223, 9.030, 8.942, 8.657, 9.190]
> >
> > Stop spawning a thread per request - there are a lot of ways to do this
> > better, but lets not even mess with that and just literally move the
> > thread spawning that happens per request because its a silly idea here:
> > [8.579, 8.732, 8.217, 8.522, 8.214] (almost 10% increase).
> >
> > Stop instantiating oslo config object per request - this should be a no
> > brainer, we dont need to parse config inside of a request handler:
> > [8.544, 8.191, 8.318, 8.086] (a few more percent).
> >
> > Now, the slightly less low hanging fruit - there are 3 round trips to
> > the database *every request*. This is where the vast majority of request
> > time is spent (not in python). I didn't actually implement a full on
> > cache (I just hacked around the db queries), but this should be trivial
> > to do since designate does know when to invalidate the cache data. Some
> > numbers on how much a warm cache will help:
> >
> > Caching zone: [5.968, 5.942, 5.936, 5.797, 5.911]
> >
> > Caching records: [3.450, 3.357, 3.364, 3.459, 3.352].
> >
> > I would also expect real-world usage to be similar in that you should
> > only get 1 cache miss per worker per notify, and then all the other
> > public DNS servers would be getting cache hits. You could also remove
> > the cost of that 1 cache miss by pre-loading data in to the cache.
> I actually would expect the real world use of this to have most of the
> servers have a cache miss.
> We shuffle the order of the miniDNS servers sent out to the user facing
> DNS servers, so I would expect them to hit different minidns servers
> at nearly same time, and each of them try to generate the cache entry.
> For pre-loading - this could work, but I *really* don't like relying on
> a cache for one of the critical path components.

I am not sure what the issue with caching in general is, but its not
far-fetched to pre load an axfr into a cache before you send out any
notifies (since you know exactly when that will happen). For the herding
issue - that's just a matter of how you design your cache coherence
system. Usually you want to design that around your threading/worker
model and since we actually get a speed increase by turning the current
threading off it might be worth fixing that first...

That being said - this doesn't need to be designed amazingly to reap the
benefits being argued for. I haven't heard any goals of 'make every
single request as low latency as possible' (which is when you would
worry about dealing with cold cache costs), but instead that there's a
need to scale up to a potentially large number of clients all requesting
the same axfr at once. In that scenario even the most simple caching
setup would make a huge difference.

> >
> > All said and done, I think that's almost a 3x speed increase with
> > minimal effort. So, can we stop saying that this has anything to do with
> > Python as a language and has everything to do with the algorithms being
> > used?
> As I have said before - for us, the time spent : performance
> improvement ratio is just much higher (for our dev team at least) with
> Go.
> We saw a 50x improvement for small SOA queries, and ~ 10x improvement
> for 2000 record AXFR (without caching). The majority of your
> improvement came from caching, so I would imagine that would speed up
> the Go implementation as well.

There has to be something very different between your python testing set
up and mine. In my testing there simply wasn't enough time spent in
Python to get even a 2x speed increase by removing all execution time. I
wonder if this is because the code originally spawns a thread per
request and therefore if you run in with a large number of parallel
requests you'll effectively thread bomb all the workers?

The point I am trying to make here is that throwing out "we rewrote our
software in another language, now its X times faster" does not mean that
the language is the issue. If there is a potential language issue it is
extremely useful for us to drill in to and determine the root cause
since that effects all of us. Since there is also obvious costs to all
of us for supporting a new language I think its entirely reasonable to
require some kind of root cause analysis on the need to use a different
language before giving case-by-case approval.

More information about the OpenStack-dev mailing list