[User-committee] Feedback on Grizzly

Ryan Lane rlane at wikimedia.org
Fri Apr 5 18:59:57 UTC 2013


On Fri, Apr 5, 2013 at 7:01 AM, Matt Van Winkle <mvanwink at rackspace.com>wrote:

>  Hello again, folks!
>
>  When I reached out a couple weeks ago, I mentioned that I was hoping
> that, along with being a large developer of OpenStack, Rackspace, could
> also contribute the committee's work as one of it's largest users via our
> public cloud.  We just found our first opportunity.  This week we deployed
> an early release of Grizzly code to one of our data centers.
>
>  Going in, we knew there were quite a few database migrations.  As we
> studied them, however, they presented some challenges in the manner that
> they were executed.  Using them as they were would have meant extended
> downtime for the databases given the size of our production data (row
> counts, etc).  That downtime is problematic since it translates to the
> Public APIs being unavailable – something we aim to impact as minimally as
> possible during code deploys. Ultimately, we had to rewrite them ourselves
> to achieve the same out comes with less DB unavailability.  There is plenty
> of work the community can do, and the committee can help guide, around
> better ways to change database structure while maintaining as much uptime
> as possible.  If you need more details, I'm happy to bring the folks that
> worked on the rewrite into the conversation.  Both will actually be at the
> summit.
>
>  The bigger surprise - and full disclosure, we learned a lot about the
> things we aren't testing in our deployment pipeline - was the dramatic
> increase in network traffic following the deploy.  The new table
> structures, increased meta data and new queries in this version translated
> to about 10X in the amount of data being returned for some queries.  Add to
> that, the fact that compute nodes are regularly querying for certain
> information or often performing a "check in", and we saw a 3X (or more)
> increase in network traffic on the management network we have for this
> particular DC (and it's a smaller one as our various deployments go).  For
> now we have improved things slightly by turning off the following periodic
> tasks:
>
>  reboot_timeout
> rescue_timeout
> resize_confirm_window
>
>  These not running has the potential to create some other issues (zombies
> and such), but that can be managed.
>
>  It does look like the developers are already working on getting some of
> the queries updated:
>
>  https://review.openstack.org/#/c/26136/
> https://review.openstack.org/#/c/26109/
>
>  All in all, I wanted to reach back out to you to follow up from before,
> because I think this particular experience is an excellent highlight that
> there is often a disconnect between some of the changes that come through
> to trunk and use of the code at scale.  Almost everyone who was dealt with
> the above will be in Oregon week after next, so I'm happy to drag any and
> all into the mix to discuss further.
>
>
Has this discussion been brought up with the developer community? I
definitely feel it's important for the user committee to push on issues
like this, but we should only push on topics that have already gone through
normal developer processes and aren't getting traction.

- Ryan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/user-committee/attachments/20130405/227d8105/attachment.html>


More information about the User-committee mailing list