[Openstack-operators] Folsom to Grizzly Upgrade Nodes
Jonathan Proulx
jon at jonproulx.com
Fri Sep 20 18:14:54 UTC 2013
On Thu, Sep 19, 2013 at 11:21 PM, Lorin Hochstein
<lorin at nimbisservices.com> wrote:
>
>
> I'd be really interested to hear what these pain points were. Were these just due to quantum/neutron, or because you were migrating from nova-network to quantum?
>
A bit of both. My short opinion is if you have an existing deploy
with nova-network don't change yet, but if you're building fresh go
with neutron/quantum to save the inevitable disruptive transition but
be sure you test at scale befor eyou go live.
I've been holding off on the public ranting because I'm still not
quite sure how much of the issue was me not being ready for the
transition and how much was quantum not being ready. I'll attempt to
tell the whole story here and let others judge.
Background / History:
I've been running a one rack 60 physical node OpenStack deployment
since July 2012 at MIT CSAIL (http://www.csail.mit.edu). Base
operating system is Ubuntu 12.04LTS using puppet for configuration
management (puppetlabs/openstack) modules we started with Essex and
have been tracking the cloudarchive for Folsom and Grizzly. Typically
the cloud is heavily utilized and very near resource capacity. 900 -
1000 instances is typical with some projects liking to start or stop
several hundred at a time. Networking was nova-network using
flat-dhcp and multihost with a typical setup using an rfc1918 private
network NAT'ed to the public network and available floating IPs.
Essex was essentially the "Alpha" period of our deployment. Since we
weren't promising continuous operation yet and were having issues with
some race conditions in the Essex schedule I took the Folsom upgrade
as soon as the cloudarchive packages were available, before the puppet
modules had been fully updated. Unsurprisingly I ran into some
upgrade issues both with adapting configs and some legitimate bugs
that required hand clean up of some database tables. With all that no
running instances were harmed, and the whole upgrade and debug to
maybe 20-25 hours.
This started our "Beta" phase where we opened the cloud to everyone in
the lab, but with warning stickers all over it saying future
disruptive changes were likely.
I work in a group of 8, but am currently the only one doing OpenStack
infrastructure stuff yet (couple of others are users of the cloud and
provide some operational support like creating new projects and quota
updates), so any times mentioned for work done is done by one person
and the longer the time the more the sleep deprivation and lower the
efficiency of work done.
The Plan:
This was planned to be the transition from "Beta" to "General
Availability" and involve more fairly major reconfigurations to meet
needs we'd identified in the first year of operations. Specifically
reconfiguring the networking to connect instances directly to existing
lab vlans both to get NAT out of the picture and allow easier
migration of legacy applications ("pets") into the newer cloud
resource, moving from the simple cinder Linux LVM backend we'd used in
testing to a vendor specific SAN backend to leverage existing
enterprise storage, and combining host aggregates and instance types
to provide different scheduling zones for computationally intensive
compute instances and more over schedulable web app and testing
instances.
I planned the reconfig in two main phases, upgrade in place with
nova-network followed by the transition to quantum for networking. I
scheduled a week of total downtime with all instances off line.
Phase One - straight upgrade:
This was about as uneventful for me as what Joe described. I did find
that the introduction of of nova-conductor was a serious bottle neck
at first but it was trivially solved by launching some nova-conductor
instances within the cloud. To meet my aggregate scheduler needs I
also grabbed core_filter.py from Havana because I personally needed
the AggregateCoreFilter which was a simple drop in and it worked, and
for my EqualLogic SAN I needed to back port eqlx.py from
https://review.openstack.org/#/c/43944 which took a little more work,
but both of those are highly site specific needs.
Phase Two - nova-network -> quantum upgrade:
The plan here was to deploy quantum using open-vswitch plugin with
vlan based provider networks and gre based project-private networks.
The initial provider network was the same vlan the old floating IPs
had been on, additional provider vlans were to be added later (and
since have been) for legacy app migration. We didn't have a use plan
for the user created gre networks but it was easy to provide, and some
projects are using them now.
I had initially wanted to use my existing non-openstack DHCP
infrastructure, and rather wish I could since all of my ongoing
troubles with qunatum centre on the dhcp-agent, but obviously if
OpenStack doesn't control the DHCP it can't do fixed address
assignment or even tell what IP an instance is assigned.
I'd initially marked the the provider network as external, since it is
a publicly addressable network with a proper router outside openstack.
I'm still not sure if it's strictly necessary but I couldn't get dhcp
to work until I made it an internal network. There was also a bit of
confusion around my ovs bridges and which ports got attached to which
bridge. I was getting rather sleep deprived at this point so my notes
about how things got from point A to point B are not so good. Once I
tracked down all the missing bits to get things plumbed together
properly, I went back to update the docs and the steps I missed for
adding interfaces to bridges were in fact in the Network
Administrators Guide which I'd been using, hence my reluctance to
complain too loudly.
Problems with running quantum:
I'm a bit more sure of what didn't work after getting quantum setup.
Most of my quantum issues are scaling issues of various sorts, which
really surprised me since I don't think my current scale is very
large. That may be the problem. In proof of concept and developer
size systems the scaling doesn't matter and at really large scale
horizontal scale out is already required. At my size my single
controller node (dual socket hex-core with 48G ram) typically runs at
about 10-20% capacity now (was 5% pre quantum) with peaks under
extreme load brushing 50%.
The first issue that became apparent was that some instances would be
assigned multiple quantum ports
(https://bugs.launchpad.net/ubuntu/+bug/1160442). The bug report show
this happens at 128 instances concurrently started but not at 64. I
was seeing it starting around 10. The bug reporter has 8 compute
hosts, my theory is it was worse for me because I had more
quantum-clients running in parallel. I applied the patch that closed
that bug which provides the following in the ovs-agent config
(defaults in comments my settings uncommented:
# Maximum number of SQL connections to keep open in a QueuePool in SQLAlchemy
# sqlalchemy_pool_size = 5
sqlalchemy_pool_size = 24
# sqlalchemy_max_overflow = 10
sqlalchemy_max_overflow = 48
# Example sqlalchemy_pool_timeout = 30
sqlalchemy_pool_timeout = 2
I believe this solved multiple port problem, but then at a slightly
higher scale but still <50 concurrent starts quantum port creations
would time out and the abortive instances would go into error state.
This seemed to be caused by serialization in both keystone and
quantum-server. In the upgrade keystone had been reconfigured to use
PKI tokens stored in MYSQL, moving back to UUID tokens in memcache
helped a lot both not enough.
Peter Feiner 's blog post on parallel performance at
http://blog.gridcentric.com/bid/318277/Boosting-OpenStack-s-Parallel-Performance
got me most of the rest of the way. Particularly the multi worker
patches for keystone and quantum-server, which I took from his links.
A multiserver patch for quantum-server is under review at
https://review.openstack.org/#/c/37131 I'm beginning to worry it will
make it into Havana, there's also a review for the keystone-all piece
at https://review.openstack.org/#/c/42967/ which I believe is being
held for Icehouse.
At this point I could start hundreds of instances and they would all
have the proper number of quantum ports assigned and end in "Active"
state, but many (most) of them would never get their address via
DHCP.. After some investigation I saw that while quantum had the mac
and IP assignments for the ports dnsmasq never got them. The
incidence of this fault seemed to vary not only with the number of
concurrent starts but also with the number of running instances.
After much hair pulling I was able to mitigate this somewhat by
increasing the default DHCP lease time from 2min to 30min (which is
much more reasonable IMHO) and by increasing "agent_downtime" in
quantum.conf from a default of 5sec to 60sec, though even at this we
still occasionally see this crop up but it's infrequent enough I've
been telling people to "try again and hope it's better in the next
release".
Another small issue we're having is that if you assign a specific
fixed IP to an instance then tare it down, you need to wait untill
it's dhcp lease expires before you can launch another instance with
that IP.
Looking Forward:
I'm optimistic about Havana it should have most (but certainly not
all) I needed to club into Grizzly to make it go and there seems to
have been significant work around dhcp lease updates which I hope
makes things better there at least in terms of immediately releasing
IPs if nothing else.
Conclusion:
Blame Lorin he asked :)
-Jon
More information about the OpenStack-operators
mailing list