[openstack-dev] [Neutron] DHCP Agent Reliability
marun at redhat.com
Wed Dec 4 03:01:13 UTC 2013
On Dec 4, 2013, at 1:47 AM, Stephen Gran <stephen.gran at theguardian.com> wrote:
> On 03/12/13 16:08, Maru Newby wrote:
>> I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load:
>> High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down. This results in the Neutron service not sending notifications of port addition to the DHCP agent. At present, the notifications are simply dropped. A simple fix is to send notifications regardless of agent status. Does anybody have any objections to this stop-gap approach? I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long).
>> Fixing this problem for real, though, will likely be more involved. The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification? Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible.
> It strikes me that we ask an awful lot of a single neutron-server instance - it has to take state updates from all the agents, it has to do scheduling, it has to respond to API requests, and it has to communicate about actual changes with the agents.
> Maybe breaking some of these out the way nova has a scheduler and a conductor and so on might be a good model (I know there are things people are unhappy about with nova-scheduler, but imagine how much worse it would be if it was built into the API).
> Doing all of those tasks, and doing it largely single threaded, is just asking for overload.
I'm sorry if it wasn't clear in my original message, but my primary concern lies with the reliability rather than the scalability of the Neutron service. Carl's addition of multiple workers is a good stop-gap to minimize the impact of blocking IO calls in the current architecture, and we already have consensus on the need to separate RPC and WSGI functions as part of the Pecan rewrite. I am worried, though, that we are not being sufficiently diligent in how we manage state transitions through notifications. Managing transitions and their associate error states is needlessly complicated by the current ad-hoc approach, and I'd appreciate input on the part of distributed systems experts as to how we could do better.
More information about the OpenStack-dev