[openstack-dev] [octavia] Sometimes amphoras are not re-created if they are not reached for more than heartbeat_timeout
Michael Johnson
johnsomor at gmail.com
Fri May 4 21:27:53 UTC 2018
I have commented on both of those stories. Thank you for submitting them.
As for the values,this is hard as those settings depend on a lot of
factors. The default values are targeted towards developers and likely
need to be adjusted for production. We have not yet put together our
deployment guide where we would cover this type of tuning. Sigh, so
much to do and not enough team members.
Here are some comments I can give on those settings:
[health_manager]
failover_threads - This is the maximum number of parallel failovers
each instance (process) of the octavia-healthmanager can process at
the same time. Beyond this number they queue until a thread becomes
available. If your cloud is fairly stable and you have few health
managers, this can be a reasonably low number. Consider the maximum
number of amphora you would have on a single compute host should it
fail. Also take into account the CPU power available on the health
manager host.
status_update_threads - This is the maximum number of health heartbeat
messages each instance (process) of the octavia-healthmanager can
process at the same time. The more octavia-healthmanagers you have,
the lower this can be. The upper limit on this is related to how fast
your database is processing the updates. Should this number be too
low, the heatlh manager will start logging warnings that you need more
health managers.
[haproxy_amphora]
build_rate_limit
build_active_retries
These two settings are only used if build rate limiting is enabled
(not by default). This would be set if your Nova infrastructure cannot
handle the rate of instance builds Octavia is asking of it. This will
prioritize instance builds based on the need and will limit the rate
of instance builds Octavia asks Nova for. The only impact to the
Octavia controllers is increased memory utilization if there are a
large number of builds being queued waiting for Nova.
You missed these two:
connection_max_retries
connection_retry_interval
These values are typically adjusted in production environments as they
are tuned for exceeding slow development systems (virtualbox, etc.)
where booting instances can take up to twenty minutes. This is the
time after Nova declares the instance "ACTIVE" and when the kernel
finishes booting in the instance and the amphora agent is running. The
default is to wait 25 minutes. In production you would expect to drop
this number significantly. On a typical cloud this should take less
than thirty seconds, but you should give it some buffer in case a host
is especially busy. Again this depends on the performance of your
cloud.
[controller_worker]
workers - This is the number of worker threads pulling user requests
from the oslo messaging queue for each instance of the octavia-worker
process. This number would be tuned depending on the number of worker
controllers you have in your cloud and the rate of user requests
(create, update, delete) that need to be serviced by a worker. GET
calls do not require a worker. This will also be limited by the
controller host CPU and RAM capacities.
amp_active_retries
amp_active_wait_sec
Both of these values depend on the performance of your Nova
environment. This is how many times and how often we check Nova to see
if a requested instance has become "ACTIVE". Unless your Nova
environment is unusually slow, you should not need to change these
values.
[task_flow]
max_workers - This value limits the parallelism inside the TaskFlow
flows used by the controllers. Currently there is little reason to
adjust this value as the degrees of parallelism in our flows are not
higher than this value. However, when we release Active-Active load
balancers this value will control the number of parallel amphora
builds up to the build limit above.
Michael
On Thu, May 3, 2018 at 1:51 AM, <mihaela.balas at orange.com> wrote:
> Hi Michael,
>
> I build a new amphora image with the latest patches and I reproduced two different bugs that I see in my environment. One of them is similar to the one initially described in this thread. I opened two stories as you advised:
>
> https://storyboard.openstack.org/#!/story/2001960
> https://storyboard.openstack.org/#!/story/2001955
>
> Meanwhile, can you provide some recommendation of values for the following parameters (maybe in relation with number of workers, cores, computes etc)?
>
> [health_manager]
> failover_threads
> status_update_threads
>
> [haproxy_amphora]
> build_rate_limit
> build_active_retries
>
> [controller_worker]
> workers
> amp_active_retries
> amp_active_wait_sec
>
> [task_flow]
> max_workers
>
> Thank you for your help,
> Mihaela Balas
>
> -----Original Message-----
> From: Michael Johnson [mailto:johnsomor at gmail.com]
> Sent: Friday, April 27, 2018 8:24 PM
> To: OpenStack Development Mailing List (not for usage questions)
> Subject: Re: [openstack-dev] [octavia] Sometimes amphoras are not re-created if they are not reached for more than heartbeat_timeout
>
> Hi Mihaela,
>
> I am sorry to hear you are having trouble with the queens release of Octavia. It is true that a lot of work has gone into the failover capability, specifically working around a python threading issue and making it more resistant to certain neutron failure situations (missing ports, etc.).
>
> I know of one open bug against the failover flows, https://storyboard.openstack.org/#!/story/2001481, "failover breaks in Active/Standby mode if both amphroae are down".
>
> Unfortunately the log snippet above does not give me enough information about the problem to help with this issue. From the snippet it looks like the failovers were initiated, but the controllers are unable to reach the amphora-agent on the replacement amphora. It will continue those retry attempts, but eventually will fail the amphora into ERROR if it doesn't succeed.
>
> One thought I have is if you created you amphora image in the last two weeks, you may have built an amphora using the master branch of octavia, which had a bug that impacted active/standby images. This was introduced working around the new pip 10 issues. That patch has been
> fixed: https://review.openstack.org/#/c/564371/
>
> If neither of these situations match your environment, please open a story (https://storyboard.openstack.org/#!/dashboard/stories) for us and include the health manager logs from the point you delete the amphora up until it starts these connection attempts. We will dig through those logs to see what the issue might be.
>
> Michael (johnsom)
>
> On Wed, Apr 25, 2018 at 4:07 AM, <mihaela.balas at orange.com> wrote:
>> Hello,
>>
>>
>>
>> I am testing Octavia Queens and I see that the failover behavior is
>> very much different than the one in Ocata (this is the version we are
>> currently running in production).
>>
>> One example of such behavior is:
>>
>>
>>
>> I create 4 load balancers and after the creation is successful, I shut
>> off all the 8 amphoras. Sometimes, even the health-manager agent does
>> not reach the amphoras, they are not deleted and re-created. The logs
>> look like shown below even when the heartbeat timeout is long passed.
>> Sometimes the amphoras are deleted and re-created. Sometimes, they
>> are partially re-created – part of them remain in shut off.
>>
>> Heartbeat_timeout is set to 60 seconds.
>>
>>
>>
>>
>>
>>
>>
>> [octavia-health-manager-3662231220-nxnt3] 2018-04-25 10:57:26.244 11
>> WARNING octavia.amphorae.drivers.haproxy.rest_api_driver
>> [req-339b54a7-ab0c-422a-832f-a444cd710497 -
>> a5f15235c0714365b98a50a11ec956e7
>> - - -] Could not connect to instance. Retrying.: ConnectionError:
>> HTTPSConnectionPool(host='192.168.0.15', port=9443): Max retries
>> exceeded with url:
>> /0.5/listeners/285ad342-5582-423e-b654-1f0b50d91fb2/certificates/octav
>> iasrv2.orange.com.pem (Caused by
>> NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection
>> object at 0x7f559862c710>: Failed to establish a new connection:
>> [Errno 113] No route to host',))
>>
>> [octavia-health-manager-3662231220-3lssd] 2018-04-25 10:57:26.464 13
>> WARNING octavia.amphorae.drivers.haproxy.rest_api_driver
>> [req-a63b795a-4b4f-4b90-a201-a4c9f49ac68b -
>> a5f15235c0714365b98a50a11ec956e7
>> - - -] Could not connect to instance. Retrying.: ConnectionError:
>> HTTPSConnectionPool(host='192.168.0.14', port=9443): Max retries
>> exceeded with url:
>> /0.5/listeners/a45bdef3-e7da-4a18-9f1f-53d5651efe0f/1615c1ec-249e-4fa8
>> -9d73-2397e281712c/haproxy (Caused by
>> NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection
>> object at 0x7f8a0de95e10>: Failed to establish a new connection:
>> [Errno 113] No route to host',))
>>
>> [octavia-health-manager-3662231220-nxnt3] 2018-04-25 10:57:27.772 11
>> WARNING octavia.amphorae.drivers.haproxy.rest_api_driver
>> [req-10febb10-85ea-4082-9df7-daa48894b004 -
>> a5f15235c0714365b98a50a11ec956e7
>> - - -] Could not connect to instance. Retrying.: ConnectionError:
>> HTTPSConnectionPool(host='192.168.0.19', port=9443): Max retries
>> exceeded with url:
>> /0.5/listeners/96ce5862-d944-46cb-8809-e1e328268a66/fc5b7940-3527-4e9b
>> -b93f-1da3957a5b71/haproxy (Caused by
>> NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection
>> object at 0x7f5598491c90>: Failed to establish a new connection:
>> [Errno 113] No route to host',))
>>
>> [octavia-health-manager-3662231220-nxnt3] 2018-04-25 10:57:34.252 11
>> WARNING octavia.amphorae.drivers.haproxy.rest_api_driver
>> [req-339b54a7-ab0c-422a-832f-a444cd710497 -
>> a5f15235c0714365b98a50a11ec956e7
>> - - -] Could not connect to instance. Retrying.: ConnectionError:
>> HTTPSConnectionPool(host='192.168.0.15', port=9443): Max retries
>> exceeded with url:
>> /0.5/listeners/285ad342-5582-423e-b654-1f0b50d91fb2/certificates/octav
>> iasrv2.orange.com.pem (Caused by
>> NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection
>> object at 0x7f5598520790>: Failed to establish a new connection:
>> [Errno 113] No route to host',))
>>
>> [octavia-health-manager-3662231220-3lssd] 2018-04-25 10:57:34.476 13
>> WARNING octavia.amphorae.drivers.haproxy.rest_api_driver
>> [req-a63b795a-4b4f-4b90-a201-a4c9f49ac68b -
>> a5f15235c0714365b98a50a11ec956e7
>> - - -] Could not connect to instance. Retrying.: ConnectionError:
>> HTTPSConnectionPool(host='192.168.0.14', port=9443): Max retries
>> exceeded with url:
>> /0.5/listeners/a45bdef3-e7da-4a18-9f1f-53d5651efe0f/1615c1ec-249e-4fa8
>> -9d73-2397e281712c/haproxy (Caused by
>> NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection
>> object at 0x7f8a0de953d0>: Failed to establish a new connection:
>> [Errno 113] No route to host',))
>>
>> [octavia-health-manager-3662231220-nxnt3] 2018-04-25 10:57:35.780 11
>> WARNING octavia.amphorae.drivers.haproxy.rest_api_driver
>> [req-10febb10-85ea-4082-9df7-daa48894b004 -
>> a5f15235c0714365b98a50a11ec956e7
>> - - -] Could not connect to instance. Retrying.: ConnectionError:
>> HTTPSConnectionPool(host='192.168.0.19', port=9443): Max retries
>> exceeded with url:
>> /0.5/listeners/96ce5862-d944-46cb-8809-e1e328268a66/fc5b7940-3527-4e9b
>> -b93f-1da3957a5b71/haproxy (Caused by
>> NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection
>> object at 0x7f55984e2050>: Failed to establish a new connection:
>> [Errno 113] No route to host',))
>>
>>
>>
>> Thank you,
>>
>> Mihaela Balas
>>
>> ______________________________________________________________________
>> ___________________________________________________
>>
>> Ce message et ses pieces jointes peuvent contenir des informations
>> confidentielles ou privilegiees et ne doivent donc pas etre diffuses,
>> exploites ou copies sans autorisation. Si vous avez recu ce message
>> par erreur, veuillez le signaler a l'expediteur et le detruire ainsi
>> que les pieces jointes. Les messages electroniques etant susceptibles
>> d'alteration, Orange decline toute responsabilite si ce message a ete
>> altere, deforme ou falsifie. Merci.
>>
>> This message and its attachments may contain confidential or
>> privileged information that may be protected by law; they should not
>> be distributed, used or copied without authorisation.
>> If you have received this email in error, please notify the sender and
>> delete this message and its attachments.
>> As emails may be altered, Orange is not liable for messages that have
>> been modified, changed or falsified.
>> Thank you.
>>
>>
>> ______________________________________________________________________
>> ____ OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> _________________________________________________________________________________________________________________________
>
> Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
> Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.
>
> This message and its attachments may contain confidential or privileged information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
> Thank you.
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-dev
mailing list