[Openstack-operators] [OCTAVIA][QUEENS][KOLLA] - Amphora to Health-manager invalid UDP heartbeat.

Gaël THEROND gael.therond at gmail.com
Wed Oct 24 12:06:06 UTC 2018


Hi Michael,

Thanks a lot for those many details regarding the transition between
different states, indeed as you said, my LB passed from pending_update to
active but I still had an offline status this morning as I still received
UDP Packets that HM dropped.

When I was talking about the HealthManager reaching to the amphora on port
9443 of course I didn't mean it use the heartbeat key.


I just had a look at my Amphora and Octavia CP (Control Plan) versions,
seems a little bit off sync as my amphora agent is: *%prog 3.0.0.0b4.dev6*
while my octavia CP services are: *%prog 2.0.1*

I've just updated to stable/rocky this morning and so jumped to: *%prog
3.0.1*
I'll check if I still encounter this issue, but for now my issue seems to
have vanished as I've the following messages:

*2018-10-24 11:58:54.620 24 DEBUG futurist.periodics [-] Submitting
periodic callback 'octavia.cmd.health_manager.periodic_health_check'
_process_scheduled
/usr/lib/python2.7/site-packages/futurist/periodics.py:639*
*2018-10-24 11:58:57.620 24 DEBUG futurist.periodics [-] Submitting
periodic callback 'octavia.cmd.health_manager.periodic_health_check'
_process_scheduled
/usr/lib/python2.7/site-packages/futurist/periodics.py:639*
*2018-10-24 11:59:00.620 24 DEBUG futurist.periodics [-] Submitting
periodic callback 'octavia.cmd.health_manager.periodic_health_check'
_process_scheduled
/usr/lib/python2.7/site-packages/futurist/periodics.py:639*
*2018-10-24 11:59:03.620 24 DEBUG futurist.periodics [-] Submitting
periodic callback 'octavia.cmd.health_manager.periodic_health_check'
_process_scheduled
/usr/lib/python2.7/site-packages/futurist/periodics.py:639*
*2018-10-24 11:59:04.557 23 DEBUG
octavia.amphorae.drivers.health.heartbeat_udp [-] Received packet from
('172.27.201.105', 48342) dorecv
/usr/lib/python2.7/site-packages/octavia/amphorae/drivers/health/heartbeat_udp.py:187*
*2018-10-24 11:59:04.619 45 DEBUG
octavia.controller.healthmanager.health_drivers.update_db [-] Health Update
finished in: 0.0600640773773 seconds update_health
/usr/lib/python2.7/site-packages/octavia/controller/healthmanager/health_drivers/update_db.py:93*

I'll update you with my following investigation, but so far, the issue
seems to be resolve, I'll tweak a bit the timeouts as my LB take a looooot
of time to create Listeners/Pools and come to an online status.

Thanks a lot!

Le mar. 23 oct. 2018 à 19:09, Michael Johnson <johnsomor at gmail.com> a
écrit :

> Are the controller and the amphora using the same version of Octavia?
>
> We had a python3 issue where we had to change the HMAC digest used. If
> you controller is running an older version of Octavia than your
> amphora images, it may not have the compatibility code to support the
> new format.  The compatibility code is here:
>
> https://github.com/openstack/octavia/blob/master/octavia/amphorae/backends/health_daemon/status_message.py#L56
>
> There is also a release note about the issue here:
> https://docs.openstack.org/releasenotes/octavia/rocky.html#upgrade-notes
>
> If that is not the issue, I would double check the heartbeat_key in
> the health manager configuration files and inside one of the amphora.
>
> Note, that this key is only used for health heartbeats and stats, it
> is not used for the controller to amphora communication on port 9443.
>
> Also, load balancers cannot get "stuck" in PENDING_* states unless
> someone has killed the controller process that was actively working on
> that load balancer. By killed I mean a non-graceful shutdown of the
> process that was in the middle of working on the load balancer.
> Otherwise all code paths lead back to ACTIVE or ERROR status after it
> finishes the work or gives up retrying the requested action. Check
> your controller logs to make sure this load balancer is not still
> being worked on by one of the controllers. The default retry timeouts
> (some are up to 25 minutes) are very long (it will keep trying to
> accomplish the request) to accommodate very slow (virtual box) hosts
> and the test gates. You will want to tune those down for a production
> deployment.
>
> Michael
>
> On Tue, Oct 23, 2018 at 7:09 AM Gaël THEROND <gael.therond at gmail.com>
> wrote:
> >
> > Hi guys,
> >
> > I'm finishing to work on my POC for Octavia and after solving few issues
> with my configuration I'm close to get a properly working setup.
> > However, I'm facing a small but yet annoying bug with the health-manager
> receiving amphora heartbeat UDP packet which it consider as not correct and
> so drop it.
> >
> > Here are the messages that can be found in logs:
> >
> > 2018-10-23 13:53:21.844 25 WARNING
> octavia.amphorae.backends.health_daemon.status_message [-] calculated hmac:
> faf73e41a0f843b826ee581c3995b7f7e56b5e5a294fca0b84eda426766f8415 not equal
> to msg hmac:
> 6137613337316432636365393832376431343337306537353066626130653261 dropping
> packet
> >
> > Which come from this part of the HM Code:
> >
> >
> https://docs.openstack.org/octavia/pike/_modules/octavia/amphorae/backends/health_daemon/status_message.html#get_payload
> >
> > The annoying thing is that I don't get why the UDP packet is considered
> as stale and how can I try to reproduce the payload which is send to the
> HealthManager.
> > I'm willing to write a simple PY program to simulate the heartbeat
> payload but I don't now what's exactly the message and I think I miss some
> informations.
> >
> > Both HealthManager and the Amphora do use the same heartbeat_key and
> both can contact on the network as the initial Health-manager to Amphora
> 9443 connection is validated.
> >
> > As an effect to this situation, my loadbalancer is stuck in
> PENDING_UPDATE mode.
> >
> > Do you have any idea on how can I handle such thing or if it's something
> already seen out there for anyone else?
> >
> > Kind regards,
> > G.
> > _______________________________________________
> > OpenStack-operators mailing list
> > OpenStack-operators at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20181024/76fb28d4/attachment.html>


More information about the OpenStack-operators mailing list