[placement][sdk] How to debug HTTP 502 errors with placement in DevStack?
We recently merged support for placement traits in openstacksdk. Since then, we've seen an uptick in failures of various functional jobs [1]. The failure is always the same test: openstack.tests.functional.placement.v1.test_trait.TestTrait.test_resource_pr ovider_inventory That test simply creates a new, custom trait and then attempts to list all traits, show an individual trait, and finally delete the trait. The failure occurs during the first step, creation of the custom trait: openstack.exceptions.HttpException: HttpException: 502: Server Error for url: https://10.209.100.9/placement/traits/CUSTOM_A982E0BA1C2B4D08BFD6D2594C67831... , Bad Gateway: response from an upstream server.: The proxy server received an invalid: Apache/2.4.52 (Ubuntu) Server at 10.209.100.9 Port 80: Additionally, a 201 Created: 502 Bad Gateway: error was encountered while trying to use an ErrorDocument to handle the request. I've looked through the various job artefacts and haven't found any smoking guns. I can see placement receive and reply to the request so it would seem something is happening in between. *Fortunately*, this is also reproducible locally against a standard devstack deployment by running the following in the openstacksdk repo: OS_TEST_TIMEOUT=60 tox -e functional-py310 -- \ -n openstack/tests/functional/placement/v1/test_trait.py \ --until-failure Does anyone have any insight into what could be causing this issue and have suggestions for how we might go about debugging it? As things I haven't a clue 😔 Cheers, Stephen [1] https://zuul.opendev.org/t/openstack/builds?job_name=openstacksdk-functional-devstack&project=openstack%2Fdevstack&branch=master&skip=0
On Wed, Aug 2, 2023, at 10:22 AM, Stephen Finucane wrote:
We recently merged support for placement traits in openstacksdk. Since then, we've seen an uptick in failures of various functional jobs [1]. The failure is always the same test:
openstack.tests.functional.placement.v1.test_trait.TestTrait.test_resource_pr ovider_inventory
That test simply creates a new, custom trait and then attempts to list all traits, show an individual trait, and finally delete the trait. The failure occurs during the first step, creation of the custom trait:
openstack.exceptions.HttpException: HttpException: 502: Server Error for url: https://10.209.100.9/placement/traits/CUSTOM_A982E0BA1C2B4D08BFD6D2594C67831... , Bad Gateway: response from an upstream server.: The proxy server received an invalid: Apache/2.4.52 (Ubuntu) Server at 10.209.100.9 Port 80: Additionally, a 201 Created: 502 Bad Gateway: error was encountered while trying to use an ErrorDocument to handle the request.
I've looked through the various job artefacts and haven't found any smoking guns. I can see placement receive and reply to the request so it would seem something is happening in between.
Yes, this appears to be some problem in apache2 (possibly caused by the response but as far as the backend server is concerned everything is ok). I would increase the log level of the apache server. There are two places to do this 1) for the https frontend here [2] and 2) for the http wsgi backend here [3]. I think the first file comes from the apache2 package in Ubuntu so I'm not sure what the best way to modify that is. The https proxy file is configured by devstack/lib/tls in a heredoc which you can modify for that frontend. You mention this is locally reproducible so you may be able to simply edit those files on disk and restart apache2 without needing to modify devstack. Hopefully, extra logging will give a better indication of what is going on.
*Fortunately*, this is also reproducible locally against a standard devstack deployment by running the following in the openstacksdk repo:
OS_TEST_TIMEOUT=60 tox -e functional-py310 -- \ -n openstack/tests/functional/placement/v1/test_trait.py \ --until-failure
Does anyone have any insight into what could be causing this issue and have suggestions for how we might go about debugging it? As things I haven't a clue 😔
Cheers, Stephen
[2] https://zuul.opendev.org/t/openstack/build/b37d2aedd1514682b3672c4b732b2717/... [3] https://zuul.opendev.org/t/openstack/build/b37d2aedd1514682b3672c4b732b2717/...
On Wed, 2023-08-02 at 16:50 -0700, Clark Boylan wrote:
On Wed, Aug 2, 2023, at 10:22 AM, Stephen Finucane wrote:
We recently merged support for placement traits in openstacksdk. Since then, we've seen an uptick in failures of various functional jobs [1]. The failure is always the same test:
openstack.tests.functional.placement.v1.test_trait.TestTrait.test_resource_pr ovider_inventory
That test simply creates a new, custom trait and then attempts to list all traits, show an individual trait, and finally delete the trait. The failure occurs during the first step, creation of the custom trait:
openstack.exceptions.HttpException: HttpException: 502: Server Error for url: https://10.209.100.9/placement/traits/CUSTOM_A982E0BA1C2B4D08BFD6D2594C67831... , Bad Gateway: response from an upstream server.: The proxy server received an invalid: Apache/2.4.52 (Ubuntu) Server at 10.209.100.9 Port 80: Additionally, a 201 Created: 502 Bad Gateway: error was encountered while trying to use an ErrorDocument to handle the request.
I've looked through the various job artefacts and haven't found any smoking guns. I can see placement receive and reply to the request so it would seem something is happening in between.
Yes, this appears to be some problem in apache2 (possibly caused by the response but as far as the backend server is concerned everything is ok). I would increase the log level of the apache server. There are two places to do this 1) for the https frontend here [2] and 2) for the http wsgi backend here [3]. I think the first file comes from the apache2 package in Ubuntu so I'm not sure what the best way to modify that is. The https proxy file is configured by devstack/lib/tls in a heredoc which you can modify for that frontend.
You mention this is locally reproducible so you may be able to simply edit those files on disk and restart apache2 without needing to modify devstack. Hopefully, extra logging will give a better indication of what is going on.
I gave this a shot today but didn't get anywhere, unfortunately. I've posted full logs from two PUT requests here [1]. Diffing them, they're effectively identical right up until the response is sent. As in the CI, I can't see anything wrong in the Placement logs themselves either. I tried adding logging at multiple points and the only thing that seemed to "fix" things was adding large logs (dumping the entire request object) in 'placement.wsgi_wrapper', but I don't know if that was a fluke or what. Back to the drawing board it would seem. Stephen [1] https://paste.opendev.org/show/bd6TFQXwj0zmHF0EHpqA/
*Fortunately*, this is also reproducible locally against a standard devstack deployment by running the following in the openstacksdk repo:
OS_TEST_TIMEOUT=60 tox -e functional-py310 -- \ -n openstack/tests/functional/placement/v1/test_trait.py \ --until-failure
Does anyone have any insight into what could be causing this issue and have suggestions for how we might go about debugging it? As things I haven't a clue 😔
Cheers, Stephen
[2] https://zuul.opendev.org/t/openstack/build/b37d2aedd1514682b3672c4b732b2717/... [3] https://zuul.opendev.org/t/openstack/build/b37d2aedd1514682b3672c4b732b2717/...
participants (2)
-
Clark Boylan
-
Stephen Finucane