[sdk] openstacksdk 0.99.0 breaks swift object header setting
Hello, Last week the OpenDev team upgraded Zuul which deployed zuul-executors with openstacksdk==0.99.0 installed. After this, test job logs uploaded to the rackspace swift location were no longer viewable in the Zuul dashboard. The reason for this is these objects did not have properly configured CORS headers any longer. The objects uploaded to OVH appear to have been fine (potentially because the headers are not required for OVH). Today we downgraded the version of openstacksdk on the zuul-executors to 0.61.0 and the rackspace uploaded job log objects have functional CORS headers again. For this reason we're fairly confident that something about the 0.99.0 release is impacting the setting of these headers. The code that does these uploads with openstacksdk can be found here [0]. I suspect that the problem is going to be related to this header set on line 227 [1]. In particular that appears to be rackspace specific and this issue is rackspace specific under 0.99.0. However, I wasn't able to find anything in openstacksdk that would have a problem with this. Does anyone else have ideas? I'm somewhat concerned that this represents a class of data loss (metadata is going missing) for swift object uploads. Additionally, we rely on setting the x-delete-after header to timeout, expire, and prune our logs in swift [2]. If there is a general problem with 0.99.0 setting headers on objects it is possible that all of the objects we created using 0.99.0 do not have this metadata set and will leak in their swift containers. [0] https://opendev.org/zuul/zuul-jobs/src/commit/e69d879caecb454c529a7d757b80ae... [1] https://opendev.org/zuul/zuul-jobs/src/commit/e69d879caecb454c529a7d757b80ae... [2] https://opendev.org/zuul/zuul-jobs/src/commit/e69d879caecb454c529a7d757b80ae...
Hey, https://review.opendev.org/c/openstack/openstacksdk/+/844414 <https://review.opendev.org/c/openstack/openstacksdk/+/844414> should address the issue. For R1 of SDK we want to ensure all interfaces of SDK use same code branch and not 3 different like it was before. So with the latest changes we now rely here on the regular mechanism that actually watches how to handle each known parameter. All of the arbitrary parameters in Swift are managed as container/object metadata and should get X-Container-Meta- or X-Object-Meta- prefixes in order to be considered and returned back by manila Swift. This is sadly not the case for Rax as you rightfully mention. So for now I have added header required by Rax into “known” exceptions and added explicit unit test for that. Would be cool to find way how I could test that really in Rax to ensure particular Zuul use case will work. Generally we would need to think whether SDK should continue “tell user he is wrong before reaching out to the server” approach or to switch to the “believe the user and let server reject” approach generally. This here is precisely one of those cases. Alternatively we would need to widen our concept of vendors and move all of the specific Rax logic over there. We surely have lot of OpenStack based clouds, which do not explicitly implement everything in the compatible way. So in order to make SDK/CLI working with all of them we either drop most of the validation logic in SDK or really start implementing cloud specific overrides as part of “extensions”. Artem
On 1. Jun 2022, at 19:55, Clark Boylan <cboylan@sapwetik.org> wrote:
Hello,
Last week the OpenDev team upgraded Zuul which deployed zuul-executors with openstacksdk==0.99.0 installed. After this, test job logs uploaded to the rackspace swift location were no longer viewable in the Zuul dashboard. The reason for this is these objects did not have properly configured CORS headers any longer. The objects uploaded to OVH appear to have been fine (potentially because the headers are not required for OVH).
Today we downgraded the version of openstacksdk on the zuul-executors to 0.61.0 and the rackspace uploaded job log objects have functional CORS headers again. For this reason we're fairly confident that something about the 0.99.0 release is impacting the setting of these headers.
The code that does these uploads with openstacksdk can be found here [0]. I suspect that the problem is going to be related to this header set on line 227 [1]. In particular that appears to be rackspace specific and this issue is rackspace specific under 0.99.0. However, I wasn't able to find anything in openstacksdk that would have a problem with this. Does anyone else have ideas? I'm somewhat concerned that this represents a class of data loss (metadata is going missing) for swift object uploads.
Additionally, we rely on setting the x-delete-after header to timeout, expire, and prune our logs in swift [2]. If there is a general problem with 0.99.0 setting headers on objects it is possible that all of the objects we created using 0.99.0 do not have this metadata set and will leak in their swift containers.
[0] https://opendev.org/zuul/zuul-jobs/src/commit/e69d879caecb454c529a7d757b80ae... [1] https://opendev.org/zuul/zuul-jobs/src/commit/e69d879caecb454c529a7d757b80ae... [2] https://opendev.org/zuul/zuul-jobs/src/commit/e69d879caecb454c529a7d757b80ae...
On Thu, Jun 2, 2022, at 3:44 AM, Artem Goncharov wrote:
Hey,
https://review.opendev.org/c/openstack/openstacksdk/+/844414 should address the issue. For R1 of SDK we want to ensure all interfaces of SDK use same code branch and not 3 different like it was before. So with the latest changes we now rely here on the regular mechanism that actually watches how to handle each known parameter. All of the arbitrary parameters in Swift are managed as container/object metadata and should get X-Container-Meta- or X-Object-Meta- prefixes in order to be considered and returned back by manila Swift. This is sadly not the case for Rax as you rightfully mention.
Note that the docs [3] still say that headers are passed through, which isn't quite the case in 0.99.0. Filtered headers for what the SDK thinks are valid are passed through. This change should probably update the docs too.
So for now I have added header required by Rax into “known” exceptions and added explicit unit test for that. Would be cool to find way how I could test that really in Rax to ensure particular Zuul use case will work.
Probably the simplest thing is to have a test that checks the right headers are sent. Then if/when clouds change behavior we can update the tests to follow.
Generally we would need to think whether SDK should continue “tell user he is wrong before reaching out to the server” approach or to switch to the “believe the user and let server reject” approach generally. This here is precisely one of those cases. Alternatively we would need to widen our concept of vendors and move all of the specific Rax logic over there. We surely have lot of OpenStack based clouds, which do not explicitly implement everything in the compatible way. So in order to make SDK/CLI working with all of them we either drop most of the validation logic in SDK or really start implementing cloud specific overrides as part of “extensions”.
The problem with telling the user they are wrong before talking to the server is that it assumes that clouds don't do weird things. History has shown us that cloud definitely do weird things and this isn't related to any single cloud. Off the top of my head I can come up with a few examples and if we look through the git history of nodepool and opendev system-config I'm sure we'll find many many more occurences. This isn't a specific cloud problem either; they all do it one way or another. It is my opinion that tools like openstacksdk should be forgiving in what they accept to enable users to use openstack in its many forms. Shade is a great example of this. Eventually shade was rolled into openstacksdk and doesn't really exist anymore. It feels like the spirit of shade has been lost though and we're regressing back to 2014 when you couldn't talk to two different clouds without 4 different client tools. Similarly support for old versions of APIs have been ripped out of these tools. Unfortunately some clouds are still deploying old versions of clouds and ideally as a user I wouldn't need to think too hard about that. It isn't my fault this is the situation, I just want the software to work.
Artem
On 1. Jun 2022, at 19:55, Clark Boylan <cboylan@sapwetik.org> wrote:
Hello,
Last week the OpenDev team upgraded Zuul which deployed zuul-executors with openstacksdk==0.99.0 installed. After this, test job logs uploaded to the rackspace swift location were no longer viewable in the Zuul dashboard. The reason for this is these objects did not have properly configured CORS headers any longer. The objects uploaded to OVH appear to have been fine (potentially because the headers are not required for OVH).
Today we downgraded the version of openstacksdk on the zuul-executors to 0.61.0 and the rackspace uploaded job log objects have functional CORS headers again. For this reason we're fairly confident that something about the 0.99.0 release is impacting the setting of these headers.
The code that does these uploads with openstacksdk can be found here [0]. I suspect that the problem is going to be related to this header set on line 227 [1]. In particular that appears to be rackspace specific and this issue is rackspace specific under 0.99.0. However, I wasn't able to find anything in openstacksdk that would have a problem with this. Does anyone else have ideas? I'm somewhat concerned that this represents a class of data loss (metadata is going missing) for swift object uploads.
Additionally, we rely on setting the x-delete-after header to timeout, expire, and prune our logs in swift [2]. If there is a general problem with 0.99.0 setting headers on objects it is possible that all of the objects we created using 0.99.0 do not have this metadata set and will leak in their swift containers.
[0] https://opendev.org/zuul/zuul-jobs/src/commit/e69d879caecb454c529a7d757b80ae... [1] https://opendev.org/zuul/zuul-jobs/src/commit/e69d879caecb454c529a7d757b80ae... [2] https://opendev.org/zuul/zuul-jobs/src/commit/e69d879caecb454c529a7d757b80ae...
[3] https://docs.openstack.org/openstacksdk/latest/user/connection.html#openstac...
Hey Clark, Pls find comments below. I need to justify the product that I maintain, but I perfectly understand you.
On 2. Jun 2022, at 18:22, Clark Boylan <cboylan@sapwetik.org> wrote:
On Thu, Jun 2, 2022, at 3:44 AM, Artem Goncharov wrote:
Hey,
https://review.opendev.org/c/openstack/openstacksdk/+/844414 <https://review.opendev.org/c/openstack/openstacksdk/+/844414> should address the issue. For R1 of SDK we want to ensure all interfaces of SDK use same code branch and not 3 different like it was before. So with the latest changes we now rely here on the regular mechanism that actually watches how to handle each known parameter. All of the arbitrary parameters in Swift are managed as container/object metadata and should get X-Container-Meta- or X-Object-Meta- prefixes in order to be considered and returned back by manila Swift. This is sadly not the case for Rax as you rightfully mention.
Note that the docs [3] still say that headers are passed through, which isn't quite the case in 0.99.0. Filtered headers for what the SDK thinks are valid are passed through. This change should probably update the docs too.
Partially yes, partially no. Documentation says it will pass them and the SW does that. It passes, however, only what it know would be accepted. I see the point here and will try to address this, but the issue is that we have generally a mixture of header and metadata props (which are implemented as headers). I will try to reimplement current handling to be able to guess better what user means.
So for now I have added header required by Rax into “known” exceptions and added explicit unit test for that. Would be cool to find way how I could test that really in Rax to ensure particular Zuul use case will work.
Probably the simplest thing is to have a test that checks the right headers are sent. Then if/when clouds change behavior we can update the tests to follow.
This is how I implemented that, but it is only a unit test. And here I was rather wondering if there would be possibility to have func test for that to save all of us from another analysis sessions on what the heck is going on right now. Point is that we only know that cloud changed behaviour by noticing Zuul fails now (in similar case to the current issue).
Generally we would need to think whether SDK should continue “tell user he is wrong before reaching out to the server” approach or to switch to the “believe the user and let server reject” approach generally. This here is precisely one of those cases. Alternatively we would need to widen our concept of vendors and move all of the specific Rax logic over there. We surely have lot of OpenStack based clouds, which do not explicitly implement everything in the compatible way. So in order to make SDK/CLI working with all of them we either drop most of the validation logic in SDK or really start implementing cloud specific overrides as part of “extensions”.
The problem with telling the user they are wrong before talking to the server is that it assumes that clouds don't do weird things. History has shown us that cloud definitely do weird things and this isn't related to any single cloud. Off the top of my head I can come up with a few examples and if we look through the git history of nodepool and opendev system-config I'm sure we'll find many many more occurences. This isn't a specific cloud problem either; they all do it one way or another.
It is my opinion that tools like openstacksdk should be forgiving in what they accept to enable users to use openstack in its many forms. Shade is a great example of this. Eventually shade was rolled into openstacksdk and doesn't really exist anymore. It feels like the spirit of shade has been lost though and we're regressing back to 2014 when you couldn't talk to two different clouds without 4 different client tools.
Similarly support for old versions of APIs have been ripped out of these tools. Unfortunately some clouds are still deploying old versions of clouds and ideally as a user I wouldn't need to think too hard about that. It isn't my fault this is the situation, I just want the software to work.
I perfectly see your point. To some extend you are right, but let me describe reasoning for all of those changes we have now: - Zuul is using one interface of SDK - Ansible is using other interface (moreover it mixes 3 different interfaces) - OSC is either not using SDK or using one of 3 mentioned interfaces - Those mentioned interfaces try to do same thing, but eventually differently - Vanila services evolve and require we permanently add new features to all of those interfaces simultaneously - Amount of contributions is incredibly low what makes it impossible to maintain multiple interfaces for doing same job - So what I am trying to do now is to have all of the mentioned tools rely on the single branch in code to be able to maintain it easier and reduce amount of questions like “hmm, it works if I use OSC, but it doesn’t when I use Ansible. What is wrong?”. Target is just to exactly reduce amount of those “different tools for different tasks and or clouds” as you mention and instead to have a single tool that work for every cloud. We do not take care of shade any more. We try to deprecate all of the individual clients exactly to have a single OSC be able to do everything. And it should just know all of those corner cases. We try to get rid of as much “reimplementation” from Ansible modules as possible. - Yes, maybe idea of trying to help user to know that he will fail once we reach server was wrong, but you can judge only from experience (when clouds behave differently). Idea as such is not bad, is having its right to live and is generally targeting to be more user friendly than the server response that just returns you 409 with “good luck figuring out what is wrong”. Here I was spending days trying to find issue in my request with only 409 “Bad Request” back. And when target clouds are different you as a user are left in even more problems. Generally idea of SDK is to save user from even knowing about differences of clouds he want to talk to. Without that the whole burden is on the user and you need to have all of those cloud hacks directly in your code (is it cool to have all of those hacks inside Zuul and reimplementing them through the different use cases? And repeating the same hacks in OSC and Ansible? Same for all of the user written applications?) - I know this work breaks some corner cases. And I perfectly see the pain my work is causing (even for myself). I have invested huge amount of time in rewriting this and joining code branches and tests. Trust me this was not fun. But unless this is done we wither stuck with old code without possibility to add new features (i.e. micro versions) or we continue living in the mess of tons of different ways to do the same thing with no maintainers. =========== So overall, while I understand all of the issues I am causing to projects right now, I kindly ask, be patient, report broken issues to me and help me fix this insanity. Believe me, I am not doing all of that with bad intents, but rather hoping to resolve issues we have and unify things in the light of low contributions. Otherwise SDK, CLI, Ansible, eventually Zuul are not able to properly keep up with OpenStack services themselves. I listen to complains carefully, and that is the reason for R1.0 not to exist as of now. Artem
Artem
On 1. Jun 2022, at 19:55, Clark Boylan <cboylan@sapwetik.org> wrote:
Hello,
Last week the OpenDev team upgraded Zuul which deployed zuul-executors with openstacksdk==0.99.0 installed. After this, test job logs uploaded to the rackspace swift location were no longer viewable in the Zuul dashboard. The reason for this is these objects did not have properly configured CORS headers any longer. The objects uploaded to OVH appear to have been fine (potentially because the headers are not required for OVH).
Today we downgraded the version of openstacksdk on the zuul-executors to 0.61.0 and the rackspace uploaded job log objects have functional CORS headers again. For this reason we're fairly confident that something about the 0.99.0 release is impacting the setting of these headers.
The code that does these uploads with openstacksdk can be found here [0]. I suspect that the problem is going to be related to this header set on line 227 [1]. In particular that appears to be rackspace specific and this issue is rackspace specific under 0.99.0. However, I wasn't able to find anything in openstacksdk that would have a problem with this. Does anyone else have ideas? I'm somewhat concerned that this represents a class of data loss (metadata is going missing) for swift object uploads.
Additionally, we rely on setting the x-delete-after header to timeout, expire, and prune our logs in swift [2]. If there is a general problem with 0.99.0 setting headers on objects it is possible that all of the objects we created using 0.99.0 do not have this metadata set and will leak in their swift containers.
[0] https://opendev.org/zuul/zuul-jobs/src/commit/e69d879caecb454c529a7d757b80ae... [1] https://opendev.org/zuul/zuul-jobs/src/commit/e69d879caecb454c529a7d757b80ae... [2] https://opendev.org/zuul/zuul-jobs/src/commit/e69d879caecb454c529a7d757b80ae...
[3] https://docs.openstack.org/openstacksdk/latest/user/connection.html#openstac... <https://docs.openstack.org/openstacksdk/latest/user/connection.html#openstack.connection.Connection.create_object>
On 2022-06-02 21:12:33 +0200 (+0200), Artem Goncharov wrote: [...]
I listen to complains carefully, and that is the reason for R1.0 not to exist as of now. [...]
It's probably also worth pointing out, from a user experience perspective, that we anticipated backward-incompatible changes in 1.0.0 and so merged an openstacksdk<1 pin in nodepool's requirements.txt file. Unfortunately, instead the backward-incompatible changes were released as 0.99.0 so when the container images for our launchers updated we started getting another error back from all of the Rackspace regions we're using (traceback attached). I'm happy to file a bug report on that if nobody has beaten me to it, but further diagnostics will probably have to wait until after summit travel since we've rolled back to 0.61.0 for now. -- Jeremy Stanley
Thanks Jeremy, will try to analyze the trace. This is not too extensive, but better then silent pin without possibility to even notice that. I see that there are generally not enough deep tests for Openstack driver in Zuul since mostly issues appear in Prod. I will to try to invest some time there. Artem ---- typed from mobile, auto-correct typos assumed ---- On Thu, Jun 2, 2022, 21:47 Jeremy Stanley <fungi@yuggoth.org> wrote:
On 2022-06-02 21:12:33 +0200 (+0200), Artem Goncharov wrote: [...]
I listen to complains carefully, and that is the reason for R1.0 not to exist as of now. [...]
It's probably also worth pointing out, from a user experience perspective, that we anticipated backward-incompatible changes in 1.0.0 and so merged an openstacksdk<1 pin in nodepool's requirements.txt file. Unfortunately, instead the backward-incompatible changes were released as 0.99.0 so when the container images for our launchers updated we started getting another error back from all of the Rackspace regions we're using (traceback attached).
I'm happy to file a bug report on that if nobody has beaten me to it, but further diagnostics will probably have to wait until after summit travel since we've rolled back to 0.61.0 for now. -- Jeremy Stanley
On Thu, Jun 2, 2022, at 12:54 PM, Artem Goncharov wrote:
Thanks Jeremy, will try to analyze the trace.
This is not too extensive, but better then silent pin without possibility to even notice that.
I see that there are generally not enough deep tests for Openstack driver in Zuul since mostly issues appear in Prod. I will to try to invest some time there.
We do functional testing against a devstack cloud. Unfortunately, as previously mentioned every cloud is just sufficiently unique that there are always corner cases that devstack can't catch. In general what we are able to test is that we can upload an image to devstack then boot that image and ssh into it. But then different clouds support different image upload methods and networking and even hypervisors. We do our best but once you get to the variety of clouds in the wild there is a lot more opportunity to find bad assumptions.
Artem
---- typed from mobile, auto-correct typos assumed ----
On Thu, Jun 2, 2022, 21:47 Jeremy Stanley <fungi@yuggoth.org> wrote:
On 2022-06-02 21:12:33 +0200 (+0200), Artem Goncharov wrote: [...]
I listen to complains carefully, and that is the reason for R1.0 not to exist as of now. [...]
It's probably also worth pointing out, from a user experience perspective, that we anticipated backward-incompatible changes in 1.0.0 and so merged an openstacksdk<1 pin in nodepool's requirements.txt file. Unfortunately, instead the backward-incompatible changes were released as 0.99.0 so when the container images for our launchers updated we started getting another error back from all of the Rackspace regions we're using (traceback attached).
I'm happy to file a bug report on that if nobody has beaten me to it, but further diagnostics will probably have to wait until after summit travel since we've rolled back to 0.61.0 for now. -- Jeremy Stanley
On 2. Jun 2022, at 22:52, Clark Boylan <cboylan@sapwetik.org> wrote:
On Thu, Jun 2, 2022, at 12:54 PM, Artem Goncharov wrote:
Thanks Jeremy, will try to analyze the trace.
This is not too extensive, but better then silent pin without possibility to even notice that.
I see that there are generally not enough deep tests for Openstack driver in Zuul since mostly issues appear in Prod. I will to try to invest some time there.
We do functional testing against a devstack cloud. Unfortunately, as previously mentioned every cloud is just sufficiently unique that there are always corner cases that devstack can't catch. In general what we are able to test is that we can upload an image to devstack then boot that image and ssh into it. But then different clouds support different image upload methods and networking and even hypervisors. We do our best but once you get to the variety of clouds in the wild there is a lot more opportunity to find bad assumptions.
Right. I didn’t want to say there is no testing. I mean in Zuul you have knowledge of the corner cases you face and they need to be included in tests to guarantee we do not break. But actually now I think it would be a real big benefit to everybody to include knowledge of those cases including corresponding tests into the SDK so that others can also benefit from that. Those who rely on SDK should just work no matter which particular cloud they are working with. This brings me slowly to other point. Our dream of real interoperability has not taken off. What is the sense of any certification if this does not save us even in pretty simple scenarios: provision host with custom image or upload object to object_store with CORS on? As a person who is responsible on our cloud for passing this certification I say openly: at the moment this is as good as useless at all. There are tests, and yes, sometimes it is even a challenge to pass those. But if they are not even able to guarantee us interoperability here this is not worth of effort. With this knowledge we can say that there is not a single OpenStack public cloud out there. They are all “based on OpenStack”. Artem
Artem
---- typed from mobile, auto-correct typos assumed ----
On Thu, Jun 2, 2022, 21:47 Jeremy Stanley <fungi@yuggoth.org> wrote:
On 2022-06-02 21:12:33 +0200 (+0200), Artem Goncharov wrote: [...]
I listen to complains carefully, and that is the reason for R1.0 not to exist as of now. [...]
It's probably also worth pointing out, from a user experience perspective, that we anticipated backward-incompatible changes in 1.0.0 and so merged an openstacksdk<1 pin in nodepool's requirements.txt file. Unfortunately, instead the backward-incompatible changes were released as 0.99.0 so when the container images for our launchers updated we started getting another error back from all of the Rackspace regions we're using (traceback attached).
I'm happy to file a bug report on that if nobody has beaten me to it, but further diagnostics will probably have to wait until after summit travel since we've rolled back to 0.61.0 for now. -- Jeremy Stanley
participants (3)
-
Artem Goncharov
-
Clark Boylan
-
Jeremy Stanley