[ironic] RFC: deprecate the iSCSI deploy interface?
Hi all, Side note for those lacking context: this proposal concerns deprecating one of the ironic deploy interfaces detailed in https://docs.openstack.org/ironic/latest/admin/interfaces/deploy.html. It does not affect the boot-from-iSCSI feature. I would like to propose deprecating and removing the 'iscsi' deploy interface over the course of the next 2 cycles. The reasons are: 1) The iSCSI deploy is a source of occasional cryptic bugs when a target cannot be discovered or mounted properly. 2) Its security is questionable: I don't think we even use authentication. 3) Operators confusion: right now we default to the iSCSI deploy but pretty much direct everyone who cares about scalability or security to the 'direct' deploy. 4) Cost of maintenance: our feature set is growing, our team - not so much. iscsi_deploy.py is 800 lines of code that can be removed, and some dependencies that can be dropped as well. As far as I can remember, we've kept the iSCSI deploy for two reasons: 1) The direct deploy used to require Glance with Swift backend. The recently added [agent]image_download_source option allows caching and serving images via the ironic's HTTP server, eliminating this problem. I guess we'll have to switch to 'http' by default for this option to keep the out-of-box experience. 2) Memory footprint of the direct deploy. With the raw images streaming we no longer have to cache the downloaded images in the agent memory, removing this problem as well (I'm not even sure how much of a problem it is in 2020, even my phone has 4GiB of RAM). If this proposal is accepted, I suggest to execute it as follows: Victoria release: 1) Put an early deprecation warning in the release notes. 2) Announce the future change of the default value for [agent]image_download_source. W release: 3) Change [agent]image_download_source to 'http' by default. 4) Remove iscsi from the default enabled_deploy_interfaces and move it to the back of the supported list (effectively making direct deploy the default). X release: 5) Remove the iscsi deploy code from both ironic and IPA. Thoughts, opinions, suggestions? Dmitry
I'm having a sense of deja vu! Because of the way the mechanics work, the iscsi deploy driver is in an unfortunate position of being harder to troubleshoot and diagnose failures. Which basically means we've not been able to really identify common failures and add logic to handle them appropriately, like we are able to with a tcp socket and file download. Based on this alone, I think it makes a solid case for us to seriously consider deprecation. Overall, I'm +1 for the proposal and I believe over two cycles is the right way to go. I suspect we're going to have lots of push back from the TripleO community because there has been resistance to change their default usage in the past. As such I'm adding them to the subject so hopefully they will be at least aware. I guess my other worry is operators who already have a substantial operational infrastructure investment built around the iscsi deploy interface. I wonder why they didn't use direct, but maybe they have all migrated in the past ?5? years. This could just be a non-concern in reality, I'm just not sure. Of course, if someone is willing to step up and make the iscsi deployment interface their primary focus, that also shifts the discussion to making direct the default interface? -Julia On Thu, Aug 20, 2020 at 1:57 AM Dmitry Tantsur <dtantsur@redhat.com> wrote:
Hi all,
Side note for those lacking context: this proposal concerns deprecating one of the ironic deploy interfaces detailed in https://docs.openstack.org/ironic/latest/admin/interfaces/deploy.html. It does not affect the boot-from-iSCSI feature.
I would like to propose deprecating and removing the 'iscsi' deploy interface over the course of the next 2 cycles. The reasons are: 1) The iSCSI deploy is a source of occasional cryptic bugs when a target cannot be discovered or mounted properly. 2) Its security is questionable: I don't think we even use authentication. 3) Operators confusion: right now we default to the iSCSI deploy but pretty much direct everyone who cares about scalability or security to the 'direct' deploy. 4) Cost of maintenance: our feature set is growing, our team - not so much. iscsi_deploy.py is 800 lines of code that can be removed, and some dependencies that can be dropped as well.
As far as I can remember, we've kept the iSCSI deploy for two reasons: 1) The direct deploy used to require Glance with Swift backend. The recently added [agent]image_download_source option allows caching and serving images via the ironic's HTTP server, eliminating this problem. I guess we'll have to switch to 'http' by default for this option to keep the out-of-box experience. 2) Memory footprint of the direct deploy. With the raw images streaming we no longer have to cache the downloaded images in the agent memory, removing this problem as well (I'm not even sure how much of a problem it is in 2020, even my phone has 4GiB of RAM).
If this proposal is accepted, I suggest to execute it as follows: Victoria release: 1) Put an early deprecation warning in the release notes. 2) Announce the future change of the default value for [agent]image_download_source. W release: 3) Change [agent]image_download_source to 'http' by default. 4) Remove iscsi from the default enabled_deploy_interfaces and move it to the back of the supported list (effectively making direct deploy the default). X release: 5) Remove the iscsi deploy code from both ironic and IPA.
Thoughts, opinions, suggestions?
Dmitry
Hi! CERN's deployment is using the iscsi deploy interface since we started with Ironic a couple of years ago (and we installed around 5000 nodes with it by now). The reason we chose it at the time was simplicity: we did not (and still do not) have a Swift backend to Glance, and the iscsi interface provided a straightforward alternative. While we have not seen obscure bugs/issues with it, I can certainly back the scalability issues mentioned by Dmitry: the tunneling of the images through the controllers can create issues when deploying hundreds of nodes at the same time. The security of the iscsi interface is less of a concern in our specific environment. So, why did we not move to direct (yet)? In addition to the lack of Swift, mostly since iscsi works for us and the scalability issues were not that much of a burning problem ... so we focused on other things :) Here are some thoughts/suggestions for this discussion: How would 'direct' work with other Glance backends (like Ceph/RBD in our case)? If using direct requires to duplicate images from Glance to Ironic (or somewhere else) to be served, I think this would be an argument against deprecating iscsi. Equally, if this would require to completely move the Glance backend to something else, like from RBD to RadosGW, I would not expect happy operators. (Does anyone know if RadosGW could even replace Swift for this specific use case?) Do we have numbers on how many deployments use iscsi vs direct? If many rely on iscsi, I would also suggest to establish a migration guide for operators on how to move from iscsi to direct, for the various configs. Recent versions of Glance support multiple backends, so a migration path may be to add a new (direct compatible) backend for new images. Cheers, Arne On 20.08.20 17:49, Julia Kreger wrote:
I'm having a sense of deja vu!
Because of the way the mechanics work, the iscsi deploy driver is in an unfortunate position of being harder to troubleshoot and diagnose failures. Which basically means we've not been able to really identify common failures and add logic to handle them appropriately, like we are able to with a tcp socket and file download. Based on this alone, I think it makes a solid case for us to seriously consider deprecation.
Overall, I'm +1 for the proposal and I believe over two cycles is the right way to go.
I suspect we're going to have lots of push back from the TripleO community because there has been resistance to change their default usage in the past. As such I'm adding them to the subject so hopefully they will be at least aware.
I guess my other worry is operators who already have a substantial operational infrastructure investment built around the iscsi deploy interface. I wonder why they didn't use direct, but maybe they have all migrated in the past ?5? years. This could just be a non-concern in reality, I'm just not sure.
Of course, if someone is willing to step up and make the iscsi deployment interface their primary focus, that also shifts the discussion to making direct the default interface?
-Julia
On Thu, Aug 20, 2020 at 1:57 AM Dmitry Tantsur <dtantsur@redhat.com> wrote:
Hi all,
Side note for those lacking context: this proposal concerns deprecating one of the ironic deploy interfaces detailed in https://docs.openstack.org/ironic/latest/admin/interfaces/deploy.html. It does not affect the boot-from-iSCSI feature.
I would like to propose deprecating and removing the 'iscsi' deploy interface over the course of the next 2 cycles. The reasons are: 1) The iSCSI deploy is a source of occasional cryptic bugs when a target cannot be discovered or mounted properly. 2) Its security is questionable: I don't think we even use authentication. 3) Operators confusion: right now we default to the iSCSI deploy but pretty much direct everyone who cares about scalability or security to the 'direct' deploy. 4) Cost of maintenance: our feature set is growing, our team - not so much. iscsi_deploy.py is 800 lines of code that can be removed, and some dependencies that can be dropped as well.
As far as I can remember, we've kept the iSCSI deploy for two reasons: 1) The direct deploy used to require Glance with Swift backend. The recently added [agent]image_download_source option allows caching and serving images via the ironic's HTTP server, eliminating this problem. I guess we'll have to switch to 'http' by default for this option to keep the out-of-box experience. 2) Memory footprint of the direct deploy. With the raw images streaming we no longer have to cache the downloaded images in the agent memory, removing this problem as well (I'm not even sure how much of a problem it is in 2020, even my phone has 4GiB of RAM).
If this proposal is accepted, I suggest to execute it as follows: Victoria release: 1) Put an early deprecation warning in the release notes. 2) Announce the future change of the default value for [agent]image_download_source. W release: 3) Change [agent]image_download_source to 'http' by default. 4) Remove iscsi from the default enabled_deploy_interfaces and move it to the back of the supported list (effectively making direct deploy the default). X release: 5) Remove the iscsi deploy code from both ironic and IPA.
Thoughts, opinions, suggestions?
Dmitry
Hi, On Mon, Aug 24, 2020 at 10:24 AM Arne Wiebalck <arne.wiebalck@cern.ch> wrote:
Hi!
CERN's deployment is using the iscsi deploy interface since we started with Ironic a couple of years ago (and we installed around 5000 nodes with it by now). The reason we chose it at the time was simplicity: we did not (and still do not) have a Swift backend to Glance, and the iscsi interface provided a straightforward alternative.
While we have not seen obscure bugs/issues with it, I can certainly back the scalability issues mentioned by Dmitry: the tunneling of the images through the controllers can create issues when deploying hundreds of nodes at the same time. The security of the iscsi interface is less of a concern in our specific environment.
So, why did we not move to direct (yet)? In addition to the lack of Swift, mostly since iscsi works for us and the scalability issues were not that much of a burning problem ... so we focused on other things :)
Here are some thoughts/suggestions for this discussion:
How would 'direct' work with other Glance backends (like Ceph/RBD in our case)? If using direct requires to duplicate images from Glance to Ironic (or somewhere else) to be served, I think this would be an argument against deprecating iscsi.
With image_download_source=http ironic will download the image to the conductor to be able serve it to the node. Which is exactly what the iscsi is doing, so not much of a change for you (except for s/iSCSI/HTTP/ as a means of serving the image). Would it be an option for you to test direct deploy with image_download_source=http?
Equally, if this would require to completely move the Glance backend to something else, like from RBD to RadosGW, I would not expect happy operators. (Does anyone know if RadosGW could even replace Swift for this specific use case?)
AFAIK ironic works with RadosGW, we have some support code for it.
Do we have numbers on how many deployments use iscsi vs direct? If many rely on iscsi, I would also suggest to establish a migration guide for operators on how to move from iscsi to direct, for the various configs. Recent versions of Glance support multiple backends, so a migration path may be to add a new (direct compatible) backend for new images.
I don't have any numbers, but a migration guide is a must in any case. I expect most TripleO consumers to use the iscsi deploy, but only because it's the default. Their Edge solution uses the direct deploy. I've polled a few operators I know, they all (except for you, obviously :) seem to use the direct deploy. Metal3 uses direct deploy. Dmitry
Cheers, Arne
I'm having a sense of deja vu!
Because of the way the mechanics work, the iscsi deploy driver is in an unfortunate position of being harder to troubleshoot and diagnose failures. Which basically means we've not been able to really identify common failures and add logic to handle them appropriately, like we are able to with a tcp socket and file download. Based on this alone, I think it makes a solid case for us to seriously consider deprecation.
Overall, I'm +1 for the proposal and I believe over two cycles is the right way to go.
I suspect we're going to have lots of push back from the TripleO community because there has been resistance to change their default usage in the past. As such I'm adding them to the subject so hopefully they will be at least aware.
I guess my other worry is operators who already have a substantial operational infrastructure investment built around the iscsi deploy interface. I wonder why they didn't use direct, but maybe they have all migrated in the past ?5? years. This could just be a non-concern in reality, I'm just not sure.
Of course, if someone is willing to step up and make the iscsi deployment interface their primary focus, that also shifts the discussion to making direct the default interface?
-Julia
On Thu, Aug 20, 2020 at 1:57 AM Dmitry Tantsur <dtantsur@redhat.com> wrote:
Hi all,
Side note for those lacking context: this proposal concerns deprecating
one of the ironic deploy interfaces detailed in https://docs.openstack.org/ironic/latest/admin/interfaces/deploy.html. It does not affect the boot-from-iSCSI feature.
I would like to propose deprecating and removing the 'iscsi' deploy
interface over the course of the next 2 cycles. The reasons are:
1) The iSCSI deploy is a source of occasional cryptic bugs when a target cannot be discovered or mounted properly. 2) Its security is questionable: I don't think we even use authentication. 3) Operators confusion: right now we default to the iSCSI deploy but
On 20.08.20 17:49, Julia Kreger wrote: pretty much direct everyone who cares about scalability or security to the 'direct' deploy.
4) Cost of maintenance: our feature set is growing, our team - not so much. iscsi_deploy.py is 800 lines of code that can be removed, and some dependencies that can be dropped as well.
As far as I can remember, we've kept the iSCSI deploy for two reasons: 1) The direct deploy used to require Glance with Swift backend. The recently added [agent]image_download_source option allows caching and serving images via the ironic's HTTP server, eliminating this problem. I guess we'll have to switch to 'http' by default for this option to keep the out-of-box experience. 2) Memory footprint of the direct deploy. With the raw images streaming we no longer have to cache the downloaded images in the agent memory, removing this problem as well (I'm not even sure how much of a problem it is in 2020, even my phone has 4GiB of RAM).
If this proposal is accepted, I suggest to execute it as follows: Victoria release: 1) Put an early deprecation warning in the release notes. 2) Announce the future change of the default value for [agent]image_download_source. W release: 3) Change [agent]image_download_source to 'http' by default. 4) Remove iscsi from the default enabled_deploy_interfaces and move it to the back of the supported list (effectively making direct deploy the default). X release: 5) Remove the iscsi deploy code from both ironic and IPA.
Thoughts, opinions, suggestions?
Dmitry
Hi Dmitry, On 24.08.20 10:32, Dmitry Tantsur wrote:
Hi,
On Mon, Aug 24, 2020 at 10:24 AM Arne Wiebalck <arne.wiebalck@cern.ch <mailto:arne.wiebalck@cern.ch>> wrote:
Hi!
CERN's deployment is using the iscsi deploy interface since we started with Ironic a couple of years ago (and we installed around 5000 nodes with it by now). The reason we chose it at the time was simplicity: we did not (and still do not) have a Swift backend to Glance, and the iscsi interface provided a straightforward alternative.
While we have not seen obscure bugs/issues with it, I can certainly back the scalability issues mentioned by Dmitry: the tunneling of the images through the controllers can create issues when deploying hundreds of nodes at the same time. The security of the iscsi interface is less of a concern in our specific environment.
So, why did we not move to direct (yet)? In addition to the lack of Swift, mostly since iscsi works for us and the scalability issues were not that much of a burning problem ... so we focused on other things :)
Here are some thoughts/suggestions for this discussion:
How would 'direct' work with other Glance backends (like Ceph/RBD in our case)? If using direct requires to duplicate images from Glance to Ironic (or somewhere else) to be served, I think this would be an argument against deprecating iscsi.
With image_download_source=http ironic will download the image to the conductor to be able serve it to the node. Which is exactly what the iscsi is doing, so not much of a change for you (except for s/iSCSI/HTTP/ as a means of serving the image).
Would it be an option for you to test direct deploy with image_download_source=http?
Oh, absolutely! I was not aware that setting this option would make Ironic act as an image buffer (I thought this would expect some URL the admin had to provide) ... I will try this and let you know.
Equally, if this would require to completely move the Glance backend to something else, like from RBD to RadosGW, I would not expect happy operators. (Does anyone know if RadosGW could even replace Swift for this specific use case?)
AFAIK ironic works with RadosGW, we have some support code for it.
I was mostly asking to see if RadosGW is a (longer term) option to fully benefit from direct's inherent scaling.
Do we have numbers on how many deployments use iscsi vs direct? If many rely on iscsi, I would also suggest to establish a migration guide for operators on how to move from iscsi to direct, for the various configs. Recent versions of Glance support multiple backends, so a migration path may be to add a new (direct compatible) backend for new images.
I don't have any numbers, but a migration guide is a must in any case.
I expect most TripleO consumers to use the iscsi deploy, but only because it's the default. Their Edge solution uses the direct deploy. I've polled a few operators I know, they all (except for you, obviously :) seem to use the direct deploy. Metal3 uses direct deploy.
Thanks! Arne
Dmitry
Cheers, Arne
On 20.08.20 17:49, Julia Kreger wrote: > I'm having a sense of deja vu! > > Because of the way the mechanics work, the iscsi deploy driver is in > an unfortunate position of being harder to troubleshoot and diagnose > failures. Which basically means we've not been able to really identify > common failures and add logic to handle them appropriately, like we > are able to with a tcp socket and file download. Based on this alone, > I think it makes a solid case for us to seriously consider > deprecation. > > Overall, I'm +1 for the proposal and I believe over two cycles is the > right way to go. > > I suspect we're going to have lots of push back from the TripleO > community because there has been resistance to change their default > usage in the past. As such I'm adding them to the subject so hopefully > they will be at least aware. > > I guess my other worry is operators who already have a substantial > operational infrastructure investment built around the iscsi deploy > interface. I wonder why they didn't use direct, but maybe they have > all migrated in the past ?5? years. This could just be a non-concern > in reality, I'm just not sure. > > Of course, if someone is willing to step up and make the iscsi > deployment interface their primary focus, that also shifts the > discussion to making direct the default interface? > > -Julia > > > On Thu, Aug 20, 2020 at 1:57 AM Dmitry Tantsur <dtantsur@redhat.com <mailto:dtantsur@redhat.com>> wrote: >> >> Hi all, >> >> Side note for those lacking context: this proposal concerns deprecating one of the ironic deploy interfaces detailed in https://docs.openstack.org/ironic/latest/admin/interfaces/deploy.html. It does not affect the boot-from-iSCSI feature. >> >> I would like to propose deprecating and removing the 'iscsi' deploy interface over the course of the next 2 cycles. The reasons are: >> 1) The iSCSI deploy is a source of occasional cryptic bugs when a target cannot be discovered or mounted properly. >> 2) Its security is questionable: I don't think we even use authentication. >> 3) Operators confusion: right now we default to the iSCSI deploy but pretty much direct everyone who cares about scalability or security to the 'direct' deploy. >> 4) Cost of maintenance: our feature set is growing, our team - not so much. iscsi_deploy.py is 800 lines of code that can be removed, and some dependencies that can be dropped as well. >> >> As far as I can remember, we've kept the iSCSI deploy for two reasons: >> 1) The direct deploy used to require Glance with Swift backend. The recently added [agent]image_download_source option allows caching and serving images via the ironic's HTTP server, eliminating this problem. I guess we'll have to switch to 'http' by default for this option to keep the out-of-box experience. >> 2) Memory footprint of the direct deploy. With the raw images streaming we no longer have to cache the downloaded images in the agent memory, removing this problem as well (I'm not even sure how much of a problem it is in 2020, even my phone has 4GiB of RAM). >> >> If this proposal is accepted, I suggest to execute it as follows: >> Victoria release: >> 1) Put an early deprecation warning in the release notes. >> 2) Announce the future change of the default value for [agent]image_download_source. >> W release: >> 3) Change [agent]image_download_source to 'http' by default. >> 4) Remove iscsi from the default enabled_deploy_interfaces and move it to the back of the supported list (effectively making direct deploy the default). >> X release: >> 5) Remove the iscsi deploy code from both ironic and IPA. >> >> Thoughts, opinions, suggestions? >> >> Dmitry >
On Mon, 2020-08-24 at 10:32 +0200, Dmitry Tantsur wrote:
Hi,
On Mon, Aug 24, 2020 at 10:24 AM Arne Wiebalck <arne.wiebalck@cern.ch> wrote:
Hi!
CERN's deployment is using the iscsi deploy interface since we started with Ironic a couple of years ago (and we installed around 5000 nodes with it by now). The reason we chose it at the time was simplicity: we did not (and still do not) have a Swift backend to Glance, and the iscsi interface provided a straightforward alternative.
While we have not seen obscure bugs/issues with it, I can certainly back the scalability issues mentioned by Dmitry: the tunneling of the images through the controllers can create issues when deploying hundreds of nodes at the same time. The security of the iscsi interface is less of a concern in our specific environment.
So, why did we not move to direct (yet)? In addition to the lack of Swift, mostly since iscsi works for us and the scalability issues were not that much of a burning problem ... so we focused on other things :)
Here are some thoughts/suggestions for this discussion:
How would 'direct' work with other Glance backends (like Ceph/RBD in our case)? If using direct requires to duplicate images from Glance to Ironic (or somewhere else) to be served, I think this would be an argument against deprecating iscsi.
With image_download_source=http ironic will download the image to the conductor to be able serve it to the node. Which is exactly what the iscsi is doing, so not much of a change for you (except for s/iSCSI/HTTP/ as a means of serving the image).
Would it be an option for you to test direct deploy with image_download_source=http? i think if there is still an option to not force deployemnt to altere any of there other sevices this is likely ok but i think the onious shoudl be on the ironic and ooo teams to ensure there is an upgrade path for those useres before this deprecation becomes a removal without deploying swift or a swift compatibale api e.g. RadosGW
perhaps a ci job could be put in place maybe using grenade that starts with iscsi and moves to direct with http porvided to show that just setting that weill allow the conductor to download the image from glance and server it to the ipa. unlike cern i just use ironic in a tiny home deployment where i have an all in one deployment + 4 addtional nodes for ironic. i cant deploy swift as all my disks are already in use for cinder so down the line when i eventually upgrade to vicortia and wallaby i would either have to drop ironic or not upgrade it if there is not a option to just pull the image from glance or glance via the conductor. enhancing the ipa to pull directly from glance would also proably work for many who use iscsi today but that would depend on your network toplogy i guess.
Equally, if this would require to completely move the Glance backend to something else, like from RBD to RadosGW, I would not expect happy operators. (Does anyone know if RadosGW could even replace Swift for this specific use case?)
AFAIK ironic works with RadosGW, we have some support code for it.
Do we have numbers on how many deployments use iscsi vs direct? If many rely on iscsi, I would also suggest to establish a migration guide for operators on how to move from iscsi to direct, for the various configs. Recent versions of Glance support multiple backends, so a migration path may be to add a new (direct compatible) backend for new images.
I don't have any numbers, but a migration guide is a must in any case.
I expect most TripleO consumers to use the iscsi deploy, but only because it's the default. Their Edge solution uses the direct deploy. I've polled a few operators I know, they all (except for you, obviously :) seem to use the direct deploy. Metal3 uses direct deploy.
Dmitry
Cheers, Arne
On 20.08.20 17:49, Julia Kreger wrote:
I'm having a sense of deja vu!
Because of the way the mechanics work, the iscsi deploy driver is in an unfortunate position of being harder to troubleshoot and diagnose failures. Which basically means we've not been able to really identify common failures and add logic to handle them appropriately, like we are able to with a tcp socket and file download. Based on this alone, I think it makes a solid case for us to seriously consider deprecation.
Overall, I'm +1 for the proposal and I believe over two cycles is the right way to go.
I suspect we're going to have lots of push back from the TripleO community because there has been resistance to change their default usage in the past. As such I'm adding them to the subject so hopefully they will be at least aware.
I guess my other worry is operators who already have a substantial operational infrastructure investment built around the iscsi deploy interface. I wonder why they didn't use direct, but maybe they have all migrated in the past ?5? years. This could just be a non-concern in reality, I'm just not sure.
Of course, if someone is willing to step up and make the iscsi deployment interface their primary focus, that also shifts the discussion to making direct the default interface?
-Julia
On Thu, Aug 20, 2020 at 1:57 AM Dmitry Tantsur <dtantsur@redhat.com>
wrote:
Hi all,
Side note for those lacking context: this proposal concerns deprecating
one of the ironic deploy interfaces detailed in https://docs.openstack.org/ironic/latest/admin/interfaces/deploy.html. It does not affect the boot-from-iSCSI feature.
I would like to propose deprecating and removing the 'iscsi' deploy
interface over the course of the next 2 cycles. The reasons are:
1) The iSCSI deploy is a source of occasional cryptic bugs when a
target cannot be discovered or mounted properly.
2) Its security is questionable: I don't think we even use
authentication.
3) Operators confusion: right now we default to the iSCSI deploy but
pretty much direct everyone who cares about scalability or security to the 'direct' deploy.
4) Cost of maintenance: our feature set is growing, our team - not so
much. iscsi_deploy.py is 800 lines of code that can be removed, and some dependencies that can be dropped as well.
As far as I can remember, we've kept the iSCSI deploy for two reasons: 1) The direct deploy used to require Glance with Swift backend. The
recently added [agent]image_download_source option allows caching and serving images via the ironic's HTTP server, eliminating this problem. I guess we'll have to switch to 'http' by default for this option to keep the out-of-box experience.
2) Memory footprint of the direct deploy. With the raw images streaming
we no longer have to cache the downloaded images in the agent memory, removing this problem as well (I'm not even sure how much of a problem it is in 2020, even my phone has 4GiB of RAM).
If this proposal is accepted, I suggest to execute it as follows: Victoria release: 1) Put an early deprecation warning in the release notes. 2) Announce the future change of the default value for
[agent]image_download_source.
W release: 3) Change [agent]image_download_source to 'http' by default. 4) Remove iscsi from the default enabled_deploy_interfaces and move it
to the back of the supported list (effectively making direct deploy the default).
X release: 5) Remove the iscsi deploy code from both ironic and IPA.
Thoughts, opinions, suggestions?
Dmitry
On Mon, Aug 24, 2020 at 1:52 PM Sean Mooney <smooney@redhat.com> wrote:
Hi,
On Mon, Aug 24, 2020 at 10:24 AM Arne Wiebalck <arne.wiebalck@cern.ch> wrote:
Hi!
CERN's deployment is using the iscsi deploy interface since we started with Ironic a couple of years ago (and we installed around 5000 nodes with it by now). The reason we chose it at the time was simplicity: we did not (and still do not) have a Swift backend to Glance, and the iscsi interface provided a straightforward alternative.
While we have not seen obscure bugs/issues with it, I can certainly back the scalability issues mentioned by Dmitry: the tunneling of the images through the controllers can create issues when deploying hundreds of nodes at the same time. The security of the iscsi interface is less of a concern in our specific environment.
So, why did we not move to direct (yet)? In addition to the lack of Swift, mostly since iscsi works for us and the scalability issues were not that much of a burning problem ... so we focused on other things :)
Here are some thoughts/suggestions for this discussion:
How would 'direct' work with other Glance backends (like Ceph/RBD in our case)? If using direct requires to duplicate images from Glance to Ironic (or somewhere else) to be served, I think this would be an argument against deprecating iscsi.
With image_download_source=http ironic will download the image to the conductor to be able serve it to the node. Which is exactly what the iscsi is doing, so not much of a change for you (except for s/iSCSI/HTTP/ as a means of serving the image).
Would it be an option for you to test direct deploy with image_download_source=http? i think if there is still an option to not force deployemnt to altere any of there other sevices this is likely ok but i think the onious shoudl be on the ironic and ooo teams to ensure there is an upgrade path for those useres before
On Mon, 2020-08-24 at 10:32 +0200, Dmitry Tantsur wrote: this deprecation becomes a removal without deploying swift or a swift compatibale api e.g. RadosGW
Swift is NOT a requirement (nor is RadosGW) when image_download_source=http is used. Any glance backend (or no glance at all) will work.
perhaps a ci job could be put in place maybe using grenade that starts with iscsi and moves to direct with http porvided to show that just setting that weill allow the conductor to download the image from glance and server it to the ipa.
We already have CI jobs that do it, I'm not sure what grenade would win us? At this point, we keep grenade jobs barely working at all (actually, the multinode grenade job is not working), we cannot add anything there. Dmitry
unlike cern i just use ironic in a tiny home deployment where i have an all in one deployment + 4 addtional nodes for ironic. i cant deploy swift as all my disks are already in use for cinder so down the line when i eventually upgrade to vicortia and wallaby i would either have to drop ironic or not upgrade it if there is not a option to just pull the image from glance or glance via the conductor. enhancing the ipa to pull directly from glance would also proably work for many who use iscsi today but that would depend on your network toplogy i guess.
Equally, if this would require to completely move the Glance backend to something else, like from RBD to RadosGW, I would not expect happy operators. (Does anyone know if RadosGW could even replace Swift for this specific use case?)
AFAIK ironic works with RadosGW, we have some support code for it.
Do we have numbers on how many deployments use iscsi vs direct? If many rely on iscsi, I would also suggest to establish a migration guide for operators on how to move from iscsi to direct, for the various configs. Recent versions of Glance support multiple backends, so a migration
may be to add a new (direct compatible) backend for new images.
I don't have any numbers, but a migration guide is a must in any case.
I expect most TripleO consumers to use the iscsi deploy, but only because it's the default. Their Edge solution uses the direct deploy. I've
few operators I know, they all (except for you, obviously :) seem to use the direct deploy. Metal3 uses direct deploy.
Dmitry
Cheers, Arne
On 20.08.20 17:49, Julia Kreger wrote:
I'm having a sense of deja vu!
Because of the way the mechanics work, the iscsi deploy driver is in an unfortunate position of being harder to troubleshoot and diagnose failures. Which basically means we've not been able to really
identify
common failures and add logic to handle them appropriately, like we are able to with a tcp socket and file download. Based on this alone, I think it makes a solid case for us to seriously consider deprecation.
Overall, I'm +1 for the proposal and I believe over two cycles is the right way to go.
I suspect we're going to have lots of push back from the TripleO community because there has been resistance to change their default usage in the past. As such I'm adding them to the subject so hopefully they will be at least aware.
I guess my other worry is operators who already have a substantial operational infrastructure investment built around the iscsi deploy interface. I wonder why they didn't use direct, but maybe they have all migrated in the past ?5? years. This could just be a non-concern in reality, I'm just not sure.
Of course, if someone is willing to step up and make the iscsi deployment interface their primary focus, that also shifts the discussion to making direct the default interface?
-Julia
On Thu, Aug 20, 2020 at 1:57 AM Dmitry Tantsur <dtantsur@redhat.com>
wrote:
Hi all,
Side note for those lacking context: this proposal concerns
deprecating
one of the ironic deploy interfaces detailed in https://docs.openstack.org/ironic/latest/admin/interfaces/deploy.html. It does not affect the boot-from-iSCSI feature.
I would like to propose deprecating and removing the 'iscsi' deploy
interface over the course of the next 2 cycles. The reasons are:
1) The iSCSI deploy is a source of occasional cryptic bugs when a
target cannot be discovered or mounted properly.
2) Its security is questionable: I don't think we even use
authentication.
3) Operators confusion: right now we default to the iSCSI deploy but
pretty much direct everyone who cares about scalability or security to
path polled a the
'direct' deploy.
4) Cost of maintenance: our feature set is growing, our team - not so
much. iscsi_deploy.py is 800 lines of code that can be removed, and some dependencies that can be dropped as well.
As far as I can remember, we've kept the iSCSI deploy for two
reasons:
1) The direct deploy used to require Glance with Swift backend. The
recently added [agent]image_download_source option allows caching and serving images via the ironic's HTTP server, eliminating this problem. I guess we'll have to switch to 'http' by default for this option to keep the out-of-box experience.
2) Memory footprint of the direct deploy. With the raw images streaming
we no longer have to cache the downloaded images in the agent memory, removing this problem as well (I'm not even sure how much of a problem it is in 2020, even my phone has 4GiB of RAM).
If this proposal is accepted, I suggest to execute it as follows: Victoria release: 1) Put an early deprecation warning in the release notes. 2) Announce the future change of the default value for
[agent]image_download_source.
W release: 3) Change [agent]image_download_source to 'http' by default. 4) Remove iscsi from the default enabled_deploy_interfaces and move it
to the back of the supported list (effectively making direct deploy the default).
X release: 5) Remove the iscsi deploy code from both ironic and IPA.
Thoughts, opinions, suggestions?
Dmitry
On 25/08/20 12:05 am, Dmitry Tantsur wrote:
On Mon, Aug 24, 2020 at 1:52 PM Sean Mooney <smooney@redhat.com <mailto:smooney@redhat.com>> wrote:
On Mon, 2020-08-24 at 10:32 +0200, Dmitry Tantsur wrote: > Hi, > > On Mon, Aug 24, 2020 at 10:24 AM Arne Wiebalck <arne.wiebalck@cern.ch <mailto:arne.wiebalck@cern.ch>> > wrote: > > > Hi! > > > > CERN's deployment is using the iscsi deploy interface since we started > > with Ironic a couple of years ago (and we installed around 5000 nodes > > with it by now). The reason we chose it at the time was simplicity: we > > did not (and still do not) have a Swift backend to Glance, and the iscsi > > interface provided a straightforward alternative. > > > > While we have not seen obscure bugs/issues with it, I can certainly back > > the scalability issues mentioned by Dmitry: the tunneling of the images > > through the controllers can create issues when deploying hundreds of > > nodes at the same time. The security of the iscsi interface is less of a > > concern in our specific environment. > > > > So, why did we not move to direct (yet)? In addition to the lack of > > Swift, mostly since iscsi works for us and the scalability issues were > > not that much of a burning problem ... so we focused on other things :) > > > > Here are some thoughts/suggestions for this discussion: > > > > How would 'direct' work with other Glance backends (like Ceph/RBD in our > > case)? If using direct requires to duplicate images from Glance to > > Ironic (or somewhere else) to be served, I think this would be an > > argument against deprecating iscsi. > > > > With image_download_source=http ironic will download the image to the > conductor to be able serve it to the node. Which is exactly what the iscsi > is doing, so not much of a change for you (except for s/iSCSI/HTTP/ as a > means of serving the image). > > Would it be an option for you to test direct deploy with > image_download_source=http? i think if there is still an option to not force deployemnt to altere any of there other sevices this is likely ok but i think the onious shoudl be on the ironic and ooo teams to ensure there is an upgrade path for those useres before this deprecation becomes a removal without deploying swift or a swift compatibale api e.g. RadosGW
Swift is NOT a requirement (nor is RadosGW) when image_download_source=http is used. Any glance backend (or no glance at all) will work.
Even though the TripleO undercloud has swift, I'd be inclined to do image_download_source=http so that it can scale out to minions, and so we're not relying on a single-node swift for image serving
Hi Steve, On 24.08.20 23:55, Steve Baker wrote:
On 25/08/20 12:05 am, Dmitry Tantsur wrote:
On Mon, Aug 24, 2020 at 1:52 PM Sean Mooney <smooney@redhat.com <mailto:smooney@redhat.com>> wrote:
On Mon, 2020-08-24 at 10:32 +0200, Dmitry Tantsur wrote: > Hi, > > On Mon, Aug 24, 2020 at 10:24 AM Arne Wiebalck <arne.wiebalck@cern.ch <mailto:arne.wiebalck@cern.ch>> > wrote: > > > Hi! > > > > CERN's deployment is using the iscsi deploy interface since we started > > with Ironic a couple of years ago (and we installed around 5000 nodes > > with it by now). The reason we chose it at the time was simplicity: we > > did not (and still do not) have a Swift backend to Glance, and the iscsi > > interface provided a straightforward alternative. > > > > While we have not seen obscure bugs/issues with it, I can certainly back > > the scalability issues mentioned by Dmitry: the tunneling of the images > > through the controllers can create issues when deploying hundreds of > > nodes at the same time. The security of the iscsi interface is less of a > > concern in our specific environment. > > > > So, why did we not move to direct (yet)? In addition to the lack of > > Swift, mostly since iscsi works for us and the scalability issues were > > not that much of a burning problem ... so we focused on other things :) > > > > Here are some thoughts/suggestions for this discussion: > > > > How would 'direct' work with other Glance backends (like Ceph/RBD in our > > case)? If using direct requires to duplicate images from Glance to > > Ironic (or somewhere else) to be served, I think this would be an > > argument against deprecating iscsi. > > > > With image_download_source=http ironic will download the image to the > conductor to be able serve it to the node. Which is exactly what the iscsi > is doing, so not much of a change for you (except for s/iSCSI/HTTP/ as a > means of serving the image). > > Would it be an option for you to test direct deploy with > image_download_source=http? i think if there is still an option to not force deployemnt to altere any of there other sevices this is likely ok but i think the onious shoudl be on the ironic and ooo teams to ensure there is an upgrade path for those useres before this deprecation becomes a removal without deploying swift or a swift compatibale api e.g. RadosGW
Swift is NOT a requirement (nor is RadosGW) when image_download_source=http is used. Any glance backend (or no glance at all) will work.
Even though the TripleO undercloud has swift, I'd be inclined to do image_download_source=http so that it can scale out to minions, and so we're not relying on a single-node swift for image serving
This makes it sound a little like 'direct' with image_download_source=http would be easily scalable ... but it is only if you can (and are willing to) scale the Ironic control plane through which the images are still tunneled (and Glance behind it ... not sure if there is any caching of images inside the Ironic controllers). Seems to be the case for you and TripleO, but it may not be the case in other setups, using conductor groups may complicated things, for instance. So, from what I see, image_download_source=http is a good option to move deployments off the iscsi deploy interface, but it does not bring the same (scalability) advantages you would get from a setup where Glance is backed by a scalable Swift or RadosGW backend. Cheers, Arne
On Mon, Aug 24, 2020 at 1:52 PM Sean Mooney <smooney@redhat.com> wrote:
Hi,
On Mon, Aug 24, 2020 at 10:24 AM Arne Wiebalck <arne.wiebalck@cern.ch> wrote:
Hi!
CERN's deployment is using the iscsi deploy interface since we started with Ironic a couple of years ago (and we installed around 5000 nodes with it by now). The reason we chose it at the time was simplicity: we did not (and still do not) have a Swift backend to Glance, and the iscsi interface provided a straightforward alternative.
While we have not seen obscure bugs/issues with it, I can certainly back the scalability issues mentioned by Dmitry: the tunneling of the images through the controllers can create issues when deploying hundreds of nodes at the same time. The security of the iscsi interface is less of a concern in our specific environment.
So, why did we not move to direct (yet)? In addition to the lack of Swift, mostly since iscsi works for us and the scalability issues were not that much of a burning problem ... so we focused on other things :)
Here are some thoughts/suggestions for this discussion:
How would 'direct' work with other Glance backends (like Ceph/RBD in our case)? If using direct requires to duplicate images from Glance to Ironic (or somewhere else) to be served, I think this would be an argument against deprecating iscsi.
With image_download_source=http ironic will download the image to the conductor to be able serve it to the node. Which is exactly what the iscsi is doing, so not much of a change for you (except for s/iSCSI/HTTP/ as a means of serving the image).
Would it be an option for you to test direct deploy with image_download_source=http? i think if there is still an option to not force deployemnt to altere any of there other sevices this is likely ok but i think the onious shoudl be on the ironic and ooo teams to ensure there is an upgrade path for those useres before
On Mon, 2020-08-24 at 10:32 +0200, Dmitry Tantsur wrote: this deprecation becomes a removal without deploying swift or a swift compatibale api e.g. RadosGW
perhaps a ci job could be put in place maybe using grenade that starts with iscsi and moves to direct with http porvided to show that just setting that weill allow the conductor to download the image from glance and server it to the ipa.
This is the CI job with direct deploy in a low RAM environment with a large image (CentOS) without Swift: https://zuul.opendev.org/t/openstack/build/58f623d90435470f9095eb68202c25f8 The change is https://review.opendev.org/#/c/747413/ Dmitry
unlike cern i just use ironic in a tiny home deployment where i have an all in one deployment + 4 addtional nodes for ironic. i cant deploy swift as all my disks are already in use for cinder so down the line when i eventually upgrade to vicortia and wallaby i would either have to drop ironic or not upgrade it if there is not a option to just pull the image from glance or glance via the conductor. enhancing the ipa to pull directly from glance would also proably work for many who use iscsi today but that would depend on your network toplogy i guess.
Equally, if this would require to completely move the Glance backend to something else, like from RBD to RadosGW, I would not expect happy operators. (Does anyone know if RadosGW could even replace Swift for this specific use case?)
AFAIK ironic works with RadosGW, we have some support code for it.
Do we have numbers on how many deployments use iscsi vs direct? If many rely on iscsi, I would also suggest to establish a migration guide for operators on how to move from iscsi to direct, for the various configs. Recent versions of Glance support multiple backends, so a migration
may be to add a new (direct compatible) backend for new images.
I don't have any numbers, but a migration guide is a must in any case.
I expect most TripleO consumers to use the iscsi deploy, but only because it's the default. Their Edge solution uses the direct deploy. I've
few operators I know, they all (except for you, obviously :) seem to use the direct deploy. Metal3 uses direct deploy.
Dmitry
Cheers, Arne
On 20.08.20 17:49, Julia Kreger wrote:
I'm having a sense of deja vu!
Because of the way the mechanics work, the iscsi deploy driver is in an unfortunate position of being harder to troubleshoot and diagnose failures. Which basically means we've not been able to really
identify
common failures and add logic to handle them appropriately, like we are able to with a tcp socket and file download. Based on this alone, I think it makes a solid case for us to seriously consider deprecation.
Overall, I'm +1 for the proposal and I believe over two cycles is the right way to go.
I suspect we're going to have lots of push back from the TripleO community because there has been resistance to change their default usage in the past. As such I'm adding them to the subject so hopefully they will be at least aware.
I guess my other worry is operators who already have a substantial operational infrastructure investment built around the iscsi deploy interface. I wonder why they didn't use direct, but maybe they have all migrated in the past ?5? years. This could just be a non-concern in reality, I'm just not sure.
Of course, if someone is willing to step up and make the iscsi deployment interface their primary focus, that also shifts the discussion to making direct the default interface?
-Julia
On Thu, Aug 20, 2020 at 1:57 AM Dmitry Tantsur <dtantsur@redhat.com>
wrote:
Hi all,
Side note for those lacking context: this proposal concerns
deprecating
one of the ironic deploy interfaces detailed in https://docs.openstack.org/ironic/latest/admin/interfaces/deploy.html. It does not affect the boot-from-iSCSI feature.
I would like to propose deprecating and removing the 'iscsi' deploy
interface over the course of the next 2 cycles. The reasons are:
1) The iSCSI deploy is a source of occasional cryptic bugs when a
target cannot be discovered or mounted properly.
2) Its security is questionable: I don't think we even use
authentication.
3) Operators confusion: right now we default to the iSCSI deploy but
pretty much direct everyone who cares about scalability or security to
path polled a the
'direct' deploy.
4) Cost of maintenance: our feature set is growing, our team - not so
much. iscsi_deploy.py is 800 lines of code that can be removed, and some dependencies that can be dropped as well.
As far as I can remember, we've kept the iSCSI deploy for two
reasons:
1) The direct deploy used to require Glance with Swift backend. The
recently added [agent]image_download_source option allows caching and serving images via the ironic's HTTP server, eliminating this problem. I guess we'll have to switch to 'http' by default for this option to keep the out-of-box experience.
2) Memory footprint of the direct deploy. With the raw images streaming
we no longer have to cache the downloaded images in the agent memory, removing this problem as well (I'm not even sure how much of a problem it is in 2020, even my phone has 4GiB of RAM).
If this proposal is accepted, I suggest to execute it as follows: Victoria release: 1) Put an early deprecation warning in the release notes. 2) Announce the future change of the default value for
[agent]image_download_source.
W release: 3) Change [agent]image_download_source to 'http' by default. 4) Remove iscsi from the default enabled_deploy_interfaces and move it
to the back of the supported list (effectively making direct deploy the default).
X release: 5) Remove the iscsi deploy code from both ironic and IPA.
Thoughts, opinions, suggestions?
Dmitry
On 8/20/20 5:49 PM, Julia Kreger wrote:
I suspect we're going to have lots of push back from the TripleO community because there has been resistance to change their default usage in the past. As such I'm adding them to the subject so hopefully they will be at least aware.
Since TripleO already support using the direct interface, it's recommended and tested by the TripleO group focusing on edge type deployments, switching to direct by default might not be too much of a hassle for TripleO. We may want to change the disk-image format used by TripleO to raw as well, to benefit from the raw image streaming capabilities? Or would enabling image_download_source = http convert the images as they are cached on conductors? (see question inline below.)
On Thu, Aug 20, 2020 at 1:57 AM Dmitry Tantsur <dtantsur@redhat.com> wrote:
Hi all,
Side note for those lacking context: this proposal concerns deprecating one of the ironic deploy interfaces detailed in https://docs.openstack.org/ironic/latest/admin/interfaces/deploy.html. It does not affect the boot-from-iSCSI feature.
I would like to propose deprecating and removing the 'iscsi' deploy interface over the course of the next 2 cycles. The reasons are: 1) The iSCSI deploy is a source of occasional cryptic bugs when a target cannot be discovered or mounted properly. 2) Its security is questionable: I don't think we even use authentication. 3) Operators confusion: right now we default to the iSCSI deploy but pretty much direct everyone who cares about scalability or security to the 'direct' deploy. 4) Cost of maintenance: our feature set is growing, our team - not so much. iscsi_deploy.py is 800 lines of code that can be removed, and some dependencies that can be dropped as well.
As far as I can remember, we've kept the iSCSI deploy for two reasons: 1) The direct deploy used to require Glance with Swift backend. The recently added [agent]image_download_source option allows caching and serving images via the ironic's HTTP server, eliminating this problem. I guess we'll have to switch to 'http' by default for this option to keep the out-of-box experience. 2) Memory footprint of the direct deploy. With the raw images streaming we no longer have to cache the downloaded images in the agent memory, removing this problem as well (I'm not even sure how much of a problem it is in 2020, even my phone has 4GiB of RAM).
When using image_download_source = http, does Ironic convert non-raw images when they are placed on each conductors cache? To benefit from the raw image streaming?
If this proposal is accepted, I suggest to execute it as follows: Victoria release: 1) Put an early deprecation warning in the release notes. 2) Announce the future change of the default value for [agent]image_download_source. W release: 3) Change [agent]image_download_source to 'http' by default. 4) Remove iscsi from the default enabled_deploy_interfaces and move it to the back of the supported list (effectively making direct deploy the default). X release: 5) Remove the iscsi deploy code from both ironic and IPA.
Thoughts, opinions, suggestions?
Dmitry
On Tue, Aug 25, 2020 at 12:39 PM Harald Jensas <hjensas@redhat.com> wrote:
On 8/20/20 5:49 PM, Julia Kreger wrote:
I suspect we're going to have lots of push back from the TripleO community because there has been resistance to change their default usage in the past. As such I'm adding them to the subject so hopefully they will be at least aware.
Since TripleO already support using the direct interface, it's recommended and tested by the TripleO group focusing on edge type deployments, switching to direct by default might not be too much of a hassle for TripleO.
++
We may want to change the disk-image format used by TripleO to raw as well, to benefit from the raw image streaming capabilities? Or would enabling image_download_source = http convert the images as they are cached on conductors? (see question inline below.)
On Thu, Aug 20, 2020 at 1:57 AM Dmitry Tantsur <dtantsur@redhat.com> wrote:
Hi all,
Side note for those lacking context: this proposal concerns deprecating
one of the ironic deploy interfaces detailed in https://docs.openstack.org/ironic/latest/admin/interfaces/deploy.html. It does not affect the boot-from-iSCSI feature.
I would like to propose deprecating and removing the 'iscsi' deploy
1) The iSCSI deploy is a source of occasional cryptic bugs when a target cannot be discovered or mounted properly. 2) Its security is questionable: I don't think we even use authentication. 3) Operators confusion: right now we default to the iSCSI deploy but
interface over the course of the next 2 cycles. The reasons are: pretty much direct everyone who cares about scalability or security to the 'direct' deploy.
4) Cost of maintenance: our feature set is growing, our team - not so much. iscsi_deploy.py is 800 lines of code that can be removed, and some dependencies that can be dropped as well.
As far as I can remember, we've kept the iSCSI deploy for two reasons: 1) The direct deploy used to require Glance with Swift backend. The recently added [agent]image_download_source option allows caching and serving images via the ironic's HTTP server, eliminating this problem. I guess we'll have to switch to 'http' by default for this option to keep the out-of-box experience. 2) Memory footprint of the direct deploy. With the raw images streaming we no longer have to cache the downloaded images in the agent memory, removing this problem as well (I'm not even sure how much of a problem it is in 2020, even my phone has 4GiB of RAM).
When using image_download_source = http, does Ironic convert non-raw images when they are placed on each conductors cache? To benefit from the raw image streaming?
Yes, unless it's explicitly disabled. Although storing raw images from the beginning may make deployments a bit faster and save some disk space for this conversion. Dmitry
If this proposal is accepted, I suggest to execute it as follows: Victoria release: 1) Put an early deprecation warning in the release notes. 2) Announce the future change of the default value for [agent]image_download_source. W release: 3) Change [agent]image_download_source to 'http' by default. 4) Remove iscsi from the default enabled_deploy_interfaces and move it to the back of the supported list (effectively making direct deploy the default). X release: 5) Remove the iscsi deploy code from both ironic and IPA.
Thoughts, opinions, suggestions?
Dmitry
participants (6)
-
Arne Wiebalck
-
Dmitry Tantsur
-
Harald Jensas
-
Julia Kreger
-
Sean Mooney
-
Steve Baker