From gmann at ghanshyammann.com Mon Apr 1 01:54:12 2019 From: gmann at ghanshyammann.com (Ghanshyam Mann) Date: Sun, 31 Mar 2019 20:54:12 -0500 Subject: [nova] "future" specs and blueprints In-Reply-To: References: Message-ID: <169d69aa3a1.1154d6bf526152.5215579436310684500@ghanshyammann.com> ---- On Wed, 27 Mar 2019 14:45:02 -0500 Eric Fried wrote ---- > All- > > It seems like we don't have a great way of looking at a blueprint and > saying, "this looks like a good idea, and yes we want to do it, but > we're not going to be able to prioritize it for this release." Sometimes > we just leave those blueprints "unspecified" and their specs stay open > in limbo. And some blueprints we'll approve for the current series, and > merge the spec under $current_cycle/approved, but not really intend to > get the work completed. And at the end of the release, those pieces > muddy our completion stats. It would be nice if those stats reflected > only work we got done that we *intended* to get done. > > I'm not bringing this up just for the sake of making numbers pretty; it > would just be nice to have a crisp picture of the work slated for the > current release. And also a way for contributors to propose specs they > *don't* intend for the current series, but still want to start > discussing (I happen to know of at least one example of this for Train). > And also a way to keep track of work we know we want to do eventually, > just not now. > > === TL;DR >>> > So I'd like to propose that we set up a "future" series in Launchpad, > and a corresponding subdirectory in the specs repo. > === TL;DR <<< > > The process would be pretty much as you would expect, including (but not > limited to): > > - If we decide (e.g. at the PTG) that we like a spec that's proposed > under $current/approved, but won't have time for it in the current > series, we'll ask the author to move it to future/ and make sure the > History section includes "$current: proposed and approved for 'future'". > - If we decide in mid-release that we want to defer a blueprint, we can > propose a patch to move it from $current/approved to future/ (including > a redirect). Thanks for bringing this up. What will be the conditions or criteria of the above two scenarios? - Review bandwidth? - Author (contributors) wish not to complete the code in the cycle when he/she is proposing the spec for? I mean do we have any scenario where spec author says "please approve my spec in this cycle but I will not start the code in this cycle so consider this a future spec" - Technical scope etc ? - Anything else ? IMO, the first two scenarios should not be the reason to ask people to propose your spec in future/ dir. For other scenarios, we should clearly say "NO: as of now due to xyz reason so rejected or suggestion for repropose" etc. future/ dir idea is good but main challenge it can face is "how many people going to review the future/ items' because I feel everyone is so busy with their current assignments. > - If a contributor wants to start work on a spec for a future release, > they can propose it directly to the future/ path. +1, this is a nice idea to start the early discussion/feedback etc. -gmann > - Every cycle when setting goals we can skim through the 'future' items > to see if any should be pulled in. In which case we propose a patch to > move the file from future/ to $current/approved (including a redirect) > which we can fast-approve. > - If we decide to flush a spec from 'future', we can move it to a > 'rejected' (or 'abandoned'? bikeshed away) folder - but we can sort this > part of the process out later. > > How do folks feel about this idea? > > -efried > . > > From mnaser at vexxhost.com Mon Apr 1 02:21:22 2019 From: mnaser at vexxhost.com (Mohammed Naser) Date: Sun, 31 Mar 2019 22:21:22 -0400 Subject: [nova] super long online_data_migrations Message-ID: Hi there, During upgrades, I've noticed that when running online_data_migrations with "infinite-until-done" mode, it loops over all of the migrations one by one. However, one of the online data migrations (instance_obj.populate_missing_availability_zones) makes a query that takes a really long time as it seems inefficient (which eventually results in 0, cause it already ran), which means as it loops in "blocks" of 50, there's almost a 2-3 to 8 minute wait in really large environments. The question ends up in specific: SELECT count(*) AS count_1 FROM (SELECT instance_extra.created_at AS instance_extra_created_at, instance_extra.updated_at AS instance_extra_updated_at, instance_extra.deleted_at AS instance_extra_deleted_at, instance_extra.deleted AS instance_extra_deleted, instance_extra.id AS instance_extra_id, instance_extra.instance_uuid AS instance_extra_instance_uuid FROM instance_extra WHERE instance_extra.keypairs IS NULL AND instance_extra.deleted = 0) AS anon_1 The explain for the DB query in this example: +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 382473 | Using where | +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ It's possible that it can be ever worse, as this number is from another very-long running environments. +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 3008741 | Using where | +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ I'm not the SQL expert, could we not optimize this? Alternatively, could we update the online data migrations code to "pop out" any of the migrations that return 0 for the next iteration, that way it only works on those online_data_migrations that *have* to be done, and ignore those it knows are done? Thanks, Mohammed -- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser at vexxhost.com W. http://vexxhost.com From mnaser at vexxhost.com Mon Apr 1 02:26:47 2019 From: mnaser at vexxhost.com (Mohammed Naser) Date: Sun, 31 Mar 2019 22:26:47 -0400 Subject: [nova] super long online_data_migrations In-Reply-To: References: Message-ID: On Sun, Mar 31, 2019 at 10:21 PM Mohammed Naser wrote: > > Hi there, > > During upgrades, I've noticed that when running online_data_migrations > with "infinite-until-done" mode, it loops over all of the migrations > one by one. > > However, one of the online data migrations > (instance_obj.populate_missing_availability_zones) makes a query that > takes a really long time as it seems inefficient (which eventually > results in 0, cause it already ran), which means as it loops in > "blocks" of 50, there's almost a 2-3 to 8 minute wait in really large > environments. > > The question ends up in specific: > > SELECT count(*) AS count_1 > FROM (SELECT instance_extra.created_at AS instance_extra_created_at, > instance_extra.updated_at AS instance_extra_updated_at, > instance_extra.deleted_at AS instance_extra_deleted_at, > instance_extra.deleted AS instance_extra_deleted, instance_extra.id AS > instance_extra_id, instance_extra.instance_uuid AS > instance_extra_instance_uuid > FROM instance_extra > WHERE instance_extra.keypairs IS NULL AND instance_extra.deleted = 0) AS anon_1 > > The explain for the DB query in this example: > > +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ > | id | select_type | table | type | possible_keys | key | > key_len | ref | rows | Extra | > +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ > | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | > NULL | NULL | 382473 | Using where | > +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ > > It's possible that it can be ever worse, as this number is from > another very-long running environments. > > +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ > | id | select_type | table | type | possible_keys | key | > key_len | ref | rows | Extra | > +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ > | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | > NULL | NULL | 3008741 | Using where | > +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ > > I'm not the SQL expert, could we not optimize this? Alternatively, > could we update the online data migrations code to "pop out" any of > the migrations that return 0 for the next iteration, that way it only > works on those online_data_migrations that *have* to be done, and > ignore those it knows are done? and while we're at it, can we just bump the default rows-per-run to something more than 50 rows? it seems super .. small :) > Thanks, > Mohammed > > -- > Mohammed Naser — vexxhost > ----------------------------------------------------- > D. 514-316-8872 > D. 800-910-1726 ext. 200 > E. mnaser at vexxhost.com > W. http://vexxhost.com -- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser at vexxhost.com W. http://vexxhost.com From me at not.mn Mon Apr 1 02:38:40 2019 From: me at not.mn (John Dickinson) Date: Sun, 31 Mar 2019 19:38:40 -0700 Subject: [nova] "future" specs and blueprints In-Reply-To: References: Message-ID: <594FD6B1-F46C-4B85-B7D0-4CB55D85BF8B@not.mn> Apologies for top-posting, etc etc I'd like to mention something the Swift team introduced back in 2016 and has been using quite successfully ever since. After trying both blueprints and specs, we have settled on our ideas wiki page. Basically, if you've got an idea, write it down and link to it. I'll let the original email do more of the explaining: http://lists.openstack.org/pipermail/openstack-dev/2016-May/094026.html Swift's ideas page is at https://wiki.openstack.org/wiki/Swift/ideas --John On 27 Mar 2019, at 12:45, Eric Fried wrote: > All- > > It seems like we don't have a great way of looking at a blueprint and > saying, "this looks like a good idea, and yes we want to do it, but > we're not going to be able to prioritize it for this release." Sometimes > we just leave those blueprints "unspecified" and their specs stay open > in limbo. And some blueprints we'll approve for the current series, and > merge the spec under $current_cycle/approved, but not really intend to > get the work completed. And at the end of the release, those pieces > muddy our completion stats. It would be nice if those stats reflected > only work we got done that we *intended* to get done. > > I'm not bringing this up just for the sake of making numbers pretty; it > would just be nice to have a crisp picture of the work slated for the > current release. And also a way for contributors to propose specs they > *don't* intend for the current series, but still want to start > discussing (I happen to know of at least one example of this for Train). > And also a way to keep track of work we know we want to do eventually, > just not now. > > === TL;DR >>> > So I'd like to propose that we set up a "future" series in Launchpad, > and a corresponding subdirectory in the specs repo. > === TL;DR <<< > > The process would be pretty much as you would expect, including (but not > limited to): > > - If we decide (e.g. at the PTG) that we like a spec that's proposed > under $current/approved, but won't have time for it in the current > series, we'll ask the author to move it to future/ and make sure the > History section includes "$current: proposed and approved for 'future'". > - If we decide in mid-release that we want to defer a blueprint, we can > propose a patch to move it from $current/approved to future/ (including > a redirect). > - If a contributor wants to start work on a spec for a future release, > they can propose it directly to the future/ path. > - Every cycle when setting goals we can skim through the 'future' items > to see if any should be pulled in. In which case we propose a patch to > move the file from future/ to $current/approved (including a redirect) > which we can fast-approve. > - If we decide to flush a spec from 'future', we can move it to a > 'rejected' (or 'abandoned'? bikeshed away) folder - but we can sort this > part of the process out later. > > How do folks feel about this idea? > > -efried > . -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 850 bytes Desc: OpenPGP digital signature URL: From anlin.kong at gmail.com Mon Apr 1 03:24:33 2019 From: anlin.kong at gmail.com (Lingxian Kong) Date: Mon, 1 Apr 2019 16:24:33 +1300 Subject: [trove] Trove for Train Message-ID: Hi, I'm Lingxian Kong, I'm going to serve as Trove PTL for the Train dev cycle. Since the master branch is open for contribution and review, for those who care about Trove, here are several things I'd like to bring to your attention and most importantly, need your feedback. - Deprecate nova-network. As I mentioned in the candidacy, The nova-network related code is spread in the repo, which makes it very difficult for new feature implementation and bugfix. Considering nova-network was deprecated in the OpenStack Newton release, I propose we also deprecate nova-network support in Trove and remove after several cycles according to the deprecation policy of the community. I'm not sure if there is still anyone using nova-network for Trove, especially in production. If yes, please reply to this email. - Create service VM in admin project by default Currently, Trove has configuration support to create the db instance in the admin project, which I think should be the default deployment model to reduce the security risk given all the db instances are communicating with RabbitMQ in the control plane. - Remove SecurityGroup API extension TBH, I don't know when and why that extension was added in Trove but since it's not included in Trove API document( https://developer.openstack.org/api-ref/database/), I assume there is on one relies on that in production, so should be safe to remove. - Remove SecurityGroup related database model I don't have the history development background in my mind, but IMHO, i don't think it's reasonable for Trove to maintain such information in db. - Security group management enhancement Removing the API extension and database model doesn't mean Trove shouldn't support security group for the db instance, on the contrary, security should always be the first thing we consider for new features. The two tasks above are actually prerequisites for this one. In order to make it easy to maintain and as more secure as possible, Trove is not going to allow the end user to manipulate the security group associated with db instance. Trove will try to provide as more information as possible to make the debugging and performance tuning easy. - Monitoring capability Currently, there is no monitoring capability support in Trove, and I think that's the only main part missing for Trove to be running in production. I don't have a full picture in mind now but will try to figure out how to achieve that. - Priorities of the previous dev cycles Of course, I shouldn't put the previous dev cycle priorities away from the track, e.g. the Stein dev cycle priorities are well documented here https://etherpad.openstack.org/p/trove-stein-priorities-and-specs-tracking As Trove project has been experiencing some up and downs in the past, but it's still very useful in some deployment use cases and has some advantages over the container deployment model. As you could guess, the reason I raised my hand to lead Trove is that we(Catalyst Cloud) have been deploying Trove in production, so all those things are aiming at making Trove production ready, not only for private cloud but also for the public. If you have any concerns related to what's mentioned above, please don't hesitate to reply. Alternately, I'm always in the #openstack-trove IRC channel and could answer any questions during the working hours of UTC+12. I really appreciate any feedback from the community. --- Cheers, Lingxian Kong Catalyst Cloud -------------- next part -------------- An HTML attachment was scrubbed... URL: From gouthampravi at gmail.com Mon Apr 1 04:15:51 2019 From: gouthampravi at gmail.com (Goutham Pacha Ravi) Date: Sun, 31 Mar 2019 21:15:51 -0700 Subject: [dev][stable][release][manila] Should we revert these stable branch backports? In-Reply-To: <20190330204021.leizflkinun32bwy@barron.net> References: <20190327125654.5bazj5hic3ilj27q@barron.net> <20190330204021.leizflkinun32bwy@barron.net> Message-ID: On Sat, Mar 30, 2019 at 1:40 PM Tom Barron wrote: > > On 27/03/19 11:26 -0700, Goutham Pacha Ravi wrote: > >On Wed, Mar 27, 2019 at 5:58 AM Tom Barron wrote: > >> > >> In manila we recently merged backports of a change [1] that aims > >> to fix up faulty configuration option deprecations in the > >> Dell-EMC VMAX driver. The question at hand is whether (a) we > >> should revert these backports on the grounds that the > >> deprecations were faulty and therefore are only effective from > >> Stein forwards, or whether (b) the deprecations actually worked > >> and took effect back when Ocata was the development branch but > >> had faults that should be corrected all the way back to Pike > >> before it goes EM. > >> > >> I think (b) is the correct answer but let's check. > >> > >> On Jan 10 2017 a review [2] merged that deprecated generic > >> Dell-EMC driver options in favor of model-specific options [2]. > >> There were two models at the time, Unity and VNX. This change > >> was in itself unproblematic. > >> > >> On Jan 24 2017 a review [3] merged that introduced a third > >> Dell-EMC model, VMAX. This new code introduced VMAX specific > >> options, consistent with the deprecation of the generic Dell-EMC > >> options. However it had two problems which review [1] corrects. > >> When it defined the new VMAX-specific options it failed to > >> indicate the corresponding old generic options via > >> 'deprecated_name' [4]. Worse, the code that consumed the options > >> actually looked for the old generic 'emc_interface_ports' option > >> instead of the new 'vmax_ethernet_ports' option. The only way to > >> set a value for this option was to use its deprecated > >> form. > >> > >> The change that we backported fixes both of these problems. > >> > >> I think it is a valid stable backport. > > > > > >I don't agree with some parts of the change. However, I didn't review > >this patch in time, so my opinion is late, and possibly annoying now. > >A couple of things with this change are weird, and we could decide to > >fix these: > > > >1) The EMC VMAX driver never had "emc_nas_server_container" or > >"emc_nas_pool_names" as config options, but the bug fix [1], adds > >these as deprecated names for valid options (vmax_server_container, > >vmax_share_data_pools), in a retrospective manner. > > True, they are added as synonyms for the corresponing model-specific > options, and are marked as deprecated. This aligns the VMAX config > with the other EMC models. > > >2) The EMC VMAX driver always used "emc_interface_ports", however, > >there was no such configuration option wrt that backend. How would > >users know to set it? Things passed because there was a "safe_get" > >operation, and perhaps the vendor had called this out in their docs? > >It's good they fixed this part in Stein. > > There is no question that the VMAX driver code was broken. > > > > > > >> If people set this "ports" option using the deprecated option it > >> will still work. The deprecated option has not been removed yet > >> and will never be removed from any of the stable branches. > >> > >> If they tried to set it via the proper option then they would have > >> had a functional problem, a bug, which this change now fixes. > > > >Agree with this point. > > > >> If on the other hand we say we cannot backport this then until > >> stable/stein there will be no way for VMAX users of earlier stable > >> branches to set this option except by magic -- by using an old > >> deprecated form of the option that without the backported fix is not > >> even visible in the customer-facing configuration for VMAX. > >> All this said, I could well be missing something so I welcome > >> analysis by stable/core folks. The backports in question have not > >> yet been released so it would not be a problem to revert them. > > > >I think it would be worth back porting the deprecation and association > >between "emc_interface_ports" and "vmax_ethernet_ports" in the > >interest of customers/users. However, I think [1] introduces another > >bug by exposing net new options for this driver > >("emc_nas_server_container" and "emc_nas_pool_names") as deprecated > >forms of existing options ("vmax_server_container", > >"vmax_share_data_pools"), and that is concerning. > > > > What is the new bug that this change introduces? Won't the net new > options work and isn't the configuration backwards compatible in the > sense that there are no options that used to work that will not work > now? The bug is introducing deprecated forms of the options, and back porting that to older releases. Perhaps there's some business/customer understanding reasoning I don't know of. It can be termed harmless, because the original and legitimate options will continue to work. > > >> Thanks! > >> > >> -- Tom Barron > >> > >> [1] https://review.openstack.org/#/c/608725/ > >> > >> [2] https://review.openstack.org/#/c/415079/ > >> > >> [3] https://review.openstack.org/#/c/404859/ > >> > >> [4] If this was the only issue then arguably it wasn't an error > >> since the VMAX driver never operated with the old options. > >> From skaplons at redhat.com Mon Apr 1 06:29:40 2019 From: skaplons at redhat.com (Slawomir Kaplonski) Date: Mon, 1 Apr 2019 08:29:40 +0200 Subject: [neutron] CI broken Message-ID: Hi, Just FYI, since few days we have broken neutron-tempest-plugin-designate-scenario job and it is failing 100% times. Bug is reported in [1]. If this job failed on Your patch, please don’t recheck as it will not solve the problem. [1] https://bugs.launchpad.net/neutron/+bug/1822453 — Slawek Kaplonski Senior software engineer Red Hat From witold.bedyk at suse.com Mon Apr 1 07:52:24 2019 From: witold.bedyk at suse.com (Witek Bedyk) Date: Mon, 1 Apr 2019 09:52:24 +0200 Subject: [monasca][ptg] PTG Planning In-Reply-To: <63b7c34a-1538-7597-6f3d-ac8073ef9366@suse.com> References: <63b7c34a-1538-7597-6f3d-ac8073ef9366@suse.com> Message-ID: Good morning, The correct URL is of course: https://etherpad.openstack.org/p/monasca-ptg-train Cheers Witek On 3/26/19 11:58 AM, Witek Bedyk wrote: > Hello everyone, > > the schedule for the next PTG in Denver has been published [1]. We have > a room reserved from Thursday afternoon until Saturday. I have started > an etherpad to collect topics for the discussion: > > https://etherpad.openstack.org/p/monasca-ptg-stein > > Please add your name and the times when you can attend (both F2F and > remotely). Please also put the topics you would like to discuss or > features you think we should add in the next release cycle. > > Thanks for your input and I'm looking forward to see/hear you there. > Witek > > [1] https://www.openstack.org/ptg#tab_schedule > > From frickler at offenerstapel.de Mon Apr 1 08:43:50 2019 From: frickler at offenerstapel.de (Jens Harbott) Date: Mon, 01 Apr 2019 08:43:50 +0000 Subject: [all][ops] Train goal: removing and simplifying the endpoint tripplets? In-Reply-To: References: Message-ID: <1554108230.4997.13.camel@offenerstapel.de> On Thu, 2019-03-28 at 16:49 +0100, Thomas Goirand wrote: > Hi, > > During the summit in Tokyo (if I remember well), Sean Dague lead a > discussion about removing the need for having 3 endpoints per > service. I > was very excited about the proposal, and it's IMO a shame it hasn't > been > implemented. Everyone in the room agreed. Here the content of the > discussion as I remember it: > > > 1/ The only service that needed the admin endpoint was Keystone. This > requirement is now gone. So we could get rid of the admin endpoint > all > together. > > 2/ The need for an interal vs public endpoint was only needed for > accounting (of for example bandwidth when uploading to Glance), but > this > could be work-around by operators by using intelligent routing. So we > wouldn't need the internal endpoint. > > This makes us only need the public endpoint, and that's it. > > Then, there are these %(tenant_id)s bits in the endpoints which are > also > very much annoying, and could be removed if the clients were smarter. > These are still needed, apparently, for: > - cinder > - swift > - heat > > > Is anyone planning to implement (at least some parts of) the above? For me as an operator, the distinction between internal and public endpoints is helpful, as it allows to easily set up extended filtering or rate limiting for public services without affecting internal API calls, which in most deployments cause the majority of requests. I'm not sure what "intelligent routing" is meant to be, but it sounds more complicated and unstable than the current solution. Big +1 on dropping the admin endpoint though, now that keystone doesn't need it anymore. Jens From noonedeadpunk at ya.ru Mon Apr 1 09:25:43 2019 From: noonedeadpunk at ya.ru (=?utf-8?B?0KDQsNCx0L7RgtGP0LPQvtCyINCU0LzQuNGC0YDQuNC5?=) Date: Mon, 01 Apr 2019 12:25:43 +0300 Subject: [all][ops] Train goal: removing and simplifying the endpoint tripplets? In-Reply-To: <1554108230.4997.13.camel@offenerstapel.de> References: <1554108230.4997.13.camel@offenerstapel.de> Message-ID: <23799501554110743@sas2-7b909973f402.qloud-c.yandex.net> +1 to Jens point. Internal endpoints seems to be pretty useful for me as well, as you may set internal networks to completely another physical interface (like internal infiniband connections), while leave public endpoints rate limited, and it's pretty easy to configure and maintain. And I guess it might be the case for a pretty big amount of public clouds. 01.04.2019, 11:48, "Jens Harbott" : > On Thu, 2019-03-28 at 16:49 +0100, Thomas Goirand wrote: >>  Hi, >> >>  During the summit in Tokyo (if I remember well), Sean Dague lead a >>  discussion about removing the need for having 3 endpoints per >>  service. I >>  was very excited about the proposal, and it's IMO a shame it hasn't >>  been >>  implemented. Everyone in the room agreed. Here the content of the >>  discussion as I remember it: >> >>   >>  1/ The only service that needed the admin endpoint was Keystone. This >>  requirement is now gone. So we could get rid of the admin endpoint >>  all >>  together. >> >>  2/ The need for an interal vs public endpoint was only needed for >>  accounting (of for example bandwidth when uploading to Glance), but >>  this >>  could be work-around by operators by using intelligent routing. So we >>  wouldn't need the internal endpoint. >> >>  This makes us only need the public endpoint, and that's it. >> >>  Then, there are these %(tenant_id)s bits in the endpoints which are >>  also >>  very much annoying, and could be removed if the clients were smarter. >>  These are still needed, apparently, for: >>  - cinder >>  - swift >>  - heat >>   >> >>  Is anyone planning to implement (at least some parts of) the above? > > For me as an operator, the distinction between internal and public > endpoints is helpful, as it allows to easily set up extended filtering > or rate limiting for public services without affecting internal API > calls, which in most deployments cause the majority of requests. > > I'm not sure what "intelligent routing" is meant to be, but it sounds > more complicated and unstable than the current solution. > > Big +1 on dropping the admin endpoint though, now that keystone doesn't > need it anymore. > > Jens --  Kind Regards, Dmitriy Rabotyagov From surya.seetharaman9 at gmail.com Mon Apr 1 09:30:27 2019 From: surya.seetharaman9 at gmail.com (Surya Seetharaman) Date: Mon, 1 Apr 2019 11:30:27 +0200 Subject: [nova] super long online_data_migrations In-Reply-To: References: Message-ID: Hi Mohammed, On Mon, Apr 1, 2019 at 4:29 AM Mohammed Naser wrote: > On Sun, Mar 31, 2019 at 10:21 PM Mohammed Naser > wrote: > > > > Hi there, > > > > During upgrades, I've noticed that when running online_data_migrations > > with "infinite-until-done" mode, it loops over all of the migrations > > one by one. > > > > However, one of the online data migrations > > (instance_obj.populate_missing_availability_zones) makes a query that > > takes a really long time as it seems inefficient (which eventually > > results in 0, cause it already ran), which means as it loops in > > "blocks" of 50, there's almost a 2-3 to 8 minute wait in really large > > environments. > Hmm, all we do in that migration is try to get instance records whose availability_zone is None [1] and if no records are found we just return all done. While I agree that once a migration is done, the next time we loop through all the migrations we again do the query at least once to ensure we get back zero records for most of the migrations (we don't always use persistent markers to see if the migration was completed in the previous run) which means we do run through the whole table. > > > > The question ends up in specific: > > > > SELECT count(*) AS count_1 > > FROM (SELECT instance_extra.created_at AS instance_extra_created_at, > > instance_extra.updated_at AS instance_extra_updated_at, > > instance_extra.deleted_at AS instance_extra_deleted_at, > > instance_extra.deleted AS instance_extra_deleted, instance_extra.id AS > > instance_extra_id, instance_extra.instance_uuid AS > > instance_extra_instance_uuid > > FROM instance_extra > > WHERE instance_extra.keypairs IS NULL AND instance_extra.deleted = 0) AS > anon_1 > > > This is the keypair_obj.migrate_keypairs_to_api_db migration that was added in Newton. Since we are just counting, we need not pull the whole record I guess (not sure how much improvement that would cause), I am myself not an SQL expert, maybe jaypipes can help here. > The explain for the DB query in this example: > > > > > +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ > > | id | select_type | table | type | possible_keys | key | > > key_len | ref | rows | Extra | > > > +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ > > | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | > > NULL | NULL | 382473 | Using where | > > > +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ > > > > It's possible that it can be ever worse, as this number is from > > another very-long running environments. > > > > > +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ > > | id | select_type | table | type | possible_keys | key | > > key_len | ref | rows | Extra | > > > +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ > > | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | > > NULL | NULL | 3008741 | Using where | > > > +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ > > > > I'm not the SQL expert, could we not optimize this? Alternatively, > > could we update the online data migrations code to "pop out" any of > > the migrations that return 0 for the next iteration, that way it only > > works on those online_data_migrations that *have* to be done, and > > ignore those it knows are done? > I don't know if there is a good way by which we can persistently store the state of finished migrations to ensure they are not executed ever again (as in not having to make the query) once done. It would also be nice to also be able to opt-in into specific migrations specially since these span over releases. > > and while we're at it, can we just bump the default rows-per-run to > something more than > 50 rows? it seems super .. small :) > > I agree the default 50 is a pretty small batch size specially for large deployments. [1] https://github.com/openstack/nova/blob/95a87bce9fa7575c172a7d06344fd3cd070db587/nova/objects/instance.py#L1302 Thanks for bringing this up, Regards, Surya. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dmellado at redhat.com Mon Apr 1 09:40:31 2019 From: dmellado at redhat.com (Daniel Mellado) Date: Mon, 1 Apr 2019 11:40:31 +0200 Subject: [kuryr] Changing upstream meeting time Message-ID: <9dcf55bd-570c-a387-dffc-a0e2c2dcd8f2@redhat.com> Hi all! Since most of our regular meeting attendees are in CEST time zone, I'm shifting the time slot for our regular meetings to match DST, so the new meeting time would be Mondays at 14:00 UTC. I've already proposed a patch at [1]. Searching at http://eavesdrop.openstack.org/ shows no conflict, so we'll be meeting at that time at #openstack-meeting-4 (falling back to #openstack-kuryr should any issues arise). If you have any doubt or comment, feel free to ping me at the irc. Best! Daniel [1] https://review.openstack.org/#/c/648927/ -------------- next part -------------- A non-text attachment was scrubbed... Name: 0x13DDF774E05F5B85.asc Type: application/pgp-keys Size: 2208 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 488 bytes Desc: OpenPGP digital signature URL: From tpb at dyncloud.net Mon Apr 1 10:35:44 2019 From: tpb at dyncloud.net (Tom Barron) Date: Mon, 1 Apr 2019 06:35:44 -0400 Subject: [dev][stable][release][manila] Should we revert these stable branch backports? In-Reply-To: References: <20190327125654.5bazj5hic3ilj27q@barron.net> <20190330204021.leizflkinun32bwy@barron.net> Message-ID: <20190401103544.ew26pngy2xaba62h@barron.net> On 31/03/19 21:15 -0700, Goutham Pacha Ravi wrote: >On Sat, Mar 30, 2019 at 1:40 PM Tom Barron wrote: >> >> On 27/03/19 11:26 -0700, Goutham Pacha Ravi wrote: >> >On Wed, Mar 27, 2019 at 5:58 AM Tom Barron wrote: >> >> >> >> In manila we recently merged backports of a change [1] that aims >> >> to fix up faulty configuration option deprecations in the >> >> Dell-EMC VMAX driver. The question at hand is whether (a) we >> >> should revert these backports on the grounds that the >> >> deprecations were faulty and therefore are only effective from >> >> Stein forwards, or whether (b) the deprecations actually worked >> >> and took effect back when Ocata was the development branch but >> >> had faults that should be corrected all the way back to Pike >> >> before it goes EM. >> >> >> >> I think (b) is the correct answer but let's check. >> >> >> >> On Jan 10 2017 a review [2] merged that deprecated generic >> >> Dell-EMC driver options in favor of model-specific options [2]. >> >> There were two models at the time, Unity and VNX. This change >> >> was in itself unproblematic. >> >> >> >> On Jan 24 2017 a review [3] merged that introduced a third >> >> Dell-EMC model, VMAX. This new code introduced VMAX specific >> >> options, consistent with the deprecation of the generic Dell-EMC >> >> options. However it had two problems which review [1] corrects. >> >> When it defined the new VMAX-specific options it failed to >> >> indicate the corresponding old generic options via >> >> 'deprecated_name' [4]. Worse, the code that consumed the options >> >> actually looked for the old generic 'emc_interface_ports' option >> >> instead of the new 'vmax_ethernet_ports' option. The only way to >> >> set a value for this option was to use its deprecated >> >> form. >> >> >> >> The change that we backported fixes both of these problems. >> >> >> >> I think it is a valid stable backport. >> > >> > >> >I don't agree with some parts of the change. However, I didn't review >> >this patch in time, so my opinion is late, and possibly annoying now. >> >A couple of things with this change are weird, and we could decide to >> >fix these: >> > >> >1) The EMC VMAX driver never had "emc_nas_server_container" or >> >"emc_nas_pool_names" as config options, but the bug fix [1], adds >> >these as deprecated names for valid options (vmax_server_container, >> >vmax_share_data_pools), in a retrospective manner. >> >> True, they are added as synonyms for the corresponing model-specific >> options, and are marked as deprecated. This aligns the VMAX config >> with the other EMC models. >> >> >2) The EMC VMAX driver always used "emc_interface_ports", however, >> >there was no such configuration option wrt that backend. How would >> >users know to set it? Things passed because there was a "safe_get" >> >operation, and perhaps the vendor had called this out in their docs? >> >It's good they fixed this part in Stein. >> >> There is no question that the VMAX driver code was broken. >> >> > >> > >> >> If people set this "ports" option using the deprecated option it >> >> will still work. The deprecated option has not been removed yet >> >> and will never be removed from any of the stable branches. >> >> >> >> If they tried to set it via the proper option then they would have >> >> had a functional problem, a bug, which this change now fixes. >> > >> >Agree with this point. >> > >> >> If on the other hand we say we cannot backport this then until >> >> stable/stein there will be no way for VMAX users of earlier stable >> >> branches to set this option except by magic -- by using an old >> >> deprecated form of the option that without the backported fix is not >> >> even visible in the customer-facing configuration for VMAX. >> >> All this said, I could well be missing something so I welcome >> >> analysis by stable/core folks. The backports in question have not >> >> yet been released so it would not be a problem to revert them. >> > >> >I think it would be worth back porting the deprecation and association >> >between "emc_interface_ports" and "vmax_ethernet_ports" in the >> >interest of customers/users. However, I think [1] introduces another >> >bug by exposing net new options for this driver >> >("emc_nas_server_container" and "emc_nas_pool_names") as deprecated >> >forms of existing options ("vmax_server_container", >> >"vmax_share_data_pools"), and that is concerning. >> > >> >> What is the new bug that this change introduces? Won't the net new >> options work and isn't the configuration backwards compatible in the >> sense that there are no options that used to work that will not work >> now? > >The bug is introducing deprecated forms of the options, and back >porting that to older releases. Perhaps there's some business/customer >understanding reasoning I don't know of. It can be termed harmless, >because the original and legitimate options will continue to work. Hmm, that doesn't seem to me to be an error, flaw, failure or fault that causes the program or system to produce an incorrect or unexpected result, or to behave in unintended ways. The change in question gets this VMAX model to *work*, and to work in a way that is consistent with config options and user expectations for the other EMC models. Without introducing a bug a backport could still be a violation of stable/branch review guidelines [1] but that doesn't seem to be an issue either. This backport doesn't introduce a new feature, change HTTP APIs, change the AMQP API, change notification definitions, make DB schema changes, or make incompatible config file changes. At the risk of stating the obvious, but for the record, this backport doesn't change *when* these deprecations occurred for VMAX. They occurred in Stein development cycle, not during the Ocata development cycle, and the earliest the deprecated options could be removed is Train. [1] https://docs.openstack.org/project-team-guide/stable-branches.html > >> >> >> Thanks! >> >> >> >> -- Tom Barron >> >> >> >> [1] https://review.openstack.org/#/c/608725/ >> >> >> >> [2] https://review.openstack.org/#/c/415079/ >> >> >> >> [3] https://review.openstack.org/#/c/404859/ >> >> >> >> [4] If this was the only issue then arguably it wasn't an error >> >> since the VMAX driver never operated with the old options. >> >> From jaypipes at gmail.com Mon Apr 1 12:06:27 2019 From: jaypipes at gmail.com (Jay Pipes) Date: Mon, 1 Apr 2019 08:06:27 -0400 Subject: [all][ops] Train goal: removing and simplifying the endpoint tripplets? In-Reply-To: <1554108230.4997.13.camel@offenerstapel.de> References: <1554108230.4997.13.camel@offenerstapel.de> Message-ID: <83ea0995-c5be-4f5a-208f-25c81a29657f@gmail.com> On 04/01/2019 04:43 AM, Jens Harbott wrote: > On Thu, 2019-03-28 at 16:49 +0100, Thomas Goirand wrote: >> Hi, >> >> During the summit in Tokyo (if I remember well), Sean Dague lead a >> discussion about removing the need for having 3 endpoints per >> service. I >> was very excited about the proposal, and it's IMO a shame it hasn't >> been >> implemented. Everyone in the room agreed. Here the content of the >> discussion as I remember it: >> >> >> 1/ The only service that needed the admin endpoint was Keystone. This >> requirement is now gone. So we could get rid of the admin endpoint >> all >> together. >> >> 2/ The need for an interal vs public endpoint was only needed for >> accounting (of for example bandwidth when uploading to Glance), but >> this >> could be work-around by operators by using intelligent routing. So we >> wouldn't need the internal endpoint. >> >> This makes us only need the public endpoint, and that's it. >> >> Then, there are these %(tenant_id)s bits in the endpoints which are >> also >> very much annoying, and could be removed if the clients were smarter. >> These are still needed, apparently, for: >> - cinder >> - swift >> - heat >> >> >> Is anyone planning to implement (at least some parts of) the above? > > For me as an operator, the distinction between internal and public > endpoints is helpful, as it allows to easily set up extended filtering > or rate limiting for public services without affecting internal API > calls, which in most deployments cause the majority of requests. > > I'm not sure what "intelligent routing" is meant to be, but it sounds > more complicated and unstable than the current solution. Maybe Thomas was referring to having Keystone just return a single set of endpoints depending on the source CIDR. Or maybe he is referring to performing rate-limiting using a lower-level tool that was purpose-built for it -- something like iptables? i.e. ACCEPT all new connections from your private subnet/CIDR and jump all new connections not in your private subnet to a RATE-LIMIT chain that applies rate-limiting thresholds. In other words, use a single HTTP endpoint and do the rate-limiting in the Linux kernel instead of higher-level applications. Related: this is why having "quotas" for things like # of metadata items in Nova was always a terrible "feature" that was abusing the quota system as a terrible rate-limiting middleware when things like iptables or tc were a more appropriate solution. Best, -jay > Big +1 on dropping the admin endpoint though, now that keystone doesn't > need it anymore. > > Jens > From jaypipes at gmail.com Mon Apr 1 12:21:10 2019 From: jaypipes at gmail.com (Jay Pipes) Date: Mon, 1 Apr 2019 08:21:10 -0400 Subject: [nova] super long online_data_migrations In-Reply-To: References: Message-ID: On 03/31/2019 10:21 PM, Mohammed Naser wrote: > Hi there, > > During upgrades, I've noticed that when running online_data_migrations > with "infinite-until-done" mode, it loops over all of the migrations > one by one. > > However, one of the online data migrations > (instance_obj.populate_missing_availability_zones) makes a query that > takes a really long time as it seems inefficient (which eventually > results in 0, cause it already ran), which means as it loops in > "blocks" of 50, there's almost a 2-3 to 8 minute wait in really large > environments. > > The question ends up in specific: > > SELECT count(*) AS count_1 > FROM (SELECT instance_extra.created_at AS instance_extra_created_at, > instance_extra.updated_at AS instance_extra_updated_at, > instance_extra.deleted_at AS instance_extra_deleted_at, > instance_extra.deleted AS instance_extra_deleted, instance_extra.id AS > instance_extra_id, instance_extra.instance_uuid AS > instance_extra_instance_uuid > FROM instance_extra > WHERE instance_extra.keypairs IS NULL AND instance_extra.deleted = 0) AS anon_1 Ugh. :( The online data migration shouldn't be calling the above SQL statement at all. Instead, the migration should be doing something like this: SELECT ie.instance_uuid FROM instance_extra AS ie WHERE ie.keypairs IS NULL AND ie.deletd = 0 LIMIT 100 and then while getting any rows returned from the above, perform the work of transforming the problematic data in the table for each matched instance_uuid. I'm actually not sure what the above query has to do with availability zones, but I'll look into it later on this morning. Can you report a bug about this and we'll get on it ASAP? Best, -jay > The explain for the DB query in this example: > > +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ > | id | select_type | table | type | possible_keys | key | > key_len | ref | rows | Extra | > +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ > | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | > NULL | NULL | 382473 | Using where | > +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ > > It's possible that it can be ever worse, as this number is from > another very-long running environments. > > +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ > | id | select_type | table | type | possible_keys | key | > key_len | ref | rows | Extra | > +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ > | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | > NULL | NULL | 3008741 | Using where | > +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ > > I'm not the SQL expert, could we not optimize this? Alternatively, > could we update the online data migrations code to "pop out" any of > the migrations that return 0 for the next iteration, that way it only > works on those online_data_migrations that *have* to be done, and > ignore those it knows are done? > > Thanks, > Mohammed > From bence.romsics at gmail.com Mon Apr 1 12:31:11 2019 From: bence.romsics at gmail.com (Bence Romsics) Date: Mon, 1 Apr 2019 14:31:11 +0200 Subject: [neutron] bug deputy report for week of 2019-03-25 In-Reply-To: References: Message-ID: Hi, Please excuse my self-reply, but here comes the updated, more complete version of my bug deputy report. I will not be able to attend today's meeting, but if I may, let me call your attention to the top 3 bugs: Critical: * https://bugs.launchpad.net/neutron/+bug/1822453 neutron-tempest-plugin-designate-scenario is broken gate-failure, fix proposed: https://review.openstack.org/648951 High: * https://bugs.launchpad.net/neutron/+bug/1821912 intermittent ssh failures in various scenario tests gate failure, no fix proposed yet * https://bugs.launchpad.net/neutron/+bug/1822256 Ip segments lost when restart ovs-agent with openvswitch firewall packet loss at agent restart, no fix proposed yet Medium: * https://bugs.launchpad.net/neutron/+bug/1822100 Network Update if current provider net attribure key/value pair in request fix proposed: https://review.openstack.org/648522 * https://bugs.launchpad.net/neutron/+bug/1822105 Policy rules related to "sub parameters" doesn't work properly fix proposed: https://review.openstack.org/648532 * https://bugs.launchpad.net/neutron/+bug/1821948 Unstable unit test uses subnet broadcast address low frequency gate failure, fix merged: https://review.openstack.org/648172 Low: * https://bugs.launchpad.net/neutron/+bug/1822155 neutron-keepalived-state-change can not start on some python3 distro problem specific to (unsupported) python3.4, fix proposed: https://review.openstack.org/648459 * https://bugs.launchpad.net/neutron/+bug/1822199 neutron-vpn-netns-wrapper not invoked with --rootwrap_config parameter low hanging fruit, no fix proposed yet Needs further triaging: * https://bugs.launchpad.net/neutron/+bug/1821963 Rally test delete-subnets fails at higher concurrency could not reproduce yet, waiting to hear back from reporter Incomplete: * https://bugs.launchpad.net/neutron/+bug/1821567 network_segment_ranges could not load in tricirlce test likely not a neutron bug but a tricircle one * https://bugs.launchpad.net/neutron/+bug/1821925 Limit test coverage for Extended Maintenance stable branches please look at it if you're interested in stable / extended maintenance process Duplicate: * https://bugs.launchpad.net/neutron/+bug/1821357 VRRP vip on VM not reachable from other network on DVR setup duplicate of https://bugs.launchpad.net/bugs/1774459 Invalid: * https://bugs.launchpad.net/neutron/+bug/1822382 DBDeadlock for INSERT INTO resourcedeltas Cheers, Bence irc: rubasov From frickler at offenerstapel.de Mon Apr 1 13:45:16 2019 From: frickler at offenerstapel.de (Jens Harbott) Date: Mon, 01 Apr 2019 13:45:16 +0000 Subject: [dev][neutron] CI broken In-Reply-To: References: Message-ID: <1554126316.4997.15.camel@offenerstapel.de> On Mon, 2019-04-01 at 08:29 +0200, Slawomir Kaplonski wrote: > Hi, > > Just FYI, since few days we have broken neutron-tempest-plugin- > designate-scenario job and it is failing 100% times. Bug is reported > in [1]. > If this job failed on Your patch, please don’t recheck as it will not > solve the problem. > > [1] https://bugs.launchpad.net/neutron/+bug/1822453 Actually a patch in devstack caused this regression, affecting all jobs that use multiple tempest plugins. A fix for that is being reviewed, see the above bug report for details. Jens From navdeep.uniyal at bristol.ac.uk Mon Apr 1 13:54:02 2019 From: navdeep.uniyal at bristol.ac.uk (Navdeep Uniyal) Date: Mon, 1 Apr 2019 13:54:02 +0000 Subject: [Magnum] Cluster Create failure In-Reply-To: References: Message-ID: Dear All, My Kubernetes Cluster is timing out after 60 mins. Following is the update I am getting in magnum.log: {"stack": {"parent": null, "disable_rollback": true, "description": "This template will boot a Kubernetes cluster with one or more minions (as specified by the number_of_minions parameter, which defaults to 1).\n", "parameters": {"magnum_url": "http://10.68.48.4:9511/v1", "kube_tag": "v1.11.6", "http_proxy": "", "cgroup_driver": "cgroupfs", "registry_container": "container", "kubernetes_port": "6443", "calico_kube_controllers_tag": "v1.0.3", "octavia_enabled": "False", "etcd_volume_size": "0", "kube_dashboard_enabled": "True", "master_flavor": "medium", "etcd_tag": "v3.2.7", "kube_version": "v1.11.6", "k8s_keystone_auth_tag": "1.13.0", "kube_service_account_private_key": "******", "keystone_auth_enabled": "True", "cloud_provider_tag": "v0.2.0", "ca_key": "******", "tiller_enabled": "False", "registry_enabled": "False", "verify_ca": "True", "password": "******", "dns_service_ip": "10.254.0.10", "ssh_key_name": "magnum_key", "flannel_tag": "v0.10.0-amd64", "flannel_network_subnetlen": "24", "dns_nameserver": "8.8.8.8", "number_of_masters": "1", "wait_condition_timeout": "6000", "portal_network_cidr": "10.254.0.0/16", "admission_control_list": "NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota", "pods_network_cidr": "10.100.0.0/16", "ingress_controller": "", "external_network": "751ae6e5-71af-4f78-b846-b0e1843093c8", "docker_volume_type": "", "registry_port": "5000", "tls_disabled": "False", "trust_id": "******", "swift_region": "", "influx_grafana_dashboard_enabled": "False", "volume_driver": "", "kubescheduler_options": "", "calico_tag": "v2.6.7", "loadbalancing_protocol": "TCP", "cloud_provider_enabled": "True", "OS::stack_id": "06c05715-ac05-4287-905c-38f1964f09fe", "flannel_cni_tag": "v0.3.0", "prometheus_monitoring": "False", "kubelet_options": "", "fixed_network": "", "kube_dashboard_version": "v1.8.3", "trustee_username": "d7ff417e-85b6-4b9a-94c3-211e7b830a51_4c6bc4445c764249921a0a6e40b192dd", "availability_zone": "", "server_image": "fedora-feduser-atomic", "flannel_network_cidr": "10.100.0.0/16", "cert_manager_api": "False", "minion_flavor": "medium", "kubeproxy_options": "", "calico_cni_tag": "v1.11.2", "cluster_uuid": "d7ff417e-85b6-4b9a-94c3-211e7b830a51", "grafana_admin_passwd": "******", "flannel_backend": "udp", "trustee_domain_id": "ac26210ad4f74217b3abf28a9b5cf56d", "fixed_subnet": "", "https_proxy": "", "username": "admin", "insecure_registry_url": "", "docker_volume_size": "0", "grafana_tag": "5.1.5", "kube_allow_priv": "true", "node_problem_detector_tag": "v0.6.2", "docker_storage_driver": "overlay2", "project_id": "4c6bc4445c764249921a0a6e40b192dd", "registry_chunksize": "5242880", "trustee_user_id": "d1983ea926c34536aabc8d50a85503e8", "container_infra_prefix": "", "number_of_minions": "1", "tiller_tag": "v2.12.3", "auth_url": "http://pluto:5000/v3", "registry_insecure": "True", "tiller_namespace": "magnum-tiller", "prometheus_tag": "v1.8.2", "OS::project_id": "4c6bc4445c764249921a0a6e40b192dd", "kubecontroller_options": "", "fixed_network_cidr": "10.0.0.0/24", "kube_service_account_key": "******", "ingress_controller_role": "ingress", "region_name": "RegionOne", "kubeapi_options": "", "openstack_ca": "******", "trustee_password": "******", "nodes_affinity_policy": "soft-anti-affinity", "minions_to_remove": "", "octavia_ingress_controller_tag": "1.13.2-alpha", "OS::stack_name": "kubernetes-cluster-wwmvqecjiznb", "system_pods_timeout": "5", "system_pods_initial_delay": "30", "dns_cluster_domain": "cluster.local", "calico_ipv4pool": "192.168.0.0/16", "network_driver": "flannel", "monitoring_enabled": "False", "heat_container_agent_tag": "stein-dev", "no_proxy": "", "discovery_url": "https://discovery.etcd.io/b8fe011e8b281615904de97ee05511a7"}, "deletion_time": null, "stack_name": "kubernetes-cluster-wwmvqecjiznb", "stack_user_project_id": "8204d11826fb4253ae7c9063306cb4e1", "tags": null, "creation_time": "2019-04-01T13:19:53Z", "links": [{"href": "http://pluto:8004/v1/4c6bc4445c764249921a0a6e40b192dd/stacks/kubernetes-cluster-wwmvqecjiznb/06c05715-ac05-4287-905c-38f1964f09fe", "rel": "self"}], "capabilities": [], "notification_topics": [], "timeout_mins": 60, "stack_status": "CREATE_IN_PROGRESS", "stack_owner": null, "updated_time": null, "id": "06c05715-ac05-4287-905c-38f1964f09fe", "stack_status_reason": "Stack CREATE started", "template_description": "This template will boot a Kubernetes cluster with one or more minions (as specified by the number_of_minions parameter, which defaults to 1).\n"}} I am not sure how to triage this issue as I cannot see any errors in heat.log as well. Even I can see both Master and Minion node running but the task errors out during OS::Heat::SoftwareDeployment in kube_cluster_deploy and OS::Heat::ResourceGroup in kube_minions I don't have much experience with Kubernetes clusters as well so please forgive me if I am raising any silly queries. Kind Regards, Navdeep -----Original Message----- From: Navdeep Uniyal Sent: 29 March 2019 12:16 To: Mohammed Naser ; Bharat Kunwar Cc: openstack at lists.openstack.org Subject: RE: [Magnum] Cluster Create failure Hi Guys, I am able to resolve the issue in nova. (it was a problem with the oslo.db version - Somehow I installed version 4.44 instead of 4.25 for my pike installation) However, moving forward, I started my kube cluster, I could see 2 instances running for Kube-master and kube-minion. But the deployment failed after that with the following error: {"message": "The resource was found at http://pluto:8004/v1/4c6bc4445c764249921a0a6e40b192dd/stacks/kubernetes-cluster-eovgkanhoa4x/384d8725-bca3-4fa4-a9fd-f18687aab8fb/resources?status=FAILED&nested_depth=2;\nyou should be redirected au tomatically.\n\n", "code": "302 Found", "title": "Found"} log_http_response /var/lib/magnum/env/local/lib/python2.7/site-packages/heatclient/common/http.py:157 2019-03-29 12:05:51.225 157681 DEBUG heatclient.common.http [req-76e55dec-9511-4aad-aa52-af9978b40eed - - - - -] curl -g -i -X GET -H 'User-Agent: python-heatclient' -H 'Content-Type: application/json' -H 'X-Aut h-Url: http://pluto:5000/v3' -H 'Accept: application/json' -H 'X-Auth-Token: {SHA1}f2c32656c7103ad0b89d83ff9f1b6cebc0a6eee7' http://pluto:8004/v1/4c6bc4445c764249921a0a6e40b192dd/stacks/kubernetes-cluster-eovgka nhoa4x/384d8725-bca3-4fa4-a9fd-f18687aab8fb/resources?status=FAILED&nested_depth=2 log_curl_request /var/lib/magnum/env/local/lib/python2.7/site-packages/heatclient/common/http.py:144 2019-03-29 12:05:51.379 157681 DEBUG heatclient.common.http [req-76e55dec-9511-4aad-aa52-af9978b40eed - - - - -] HTTP/1.1 200 OK Content-Type: application/json Content-Length: 4035 X-Openstack-Request-Id: req-942ef8fa-1bba-4573-9022-0d4e135772e0 Date: Fri, 29 Mar 2019 12:05:51 GMT Connection: keep-alive {"resources": [{"resource_name": "kube_cluster_deploy", "links": [{"href": "http://pluto:8004/v1/4c6bc4445c764249921a0a6e40b192dd/stacks/kubernetes-cluster-eovgkanhoa4x/384d8725-bca3-4fa4-a9fd-f18687aab8fb/resources/kube_cluster_deploy", "rel": "self"}, {"href": "http://pluto:8004/v1/4c6bc4445c764249921a0a6e40b192dd/stacks/kubernetes-cluster-eovgkanhoa4x/384d8725-bca3-4fa4-a9fd-f18687aab8fb", "rel": "stack"}], "logical_resource_id": "kube_cluster_deploy", "creation_time": "2019-03-29T10:40:00Z", "resource_status": "CREATE_FAILED", "updated_time": "2019-03-29T10:40:00Z", "required_by": [], "resource_status_reason": "CREATE aborted (Task create from SoftwareDeployment \"kube_cluster_deploy\" Stack \"kubernetes-cluster-eovgkanhoa4x\" [384d8725-bca3-4fa4-a9fd-f18687aab8fb] Timed out)", "physical_resource_id": "8d715a3f-6ec8-4772-ba4b-1056cd4ab7d3", "resource_type": "OS::Heat::SoftwareDeployment"}, {"resource_name": "kube_minions", "links": [{"href": "http://pluto:8004/v1/4c6bc4445c764249921a0a6e40b192dd/stacks/kubernetes-cluster-eovgkanhoa4x/384d8725-bca3-4fa4-a9fd-f18687aab8fb/resources/kube_minions", "rel": "self"}, {"href": "http://pluto:8004/v1/4c6bc4445c764249921a0a6e40b192dd/stacks/kubernetes-cluster-eovgkanhoa4x/384d8725-bca3-4fa4-a9fd-f18687aab8fb", "rel": "stack"}, {"href": "http://pluto:8004/v1/4c6bc4445c764249921a0a6e40b192dd/stacks/kubernetes-cluster-eovgkanhoa4x-kube_minions-otcpiw3oye46/33700819-0766-4d30-954b-29aace6048cc", "rel": "nested"}], "logical_resource_id": "kube_minions", "creation_time": "2019-03-29T10:40:00Z", "resource_status_reason": "CREATE aborted (Task create from ResourceGroup \"kube_minions\" Stack \"kubernetes-cluster-eovgkanhoa4x\" [384d8725-bca3-4fa4-a9fd-f18687aab8fb] Timed out)", "updated_time": "2019-03-29T10:40:00Z", "required_by": [], "resource_status": "CREATE_FAILED", "physical_resource_id": "33700819-0766-4d30-954b-29aace6048cc", "resource_type": "OS::Heat::ResourceGroup"}, {"parent_resource": "kube_minions", "resource_name": "0", "links": [{"href": "http://pluto:8004/v1/4c6bc4445c764249921a0a6e40b192dd/stacks/kubernetes-cluster-eovgkanhoa4x-kube_minions-otcpiw3oye46/33700819-0766-4d30-954b-29aace6048cc/resources/0", "rel": "self"}, {"href": "http://pluto:8004/v1/4c6bc4445c764249921a0a6e40b192dd/stacks/kubernetes-cluster-eovgkanhoa4x-kube_minions-otcpiw3oye46/33700819-0766-4d30-954b-29aace6048cc", "rel": "stack"}, {"href": "http://pluto:8004/v1/4c6bc4445c764249921a0a6e40b192dd/stacks/kubernetes-cluster-eovgkanhoa4x-kube_minions-otcpiw3oye46-0-ftjzf76onzqn/d1a8214c-c5b0-488c-83d6-f0a9cacbe844", "rel": "nested"}], "logical_resource_id": "0", "creation_time": "2019-03-29T10:40:59Z", "resource_status_reason": "resources[0]: Stack CREATE cancelled", "updated_time": "2019-03-29T10:40:59Z", "required_by": [], "resource_status": "CREATE_FAILED", "physical_resource_id": "d1a8214c-c5b0-488c-83d6-f0a9cacbe844", "resource_type": "file:///var/lib/magnum/env/lib/python2.7/site-packages/magnum/drivers/k8s_fedora_atomic_v1/templates/kubeminion.yaml"}, {"parent_resource": "0", "resource_name": "minion_wait_condition", "links": [{"href": "http://pluto:8004/v1/4c6bc4445c764249921a0a6e40b192dd/stacks/kubernetes-cluster-eovgkanhoa4x-kube_minions-otcpiw3oye46-0-ftjzf76onzqn/d1a8214c-c5b0-488c-83d6-f0a9cacbe844/resources/minion_wait_condition", "rel": "self"}, {"href": "http://pluto:8004/v1/4c6bc4445c764249921a0a6e40b192dd/stacks/kubernetes-cluster-eovgkanhoa4x-kube_minions-otcpiw3oye46-0-ftjzf76onzqn/d1a8214c-c5b0-488c-83d6-f0a9cacbe844", "rel": "stack"}], "logical_resource_id": "minion_wait_condition", "creation_time": "2019-03-29T10:41:01Z", "resource_status": "CREATE_FAILED", "updated_time": "2019-03-29T10:41:01Z", "required_by": [], "resource_status_reason": "CREATE aborted (Task create from HeatWaitCondition \"minion_wait_condition\" Stack \"kubernetes-cluster-eovgkanhoa4x-kube_minions-otcpiw3oye46-0-ftjzf76onzqn\" [d1a8214c-c5b0-488c-83d6-f0a9cacbe844] Timed out)", "physical_resource_id": "", "resource_type": "OS::Heat::WaitCondition"}]} I am not sure how to debug this. Please advise. Kind Regards, Navdeep -----Original Message----- From: Mohammed Naser Sent: 28 March 2019 13:27 To: Navdeep Uniyal Cc: Bharat Kunwar ; openstack at lists.openstack.org Subject: Re: [Magnum] Cluster Create failure your placement service seems to be broken :) On Thu, Mar 28, 2019 at 9:10 AM Navdeep Uniyal wrote: > > Yes, there seems to be some issue with the server creation now. > I will check and try resolving that. Thank you > > Regards, > Navdeep > > -----Original Message----- > From: Bharat Kunwar > Sent: 28 March 2019 12:40 > To: Navdeep Uniyal > Cc: openstack at lists.openstack.org > Subject: Re: [Magnum] Cluster Create failure > > Can you create a server normally? > -- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser at vexxhost.com W. http://vexxhost.com From bharat at stackhpc.com Mon Apr 1 13:58:48 2019 From: bharat at stackhpc.com (Bharat Kunwar) Date: Mon, 1 Apr 2019 14:58:48 +0100 Subject: [Magnum] Cluster Create failure In-Reply-To: References: Message-ID: <29BAEF27-99CE-410D-B9EF-C3E5851C9D26@stackhpc.com> Hi Navdeep, Have you tried logging into the master/worker node and gripping for `fail` inside /var/log/cloud-init.log and /var/log/cloud-init-output.log? Also how did you deploy your OpenStack services? Bharat > On 1 Apr 2019, at 14:54, Navdeep Uniyal wrote: > > Dear All, > > My Kubernetes Cluster is timing out after 60 mins. > > Following is the update I am getting in magnum.log: > > {"stack": {"parent": null, "disable_rollback": true, "description": "This template will boot a Kubernetes cluster with one or more minions (as specified by the number_of_minions parameter, which defaults to 1).\n", "parameters": {"magnum_url": "http://10.68.48.4:9511/v1", "kube_tag": "v1.11.6", "http_proxy": "", "cgroup_driver": "cgroupfs", "registry_container": "container", "kubernetes_port": "6443", "calico_kube_controllers_tag": "v1.0.3", "octavia_enabled": "False", "etcd_volume_size": "0", "kube_dashboard_enabled": "True", "master_flavor": "medium", "etcd_tag": "v3.2.7", "kube_version": "v1.11.6", "k8s_keystone_auth_tag": "1.13.0", "kube_service_account_private_key": "******", "keystone_auth_enabled": "True", "cloud_provider_tag": "v0.2.0", "ca_key": "******", "tiller_enabled": "False", "registry_enabled": "False", "verify_ca": "True", "password": "******", "dns_service_ip": "10.254.0.10", "ssh_key_name": "magnum_key", "flannel_tag": "v0.10.0-amd64", "flannel_network_subnetlen": "24", "dns_nameserver": "8.8.8.8", "number_of_masters": "1", "wait_condition_timeout": "6000", "portal_network_cidr": "10.254.0.0/16", "admission_control_list": "NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota", "pods_network_cidr": "10.100.0.0/16", "ingress_controller": "", "external_network": "751ae6e5-71af-4f78-b846-b0e1843093c8", "docker_volume_type": "", "registry_port": "5000", "tls_disabled": "False", "trust_id": "******", "swift_region": "", "influx_grafana_dashboard_enabled": "False", "volume_driver": "", "kubescheduler_options": "", "calico_tag": "v2.6.7", "loadbalancing_protocol": "TCP", "cloud_provider_enabled": "True", "OS::stack_id": "06c05715-ac05-4287-905c-38f1964f09fe", "flannel_cni_tag": "v0.3.0", "prometheus_monitoring": "False", "kubelet_options": "", "fixed_network": "", "kube_dashboard_version": "v1.8.3", "trustee_username": "d7ff417e-85b6-4b9a-94c3-211e7b830a51_4c6bc4445c764249921a0a6e40b192dd", "availability_zone": "", "server_image": "fedora-feduser-atomic", "flannel_network_cidr": "10.100.0.0/16", "cert_manager_api": "False", "minion_flavor": "medium", "kubeproxy_options": "", "calico_cni_tag": "v1.11.2", "cluster_uuid": "d7ff417e-85b6-4b9a-94c3-211e7b830a51", "grafana_admin_passwd": "******", "flannel_backend": "udp", "trustee_domain_id": "ac26210ad4f74217b3abf28a9b5cf56d", "fixed_subnet": "", "https_proxy": "", "username": "admin", "insecure_registry_url": "", "docker_volume_size": "0", "grafana_tag": "5.1.5", "kube_allow_priv": "true", "node_problem_detector_tag": "v0.6.2", "docker_storage_driver": "overlay2", "project_id": "4c6bc4445c764249921a0a6e40b192dd", "registry_chunksize": "5242880", "trustee_user_id": "d1983ea926c34536aabc8d50a85503e8", "container_infra_prefix": "", "number_of_minions": "1", "tiller_tag": "v2.12.3", "auth_url": "http://pluto:5000/v3", "registry_insecure": "True", "tiller_namespace": "magnum-tiller", "prometheus_tag": "v1.8.2", "OS::project_id": "4c6bc4445c764249921a0a6e40b192dd", "kubecontroller_options": "", "fixed_network_cidr": "10.0.0.0/24", "kube_service_account_key": "******", "ingress_controller_role": "ingress", "region_name": "RegionOne", "kubeapi_options": "", "openstack_ca": "******", "trustee_password": "******", "nodes_affinity_policy": "soft-anti-affinity", "minions_to_remove": "", "octavia_ingress_controller_tag": "1.13.2-alpha", "OS::stack_name": "kubernetes-cluster-wwmvqecjiznb", "system_pods_timeout": "5", "system_pods_initial_delay": "30", "dns_cluster_domain": "cluster.local", "calico_ipv4pool": "192.168.0.0/16", "network_driver": "flannel", "monitoring_enabled": "False", "heat_container_agent_tag": "stein-dev", "no_proxy": "", "discovery_url": "https://discovery.etcd.io/b8fe011e8b281615904de97ee05511a7"}, "deletion_time": null, "stack_name": "kubernetes-cluster-wwmvqecjiznb", "stack_user_project_id": "8204d11826fb4253ae7c9063306cb4e1", "tags": null, "creation_time": "2019-04-01T13:19:53Z", "links": [{"href": "http://pluto:8004/v1/4c6bc4445c764249921a0a6e40b192dd/stacks/kubernetes-cluster-wwmvqecjiznb/06c05715-ac05-4287-905c-38f1964f09fe", "rel": "self"}], "capabilities": [], "notification_topics": [], "timeout_mins": 60, "stack_status": "CREATE_IN_PROGRESS", "stack_owner": null, "updated_time": null, "id": "06c05715-ac05-4287-905c-38f1964f09fe", "stack_status_reason": "Stack CREATE started", "template_description": "This template will boot a Kubernetes cluster with one or more minions (as specified by the number_of_minions parameter, which defaults to 1).\n"}} > > I am not sure how to triage this issue as I cannot see any errors in heat.log as well. > Even I can see both Master and Minion node running but the task errors out during OS::Heat::SoftwareDeployment in kube_cluster_deploy and OS::Heat::ResourceGroup in kube_minions > > I don't have much experience with Kubernetes clusters as well so please forgive me if I am raising any silly queries. > > Kind Regards, > Navdeep > From mnaser at vexxhost.com Mon Apr 1 14:12:56 2019 From: mnaser at vexxhost.com (Mohammed Naser) Date: Mon, 1 Apr 2019 10:12:56 -0400 Subject: [nova] super long online_data_migrations In-Reply-To: References: Message-ID: On Mon, Apr 1, 2019 at 8:25 AM Jay Pipes wrote: > > On 03/31/2019 10:21 PM, Mohammed Naser wrote: > > Hi there, > > > > During upgrades, I've noticed that when running online_data_migrations > > with "infinite-until-done" mode, it loops over all of the migrations > > one by one. > > > > However, one of the online data migrations > > (instance_obj.populate_missing_availability_zones) makes a query that > > takes a really long time as it seems inefficient (which eventually > > results in 0, cause it already ran), which means as it loops in > > "blocks" of 50, there's almost a 2-3 to 8 minute wait in really large > > environments. > > > > The question ends up in specific: > > > > SELECT count(*) AS count_1 > > FROM (SELECT instance_extra.created_at AS instance_extra_created_at, > > instance_extra.updated_at AS instance_extra_updated_at, > > instance_extra.deleted_at AS instance_extra_deleted_at, > > instance_extra.deleted AS instance_extra_deleted, instance_extra.id AS > > instance_extra_id, instance_extra.instance_uuid AS > > instance_extra_instance_uuid > > FROM instance_extra > > WHERE instance_extra.keypairs IS NULL AND instance_extra.deleted = 0) AS anon_1 > > Ugh. :( > > The online data migration shouldn't be calling the above SQL statement > at all. > > Instead, the migration should be doing something like this: > > SELECT ie.instance_uuid FROM instance_extra AS ie > WHERE ie.keypairs IS NULL AND ie.deletd = 0 > LIMIT 100 > > and then while getting any rows returned from the above, perform the > work of transforming the problematic data in the table for each matched > instance_uuid. > > I'm actually not sure what the above query has to do with availability > zones, but I'll look into it later on this morning. > > Can you report a bug about this and we'll get on it ASAP? Thanks for looking into this. I've noticed some of the newer data migrations actually take into question the limit, but this one doesn't from the looks of it. https://bugs.launchpad.net/nova/+bug/1822613 I'd be happy to test it out against a big data set, if you'd like. > Best, > -jay > > > The explain for the DB query in this example: > > > > +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ > > | id | select_type | table | type | possible_keys | key | > > key_len | ref | rows | Extra | > > +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ > > | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | > > NULL | NULL | 382473 | Using where | > > +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ > > > > It's possible that it can be ever worse, as this number is from > > another very-long running environments. > > > > +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ > > | id | select_type | table | type | possible_keys | key | > > key_len | ref | rows | Extra | > > +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ > > | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | > > NULL | NULL | 3008741 | Using where | > > +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ > > > > I'm not the SQL expert, could we not optimize this? Alternatively, > > could we update the online data migrations code to "pop out" any of > > the migrations that return 0 for the next iteration, that way it only > > works on those online_data_migrations that *have* to be done, and > > ignore those it knows are done? > > > > Thanks, > > Mohammed > > > -- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser at vexxhost.com W. http://vexxhost.com From openstack at fried.cc Mon Apr 1 14:15:49 2019 From: openstack at fried.cc (Eric Fried) Date: Mon, 1 Apr 2019 09:15:49 -0500 Subject: [nova][CI] nova-live-migration failing Message-ID: <9f43b784-2dfb-3d63-0212-51f758a36e32@fried.cc> > I've noticed the nova-live-migration job failing consistently over the > weekend on clearly unrelated patches. Haven't had a chance to look > into it at all. Please consider holding off blind rechecks for a bit. Matt identified bug [1], cause [2], and fix [3]. Please wait until the latter has merged (or rebase on top of it) before rechecking in nova. Thanks, efried [1] https://bugs.launchpad.net/nova/+bug/1822605 [2] https://review.openstack.org/#/c/601433/ [3] https://review.openstack.org/#/c/649036/ From navdeep.uniyal at bristol.ac.uk Mon Apr 1 14:19:03 2019 From: navdeep.uniyal at bristol.ac.uk (Navdeep Uniyal) Date: Mon, 1 Apr 2019 14:19:03 +0000 Subject: [Magnum] Cluster Create failure In-Reply-To: <29BAEF27-99CE-410D-B9EF-C3E5851C9D26@stackhpc.com> References: <29BAEF27-99CE-410D-B9EF-C3E5851C9D26@stackhpc.com> Message-ID: Hi Bharat, Thank you for your response. I am getting following errors in my worker VM (Master VM has similar errors): [feduser at kubernetes-cluster-wwmvqecjiznb-minion-0 ~]$ less /var/log/cloud-init.log | grep fail 2019-03-29 16:20:37,018 - cc_growpart.py[DEBUG]: '/' SKIPPED: device_part_info(/dev/mapper/atomicos-root) failed: /dev/mapper/atomicos-root not a partition 2019-03-29 16:20:37,219 - main.py[DEBUG]: Ran 14 modules with 0 failures 2019-03-29 16:20:38,450 - main.py[DEBUG]: Ran 7 modules with 0 failures 2019-03-29 16:20:39,501 - main.py[DEBUG]: Ran 16 modules with 0 failures 2019-04-01 13:21:07,978 - util.py[WARNING]: failed stage init-local 2019-04-01 13:21:07,978 - util.py[DEBUG]: failed stage init-local 2019-04-01 13:21:09,250 - url_helper.py[DEBUG]: Calling 'http://169.254.169.254/openstack' failed [0/-1s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /openstack (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 101] Network is unreachable',))] 2019-04-01 13:21:09,252 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [0/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 101] Network is unreachable',))] 2019-04-01 13:21:10,255 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [1/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 101] Network is unreachable',))] 2019-04-01 13:21:11,259 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [2/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 101] Network is unreachable',))] 2019-04-01 13:21:12,264 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [3/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 101] Network is unreachable',))] 2019-04-01 13:21:13,268 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [4/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 101] Network is unreachable',))] 2019-04-01 13:21:14,272 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [5/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 101] Network is unreachable',))] 2019-04-01 13:21:16,278 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [7/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 101] Network is unreachable',))] 2019-04-01 13:21:18,283 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [9/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 101] Network is unreachable',))] 2019-04-01 13:21:36,442 - cc_growpart.py[DEBUG]: '/' SKIPPED: device_part_info(/dev/mapper/atomicos-root) failed: /dev/mapper/atomicos-root not a partition 2019-04-01 13:21:36,609 - main.py[DEBUG]: Ran 14 modules with 0 failures 2019-04-01 13:21:37,847 - main.py[DEBUG]: Ran 7 modules with 0 failures 2019-04-01 13:24:19,548 - util.py[WARNING]: Running module scripts-user () failed 2019-04-01 13:24:19,548 - util.py[DEBUG]: Running module scripts-user () failed return self._runners.run(name, functor, args, freq, clear_on_fail) % (len(failed), len(attempted))) RuntimeError: Runparts: 2 failures in 11 attempted commands 2019-04-01 13:24:19,614 - main.py[DEBUG]: Ran 16 modules with 1 failures I cannot see any error in the metadata service logs and I can reach the server from my VM. I used the Openstack (pike) guide to deploy it manually without using any other system. In my setup, I have nova, neutron(Self-Service), Glance, Horizon, Keystone, Heat and Magnum running. Kind Regards, Navdeep -----Original Message----- From: Bharat Kunwar Sent: 01 April 2019 14:59 To: Navdeep Uniyal Cc: Mohammed Naser ; openstack at lists.openstack.org Subject: Re: [Magnum] Cluster Create failure Hi Navdeep, Have you tried logging into the master/worker node and gripping for `fail` inside /var/log/cloud-init.log and /var/log/cloud-init-output.log? Also how did you deploy your OpenStack services? Bharat > On 1 Apr 2019, at 14:54, Navdeep Uniyal wrote: > > Dear All, > > My Kubernetes Cluster is timing out after 60 mins. > > Following is the update I am getting in magnum.log: > > {"stack": {"parent": null, "disable_rollback": true, "description": "This template will boot a Kubernetes cluster with one or more minions (as specified by the number_of_minions parameter, which defaults to 1).\n", "parameters": {"magnum_url": "http://10.68.48.4:9511/v1", "kube_tag": "v1.11.6", "http_proxy": "", "cgroup_driver": "cgroupfs", "registry_container": "container", "kubernetes_port": "6443", "calico_kube_controllers_tag": "v1.0.3", "octavia_enabled": "False", "etcd_volume_size": "0", "kube_dashboard_enabled": "True", "master_flavor": "medium", "etcd_tag": "v3.2.7", "kube_version": "v1.11.6", "k8s_keystone_auth_tag": "1.13.0", "kube_service_account_private_key": "******", "keystone_auth_enabled": "True", "cloud_provider_tag": "v0.2.0", "ca_key": "******", "tiller_enabled": "False", "registry_enabled": "False", "verify_ca": "True", "password": "******", "dns_service_ip": "10.254.0.10", "ssh_key_name": "magnum_key", "flannel_tag": "v0.10.0-amd64", "flannel_network_subnetlen": "24", "dns_nameserver": "8.8.8.8", "number_of_masters": "1", "wait_condition_timeout": "6000", "portal_network_cidr": "10.254.0.0/16", "admission_control_list": "NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota", "pods_network_cidr": "10.100.0.0/16", "ingress_controller": "", "external_network": "751ae6e5-71af-4f78-b846-b0e1843093c8", "docker_volume_type": "", "registry_port": "5000", "tls_disabled": "False", "trust_id": "******", "swift_region": "", "influx_grafana_dashboard_enabled": "False", "volume_driver": "", "kubescheduler_options": "", "calico_tag": "v2.6.7", "loadbalancing_protocol": "TCP", "cloud_provider_enabled": "True", "OS::stack_id": "06c05715-ac05-4287-905c-38f1964f09fe", "flannel_cni_tag": "v0.3.0", "prometheus_monitoring": "False", "kubelet_options": "", "fixed_network": "", "kube_dashboard_version": "v1.8.3", "trustee_username": "d7ff417e-85b6-4b9a-94c3-211e7b830a51_4c6bc4445c764249921a0a6e40b192dd", "availability_zone": "", "server_image": "fedora-feduser-atomic", "flannel_network_cidr": "10.100.0.0/16", "cert_manager_api": "False", "minion_flavor": "medium", "kubeproxy_options": "", "calico_cni_tag": "v1.11.2", "cluster_uuid": "d7ff417e-85b6-4b9a-94c3-211e7b830a51", "grafana_admin_passwd": "******", "flannel_backend": "udp", "trustee_domain_id": "ac26210ad4f74217b3abf28a9b5cf56d", "fixed_subnet": "", "https_proxy": "", "username": "admin", "insecure_registry_url": "", "docker_volume_size": "0", "grafana_tag": "5.1.5", "kube_allow_priv": "true", "node_problem_detector_tag": "v0.6.2", "docker_storage_driver": "overlay2", "project_id": "4c6bc4445c764249921a0a6e40b192dd", "registry_chunksize": "5242880", "trustee_user_id": "d1983ea926c34536aabc8d50a85503e8", "container_infra_prefix": "", "number_of_minions": "1", "tiller_tag": "v2.12.3", "auth_url": "http://pluto:5000/v3", "registry_insecure": "True", "tiller_namespace": "magnum-tiller", "prometheus_tag": "v1.8.2", "OS::project_id": "4c6bc4445c764249921a0a6e40b192dd", "kubecontroller_options": "", "fixed_network_cidr": "10.0.0.0/24", "kube_service_account_key": "******", "ingress_controller_role": "ingress", "region_name": "RegionOne", "kubeapi_options": "", "openstack_ca": "******", "trustee_password": "******", "nodes_affinity_policy": "soft-anti-affinity", "minions_to_remove": "", "octavia_ingress_controller_tag": "1.13.2-alpha", "OS::stack_name": "kubernetes-cluster-wwmvqecjiznb", "system_pods_timeout": "5", "system_pods_initial_delay": "30", "dns_cluster_domain": "cluster.local", "calico_ipv4pool": "192.168.0.0/16", "network_driver": "flannel", "monitoring_enabled": "False", "heat_container_agent_tag": "stein-dev", "no_proxy": "", "discovery_url": "https://discovery.etcd.io/b8fe011e8b281615904de97ee05511a7"}, "deletion_time": null, "stack_name": "kubernetes-cluster-wwmvqecjiznb", "stack_user_project_id": "8204d11826fb4253ae7c9063306cb4e1", "tags": null, "creation_time": "2019-04-01T13:19:53Z", "links": [{"href": "http://pluto:8004/v1/4c6bc4445c764249921a0a6e40b192dd/stacks/kubernetes-cluster-wwmvqecjiznb/06c05715-ac05-4287-905c-38f1964f09fe", "rel": "self"}], "capabilities": [], "notification_topics": [], "timeout_mins": 60, "stack_status": "CREATE_IN_PROGRESS", "stack_owner": null, "updated_time": null, "id": "06c05715-ac05-4287-905c-38f1964f09fe", "stack_status_reason": "Stack CREATE started", "template_description": "This template will boot a Kubernetes cluster with one or more minions (as specified by the number_of_minions parameter, which defaults to 1).\n"}} > > I am not sure how to triage this issue as I cannot see any errors in heat.log as well. > Even I can see both Master and Minion node running but the task errors out during OS::Heat::SoftwareDeployment in kube_cluster_deploy and OS::Heat::ResourceGroup in kube_minions > > I don't have much experience with Kubernetes clusters as well so please forgive me if I am raising any silly queries. > > Kind Regards, > Navdeep > From mbooth at redhat.com Mon Apr 1 14:23:03 2019 From: mbooth at redhat.com (Matthew Booth) Date: Mon, 1 Apr 2019 14:23:03 +0000 Subject: [nova] Privsep is not giving us any security In-Reply-To: References: <99f73d03-5c21-d346-4611-12f87c4ac124@openstack.org> Message-ID: On Sat, 30 Mar 2019 at 08:32, Michael Still wrote: > > On Sat., 30 Mar. 2019, 6:28 pm Thierry Carrez, wrote: >> >> Michael Still wrote: >> > The reality is that privsep was always going to be a process. It's taken >> > more than 80 patches to get close to removing rootwrap. >> > >> > There are other advantages to removing rootwrap, mainly around >> > performance, the integration of library code, and general >> > non-bonkersness (cat to tee to write to a file as root), etc. >> > >> > There is president in the code to mark calls as undesirable, and others >> > could be marked like that as well, but ultimately someone needs to do an >> > audit and fix things... That's more than one person can reasonably do. >> > >> > So, who wants to help try and improve this? Patches welcome. >> >> It's been on my priority-2 TODO list for a while to help with that... >> Now if people would stop adding to my priority-1 TODO list... >> >> Agree that's definitely more than a one-person job, but migrating a >> specific call is also a reasonably self-contained unit of work that (1) >> does not require a deep understanding of all the code around it, and (2) >> does not commit you for a lifelong feature maintenance duty... So maybe >> it would be a good thing to suggest newcomers / students to get a poke >> at? I'm happy to help with the reviewing if we can come up with a topic >> name that helps finding those. > > > One concern I have is that I am not sure it's always as simple as it looks. For example, we could enforce that device files are always in /dev, but is that always true on all architectures with all hypervisors? How do we know that? I think the answer here is that we don't have privsep functions like that *at all*. So instead of a generic privsep.data.discombobulate(device) you have privsep.libvirt.volume.foo.discombobulate(volume). I mentioned this as a general principal in my original mail which has since been snipped, that we should probably *never* pass a reference to a system resource (the most obvious example being a path) to a privsep function. Instead, privsep has its own config and can work that out for itself. This does mean a proliferation of specific privsep functions over a few generic ones, but that's how the security model works. In general we should expect classes to have a shadow privsep class containing security-sensitive logic. > We could add enforcement and at the same time add a workaround flag to turn it off, which is immediately deprecated so people have a release to notice a breakage. Do we do that per enforcement rule? That's a lot of flags! > > Finally, the initial forklift has been a series of relatively simple code swaps (which is really the root of Mr Booth's concern). Even with taking the easy path there haven't been heaps of volunteers helping and some of this code has been in review for quite a long time. Do we really think that volunteers are going to show up now? Exactly. Most folks acknowledge the importance of this stuff, but not to the extend of prioritising it for review: it's boring, involves large code churn, and doesn't immediately add anything obvious. Worse the code churn means that unless it lands quickly it's very soon going to be in merge conflict hell. Writing the code is likely the easy bit. Unless we can agree on this as a priority theme for a while I don't see us making any progress on it in practise. Matt -- Matthew Booth Red Hat OpenStack Engineer, Compute DFG Phone: +442070094448 (UK) From ed at leafe.com Mon Apr 1 14:51:12 2019 From: ed at leafe.com (Ed Leafe) Date: Mon, 1 Apr 2019 09:51:12 -0500 Subject: [placement] update 19-12 In-Reply-To: References: Message-ID: On Mar 29, 2019, at 9:07 AM, Chris Dent wrote: > > * Some [image type traits](https://review.openstack.org/648147) have > merged (to be used in a nova-side request pre-filter), but the > change has exposed an issue we'll need to resolve: os-traits and > os-resource-classes are under the cycle-with-intermediary style > release which means that at this time in the cycle it is difficult > to make a release which can delay work. We could switch to > independent. This would make sense for libraries that are > basically lists of strings. It's hard to break that. Especially since strings cannot be removed. +1 to moving to independent releases. > We could also > investigate making os-traits and os-resource-classes > required-projects in job templates in zuul. This would allow them > to be "tox siblings". Or we could wait. Please express an > opinion if you have one. I googled ’tox siblings’ and got a lot of family relationship advice. So no, I have no opinion on this. > We should do a bug squash day at some point. Should we wait until > after the PTG or no? I don’t think PTG makes a difference. Let’s wait until after Stein is released, and figure out a good date then. -- Ed Leafe From cdent+os at anticdent.org Mon Apr 1 14:55:44 2019 From: cdent+os at anticdent.org (Chris Dent) Date: Mon, 1 Apr 2019 15:55:44 +0100 (BST) Subject: [placement] update 19-12 In-Reply-To: References: Message-ID: On Mon, 1 Apr 2019, Ed Leafe wrote: > On Mar 29, 2019, at 9:07 AM, Chris Dent wrote: >> We could also >> investigate making os-traits and os-resource-classes >> required-projects in job templates in zuul. This would allow them >> to be "tox siblings". Or we could wait. Please express an >> opinion if you have one. > > I googled ’tox siblings’ and got a lot of family relationship advice. So no, I have no opinion on this. https://docs.openstack.org/infra/manual/zuulv3.html#installation-of-sibling-requirements -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent From skaplons at redhat.com Mon Apr 1 14:56:45 2019 From: skaplons at redhat.com (Slawomir Kaplonski) Date: Mon, 1 Apr 2019 16:56:45 +0200 Subject: [dev][neutron] CI broken In-Reply-To: <1554126316.4997.15.camel@offenerstapel.de> References: <1554126316.4997.15.camel@offenerstapel.de> Message-ID: Hi, Thx Jens for pointing to the culprit of the issue. My patch for neutron-tempest-plugin is now merged so neutron jobs should be fine now and You can recheck Your patches now :) > Wiadomość napisana przez Jens Harbott w dniu 01.04.2019, o godz. 15:45: > > On Mon, 2019-04-01 at 08:29 +0200, Slawomir Kaplonski wrote: >> Hi, >> >> Just FYI, since few days we have broken neutron-tempest-plugin- >> designate-scenario job and it is failing 100% times. Bug is reported >> in [1]. >> If this job failed on Your patch, please don’t recheck as it will not >> solve the problem. >> >> [1] https://bugs.launchpad.net/neutron/+bug/1822453 > > Actually a patch in devstack caused this regression, affecting all jobs > that use multiple tempest plugins. A fix for that is being reviewed, > see the above bug report for details. > > Jens > — Slawek Kaplonski Senior software engineer Red Hat From mriedemos at gmail.com Mon Apr 1 14:57:23 2019 From: mriedemos at gmail.com (Matt Riedemann) Date: Mon, 1 Apr 2019 07:57:23 -0700 Subject: [nova][CI] nova-live-migration failing In-Reply-To: References: Message-ID: <64c8c2ec-3136-baae-31e6-703106a478df@gmail.com> On 3/31/2019 4:00 AM, Eric Fried wrote: > I've noticed the nova-live-migration job failing consistently over the weekend on clearly unrelated patches. Haven't had a chance to look into it at all. Please consider holding off blind rechecks for a bit. https://review.openstack.org/#/c/649036/ fixes it, so avoid rechecks until that has merged. -- Thanks, Matt From florian.engelmann at everyware.ch Mon Apr 1 15:40:57 2019 From: florian.engelmann at everyware.ch (Florian Engelmann) Date: Mon, 1 Apr 2019 17:40:57 +0200 Subject: [nova] retire a flavor In-Reply-To: References: Message-ID: <33137296-6ae3-3f6d-43bf-35b3d52e1a6a@everyware.ch> Hi, as far as I tested it is not possible to change a flavor from public to private. Managing access to flavors on a project basis might be an option for a private cloud but not for a public one. There are a lot of unit tests about: "alias": "OS-FLV-DISABLED", "description": "Support to show the disabled status of a flavor.", What's that feature for? How does it work? Is this ongoing? I think we need some method to retire images and flavors. All the best, Florian Am 3/31/19 um 11:57 AM schrieb Sean Mooney: > On Fri, 2019-03-29 at 11:25 -0500, Eric Fried wrote: >> Florian- >> >> You can definitely delete a flavor [1]. >> >> Don't worry about it affecting existing instances that were created with >> that flavor: nova stores a copy of the flavor with the instance itself >> so the information is preserved. >> >> Thanks for the question! > the other option if you want to keep the flavor but not make it usable by > new instances in general is to use the os-flavor-access api. > https://developer.openstack.org/api-ref/compute/?expanded=#flavors-access-flavors-os-flavor-access > This allows you to make a flavor private and control what project can use it. > > > unfortunetly we seam to have almost no documentation for this however that is what the --project argument > to openstack flavor set is used to contol. > https://docs.openstack.org/python-openstackclient/pike/cli/command-objects/flavor.html#flavor-set > e.g. openstack flavor set --project