Open Stack

Thu Jun 1 09:53:49 UTC 2017

On 31/05, Matt Riedemann wrote:
> On 5/31/2017 6:58 AM, Gorka Eguileor wrote:
> > Hi,
> >
> > As some of you may know I've been working on improving iSCSI connections
> > on OpenStack to make them more robust and prevent them from leaving
> > leftovers on attach/detach operations.
> >
> > There are a couple of posts [1][2] going in more detail, but a good
> > summary would be that to fix this issue we require a considerable rework
> > in OS-Brick, changes in Open iSCSI, Cinder, Nova and specific tests.
> >
> > Relevant changes for those projects are:
> >
> > - Open iSCSI: iscsid behavior is not a perfect fit for the OpenStack use
> >    case, so a new feature was added to disable automatic scans that added
> >    unintended devices to the systems.  Done and merged [3][4], it will be
> >    available on RHEL with iscsi-initiator-utils-6.2.0.874-2.el7
> >
> > - OS-Brick: rework iSCSI to make it robust on unreliable networks, to
> >    add a `force` detach option that prioritizes leaving a clean system
> >    over possible data loss, and to support the new Open iSCSI feature.
> >    Done and pending review [5][6][7]
> >
> > - Cinder: Handle some attach/detach errors a little better and add
> >    support to the force detach option for some operations where data loss
> >    on error is acceptable, ie: create volume from image, restore backup,
> >    etc. Done and pending review [8][9]
> >
> > - Nova: I haven't looked into the code here, but I'm sure there will be
> >    cases where using the force detach operation will be useful.
> >
> > - Tests: While we do have tempest tests that verify that attach/detach
> >    operations work both in Nova and in cinder volume creation operations,
> >    they are not meant to test the robustness of the system, so new tests
> >    will be required to validate the code.  Done [10]
> >
> > Proposed tests are simplified versions of the ones I used to validate
> > the code; but hey, at least these are somewhat readable ;-)
> > Unfortunately they are not in line with the tempest mission since they
> > are not meant to be run in a production environment due to their
> > disruptive nature while injecting errors.  They need to be run
> > sequentially and without any other operations running on the deployment.
> > They also run sudo commands via local bash or SSH for the verification
> > and error generation bits.
> >
> > We are testing create volume from image and attaching a volume to an
> > instance under the following networking error scenarios:
> >
> >   - No errors
> >   - All paths have 10% incoming packets dropped
> >   - All paths have 20% incoming packets dropped
> >   - All paths have 100% incoming packets dropped
> >   - Half the paths have 20% incoming packets dropped
> >   - The other half of the paths have 20% incoming packets dropped
> >   - Half the paths have 100% incoming packets dropped
> >   - The other half of the paths have 100% incoming packets dropped
> >
> > There are single execution versions as well as 10 consecutive operations
> > variants.
> >
> > Since these are big changes I'm sure we would all feel a lot more
> > confident to merge them if storage vendors would run the new tests to
> > confirm that there are no issues with their backends.
> >
> > Unfortunately to fully test the solution you may need to build the
> > latest Open-iSCSI package and install it in the system, then you can
> > just use an all-in-one DevStack with a couple of changes in the local.conf:
> >
> >     enable_service tempest
> >
> >     CINDER_REPO=https://review.openstack.org/p/openstack/cinder
> >     CINDER_BRANCH=refs/changes/45/469445/1
> >
> >     LIBS_FROM_GIT=os-brick
> >
> >     OS_BRICK_REPO=https://review.openstack.org/p/openstack/os-brick
> >     OS_BRICK_BRANCH=refs/changes/94/455394/11
> >
> >     [[post-config|$CINDER_CONF]]
> >     [multipath-backend]
> >     use_multipath_for_image_xfer=true
> >
> >     [[post-config|$NOVA_CONF]]
> >     [libvirt]
> >     volume_use_multipath = True
> >
> >     [[post-config|$KEYSTONE_CONF]]
> >     [token]
> >     expiration = 14400
> >
> >     [[test-config|$TEMPEST_CONFIG]]
> >     [volume-feature-enabled]
> >     multipath = True
> >     [volume]
> >     build_interval = 10
> >     multipath_type = $MULTIPATH_VOLUME_TYPE
> >     backend_protocol_tcp_port = 3260
> >     multipath_backend_addresses = $STORAGE_BACKEND_IP1,$STORAGE_BACKEND_IP2
> >
> > Multinode configurations are also supported using SSH with use/password or
> > private key to introduce the errors or check that the systems didn't leave any
> > leftovers, the tests can also run a cleanup command between tests, etc., but
> > that's beyond the scope of this email.
> >
> > Then you can run them all from /opt/stack/tempest with:
> >
> >   $ cd /opt/stack/tempest
> >   $ OS_TEST_TIMEOUT=7200 ostestr -r cinder.tests.tempest.scenario.test_multipath.*
> >
> > But I would recommend first running the simplest one without errors and
> > manually checking that the multipath is being created.
> >
> >   $ ostestr -n cinder.tests.tempest.scenario.test_multipath.TestMultipath.test_create_volume_with_errors_1
> >
> > Then doing the same with one with errors and verify the presence of the
> > filters in iptables and that the packet drop for those filters is non zero:
> >
> >   $ ostestr -n cinder.tests.tempest.scenario.test_multipath.TestMultipath.test_create_volume_with_errors_2
> >   $ sudo iptables -nvL INPUT
> >
> > Then doing the same with a Nova test just to verify that it is correctly
> > configured to use multipathing:
> >
> >   $ ostestr -n cinder.tests.tempest.scenario.test_multipath.TestMultipath.test_attach_detach_once_with_errors_1
> >
> > And if these work we can go ahead and run the 10 operations scenarios,
> > since the individual ones don't have any added value over those.  I usually
> > run the tests like this:
> >
> >   $ OS_TEST_TIMEOUT=7200 ostestr -r 'cinder.tests.tempest.scenario.test_multipath.TestMultipath.test_(create_volumes|attach_detach_many)_with_errors_*' --serial -- -f
> >
> > Friendly warning, some of the tests take forever that's why we are increasing
> > the keystone token expiration time and the test timeout.  For example with 2
> > paths some tests take around 40 minutes, so don't despair.
> >
> > The only backends I've actually tried the tests on are QNAP and XtremIO,
> > so I'm really hoping someone else will have the inclination and the time
> > to run the tests on different backends, and maybe even do some
> > additional testing.  :-)
> >
> >
> > Cheers,
> > Gorka,
> >
> >
> > PS: For my tests I actually changed iscsid login retries to reduce the
> > running time by setting a value of 2 as the configuration parameter of
> > "node.session.initial_login_retry_max".
> >
> >
> > [1] https://gorka.eguileor.com/iscsi-multipath/
> > [2] https://gorka.eguileor.com/revamping-iscsi-connections-in-openstack/
> > [3] https://github.com/open-iscsi/open-iscsi/commit/5e32aea95741a07d53153c658a0572588eae494d
> > [4] https://github.com/open-iscsi/open-iscsi/commit/d5483b0df96bd2a1cf86039cf4c6822ec7d7f609
> > [5] https://review.openstack.org/455392
> > [6] https://review.openstack.org/455393
> > [7] https://review.openstack.org/455394
> > [8] https://review.openstack.org/459453
> > [9] https://review.openstack.org/459454
> > [10] https://review.openstack.org/469445
> >
> > __________________________________________________________________________
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
> Gorka, this is really all about testing and making multipath support more
> robust, right? For those not using multipath does any of this matter?

Hi Matt,

These changes only affect iSCSI connections (single or multipathed), so
it will be irrelevant for any other connection types, with the small
exception of a cleanup fix [1] that affected all connections.

[1] https://review.openstack.org/459453

>
> The reason I ask is I was thinking we were going to also fix some other long
> standing issues, like [1][2], where we don't terminate connections and
> remove exports properly when shelve-offloading an instance. I guess that's
> totally unrelated here.

Yes, that would be unrelated, and as far as I can tell you have all
required Cinder bits to fix it in place, right?

Since you can just call os-terminate_connection REST API like you are
suggesting in your patch.

>
> As for the testing concern in Tempest with serial tests, it is possible to
> run tests in Tempest with a LockFixture but you'd likely have to lock all
> tests that involve a volume from running at the same time. We have the same
> issue with needing to test the evacuate feature in Nova but evacuate
> requires that the nova-compute service is down on the host so we'd have to
> run it serially.
>

To avoid false failures I would need to prevent a great number of tests
-any test that attaches/detaches a volume directly or indirectly nova
attach/detach, cinder create from image, cinder migrate, backup
create/restore, etc.- as well as be sure that the none of these
operations were being performed on the deployment while the tests were
running, and that seems like a considerable effort with potentially
little benefit.

> So do you plan on leaving those tests in Tempest or moving them into the
> Cinder repo and making them run under a separate tox serial environment?
>

After some consideration I decided to create tests directly in the
Cinder repository, as it made more sense.

Regarding running the tests my idea was to leave them as some kind of
one-off manual tests, as I don't think it would be a good idea to run
them automatically by our CI or customers.

There are various reasons for this:

- Long running time: Tests can take a couple of hours (not including
  environment deployment time), and our gates would take a huge hit.

- Security: We require the capability to run sudo commands (directly or
  via SSH depending on the configuration) in order to generate
  connection errors and to validate the results of the tests, which any
  security assessment would consider an unnecessary risk.

- Disruption: When injecting errors we are disrupting communications to
  the storage array, in some cases preventing all communication with it,
  which would be a terrible idea on a customer deployment.

Cheers,
Gorka.

> [1] https://bugs.launchpad.net/nova/+bug/1547142
> [2] https://bugs.launchpad.net/cinder/+bug/1527278
>
> --
>
> Thanks,
>
> Matt
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Open Stack

[openstack-dev] [cinder][nova][os-brick] Testing for proposed iSCSI OS-Brick code

OpenStack

Community

Documentation

Branding & Legal