[cinder][dev] Bug for deferred deletion in RBD

Gorka Eguileor geguileo at redhat.com
Wed Feb 13 09:37:24 UTC 2019


On 13/02, Jae Sang Lee wrote:
> As mentioned in Gorka, sql connection is using pymysql.
>
> And I increased max_pool_size to 50(I think gorka mistaken max_pool_size to
> max_retries.),

My bad, I meant "max_overflow", which was changed a while back to 50
(though I don't remember when).



> but it was the same that the cinder-volume stucked from the time that 4~50
> volumes were deleted.
>
> There seems to be a problem with the cinder rbd volume driver, so I tested
> to delete 200 volumes continously
> by used only RBDClient and RBDProxy. There was no problem at this time.

I assume you tested it using eventlets, right?

Cheers,
Gorka.


>
> I think there is some code in the cinder-volume that causes a hang but it's
> too hard to find now.
>
> Thanks.
>
> 2019년 2월 12일 (화) 오후 6:24, Gorka Eguileor <geguileo at redhat.com>님이 작성:
>
> > On 12/02, Arne Wiebalck wrote:
> > > Jae,
> > >
> > > One other setting that caused trouble when bulk deleting cinder volumes
> > was the
> > > DB connection string: we did not configure a driver and hence used the
> > Python
> > > mysql wrapper instead … essentially changing
> > >
> > > connection = mysql://cinder:<pw>@<host>:<port>/cinder
> > >
> > > to
> > >
> > > connection = mysql+pymysql://cinder:<pw>@<host>:<port>/cinder
> > >
> > > solved the parallel deletion issue for us.
> > >
> > > All details in the last paragraph of [1].
> > >
> > > HTH!
> > >  Arne
> > >
> > > [1]
> > https://techblog.web.cern.ch/techblog/post/experiences-with-cinder-in-production/
> > >
> >
> > Good point, using a C mysql connection library will induce thread
> > starvation.  This was thoroughly discussed, and the default changed,
> > like 2 years ago...  So I assumed we all changed that.
> >
> > Something else that could be problematic when receiving many concurrent
> > requests on any Cinder service is the number of concurrent DB
> > connections, although we also changed this a while back to 50.  This is
> > set as sql_max_retries or max_retries (depending on the version) in the
> > "[database]" section.
> >
> > Cheers,
> > Gorka.
> >
> >
> > >
> > >
> > > > On 12 Feb 2019, at 01:07, Jae Sang Lee <hyangii at gmail.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I tested today by increasing EVENTLET_THREADPOOL_SIZE size to 100. I
> > wanted to have good results,
> > > > but this time I did not get a response after removing 41 volumes. This
> > environment variable did not fix
> > > > the cinder-volume stopping.
> > > >
> > > > Restarting the stopped cinder-volume will delete all volumes that are
> > in deleting state while running the clean_up function.
> > > > Only one volume in the deleting state, I force the state of this
> > volume to be available, and then delete it, all volumes will be deleted.
> > > >
> > > > This result was the same for 3 consecutive times. After removing
> > dozens of volumes, the cinder-volume was down,
> > > > and after the restart of the service, 199 volumes were deleted and one
> > volume was manually erased.
> > > >
> > > > If you have a different approach to solving this problem, please let
> > me know.
> > > >
> > > > Thanks.
> > > >
> > > > 2019년 2월 11일 (월) 오후 9:40, Arne Wiebalck <Arne.Wiebalck at cern.ch>님이 작성:
> > > > Jae,
> > > >
> > > >> On 11 Feb 2019, at 11:39, Jae Sang Lee <hyangii at gmail.com> wrote:
> > > >>
> > > >> Arne,
> > > >>
> > > >> I saw the messages like ''moving volume to trash"  in the
> > cinder-volume logs and the peridic task also reports
> > > >> like "Deleted <vol-uuid> from trash for backend '<backends-name>'"
> > > >>
> > > >> The patch worked well when clearing a small number of volumes. This
> > happens only when I am deleting a large
> > > >> number of volumes.
> > > >
> > > > Hmm, from cinder’s point of view, the deletion should be more or less
> > instantaneous, so it should be able to “delete”
> > > > many more volumes before getting stuck.
> > > >
> > > > The periodic task, however, will go through the volumes one by one, so
> > if you delete many at the same time,
> > > > volumes may pile up in the trash (for some time) before the tasks gets
> > round to delete them. This should not affect
> > > > c-vol, though.
> > > >
> > > >> I will try to adjust the number of thread pools by adjusting the
> > environment variables with your advices
> > > >>
> > > >> Do you know why the cinder-volume hang does not occur when create a
> > volume, but only when delete a volume?
> > > >
> > > > Deleting a volume ties up a thread for the duration of the deletion
> > (which is synchronous and can hence take very
> > > > long for ). If you have too many deletions going on at the same time,
> > you run out of threads and c-vol will eventually
> > > > time out. FWIU, creation basically works the same way, but it is
> > almost instantaneous, hence the risk of using up all
> > > > threads is simply lower (Gorka may correct me here :-).
> > > >
> > > > Cheers,
> > > >  Arne
> > > >
> > > >>
> > > >>
> > > >> Thanks.
> > > >>
> > > >>
> > > >> 2019년 2월 11일 (월) 오후 6:14, Arne Wiebalck <Arne.Wiebalck at cern.ch>님이 작성:
> > > >> Jae,
> > > >>
> > > >> To make sure deferred deletion is properly working: when you delete
> > individual large volumes
> > > >> with data in them, do you see that
> > > >> - the volume is fully “deleted" within a few seconds, ie. not staying
> > in ‘deleting’ for a long time?
> > > >> - that the volume shows up in trash (with “rbd trash ls”)?
> > > >> - the periodic task reports it is deleting volumes from the trash?
> > > >>
> > > >> Another option to look at is “backend_native_threads_pool_size": this
> > will increase the number
> > > >> of threads to work on deleting volumes. It is independent from
> > deferred deletion, but can also
> > > >> help with situations where Cinder has more work to do than it can
> > cope with at the moment.
> > > >>
> > > >> Cheers,
> > > >>  Arne
> > > >>
> > > >>
> > > >>
> > > >>> On 11 Feb 2019, at 09:47, Jae Sang Lee <hyangii at gmail.com> wrote:
> > > >>>
> > > >>> Yes, I added your code to pike release manually.
> > > >>>
> > > >>>
> > > >>>
> > > >>> 2019년 2월 11일 (월) 오후 4:39에 Arne Wiebalck <Arne.Wiebalck at cern.ch>님이
> > 작성:
> > > >>> Hi Jae,
> > > >>>
> > > >>> You back ported the deferred deletion patch to Pike?
> > > >>>
> > > >>> Cheers,
> > > >>>  Arne
> > > >>>
> > > >>> > On 11 Feb 2019, at 07:54, Jae Sang Lee <hyangii at gmail.com> wrote:
> > > >>> >
> > > >>> > Hello,
> > > >>> >
> > > >>> > I recently ran a volume deletion test with deferred deletion
> > enabled on the pike release.
> > > >>> >
> > > >>> > We experienced a cinder-volume hung when we were deleting a large
> > amount of the volume in which the data was actually written(I make 15GB
> > file in every volumes), and we thought deferred deletion would solve it.
> > > >>> >
> > > >>> > However, while deleting 200 volumes, after 50 volumes, the
> > cinder-volume downed as before. In my opinion, the trash_move api does not
> > seem to work properly when removing multiple volumes, just like remove api.
> > > >>> >
> > > >>> > If these test results are my fault, please let me know the correct
> > test method.
> > > >>> >
> > > >>>
> > > >>> --
> > > >>> Arne Wiebalck
> > > >>> CERN IT
> > > >>>
> > > >>
> > > >> --
> > > >> Arne Wiebalck
> > > >> CERN IT
> > > >>
> > > >
> > > > --
> > > > Arne Wiebalck
> > > > CERN IT
> > > >
> > >
> >



More information about the openstack-discuss mailing list