[openstack-dev] [Nova] Handling soft delete for instance rows in a new cells database

Jay Pipes jaypipes at gmail.com
Wed Nov 26 20:38:56 UTC 2014


On 11/25/2014 09:34 PM, Mike Bayer wrote:
>> On Nov 25, 2014, at 8:15 PM, Ahmed RAHAL <arahal at iweb.com> wrote:
>>
>> Hi,
>>
>> Le 2014-11-24 17:20, Michael Still a écrit :
>>> Heya,
>>>
>>> This is a new database, so its our big chance to get this right. So,
>>> ideas welcome...
>>>
>>> Some initial proposals:
>>>
>>>   - we do what we do in the current nova database -- we have a deleted
>>> column, and we set it to true when we delete the instance.
>>>
>>>   - we have shadow tables and we move delete rows to a shadow table.
>>>
>>>   - something else super clever I haven't thought of.
>>
>> Some random thoughts that came to mind ...
>>
>> 1/ as far as I remember, you rarely want to delete a row
>> - it's usually a heavy DB operation (well, was back then)
>> - it's destructive (but we may want that)
>> - it creates fragmentation (less of a problem depending on db engine)
>> - it can break foreign key relations if not done the proper way
>
> deleting records with foreign key dependencies is a known quantity.  Those items are all related and being able to delete everything related is a well-solved problem, both via ON DELETE cascades as well as standard ORM features.

++

>> 2/ updating a row to 'deleted=1'
>> - gives an opportunity to set a useful deletion time-stamp
>> I would even argue that setting the deleted_at field would suffice to declare a row 'deleted' (as in 'not NULL'). I know, "explicit is better than implicit" …
>
> the logic that’s used is that “deleted” is set to the primary key of the record, this is to allow UNIQUE constraints to be set up that serve on the non-deleted rows only (e.g. UNIQUE on “x” + “deleted” is possible when there are multiple “deleted” rows with “x”).

Indeed. Because people want to be able to name an instance one thing, 
delete it, and immediately name another instance the same thing. Ugh -- 
what an annoying use case, IMO. Better to just delete the row out of the 
database after archival/audit log of the operation.

>> - the update operation is not destructive
>> - an admin/DBA can decide when and how to purge/archive rows
>>
>> 3/ moving the row at deletion
>> - you want to avoid additional steps to complete an operation, thus avoid creating a new record while deleting one
>> - even if you wrap things into a transaction, not being able to create a row somewhere can make your delete transaction fail
>> - if I were to archive all deleted rows, at scale I'd probably move them to another db server altogether
>
> if you’re really “archiving”, I’d just dump out a log of what occurred to a textual log file, then you archive the files.  There’s no need for a pure “audit trail” to even be in the relational DB.

Precisely. Why is the RDBMS the thing that is used for archival/audit 
logging? Why not a NoSQL store or a centralized log facility? All that 
would be needed would be for us to standardize on the format of the 
archival record, standardize on the things to provide with the archival 
record (for instance system metadata, etc), and then write a simple 
module that would write an archival record to some backend data store.

Then we could rid ourselves of the awfulness of the shadow tables and 
all of the read_deleted=yes crap.

Best,
-jay

>> Now, I for one would keep the current mark-as-deleted model.
>>
>> I however perfectly get the problem of massive churn with instance creation/deletion.
>
> is there?   inserting and updating rows is a normal thing in relational DBs.
>
>
>> So, let's be crazy, why not have a config option 'on_delete=mark_delete', 'on_delete=purge' or 'on_delete=archive' and let the admin choose ? (is that feasible ?)
>
> I’m -1 on that.  The need for records to be soft-deleted or not, and if those soft-deletes need to be accessible in the application, should be decided up front.  Adding a multiplicity of options just makes the code that much more complicated and fragments its behaviors and test coverage.   The suggestion basically tries to avoid making a decision and I think more thought should be put into what is truly needed.
>
>
>> This would especially come handy if the admin decides the global cells database may not need to keep track of deleted instances, the cell-local nova database being the absolute reference for that.
>
> why would an admin decide that this is, or is not, needed?   if the deleted data isn’t needed by the live app, it should just be dumped to an archive.  admins can set how often that archive should be purged, but IMHO the “pipeline” of these records should be straight; there shouldn’t be junctions and switches that cause there to be multiple data paths.   It leads to too much complexity.
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



More information about the OpenStack-dev mailing list