[ops] database archiving tool
Hi all, At OVH we needed to write our own tool that archive data from OpenStack databases to prevent some side effect related to huge tables (slower response time, changing MariaDB query plan) and to answer to some legal aspects. So we started to write a python tool which is called OSArchiver that I briefly presented at Denver few days ago in the "Optimizing OpenStack at large scale" talk. We think that this tool could be helpful to other and are ready to open source it, first we would like to get the opinion of the ops community about that tool. To sum-up OSArchiver is written to work regardless of Openstack project. The tool relies on the fact that soft deleted data are recognizable because of their 'deleted' column which is set to 1 or uuid and 'deleted_at' column which is set to the date of deletion. The points to have in mind about OSArchiver: * There is no knowledge of business objects * One table might be archived if it contains 'deleted' column * Children rows are archived before parents rows * A row can not be deleted if it fails to be archived Here are features already implemented: * Archive data in an other database and/or file (actually SQL and CSV formats are supported) to be easily imported * Delete data from Openstack databases * Customizable (retention, exclude DBs, exclude tables, bulk insert/delete) * Multiple archiving configuration * Dry-run mode * Easily extensible, you can add your own destination module (other file format, remote storage etc...) * Archive and/or delete only mode It also means that by design you can run osarchiver not only on OpenStack databases but also on archived OpenStack databases. Thanks in advance for your feedbacks. -- Pierre-Samuel Le Stang
Hi Pierre-Samuel, at this point most of the OpenStack projects have their own way to archive/delete soft deleted records. But one thing usually missing is the retention period of soft deleted records and then the archived data. I'm interested to learn more about what you are doing. Is there any link to access the code? Belmiro CERN On Thu, May 9, 2019 at 5:25 PM Pierre-Samuel LE STANG < pierre-samuel.le-stang@corp.ovh.com> wrote:
Hi all,
At OVH we needed to write our own tool that archive data from OpenStack databases to prevent some side effect related to huge tables (slower response time, changing MariaDB query plan) and to answer to some legal aspects.
So we started to write a python tool which is called OSArchiver that I briefly presented at Denver few days ago in the "Optimizing OpenStack at large scale" talk. We think that this tool could be helpful to other and are ready to open source it, first we would like to get the opinion of the ops community about that tool.
To sum-up OSArchiver is written to work regardless of Openstack project. The tool relies on the fact that soft deleted data are recognizable because of their 'deleted' column which is set to 1 or uuid and 'deleted_at' column which is set to the date of deletion.
The points to have in mind about OSArchiver: * There is no knowledge of business objects * One table might be archived if it contains 'deleted' column * Children rows are archived before parents rows * A row can not be deleted if it fails to be archived
Here are features already implemented: * Archive data in an other database and/or file (actually SQL and CSV formats are supported) to be easily imported * Delete data from Openstack databases * Customizable (retention, exclude DBs, exclude tables, bulk insert/delete) * Multiple archiving configuration * Dry-run mode * Easily extensible, you can add your own destination module (other file format, remote storage etc...) * Archive and/or delete only mode
It also means that by design you can run osarchiver not only on OpenStack databases but also on archived OpenStack databases.
Thanks in advance for your feedbacks.
-- Pierre-Samuel Le Stang
On 5/9/2019 10:43 AM, Belmiro Moreira wrote:
But one thing usually missing is the retention period of soft deleted records and then the archived data.
Something like this? https://review.opendev.org/#/c/556751/ -- Thanks, Matt
Hello Belmiro, I will put the code on OVH's github repository as soon as possible. I'll keep you informed. -- Pierre-Samuel Le Stang Belmiro Moreira <moreira.belmiro.email.lists@gmail.com> wrote on jeu. [2019-mai-09 17:43:49 +0200]:
Hi Pierre-Samuel, at this point most of the OpenStack projects have their own way to archive/ delete soft deleted records. But one thing usually missing is the retention period of soft deleted records and then the archived data.
I'm interested to learn more about what you are doing. Is there any link to access the code?
Belmiro CERN
On Thu, May 9, 2019 at 5:25 PM Pierre-Samuel LE STANG < pierre-samuel.le-stang@corp.ovh.com> wrote:
Hi all,
At OVH we needed to write our own tool that archive data from OpenStack databases to prevent some side effect related to huge tables (slower response time, changing MariaDB query plan) and to answer to some legal aspects.
So we started to write a python tool which is called OSArchiver that I briefly presented at Denver few days ago in the "Optimizing OpenStack at large scale" talk. We think that this tool could be helpful to other and are ready to open source it, first we would like to get the opinion of the ops community about that tool.
To sum-up OSArchiver is written to work regardless of Openstack project. The tool relies on the fact that soft deleted data are recognizable because of their 'deleted' column which is set to 1 or uuid and 'deleted_at' column which is set to the date of deletion.
The points to have in mind about OSArchiver: * There is no knowledge of business objects * One table might be archived if it contains 'deleted' column * Children rows are archived before parents rows * A row can not be deleted if it fails to be archived
Here are features already implemented: * Archive data in an other database and/or file (actually SQL and CSV formats are supported) to be easily imported * Delete data from Openstack databases * Customizable (retention, exclude DBs, exclude tables, bulk insert/delete) * Multiple archiving configuration * Dry-run mode * Easily extensible, you can add your own destination module (other file format, remote storage etc...) * Archive and/or delete only mode
It also means that by design you can run osarchiver not only on OpenStack databases but also on archived OpenStack databases.
Thanks in advance for your feedbacks.
-- Pierre-Samuel Le Stang
-- Pierre-Samuel Le Stang
On 5/9/19 11:14 AM, Pierre-Samuel LE STANG wrote:
Hi all,
At OVH we needed to write our own tool that archive data from OpenStack databases to prevent some side effect related to huge tables (slower response time, changing MariaDB query plan) and to answer to some legal aspects. [snip]
Please make sure your tool takes into account OSSN-0075 [0]. Please read [1] to understand how the glance-manage tool currently deals with this issue. [0] https://wiki.openstack.org/wiki/OSSN/OSSN-0075 [1] https://docs.openstack.org/glance/latest/admin/db.html#database-maintenance [snip]
Thanks in advance for your feedbacks.
Happy to help! brian
Hi Brian Thanks for pointing that out! The tool is able to exclude a table from a database so it's possible to exclude by default glance.images in the config file with an explicit message pointing to the OSSN-0075. A more robust solution may be to hard code the glance.images exclusion and add a boolean flag set to false by default that you may enable to delete the table. Regards, -- PS Brian Rosmaita <rosmaita.fossdev@gmail.com> wrote on mar. [2019-mai-21 08:42:11 -0400]:
On 5/9/19 11:14 AM, Pierre-Samuel LE STANG wrote:
Hi all,
At OVH we needed to write our own tool that archive data from OpenStack databases to prevent some side effect related to huge tables (slower response time, changing MariaDB query plan) and to answer to some legal aspects. [snip]
Please make sure your tool takes into account OSSN-0075 [0]. Please read [1] to understand how the glance-manage tool currently deals with this issue.
[0] https://wiki.openstack.org/wiki/OSSN/OSSN-0075 [1] https://docs.openstack.org/glance/latest/admin/db.html#database-maintenance
[snip]
Thanks in advance for your feedbacks.
Happy to help! brian
On 5/9/19 5:14 PM, Pierre-Samuel LE STANG wrote:
Hi all,
At OVH we needed to write our own tool that archive data from OpenStack databases to prevent some side effect related to huge tables (slower response time, changing MariaDB query plan) and to answer to some legal aspects.
So we started to write a python tool which is called OSArchiver that I briefly presented at Denver few days ago in the "Optimizing OpenStack at large scale" talk. We think that this tool could be helpful to other and are ready to open source it, first we would like to get the opinion of the ops community about that tool.
To sum-up OSArchiver is written to work regardless of Openstack project. The tool relies on the fact that soft deleted data are recognizable because of their 'deleted' column which is set to 1 or uuid and 'deleted_at' column which is set to the date of deletion.
The points to have in mind about OSArchiver: * There is no knowledge of business objects * One table might be archived if it contains 'deleted' column * Children rows are archived before parents rows * A row can not be deleted if it fails to be archived
Here are features already implemented: * Archive data in an other database and/or file (actually SQL and CSV formats are supported) to be easily imported * Delete data from Openstack databases * Customizable (retention, exclude DBs, exclude tables, bulk insert/delete) * Multiple archiving configuration * Dry-run mode * Easily extensible, you can add your own destination module (other file format, remote storage etc...) * Archive and/or delete only mode
It also means that by design you can run osarchiver not only on OpenStack databases but also on archived OpenStack databases.
Thanks in advance for your feedbacks.
Hi Pierre, That's really the kind of project that I would prefer not to have to exist. By this I mean, it'd be a lot nicer if this could be taken care of project by project, with something like what Nova does (ie: nova-manage db archive_deleted_rows). In such configuration, that's typically something that could be added as a cron job, automatically configured by packages. Now, a question for other OPS reading this thread: how long should be the retention? In Debian, we use to have the unsaid policy that we don't want too much retention, to project privacy. Though in operation, we may need at least a few weeks of history, so we can do support. If I was to configure a cron job for nova, for example, what parameter should I set to --before (when #556751 is merged)? My instinct would be: nova-manage db archive_deleted_rows \ --before $(date -d "-1 month" +%Y-%m-%d) Your thoughts everyone? Cheers, Thomas Goirand (zigo)
Thomas Goirand <zigo@debian.org> wrote on mer. [2019-mai-22 09:09:25 +0200]:
On 5/9/19 5:14 PM, Pierre-Samuel LE STANG wrote:
Hi all,
At OVH we needed to write our own tool that archive data from OpenStack databases to prevent some side effect related to huge tables (slower response time, changing MariaDB query plan) and to answer to some legal aspects.
So we started to write a python tool which is called OSArchiver that I briefly presented at Denver few days ago in the "Optimizing OpenStack at large scale" talk. We think that this tool could be helpful to other and are ready to open source it, first we would like to get the opinion of the ops community about that tool.
To sum-up OSArchiver is written to work regardless of Openstack project. The tool relies on the fact that soft deleted data are recognizable because of their 'deleted' column which is set to 1 or uuid and 'deleted_at' column which is set to the date of deletion.
The points to have in mind about OSArchiver: * There is no knowledge of business objects * One table might be archived if it contains 'deleted' column * Children rows are archived before parents rows * A row can not be deleted if it fails to be archived
Here are features already implemented: * Archive data in an other database and/or file (actually SQL and CSV formats are supported) to be easily imported * Delete data from Openstack databases * Customizable (retention, exclude DBs, exclude tables, bulk insert/delete) * Multiple archiving configuration * Dry-run mode * Easily extensible, you can add your own destination module (other file format, remote storage etc...) * Archive and/or delete only mode
It also means that by design you can run osarchiver not only on OpenStack databases but also on archived OpenStack databases.
Thanks in advance for your feedbacks.
Hi Pierre,
That's really the kind of project that I would prefer not to have to exist. By this I mean, it'd be a lot nicer if this could be taken care of project by project, with something like what Nova does (ie: nova-manage db archive_deleted_rows).
In such configuration, that's typically something that could be added as a cron job, automatically configured by packages.
Now, a question for other OPS reading this thread: how long should be the retention? In Debian, we use to have the unsaid policy that we don't want too much retention, to project privacy. Though in operation, we may need at least a few weeks of history, so we can do support. If I was to configure a cron job for nova, for example, what parameter should I set to --before (when #556751 is merged)? My instinct would be:
nova-manage db archive_deleted_rows \ --before $(date -d "-1 month" +%Y-%m-%d)
Your thoughts everyone?
Hi Thomas, Thanks for your feedback, I really appreciate it. The tool is designed to be customized and to fit your needs. It means that you can have one configuration per project or one configuration for all projects. So you might imagine having a configuration for glance which exclude images table and one configuration for nova with a higher or lower retention. -- PS
On 5/22/19 10:34 AM, Pierre-Samuel LE STANG wrote:
Thomas Goirand <zigo@debian.org> wrote on mer. [2019-mai-22 09:09:25 +0200]:
Hi Pierre,
That's really the kind of project that I would prefer not to have to exist. By this I mean, it'd be a lot nicer if this could be taken care of project by project, with something like what Nova does (ie: nova-manage db archive_deleted_rows).
In such configuration, that's typically something that could be added as a cron job, automatically configured by packages.
Now, a question for other OPS reading this thread: how long should be the retention? In Debian, we use to have the unsaid policy that we don't want too much retention, to project privacy. Though in operation, we may need at least a few weeks of history, so we can do support. If I was to configure a cron job for nova, for example, what parameter should I set to --before (when #556751 is merged)? My instinct would be:
nova-manage db archive_deleted_rows \ --before $(date -d "-1 month" +%Y-%m-%d)
Your thoughts everyone?
Hi Thomas,
Thanks for your feedback, I really appreciate it. The tool is designed to be customized and to fit your needs. It means that you can have one configuration per project or one configuration for all projects.
So you might imagine having a configuration for glance which exclude images table and one configuration for nova with a higher or lower retention.
-- PS
This looks super nice then! Will you provide a standard configuration for every OpenStack project? It'd be nice if your package had a conf.d folder where one could drop the config for every project. That way, every OpenStack project package could drop a configuration there. Cheers, Thomas Goirand (zigo)
Thomas Goirand <zigo@debian.org> wrote on mer. [2019-mai-22 13:24:19 +0200]:
On 5/22/19 10:34 AM, Pierre-Samuel LE STANG wrote:
Thomas Goirand <zigo@debian.org> wrote on mer. [2019-mai-22 09:09:25 +0200]:
Hi Pierre,
That's really the kind of project that I would prefer not to have to exist. By this I mean, it'd be a lot nicer if this could be taken care of project by project, with something like what Nova does (ie: nova-manage db archive_deleted_rows).
In such configuration, that's typically something that could be added as a cron job, automatically configured by packages.
Now, a question for other OPS reading this thread: how long should be the retention? In Debian, we use to have the unsaid policy that we don't want too much retention, to project privacy. Though in operation, we may need at least a few weeks of history, so we can do support. If I was to configure a cron job for nova, for example, what parameter should I set to --before (when #556751 is merged)? My instinct would be:
nova-manage db archive_deleted_rows \ --before $(date -d "-1 month" +%Y-%m-%d)
Your thoughts everyone?
Hi Thomas,
Thanks for your feedback, I really appreciate it. The tool is designed to be customized and to fit your needs. It means that you can have one configuration per project or one configuration for all projects.
So you might imagine having a configuration for glance which exclude images table and one configuration for nova with a higher or lower retention.
-- PS
This looks super nice then!
Will you provide a standard configuration for every OpenStack project? It'd be nice if your package had a conf.d folder where one could drop the config for every project. That way, every OpenStack project package could drop a configuration there.
Cheers,
Thomas Goirand (zigo)
I will provide a default configuration file that might be used as a template or reference. I did not consider adding the conf.d folder mechanism as all the different configurations can be written in the same config file but that sounds good to make the things clearer an easier. I'm still focus on making the code opensource as soon as it's done we will be able to make the tool evolve. -- PS
Hi all, We finally opensourced the tool on our github repository. You may get it here: https://github.com/ovh/osarchiver/ Thanks for your feedbacks. -- PS Pierre-Samuel LE STANG <pierre-samuel.le-stang@corp.ovh.com> wrote on jeu. [2019-mai-09 17:14:35 +0200]:
Hi all,
At OVH we needed to write our own tool that archive data from OpenStack databases to prevent some side effect related to huge tables (slower response time, changing MariaDB query plan) and to answer to some legal aspects.
So we started to write a python tool which is called OSArchiver that I briefly presented at Denver few days ago in the "Optimizing OpenStack at large scale" talk. We think that this tool could be helpful to other and are ready to open source it, first we would like to get the opinion of the ops community about that tool.
To sum-up OSArchiver is written to work regardless of Openstack project. The tool relies on the fact that soft deleted data are recognizable because of their 'deleted' column which is set to 1 or uuid and 'deleted_at' column which is set to the date of deletion.
The points to have in mind about OSArchiver: * There is no knowledge of business objects * One table might be archived if it contains 'deleted' column * Children rows are archived before parents rows * A row can not be deleted if it fails to be archived
Here are features already implemented: * Archive data in an other database and/or file (actually SQL and CSV formats are supported) to be easily imported * Delete data from Openstack databases * Customizable (retention, exclude DBs, exclude tables, bulk insert/delete) * Multiple archiving configuration * Dry-run mode * Easily extensible, you can add your own destination module (other file format, remote storage etc...) * Archive and/or delete only mode
It also means that by design you can run osarchiver not only on OpenStack databases but also on archived OpenStack databases.
Thanks in advance for your feedbacks.
-- Pierre-Samuel Le Stang
-- Pierre-Samuel Le Stang
participants (5)
-
Belmiro Moreira
-
Brian Rosmaita
-
Matt Riedemann
-
Pierre-Samuel LE STANG
-
Thomas Goirand