<br><br><div class="gmail_quote">On Tue, Mar 26, 2013 at 2:41 PM, Sandy Walsh <span dir="ltr"><<a href="mailto:sandy.walsh@rackspace.com" target="_blank">sandy.walsh@rackspace.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

In a previous thread we were discussing the possibility of using a<br>

"metadata" schema model to keep the complexity of the database down.<br>

<br>

<a href="http://lists.openstack.org/pipermail/openstack-dev/2013-January/005132.html" target="_blank">http://lists.openstack.org/pipermail/openstack-dev/2013-January/005132.html</a><br>

<br>

This is a very attractive approach because it's so simple, but would it<br>

be able to handle the volume of data we're contemplating?<br>

<br>

(Disclaimer: I'm not much of db guy so I could have missed something<br>

obvious here)<br>

<br>

So ... I created a simple set of tables like this:<br>

<br>

mysql> show tables;<br>

+-------------------+<br>

| Tables_in_cm_meta |<br>

+-------------------+<br>

| rawdata           |<br>

| traits            |<br>

+-------------------+<br>

2 rows in set (0.00 sec)<br>

<br>

<br>

mysql> show columns from traits;<br>

+------------+-------------+------+-----+---------+----------------+<br>

| Field      | Type        | Null | Key | Default | Extra          |<br>

+------------+-------------+------+-----+---------+----------------+<br>

| id         | int(11)     | NO   | PRI | NULL    | auto_increment |<br>

| key        | varchar(32) | YES  | MUL | NULL    |                |<br>

| value      | varchar(32) | YES  | MUL | NULL    |                |<br>

| rawdata_id | int(11)     | YES  | MUL | NULL    |                |<br>

+------------+-------------+------+-----+---------+----------------+<br>

4 rows in set (0.00 sec)<br>

<br>

<br>

mysql> show columns from rawdata;<br>

+-------+---------+------+-----+---------+----------------+<br>

| Field | Type    | Null | Key | Default | Extra          |<br>

+-------+---------+------+-----+---------+----------------+<br>

| id    | int(11) | NO   | PRI | NULL    | auto_increment |<br>

| raw   | text    | YES  |     | NULL    |                |<br>

+-------+---------+------+-----+---------+----------------+<br>

2 rows in set (0.00 sec)<br>

<br>

<br>

All the traits columns are indexed. Foreign Key to RawData.id primary key.<br>

<br>

Note: Yes, the traits table can be normalized further with the labels in<br>

another table and only the id's used for the joins, but this is a close<br>

enough test to the stacktach baseline. (Although if someone thinks this<br>

will result in a *radically* better performance time, please let me<br>

know. Include the time to check for duplicates for each unique trait key.)<br>

<br>

I populated the rawdata table with 10000 rows ... each row had 15 traits<br>

associated with it.<br>

<br>

So about 150k rows in the traits table for the 10k rawdata rows.<br>

<br>

Next I created a basic query to return all the UUID's for events within<br>

a small time range.<br>

<br>

mysql> SELECT DISTINCT traits.value AS traits_value<br>

       FROM traits<br>

       WHERE traits.`key` = 'instance' AND<br>

             traits.rawdata_id IN (<br>

                 SELECT DISTINCT<br>

                     traits.rawdata_id AS traits_rawdata_id<br>

                 FROM traits<br>

                 WHERE traits.`key` = 'when' AND<br>

                       traits.value >= '2013-03-13 13:51:22.761889' AND<br>

                       traits.value <= '2013-03-13 13:53:15.508622');<br>

<br>

+----------------------------------+<br>

| traits_value                     |<br>

+----------------------------------+<br>

| ins-f501e7ca-df38-4fb0-97a4-470e |<br>

               ...<br>

| ins-ca4b7acb-04b0-4d32-9a8f-9e0e |<br>

+----------------------------------+<br>

50 rows in set (4 min 27.78 sec)<br>

<br>

OVER FOUR MINUTES. Ugh.<br>

(mysql 5.1.67 on Ubuntu 10.04, 4G ram, single core)<br>

<br>

This is a pretty darn simple query.<br>

<br>

In StackTach we're already dealing with nearly 17M rows for about 5<br>

months of data (full event data in a fully denormalized table) and doing<br>

things a lot more complicated.<br>

<br>

Comparable queries on the same machine against the stacktach rawdata<br>

table are < 1s.<br>

<br>

show columns from stacktach_rawdata;<br>

+---------------+--------------+------+-----+---------+----------------+<br>

| Field         | Type         | Null | Key | Default | Extra          |<br>

+---------------+--------------+------+-----+---------+----------------+<br>

| id            | int(11)      | NO   | PRI | NULL    | auto_increment |<br>

| deployment_id | int(11)      | NO   | MUL | NULL    |                |<br>

| tenant        | varchar(50)  | YES  | MUL | NULL    |                |<br>

| json          | longtext     | NO   |     | NULL    |                |<br>

| routing_key   | varchar(50)  | YES  | MUL | NULL    |                |<br>

| state         | varchar(20)  | YES  | MUL | NULL    |                |<br>

| old_state     | varchar(20)  | YES  | MUL | NULL    |                |<br>

| old_task      | varchar(30)  | YES  | MUL | NULL    |                |<br>

| when          | datetime     | NO   | MUL | NULL    |                |<br>

| microseconds  | int(11)      | NO   | MUL | NULL    |                |<br>

| publisher     | varchar(100) | YES  | MUL | NULL    |                |<br>

| event         | varchar(50)  | YES  | MUL | NULL    |                |<br>

| service       | varchar(50)  | YES  | MUL | NULL    |                |<br>

| host          | varchar(100) | YES  | MUL | NULL    |                |<br>

| instance      | varchar(50)  | YES  | MUL | NULL    |                |<br>

| request_id    | varchar(50)  | YES  | MUL | NULL    |                |<br>

+---------------+--------------+------+-----+---------+----------------+<br>

16 rows in set (0.00 sec)<br>

<br>

I think we need to pass on this normalized metadata approach unless<br>

someone can spot something blatant I'm doing wrong here?<br>

<br>

We may have to build a system where we can have user-defined data types<br>

via plugins and each plugin will manage the tables it needs (with<br>

migration). More complicated, but certainly optimal.<br></blockquote><div><br></div><div>Is it necessary to make all of the metadata available for queries? It seems like, in practice, the useful values will be known somewhere early in the setup process. We could provide a default table definition, and a way to add let the user indicate additional values they want stored using a configuration file. We could then either update the table definition, or leave that up to the deployer to handle.</div>

<div><br></div><div>Doug</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

The source for the test is here:<br>

<a href="https://github.com/SandyWalsh/twobillion/blob/master/sql_schema.py" target="_blank">https://github.com/SandyWalsh/twobillion/blob/master/sql_schema.py</a><br>

and<br>

<a href="https://github.com/SandyWalsh/twobillion/blob/master/schema.py" target="_blank">https://github.com/SandyWalsh/twobillion/blob/master/schema.py</a><br>

<br>

(uncomment the create_data() call on the first run to get a dataset<br>

<a href="https://github.com/SandyWalsh/twobillion/blob/master/sql_schema.py#L169" target="_blank">https://github.com/SandyWalsh/twobillion/blob/master/sql_schema.py#L169</a> )<br>

<br>

Look forward to your feedback db-heads :)<br>

<br>

-S<br>

<br>

<br>

_______________________________________________<br>

OpenStack-dev mailing list<br>

<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

</blockquote></div><br>