Open Stack

Tue Mar 26 18:41:24 UTC 2013

In a previous thread we were discussing the possibility of using a
"metadata" schema model to keep the complexity of the database down.

http://lists.openstack.org/pipermail/openstack-dev/2013-January/005132.html

This is a very attractive approach because it's so simple, but would it
be able to handle the volume of data we're contemplating?

(Disclaimer: I'm not much of db guy so I could have missed something
obvious here)

So ... I created a simple set of tables like this:

mysql> show tables;
+-------------------+
| Tables_in_cm_meta |
+-------------------+
| rawdata           |
| traits            |
+-------------------+
2 rows in set (0.00 sec)

mysql> show columns from traits;
+------------+-------------+------+-----+---------+----------------+
| Field      | Type        | Null | Key | Default | Extra          |
+------------+-------------+------+-----+---------+----------------+
| id         | int(11)     | NO   | PRI | NULL    | auto_increment |
| key        | varchar(32) | YES  | MUL | NULL    |                |
| value      | varchar(32) | YES  | MUL | NULL    |                |
| rawdata_id | int(11)     | YES  | MUL | NULL    |                |
+------------+-------------+------+-----+---------+----------------+
4 rows in set (0.00 sec)

mysql> show columns from rawdata;
+-------+---------+------+-----+---------+----------------+
| Field | Type    | Null | Key | Default | Extra          |
+-------+---------+------+-----+---------+----------------+
| id    | int(11) | NO   | PRI | NULL    | auto_increment |
| raw   | text    | YES  |     | NULL    |                |
+-------+---------+------+-----+---------+----------------+
2 rows in set (0.00 sec)

All the traits columns are indexed. Foreign Key to RawData.id primary key.

Note: Yes, the traits table can be normalized further with the labels in
another table and only the id's used for the joins, but this is a close
enough test to the stacktach baseline. (Although if someone thinks this
will result in a *radically* better performance time, please let me
know. Include the time to check for duplicates for each unique trait key.)

I populated the rawdata table with 10000 rows ... each row had 15 traits
associated with it.

So about 150k rows in the traits table for the 10k rawdata rows.

Next I created a basic query to return all the UUID's for events within
a small time range.

mysql> SELECT DISTINCT traits.value AS traits_value
       FROM traits
       WHERE traits.`key` = 'instance' AND
             traits.rawdata_id IN (
                 SELECT DISTINCT
                     traits.rawdata_id AS traits_rawdata_id
                 FROM traits
                 WHERE traits.`key` = 'when' AND
                       traits.value >= '2013-03-13 13:51:22.761889' AND
                       traits.value <= '2013-03-13 13:53:15.508622');

+----------------------------------+
| traits_value                     |
+----------------------------------+
| ins-f501e7ca-df38-4fb0-97a4-470e |
               ...
| ins-ca4b7acb-04b0-4d32-9a8f-9e0e |
+----------------------------------+
50 rows in set (4 min 27.78 sec)

OVER FOUR MINUTES. Ugh.
(mysql 5.1.67 on Ubuntu 10.04, 4G ram, single core)

This is a pretty darn simple query.

In StackTach we're already dealing with nearly 17M rows for about 5
months of data (full event data in a fully denormalized table) and doing
things a lot more complicated.

Comparable queries on the same machine against the stacktach rawdata
table are < 1s.

show columns from stacktach_rawdata;
+---------------+--------------+------+-----+---------+----------------+
| Field         | Type         | Null | Key | Default | Extra          |
+---------------+--------------+------+-----+---------+----------------+
| id            | int(11)      | NO   | PRI | NULL    | auto_increment |
| deployment_id | int(11)      | NO   | MUL | NULL    |                |
| tenant        | varchar(50)  | YES  | MUL | NULL    |                |
| json          | longtext     | NO   |     | NULL    |                |
| routing_key   | varchar(50)  | YES  | MUL | NULL    |                |
| state         | varchar(20)  | YES  | MUL | NULL    |                |
| old_state     | varchar(20)  | YES  | MUL | NULL    |                |
| old_task      | varchar(30)  | YES  | MUL | NULL    |                |
| when          | datetime     | NO   | MUL | NULL    |                |
| microseconds  | int(11)      | NO   | MUL | NULL    |                |
| publisher     | varchar(100) | YES  | MUL | NULL    |                |
| event         | varchar(50)  | YES  | MUL | NULL    |                |
| service       | varchar(50)  | YES  | MUL | NULL    |                |
| host          | varchar(100) | YES  | MUL | NULL    |                |
| instance      | varchar(50)  | YES  | MUL | NULL    |                |
| request_id    | varchar(50)  | YES  | MUL | NULL    |                |
+---------------+--------------+------+-----+---------+----------------+
16 rows in set (0.00 sec)

I think we need to pass on this normalized metadata approach unless
someone can spot something blatant I'm doing wrong here?

We may have to build a system where we can have user-defined data types
via plugins and each plugin will manage the tables it needs (with
migration). More complicated, but certainly optimal.

The source for the test is here:
https://github.com/SandyWalsh/twobillion/blob/master/sql_schema.py
and
https://github.com/SandyWalsh/twobillion/blob/master/schema.py

(uncomment the create_data() call on the first run to get a dataset
https://github.com/SandyWalsh/twobillion/blob/master/sql_schema.py#L169 )

Look forward to your feedback db-heads :)

-S

Open Stack

[openstack-dev] [Ceilometer] Experiments with Metadata table for Meters ... initial results.

OpenStack

Community

Documentation

Branding & Legal