[openstack-dev] [Ceilometer] Experiments with Metadata table for Meters ... initial results.
Sandy Walsh
sandy.walsh at rackspace.com
Tue Mar 26 18:41:24 UTC 2013
In a previous thread we were discussing the possibility of using a
"metadata" schema model to keep the complexity of the database down.
http://lists.openstack.org/pipermail/openstack-dev/2013-January/005132.html
This is a very attractive approach because it's so simple, but would it
be able to handle the volume of data we're contemplating?
(Disclaimer: I'm not much of db guy so I could have missed something
obvious here)
So ... I created a simple set of tables like this:
mysql> show tables;
+-------------------+
| Tables_in_cm_meta |
+-------------------+
| rawdata |
| traits |
+-------------------+
2 rows in set (0.00 sec)
mysql> show columns from traits;
+------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| key | varchar(32) | YES | MUL | NULL | |
| value | varchar(32) | YES | MUL | NULL | |
| rawdata_id | int(11) | YES | MUL | NULL | |
+------------+-------------+------+-----+---------+----------------+
4 rows in set (0.00 sec)
mysql> show columns from rawdata;
+-------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| raw | text | YES | | NULL | |
+-------+---------+------+-----+---------+----------------+
2 rows in set (0.00 sec)
All the traits columns are indexed. Foreign Key to RawData.id primary key.
Note: Yes, the traits table can be normalized further with the labels in
another table and only the id's used for the joins, but this is a close
enough test to the stacktach baseline. (Although if someone thinks this
will result in a *radically* better performance time, please let me
know. Include the time to check for duplicates for each unique trait key.)
I populated the rawdata table with 10000 rows ... each row had 15 traits
associated with it.
So about 150k rows in the traits table for the 10k rawdata rows.
Next I created a basic query to return all the UUID's for events within
a small time range.
mysql> SELECT DISTINCT traits.value AS traits_value
FROM traits
WHERE traits.`key` = 'instance' AND
traits.rawdata_id IN (
SELECT DISTINCT
traits.rawdata_id AS traits_rawdata_id
FROM traits
WHERE traits.`key` = 'when' AND
traits.value >= '2013-03-13 13:51:22.761889' AND
traits.value <= '2013-03-13 13:53:15.508622');
+----------------------------------+
| traits_value |
+----------------------------------+
| ins-f501e7ca-df38-4fb0-97a4-470e |
...
| ins-ca4b7acb-04b0-4d32-9a8f-9e0e |
+----------------------------------+
50 rows in set (4 min 27.78 sec)
OVER FOUR MINUTES. Ugh.
(mysql 5.1.67 on Ubuntu 10.04, 4G ram, single core)
This is a pretty darn simple query.
In StackTach we're already dealing with nearly 17M rows for about 5
months of data (full event data in a fully denormalized table) and doing
things a lot more complicated.
Comparable queries on the same machine against the stacktach rawdata
table are < 1s.
show columns from stacktach_rawdata;
+---------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| deployment_id | int(11) | NO | MUL | NULL | |
| tenant | varchar(50) | YES | MUL | NULL | |
| json | longtext | NO | | NULL | |
| routing_key | varchar(50) | YES | MUL | NULL | |
| state | varchar(20) | YES | MUL | NULL | |
| old_state | varchar(20) | YES | MUL | NULL | |
| old_task | varchar(30) | YES | MUL | NULL | |
| when | datetime | NO | MUL | NULL | |
| microseconds | int(11) | NO | MUL | NULL | |
| publisher | varchar(100) | YES | MUL | NULL | |
| event | varchar(50) | YES | MUL | NULL | |
| service | varchar(50) | YES | MUL | NULL | |
| host | varchar(100) | YES | MUL | NULL | |
| instance | varchar(50) | YES | MUL | NULL | |
| request_id | varchar(50) | YES | MUL | NULL | |
+---------------+--------------+------+-----+---------+----------------+
16 rows in set (0.00 sec)
I think we need to pass on this normalized metadata approach unless
someone can spot something blatant I'm doing wrong here?
We may have to build a system where we can have user-defined data types
via plugins and each plugin will manage the tables it needs (with
migration). More complicated, but certainly optimal.
The source for the test is here:
https://github.com/SandyWalsh/twobillion/blob/master/sql_schema.py
and
https://github.com/SandyWalsh/twobillion/blob/master/schema.py
(uncomment the create_data() call on the first run to get a dataset
https://github.com/SandyWalsh/twobillion/blob/master/sql_schema.py#L169 )
Look forward to your feedback db-heads :)
-S
More information about the OpenStack-dev
mailing list