Open Stack

Tue Mar 26 20:05:34 UTC 2013

On Tue, Mar 26, 2013 at 2:41 PM, Sandy Walsh <sandy.walsh at rackspace.com>wrote:

> In a previous thread we were discussing the possibility of using a
> "metadata" schema model to keep the complexity of the database down.
>
> http://lists.openstack.org/pipermail/openstack-dev/2013-January/005132.html
>
> This is a very attractive approach because it's so simple, but would it
> be able to handle the volume of data we're contemplating?
>
> (Disclaimer: I'm not much of db guy so I could have missed something
> obvious here)
>
> So ... I created a simple set of tables like this:
>
> mysql> show tables;
> +-------------------+
> | Tables_in_cm_meta |
> +-------------------+
> | rawdata           |
> | traits            |
> +-------------------+
> 2 rows in set (0.00 sec)
>
>
> mysql> show columns from traits;
> +------------+-------------+------+-----+---------+----------------+
> | Field      | Type        | Null | Key | Default | Extra          |
> +------------+-------------+------+-----+---------+----------------+
> | id         | int(11)     | NO   | PRI | NULL    | auto_increment |
> | key        | varchar(32) | YES  | MUL | NULL    |                |
> | value      | varchar(32) | YES  | MUL | NULL    |                |
> | rawdata_id | int(11)     | YES  | MUL | NULL    |                |
> +------------+-------------+------+-----+---------+----------------+
> 4 rows in set (0.00 sec)
>
>
> mysql> show columns from rawdata;
> +-------+---------+------+-----+---------+----------------+
> | Field | Type    | Null | Key | Default | Extra          |
> +-------+---------+------+-----+---------+----------------+
> | id    | int(11) | NO   | PRI | NULL    | auto_increment |
> | raw   | text    | YES  |     | NULL    |                |
> +-------+---------+------+-----+---------+----------------+
> 2 rows in set (0.00 sec)
>
>
> All the traits columns are indexed. Foreign Key to RawData.id primary key.
>
> Note: Yes, the traits table can be normalized further with the labels in
> another table and only the id's used for the joins, but this is a close
> enough test to the stacktach baseline. (Although if someone thinks this
> will result in a *radically* better performance time, please let me
> know. Include the time to check for duplicates for each unique trait key.)
>
> I populated the rawdata table with 10000 rows ... each row had 15 traits
> associated with it.
>
> So about 150k rows in the traits table for the 10k rawdata rows.
>
> Next I created a basic query to return all the UUID's for events within
> a small time range.
>
> mysql> SELECT DISTINCT traits.value AS traits_value
>        FROM traits
>        WHERE traits.`key` = 'instance' AND
>              traits.rawdata_id IN (
>                  SELECT DISTINCT
>                      traits.rawdata_id AS traits_rawdata_id
>                  FROM traits
>                  WHERE traits.`key` = 'when' AND
>                        traits.value >= '2013-03-13 13:51:22.761889' AND
>                        traits.value <= '2013-03-13 13:53:15.508622');
>
> +----------------------------------+
> | traits_value                     |
> +----------------------------------+
> | ins-f501e7ca-df38-4fb0-97a4-470e |
>                ...
> | ins-ca4b7acb-04b0-4d32-9a8f-9e0e |
> +----------------------------------+
> 50 rows in set (4 min 27.78 sec)
>
> OVER FOUR MINUTES. Ugh.
> (mysql 5.1.67 on Ubuntu 10.04, 4G ram, single core)
>
> This is a pretty darn simple query.
>
> In StackTach we're already dealing with nearly 17M rows for about 5
> months of data (full event data in a fully denormalized table) and doing
> things a lot more complicated.
>
> Comparable queries on the same machine against the stacktach rawdata
> table are < 1s.
>
> show columns from stacktach_rawdata;
> +---------------+--------------+------+-----+---------+----------------+
> | Field         | Type         | Null | Key | Default | Extra          |
> +---------------+--------------+------+-----+---------+----------------+
> | id            | int(11)      | NO   | PRI | NULL    | auto_increment |
> | deployment_id | int(11)      | NO   | MUL | NULL    |                |
> | tenant        | varchar(50)  | YES  | MUL | NULL    |                |
> | json          | longtext     | NO   |     | NULL    |                |
> | routing_key   | varchar(50)  | YES  | MUL | NULL    |                |
> | state         | varchar(20)  | YES  | MUL | NULL    |                |
> | old_state     | varchar(20)  | YES  | MUL | NULL    |                |
> | old_task      | varchar(30)  | YES  | MUL | NULL    |                |
> | when          | datetime     | NO   | MUL | NULL    |                |
> | microseconds  | int(11)      | NO   | MUL | NULL    |                |
> | publisher     | varchar(100) | YES  | MUL | NULL    |                |
> | event         | varchar(50)  | YES  | MUL | NULL    |                |
> | service       | varchar(50)  | YES  | MUL | NULL    |                |
> | host          | varchar(100) | YES  | MUL | NULL    |                |
> | instance      | varchar(50)  | YES  | MUL | NULL    |                |
> | request_id    | varchar(50)  | YES  | MUL | NULL    |                |
> +---------------+--------------+------+-----+---------+----------------+
> 16 rows in set (0.00 sec)
>
> I think we need to pass on this normalized metadata approach unless
> someone can spot something blatant I'm doing wrong here?
>
> We may have to build a system where we can have user-defined data types
> via plugins and each plugin will manage the tables it needs (with
> migration). More complicated, but certainly optimal.
>

Is it necessary to make all of the metadata available for queries? It seems
like, in practice, the useful values will be known somewhere early in the
setup process. We could provide a default table definition, and a way to
add let the user indicate additional values they want stored using a
configuration file. We could then either update the table definition, or
leave that up to the deployer to handle.

Doug

>
> The source for the test is here:
> https://github.com/SandyWalsh/twobillion/blob/master/sql_schema.py
> and
> https://github.com/SandyWalsh/twobillion/blob/master/schema.py
>
> (uncomment the create_data() call on the first run to get a dataset
> https://github.com/SandyWalsh/twobillion/blob/master/sql_schema.py#L169 )
>
> Look forward to your feedback db-heads :)
>
> -S
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20130326/ebaa2d7f/attachment.html>

Open Stack

[openstack-dev] [Ceilometer] Experiments with Metadata table for Meters ... initial results.

OpenStack

Community

Documentation

Branding & Legal