[openstack-dev] [Zaqar] Comments on the concerns arose during the TC meeting
Clint Byrum
clint at fewbar.com
Tue Sep 30 17:16:20 UTC 2014
Excerpts from Jay Pipes's message of 2014-09-30 09:41:29 -0700:
> On 09/30/2014 08:03 AM, Soren Hansen wrote:
> > 2014-09-12 1:05 GMT+02:00 Jay Pipes <jaypipes at gmail.com>:
> >> If Nova was to take Soren's advice and implement its data-access layer
> >> on top of Cassandra or Riak, we would just end up re-inventing SQL
> >> Joins in Python-land.
> >
> > I may very well be wrong(!), but this statement makes it sound like you've
> > never used e.g. Riak. Or, if you have, not done so in the way it's
> > supposed to be used.
> >
> > If you embrace an alternative way of storing your data, you wouldn't just
> > blindly create a container for each table in your RDBMS.
> >
> > For example: In Nova's SQL-based datastore we have a table for security
> > groups and another for security group rules. Rows in the security group
> > rules table have a foreign key referencing the security group to which
> > they belong. In a datastore like Riak, you could have a security group
> > container where each value contains not just the security group
> > information, but also all the security group rules. No joins in
> > Python-land necessary.
>
> OK, that's all fine for a simple one-to-many relation.
>
> How would I go about getting the associated fixed IPs for a network? The
> query to get associated fixed IPs for a network [1] in Nova looks like this:
>
> SELECT
> fip.address,
> fip.instance_uuid,
> fip.network_id,
> fip.virtual_interface_id,
> vif.address,
> i.hostname,
> i.updated_at,
> i.created_at,
> fip.allocated,
> fip.leased,
> vif2.id
> FROM fixed_ips fip
> LEFT JOIN virtual_interfaces vif
> ON vif.id = fip.virtual_interface_id
> AND vif.deleted = 0
> LEFT JOIN instances i
> ON fip.instance_uuid = i.uuid
> AND i.deleted = 0
> LEFT JOIN (
> SELECT MIN(vi.id) AS id, vi.instance_uuid
> FROM virtual_interfaces vi
> GROUP BY instance_uuid
> ) as vif2
> WHERE fip.deleted = 0
> AND fip.network_id = :network_id
> AND fip.virtual_interface_id IS NOT NULL
> AND fip.instance_uuid IS NOT NULL
> AND i.host = :host
>
You and I both know that this query is not something we want to be
running a lot. In the past when I've had systems where something like
the above needed to be run a lot, I created materialized views for it
and made access to this information asynchronous because relying on this
in real-time is _extremely expensive_.
This is where a massively scalable but somewhat dumb database shines
because you can materialize the view into it at a much faster pace and
with more scaled-out workers so that you now have fine grained resource
allocations for scaling the responsiveness of this view.
> would I have a Riak container for virtual_interfaces that would also
> have instance information, network information, fixed_ip information?
> How would I accomplish the query against a derived table that gets the
> minimum virtual interface ID for each instance UUID?
>
Yes you'd have a bunch of duplicated information everywhere. Asynchronous
denormalizing is the new join. The only worry I'd have is what
accidental contracts have we made that require bits of information to
appear all at once. :-P
> More than likely, I would end up having to put a bunch of indexes and
> relations into my Riak containers and structures just so I could do
> queries like the above. Failing that, I'd need to do multiple queries to
> multiple Riak containers and then join the resulting projection in
> memory, in Python. And that is why I say you will just end up
> implementing joins in Python.
>
My experience has been that you have a map/reduce cluster somewhere
churning through updates to pre-compute things for slow queries and you
do any realtime duplication at write time. The idea is to create documents
that are themselves useful without the need of the other documents. Where
this approach suffers is when your documents all have so many copies of
data that you end up writing 10x more than you read. That is a rare thing,
and most certainly would not happen with Nova's data, which mostly would
have duplicates of lots of UUID's and IP addresses.
More information about the OpenStack-dev
mailing list