[Openstack] Database stuff

Michael Pittaro mikeyp at lahondaresearch.org
Tue Nov 29 18:32:29 UTC 2011


I like e).

I know it's heresy in some circles, but this approach can also support
the use of SqlAlchemy core, without the use of the SQLAlchemy ORM.
It's more work in some respects, but does allow writing very efficient
queries.

mike

On Tue, Nov 29, 2011 at 9:55 AM, Vishvananda Ishaya
<vishvananda at gmail.com> wrote:
> e) is the right solution imho.  The only reason joinedloads slipped in is for efficiency reasons.
>
> In an ideal world the solution would be:
>
> 1) (explicitness) Every object or list of related objects is retrieved with an explicit call:
>  instance = db.instance_get(id)
>  ifaces = db.interfaces_get_by_instance(id)
>  for iface in ifaces:
>     ip = db.fixed_ip_get_by_interface(iface['id'])
> 2) (efficiency) Queries are perfectly efficient and all joins that will be used are made at once.
>  So the above would be a single db query that joins all instances ifaces and ips.
>
> Unless we're doing source code analysis to generate our queries, then we're probably going
> to have to make some tradeoffs to get as much efficiency and explicitness as possible.
>
> Brainstorming, perhaps we could use a hinting/caching mechanism that the backend could support.
> something like db.interfaces_get_by_instance(id, hint='fixed_ip'), which states that you are about to make
> another db request to get the fixed ips, so the backend could prejoin and cache the results. Then the next
> request could be: db.fixed_ip_get_by_interface(iface['id'], cached=True) or some such.
>
> I would like to move towards 1) but I think we really have to solve 2) or we will be smashing the database with too many queries.
>
> Vish
>
> On Nov 29, 2011, at 7:20 AM, Soren Hansen wrote:
>
>> Hi, guys.
>>
>> Gosh, this turned out to be long. Sorry about that.
>>
>> I'm adding some tests for the DB api, and I've stumbled across something
>> that we should probably discuss.
>>
>> First of all, most (if not all) of the various *_create methods we have,
>> return quite amputated objects. Any attempt to access related objects
>> will fail with the much too familiar:
>>
>>   DetachedInstanceError: Parent instance <Whatever at 0x4f5c8d0> is not
>>   bound to a Session; lazy load operation of attribute 'other_things'
>>   cannot proceed
>>
>> Also, with the SQLAlchemy driver, this test would pass:
>>
>>    network = db.network_create(ctxt, {})
>>    network = db.network_get(ctxt, network['id'])
>>
>>    instance = db.instance_create(ctxt, {})
>>    self.assertEquals(len(network['virtual_interfaces']), 0)
>>    db.virtual_interface_create(ctxt, {'network_id': network['id'],
>>                                       'instance_id': instance['id']}
>>
>>    self.assertEquals(len(network['virtual_interfaces']), 0)
>>    network = db.network_get(ctxt, network['id'])
>>    self.assertEquals(len(network['virtual_interfaces']), 1)
>>
>> I create a network, pull it out again (as per my comment above), verify
>> that it has no virtual_interfaces related to it, create a virtual
>> interface in this network, and check the network's virtual_interfaces
>> key and finds that it still has length 0. Reloading the network now
>> reveals the new virtual interface.
>>
>> SQLAlchemy does support looking these things up on the fly. In fact,
>> AFAIK, this is its default behaviour. We just override it with
>> joinedload options, because we don't use scoped sessions.
>>
>> My fake db driver looks stuff like this up on the fly (so the
>> assertEquals after the virtual_interface_create will fail with that db
>> driver).
>>
>> So my question is this: Should this be
>>
>> a) looked up on the fly,
>> b) looked up on first key access and then cached,
>> c) looked up when the parent object is loaded and then never again,
>> d) or up to the driver author?
>>
>> Or should we do away with this stuff altogether? I.e. no more looking up
>> related objects by way of __getitem__ lookups, and instead only allow
>> lookups through db methods. So, instead of
>> network['virtual_interfaces'], you'd always do
>> db.virtual_interfaces_get_by_network(ctxt, network['id']).  Let's call
>> this option e).
>>
>> I'm pretty undecided myself. If we go with option e) it becomes clear to
>> consumers of the DB api when they're pulling out fresh stuff from the DB
>> and when they're reusing potentially old results.  Explicit is better
>> than implicit, but it'll take quite a bit of work to change this.
>>
>> If we go with one of options a) through d), my order of preference is
>> (from most to least preferred): a), d), c), b).
>>
>> There's value in having a right way and a wrong way to do this. If it's
>> either-or, it makes testing (as in validation) more awkward. I'd say
>> it's always possible to do on-the-fly lookups. Overriding __getitem__
>> and fetching fresh results is pretty simple, and, as mentioned earlier,
>> I believe this is SQLAlchemy's default behaviour (somebody please
>> correct me if I'm wrong). Forcing an arbitrary ORM to replicate the
>> behaviour of b) and c) could be incredibly awkward, and c) is also
>> complicated because there might be reference loops involved. Also,
>> reviewing correct use of something where the need for reloads depends on
>> previous use of your db objects (which might itself be conditional or
>> happen in called methods) sounds like no fun at all.  With d) it's
>> pretty straightforward: "Do you want to be sure to have fresh responses?
>> Then reload the object.  Otherwise, behaviour is undefined." It's
>> straightforward to explain and review.  Option e) is also easy to
>> explain and do reviews for, btw.
>>
>> DB objects with options a) through d) will have trouble making it
>> through a serializer. After dict serialisation, the object isn't going
>> to change, unresolved related objects will not be resolved, and
>> prepopulating everything prior to serialisation is out of the question
>> due to the potential for loops and very big sets of related objects (and
>> their related objects and their related objects's related objects,
>> etc.). I think it would be valuable to not have to think a whole lot
>> about whether a particular db-like object is a "real" db object or
>> whether it came in as a JSON object or over the message bus or whatever.
>> Only option e) will give us that, AFAICS.
>>
>> It seems I've talked myself into preferring option e). It's too much
>> work to do on my own, though, and it's going to be disruptive, so we
>> need to do it real soon. I think it'll be worth it, though.
>>
>> --
>> Soren Hansen        | http://linux2go.dk/
>> Ubuntu Developer    | http://www.ubuntu.com/
>> OpenStack Developer | http://www.openstack.org/
>>




More information about the Openstack mailing list