[openstack-dev] [MagnetoDB] Best practices for uploading large amounts of data
Dmitriy Ukhlov
dukhlov at mirantis.com
Fri Mar 28 12:21:12 UTC 2014
On 03/28/2014 11:29 AM, Serge Kovaleff wrote:
> Hi Iliia,
>
> I would take a look into BSON http://bsonspec.org/
>
> Cheers,
> Serge Kovaleff
>
> On Thu, Mar 27, 2014 at 8:23 PM, Illia Khudoshyn
> <ikhudoshyn at mirantis.com <mailto:ikhudoshyn at mirantis.com>> wrote:
>
> Hi, Openstackers,
>
> I'm currently working on adding bulk data load functionality to
> MagnetoDB. This functionality implies inserting huge amounts of
> data (billions of rows, gigabytes of data). The data being
> uploaded is a set of JSON's (for now). The question I'm interested
> in is a way of data transportation. For now I do streaming HTTP
> POST request from the client side with gevent.pywsgi on the server
> side.
>
> Could anybody suggest any (better?) approach for the
> transportation, please?
> What are best practices for that.
>
> Thanks in advance.
>
> --
>
> Best regards,
>
> Illia Khudoshyn,
> Software Engineer, Mirantis, Inc.
>
> 38, Lenina ave. Kharkov, Ukraine
>
> www.mirantis.com <http://www.mirantis.ru/>
>
> www.mirantis.ru <http://www.mirantis.ru/>
>
> Skype: gluke_work
>
> ikhudoshyn at mirantis.com <mailto:ikhudoshyn at mirantis.com>
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> <mailto:OpenStack-dev at lists.openstack.org>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Hi Iliia,
I guess if we a talking about cassandra batch loading the fastest way is
to generate sstables locally and load it into Cassandra via JMX or
sstableloader
http://www.datastax.com/dev/blog/bulk-loading
If you want to implement bulk load via magnetodb layer (not to cassandra
directly) you could try to use simple tcp socket and implement your
binary protocol (using bson for example). Http is text protocol so using
tcp socket can help you to avoid overhead of base64 encoding. In my
opinion, working with HTTP and BSON is doubtful solution
because you wil use 2 phase encoddung and decoding: 1) "object to bson",
2) "bson to base64", 3) "base64 to bson", 4) "bson to object" 1) "obect
to json" instead of 1) "object to json", 2) "json to object" in case of
HTTP + json
Http streaming as I know is asynchronous type of http. You can expect
performance growing thanks to skipping generation of http response on
server side and waiting on for that response on client side for each
chunk. But you still need to send almost the same amount of data. So if
network throughput is your bottleneck - it doesn't help. If server side
is your bottleneck - it doesn't help too.
Also pay your attention that in any case, now MagnetoDB Cassandra
Storage convert your data to CQL query which is also text. It would be
nice to implement MagnetoDB BatchWriteItem operation via Cassandra
sstable generation and loading via sstableloader, but unfortunately as I
know this functionality support implemented only for Java world
--
Best regards,
Dmitriy Ukhlov
Mirantis Inc.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140328/fda17d7b/attachment.html>
More information about the OpenStack-dev
mailing list