Monday, April 25, 2011

Multipart Parallel upload to Amazon Simple Storage Service using boto

The comment I received on upload to S3 trigged me to do some research (or some googling).  Life is short and we always need something to save time.

Yeah, in Windows I tried one of the free uploaders.  It's like using file explorer after asking for your id key and secret key.  It's fast and error free.  I don't remember if I have large files in there, and I would think retries will be hidden from view.

What to look for is multipart (relatively new S3 protocol) and parallel upload, and of course automatic retry.  Basically it's more like ftp, where you see the files are split into trunks and  transferred in many parallel threads.

Multipart upload allow behavior like "resume".  If you lost one file trunk you just need to retry that trunk.  The transfer speed depends on both end of each connection, but the number of connections is not limited to one.

Most Windows uploaders have both in the commercial versions.  You may or may not find it in the free versions.  The problem is, I haven't booted into Windows for weeks now, and I have no intention to if not necessary.

In Linux it's a different story.  Some good uploaders don't have Linux versions.  s3cmd is good and free, but it doesn't support multipart nor parallelism.  You can see that it upload one file at a time, with retry on errors.

After googling I came across the one and only one boto package.  I also copied the one and only one complete script that demonstrate multipart and parallel upload.

First, boto is written in python, so you can also try it in Windows.  Of course python is already in my Linux, and I didn't know if it was me who installed it.  (I'm the only supervisor.)  Also, boto is already installed, but I didn't know if it was me.

But the boto in my Linux is old.  You need the latest boto, 2.x.  You have to uninstall the current version using any one of the usual methods.  Download boto from the official site.  I unzip the .gz file using the archiver via the GUI.  I don't even know the command to unzip it.

After some guessing, I figured out how to "install" the new version.  You change directory to where the setup file is.  Then use the command:

$python setup install

To try multipart upload, create myuniquebucket using the S3 control console, at the reduced redundancy rate, the cheaper rate.  I think the rate dependents on your bucket, determined at creation.

Copy the "bio" script or any other script you want to try.  I removed the reduced redundancy option in the script because it got complains from python.

To run boto, first you need to edit the ~/.boto file to declare the id key and secret key.  See the official boto site.  Download, modify the upload script somewhere.  Then

$python bigfilename myuniquebucket

Maybe I post my configure file and the bio script later.  This script works as intended but far from an end user product.

It is not obvious how you specify the size of the parts.  If the files are over 10M you may think that the script hangs.  I am getting 100K/s upload rate.  You need some signs of life like in s3cmd, when the parts are large.

The script uses all your cores, or threads in a multithread CPU.  For me it's two.  I don't know if you can maximum my 10M cable modem connection by more parallelism.

The script uploads one file, in whole or in parts.  You still need to call the script from other directory sync scripts.  I do not know if the script will upload to subdirectories.

1 comment:

kirti said...

Its really true what you are saying..
Multipart feature provide save time as well single process uploaded in parallel threads..I use Bucket Explorer tool to implement these services in my applications and transaction with Amazon S3..Its really nice and easy to use..