Wednesday, April 27, 2011

Cloud storage solutions: s3cmd boto gsutil

s3cmd is good for Amazon S3.  It's in python too, so you can also use it in Windows or wrap it up.  There are two things critically missing - multipart upload and parallel upload.

There is a branch on s3cmd for "parallel" upload.  I downloaded, only to find out that it doesn't improve anything.  I think if you are uploading to the same account via the same IP, the limiting speed is the same.  That's the reason the patch wasn't added to the main s3cmd.

The multipart upload is implemented in boto.  The problem is that it's just a wrapper in python for the S3 (and all AWS) API.  I downloaded some user scripts that uses boto.  It works but only if you just want to upload a large file to the root of a bucket.

In the search for someone embedding boto, to my surprise it is gsutil - for google storage.  I understand a few things.

Now that the giant is closing in, I wonder if the incentive of updating S3 tools is great.  Google Storage is slightly more expansive in storage, about the same in bandwidth, but the promotion is 100 Gb free per month!  The problem is that I have no idea if they will give me an account after I submitted an almost blank form with nothing to say, and how long I have to wait.

I understand why Amazon have the 1 year free promotion.  And why they recently added the multipart upload and resumable download (?) API.  Everybody will switch to GS without that.

Go to gsutil in Google code to download and install.  They have good instructions for that.  But it wouldn't work.  Because developers' Ubuntu isn't the same Ubuntu as you and me!

After installed gsutil in your home directory, you cd into gsutil, edit the gsutil file and search for the line:

def SanityCheckXmlParser(cmd):

Then, right below the line, add a line with two spaces and the word

  return

That will make it work.  You also need to setup ~/.boto file like this


[Credentials]
aws_access_key_id = your id key
aws_secret_access_key = your secret key

[Boto]
debug = 0
num_retries = 10

I think if you don't have anything, gsutil will ask you, but for GS only, or not.

The documentation (or lack of it) says gs:// but it will also works for s3://, exactly the same, or not.  The syntax is similar to s3cmd but more like Linux.

To copy the whole directory (and subs):

gsutil cp -R mydir s3://myuniquebucket

Large files are probably uploaded in 1M chunks (or not).  To copy subdirectories, -R option is necessary for s3.

Unfortunately gsutil doesn't sync.  But since gsutil is rather reliable with multipart upload, you don't need to sync that often.  When you added something or just wanted to check, you can use s3cmd:

s3cmd sync mydir/ s3://myuniquebucket/mydir/

Note the slashes at the end, without which s3cmd will check more than that.  If s3cmd detected that you need to upload a large file, you are better off to upload the file yourself by gsutil.

Now that's the complete solution.  Three packages to do something simple.  You may want to pay up for the commercial software.

Now I can backup my large encrypted containers.

5 comments:

Unknown said...

Got this installed and working on my system but I am still not able to upload large files to S3. How large of files are you uploading?

Unknown said...

I may have gotten this to work now actually. I had to also get this module and install it:

http://pypi.python.org/pypi/filechunkio/1.5

Trying out s3multiput now that was in the boto/bin directory which I had actually installed into /usr/bin with adjusting the setup.py script. So s3multiput is only available with the gsutil package? Wonder why I can't get it working with the straight up gsutil cp file s3://bucket/

Unknown said...

Just updating my last post, my previous post worked out for me using s3multiput rather than using gsutil. Thanks for the original post it set me on the right path!

The Player said...

In my update post: the gsutil is supposed to upload in multiparts in 1MB trunks. Boto supports that. But it may be true for Google Storage, but I doubt if it's for S3 too. And I couldn't find out if there's an option to force multiparts. A 2GB file failed when > 1.5GB was uploaded.

The Player said...

Cool, the python update is uploaded the same date as my post. But I wonder does boto is updated to use it.

Recently there's some updating in boto too, specifically about s3multiput.

Boto is not intended for end users, so I won't be trying it directly soon.

My experience is that gsutil uses boto under it. I suppose gsutil uploads to google storage in multiparts, but NOT to S3. I blamed gsutil but I may be wrong. I should be blaming boto instead? boto claim to use multiparts, but didn't say it only supports google storage.