App Engine Datastore: How to Efficiently Export Your Data

Posted November 8th, 2012 in Big Data, Development by Greg Bayer
                     

While Google App Engine has many strengths, as with all platforms, there are some some challenges to be aware of. Over the last two years, one of our biggest challenges at Pulse has been how difficult it can be to export large amounts of data for migration, backup, and integration with other systems. While there are several options and tools, so far none have been feasible for large datasets (10GB+).

Since we have many TBs of data in Datastore, we’ve been actively looking for a solution to this for some time. I’m excited to share a very effective approach based on Google Cloud Storage and Datastore Backups, along with a method for converting the data to other fomats!

Existing Options For Data Export

These options that have been around for some time. They are often promoted as making it easy to access datastore data, but the reality can be very different when dealing with big data.

  1. Using the Remote API Bulk Loader. Although convenient, this official tool only works well for smaller datasets. Large datasets can easily take 24 hours to download and often fail without explanation. This tool has pretty much remained the same (without any further development) since App Engine’s early days. All official Google instructions point to this approach.
  2. Writing a map reduce job to push the data to another server. This approach can be painfully manual and often requires significant infrastructure elsewhere (eg. on AWS).
  3. Using the Remote API directly or writing a handler to access datastore entities one query at a time, you can run a parallelizable script or map reduce job to pull the data to where you need it. Unfortunately this has the same issues as #2.

A New Approach – Export Data via Google Cloud Storage

The recent introduction of Google Cloud Storage has finally made exporting large datasets out of Google App Engine’s datastore possible and fairly easy. The setup steps are annoying, but thankfully it’s mostly a one-time cost. Here’s how it works.

One-time setup

  • Create a new task queue in your App Engine app called ‘backups’ with the maximum 500/s rate limit (optional).
  • Sign up for a Google Cloud Storage account with billing enabled. Download and configure gsutil for your account.
  • Created a bucket for your data in Google Cloud Storage. You can use the online browser to do this. Note: There’s an unresolved bug that causes backups to buckets with underscores to fail.
  • Use gsutil to set the acl and default acl for that bucket to include your app’s service account email address with WRITE and FULL_CONTROL respectively.

 Steps to export data

  • Navigate to the datastore admin tab in the App Engine console for your app. Click the checkbox next to the Entity Kinds you want to export, and push the Backup button.
  • Select your ‘backups’ queue (optional) and Google Cloud Storage as the destination. Enter the bucket name as /gs/your_bucket_name/your_path.
  • A map reduce job with 256 shards will be run to copy your data. It should be quite fast (see below).

Steps to download data

  • On the machine where you want the data, run the following command. Optionally you can include the -m flag before cp to enable multi-threaded downloads.
gsutil cp -R /gs/your_bucket_name/your_path /local_target

 

Reading Your Data

Unfortunately, even though you now have an efficient way to export data, this approach doesn’t include a built-in way to convert your data to common formats like CSV or JSON. If you stop here, you’re basically stuck using this data only to backup/restore App Engine. While that is useful, there are many other use-cases we have for exporting data at Pulse. So how do we read the data? It turns out there’s an undocumented, but relatively simple way of converting Google’s level db formated backup files into simple python dictionaries matching the structure of your original datastore entities. Here’s a Python snippet to get you started.

# Make sure App Engine APK is available
#import sys
#sys.path.append('/usr/local/google_appengine')
from google.appengine.api.files import records
from google.appengine.datastore import entity_pb
from google.appengine.api import datastore

raw = open('path_to_a_datastore_output_file', 'r')
reader = records.RecordsReader(raw)
for record in reader:
        entity_proto = entity_pb.EntityProto(contents=record)
        entity = datastore.Entity.FromPb(entity_proto)
        #Entity is available as a dictionary!

Note: If you use this approach to read all files in an output directory, you may get a ProtocolBufferDecodeError exception for the first record. It should be safe to ignore that error and continue reading the rest of the records.

Performance Comparison

Remote API Bulk Loader

  • 10GB / 10 hours ~ 291KB/s
  • 100GB – never finishes!

Backup to Google Cloud Storage + Download with gsutil

  • 10GB / 10 mins + 10 mins ~ 8.5MB/s
  • 100GB / 35 mins + 100 mins ~ 12.6MB/s

App Engine Wish List – Updates From Google IO 2012

Posted June 28th, 2012 in Development by Greg Bayer
                     

We’ve been using Google App Engine at Pulse since 2010, back when we had only one backend engineer. In that time, App Engine has served us very well. There are many things Google App Engine does very well; the most obvious advantage is saving us lots of Ops work and letting us stay focused on our application. Over the last two years, it has grown with us both in terms of scale (from 200k users, to 15M+) and in terms of features.

As I’m writing this post (from Google I/O 2012), I’m happy to report that App Engine continues to grow with us. This year, Google’s App Engine team has announced that they are fixing our number one wish list item! They have also started addressing several other important concerns. For some context, here is Pulse’s App Engine wish list as of about a month ago.

  1. SSL support for custom domains
  2. Faster bulk import & export of datastore data
  3. Faster datastore snapshotting
  4. Tunable memcache eviction policies & capacity
  5. Improved support for searching / browsing / downloading high volume application logs
  6. Faster (diff-based) deployment for large applications
  7. Support for naked domains (without www. in front)
  8. Unlimited developer accounts per application

Barb Darrow from GigaOm published part of this list earlier this week (before I/O started). Check out the article Google App Engine: What developers want at Google I/O to see more common wish list items from other developers.

As of yesterday, (with the release of SDK version 1.7.0), SSL for custom domains is now officially supported either via SNI for $9/month or via a custom IP for $99/month. This means that you can now host a domain like www.pulse.me on App Engine and support https throughout your site. Previously it had only been possible to use http with your domain, and any secure transactions had to be routed to the less appealing xxxxx.appspot.com domain. This meant you had to break the user’s flow or use some complicated hacks to hide the domain switching. Now it is finally possible to present a seamless, secure experience without ever leaving your custom domain.

There were many other great features released with 1.7.0 (see the link above). As for the rest of our wish list, here’s how it stands now!

  1. SSL support for custom domains
    – Supported now!
  2. Faster bulk import & export of datastore data
    – Update 2: App Engine Datastore: How to Efficiently Export Your Data
  3. Faster datastore snapshotting
    – Update 3: The internal settings for map reduce-based snapshotting have been increased to use 256 shards. It’s actually pretty fast now! Still hoping for incremental backups in the future.
  4. Tunable memcache eviction policies & capacity
    – I hear that we will soon be able to segment applications and control capacity. Eviction policy controls are likely to take longer.
  5. Improved support for searching / browsing / downloading high volume application logs
    – It was announced that this is coming very soon!!
  6. Faster (diff-based) deployment for large applications
    – Update 4: This is supporting and working for us now!
  7. Support for naked domains (without www. in front)
    – Pending. No ETA.
  8. Unlimited developer accounts per application
    – This is now supported for premier accounts!

Let me know in the comments if you have any questions about these or want to share some of your wish list items. I’m always happy to discuss App Engine issues with other developers.

Update: Just now, at the second Google I/O keynote, Urs Hölzle has announced Google’s push into the IaaS space with Google Compute Engine. It should be interesting to see if this offers serious competition to Amazon’s EC2 for future Pulse systems and features. 771886 cores available to the demo Genome app was pretty impressive! I’ll post here and/or at eng.pulse.me when we get a chance to try it out!

Scaling Pulse to 11M Users

Posted February 16th, 2012 in Pulse by Greg Bayer
                     

As part of Pulse’s recent announcement of crossing the 11M user mark (up 10x since last year!), we’ve written a set of blog posts to share how we’ve scaled our backend infrastructure to keep up with our new users and support some powerful new features. Here’s a quick recap of our systems on both Amazon Web Services (AWS) and Google App Engine (GAE), along with links to the detailed posts describing each.

Continue Reading »

Scaling with the Kindle Fire

Posted December 1st, 2011 in Pulse by Greg Bayer
                     

Earlier this week I wrote a guest post for the Google App Engine Blog on how Pulse has scaled-up our backend infrastructure to prepare for the recent Kindle Fire launch.

The Kindle Fire includes Pulse as one of the only preloaded apps and is projected to sell over five million units this quarter alone. This meant we had to prepare for nearly doubling our user-base in a very short period. We also needed to be ready for spikes in load due to press events and the holiday season.

Continue Reading »

Pulse Wins Apple Design Award and Raises $9 Million Series A

Posted June 16th, 2011 in Pulse by Greg Bayer
                     

I’m very excited to share that Pulse has announced it’s series A funding round! All of us are still fired up about last week’s Apple Design Award at WWDC and our recent 4 million user milestone, not to mention that today is our co-founder Ankit’s birthday. Thanks to the team for their tireless work and to everyone who has helped us get here!

Check out some of today’s press:

Pulse Blog – Announcing Our Series A Financing
TechCrunch – 4 Million Users Strong And Apple Design Award In Hand, Pulse Grabs $9 Million Series A
WSJ – Pulse Taps $9M To Win Battle For Mobile-News Consumers
Forbes – News Reader Pulse Raises $9 Million
Mashable – Pulse Passes 4 Million Users, Raises $9 Million for Visual News Reader

Working Hard With No Regrets

Posted June 2nd, 2011 in Observations by Greg Bayer
                     

Working for a startup usually means putting in more hours than others. Recently, I spent two days on less than 3 hours of sleep in order to push out our new Pulse.me release. This doesn’t seem strange to me and didn’t make me unhappy. In fact, it was one of the most exciting and fun things I’ve done in a while. However, after mentioning it to some friends, I realized not everyone understands why it can be good to spend so much time “working” to build something you believe in.

Upon hearing about my sleep deprived state, my friend sent me a link to the top 5 regrets people make on their deathbed along with the comment “you might need this.”  I appreciated the link and enjoyed the reminder to live life to the fullest, especially with regards to keeping in touch with friends and loved ones. I also realized that my friend didn’t understand that for me the long hours I put in are all about fulfilling my dreams of creating new technology and impacting the world in a positive way. According the article, not chasing after dreams is people’s #1 regret.

Continue Reading »

Pulse News is Hiring!

Posted December 10th, 2010 in Pulse by Greg Bayer
                     

A few months ago I mentioned that I left the government/research world (Sandia Labs) and joined an exciting new startup.   I’d like to share a bit more about my experience so far and announce that we are hiring!

Those who have worked at a large company and then moved to startup can probably relate to my experience.  First, without a doubt, the most motivating and fun part about working at Pulse is seeing the impact of my work. And I don’t mean just having someone say “Good Job” or receiving a strong performance review, I mean seeing thousands of people USE the results of your work and submit feedback about how it benefitted their lives.  At Pulse, this experience is magnified by the fact that we release new product features every two weeks, and not ever quarter, or every year!

Continue Reading »

Recently Joined Pulse!

Posted October 10th, 2010 in Pulse by Greg Bayer
                     

Pulse by Alphonso LabsAfter a year and half of big data research for the government and quite a bit of fun with Hadoop, I’ve decided to join some good friends at an early-stage startup called Alphonso Labs.

Pulse is currently the #1 news reader on the iPad, iPhone, Andriod app stores.  I’ll be leading the development of our backend data platform and working with a great team.

As we start to build out Pulse’s backend, I’ll be continuing to experiment with Google App engine.  Stay tuned for more posts in that regard.

Pulse on the iPad

Continue Reading »