23

App Engine Datastore: How to Efficiently Export Your Data

Posted November 8th, 2012 in Big Data, Development and tagged , , , by Greg Bayer
                                

While Google App Engine has many strengths, as with all platforms, there are some some challenges to be aware of. Over the last two years, one of our biggest challenges at Pulse has been how difficult it can be to export large amounts of data for migration, backup, and integration with other systems. While there are several options and tools, so far none have been feasible for large datasets (10GB+).

Since we have many TBs of data in Datastore, we’ve been actively looking for a solution to this for some time. I’m excited to share a very effective approach based on Google Cloud Storage and Datastore Backups, along with a method for converting the data to other fomats!

Existing Options For Data Export

These options that have been around for some time. They are often promoted as making it easy to access datastore data, but the reality can be very different when dealing with big data.

  1. Using the Remote API Bulk Loader. Although convenient, this official tool only works well for smaller datasets. Large datasets can easily take 24 hours to download and often fail without explanation. This tool has pretty much remained the same (without any further development) since App Engine’s early days. All official Google instructions point to this approach.
  2. Writing a map reduce job to push the data to another server. This approach can be painfully manual and often requires significant infrastructure elsewhere (eg. on AWS).
  3. Using the Remote API directly or writing a handler to access datastore entities one query at a time, you can run a parallelizable script or map reduce job to pull the data to where you need it. Unfortunately this has the same issues as #2.

A New Approach – Export Data via Google Cloud Storage

The recent introduction of Google Cloud Storage has finally made exporting large datasets out of Google App Engine’s datastore possible and fairly easy. The setup steps are annoying, but thankfully it’s mostly a one-time cost. Here’s how it works.

One-time setup

  • Create a new task queue in your App Engine app called ‘backups’ with the maximum 500/s rate limit (optional).
  • Sign up for a Google Cloud Storage account with billing enabled. Download and configure gsutil for your account.
  • Created a bucket for your data in Google Cloud Storage. You can use the online browser to do this. Note: There’s an unresolved bug that causes backups to buckets with underscores to fail.
  • Use gsutil to set the acl and default acl for that bucket to include your app’s service account email address with WRITE and FULL_CONTROL respectively.

 Steps to export data

  • Navigate to the datastore admin tab in the App Engine console for your app. Click the checkbox next to the Entity Kinds you want to export, and push the Backup button.
  • Select your ‘backups’ queue (optional) and Google Cloud Storage as the destination. Enter the bucket name as /gs/your_bucket_name/your_path.
  • A map reduce job with 256 shards will be run to copy your data. It should be quite fast (see below).

Steps to download data

  • On the machine where you want the data, run the following command. Optionally you can include the -m flag before cp to enable multi-threaded downloads.
gsutil cp -R /gs/your_bucket_name/your_path /local_target

 

Reading Your Data

Unfortunately, even though you now have an efficient way to export data, this approach doesn’t include a built-in way to convert your data to common formats like CSV or JSON. If you stop here, you’re basically stuck using this data only to backup/restore App Engine. While that is useful, there are many other use-cases we have for exporting data at Pulse. So how do we read the data? It turns out there’s an undocumented, but relatively simple way of converting Google’s level db formated backup files into simple python dictionaries matching the structure of your original datastore entities. Here’s a Python snippet to get you started.

# Make sure App Engine APK is available
#import sys
#sys.path.append('/usr/local/google_appengine')
from google.appengine.api.files import records
from google.appengine.datastore import entity_pb
from google.appengine.api import datastore

raw = open('path_to_a_datastore_output_file', 'r')
reader = records.RecordsReader(raw)
for record in reader:
        entity_proto = entity_pb.EntityProto(contents=record)
        entity = datastore.Entity.FromPb(entity_proto)
        #Entity is available as a dictionary!

Note: If you use this approach to read all files in an output directory, you may get a ProtocolBufferDecodeError exception for the first record. It should be safe to ignore that error and continue reading the rest of the records.

Performance Comparison

Remote API Bulk Loader

  • 10GB / 10 hours ~ 291KB/s
  • 100GB – never finishes!

Backup to Google Cloud Storage + Download with gsutil

  • 10GB / 10 mins + 10 mins ~ 8.5MB/s
  • 100GB / 35 mins + 100 mins ~ 12.6MB/s
                                
  • http://twitter.com/villaindm Dmitry Lukashev

    Nice export tips, thanks!
    I have a problem with reading packed backup data.
    What is ‘path_to_datastore_export_file’? After export appengine creates 2 info files like ‘agZidXp6d3JyQQsSHF9BRV9EYXRhc3RvcmVBZG1pbl9PcGVyYXRpb24Y8a6dIAwLEhZfQUVfQmFja3VwX0luZm9ybWF0aW9uGAEM.backup_info’ and N ‘-output-X’ files.
    RecordsReader says ‘google.net.proto.ProtocolBuffer.ProtocolBufferDecodeError: truncated’ if I pass ‘info’ file. For other files it says nothing)

    btw. RecordReader -> RecordsReader

  • http://GBayer.com/ Greg Bayer

    You’ll want to read each one of the N ‘-output-X’ files with this code block. So to read all of them, just loop through all of the output files in the directory and replace ‘path_to_datastore_export_file’ with each output file you find. The info files can be ignored.

    Thanks for pointing out the typo! I’ll fix that and try to make the other part a bit clearer.

  • http://twitter.com/villaindm Dmitry Lukashev

    That’s strange – if I pass ‘-output-X’ file to the reader – it doesn’t go inside the loop… without any error. It seems RecordsReader doesn’t recognize input stream as correct

  • http://GBayer.com/ Greg Bayer

    Sorry for not explaining more clearly. We did get a ProtocolBufferDecodeError exception for the first record, however after catching that error, we were able to continue reading the rest of the records in each file without any trouble. Let me know if find a cleaner way to do it.

  • Robert Jerovsek

    Dmitry might be using windows and might thus need to use ‘rb’ as the mode when opening the file. Apart from that, I used it like: with open(sys.argv[1], ‘rb’) as raw:… This worked for me.

  • Joseph

    I need to export about 14 GB of data. But am wary of the costs. Is it possible for you to give an estimate of the cost of exporting 10 GB data?

  • http://GBayer.com/ Greg Bayer

    The cost will depend more on the number of records. You should be able to estimate it based of the cost to read all of your records from the Datastore.

  • Siddhant

    Can you help with this? The data formatting is quite out of place:

    SQLite format 3@
    ;H;-â$

    üÕfÏՁGGtablebulkloader_database_signaturebulkloader_database_signatureCREATE
    TABLE bulkloader_database_signature (

    value TEXT not
    null)g)tableresultresultCREATE TABLE result (

    id BLOB
    primary key,

    value BLOB
    not null,

    sort_key
    BLOB)+?indexsqlite_autoindex_result_1result;«HûöñìçâÝØÓÎÉÄ¿¹³­§¡›•‰ƒ}wqke_YSMGA;5/)#

    ÿùóíçáÛÕÏÉý·±«ƒ+

    h…|

    :Admissions2013leadsdata00000000000000010010j4js~computenimtgroupr

    Admissions2013leadsdatašN rcity *(Unknown city)r coords
    *rcountry *

    UNITED
    STATESrcourse *

    B.Pharmardate *
    ¸›ÞØë·r&email
    *blackbirdbikerz@gmail.comrip * 14.102.95.84rmaplink *r

    mobile * 

    9582778180rname *nikhil
    malhotrarorigin *nimtgroup.edu.in‚

    Admissions2013leadsdatašN ƒ
    h…d

    :Admissions2013leadsdata00000000000000010009j4js~computenimtgroupr

    Admissions2013leadsdata™N rcity *(Unknown city)r coords
    *rcountry *

    UNITED
    STATESrcourse *

    

    B.Tech (IT)rdate *
    ’¤í¤Ÿë·r%email
    *patilrohit07@yahoo.co.inrip *

    

    14.97.45.19rmaplink *r

    mobile * 

    9665571959rname
    *Rohitrorigin *

    Yahoo_300X250‚

    Admissions2013leadsdata™N ƒh…b

  • Guest

    Thanks for sharing such excellent information, Greg, it’s a great approach to data exporting for offline analytics.

    The ProtocolBufferDecodeError is thrown when reading the top-level .backup_info file because the first record in the file is not an entity in protobuf format, it’s a version string, i.e., “1”.

    Check out class BackupInfoWriter in the Python SDK source https://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/datastore_admin/backup_handler.py?r=361 to see how the top-level .backup_info file and the per-kind .backup_info files are written.

    Greg’s method works for the top-level .backup_info and the data files, but the per-kind .backup_info files are just a single protobuf-encoded entity and not wrapped in the leveldb format that RecordsReader reads. To read them:

    backup = google.appengine.ext.datastore_admin.backup_pb2.Backup()
    backup.ParseFromString(file_contents)

    So far, though, I haven’t needed these. All of the interesting information I’ve wanted is in the top-level .backup_info and some is duplicated in the per-kind .backup_info. YMMV.

  • chrisklaiber

    Thanks for sharing such excellent information, Greg, it’s a great approach to data exporting for offline analytics.

    The ProtocolBufferDecodeError is thrown when reading the top-level .backup_info file because the first record in the file is not an entity in protobuf format, it’s a version string, i.e., “1”.

    Check out class BackupInfoWriter in the Python SDK source https://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/datastore_admin/backup_handler.py?r=361 to see how the top-level .backup_info file and the per-kind .backup_info files are written.

    Greg’s method works for the top-level .backup_info and the data files, but the per-kind .backup_info files are just a single protobuf-encoded entity and not wrapped in the leveldb format that RecordsReader reads. To read them:

    backup = google.appengine.ext.datastore_admin.backup_pb2.Backup()
    backup.ParseFromString(file_contents)

    So far, though, I haven’t needed these. All of the interesting information I’ve wanted is in the top-level .backup_info and some is duplicated in the per-kind .backup_info. YMMV.

  • http://GBayer.com/ Greg Bayer

    I’m glad you found the article useful, and thanks for the tip! We haven’t needed to dig much deeper yet, but if we do I’ll try out what you suggested.

  • Rick Burgess

    I have written a google app engine app that uses the code in the article to process CSV files. I have never used python before so it might be a bit rough but it should work:

    https://github.com/bbhlondon/app-engine-export

    Pull requests are greatly appreciated

  • Luiz Gustavo Martins

    Hi, Excellent article! I have a doubt about doing the oposite, inserting lots of data on app engine using a file from cloud storage. Does anyone know if this is a good strategy? Today I´m using bulk insert from command line on a daily basis and I don´t think is the best solution. Any ideas?

  • http://GBayer.com/ Greg Bayer

    Unfortunately, inserting lots of data has always been difficult with GAE. Reversing this process could work, but you’ll have to make sure to divide your data carefully so you stay within the request time limits for each mapper.

    We try to avoid large data imports to GAE. When they are necessary, we usually use custom write handlers and POST to them in parallel from AWS.

  • Luiz Gustavo Martins

    Thanks for the reply, I´m really glad to know that there´s a problem and it´s not only my fault!!! I guess I´ll try using compute engine to do the bulk insert!

  • Christian

    Hi.
    I’m new to python and I tried to get the code working.
    But the varaible ‘reader’ is always empty.
    It never enters the for-loop.

    For debugging i tried this:

    for record in reader:
    self.response.write(“Found a record “)

  • http://GBayer.com/ Greg Bayer

    Is your “path_to_a_datastore_output_file” correct?

  • nolabob

    I’m a Java developer, so I’m not a Python guy. However, I’ve been able to get all of my data. I just have no idea what to make of your python snippet.

    Help? Thanks

  • Christian

    Yes it is.
    I used the git-project created by Rick Burgess.
    So the code reads all files in the “data”-directory and tries to read it.

  • Christian

    Hi..
    I’m a java developer too.
    i tried my luck with the python script but no success.
    Have found a solution?

  • kamelzcs

    How to import the csv file into SQL, such as Postgresql?
    I dont know how to deal with the relations between different entities? In this
    way, the only choice is to read all the things into memory, then find
    the related elements through element by element searching? I dont know
    the efficient way to translate a CSV file into Postgresql.

    Thanks for your sharing!

  • http://createtrips.com Jesse Luoto

    Thank you for this great article Greg!

    I extended your sample code to actually save the data to local Datastore:
    https://gist.github.com/jehna/3b258f5287fcc181aacf

  • orcaman

    This is awesome stuff man. Thanks for sharing!