36

App Engine Datastore: How to Efficiently Export Your Data

Posted November 8th, 2012 in Big Data, Development and tagged , , , by Greg Bayer
                                

While Google App Engine has many strengths, as with all platforms, there are some some challenges to be aware of. Over the last two years, one of our biggest challenges at Pulse has been how difficult it can be to export large amounts of data for migration, backup, and integration with other systems. While there are several options and tools, so far none have been feasible for large datasets (10GB+).

Since we have many TBs of data in Datastore, we’ve been actively looking for a solution to this for some time. I’m excited to share a very effective approach based on Google Cloud Storage and Datastore Backups, along with a method for converting the data to other fomats!

Existing Options For Data Export

These options that have been around for some time. They are often promoted as making it easy to access datastore data, but the reality can be very different when dealing with big data.

  1. Using the Remote API Bulk Loader. Although convenient, this official tool only works well for smaller datasets. Large datasets can easily take 24 hours to download and often fail without explanation. This tool has pretty much remained the same (without any further development) since App Engine’s early days. All official Google instructions point to this approach.
  2. Writing a map reduce job to push the data to another server. This approach can be painfully manual and often requires significant infrastructure elsewhere (eg. on AWS).
  3. Using the Remote API directly or writing a handler to access datastore entities one query at a time, you can run a parallelizable script or map reduce job to pull the data to where you need it. Unfortunately this has the same issues as #2.

A New Approach – Export Data via Google Cloud Storage

The recent introduction of Google Cloud Storage has finally made exporting large datasets out of Google App Engine’s datastore possible and fairly easy. The setup steps are annoying, but thankfully it’s mostly a one-time cost. Here’s how it works.

One-time setup

  • Create a new task queue in your App Engine app called ‘backups’ with the maximum 500/s rate limit (optional).
  • Sign up for a Google Cloud Storage account with billing enabled. Download and configure gsutil for your account.
  • Created a bucket for your data in Google Cloud Storage. You can use the online browser to do this. Note: There’s an unresolved bug that causes backups to buckets with underscores to fail.
  • Use gsutil to set the acl and default acl for that bucket to include your app’s service account email address with WRITE and FULL_CONTROL respectively.

 Steps to export data

  • Navigate to the datastore admin tab in the App Engine console for your app. Click the checkbox next to the Entity Kinds you want to export, and push the Backup button.
  • Select your ‘backups’ queue (optional) and Google Cloud Storage as the destination. Enter the bucket name as /gs/your_bucket_name/your_path.
  • A map reduce job with 256 shards will be run to copy your data. It should be quite fast (see below).

Steps to download data

  • On the machine where you want the data, run the following command. Optionally you can include the -m flag before cp to enable multi-threaded downloads.
gsutil cp -R /gs/your_bucket_name/your_path /local_target

 

Reading Your Data

Unfortunately, even though you now have an efficient way to export data, this approach doesn’t include a built-in way to convert your data to common formats like CSV or JSON. If you stop here, you’re basically stuck using this data only to backup/restore App Engine. While that is useful, there are many other use-cases we have for exporting data at Pulse. So how do we read the data? It turns out there’s an undocumented, but relatively simple way of converting Google’s level db formated backup files into simple python dictionaries matching the structure of your original datastore entities. Here’s a Python snippet to get you started.

# Make sure App Engine APK is available
#import sys
#sys.path.append('/usr/local/google_appengine')
from google.appengine.api.files import records
from google.appengine.datastore import entity_pb
from google.appengine.api import datastore

raw = open('path_to_a_datastore_output_file', 'r')
reader = records.RecordsReader(raw)
for record in reader:
        entity_proto = entity_pb.EntityProto(contents=record)
        entity = datastore.Entity.FromPb(entity_proto)
        #Entity is available as a dictionary!

Note: If you use this approach to read all files in an output directory, you may get a ProtocolBufferDecodeError exception for the first record. It should be safe to ignore that error and continue reading the rest of the records.

Performance Comparison

Remote API Bulk Loader

  • 10GB / 10 hours ~ 291KB/s
  • 100GB – never finishes!

Backup to Google Cloud Storage + Download with gsutil

  • 10GB / 10 mins + 10 mins ~ 8.5MB/s
  • 100GB / 35 mins + 100 mins ~ 12.6MB/s
                                

36 Responses so far.

  1. Nice export tips, thanks!
    I have a problem with reading packed backup data.
    What is ‘path_to_datastore_export_file’? After export appengine creates 2 info files like ‘agZidXp6d3JyQQsSHF9BRV9EYXRhc3RvcmVBZG1pbl9PcGVyYXRpb24Y8a6dIAwLEhZfQUVfQmFja3VwX0luZm9ybWF0aW9uGAEM.backup_info’ and N ‘-output-X’ files.
    RecordsReader says ‘google.net.proto.ProtocolBuffer.ProtocolBufferDecodeError: truncated’ if I pass ‘info’ file. For other files it says nothing)

    btw. RecordReader -> RecordsReader

  2. Greg Bayer says:

    You’ll want to read each one of the N ‘-output-X’ files with this code block. So to read all of them, just loop through all of the output files in the directory and replace ‘path_to_datastore_export_file’ with each output file you find. The info files can be ignored.

    Thanks for pointing out the typo! I’ll fix that and try to make the other part a bit clearer.

  3. That’s strange – if I pass ‘-output-X’ file to the reader – it doesn’t go inside the loop… without any error. It seems RecordsReader doesn’t recognize input stream as correct

  4. Greg Bayer says:

    Sorry for not explaining more clearly. We did get a ProtocolBufferDecodeError exception for the first record, however after catching that error, we were able to continue reading the rest of the records in each file without any trouble. Let me know if find a cleaner way to do it.

  5. Robert Jerovsek says:

    Dmitry might be using windows and might thus need to use ‘rb’ as the mode when opening the file. Apart from that, I used it like: with open(sys.argv[1], ‘rb’) as raw:… This worked for me.

  6. Joseph says:

    I need to export about 14 GB of data. But am wary of the costs. Is it possible for you to give an estimate of the cost of exporting 10 GB data?

  7. Greg Bayer says:

    The cost will depend more on the number of records. You should be able to estimate it based of the cost to read all of your records from the Datastore.

  8. Siddhant says:

    Can you help with this? The data formatting is quite out of place:

    SQLite format 3@
    ;H;-â$

    üÕfÏՁGGtablebulkloader_database_signaturebulkloader_database_signatureCREATE
    TABLE bulkloader_database_signature (

    value TEXT not
    null)g)tableresultresultCREATE TABLE result (

    id BLOB
    primary key,

    value BLOB
    not null,

    sort_key
    BLOB)+?indexsqlite_autoindex_result_1result;«HûöñìçâÝØÓÎÉÄ¿¹³­§¡›•‰ƒ}wqke_YSMGA;5/)#

    ÿùóíçáÛÕÏÉý·±«ƒ+

    h…|

    :Admissions2013leadsdata00000000000000010010j4js~computenimtgroupr

    Admissions2013leadsdatašN rcity *(Unknown city)r coords
    *rcountry *

    UNITED
    STATESrcourse *

    B.Pharmardate *
    ¸›ÞØë·r&email
    *blackbirdbikerz@gmail.comrip * 14.102.95.84rmaplink *r

    mobile * 

    9582778180rname *nikhil
    malhotrarorigin *nimtgroup.edu.in‚

    Admissions2013leadsdatašN ƒ
    h…d

    :Admissions2013leadsdata00000000000000010009j4js~computenimtgroupr

    Admissions2013leadsdata™N rcity *(Unknown city)r coords
    *rcountry *

    UNITED
    STATESrcourse *

    

    B.Tech (IT)rdate *
    ’¤í¤Ÿë·r%email
    *patilrohit07@yahoo.co.inrip *

    

    14.97.45.19rmaplink *r

    mobile * 

    9665571959rname
    *Rohitrorigin *

    Yahoo_300X250‚

    Admissions2013leadsdata™N ƒh…b

  9. Guest says:

    Thanks for sharing such excellent information, Greg, it’s a great approach to data exporting for offline analytics.

    The ProtocolBufferDecodeError is thrown when reading the top-level .backup_info file because the first record in the file is not an entity in protobuf format, it’s a version string, i.e., “1”.

    Check out class BackupInfoWriter in the Python SDK source https://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/datastore_admin/backup_handler.py?r=361 to see how the top-level .backup_info file and the per-kind .backup_info files are written.

    Greg’s method works for the top-level .backup_info and the data files, but the per-kind .backup_info files are just a single protobuf-encoded entity and not wrapped in the leveldb format that RecordsReader reads. To read them:

    backup = google.appengine.ext.datastore_admin.backup_pb2.Backup()
    backup.ParseFromString(file_contents)

    So far, though, I haven’t needed these. All of the interesting information I’ve wanted is in the top-level .backup_info and some is duplicated in the per-kind .backup_info. YMMV.

  10. chrisklaiber says:

    Thanks for sharing such excellent information, Greg, it’s a great approach to data exporting for offline analytics.

    The ProtocolBufferDecodeError is thrown when reading the top-level .backup_info file because the first record in the file is not an entity in protobuf format, it’s a version string, i.e., “1”.

    Check out class BackupInfoWriter in the Python SDK source https://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/datastore_admin/backup_handler.py?r=361 to see how the top-level .backup_info file and the per-kind .backup_info files are written.

    Greg’s method works for the top-level .backup_info and the data files, but the per-kind .backup_info files are just a single protobuf-encoded entity and not wrapped in the leveldb format that RecordsReader reads. To read them:

    backup = google.appengine.ext.datastore_admin.backup_pb2.Backup()
    backup.ParseFromString(file_contents)

    So far, though, I haven’t needed these. All of the interesting information I’ve wanted is in the top-level .backup_info and some is duplicated in the per-kind .backup_info. YMMV.

  11. Greg Bayer says:

    I’m glad you found the article useful, and thanks for the tip! We haven’t needed to dig much deeper yet, but if we do I’ll try out what you suggested.

  12. Rick Burgess says:

    I have written a google app engine app that uses the code in the article to process CSV files. I have never used python before so it might be a bit rough but it should work:

    https://github.com/bbhlondon/app-engine-export

    Pull requests are greatly appreciated

  13. Luiz Gustavo Martins says:

    Hi, Excellent article! I have a doubt about doing the oposite, inserting lots of data on app engine using a file from cloud storage. Does anyone know if this is a good strategy? Today I´m using bulk insert from command line on a daily basis and I don´t think is the best solution. Any ideas?

  14. Greg Bayer says:

    Unfortunately, inserting lots of data has always been difficult with GAE. Reversing this process could work, but you’ll have to make sure to divide your data carefully so you stay within the request time limits for each mapper.

    We try to avoid large data imports to GAE. When they are necessary, we usually use custom write handlers and POST to them in parallel from AWS.

  15. Luiz Gustavo Martins says:

    Thanks for the reply, I´m really glad to know that there´s a problem and it´s not only my fault!!! I guess I´ll try using compute engine to do the bulk insert!

  16. Christian says:

    Hi.
    I’m new to python and I tried to get the code working.
    But the varaible ‘reader’ is always empty.
    It never enters the for-loop.

    For debugging i tried this:

    for record in reader:
    self.response.write(“Found a record “)

  17. Greg Bayer says:

    Is your “path_to_a_datastore_output_file” correct?

  18. nolabob says:

    I’m a Java developer, so I’m not a Python guy. However, I’ve been able to get all of my data. I just have no idea what to make of your python snippet.

    Help? Thanks

  19. Christian says:

    Yes it is.
    I used the git-project created by Rick Burgess.
    So the code reads all files in the “data”-directory and tries to read it.

  20. Christian says:

    Hi..
    I’m a java developer too.
    i tried my luck with the python script but no success.
    Have found a solution?

  21. kamelzcs says:

    How to import the csv file into SQL, such as Postgresql?
    I dont know how to deal with the relations between different entities? In this
    way, the only choice is to read all the things into memory, then find
    the related elements through element by element searching? I dont know
    the efficient way to translate a CSV file into Postgresql.

    Thanks for your sharing!

  22. Jesse Luoto says:

    Thank you for this great article Greg!

    I extended your sample code to actually save the data to local Datastore:
    https://gist.github.com/jehna/3b258f5287fcc181aacf

  23. orcaman says:

    This is awesome stuff man. Thanks for sharing!

  24. Saagar says:

    I get an error in Step 3. Can you help ??

    [ERROR ] Exception during authentication

    Traceback (most recent call last):

    File “C:datainstallsgoogle_appengine_1.9.18google_appenginegoogleappenginetoolsbulkloader.py”, line 3466, in Run

    self.request_manager.Authenticate()

    File “C:datainstallsgoogle_appengine_1.9.18google_appenginegoogleappenginetoolsbulkloader.py”, line 1329, in Authenticate

    remote_api_stub.MaybeInvokeAuthentication()

    File “C:datainstallsgoogle_appengine_1.9.18google_appenginegoogleappengineextremote_apiremote_api_stub.py”, line 889, in MaybeInvokeAuthentication

    datastore_stub._server.Send(datastore_stub._path, payload=None)

    File “C:datainstallsgoogle_appengine_1.9.18google_appenginegoogleappenginetoolsappengine_rpc.py”, line 426, in Send

    f = self.opener.open(req)

    File “C:Python27liburllib2.py”, line 431, in open

    response = self._open(req, data)

    File “C:Python27liburllib2.py”, line 449, in _open

    ‘_open’, req)

    File “C:Python27liburllib2.py”, line 409, in _call_chain

    result = func(*args)

    File “C:Python27liburllib2.py”, line 1227, in http_open

    return self.do_open(httplib.HTTPConnection, req)

    File “C:Python27liburllib2.py”, line 1197, in do_open

    raise URLError(err)

    URLError:

    [ERROR ] Authentication Failed: Incorrect credentials or unsupported authentication type (e.g. OpenId).

  25. Valentin says:

    Great. You saved my day.

  26. Rakz says:

    I am getting error as int() argument should be string or number , not NONE_TYPE. I believe encoding UTF-8 doesn’t seem to be proper encoding for data store file.

    snippet:
    load –encoding=UTF-8 –source_format=DATASTORE_BACKUP –allow_jagged_rows=false –write_disposition=WRITE_TRUNCATE sample_red.testenge_1 gs://test.appspot.com/bucketname/ahFzfnZpcmdpbi1yZWQtdGVzdHJBCxIcX0FFX0RhdGFzdG9yZUFkbWluX09wZXJhdGlvbhiBwLgCDAsSFl9BRV9CYWNrdXBfSW5mb3JtYXRpb24YAQw.nge.backup_info

  27. mirko77 says:

    I gave it a try but I only see a “Hallo World” page?

  28. Oscar Reynaldo Callisaya Limac says:

    Good article, a question, is there a way to download only a specific model?

  29. Greg Bayer says:

    Unless something has changed, the export step should include an option to choose which Entity Kinds you want.

  30. Samantha Atkins says:

    While it is a good article that it is this complex to get your own app data out of datastore is why I will never use it again once I leave this gig where I am obligated to. As databases go it is too primitive and too locked down against doing so many things I have done with so many other kinds of databases over the last couple of decades. I consider it the biggest fail of appengine. It feeds appengine apps and does pretty much nothing a database should do beyond that. It can’t even support some very common query patterns. BAH.

  31. enver63 says:

    Having used AppEngine for more than 4 years, I am afraid I have to agree with this comment.
    Even Google’s own procedures to backup and restore data from datastore fail consistently if your database is larger than ~ 1GB. Documentation is incomplete and inaccurate. Support from Google is non-existent, and bug reports stay unanswered for years.
    My biggest problem now is that it is hard to migrate away from Appengine/datastore, because it is virtually impossible to get your data out of it.

  32. Danny Bud says:

    On Windows you must read as rb and not r!!!
    Otherwise you will read partial records and not get results.

  33. Paul Ceccato says:

    I’m inclined to agree with you. I’m using a simpleauth as my applications authentication, and I can’t even open the datastore admin console because I get an authentication error. I don’t have a huge amount of data, but will have to write my own custom backup code to deal with this basic function!

  34. shixing19910105 says:

    nice solution

  35. Eben says:

    No one’s commented in 2 years or so, but I still found the instructions *very* helpful. I’ve got maybe 150GB in the App Engine datastore, and this allowed me to get it out in a reasonable amount of time. The backup still took a few days, and the download maybe an hour, but I got it and am able to convert it into other formats. Thank you!

  36. Greg Bayer says:

    Glad you found it helpful!

Leave a Reply