App Engine Datastore: How to Efficiently Export Your Data

Posted November 8th, 2012 in Big Data, Development and tagged App Engine, Cloud Hosting, Pulse, Scalability by Greg Bayer

While Google App Engine has many strengths, as with all platforms, there are some some challenges to be aware of. Over the last two years, one of our biggest challenges at Pulse has been how difficult it can be to export large amounts of data for migration, backup, and integration with other systems. While there are several options and tools, so far none have been feasible for large datasets (10GB+).

Since we have many TBs of data in Datastore, we’ve been actively looking for a solution to this for some time. I’m excited to share a very effective approach based on Google Cloud Storage and Datastore Backups, along with a method for converting the data to other fomats!

Existing Options For Data Export

These options that have been around for some time. They are often promoted as making it easy to access datastore data, but the reality can be very different when dealing with big data.

Using the Remote API Bulk Loader. Although convenient, this official tool only works well for smaller datasets. Large datasets can easily take 24 hours to download and often fail without explanation. This tool has pretty much remained the same (without any further development) since App Engine’s early days. All official Google instructions point to this approach.
Writing a map reduce job to push the data to another server. This approach can be painfully manual and often requires significant infrastructure elsewhere (eg. on AWS).
Using the Remote API directly or writing a handler to access datastore entities one query at a time, you can run a parallelizable script or map reduce job to pull the data to where you need it. Unfortunately this has the same issues as #2.

A New Approach – Export Data via Google Cloud Storage

The recent introduction of Google Cloud Storage has finally made exporting large datasets out of Google App Engine’s datastore possible and fairly easy. The setup steps are annoying, but thankfully it’s mostly a one-time cost. Here’s how it works.

One-time setup

Create a new task queue in your App Engine app called ‘backups’ with the maximum 500/s rate limit (optional).
Sign up for a Google Cloud Storage account with billing enabled. Download and configure gsutil for your account.
Created a bucket for your data in Google Cloud Storage. You can use the online browser to do this. Note: There’s an unresolved bug that causes backups to buckets with underscores to fail.
Use gsutil to set the acl and default acl for that bucket to include your app’s service account email address with WRITE and FULL_CONTROL respectively.

Steps to export data

Navigate to the datastore admin tab in the App Engine console for your app. Click the checkbox next to the Entity Kinds you want to export, and push the Backup button.
Select your ‘backups’ queue (optional) and Google Cloud Storage as the destination. Enter the bucket name as /gs/your_bucket_name/your_path.
A map reduce job with 256 shards will be run to copy your data. It should be quite fast (see below).

Steps to download data

On the machine where you want the data, run the following command. Optionally you can include the -m flag before cp to enable multi-threaded downloads.

gsutil cp -R /gs/your_bucket_name/your_path /local_target

Reading Your Data

Unfortunately, even though you now have an efficient way to export data, this approach doesn’t include a built-in way to convert your data to common formats like CSV or JSON. If you stop here, you’re basically stuck using this data only to backup/restore App Engine. While that is useful, there are many other use-cases we have for exporting data at Pulse. So how do we read the data? It turns out there’s an undocumented, but relatively simple way of converting Google’s level db formated backup files into simple python dictionaries matching the structure of your original datastore entities. Here’s a Python snippet to get you started.

# Make sure App Engine APK is available
#import sys
#sys.path.append('/usr/local/google_appengine')
from google.appengine.api.files import records
from google.appengine.datastore import entity_pb
from google.appengine.api import datastore

raw = open('path_to_a_datastore_output_file', 'r')
reader = records.RecordsReader(raw)
for record in reader:
        entity_proto = entity_pb.EntityProto(contents=record)
        entity = datastore.Entity.FromPb(entity_proto)
        #Entity is available as a dictionary!

Note: If you use this approach to read all files in an output directory, you may get a ProtocolBufferDecodeError exception for the first record. It should be safe to ignore that error and continue reading the rest of the records.

Performance Comparison

Remote API Bulk Loader

10GB / 10 hours ~ 291KB/s
100GB – never finishes!

Backup to Google Cloud Storage + Download with gsutil

10GB / 10 mins + 10 mins ~ 8.5MB/s
100GB / 35 mins + 100 mins ~ 12.6MB/s

36 Responses so far.

Dmitry Lukashev says:

November 11, 2012 at 8:54 am

Nice export tips, thanks!
I have a problem with reading packed backup data.
What is ‘path_to_datastore_export_file’? After export appengine creates 2 info files like ‘agZidXp6d3JyQQsSHF9BRV9EYXRhc3RvcmVBZG1pbl9PcGVyYXRpb24Y8a6dIAwLEhZfQUVfQmFja3VwX0luZm9ybWF0aW9uGAEM.backup_info’ and N ‘-output-X’ files.
RecordsReader says ‘google.net.proto.ProtocolBuffer.ProtocolBufferDecodeError: truncated’ if I pass ‘info’ file. For other files it says nothing)

btw. RecordReader -> RecordsReader
Greg Bayer says:

November 11, 2012 at 5:38 pm

You’ll want to read each one of the N ‘-output-X’ files with this code block. So to read all of them, just loop through all of the output files in the directory and replace ‘path_to_datastore_export_file’ with each output file you find. The info files can be ignored.

Thanks for pointing out the typo! I’ll fix that and try to make the other part a bit clearer.
Dmitry Lukashev says:

November 12, 2012 at 3:32 am

That’s strange – if I pass ‘-output-X’ file to the reader – it doesn’t go inside the loop… without any error. It seems RecordsReader doesn’t recognize input stream as correct
Greg Bayer says:

November 12, 2012 at 12:49 pm

Sorry for not explaining more clearly. We did get a ProtocolBufferDecodeError exception for the first record, however after catching that error, we were able to continue reading the rest of the records in each file without any trouble. Let me know if find a cleaner way to do it.
Robert Jerovsek says:

November 13, 2012 at 4:01 am

Dmitry might be using windows and might thus need to use ‘rb’ as the mode when opening the file. Apart from that, I used it like: with open(sys.argv[1], ‘rb’) as raw:… This worked for me.
Joseph says:

May 27, 2013 at 12:59 am

I need to export about 14 GB of data. But am wary of the costs. Is it possible for you to give an estimate of the cost of exporting 10 GB data?
Greg Bayer says:

May 27, 2013 at 12:28 pm

The cost will depend more on the number of records. You should be able to estimate it based of the cost to read all of your records from the Datastore.
Siddhant says:

June 23, 2013 at 7:49 am

Can you help with this? The data formatting is quite out of place:

SQLite format 3@
;H;-â$

üÕfÏÕGGtablebulkloader_database_signaturebulkloader_database_signatureCREATE
TABLE bulkloader_database_signature (

value TEXT not
null)g)tableresultresultCREATE TABLE result (

id BLOB
primary key,

value BLOB
not null,

sort_key
BLOB)+?indexsqlite_autoindex_result_1result;«HûöñìçâÝØÓÎÉÄ¿¹³§¡›•‰ƒ}wqke_YSMGA;5/)#

ÿùóíçáÛÕÏÉÃ½·±«ƒ+

h…|

:Admissions2013leadsdata00000000000000010010j4js~computenimtgroupr

Admissions2013leadsdatašNrcity *(Unknown city)rcoords
*rcountry *

UNITED
STATESrcourse *

B.Pharmardate *
¸›ÞØë·r&email
*blackbirdbikerz@gmail.comrip *14.102.95.84rmaplink *r

mobile *

9582778180rname *nikhil
malhotrarorigin *nimtgroup.edu.in‚

Admissions2013leadsdatašNƒ
h…d

:Admissions2013leadsdata00000000000000010009j4js~computenimtgroupr

Admissions2013leadsdata™Nrcity *(Unknown city)rcoords
*rcountry *

UNITED
STATESrcourse *

B.Tech (IT)rdate *
’¤í¤Ÿë·r%email
*patilrohit07@yahoo.co.inrip *

14.97.45.19rmaplink *r

mobile *

9665571959rname
*Rohitrorigin *

Yahoo_300X250‚

Admissions2013leadsdata™Nƒh…b
Guest says:

July 9, 2013 at 9:56 pm

Thanks for sharing such excellent information, Greg, it’s a great approach to data exporting for offline analytics.

The ProtocolBufferDecodeError is thrown when reading the top-level .backup_info file because the first record in the file is not an entity in protobuf format, it’s a version string, i.e., “1”.

Check out class BackupInfoWriter in the Python SDK source https://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/datastore_admin/backup_handler.py?r=361 to see how the top-level .backup_info file and the per-kind .backup_info files are written.

Greg’s method works for the top-level .backup_info and the data files, but the per-kind .backup_info files are just a single protobuf-encoded entity and not wrapped in the leveldb format that RecordsReader reads. To read them:

backup = google.appengine.ext.datastore_admin.backup_pb2.Backup()
backup.ParseFromString(file_contents)

So far, though, I haven’t needed these. All of the interesting information I’ve wanted is in the top-level .backup_info and some is duplicated in the per-kind .backup_info. YMMV.
chrisklaiber says:

July 9, 2013 at 9:58 pm

Thanks for sharing such excellent information, Greg, it’s a great approach to data exporting for offline analytics.

The ProtocolBufferDecodeError is thrown when reading the top-level .backup_info file because the first record in the file is not an entity in protobuf format, it’s a version string, i.e., “1”.

Check out class BackupInfoWriter in the Python SDK source https://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/datastore_admin/backup_handler.py?r=361 to see how the top-level .backup_info file and the per-kind .backup_info files are written.

Greg’s method works for the top-level .backup_info and the data files, but the per-kind .backup_info files are just a single protobuf-encoded entity and not wrapped in the leveldb format that RecordsReader reads. To read them:

backup = google.appengine.ext.datastore_admin.backup_pb2.Backup()
backup.ParseFromString(file_contents)

So far, though, I haven’t needed these. All of the interesting information I’ve wanted is in the top-level .backup_info and some is duplicated in the per-kind .backup_info. YMMV.
Greg Bayer says:

July 9, 2013 at 10:20 pm

I’m glad you found the article useful, and thanks for the tip! We haven’t needed to dig much deeper yet, but if we do I’ll try out what you suggested.
Rick Burgess says:

January 9, 2014 at 7:50 am

I have written a google app engine app that uses the code in the article to process CSV files. I have never used python before so it might be a bit rough but it should work:

https://github.com/bbhlondon/app-engine-export

Pull requests are greatly appreciated
Luiz Gustavo Martins says:

April 28, 2014 at 7:47 am

Hi, Excellent article! I have a doubt about doing the oposite, inserting lots of data on app engine using a file from cloud storage. Does anyone know if this is a good strategy? Today I´m using bulk insert from command line on a daily basis and I don´t think is the best solution. Any ideas?
Greg Bayer says:

April 28, 2014 at 8:19 am

Unfortunately, inserting lots of data has always been difficult with GAE. Reversing this process could work, but you’ll have to make sure to divide your data carefully so you stay within the request time limits for each mapper.

We try to avoid large data imports to GAE. When they are necessary, we usually use custom write handlers and POST to them in parallel from AWS.
Luiz Gustavo Martins says:

April 28, 2014 at 9:18 am

Thanks for the reply, I´m really glad to know that there´s a problem and it´s not only my fault!!! I guess I´ll try using compute engine to do the bulk insert!
Christian says:

April 30, 2014 at 5:55 am

Hi.
I’m new to python and I tried to get the code working.
But the varaible ‘reader’ is always empty.
It never enters the for-loop.

For debugging i tried this:

for record in reader:
self.response.write(“Found a record “)
Greg Bayer says:

April 30, 2014 at 9:55 am

Is your “path_to_a_datastore_output_file” correct?
nolabob says:

May 1, 2014 at 5:19 pm

I’m a Java developer, so I’m not a Python guy. However, I’ve been able to get all of my data. I just have no idea what to make of your python snippet.

Help? Thanks
Christian says:

May 3, 2014 at 1:03 pm

Yes it is.
I used the git-project created by Rick Burgess.
So the code reads all files in the “data”-directory and tries to read it.
Christian says:

May 5, 2014 at 4:10 am

Hi..
I’m a java developer too.
i tried my luck with the python script but no success.
Have found a solution?
kamelzcs says:

June 30, 2014 at 2:38 am

How to import the csv file into SQL, such as Postgresql?
I dont know how to deal with the relations between different entities? In this
way, the only choice is to read all the things into memory, then find
the related elements through element by element searching? I dont know
the efficient way to translate a CSV file into Postgresql.

Thanks for your sharing!
Jesse Luoto says:

August 21, 2014 at 3:14 am

Thank you for this great article Greg!

I extended your sample code to actually save the data to local Datastore:
https://gist.github.com/jehna/3b258f5287fcc181aacf
orcaman says:

September 19, 2014 at 6:20 am

This is awesome stuff man. Thanks for sharing!
Saagar says:

March 5, 2015 at 2:06 pm

I get an error in Step 3. Can you help ??

[ERROR ] Exception during authentication

Traceback (most recent call last):

File “C:datainstallsgoogle_appengine_1.9.18google_appenginegoogleappenginetoolsbulkloader.py”, line 3466, in Run

self.request_manager.Authenticate()

File “C:datainstallsgoogle_appengine_1.9.18google_appenginegoogleappenginetoolsbulkloader.py”, line 1329, in Authenticate

remote_api_stub.MaybeInvokeAuthentication()

File “C:datainstallsgoogle_appengine_1.9.18google_appenginegoogleappengineextremote_apiremote_api_stub.py”, line 889, in MaybeInvokeAuthentication

datastore_stub._server.Send(datastore_stub._path, payload=None)

File “C:datainstallsgoogle_appengine_1.9.18google_appenginegoogleappenginetoolsappengine_rpc.py”, line 426, in Send

f = self.opener.open(req)

File “C:Python27liburllib2.py”, line 431, in open

response = self._open(req, data)

File “C:Python27liburllib2.py”, line 449, in _open

‘_open’, req)

File “C:Python27liburllib2.py”, line 409, in _call_chain

result = func(*args)

File “C:Python27liburllib2.py”, line 1227, in http_open

return self.do_open(httplib.HTTPConnection, req)

File “C:Python27liburllib2.py”, line 1197, in do_open

raise URLError(err)

URLError:

[ERROR ] Authentication Failed: Incorrect credentials or unsupported authentication type (e.g. OpenId).
Valentin says:

March 25, 2015 at 6:35 am

Great. You saved my day.
Rakz says:

March 26, 2015 at 10:57 pm

I am getting error as int() argument should be string or number , not NONE_TYPE. I believe encoding UTF-8 doesn’t seem to be proper encoding for data store file.

snippet:
load –encoding=UTF-8 –source_format=DATASTORE_BACKUP –allow_jagged_rows=false –write_disposition=WRITE_TRUNCATE sample_red.testenge_1 gs://test.appspot.com/bucketname/ahFzfnZpcmdpbi1yZWQtdGVzdHJBCxIcX0FFX0RhdGFzdG9yZUFkbWluX09wZXJhdGlvbhiBwLgCDAsSFl9BRV9CYWNrdXBfSW5mb3JtYXRpb24YAQw.nge.backup_info
mirko77 says:

August 4, 2015 at 5:22 am

I gave it a try but I only see a “Hallo World” page?
Oscar Reynaldo Callisaya Limac says:

September 29, 2015 at 1:06 pm

Good article, a question, is there a way to download only a specific model?
Greg Bayer says:

October 1, 2015 at 11:48 pm

Unless something has changed, the export step should include an option to choose which Entity Kinds you want.
Samantha Atkins says:

December 14, 2015 at 2:53 pm

While it is a good article that it is this complex to get your own app data out of datastore is why I will never use it again once I leave this gig where I am obligated to. As databases go it is too primitive and too locked down against doing so many things I have done with so many other kinds of databases over the last couple of decades. I consider it the biggest fail of appengine. It feeds appengine apps and does pretty much nothing a database should do beyond that. It can’t even support some very common query patterns. BAH.
enver63 says:

January 24, 2016 at 11:57 pm

Having used AppEngine for more than 4 years, I am afraid I have to agree with this comment.
Even Google’s own procedures to backup and restore data from datastore fail consistently if your database is larger than ~ 1GB. Documentation is incomplete and inaccurate. Support from Google is non-existent, and bug reports stay unanswered for years.
My biggest problem now is that it is hard to migrate away from Appengine/datastore, because it is virtually impossible to get your data out of it.
Danny Bud says:

June 12, 2016 at 4:40 pm

On Windows you must read as rb and not r!!!
Otherwise you will read partial records and not get results.
Paul Ceccato says:

October 20, 2016 at 4:26 am

I’m inclined to agree with you. I’m using a simpleauth as my applications authentication, and I can’t even open the datastore admin console because I get an authentication error. I don’t have a huge amount of data, but will have to write my own custom backup code to deal with this basic function!
shixing19910105 says:

January 31, 2017 at 1:05 pm

nice solution
Eben says:

October 3, 2017 at 2:21 pm

No one’s commented in 2 years or so, but I still found the instructions *very* helpful. I’ve got maybe 150GB in the App Engine datastore, and this allowed me to get it out in a reasonable amount of time. The backup still took a few days, and the download maybe an hour, but I got it and am able to convert it into other formats. Thank you!
Greg Bayer says:

October 3, 2017 at 4:51 pm

Glad you found it helpful!