While Google App Engine has many strengths, as with all platforms, there are some some challenges to be aware of. Over the last two years, one of our biggest challenges at Pulse has been how difficult it can be to export large amounts of data for migration, backup, and integration with other systems. While there are several options and tools, so far none have been feasible for large datasets (10GB+).
Since we have many TBs of data in Datastore, we’ve been actively looking for a solution to this for some time. I’m excited to share a very effective approach based on Google Cloud Storage and Datastore Backups, along with a method for converting the data to other fomats!
Existing Options For Data Export
These options that have been around for some time. They are often promoted as making it easy to access datastore data, but the reality can be very different when dealing with big data.
- Using the Remote API Bulk Loader. Although convenient, this official tool only works well for smaller datasets. Large datasets can easily take 24 hours to download and often fail without explanation. This tool has pretty much remained the same (without any further development) since App Engine’s early days. All official Google instructions point to this approach.
- Writing a map reduce job to push the data to another server. This approach can be painfully manual and often requires significant infrastructure elsewhere (eg. on AWS).
- Using the Remote API directly or writing a handler to access datastore entities one query at a time, you can run a parallelizable script or map reduce job to pull the data to where you need it. Unfortunately this has the same issues as #2.
A New Approach – Export Data via Google Cloud Storage
The recent introduction of Google Cloud Storage has finally made exporting large datasets out of Google App Engine’s datastore possible and fairly easy. The setup steps are annoying, but thankfully it’s mostly a one-time cost. Here’s how it works.
One-time setup
- Create a new task queue in your App Engine app called ‘backups’ with the maximum 500/s rate limit (optional).
- Sign up for a Google Cloud Storage account with billing enabled. Download and configure gsutil for your account.
- Created a bucket for your data in Google Cloud Storage. You can use the online browser to do this. Note: There’s an unresolved bug that causes backups to buckets with underscores to fail.
- Use gsutil to set the acl and default acl for that bucket to include your app’s service account email address with WRITE and FULL_CONTROL respectively.
Steps to export data
- Navigate to the datastore admin tab in the App Engine console for your app. Click the checkbox next to the Entity Kinds you want to export, and push the Backup button.
- Select your ‘backups’ queue (optional) and Google Cloud Storage as the destination. Enter the bucket name as /gs/your_bucket_name/your_path.
- A map reduce job with 256 shards will be run to copy your data. It should be quite fast (see below).
Steps to download data
- On the machine where you want the data, run the following command. Optionally you can include the -m flag before cp to enable multi-threaded downloads.
gsutil cp -R /gs/your_bucket_name/your_path /local_target
Reading Your Data
Unfortunately, even though you now have an efficient way to export data, this approach doesn’t include a built-in way to convert your data to common formats like CSV or JSON. If you stop here, you’re basically stuck using this data only to backup/restore App Engine. While that is useful, there are many other use-cases we have for exporting data at Pulse. So how do we read the data? It turns out there’s an undocumented, but relatively simple way of converting Google’s level db formated backup files into simple python dictionaries matching the structure of your original datastore entities. Here’s a Python snippet to get you started.
# Make sure App Engine APK is available #import sys #sys.path.append('/usr/local/google_appengine') from google.appengine.api.files import records from google.appengine.datastore import entity_pb from google.appengine.api import datastore raw = open('path_to_a_datastore_output_file', 'r') reader = records.RecordsReader(raw) for record in reader: entity_proto = entity_pb.EntityProto(contents=record) entity = datastore.Entity.FromPb(entity_proto) #Entity is available as a dictionary!
Note: If you use this approach to read all files in an output directory, you may get a ProtocolBufferDecodeError exception for the first record. It should be safe to ignore that error and continue reading the rest of the records.
Performance Comparison
Remote API Bulk Loader
- 10GB / 10 hours ~ 291KB/s
- 100GB – never finishes!
Backup to Google Cloud Storage + Download with gsutil
- 10GB / 10 mins + 10 mins ~ 8.5MB/s
- 100GB / 35 mins + 100 mins ~ 12.6MB/s
Nice export tips, thanks!
I have a problem with reading packed backup data.
What is ‘path_to_datastore_export_file’? After export appengine creates 2 info files like ‘agZidXp6d3JyQQsSHF9BRV9EYXRhc3RvcmVBZG1pbl9PcGVyYXRpb24Y8a6dIAwLEhZfQUVfQmFja3VwX0luZm9ybWF0aW9uGAEM.backup_info’ and N ‘-output-X’ files.
RecordsReader says ‘google.net.proto.ProtocolBuffer.ProtocolBufferDecodeError: truncated’ if I pass ‘info’ file. For other files it says nothing)
btw. RecordReader -> RecordsReader
You’ll want to read each one of the N ‘-output-X’ files with this code block. So to read all of them, just loop through all of the output files in the directory and replace ‘path_to_datastore_export_file’ with each output file you find. The info files can be ignored.
Thanks for pointing out the typo! I’ll fix that and try to make the other part a bit clearer.
That’s strange – if I pass ‘-output-X’ file to the reader – it doesn’t go inside the loop… without any error. It seems RecordsReader doesn’t recognize input stream as correct
Sorry for not explaining more clearly. We did get a ProtocolBufferDecodeError exception for the first record, however after catching that error, we were able to continue reading the rest of the records in each file without any trouble. Let me know if find a cleaner way to do it.
Dmitry might be using windows and might thus need to use ‘rb’ as the mode when opening the file. Apart from that, I used it like: with open(sys.argv[1], ‘rb’) as raw:… This worked for me.
I need to export about 14 GB of data. But am wary of the costs. Is it possible for you to give an estimate of the cost of exporting 10 GB data?
The cost will depend more on the number of records. You should be able to estimate it based of the cost to read all of your records from the Datastore.
Can you help with this? The data formatting is quite out of place:
SQLite format 3@
;H;-â$
üÕfÏÕGGtablebulkloader_database_signaturebulkloader_database_signatureCREATE
TABLE bulkloader_database_signature (
value TEXT not
null)g)tableresultresultCREATE TABLE result (
id BLOB
primary key,
value BLOB
not null,
sort_key
BLOB)+?indexsqlite_autoindex_result_1result;«HûöñìçâÝØÓÎÉÄ¿¹³§¡›•‰ƒ}wqke_YSMGA;5/)#
ÿùóíçáÛÕÏÉý·±«ƒ+
h…|
:Admissions2013leadsdata00000000000000010010j4js~computenimtgroupr
Admissions2013leadsdatašNrcity *(Unknown city)rcoords
*rcountry *
UNITED
STATESrcourse *
B.Pharmardate *
¸›ÞØë·r&email
*blackbirdbikerz@gmail.comrip *14.102.95.84rmaplink *r
mobile *
9582778180rname *nikhil
malhotrarorigin *nimtgroup.edu.in‚
Admissions2013leadsdatašNƒ
h…d
:Admissions2013leadsdata00000000000000010009j4js~computenimtgroupr
Admissions2013leadsdata™Nrcity *(Unknown city)rcoords
*rcountry *
UNITED
STATESrcourse *
B.Tech (IT)rdate *
’¤í¤Ÿë·r%email
*patilrohit07@yahoo.co.inrip *
14.97.45.19rmaplink *r
mobile *
9665571959rname
*Rohitrorigin *
Yahoo_300X250‚
Admissions2013leadsdata™Nƒh…b
Thanks for sharing such excellent information, Greg, it’s a great approach to data exporting for offline analytics.
The ProtocolBufferDecodeError is thrown when reading the top-level .backup_info file because the first record in the file is not an entity in protobuf format, it’s a version string, i.e., “1”.
Check out class BackupInfoWriter in the Python SDK source https://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/datastore_admin/backup_handler.py?r=361 to see how the top-level .backup_info file and the per-kind .backup_info files are written.
Greg’s method works for the top-level .backup_info and the data files, but the per-kind .backup_info files are just a single protobuf-encoded entity and not wrapped in the leveldb format that RecordsReader reads. To read them:
backup = google.appengine.ext.datastore_admin.backup_pb2.Backup()
backup.ParseFromString(file_contents)
So far, though, I haven’t needed these. All of the interesting information I’ve wanted is in the top-level .backup_info and some is duplicated in the per-kind .backup_info. YMMV.
Thanks for sharing such excellent information, Greg, it’s a great approach to data exporting for offline analytics.
The ProtocolBufferDecodeError is thrown when reading the top-level .backup_info file because the first record in the file is not an entity in protobuf format, it’s a version string, i.e., “1”.
Check out class BackupInfoWriter in the Python SDK source https://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/datastore_admin/backup_handler.py?r=361 to see how the top-level .backup_info file and the per-kind .backup_info files are written.
Greg’s method works for the top-level .backup_info and the data files, but the per-kind .backup_info files are just a single protobuf-encoded entity and not wrapped in the leveldb format that RecordsReader reads. To read them:
backup = google.appengine.ext.datastore_admin.backup_pb2.Backup()
backup.ParseFromString(file_contents)
So far, though, I haven’t needed these. All of the interesting information I’ve wanted is in the top-level .backup_info and some is duplicated in the per-kind .backup_info. YMMV.
I’m glad you found the article useful, and thanks for the tip! We haven’t needed to dig much deeper yet, but if we do I’ll try out what you suggested.
I have written a google app engine app that uses the code in the article to process CSV files. I have never used python before so it might be a bit rough but it should work:
https://github.com/bbhlondon/app-engine-export
Pull requests are greatly appreciated
Hi, Excellent article! I have a doubt about doing the oposite, inserting lots of data on app engine using a file from cloud storage. Does anyone know if this is a good strategy? Today I´m using bulk insert from command line on a daily basis and I don´t think is the best solution. Any ideas?
Unfortunately, inserting lots of data has always been difficult with GAE. Reversing this process could work, but you’ll have to make sure to divide your data carefully so you stay within the request time limits for each mapper.
We try to avoid large data imports to GAE. When they are necessary, we usually use custom write handlers and POST to them in parallel from AWS.
Thanks for the reply, I´m really glad to know that there´s a problem and it´s not only my fault!!! I guess I´ll try using compute engine to do the bulk insert!
Hi.
I’m new to python and I tried to get the code working.
But the varaible ‘reader’ is always empty.
It never enters the for-loop.
For debugging i tried this:
for record in reader:
self.response.write(“Found a record “)
Is your “path_to_a_datastore_output_file” correct?
I’m a Java developer, so I’m not a Python guy. However, I’ve been able to get all of my data. I just have no idea what to make of your python snippet.
Help? Thanks
Yes it is.
I used the git-project created by Rick Burgess.
So the code reads all files in the “data”-directory and tries to read it.
Hi..
I’m a java developer too.
i tried my luck with the python script but no success.
Have found a solution?
How to import the csv file into SQL, such as Postgresql?
I dont know how to deal with the relations between different entities? In this
way, the only choice is to read all the things into memory, then find
the related elements through element by element searching? I dont know
the efficient way to translate a CSV file into Postgresql.
Thanks for your sharing!
Thank you for this great article Greg!
I extended your sample code to actually save the data to local Datastore:
https://gist.github.com/jehna/3b258f5287fcc181aacf
This is awesome stuff man. Thanks for sharing!
I get an error in Step 3. Can you help ??
[ERROR ] Exception during authentication
Traceback (most recent call last):
File “C:datainstallsgoogle_appengine_1.9.18google_appenginegoogleappenginetoolsbulkloader.py”, line 3466, in Run
self.request_manager.Authenticate()
File “C:datainstallsgoogle_appengine_1.9.18google_appenginegoogleappenginetoolsbulkloader.py”, line 1329, in Authenticate
remote_api_stub.MaybeInvokeAuthentication()
File “C:datainstallsgoogle_appengine_1.9.18google_appenginegoogleappengineextremote_apiremote_api_stub.py”, line 889, in MaybeInvokeAuthentication
datastore_stub._server.Send(datastore_stub._path, payload=None)
File “C:datainstallsgoogle_appengine_1.9.18google_appenginegoogleappenginetoolsappengine_rpc.py”, line 426, in Send
f = self.opener.open(req)
File “C:Python27liburllib2.py”, line 431, in open
response = self._open(req, data)
File “C:Python27liburllib2.py”, line 449, in _open
‘_open’, req)
File “C:Python27liburllib2.py”, line 409, in _call_chain
result = func(*args)
File “C:Python27liburllib2.py”, line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File “C:Python27liburllib2.py”, line 1197, in do_open
raise URLError(err)
URLError:
[ERROR ] Authentication Failed: Incorrect credentials or unsupported authentication type (e.g. OpenId).
Great. You saved my day.
I am getting error as int() argument should be string or number , not NONE_TYPE. I believe encoding UTF-8 doesn’t seem to be proper encoding for data store file.
snippet:
load –encoding=UTF-8 –source_format=DATASTORE_BACKUP –allow_jagged_rows=false –write_disposition=WRITE_TRUNCATE sample_red.testenge_1 gs://test.appspot.com/bucketname/ahFzfnZpcmdpbi1yZWQtdGVzdHJBCxIcX0FFX0RhdGFzdG9yZUFkbWluX09wZXJhdGlvbhiBwLgCDAsSFl9BRV9CYWNrdXBfSW5mb3JtYXRpb24YAQw.nge.backup_info
I gave it a try but I only see a “Hallo World” page?
Good article, a question, is there a way to download only a specific model?
Unless something has changed, the export step should include an option to choose which Entity Kinds you want.
While it is a good article that it is this complex to get your own app data out of datastore is why I will never use it again once I leave this gig where I am obligated to. As databases go it is too primitive and too locked down against doing so many things I have done with so many other kinds of databases over the last couple of decades. I consider it the biggest fail of appengine. It feeds appengine apps and does pretty much nothing a database should do beyond that. It can’t even support some very common query patterns. BAH.
Having used AppEngine for more than 4 years, I am afraid I have to agree with this comment.
Even Google’s own procedures to backup and restore data from datastore fail consistently if your database is larger than ~ 1GB. Documentation is incomplete and inaccurate. Support from Google is non-existent, and bug reports stay unanswered for years.
My biggest problem now is that it is hard to migrate away from Appengine/datastore, because it is virtually impossible to get your data out of it.
On Windows you must read as rb and not r!!!
Otherwise you will read partial records and not get results.
I’m inclined to agree with you. I’m using a simpleauth as my applications authentication, and I can’t even open the datastore admin console because I get an authentication error. I don’t have a huge amount of data, but will have to write my own custom backup code to deal with this basic function!
nice solution
No one’s commented in 2 years or so, but I still found the instructions *very* helpful. I’ve got maybe 150GB in the App Engine datastore, and this allowed me to get it out in a reasonable amount of time. The backup still took a few days, and the download maybe an hour, but I got it and am able to convert it into other formats. Thank you!
Glad you found it helpful!