How to read, modify, and re-upload S3 files without downloading
Carol Zhang written December 2019
Motivation
My team often receive files from third-party data providers. This means we have to ingest the data (usually from their S3 buckets), process them to be consistent with our existing schemas, and re-upload to our own S3 buckets so we can build Athena tables on top. Doing all of that within the AWS ecosystem makes sure we avoid the time-consuming process of downloading potentially huge datasets locally or on a server, editing, then uploading again.
While resources exist online for how to do each of the above tasks individually, there is no comprehensive guide on how to chain them together. I outlined my workflow in hopes of helping anyone else looking to do the same.
Steps
Getting the object from S3 is a fairly standard process. Important thing to note here is decoding file from bytes to strings in order to do any useful processing.
s3_client = boto3.client('s3')
s3_object = s3_client.get_object(Bucket=your_bucket, Key=key_of_obj)
data = s3_object['Body'].read().decode('utf-8')
After editing, (here we're removing quotes from CSVs using writer = csv.writer(writer_buffer, quoting=csv.QUOTE_NONE)
) we write the results to the buffer. Since the file is decoded from bytes to strings before, we now re-encode back to bytes so we can upload the buffer.
Here is what it looks like in full. This can be done in fewer than ten lines!
import io
import csv
import boto3
s3_client = boto3.client('s3')
s3_object = s3_client.get_object(Bucket=your_bucket, Key=key_of_obj)
# read the file
data = s3_object['Body'].read().decode('utf-8')
writer_buffer = io.StringIO()
# editing, here we're removing quotes
writer = csv.writer(writer_buffer, quoting=csv.QUOTE_NONE)
reader = csv.reader(io.StringIO(data), delimiter=',', skipinitialspace=True)
# writing and re-uploading
writer.writerows(reader)
buffer_to_upload = io.BytesIO(writer_buffer.getvalue().encode())
s3_client.put_object(Body=buffer_to_upload, Bucket='your_bucket', Key='path/to-your/file.csv')
Questions? Comments? Reach out on Twitter.