How to read, modify, and re-upload S3 files without downloading

Motivation

My team often receive files from third-party data providers. This means we have to ingest the data (usually from their S3 buckets), process them to be consistent with our existing schemas, and re-upload to our own S3 buckets so we can build Athena tables on top. Doing all of that within the AWS ecosystem makes sure we avoid the time-consuming process of downloading potentially huge datasets locally or on a server, editing, then uploading again.

While resources exist online for how to do each of the above tasks individually, there is no comprehensive guide on how to chain them together. I outlined my workflow in hopes of helping anyone else looking to do the same.

Steps

Getting the object from S3 is a fairly standard process. Important thing to note here is decoding file from bytes to strings in order to do any useful processing.


    s3_client = boto3.client('s3')
    s3_object = s3_client.get_object(Bucket=your_bucket, Key=key_of_obj)
    data = s3_object['Body'].read().decode('utf-8')
        

After editing, (here we're removing quotes from CSVs using writer = csv.writer(writer_buffer, quoting=csv.QUOTE_NONE)) we write the results to the buffer. Since the file is decoded from bytes to strings before, we now re-encode back to bytes so we can upload the buffer.

Here is what it looks like in full. This can be done in fewer than ten lines!


    import io
    import csv
    import boto3

    s3_client = boto3.client('s3')
    s3_object = s3_client.get_object(Bucket=your_bucket, Key=key_of_obj)

    # read the file
    data = s3_object['Body'].read().decode('utf-8')
    writer_buffer = io.StringIO()

    # editing, here we're removing quotes
    writer = csv.writer(writer_buffer, quoting=csv.QUOTE_NONE)
    reader = csv.reader(io.StringIO(data), delimiter=',', skipinitialspace=True)

    # writing and re-uploading
    writer.writerows(reader)

    buffer_to_upload = io.BytesIO(writer_buffer.getvalue().encode())
    s3_client.put_object(Body=buffer_to_upload, Bucket='your_bucket', Key='path/to-your/file.csv')
         

Questions? Comments? Reach out on Twitter.