Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pyspark write a DataFrame to csv files in S3 with a custom name

I am writing files to an S3 bucket with code such as the following:

df.write.format('csv').option('header','true').mode("append").save("s3://filepath")

This outputs to the S3 bucket as several files as desired, but each part has a long file name such as:

part-00019-tid-5505901395380134908-d8fa632e-bae4-4c7b-9f29-c34e9a344680-236-1-c000.csv

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Is there a way to write this as a custom file name, preferably in the PySpark write function? Such as:

part-00019-my-output.csv

>Solution :

You can’t do that with only Spark. The long random numbers behind are to make sure there is no duplication, no overwriting would happen when there are many many executors trying to write files at the same location.

You’d have to use AWS SDK to rename those files.

P/S: If you want one single CSV file, you can use coalesce. But the file name is still not determinable.

df.coalesce(1).write.format('csv')...
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading