Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How can I make only one file in spark to s3?

I have lots of csv files. After using spark sql, I want to make the one csv file.

For example I have news1.csv, news2.csv, news3.csv, ect in S3. I download into spark sql from s3, and createDataframe. After using spark sql, I want to upload s3 with only one csv file.

At first I tried spark.mode(append) for each csv file.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df = spark.sql(...)

df.write \
  .option("header","true") \
  .option("encoding", "UTF-8") \
  .mode("append") \
  .option("extracopyoptions", "TIMEFORMAT 'auto'") \
  .csv("s3a://news/test1")

But in this case. append don’t work. And each csv file was saved in news/test1 with part-00000…, part-00000…, part-00000…

Second I union dataframe

df = spark.sql(...)
df_total = df_total.union(df)

df_total.write \
  .option("header","true") \
  .option("encoding", "UTF-8") \
  .mode("append") \
  .option("extracopyoptions", "TIMEFORMAT 'auto'") \
  .csv("s3a://news/test2")

But in this case, even I made the one dataframe, each csv file was saved in news/test2 with part-00000…, part-00001…, part-00002…

How can I save only one csv file to s3?

I need your help

>Solution :

You could try to coalesce before writing the results, so it can reduce the number of partitions to only 1.

df = spark.sql(...)

df.coalesce(1)
  .write \
  ...

Note that using .mode("append") won’t empty the output folder but it will append the new results, so if you need only one CSV for the old and new results then consider the union option but remove the append mode.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading