I have lots of csv files. After using spark sql, I want to make the one csv file.
For example I have news1.csv, news2.csv, news3.csv, ect in S3. I download into spark sql from s3, and createDataframe. After using spark sql, I want to upload s3 with only one csv file.
At first I tried spark.mode(append) for each csv file.
df = spark.sql(...)
df.write \
.option("header","true") \
.option("encoding", "UTF-8") \
.mode("append") \
.option("extracopyoptions", "TIMEFORMAT 'auto'") \
.csv("s3a://news/test1")
But in this case. append don’t work. And each csv file was saved in news/test1 with part-00000…, part-00000…, part-00000…
Second I union dataframe
df = spark.sql(...)
df_total = df_total.union(df)
df_total.write \
.option("header","true") \
.option("encoding", "UTF-8") \
.mode("append") \
.option("extracopyoptions", "TIMEFORMAT 'auto'") \
.csv("s3a://news/test2")
But in this case, even I made the one dataframe, each csv file was saved in news/test2 with part-00000…, part-00001…, part-00002…
How can I save only one csv file to s3?
I need your help
>Solution :
You could try to coalesce
before writing the results, so it can reduce the number of partitions to only 1.
df = spark.sql(...)
df.coalesce(1)
.write \
...
Note that using .mode("append")
won’t empty the output folder but it will append the new results, so if you need only one CSV for the old and new results then consider the union
option but remove the append mode.