How to perform incremental load using AWS EMR (Pyspark) the right way?

byMR

November 17, 2021

I have all my data available in S3 location s3://sample/input_data

I do my ETL by deploying AWS EMR and using PySpark.

PySpark script is very simple.

I load s3://sample/input_data as spark dataframe.
Partition it by one column.
Save the dataframe as Parquet file with write option in ‘append’ mode into S3 location s3://sample/output_data
Then copy all files in s3://sample/input_data to s3://sample/archive_data and delete all data in s3://sample/input_data

So when a new data comes in s3://sample/input_data, it only process the new file and save it in s3://sample/output_data with partition.

Is there any inbuilt latch AWS EMR provides that I should be aware of which I can use it instead of doing the last step of my PySpark script?

>Solution :

You could either use Delta Lake for those purposes or partition your input directory by a time interval like s3://sample/input_data/year=2021/month=11/day=11/ so that you only process data from that time interval.