Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to perform incremental load using AWS EMR (Pyspark) the right way?

I have all my data available in S3 location s3://sample/input_data

I do my ETL by deploying AWS EMR and using PySpark.

PySpark script is very simple.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  • I load s3://sample/input_data as spark dataframe.
  • Partition it by one column.
  • Save the dataframe as Parquet file with write option in ‘append’ mode into S3 location s3://sample/output_data
  • Then copy all files in s3://sample/input_data to s3://sample/archive_data and delete all data in s3://sample/input_data

So when a new data comes in s3://sample/input_data, it only process the new file and save it in s3://sample/output_data with partition.

Is there any inbuilt latch AWS EMR provides that I should be aware of which I can use it instead of doing the last step of my PySpark script?

>Solution :

You could either use Delta Lake for those purposes or partition your input directory by a time interval like s3://sample/input_data/year=2021/month=11/day=11/ so that you only process data from that time interval.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading