Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Is it ok to use a delta table tracker based on parquet file name in Azure databricks?

Today at work i saw a delta lake tracker based on file name. By delta tracker, i mean a function that defines whether a parquet file has already been ingested or not.

The code would check what file (from the delta table) has not already been ingested, and the parquets file in the delta table would then be read using this : spark.createDataFrame(path,StringType())

Having worked with Delta tables, it does not seem ok to me to use a delta tracker that way.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  • In case record is deleted, what are the chances that the delta log would point to a new file , and that this deleted record would
    be read as a new one?

  • In case record is updated, what would be the chance that delta log would not point to a new file, and that this updated record
    would not be considered ?

  • In case some maintenance is happening on the delta table, what are
    the chances that some new files are written out of nowhere ?
    Which may cause a record to be re-ingested

Any observation or suggestion whether it is ok to work that way would be great. Thank you

>Solution :

In Delta Lake everything works on file level. So there are no ‘in place’ updates or deletes. Say a single record gets deleted (or updated) then roughly the following happens:

  1. Read in the parquet file with the relevant record (+ the other records which happen to be in the file)
  2. Write out all records except for the deleted record into a new parquet file
  3. Update the transaction log with a new version, marking the old parquet file as removed and the new parquet file as added. Note the old parquet file doesn’t get physically deleted until you run the VACUUM command.

The process for an update is basically the same.

To answer your questions more specifically:

In case record is deleted, what are the chances that the delta log
would point to a new file , and that this deleted record would be read
as a new one?

The delta log will point to a new file, but the deleted record will not be in there. There will be all the other records which happened to be in the original file.

In case record is updated, what would be the chance that delta log
would not point to a new file, and that this updated record would not
be considered ?

Files are not updated in place, so this doesn’t happen. A new file is written containing the updated record (+ any other other records in the original file). The transaction log is updated to ‘point’ to this new file.

In case some maintenance is happening on the delta table, what are the
chances that some new files are written out of nowhere ? Which may
cause a record to be re-ingested

This is possible, although not ‘out of nowhere’. For example if you run OPTIMIZE existing parquet files get reshuffled/combined to improve performance. Basically this means a number of new parquet files will be written and a new version in the transaction log will point to these parquet files. If you pickup all new files after this you will re-ingest data.

Some considerations: if your delta table is append only you could use structured streaming to read from it instead. If not then Databricks offers Change Data Feed giving your record level details of inserts, updates and deletes.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading