Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Read data from mount in Databricks (using Autoloader)

I am using azure blob storage to store data and feeding this data to Autoloader using mount. I was looking for a way to allow Autoloader to load a new file from any mount. Let’s say I have these folders in my mount:

mnt/

├─ blob_container_1

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

├─ blob_container_2

When I use .load(‘/mnt/’) no new files are detected. But when I consider folders individually then it works fine like .load(‘/mnt/blob_container_1’)

I want to load files from both mount paths using Autoloader (running continuously).

>Solution :

You can use the path for providing prefix patterns, for example:

df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", <format>) \
  .schema(schema) \
  .load("<base_path>/*/files")

For example, if you would like to parse only png files within a directory that contains files with different suffixes, you can do:

df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "binaryFile") \
  .option("pathGlobfilter", "*.png") \
  .load(<base_path>)

Refer – https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html#filtering-directories-or-files-using-glob-patterns

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading