Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Force Glue Crawler to create separate tables

I am continuously add parquet data sets to an S3 folder with a structure like this:

s3:::my-bucket/public/data/set1
s3:::my-bucket/public/data/set2
s3:::my-bucket/public/data/set3

At the beginning I only have set1 and my crawler is configured to run on the whole bucket s3:::my-bucket. This leads to the creation of a partitioned tabled named my-bucket with partitions named public, data and set1. What I actually want is to have a table named set1 without any partitions.
I see the reasons why this happens, as it is explained under How Does a Crawler Determine When to Create Partitions?. But when a new data set is uploaded (e.g. set2) I don’t want it to be another partition (because it is completely different data with a different schema).
How can I force the Glue crawler to NOT create partitions?
I know I could define the crawler path as s3:::my-bucket/public/data/ but unfortunately I don’t know where the new data sets will be created (e.g. could also be s3:::my-bucket/other/folder/set2).

Any ideas how to solve this?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You can use the TableLevelConfiguration to specify in which folder level the crawler should look for tables.

More information on that here.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading