I am continuously add parquet data sets to an S3 folder with a structure like this:
s3:::my-bucket/public/data/set1
s3:::my-bucket/public/data/set2
s3:::my-bucket/public/data/set3
At the beginning I only have set1 and my crawler is configured to run on the whole bucket s3:::my-bucket. This leads to the creation of a partitioned tabled named my-bucket with partitions named public, data and set1. What I actually want is to have a table named set1 without any partitions.
I see the reasons why this happens, as it is explained under How Does a Crawler Determine When to Create Partitions?. But when a new data set is uploaded (e.g. set2) I don’t want it to be another partition (because it is completely different data with a different schema).
How can I force the Glue crawler to NOT create partitions?
I know I could define the crawler path as s3:::my-bucket/public/data/ but unfortunately I don’t know where the new data sets will be created (e.g. could also be s3:::my-bucket/other/folder/set2).
Any ideas how to solve this?
>Solution :
You can use the TableLevelConfiguration to specify in which folder level the crawler should look for tables.
More information on that here.