Pyspark Compare column strings, grouping if alphabetic character sets are same, but avoid similar words?
I’m working on a project where I have a pyspark dataframe of two columns (word, word count) that are string and bigint respectively. The dataset is dirty such that some words have a non-letter character attached to them (ex. ‘date’, ‘[date’, ‘date]’ and ‘_date’ are all separate items but should be just ‘date’) print(dirty_df.schema) output—>… Read More Pyspark Compare column strings, grouping if alphabetic character sets are same, but avoid similar words?