pandas dataframe column string parsing


I have a DF in which one of the columns has strings of the form

0 word1|category1 word2|category2

1 word3|category3 word4|category4 word2|category2 ..

2 word1|category1 word4|category4 word3|category3 ..

where "word1|category1 word4|category4 word3|category3 .." is a string

I need an output dictionary mapping that maps unique set of words to their respective categories.

I tried using series.apply(ast.literal_eval) but it throws an invalid syntax error

>Solution :

If need dictionaries for each row use nested list comprehension:

df['col'] = [dict(y.split('|') for y in x.split()) for x in df['col']]
print (df)
0       {'word1': 'category1', 'word2': 'category2'}
1  {'word3': 'category3', 'word4': 'category4', '...
2  {'word1': 'category1', 'word4': 'category4', '...

Or if need one big dictionary from all values use Series.str.split with Series.explode, create 2 columns DataFrame and convert to dictionary:

d = df['col'].str.split().explode().str.split('|', expand=True).set_index(0)[1].to_dict()
print (d)
{'word1': 'category1', 'word2': 'category2', 'word3': 'category3', 'word4': 'category4'}

Another alterntive:

d = dict(df['col'].str.split().explode().str.split("|").to_numpy())

Leave a Reply Cancel reply