Arranging a dataframe into a list of dictionaries based on columns values

July 22, 2022

This is a follow-up to a question I previously asked regarding how to create dictionaries with specific columns however I realise now I need to be a bit more specifc in what I am asking. Take the below to be dataset.

     Name  Notes  Source            Number  Info
0    Bob    NaN     NaN              g:45   NaN
1  Billy    1.0    Home             B:+67   NaN
2  Billy    1.0    Work               B:3   NaN
3  Billy    NaN     NaN  hhtps://uishiufb   NaN
4  Billy    0.0  School             V9:67   NaN
5   Eric    0.0     NaN             R:+35   NaN
6   Eric    NaN    Home            f-g:35   NaN

I need to create a list of dictionaries based on the actual Numbers in the number column. The data is messy so take, for example, Billy. Rows 1 and 4 have the same numbers but different labels, ‘B:+’ and ‘V9’ respectively. I would like the code to ignore this and only focus on the numbers and if the numbers are the same create a unique dictionary based on that information as you can see below. Also, the data could contain something like a URL i.e completely different and id like it to store that as a separate dictionary. Note – it doesn’t need to necessarily be a URL it’s just an example because the URL could contain numbers which could confuse the issue. The dataset is extensive and so the code to do this would ideally apply to all the rows in the Name column.

It isn’t demonstrated here but some cells in the initial dataset go onto a newline so once I order the data frame into dictionaries and output it as an excel sheet, there are ‘\n’s in the data which I would like to replace with commas. Any ideas for that would be very helpful.

Output:

dict1 = {Name: Bob, Source:[Nan], Number: [g:45]}
dict2 = {Name: Billy, Source:[Home, School], Number: [B:+67,V9:67]}
dict3 = {Name: Billy, Source:[Work], Number: [B:3}
dict4 = {Name: Billy, Source:[Nan], Number: [hhtps://uishiufb]}
dict5 = {Name: Eric, Source:[Home], Number: [R:+35, f-g:35]}

Thank you very much in advance

>Solution :

How does this work for you?

num = df['Number'].str.extract(':\+?(\d+)').fillna(df['Number']).squeeze()
df.groupby(['Name', num])[['Source', 'Number']].agg(list)\
  .droplevel(1).reset_index().to_dict('records')

Output:

[{'Name': 'Billy', 'Source': ['Work'], 'Number': ['B:3']},
 {'Name': 'Billy', 'Source': ['Home', 'School'], 'Number': ['B:+67', 'V9:67']},
 {'Name': 'Billy', 'Source': [nan], 'Number': ['hhtps://uishiufb']},
 {'Name': 'Bob', 'Source': [nan], 'Number': ['g:45']},
 {'Name': 'Eric', 'Source': [nan, 'Home'], 'Number': ['R:+35', 'f-g:35']}]