Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Arranging a dataframe into a list of dictionaries based on columns values

This is a follow-up to a question I previously asked regarding how to create dictionaries with specific columns however I realise now I need to be a bit more specifc in what I am asking. Take the below to be dataset.

     Name  Notes  Source            Number  Info
0    Bob    NaN     NaN              g:45   NaN
1  Billy    1.0    Home             B:+67   NaN
2  Billy    1.0    Work               B:3   NaN
3  Billy    NaN     NaN  hhtps://uishiufb   NaN
4  Billy    0.0  School             V9:67   NaN
5   Eric    0.0     NaN             R:+35   NaN
6   Eric    NaN    Home            f-g:35   NaN

I need to create a list of dictionaries based on the actual Numbers in the number column. The data is messy so take, for example, Billy. Rows 1 and 4 have the same numbers but different labels, ‘B:+’ and ‘V9’ respectively. I would like the code to ignore this and only focus on the numbers and if the numbers are the same create a unique dictionary based on that information as you can see below. Also, the data could contain something like a URL i.e completely different and id like it to store that as a separate dictionary. Note – it doesn’t need to necessarily be a URL it’s just an example because the URL could contain numbers which could confuse the issue. The dataset is extensive and so the code to do this would ideally apply to all the rows in the Name column.

It isn’t demonstrated here but some cells in the initial dataset go onto a newline so once I order the data frame into dictionaries and output it as an excel sheet, there are ‘\n’s in the data which I would like to replace with commas. Any ideas for that would be very helpful.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Output:

dict1 = {Name: Bob, Source:[Nan], Number: [g:45]}
dict2 = {Name: Billy, Source:[Home, School], Number: [B:+67,V9:67]}
dict3 = {Name: Billy, Source:[Work], Number: [B:3}
dict4 = {Name: Billy, Source:[Nan], Number: [hhtps://uishiufb]}
dict5 = {Name: Eric, Source:[Home], Number: [R:+35, f-g:35]}

Thank you very much in advance

>Solution :

How does this work for you?

num = df['Number'].str.extract(':\+?(\d+)').fillna(df['Number']).squeeze()
df.groupby(['Name', num])[['Source', 'Number']].agg(list)\
  .droplevel(1).reset_index().to_dict('records')

Output:

[{'Name': 'Billy', 'Source': ['Work'], 'Number': ['B:3']},
 {'Name': 'Billy', 'Source': ['Home', 'School'], 'Number': ['B:+67', 'V9:67']},
 {'Name': 'Billy', 'Source': [nan], 'Number': ['hhtps://uishiufb']},
 {'Name': 'Bob', 'Source': [nan], 'Number': ['g:45']},
 {'Name': 'Eric', 'Source': [nan, 'Home'], 'Number': ['R:+35', 'f-g:35']}]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading