Home Pandas: Concatenating strings, removing duplicates and blanks within strings, counting resulting elements, row by row

Questions

Pandas: Concatenating strings, removing duplicates and blanks within strings, counting resulting elements, row by row

byMR

January 12, 2024

I have a very large dataframe that I’m trying to manipulate part of. Essentially I have a single column of data (called ‘REFIX_LIST’) where each cell contains a string of numbers separated by commas. I have a 3-part problem. I’m trying to

(1) concatenate these strings in a new column called
‘UPDATED_REFIX_LIST’ grouped by the columns ‘RECORDING_SESSION_LABEL’
and ‘trial’ so I can then
(2) delete duplicates and blanks within the string (row by row) and
then

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.
Visit Medevel
(3) count the resulting elements in the string.

So far Step 1 is working for me, though I’m not sure I’m going about this in the best way – I’m still new here, but we’re getting there! Here is my code for doing that:

# Function to concatenate strings into a single string
def concat_strings(group):
    return ', '.join(group)

# Grouping and concatenating 'REFIX_LIST' into a single string for each group (Pandas series)
df1['UPDATED_REFIX_LIST'] = df1.groupby(['RECORDING_SESSION_LABEL', 'trial'])['REFIX_LIST'].transform(concat_strings)

Step 2 is where I’m stuck. I’ve tried numerous things (too many to list and I’m honestly not even sure how all of them should work) and I haven’t been successful in deleting duplicates and blanks from the new string(s). *I think part of my problem is that any code I’ve tried is looking to remove duplicate strings within the larger string, instead of just duplicate numbers from the larger string as a whole. This could be a product of my code above and how I’m going about this, but not sure.

Step 3 I haven’t been able to attempt because I can’t get Step 2 to work yet.

Below is some sample data/code and expected output. While I’m reading csv files, the code below is an accurate representation of what I have up to this point.

import pandas as pd

# Define the data for DF1
data_df1 = {
    'RECORDING_SESSION_LABEL': [101, 101, 101, 101, 102, 102],
    'trial': [1, 1, 1, 2, 1, 2],
    'CURRENT_FIX_INDEX': [1, 2, 3, 4, 1, 2],
    'REFIX_LIST': ['', '7,8,10', '7,8', '6,4', '1,2', '1,3,4'],
    'UPDATED_REFIX_LIST': [',7,8,10,7,8', ',7,8,10,7,8', ',7,8,10,7,8', '6,4', '1,2,1,3,4', '1,2,1,3,4']
}

# Create DF1
df1 = pd.DataFrame(data_df1)

#print
print(df1)

This is what my data currently looks like between the csv file I’m importing and the first block of code above. What I need to happen is for my data to look like this after dropping duplicates and blanks (again, row by row). (I used DF2 just for separation here, but this would all be DF1 still):

# Define the data for DF2
data_df2 = {
    'RECORDING_SESSION_LABEL': [101, 101, 101, 101, 102, 102],
    'trial': [1, 1, 1, 2, 1, 2],
    'CURRENT_FIX_INDEX': [1, 2, 3, 4, 1, 2],
    'REFIX_LIST': ['', '7,8,10', '7,8', '6,4', '1,2', '1,3,4'],
    'UPDATED_REFIX_LIST': ['7,8,10', '7,8,10', '7,8,10', '6,4', '1,2,3,4', '1,2,3,4']
}

# Create DF2
df2 = pd.DataFrame(data_df2)

#print
print(df2)

Lastly, I need to count the resulting elements in the ‘UPDATED_REFIX_LIST’ column and create a new column called ‘UPDATED_REFIX_COUNT’. The final output should look like this (still in DF1):

# Define the data for DF3
data_df3 = {
    'RECORDING_SESSION_LABEL': [101, 101, 101, 101, 102, 102],
    'trial': [1, 1, 1, 2, 1, 2],
    'CURRENT_FIX_INDEX': [1, 2, 3, 4, 1, 2],
    'REFIX_LIST': ['', '7,8,10', '7,8', '6,4', '1,2', '1,3,4'],
    'UPDATED_REFIX_LIST': ['7,8,10', '7,8,10', '7,8,10', '6,4', '1,2,3,4', '1,2,3,4'],
    'UPDATED_REFIX_COUNT': [3, 3, 3, 2, 4, 4]
}

# Create DF3
df3 = pd.DataFrame(data_df3)

print(df3)

Any help is greatly appreciated!! Thank you so much!

>Solution :

Just change your initial function to avoid merging duplicates:

import re

# Function to concatenate strings into a single string
# without duplicates
def concat_strings(group):
    return ', '.join(dict.fromkeys(x for s in group
                                   for x in re.split(', *', s)
                                   if x))

# Grouping and concatenating 'REFIX_LIST' into a single string for each group
df1['UPDATED_REFIX_LIST'] = (df1.groupby(['RECORDING_SESSION_LABEL', 'trial'])
                                 ['REFIX_LIST']
                                .transform(concat_strings)
                            )

# hack to count without splitting
# count the number of "," and add 1 if not an empty string
df1['UPDATED_REFIX_COUNT'] = (df1['UPDATED_REFIX_LIST'].str.count(',')
                              .add(df1['UPDATED_REFIX_LIST'].ne(''))
                             )

Output:

   RECORDING_SESSION_LABEL  trial  CURRENT_FIX_INDEX REFIX_LIST UPDATED_REFIX_LIST  UPDATED_REFIX_COUNT
0                      101      1                  1                      7, 8, 10                    3
1                      101      1                  2     7,8,10           7, 8, 10                    3
2                      101      1                  3        7,8           7, 8, 10                    3
3                      101      2                  4        6,4               6, 4                    2
4                      102      1                  1        1,2               1, 2                    2
5                      102      2                  2      1,3,4            1, 3, 4                    3