Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to count first occurence of a value in a dataframe of lists in Python?

I have a dataframe which has one column of lists and one column with the number of objects in that list and is sorted by its descending order.
For example:

df=pd.DataFrame({'value':[['AB','BC','CD','DE','EF','FG','GH','HI'],
                          ['BC','CD','DE','IJ','JK','KL','LM'],
                          ['AB','CD','DE','MN'],
                          ['C', 'D', 'M'],
                          ['MN','NO'],
                          ['APQ']],
                 'no_of_values': [8,7,4,3,2,1]})

I would like to have a 3rd column with number of values that occur for the first time (from top to bottom). For example:

df_goal=pd.DataFrame({'value':[['AB','BC','CD','DE','EF','FG','GH','HI'],
                               ['BC','CD','DE','IJ','JK','KL','LM'],
                               ['AB','CD','DE','MN'],
                               ['C', 'D', 'M'],
                               ['MN','NO'],
                               ['APQ']],
                      'no_of_values': [8,7,4,3,2,1],
                      'no_of_1st_occurence': [8,4,1,3,1,1]})

My approach was to go through each row of the ‘value’ column and add each value, that is not already part of it, to a ‘non_redundant_list’ . If the letter is not already part of it, the column ‘no_of_1st_occurence’ should also go +1 in the respective row.
I tried it as follows:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df['no_of_1st_occurence'] = 0
non_redundant_list = []

for index in df.index:
    for list in df['value'][index]:
        for value in list:
            if not value in non_redundant_list:
                non_redundant_list.append(value)
                df['no_of_1st_occurence'][index] += 1  

However this somehow only checks for first occurence of every single letter and not of the objects in the lists. How do I have to adapt my code to work or is there any more simple solution?

>Solution :

You can do it in a much more efficient way using explode:

df['no_of_1st_occurence'] = (~df['value'].explode().duplicated()).groupby(level=0).sum()

Output:

>>> df
                              value  no_of_values  no_of_1st_occurence
0  [AB, BC, CD, DE, EF, FG, GH, HI]             8                    8
1      [BC, CD, DE, IJ, JK, KL, LM]             7                    4
2                  [AB, CD, DE, MN]             4                    1
3                         [C, D, M]             3                    3
4                          [MN, NO]             2                    1
5                             [APQ]             1                    1
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading