Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

pandas: add a string to certain values in a comma separated column if values exist in a list

I have a pandas dataframe as follows,

import pandas as pd
import numpy as np

df = pd.DataFrame({'text':['this is the good student','she wears a beautiful green dress','he is from a friendly family of four','the house is empty','the number four five is new'],
               'labels':['O,O,O,ADJ,O','O,O,O,ADJ,ADJ,O','O,O,O,O,ADJ,O,O,NUM','O,O,O,O','O,O,NUM,NUM,O,O']})

I would like to add a ‘B-‘ label to the ADJ or NUM is they are not repeated right after, and ‘I-‘ if there is a repetition. so here is my desired output,

output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

                                   text               labels
0              this is the good student          O,O,O,B-ADJ,O
1     she wears a beautiful green dress      O,O,O,B-ADJ,I-ADJ,O
2  he is from a friendly family of four  O,O,O,O,B-ADJ,O,O,B-NUM
3                    the house is empty              O,O,O,O
4           the number four five is new      O,O,B-NUM,I-NUM,O,O

so far I have created a list of unique values as such

unique_labels = (np.unique(sum(df["labels"].str.split(',').dropna().to_numpy(), []))).tolist()
unique_labels.remove('O') # no changes required for O label

and tried to first add the B label which I got an error(ValueError: Must have equal len keys and value when setting with an iterable),

for x in unique_labels:
    df.loc[df["labels"].str.contains(x), "labels"]= ['B-' + x for x in df["labels"]]

>Solution :

Try:

from itertools import groupby


def fn(x):
    out = []
    for k, g in groupby(map(str.strip, x.split(","))):
        if k == "O":
            out.extend(g)
        else:
            out.append(f"B-{next(g)}")
            out.extend([f"I-{val}" for val in g])
    return ",".join(out)


df["labels"] = df["labels"].apply(fn)
print(df)

Prints:

                                   text                   labels
0              this is the good student            O,O,O,B-ADJ,O
1     she wears a beautiful green dress      O,O,O,B-ADJ,I-ADJ,O
2  he is from a friendly family of four  O,O,O,O,B-ADJ,O,O,B-NUM
3                    the house is empty                  O,O,O,O
4           the number four five is new      O,O,B-NUM,I-NUM,O,O
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading