Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Label any text with multiple topics in sequence of their occurrence

I have a DataFrame with an ID and Text like below:

df1

ID Text
1 I have completed my order
2 I have made the payment. When can I expect the order to be delivered?
3 I am unable to make the payment.
4 I am done with registration and payment. I need the order number?
5 I am unable to complete registration. How will I even order?

I have certain topics to classify these texts:
class = ["order", "payment", "registration"]

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I am doing the following which gets me the results:

classes = ["order", "payment", "registration"]
for c in classes:
    word_counter = Counter()
    list_df = []
    field = "Text"
    df2 = pd.DataFrame()
    df2 = df2[df2[field].str.contains(c)] 
    print(c)
    list_df.append(df2)
    final_df = pd.concat(list_df)
    final_df.to_csv("./" + c + ".csv")    

This will generate me 3 CSV files which I will later join again:

file_list = []
os.chdir('<file path>')

for file in os.listdir():
    if file.endswith('.csv'):
        df = pd.read_csv(file, sep=",", encoding='ISO-8859-1')
        df['filename'] = file
        file_list.append(df)

df_topic = pd.concat(file_list, ignore_index=True)
df_topic['topic'] = df_topic['filename'].str.split('.').str[0]
df_topic= df_topic.drop('filename', 1)

The resultant DataFrame looks like this:

ID Text Topic
1 I have completed my order order
2 I have made the payment. When can I expect the order to be delivered? order
4 I am done with registration and payment. I need the order number? order
2 I have made the payment. When can I expect the order to be delivered? payment
3 I am unable to make the payment. payment
4 I am done with registration and payment. I need the order number? payment
4 I am done with registration and payment. I need the order number? registration
5 I am unable to complete registration. How will I even order? registration

But, the problem you see here is that same text may have the keywords for the other classes too and can be tagged as either (like text for id=2 has both order and payment). I can only have one record label for each id and thus would prefer to have it as Primary or Secondary topic based on the sequence of their occurrence from the beginning of the text. If a text has more than 2 then first 2 gets preference but just to ensure we may need the third topic (or nth topic) for a future instance I would like to store it as a list in the final field. (Example for id = 4 is illustrated)

ID Text Primary Topic Secondary Topic Identified Topics Topics List
1 I have completed my order order null 1 [order]
2 I have made the payment. When can I expect the order to be delivered? payment order 2 [payment,order]
3 I am unable to make the payment. payment null 1 [payment]
4 I am done with registration and payment. I need the order number? registration payment 3 [registration,payment,order]
5 I am unable to complete registeration. How will I even order? registration order 2 [registration,order]

Is it possible to do it this way. If not, what is a good way to approach such labelling issues?

>Solution :

IIUC, you could use str.extractall combined with GroupBy.agg:

lst = ["order", "payment", "registration"]
regex = f'({"|".join(lst)})'  # if lst contains special chars, wrap in re.escape
df2 = df.join(df['Text']
              .str.extractall(regex)[0]
              .groupby(level=0).agg(**{'Primary Topic': 'first',
                                       'Secondary Topic': lambda x: x.iloc[1] if len(x)>1 else 'null',
                                       'Identified Topics': 'nunique',
                                       'Topics List': list})
               )

output:

   ID                                                                   Text Primary Topic Secondary Topic  Identified Topics                     Topics List
0   1                                              I have completed my order         order            null                  1                         [order]
1   2  I have made the payment. When can I expect the order to be delivered?       payment           order                  2                [payment, order]
2   3                                       I am unable to make the payment.       payment            null                  1                       [payment]
3   4      I am done with registration and payment. I need the order number?  registration         payment                  3  [registration, payment, order]
4   5           I am unable to complete registration. How will I even order\  registration           order                  2           [registration, order]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading