I have a dataframe df
>>> print(df)
Domain[FT]
0 DOMAIN 23..304; /note="Protein kinase"; /evidence="ECO:0000255|PROSITE-ProRule:PRU00159, ECO:0000305"; DOMAIN 324..406; /note="RRM"; /evidence="ECO:0000255|PROSITE-ProRule:PRU00176, ECO:0000305"
1 DOMAIN 30..127; /note="Ig-like C2-type 1"; DOMAIN 151..213; /note="Ig-like C2-type 2"; DOMAIN 219..326; /note="Ig-like C2-type 3"; DOMAIN 331..415; /note="Ig-like C2-type 4"; DOMAIN 422..552; /note="Ig-like C2-type 5"; DOMAIN 555..671; /note="Ig-like C2-type 6"; DOMAIN 678..764; /note="Ig-like C2-type 7"; DOMAIN 845..1173; /note="Protein kinase"; /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
2 DOMAIN 80..653; /note="Protein kinase"; /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
3 DOMAIN 32..327; /note="Protein kinase"; /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
4 DOMAIN 456..734; /note="Protein kinase"; /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
.. ...
540 DOMAIN 58..313; /note="Protein kinase"; /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
541 DOMAIN 18..109; /note="PB1"; /evidence="ECO:0000255|PROSITE-ProRule:PRU01081"; DOMAIN 166..409; /note="Protein kinase"; /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
542 DOMAIN 102..367; /note="Protein kinase"; /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
543 DOMAIN 77..343; /note="Protein kinase"; /evidence="ECO:0000255|PROSITE-ProRule:PRU00159, ECO:0000269|PubMed:9092543"; DOMAIN 344..414; /note="AGC-kinase C-terminal"; /evidence="ECO:0000255|PROSITE-ProRule:PRU00618"; DOMAIN 1082..1201; /note="PH"; /evidence="ECO:0000255|PROSITE-ProRule:PRU00145"; DOMAIN 1227..1499; /note="CNH"; /evidence="ECO:0000255|PROSITE-ProRule:PRU00795"; DOMAIN 1571..1584; /note="CRIB"; /evidence="ECO:0000255|PROSITE-ProRule:PRU00057"
544 DOMAIN 98..355; /note="Protein kinase"; /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"; DOMAIN 356..431; /note="AGC-kinase C-terminal"; /evidence="ECO:0000255|PROSITE-ProRule:PRU00618"
[545 rows x 1 columns]
What I am trying to do is collect only the domain values (i.e.DOMAIN 77..343) that come before /note="Protein kinase." in this Domain[FT] column.
Additionally, I want to store the start domain and end domain values in two seperate columns in a different dataframe df_domain
i’m trying to get the end result to look something like this
>>> print(df_domain)
domain_start domain_end
0 23 304
1 845 1173
2 80 653
3 32 327
4 456 734
.. ... ...
541 58 313
542 166 409
542 102 367
543 77 343
544 98 355
[545 rows x 2 columns]
I’m not sure how to go about doing something like this
>Solution :
You can get achieve this using str.extract:
>>> df["Domain[FT]"].str.extract(r'DOMAIN (\d+)..(\d+); /note="Protein kinase"')
0 1
0 23 304
1 845 1173
2 80 653
3 32 327
4 456 734
Full example so you get your column names:
new_df = df["Domain[FT]"].str.extract(r'DOMAIN (\d+)..(\d+); /note="Protein kinase"')
new_df.columns = ["domain_start", "domain_end"]