Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

trying to extract specific data from a column using regular expressions

I have a dataframe df

>>> print(df)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Domain[FT]
0                                                                                                                                                                                                                                                                                    DOMAIN 23..304;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159, ECO:0000305"; DOMAIN 324..406;  /note="RRM";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00176, ECO:0000305"
1                                                                    DOMAIN 30..127;  /note="Ig-like C2-type 1"; DOMAIN 151..213;  /note="Ig-like C2-type 2"; DOMAIN 219..326;  /note="Ig-like C2-type 3"; DOMAIN 331..415;  /note="Ig-like C2-type 4"; DOMAIN 422..552;  /note="Ig-like C2-type 5"; DOMAIN 555..671;  /note="Ig-like C2-type 6"; DOMAIN 678..764;  /note="Ig-like C2-type 7"; DOMAIN 845..1173;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
2                                                                                                                                                                                                                                                                                                                                                                                                DOMAIN 80..653;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
3                                                                                                                                                                                                                                                                                                                                                                                                DOMAIN 32..327;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
4                                                                                                                                                                                                                                                                                                                                                                                               DOMAIN 456..734;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
..                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ...
540                                                                                                                                                                                                                                                                                                                                                                                              DOMAIN 58..313;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
541                                                                                                                                                                                                                                                                                                            DOMAIN 18..109;  /note="PB1";  /evidence="ECO:0000255|PROSITE-ProRule:PRU01081"; DOMAIN 166..409;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
542                                                                                                                                                                                                                                                                                                                                                                                             DOMAIN 102..367;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
543  DOMAIN 77..343;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159, ECO:0000269|PubMed:9092543"; DOMAIN 344..414;  /note="AGC-kinase C-terminal";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00618"; DOMAIN 1082..1201;  /note="PH";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00145"; DOMAIN 1227..1499;  /note="CNH";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00795"; DOMAIN 1571..1584;  /note="CRIB";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00057"
544                                                                                                                                                                                                                                                                                          DOMAIN 98..355;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"; DOMAIN 356..431;  /note="AGC-kinase C-terminal";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00618"

[545 rows x 1 columns]

What I am trying to do is collect only the domain values (i.e.DOMAIN 77..343) that come before /note="Protein kinase." in this Domain[FT] column.
Additionally, I want to store the start domain and end domain values in two seperate columns in a different dataframe df_domain

i’m trying to get the end result to look something like this

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>>> print(df_domain)
     domain_start domain_end
0              23    304
1             845    1173
2              80    653
3              32    327
4             456    734
..          ...              ...
541            58    313
542           166    409
542           102    367
543            77    343
544            98    355

[545 rows x 2 columns]

I’m not sure how to go about doing something like this

>Solution :

You can get achieve this using str.extract:

>>> df["Domain[FT]"].str.extract(r'DOMAIN (\d+)..(\d+);  /note="Protein kinase"')
     0     1
0   23   304
1  845  1173
2   80   653
3   32   327
4  456   734

Full example so you get your column names:

new_df = df["Domain[FT]"].str.extract(r'DOMAIN (\d+)..(\d+);  /note="Protein kinase"')
new_df.columns = ["domain_start", "domain_end"]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading