Home trying to extract specific data from a column using regular expressions

Questions

trying to extract specific data from a column using regular expressions

January 18, 2022

I have a dataframe df

>>> print(df)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Domain[FT]
0                                                                                                                                                                                                                                                                                    DOMAIN 23..304;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159, ECO:0000305"; DOMAIN 324..406;  /note="RRM";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00176, ECO:0000305"
1                                                                    DOMAIN 30..127;  /note="Ig-like C2-type 1"; DOMAIN 151..213;  /note="Ig-like C2-type 2"; DOMAIN 219..326;  /note="Ig-like C2-type 3"; DOMAIN 331..415;  /note="Ig-like C2-type 4"; DOMAIN 422..552;  /note="Ig-like C2-type 5"; DOMAIN 555..671;  /note="Ig-like C2-type 6"; DOMAIN 678..764;  /note="Ig-like C2-type 7"; DOMAIN 845..1173;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
2                                                                                                                                                                                                                                                                                                                                                                                                DOMAIN 80..653;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
3                                                                                                                                                                                                                                                                                                                                                                                                DOMAIN 32..327;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
4                                                                                                                                                                                                                                                                                                                                                                                               DOMAIN 456..734;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
..                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ...
540                                                                                                                                                                                                                                                                                                                                                                                              DOMAIN 58..313;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
541                                                                                                                                                                                                                                                                                                            DOMAIN 18..109;  /note="PB1";  /evidence="ECO:0000255|PROSITE-ProRule:PRU01081"; DOMAIN 166..409;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
542                                                                                                                                                                                                                                                                                                                                                                                             DOMAIN 102..367;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"
543  DOMAIN 77..343;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159, ECO:0000269|PubMed:9092543"; DOMAIN 344..414;  /note="AGC-kinase C-terminal";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00618"; DOMAIN 1082..1201;  /note="PH";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00145"; DOMAIN 1227..1499;  /note="CNH";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00795"; DOMAIN 1571..1584;  /note="CRIB";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00057"
544                                                                                                                                                                                                                                                                                          DOMAIN 98..355;  /note="Protein kinase";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00159"; DOMAIN 356..431;  /note="AGC-kinase C-terminal";  /evidence="ECO:0000255|PROSITE-ProRule:PRU00618"

[545 rows x 1 columns]

What I am trying to do is collect only the domain values (i.e.DOMAIN 77..343) that come before /note="Protein kinase." in this Domain[FT] column.
Additionally, I want to store the start domain and end domain values in two seperate columns in a different dataframe df_domain

i’m trying to get the end result to look something like this

>>> print(df_domain)
     domain_start domain_end
0              23    304
1             845    1173
2              80    653
3              32    327
4             456    734
..          ...              ...
541            58    313
542           166    409
542           102    367
543            77    343
544            98    355

[545 rows x 2 columns]

I’m not sure how to go about doing something like this

>Solution :

You can get achieve this using str.extract:

>>> df["Domain[FT]"].str.extract(r'DOMAIN (\d+)..(\d+);  /note="Protein kinase"')
     0     1
0   23   304
1  845  1173
2   80   653
3   32   327
4  456   734

Full example so you get your column names:

new_df = df["Domain[FT]"].str.extract(r'DOMAIN (\d+)..(\d+);  /note="Protein kinase"')
new_df.columns = ["domain_start", "domain_end"]

dataframe

byMR

Published January 18, 2022

Add a comment

Do NodeJS setInterval()s queue up?

byMR

January 18, 2022

Questions

How to come up with this output?If you enter a phrase or a sentence, the output will be the first and last letter of each word in uppercase format

byMR

January 18, 2022

Questions

Is there a way to conditionally copy values in a vectorized way on a PANDAS df?

byMR

January 18, 2022

Questions

How can I check that if 90, 270, 450 etc. are present?

byMR

January 18, 2022

Questions

r – ggplot multiple line graphs using all column as x and all row as y

byMR

January 18, 2022

Questions

My database gets an object returned instead of a value

byMR

January 18, 2022

trying to extract specific data from a column using regular expressions

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

Do NodeJS setInterval()s queue up?

How to come up with this output?If you enter a phrase or a sentence, the output will be the first and last letter of each word in uppercase format

Is there a way to conditionally copy values in a vectorized way on a PANDAS df?

How can I check that if 90, 270, 450 etc. are present?

r – ggplot multiple line graphs using all column as x and all row as y

My database gets an object returned instead of a value

Keep Up to Date with the Most Important News

trying to extract specific data from a column using regular expressions

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

Do NodeJS setInterval()s queue up?

How to come up with this output?If you enter a phrase or a sentence, the output will be the first and last letter of each word in uppercase format

Is there a way to conditionally copy values in a vectorized way on a PANDAS df?

How can I check that if 90, 270, 450 etc. are present?

r – ggplot multiple line graphs using all column as x and all row as y

My database gets an object returned instead of a value

Discover more from Dev solutions