Home Using greedy behavior to match string after x times of occurrences

Questions

Using greedy behavior to match string after x times of occurrences

July 31, 2023

I have a data frame that looks like this:

id  ec_number_clean euclidean_clean cluster label   is_akker    eggNOG_OGs  COG_category    Description GOs EC  CAZy
04XYNu_00699    3.2.1.52    0.0968  49  non-singleton   akkermansia COG3525@1|root,COG3525@2|Bacteria,46UY2@74201|Verrucomicrobia,2IU6A@203494|Verrucomicrobiae G   Glycoside hydrolase, family 20, catalytic core  -   3.2.1.52    GH20

The column that I’m interested in is the eggNOG_OGs. This column has a particular format that is not always the same in all rows. Here an example:

COG3525@1|root,COG3525@2|Bacteria,46UY2@74201|Verrucomicrobia,2IU6A@203494|Verrucomicrobiae

COG3525@1|root,COG3525@2|Bacteria

COG3525@1|root,KOG2499@2759|Eukaryota,38D1Y@33154|Opisthokonta,3NUJ9@4751|Fungi,3QMST@4890|Ascomycota,216QI@147550|Sordariomycetes,3TDHM@5125|Hypocreales,3G4R2@34397|Clavicipitaceae

COG3525@1|root,KOG2499@2759|Eukaryota,3ZBNG@5878|Ciliophora

As you can see, the pattern to follow here is the "|" (pipe) in the string.
My code uses regex to find the last occurrence of the "|" and create a new column with the string that is immediately after the last occurrence of the "|".

Now, I need to do something slightly different. Instead of the last occurrence, I need to stop after 3 occurrences of the "|", for example, based on the four lines just above this text, the new column must contain this information on each row:

Verrucomicrobia
Bacteria
Opisthokonta
Ciliophora

Here, there is little detail, sometimes there is not a third occurrence of "|". In that case, if there is not a third occurrence, just put the string after the last occurrence. For that reason, in the second line, I put Bacteria, due to the absent of a third occurrence of "|".

Here is my code, that works perfectly to find the string after the last occurrences of "|":

# Read file
input_file_1 = sys.argv[1]
output_file_1 = sys.argv[2]

# .*: match any character (except newlines), this is based on the "greedily regex method"
# \|: match the last occurrence of "|"
# ([^|]+)$: capture everything after the last occurrences of "|", so in this case everything that start with "|".
# The [^|]+ means one or more characters that are not "|". Finally, the $ matches the end of the string.
searching_root = r'.*\|([^|]+)$'

def searching_taxonomy(text):
    """
    :param text: pattern that is search
    :return: the first not None string
    """
    # Search for pattern
    match = re.search(searching_root, text)
    # If match is not None, return the first match
    # Remove any leading and trailing whitespace characters
    return match.group(1).strip() if match else None


# Define data frame
df_input = pd.read_csv(input_file_1, header=0, sep="\t")

# Create a new column and apply the function above to append the matches
df_input['eggnog_taxonomy'] = df_input['eggNOG_OGs'].apply(searching_taxonomy)

I do not know if the regex pattern that I’m using has a particular name, but I know that has a "greedy behavior". However, I think that my goal or idea is more like a strict greedy behavior because I need everything (string) after three times the occurrence of "|" but nothing more. As well as if the occurrence is not three times, just put the last one.

Any idea to modify only the pattern? Maybe combining some regex techniques.
Maybe add an if statement based on the times of occurrences, however, I want to check (first) if it is possible to modify the regex.

>Solution :

It can be achieved with combination of split + replace:

df['eggNOG_OGs'].str.split('|', n=3).str[-1].replace(r',.*', '', regex=True)

Out[367]: 
0    Verrucomicrobia
1           Bacteria
2       Opisthokonta
3         Ciliophora
Name: eggNOG_OGs, dtype: object

regex

byMR

Published July 31, 2023

Add a comment

Add dictionary in List in Python but, data is duplicated

byMR

July 31, 2023

Questions

How to check if value of a key exists of an array in another array

byMR

July 31, 2023

Questions

Generate notional dummy data in Foundry

byMR

July 31, 2023

Questions

What is wrong with my RelayCommand? Data loads on button press but won't load on page appearing 🫤

byMR

July 31, 2023

Questions

SQL insert into error: parameters are of unsupported type

byMR

July 31, 2023

Using greedy behavior to match string after x times of occurrences

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

Add dictionary in List in Python but, data is duplicated

How to check if value of a key exists of an array in another array

Generate notional dummy data in Foundry

What is wrong with my RelayCommand? Data loads on button press but won't load on page appearing 🫤

SQL insert into error: parameters are of unsupported type

Keep Up to Date with the Most Important News

Using greedy behavior to match string after x times of occurrences

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

Add dictionary in List in Python but, data is duplicated

How to check if value of a key exists of an array in another array

Generate notional dummy data in Foundry

How to create a new column that gets count by groupby in pandas

What is wrong with my RelayCommand? Data loads on button press but won't load on page appearing 🫤

SQL insert into error: parameters are of unsupported type

Discover more from Dev solutions