In Python, how do I write a loop to remove from character # n to a specific character (:) in parts of a list that match a condition?

I have a list like this:

test = ["Similar to Stxbp2: Syntaxin-binding protein 2 (Mus musculus)", "Protein of unknown function", "Similar to rab18b: Ras-related protein Rab-18-B (Danio rerio)", "Protein of unknown function", "Protein of unknown function"]

This object is, in actuality, a lot longer than this, but just for a simplified example:
My goal is to loop through test and edit it to where any value starting with "Similar to" will return the gene name proceeding directly after (e.g., for this example I’d like to replace the items in the list matching this beginning with "Stxb2" and "rab18b", respectively), which I presume would require specifying to start at character 12 and end when it reaches a colon. When a value includes "Protein of unknown function", I want it to return "Unknown". Thus, the output would be:

["Stxbp2", "Unknown", "rab18b", "Unknown", "Unknown"]

I know this probably requires a for loop with if statements to match each condition, but am pretty lost in how to proceed from there to achieve the result I’m looking for.

>Solution :

Variation without regex if you don’t like those:

def parse(x):
    if x.startswith("Similar to"):
        return x.split(":")[0].split()[-1]
    if x.startswith("Protein of unknown function"):
        return "Unknown"
    raise ValueError(f"Unknown value: {x}")

print([parse(i) for i in test ])


['Stxbp2', 'Unknown', 'rab18b', 'Unknown', 'Unknown']

Leave a Reply