filtered_df and str in Python

November 12, 2023

I am new to Python. I’m trying to filter Dataset. The filter seems to work well or I think it does:)

valid_Cas = ["yut", "thj", "bnm","vfd"]
filtered_df = df[df['Cas ID'].str[-3:].isin(valid_Cas)]

but when a filter more than three letters, it does not work,like:

valid_Cas = ["yut", "thj", "bnm","vfd","cdret"]
filtered_df = df[df['Cas ID'].str[-3:].isin(valid_Cas)]

what does it mean: str[-3:] ?

how can I filter more than 3 letters?

does the code filter "bnm5623" and "5623bnm" or does it leave it?

thank you,

>Solution :

what does it mean: str[-3:] ?

str[-3:0] is a slicing operation which means "take the last 3 characters of the string". E.g. With a given string like "abcde", "abcde"[-3:] would result in "cde". df['Cas ID'].str[-3:] performs this slicing operation on each element of the column in the dataframe.

how can I filter more than 3 letters?

To filter more than 3 characters just adjust the slicing operation to the desired length of string you are looking for. E.g. if you want to filter by strings that end with 'cdret' you would use str[-5:] because 'cdret' has a length of 5.

does the code filter "bnm5623" and "5623bnm" or does it leave it?does the code filter "bnm5623" and "5623bnm" or does it leave it?

The code df['Cas ID'].str[-3:].isin(valid_Cas) only checks the last three characters of each entry in the ‘Cas ID’ column against your valid_Cas list. So it would recognize 'bnm5623' as valid if '562' is in your list, but it wouldn’t recognize '5623bnm' as valid because it’s looking at the last three characters, which would be 'bnm'.

To filter more than 3 letters adjust the slicing operator to the longest string in your list. Here is how you would implement this:

valid_Cas = ["yut", "thj", "bnm", "vfd", "cdret"]
max_length = max(len(s) for s in valid_Cas)  # Find the length of the longest string in valid_Cas

# Filter based on the last characters of each string in 'Cas ID', using `max_length`
filtered_df = df[df['Cas ID'].str[-max_length:].isin(valid_Cas)]