Here is a sample of a dataset I have:
| ID | Project |
|---|---|
| 1 | 21st Townhouse 318 |
| 2 | The Residences 6 |
| 3 | Villanova Tower B |
| 4 | The Hills H |
| 5 | City Park |
I need to transform ‘Project’ column so that:
- if a row ends with numeric values, they should be dropped
- if a row ends with single letter, it should be dropped
- else, leave as it is
Here is how I want it to look like:
| ID | Project |
|---|---|
| 1 | 21st Townhouse |
| 2 | The Residences |
| 3 | Villanova Tower |
| 4 | The Hills |
| 5 | City Park |
I tried to search for some solution, and found this(for first condition with numeric values only):
df['Project']=df.Project[~((df.Project.astype(str).str.match("(.*\d)")) & (df.Project.astype(str).str.len() > 1))]
It worked, however, I tried to apply it for the second condition as well:
df['Project']=df.Project[~((df.Project.astype(str).str.match("(.*\w)")) & (df.Project.astype(str).str.len() == 1))]
But, It failed
Can you help me, please?
Thank you!
>Solution :
You can use
df['Project']=df['Project'].str.replace(r'\s+(?:\d+|[A-Za-z])$', '', regex=True)
See the regex demo.
Details:
\s+– one or more whitespaces(?:\d+|[A-Za-z])– a non-capturing group matching either\d+– one or more digits|– or[A-Za-z]– an ASCII letter (use[^\W\d_]to match any Unicode letter)
$– end of string.
If your strings have no whitespace before the last number/letter, you may use
df['Project'].str.replace(r'\b(?:\d+|[A-Za-z])$', '', regex=True).str.rstrip()
The \b matches a word boundary making sure there is no letter, digit or _ right before the final digits/letter, and .str.rstrip() will strip the trailing whitespace.