How to remove everything after the last occurrence of a delimiter?

May 10, 2023

I want to remove everything after the last occurrence of the _ delimiter in the HTAN Parent Biospecimen ID column.

import pandas as pd
df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].str.rsplit("_", 1).str.get(0)

Traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [41], in <cell line: 3>()
      1 # BulkRNA-seqLevel1
      2 df_2 = pd.read_csv("syn39282161.csv", sep=",")
----> 3 df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].str.rsplit("_", 1).str.get(0)
      4 df_2.head()

File ~/.local/lib/python3.9/site-packages/pandas/core/strings/accessor.py:129, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
    124     msg = (
    125         f"Cannot use .str.{func_name} with values of "
    126         f"inferred dtype '{self._inferred_dtype}'."
    127     )
    128     raise TypeError(msg)
--> 129 return func(self, *args, **kwargs)

TypeError: rsplit() takes from 1 to 2 positional arguments but 3 were given

Data:

pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
  1: 'BulkRNA-seqLevel1',
  2: 'BulkRNA-seqLevel1',
  3: 'BulkRNA-seqLevel1'},
 'Filename': {0: 'B001A001_1.fq.gz',
  1: 'B001A001_2.fq.gz',
  2: 'B001A006_1.fq.gz',
  3: 'B001A006_2.fq.gz'},
 'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
 'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
  1: 'HTA10_07_001',
  2: 'HTA10_07_006',
  3: 'HTA10_07_006'}})

Expected output:

pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
  1: 'BulkRNA-seqLevel1',
  2: 'BulkRNA-seqLevel1',
  3: 'BulkRNA-seqLevel1'},
 'Filename': {0: 'B001A001_1.fq.gz',
  1: 'B001A001_2.fq.gz',
  2: 'B001A006_1.fq.gz',
  3: 'B001A006_2.fq.gz'},
 'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
 'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
  1: 'HTA10_07',
  2: 'HTA10_07',
  3: 'HTA10_07'}})

>Solution :

You can use str.replace:

>>> df['HTAN Parent Biospecimen ID'].str.replace('_\d+$', '', regex=True)
0    HTA10_07
1    HTA10_07
2    HTA10_07
3    HTA10_07
Name: HTAN Parent Biospecimen ID, dtype: object

Explanation about regex: Regex 101