Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Use the Nth underscore from the right as the separator to split a column into two columns

Suppose I have a dataframe df as follows:

    id          value_type_and_model_name
0    1                       actual_value
1    2             fitted_value_RUT_ARIMA
2    3       fitted_lower_value_RUT_ARIMA
3    4       fitted_upper_value_RUT_ARIMA
4    5          predicted_value_RUT_ARIMA
5    6    predicted_lower_value_RUT_ARIMA
6    7    predicted_upper_value_RUT_ARIMA
7    8                fitted_value_RUT_ES
8    9          fitted_lower_value_RUT_ES
9   10          fitted_upper_value_RUT_ES
10  11             predicted_value_RUT_ES
11  12       predicted_lower_value_RUT_ES
12  13       predicted_upper_value_RUT_ES
13  14           fitted_value_RUT_SARIMAX
14  15     fitted_lower_value_RUT_SARIMAX
15  16     fitted_upper_value_RUT_SARIMAX
16  17        predicted_value_RUT_SARIMAX
17  18  predicted_lower_value_RUT_SARIMAX
18  19  predicted_upper_value_RUT_SARIMAX

I want to split this column into two columns (except 'actual_value') using the second underscore from the right as the delimiter.
The expected results are as follows:

    id             value_type   model_name
0    1           actual_value          NaN
1    2           fitted_value    RUT_ARIMA
2    3     fitted_lower_value    RUT_ARIMA
3    4     fitted_upper_value    RUT_ARIMA
4    5        predicted_value    RUT_ARIMA
5    6  predicted_lower_value    RUT_ARIMA
6    7  predicted_upper_value    RUT_ARIMA
7    8           fitted_value        UT_ES
8    9     fitted_lower_value        UT_ES
9   10     fitted_upper_value        UT_ES
10  11        predicted_value        UT_ES
11  12  predicted_lower_value        UT_ES
12  13  predicted_upper_value        UT_ES
13  14           fitted_value  RUT_SARIMAX
14  15     fitted_lower_value  RUT_SARIMAX
15  16     fitted_upper_value  RUT_SARIMAX
16  17        predicted_value  RUT_SARIMAX
17  18  predicted_lower_value  RUT_SARIMAX
18  19  predicted_upper_value  RUT_SARIMAX

How to achieve this? Thanks. I try with code: df['value_type_and_model_name'].str.rsplit('_', n=2, expand=True), but it’s not working out.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Since you "only" have one underscore in the second chunk, the easiest is to craft a regex for that specific case:

out = (df['value_type_and_model_name']
       .str.extract(r'(?P<value_type>.*)_(?P<model_name>[^_]*_[^_]*)$')
       .fillna({'value_type': df['value_type_and_model_name']})
      )

regex demo (extract)

Output:

               value_type   model_name
0            actual_value          NaN
1            fitted_value    RUT_ARIMA
2      fitted_lower_value    RUT_ARIMA
3      fitted_upper_value    RUT_ARIMA
4         predicted_value    RUT_ARIMA
5   predicted_lower_value    RUT_ARIMA
6   predicted_upper_value    RUT_ARIMA
7            fitted_value       RUT_ES
8      fitted_lower_value       RUT_ES
9      fitted_upper_value       RUT_ES
10        predicted_value       RUT_ES
11  predicted_lower_value       RUT_ES
12  predicted_upper_value       RUT_ES
13           fitted_value  RUT_SARIMAX
14     fitted_lower_value  RUT_SARIMAX
15     fitted_upper_value  RUT_SARIMAX
16        predicted_value  RUT_SARIMAX
17  predicted_lower_value  RUT_SARIMAX
18  predicted_upper_value  RUT_SARIMAX

If you really want to split:

df['value_type_and_model_name'].str.split('_(?=[^_]*_[^_]*$)', expand=True)

regex demo (split)

If you want, you can assign to the original dataframe:

s = df.pop('value_type_and_model_name')
df[['value_type', 'model_name']] = (s.str.extract(r'(.*)_([^_]*_[^_]*)$')
                                     .fillna({0: s})
                                   )

Output:

    id             value_type        model
0    1           actual_value          NaN
1    2           fitted_value    RUT_ARIMA
2    3     fitted_lower_value    RUT_ARIMA
3    4     fitted_upper_value    RUT_ARIMA
4    5        predicted_value    RUT_ARIMA
5    6  predicted_lower_value    RUT_ARIMA
6    7  predicted_upper_value    RUT_ARIMA
7    8           fitted_value       RUT_ES
8    9     fitted_lower_value       RUT_ES
9   10     fitted_upper_value       RUT_ES
10  11        predicted_value       RUT_ES
11  12  predicted_lower_value       RUT_ES
12  13  predicted_upper_value       RUT_ES
13  14           fitted_value  RUT_SARIMAX
14  15     fitted_lower_value  RUT_SARIMAX
15  16     fitted_upper_value  RUT_SARIMAX
16  17        predicted_value  RUT_SARIMAX
17  18  predicted_lower_value  RUT_SARIMAX
18  19  predicted_upper_value  RUT_SARIMAX
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading