Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Separating column string values with varying delimiters

I have a column in a dataframe that I want to split into two columns. The values in the column are strings with a players’ name followed by their position. Because players have different numbers of names, this becomes a bigger issue.

For example:

  • 1 name: Jorginho Defensive Midfield
  • 2 names: Heung-min Son Left Winger
  • 3 names: Bilal El Khannouss Attacking Midfield

The desired output would be:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Player              Position
Jorginho            Defensive Midfield
Heung-min Son       Left Winger
Bilal El Khannouss  Attacking Midfield

I believe this can be done by listing the player positions, however I don’t know how to approach that problem. I tried separating using split() with a space character as the delimiter, but that doesn’t work unfortunately.

import pandas as pd
df = pd.DataFrame({'Player': ['Richarlison Centre-Forward',
                              'Heung-min Son Left Winger',
                              'Harry Wilson Right Winger',
                              'Bilal El Khannouss Attacking Midfield',
                              'Eduardo Camavinga Central Midfield',
                              'Jorginho Defensive Midfield',
                              'Lewis Patterson Centre-Back',
                              'Layvin Kurzawa Left-Back',
                              'Kyle Walker Right-Back',
                              'Jordan Pickford Goalkeeper']})

positions = ['Centre-Forward', 'Left Winger', 'Right Winger',
             'Attacking Midfield', 'Central Midfield', 'Defensive Midfield',
             'Centre-Back', 'Left-Back', 'Right-Back', 'Goalkeeper']

Is this possible to do?

>Solution :

You can craft a regex.

import re
regex = '|'.join(map(re.escape, positions))

df['Player'].str.extract(fr'(.*)\s*({regex})')

NB. changed 'Central Midfielder' to 'Central Midfield' in the list of positions.

Another approach that does not require any list, would be to extract the last 2 words (either separated by spaces, or a dash):

df['Player'].str.extract(r'(.*)\s(\w+(?:-|\s+)\w+)')

output:

                     0                   1
0         Richarlison       Centre-Forward
1       Heung-min Son          Left Winger
2        Harry Wilson         Right Winger
3  Bilal El Khannouss   Attacking Midfield
4   Eduardo Camavinga     Central Midfield
5            Jorginho   Defensive Midfield
6     Lewis Patterson          Centre-Back
7      Layvin Kurzawa            Left-Back
8         Kyle Walker           Right-Back
9     Jordan Pickford           Goalkeeper
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading