Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas: Split and/or update columns, based on inconsistent data?

So I have a column that contains baseball team names, and I want to split it into the 2 new columns, that will contain separately city name and team name.

Team
New York Giants
Atlanta Braves
Chicago Cubs
Chicago White Sox

I would like to get something like this:

Team City Franchise
New York Giants New York Giants
Atlanta Braves Atlanta Braves
Chicago Cubs Chicago Cubs
Chicago White Sox Chicago White Sox

What I have tried so far?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  • using split and rsplit –> it gets the job done, but can’t unify it.
  • did the count df['cnt'] = df.asc.apply(lambda x: len(str(x).split(' '))) to get number of strings, so I know what kind of cases I have

There are 3 different cases:

  1. Standard one (e.g. Atlanta Braves)
  2. City with 2 strings (e.g. New York Giants)
  3. Franchise name with 2 strings (e.g. Chicago White Sox )

What I would like to do?

  • Split based on conditions (if cnt=2 then split on 1st occurence). Can’t find syntax for this, how this would go?
  • Update based on names (e.g. if ['Col_name'].str.contains("York" or "Angeles") then split on 2nd occurence . Also, can’t find working syntax, example for this?

What would be a good approach to solve this?

Thanks!

>Solution :

Use:

#part of cities with space
cities = ['York','Angeles']

#test rows
m = df['Team'].str.contains('|'.join(cities))

#first split by first space to 2 new columns
df[['City','Franchise']] = df['Team'].str.split(n=1, expand=True)
#split by second space only filtered rows
s = df.loc[m, 'Team'].str.split(n=2)
 
#update values
df.update(pd.concat([s.str[:2].str.join(' '), s.str[2]], axis=1, ignore_index=True).set_axis(['City','Franchise'], axis=1))
print (df)
                Team      City  Franchise
0    New York Giants  New York     Giants
1     Atlanta Braves   Atlanta     Braves
2       Chicago Cubs   Chicago       Cubs
3  Chicago White Sox   Chicago  White Sox
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading