Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to solve Python Pandas assign error when creating new column

I have a df containing home descriptions:

description
0   Beautiful, spacious skylit studio in the heart...
1   Enjoy 500 s.f. top floor in 1899 brownstone, w...
2   The spaceHELLO EVERYONE AND THANKS FOR VISITIN...
3   We welcome you to stay in our lovely 2 br dupl...
4   Please don’t expect the luxury here just a bas...
5   Our best guests are seeking a safe, clean, spa...
6   Beautiful house, gorgeous garden, patio, cozy ...
7   Comfortable studio apartment with super comfor...
8   A charming month-to-month home away from home ...
9   Beautiful peaceful healthy homeThe spaceHome i...

I’m trying to count the number of sentences on each row (using sent_tokenize from nltk.tokenize) and append those values as a new column, sentence_count, to the df. Since this is part of a larger data pipeline, I’m using pandas assign so that I could chain operations.

I can’t seem to get it to work, though. I’ve tried:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df.assign(sentence_count=lambda x: len(sent_tokenize(x['description'])))

and

df.assign(sentence_count=len(sent_tokenize(df['description'])))

but both return the following:

TypeError: expected string or bytes-like object

I’ve confirmed that each row has a str dtype. Perhaps it’s because description has dtype('O')?

What am I doing wrong here? Using a pipe with a custom function works fine here, but I prefer using assign.

>Solution :

x['description'] when you pass it to sent_tokenize in the first example is a pandas.Series. It’s not a string. It’s a Series (similar to a list) of strings.

So instead you should do this:

df.assign(sentence_count=df['description'].apply(sent_tokenize))

Or, if you need to pass extra parameters to sent_tokenize:

df.assign(sentence_count=df['description'].apply(lambda x: sent_tokenize(x)))
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading