Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

nltk.word_tokenize returns nothing in (n,2) shaped large vector (dataframe)

I have a basic dataset with one object named ‘comment’, one float named ‘toxicity’. My dataset’s shape is (1999516, 2)

enter image description here

I’m trying to add a new column named ‘tokenized’ with nltk’s word tokenized method and create bag of words like this :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

dataset = pd.read_csv('toxic_comment_classification_dataset.csv')

dataset['tokenized'] = dataset['comment'].apply(nltk.word_tokenize)

as mentioned in "IN [22]"

I don’t an get error message until i wait like 5 minutes after that i get this error :

TypeError: expected string or bytes-like object

How can I add tokenized comments in my vector (dataframe) as a new column?

>Solution :

It depends on the data in your comment column. It looks like not all of it is of string type. You can process only string data, and just keeep the other types as is with

dataset['tokenized'] = dataset['comment'].apply(lambda x: nltk.word_tokenize(x) if isinstance(x,str) else x)

The nltk.word_tokenize(x) is a resource-consuming function. If you need to parallelize your Pandas code, there are special libraries, like Dask. See Make Pandas DataFrame apply() use all cores?.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading