Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Obtain the average lenght of words of sentences in a dataframe column

Context: I’m trying to obtain the average length of words for a column in a dataframe.

Basically if we have these 3 sentences in a dataframe:

Sentence1 = "This is a sentence"
Sentence2 = "This is a larger sentence"
Sentence3 = "This is an even larger sentence"

The output should be the average lenght of them, split by word. So for sentence1 `len(x.split(" "))" would be 4, sentence2 would be 5 and sentence3 would be 6 and their average would be 5.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

How could I do this in a dataframe?

I was trying this

avg = df['strings'].apply(lambda x: np.mean([len(words.split(" ")) for words in x if isnstance(x,str)]))

This doesn’t really make much sense since "x" would already be the string so "words" would actually be looping through the characters and that’s not what I want (plus a single character doesn’t have attr split)

Also, would be nice to filter out "strings" that only contain floats/NaN/only numbers (hence the isinstance(x,str).

How could I get the length of x.split(" ") only and only if x is a string? And then do the average of the sum of words for all the sentences?

Thank you in advance

>Solution :

import pandas as pd

df = pd.DataFrame({'sentence':
                   ["This is a sentence",
                    "This is a larger sentence",
                    "This is an even larger sentence",
                    "",
                    1,
                    None]})
df = 
                          sentence
0               This is a sentence
1        This is a larger sentence
2  This is an even larger sentence
3                                 
4                                1
5                             None
df['length'] = df['sentence'].apply(
    lambda row: min(len(row.split(" ")), len(row)) if isinstance(row, str) else None
)
df['length'] = 
0    4.0
1    5.0
2    6.0
3    0.0
4    NaN
5    NaN
df['length'].mean() = 3.75

If you want to assign the length 1 for "", use len(row.split(" ")) instead of min(len(row.split(" ")), len(row)).

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading