Filter rows with two conditions In Dataframe in Pandas

July 4, 2022

I have dataframe:

import pandas as pd
data = {'text': ['I ran home', 'I went home', 'I looked at the cat', 'The cat looked at me'],
       'word_count':[3,3,4,5]}
        
df = pd.DataFrame(data)
df['len_text'] = df["text"].str.len()
    text             word_count len
0   I ran home              3   10
1   I went home             3   11
2   I looked at the cat     4   19
3   The cat looked at me    5   20

I want to filter rows with two conditions:
if the values in the word_count column are the same and if the value in the len_text column is greater than the next row, then leave the greater value.

So result will be:

    text             word_count len
0   I went home             3   11
1   I looked at the cat     4   19
2   The cat looked at me    5   20

I tried to do this but it doesn’t work:

for i, row in df.iterrows():
    if (df['pub_count'][i] == df['pub_count'][i+1])&(df['len'][i] >= df['df'][i+1]):
        df = df.drop(i+1)

>Solution :

You can create groups by consecutive values in word_count and get indices by DataFrameGroupBy.idxmax, last select only these rows by DataFrame.loc:

g = df['word_count'].ne(df['word_count'].shift()).cumsum()
df = df.loc[df.groupby(g)['len_text'].idxmax()]
print (df)
                   text  word_count  len_text
1           I went home           3        11
2   I looked at the cat           4        19
3  The cat looked at me           5        20

Consecutive groups means if again e.g. group 3 are count separately:

data = {'text': ['I ran home', 'I went home', 'I looked at the cat',
                 'The cat looked at me','I ran home', 'I went homes'],
       'word_count':[3,3,4,5,3,3]}
        
df = pd.DataFrame(data)
df['len_text'] = df["text"].str.len()
print (df)
                   text  word_count  len_text
0            I ran home           3        10
1           I went home           3        11
2   I looked at the cat           4        19
3  The cat looked at me           5        20
4            I ran home           3        10
5          I went homes           3        12

g = df['word_count'].ne(df['word_count'].shift()).cumsum()
df1 = df.loc[df.groupby(g)['len_text'].idxmax()]
print (df1)
                   text  word_count  len_text
1           I went home           3        11
2   I looked at the cat           4        19
3  The cat looked at me           5        20
5          I went homes           3        12

vs.

df2 = df.loc[df.groupby('word_count')['len_text'].idxmax()]
print (df2)
                   text  word_count  len_text
5          I went homes           3        12
2   I looked at the cat           4        19
3  The cat looked at me           5        20