Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Sorting rows by the number of list elements the row contains

Taking as example the following table:

index column_1 column_2
0 bli bli d e
1 bla bla a b c d e
2 ble ble a b c

If I give a token_list = ['c', 'e'] I want to order the table by the number of times the tokens each row contains in column number 2.

By ordering the table I should get the following:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

index column_1 column_2 score_tmp
1 bla bla a b c d e 2
0 bli bli d e 1
2 ble ble a b c 1

Currently, I have reached the following way of doing this, but it is taking a lot of time. How could I improve the time? Thank you in advance.

df['score_tmp'] = df[['column_2']].apply(
            lambda x: len([True for token in token_list if
            token in str(x['column_2'])]), axis=1)
results = df.sort_values('score_tmp', ascending=False).loc[df['score_tmp'] == len(token_list)].reset_index(inplace=False).to_dict('records')

>Solution :

You can split column_2 based on whitespaces, convert each row into a set and then use df.apply with set intersection with sort_values:

In [200]: df['matches'] = df.column_2.str.split().apply(lambda x: set(x) & set(token_list)).str.len()

In [204]: df.sort_values('matches', ascending=False).drop('matches', 1)
Out[204]: 
   index column_1   column_2
1      1  bla bla  a b c d e
0      0  bli bli        d e
2      2  ble ble      a b c

Timings:

In [208]: def f1():
     ...:     df['score_tmp'] = df[['column_2']].apply(lambda x: len([True for token in token_list if token in str(x['column_2'])]), axis=1)
     ...:     results = df.sort_values('score_tmp', ascending=False).loc[df['score_tmp'] == len(token_list)].reset_index(inplace=False).to_dict('records')
     ...: 

In [209]: def f2():
     ...:     df['matches'] = df.column_2.str.split().apply(lambda x: set(x) & set(token_list)).str.len()
     ...:     df.sort_values('matches', ascending=False).drop('matches', 1)
     ...: 

In [210]: %timeit f1() # solution provided in question
2.36 ms ± 55.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [211]: %timeit f2() # my solution
1.22 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading