Home Efficient selection of values in numpy

Questions

Efficient selection of values in numpy

November 23, 2021

I’m trying to find elements of one DataFrame (df_other) which match a column in another DataFrame (df). In other words, I’d like to know where the values in df['a'] match the values in df_other['a'] for each row in df['a'].

An example might be easier to explain the expected result:

>>> import pandas as pd
>>> import numpy as np
>>> 
>>> 
>>> df = pd.DataFrame({'a': ['x', 'y', 'z']})
>>> df
   a
0  x
1  y
2  z
>>> df_other = pd.DataFrame({'a': ['x', 'x', 'y', 'z', 'z2'], 'c': [1, 2, 3, 4, 5]})
>>> df_other
    a  c
0   x  1
1   x  2
2   y  3
3   z  4
4  z2  5
>>> 
>>> 
>>> u = df_other['c'].unique()
>>> u
array([1, 2, 3, 4, 5])
>>> bm = np.ones((len(df), len(u)), dtype=bool)
>>> bm
array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])

should yield a bitmap of

[
 [1, 1, 0, 0, 0], # [1, 2] are df_other['c'] where df_other['a'] == df['a']
 [0, 0, 1, 0, 0], # [3] matches
 [0, 0, 0, 1, 0], # [4] matches
]

I’m looking for a fast numpy implementation that doesn’t iterate through all rows (which is my current solution):

>>> df_other['a'] == df.loc[0, 'a']
0     True
1     True
2    False
3    False
4    False
Name: a, dtype: bool
>>> 
>>> 
>>> df_other['a'] == df.loc[1, 'a']
0    False
1    False
2     True
3    False
4    False
Name: a, dtype: bool
>>> df_other['a'] == df.loc[2, 'a']
0    False
1    False
2    False
3     True
4    False
Name: a, dtype: bool

Note: in the actual production code, there are many more column conditions ((df['a'] == df_other['a']) & (df['b'] == df_other['b'] & ...), but they are generally less than the number of rows in df, so I wouldn’t mind a solution that loops over the conditions (and subsequently sets values in bm to false).

>Solution :

numpy broadcasting is so useful here:

bm = df_other.values[:, 0] == df.values

Output:

>>> bm
array([[ True,  True, False, False, False],
       [False, False,  True, False, False],
       [False, False, False,  True, False]])

If you need it as ints:

>>> bm.astype(int)
array([[1, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0]])

numpy

byMR

Published November 23, 2021

Add a comment

Why CROSS APPLY and INNER JOIN returns different result

byMR

November 23, 2021

Questions

How compile a c++ complex folder with a simple command?

byMR

November 23, 2021

Questions

NullPointer Exception error is happening in a small test on Android

byMR

November 23, 2021

Questions

input validation help, trouble looping to a point in code (Python)

byMR

November 23, 2021

Questions

How to sort numbers in text file

byMR

November 23, 2021

Questions

Python : Pandas pivot table for multiple columns at once which has duplicate values

byMR

November 23, 2021

Efficient selection of values in numpy

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

Why CROSS APPLY and INNER JOIN returns different result

How compile a c++ complex folder with a simple command?

NullPointer Exception error is happening in a small test on Android

input validation help, trouble looping to a point in code (Python)

How to sort numbers in text file

Python : Pandas pivot table for multiple columns at once which has duplicate values

Keep Up to Date with the Most Important News

Efficient selection of values in numpy

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

Why CROSS APPLY and INNER JOIN returns different result

How compile a c++ complex folder with a simple command?

NullPointer Exception error is happening in a small test on Android

input validation help, trouble looping to a point in code (Python)

How to sort numbers in text file

Python : Pandas pivot table for multiple columns at once which has duplicate values

Discover more from Dev solutions