Filter dataframe rows based on 0/1 matrix with string contain

November 9, 2021

I have a big dataframe with rows containing an ID within a string column. I want to reduce the rows in the dataframe based on the 0/1 matrix containing IDs as column names. To illustrate, I created a dataframe df1:

import pandas as pd
df1 = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-03', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-05', '2021-01-05', '2021-01-05'],
    'ID':['R: ID1', 'ID2_P', 'L_ID2', 'ID1.I', 'RZ:ID3', 'ID1', 'P ID2', 'ID3 4K', 'ID1', 'U_ID1', 'ID2:PD', 'ID3=2D'], 
    'Ratings':[9.0, 8.0, 5.0, 3.0, 2, 3, 6, 5, 4, 10, 3, 6]})

and a matrix m1 containing 0 and 1:

m1 = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05'],
    'ID1':[0,0,1,1,1], 
    'ID2':[1,1,1,1,0],
    'ID3':[0,0,0,0,1]})

Conditioned on m1, the dataframe df1 needs to be reduced to df2:

df2 = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-01', '2021-01-03', '2021-01-03', '2021-01-05', '2021-01-05', '2021-01-05'],
    'ID':['ID2_P', 'L_ID2', 'ID1.I', 'P ID2', 'ID1', 'U_ID1', 'ID3=2D'], 
    'Ratings':[8.0, 5.0, 3, 6, 4, 10, 6]})

So far, I tried to iterate through the rows of df1 with a for loop but I wasn’t able to get a result.

Thanks a lot for your suggestions!

>Solution :

Use DataFrame.melt with filter 1 rows first:

df2 = m1.melt('Date', var_name='ID1').query('value == 1')
print (df2)
          Date  ID1  value
2   2021-01-03  ID1      1
3   2021-01-04  ID1      1
4   2021-01-05  ID1      1
5   2021-01-01  ID2      1
6   2021-01-02  ID2      1
7   2021-01-03  ID2      1
8   2021-01-04  ID2      1
14  2021-01-05  ID3      1

Then get ID by df2["ID1"].unique() in Series.str.extract:

df1['ID1'] = df1['ID'].str.extract(f'({"|".join(df2["ID1"].unique())})', expand=False)

And last filter by Date and ID1 in inner default join by DataFrame.merge, remove helper ID1 column:

df = df1.merge(df2[['Date','ID1']]).drop('ID1', axis=1)
print (df)
         Date      ID  Ratings
0  2021-01-01   ID2_P      8.0
1  2021-01-01   L_ID2      5.0
2  2021-01-03     ID1      3.0
3  2021-01-03   P ID2      6.0
4  2021-01-05     ID1      4.0
5  2021-01-05   U_ID1     10.0
6  2021-01-05  ID3=2D      6.0