Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Filter dataframe rows based on 0/1 matrix with string contain

I have a big dataframe with rows containing an ID within a string column. I want to reduce the rows in the dataframe based on the 0/1 matrix containing IDs as column names. To illustrate, I created a dataframe df1:

import pandas as pd
df1 = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-03', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-05', '2021-01-05', '2021-01-05'],
    'ID':['R: ID1', 'ID2_P', 'L_ID2', 'ID1.I', 'RZ:ID3', 'ID1', 'P ID2', 'ID3 4K', 'ID1', 'U_ID1', 'ID2:PD', 'ID3=2D'], 
    'Ratings':[9.0, 8.0, 5.0, 3.0, 2, 3, 6, 5, 4, 10, 3, 6]})  

and a matrix m1 containing 0 and 1:

m1 = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05'],
    'ID1':[0,0,1,1,1], 
    'ID2':[1,1,1,1,0],
    'ID3':[0,0,0,0,1]}) 

Conditioned on m1, the dataframe df1 needs to be reduced to df2:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df2 = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-01', '2021-01-03', '2021-01-03', '2021-01-05', '2021-01-05', '2021-01-05'],
    'ID':['ID2_P', 'L_ID2', 'ID1.I', 'P ID2', 'ID1', 'U_ID1', 'ID3=2D'], 
    'Ratings':[8.0, 5.0, 3, 6, 4, 10, 6]})

So far, I tried to iterate through the rows of df1 with a for loop but I wasn’t able to get a result.

Thanks a lot for your suggestions!

>Solution :

Use DataFrame.melt with filter 1 rows first:

df2 = m1.melt('Date', var_name='ID1').query('value == 1')
print (df2)
          Date  ID1  value
2   2021-01-03  ID1      1
3   2021-01-04  ID1      1
4   2021-01-05  ID1      1
5   2021-01-01  ID2      1
6   2021-01-02  ID2      1
7   2021-01-03  ID2      1
8   2021-01-04  ID2      1
14  2021-01-05  ID3      1

Then get ID by df2["ID1"].unique() in Series.str.extract:

df1['ID1'] = df1['ID'].str.extract(f'({"|".join(df2["ID1"].unique())})', expand=False)

And last filter by Date and ID1 in inner default join by DataFrame.merge, remove helper ID1 column:

df = df1.merge(df2[['Date','ID1']]).drop('ID1', axis=1)
print (df)
         Date      ID  Ratings
0  2021-01-01   ID2_P      8.0
1  2021-01-01   L_ID2      5.0
2  2021-01-03     ID1      3.0
3  2021-01-03   P ID2      6.0
4  2021-01-05     ID1      4.0
5  2021-01-05   U_ID1     10.0
6  2021-01-05  ID3=2D      6.0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading