Filtering out rows based on other rows using pandas

August 17, 2022

I have a dataframe that looks like this:

dict = {'companyId': {0: 198236, 1: 198236, 2: 900814, 3: 153421, 4: 153421, 5: 337815},
 'region': {0: 'Europe', 1: 'Europe', 2: 'Asia-Pacific', 3: 'North America', 4: 'North America', 5:'Africa'},
 'value': {0: 560, 1: 771, 2: 964, 3: 217, 4: 433, 5: 680},
 'type': {0: 'actual', 1: 'forecast', 2: 'actual', 3: 'forecast', 4: 'actual', 5: 'forecast'}}

df = pd.DataFrame(dict)

    companyId     region          value  type
0   198236        Europe          560    actual
1   198236        Europe          771    forecast
2   900814        Asia-Pacific    964    actual
3   153421        North America   217    forecast
4   153421        North America   433    actual
5   337815        Africa          680    forecast

I can’t seem to figure out a way to filter out certain rows based on the following condition:

If there are two entries under the same companyId, as is the case for 198236 and 153421, I want to keep only the entry where type is actual.

If there is only one entry under a companyId, as is the case for 337815 and 900814, I want to keep that row, irrespective of the value in column type.

Does anyone have an idea how to go about this?

>Solution :

You can use a groupby and transform to create boolean indexing:

#Your condition i.e. retain the rows which are not duplicated and those
# which are duplicated but only type==actual. Lets express that as a lambda.
to_filter = lambda x: (len(x) == 1) | ((len(x) > 1) & (x == 'actual'))

#then create a boolean indexing mask as below
m = df.groupby('companyId')['type'].transform(to_filter)


#then filter your df with that m:
df[m]:

   companyId         region  value      type
0     198236         Europe    560    actual
2     900814   Asia-Pacific    964    actual
4     153421  North America    433    actual
5     337815         Africa    680  forecast