Home How to delete all rows from pandas dataframe1 that do NOT exist in pandas dataframe2

Questions

How to delete all rows from pandas dataframe1 that do NOT exist in pandas dataframe2

December 4, 2022

I have two pandas dataframes, data1 and data2. They each have album and artist columns along with other columns that are different attributes. For the sake of what I’m trying to do, I want to delete all of the rows in data2 that DO NOT exist in data1. So, essentially I want all of the album and artists in data2 to match data1. Does anyone know the right way to go about this in python? TIA!

So far I’ve tried:

data2 = data2[data2[‘album’, ‘artist’].isin(data1[‘album’, ‘artist’])]

but it doesn’t like the ‘,’ to get both attributes to match.

>Solution :

You can use the merge method in Pandas to join the two dataframes on the album and artist columns and keep only the rows that exist in both dataframes. Here is an example of how you could do this:

import pandas as pd

# Create some sample dataframes
data1 = pd.DataFrame({
    "album": ["Thriller", "Back in Black", "The Dark Side of the Moon"],
    "artist": ["Michael Jackson", "AC/DC", "Pink Floyd"],
    "year": [1982, 1980, 1973]
})

data2 = pd.DataFrame({
    "album": ["The Bodyguard", "Thriller", "The Dark Side of the Moon"],
    "artist": ["Whitney Houston", "Michael Jackson", "Pink Floyd"],
    "genre": ["Soundtrack", "Pop", "Rock"]
})

# Merge the dataframes on the album and artist columns, and keep only the rows that exist in both dataframes
merged_data = data1.merge(data2, on=["album", "artist"], how="inner")

# Print the result
print(merged_data)

This code will print the following dataframe:

            album           artist  year genre
0         Thriller  Michael Jackson  1982   Pop
1  The Dark Side of the Moon  Pink Floyd  1973  Rock

As you can see, this dataframe only contains the rows that exist in both data1 and data2. You can then use this dataframe instead of data2 to work with the rows that exist in both dataframes.

Note that the merge method will also join the columns from the two dataframes, so you may need to drop any unnecessary columns or rename columns with the same name to avoid conflicts. You can do this using the drop and rename methods in Pandas, respectively. For example:

# Drop the "genre" column from the merged dataframe
merged_data = merged_data.drop("genre", axis=1)

# Rename the "year" column in the merged dataframe
merged_data = merged_data.rename({"year": "release_year"}, axis=1)

# Print the result
print(merged_data)

This code will print the following dataframe: