I have a dataframe with multiple columns and I want to extract the rows that are unique in the manner of the SQL "select distinct" operation. So far whenever I look up forums on this I find comments about counting distinct (but I want the actual values) or (worse) values that are distinct in two columns just joined together as one set (using ravel). What I want is, for example for two columns, values that are distinct in pairs and the result as a dataframe.
I am considering now that the most effective method might be to write it myself – doing a stable sort on tuples and then scanning for duplicates. Any pandas expression that is no simpler than doing essentially that is not an answer to this question. I am looking for a basic or simple compound operation.
For those who do not know what a "distinct" in a query does …
Starting with
1 2
2 3
1 2
4 5
2 3
2 1
we get back
1 2
2 3
4 5
2 1
Note – the question was asked should (2,1) and (1,2) be considered the same. No, as tuples are ordered. Again – refer to the behaviour of SQL for the details.
>Solution :
To get the unique values of a given column, try pandas.Series.unique():
values = df['column_name'].unique()
To get unique combinations of given columns, try pandas.DataFrame.drop_duplicates():
df.drop_duplicates(subset=['colmun_name1', 'column_name2'])