I have a dataframe like this:
| col1 | col2 | col3 | col N |
|---|---|---|---|
| x | y | z | f |
| y | x | z | f |
| f | none | none | none |
| z | y | x | f |
I need to count the rows that equal, regardless of their combinations.
It means that, in this case, the output shoud be something like this:
| col1 | col2 | col3 | col N | freq |
|---|---|---|---|---|
| x | y | z | f | 3 |
| f | none | none | none | 1 |
This bacause, according to the input dataset, there are three rows that have the same sequence (line 1, line 2, and line 4).
I tried to use the function "value_counts", however, according to the documentation, this function count only the unique values.
N.B. The initial dataset contain over 200 column.
Any solution?
Thanks
>Solution :
You can aggregate the columns as a unique object depending on the exact logic (a frozenset, a sorted tuple, etc.), then count the values of perform a groupby:
I would use:
out = df.agg(frozenset, axis=1).value_counts()
# or, if NaN should be ignored
out = df.agg(lambda x: frozenset(x.dropna()), axis=1).value_counts()
NB. if you can have multiple values and what to consider those, use sorted in place of frozenset
Output:
(x, y, f, z) 3
(f) 1
dtype: int64
If you really want a DataFrame with one of the original rows:
g = df.groupby(df.agg(frozenset, axis=1), as_index=False)
out = g.first().assign(freq=g['col1'].count())
Output:
col1 col2 col3 colN freq
0 x y z f 3
1 f None None None 1