In continuation of my question
There is a table in a CSV file format:
| A | B |
|---|---|
| 35480007 | 0695388 |
| 35480007 | 0695388 |
| 35407109 | 3324741 |
| 35407109 | 3324741 |
| 35250208 | 0695388 |
| 35250208 | 6104556 |
| 86730903 | 3360935 |
| 86730903 | 3360935 |
By applying the code for aggregation:
df.groupby("B")["A"].unique()
I get the result:
695388 [35480007, 35250208]
3324741 [35407109]
3360935 [86730903]
6104556 [35250208]
Could you tell me please, how can I apply some kind of filter so that only those values that have a value greater than two can be displayed: that is so:
695388 [35480007, 35250208]
and how to save the result to a file, for example in txt.
I apologize in advance if my question seemed incorrect. I am very weak in the pandas library.
thank you very much!
>Solution :
It took me a second to realize that what you mean is not a value greater than two, but rather a length greater than one (or greather than or equal to two).
With that said, you can use the apply function on your Series to see which rows satisfy this property
grouped = df.groupby("B")["A"].unique()
has_multiple_elements = grouped.apply(lambda x: len(x)>1)
Which basically applies a function to each entry in your grouped series, and returns the following:
695388 True
3324741 False
3360935 False
6104556 False
Now all that’s left is to use these True/False boolean values to filter your series. Luckily, this is very simple.
result = grouped[has_multiple_elements]
As for the second part of your question, writing this to a file can be done using the to_csv function:
# I usually use tab separated files in case any commas appear in your data itself
result.to_csv('output.tsv', sep='\t')