Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Refactor code in a pythonic way to get the most popular elements in pandas dataframe

This is the dataframe:

image_file objects
0 image_1.png [car, car, car, car, car, car, car, bus, car]
1 image_2.png [traffic light, car, car, car, car, car, car, car, car, car]
2 image_3.png [car, traffic light, person, car, car, car, car]
3 image_4.png [person, person, car, car, bicycle, car, car]
4 image_5.png [car, car, car, car, car, person, car, car, car]

The objects column is a list with the frequency of the object in the image.

I could obtained the most frequent elements according if there are exactly 3 or less elements in the image with this code:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

result = []

# Iterate through rows of the dataframe
for i, row in df.iterrows():
    # Count the frequency of each object in the image
    frequencies = Counter(row['objects'])
    # Sort the frequencies from most to least common
    sorted_frequencies = sorted(frequencies.items(),
                                    key=lambda x: x[1],
                                    reverse=True
                                    )

    # Check if there are less than 3 different objects in the image
    if len(sorted_frequencies) <= 3:
        # If so, append all of the objects to the result list
        result.extend([obj for obj, _ in sorted_frequencies])

frequency_3_most_pop = dict(Counter(result))

My concern is that iterrows is not the best option for perform an iteration over a dataframe and I would like to refactor the code to avoid it.
Any help would be appreciated.

>Solution :

Assuming you have lists in df['objects'], you can simplify your code:

frequency_3_most_pop = dict(Counter(x for l in df['objects']
                                    if len(c:=Counter(l))<=3 for x in c))

NB. requires python 3.8+ due to the walrus (:=) operator (PEP0572).

Output:

{'car': 5, 'bus': 1, 'traffic light': 2, 'person': 3, 'bicycle': 1}

timing

performed on 6k rows

# original approach
346 ms ± 49.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Counter generator (this approach)
11.5 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Used input:

df = pd.DataFrame({'image_file': ['image_1.png', 'image_2.png', 'image_3.png', 'image_4.png', 'image_5.png'],
                   'objects': [['car', 'car', 'car', 'car', 'car', 'car', 'car', 'bus', 'car'],
                               ['traffic light', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car'],
                               ['car', 'traffic light', 'person', 'car', 'car', 'car', 'car'],
                               ['person', 'person', 'car', 'car', 'bicycle', 'car', 'car'],
                               ['car', 'car', 'car', 'car', 'car', 'person', 'car', 'car', 'car']],
                  })
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading