Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Group data by a given Range

im looking for an algorithm to get different means for different values.
Example:
I have the values 1.6, 1.7, 5.6, 5.7, 5,5
So the Output should be 1.65 and 5.7

>Solution :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

If you know the "range" around each cluster mean

A possible simple solution: round every value to a multiple of your "range" parameter; group values that are rounded to the same multiple.

To group, you can use a combination of sorted and itertools.groupby, or more simply, you can use a dict of lists.

from collections import defaultdict

def clusters(data, r):
    groups = defaultdict(list)
    for x in data:
        groups[x // r].append(x)
    return groups

def means_of_clusters(data, r):
    return [sum(g) / len(g) for g in clusters(data, r).values()]

print( means_of_clusters([1.6, 1.7, 5.6, 5.7, 5.5], 0.4) )
# [1.65, 5.55, 5.7]

Note how 5.7 was separated from 5.5 and 5.6, because 5.5 and 5.6 were rounded to 13*0.4, whereas 5.7 was rounded to 14*0.4.

If you know the number of clusters

You mentioned in the comments that there will always be 2 clusters. I suggest just looking for the greatest gap between two consecutive numbers in the sorted list, and splitting on that gap:

def split_in_2_clusters(data):
    seq = sorted(data)
    split_index = max(range(1, len(seq)), key=lambda i: seq[i] - seq[i-1])
    return seq[:split_index], seq[split_index:]

def means_of_2_clusters(data):
    return tuple(sum(g) / len(g) for g in split_in_2_clusters(data))

print( means_of_2_clusters([1.6, 1.7, 5.6, 5.7, 5.5]) )
# (1.65, 5.6000000000000005)

For more complex clustering problems

I strongly suggest taking a look at all the clustering algorithms implemented in library scikit-learn. The documentation page lists the algorithms in a nice table that explains which parameters are expected by which algorithm; so you can easily choose the algorithm best-suited to your situation.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading