Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

pandas cut – different bins for different labels

I have a data frame with 2 different labels, A and B, and an associated numeric value.
I want to add a column giving the label of a custom bin that the numeric value falls in to, which can be achieved with pd.cut() as follows:

df = pd.DataFrame({"label": ['A','A','A','A','A','A','B','B','B','B'],
                   "num":   [ 1 , 2 , 4 , 5 , 10, 11, 1 , 3 , 4 , 5 ]})

df['Bin'] = pd.cut(df["num"],
                   [0, 4.5, 7.5, np.inf],
                   labels=['0-4', '5-8', '>8'],
                   include_lowest=True)

giving:

  label  num  Bin
0     A    1  0-4
1     A    2  0-4
2     A    4  0-4
3     A    5  5-8
4     A   10   >8
5     A   11   >8
6     B    1  0-4
7     B    3  0-4
8     B    4  0-4
9     B    5  5-8

However, this works well for A, but the values of B are such that the most values fall into the bottom bin, so I’d like to increase the resolution with different bins for A and B to produce the following:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  label  num  Bin
0     A    1  0-4
1     A    2  0-4
2     A    4  0-4
3     A    5  5-8
4     A   10   >8
5     A   11   >8
6     B    1  0-2
7     B    3  2-4
8     B    4  2-4
9     B    5   >4

It feels like this should be possible using a conditional such as df.where(), or maybe a groupby with a transform() or apply(), or list comprehension with if, but I have been reading stackoverflow and messing around all day and not managed to achieve anything.

I guess I could separate into individual data frames based on label, perform a custom cut to this sub-dataframue, and then concatenate the results back together, but this doesn’t feel very pythonic, or lend itself to generalisable code.

PS – This is a minimal example, my real data frame has more label values, and I want to keep it as a single data frame with differing bins for further processing in my code, hence not separating into two separate data frames based on label.

>Solution :

Yes, groupby().apply() is a good choice, for example, you can do:

df['Bin'] = df.groupby('label')['num'].apply(pd.cut,bins=3)

Output:

  label  num             Bin
0     A    1   (0.99, 4.333]
1     A    2   (0.99, 4.333]
2     A    4   (0.99, 4.333]
3     A    5  (4.333, 7.667]
4     A   10   (7.667, 11.0]
5     A   11   (7.667, 11.0]
6     B    1  (0.996, 2.333]
7     B    3  (2.333, 3.667]
8     B    4    (3.667, 5.0]
9     B    5    (3.667, 5.0]

Or, if you have a specific bins/labels mapping for each label, you can go like this:

bins = {'A': [0,4.5,7.5, np.inf], 'B': [0,2.5,4.5,np.inf]}
labels={'A':['0-4', '5-8', '>8'], 'B': ['0-2','2-4','>4']}
def my_cut(data, bins, labels):
    label = data['label'].iloc[0]
    return pd.cut(data['num'], bins=bins[label], labels=labels[label])

df['Bin'] = df.groupby('label', group_keys=False).apply(my_cut, bins=bins, labels=labels)

Output:

  label  num  Bin
0     A    1  0-4
1     A    2  0-4
2     A    4  0-4
3     A    5  5-8
4     A   10   >8
5     A   11   >8
6     B    1  0-2
7     B    3  2-4
8     B    4  2-4
9     B    5   >4
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading