Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Creating a pattern in a sequence for groupby in pandas

I am having A dataset of Three columns ID , ‘sort_seqandlevel. basically i want to identify id wise level sequence sort by sort_seq. please suggest any optimal code other then for loop` because it is taking longer time with dictionary and appending in list.

Input Dataset

    import pandas as pd
import numpy as np
data = {'id': [1, 1, 1, 1,2, 2, 3, 3, 3, 3, 4, 5, 5, 6],
        'sort_seq': [89, 24, 56,  8,  5, 64, 93, 88, 61, 31, 50, 75,  1, 81],
        'level':['a', 'a',  'b', 'c', 'x', 'x', 'g', 'a', 'b', 'b', 'b', 'c', 'c','b']}
df = pd.DataFrame(data)

Expected Output

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

enter image description here

Tried Code

collect = []
for ij in df.id.unique():
  idict = {}
  x =  df[df['id'] == ij]
  x = x.sort_values(by='sort_seq',ascending=True)
  x = x.reset_index()
  idict[ij] =  x['level'].tolist()
  collect.append(idict)
collect

>Solution :

First sort by both columns by DataFrame.sort_values, then grouping by consecutive values in level with counter by Series.value_counts, join number of counts by values of level, so possible aggregate by join:

np.random.seed(123)
    
data = {'id': [1, 1, 1, 1,2, 2, 3, 3, 3, 3, 4, 5, 5, 6],
        'sort_seq': np.random.randint(0, 100, size=14),
        'level':['a', 'a',  'b', 'c', 'x', 'x', 'g', 'a', 'b', 'b', 'b', 'c', 'c','b']}
df = pd.DataFrame(data)

df1 = df.sort_values(['id','sort_seq'])

df1 = (df1.groupby(['id', df1['level'].ne(df1['level'].shift()).cumsum()])['level']
          .value_counts()
          .droplevel(1)
          .rename('col')
          .reset_index()
          .assign(level=lambda x: x['col'].astype(str) + x['level'])
          .groupby('id')['level'].agg(','.join)
          .reset_index(name='PATTERN')
           )
print(df1)

   id      PATTERN
0   1     1c,2a,1b
1   2           2x
2   3  1b,1g,1b,1a
3   4           1b
4   5           2c
5   6           1b

Or use itertools.groupby with Counter in comprehension:

import itertools
from collections import Counter


f = lambda x: ','.join(f'{b}{a}' 
                        for _, g in itertools.groupby(x) 
                        for a, b in Counter(g).items())
df1 = (df.sort_values(['id','sort_seq'])
         .groupby('id')['level']
         .agg(f)
         .reset_index(name='PATTERN'))
           
print(df1)
   id      PATTERN
0   1     1c,2a,1b
1   2           2x
2   3  1b,1g,1b,1a
3   4           1b
4   5           2c
5   6           1b
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading