Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to add a new column that has the occurrence number of the value in a column?

Given a Polars Dataframe:

df = pl.DataFrame({'Col_1':['A','B','C','D'],'Col_2':['a','b','c','d'],'Col_3':[1,2,3,4]})
print(df)
shape: (4, 3)
┌───────┬───────┬───────┐
│ Col_1 ┆ Col_2 ┆ Col_3 │
│ ---   ┆ ---   ┆ ---   │
│ str   ┆ str   ┆ i64   │
╞═══════╪═══════╪═══════╡
│ A     ┆ a     ┆ 1     │
│ B     ┆ b     ┆ 2     │
│ C     ┆ c     ┆ 3     │
│ D     ┆ d     ┆ 4     │
└───────┴───────┴───────┘

and a list:

display_list = ['A','B','B','B','C','D','D','A']

I want the output to be in this format:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

shape: (8, 4)
┌───────┬───────┬───────┬───────────────┐
│ Col_1 ┆ Col_2 ┆ Col_3 ┆ Occurrence_No │
│ ---   ┆ ---   ┆ ---   ┆ ---           │
│ str   ┆ str   ┆ i64   ┆ i64           │
╞═══════╪═══════╪═══════╪═══════════════╡
│ A     ┆ a     ┆ 1     ┆ 1             │
│ B     ┆ b     ┆ 2     ┆ 1             │
│ B     ┆ b     ┆ 2     ┆ 2             │
│ B     ┆ b     ┆ 2     ┆ 3             │
│ C     ┆ c     ┆ 3     ┆ 1             │
│ D     ┆ d     ┆ 4     ┆ 1             │
│ D     ┆ d     ┆ 4     ┆ 2             │
│ A     ┆ a     ┆ 1     ┆ 1             │
└───────┴───────┴───────┴───────────────┘

I want to the rows to be duplicated based on the number of times the first column appears in the list. Also, I need an Occurrence_No column that acts as the counter of the occurrence of the first column in the dataframe. The order of the rows in the DataFrame does not matter.

I am able to get the result except the Occurrence_No using this code:

df = df.with_columns(pl.col('Col_1').apply(lambda x: display_list.count(x)).alias('occur'))
df = df.select(pl.exclude('occur').repeat_by('occur').explode())

The above codes creates the number of rows based on the number of times it occurs in the list.

How do I add the Occurrence_No column?

>Solution :

You could just do a join of display_list (as a pl.Series -> pl.DataFrame). For Occurence_no, a cumcount over the rle_id.

pl.Series(name='Col_1', values=display_list).to_frame().join(
    df, on='Col_1'
).with_columns(
    Occurence_No=1 + pl.col('Col_1').cumcount().over(pl.col('Col_1').rle_id())
)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading