Counting repetitons in Pyspark

byMR

August 22, 2024

Currently I’m working with a large dataframe and faced with an issue.

I want to return a number of time (count) each value is repeated in a table.

For example:
number 10 is repeated twice, so I want to get number 2 and so on…

My code is:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType

right_table_23 = [
    ("ID1", 2),
    ("ID2", 3),
    ("ID3", 5),
    ("ID4", 6),
    ("ID6", 10),
    ("ID8", 15),
    ("ID9", 10),
    ("ID10", 5),
    ("ID2", 5),
    ("ID3", 8),
    ("ID4", 3),
    ("ID2", 2),
    ("ID3", 4),
    ("ID4", 3)
]

A schema for the table showed above:

schema = StructType([
    StructField("ID", StringType(), True),
    StructField("Count", IntegerType(), True)
    ])

Next I create my table with the following code:

df_right_table_23 = spark.createDataFrame(right_table_23, schema)

In order to count the number of repetitions I use the following code:

#It can be implemented in order to find repetitions for a number 2
df_right_table_23.select().where(df_right_table_23.count == 2).count()

But if the range of digits include numbers from 2 up to 100 it is hard and time-consuming to rewrite the above-mentioned code.

Is it possible to somehow automate the process of counting repetitions?

>Solution :

You dont need to stress when You can automate the process of counting the repetitions of each value in your DataFrame simply with the good old the groupBy and count functions in PySpark.

I must say you already there , here is a code snippet to help you

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("CountRepetitions").getOrCreate()

# your schema
schema = StructType([
    StructField("ID", StringType(), True),
    StructField("Value", IntegerType(), True)  # rename 'Count' to 'Value' makes sense that way
])

# df 
df_right_table_23 = spark.createDataFrame(df_right_table_23, schema)

# shows the both 'ID' and 'Value' columns and count the number of occurrences for each
result = df_right_table_23.groupBy("ID", "Value").count()

# change 'count' column to 'occurrences' for sake of simplicity
result = result.withColumnRenamed("count", "Occurrences")

# display 
result.show()

show results here

+----+-----+-----------+
|  ID|Value|Occurrences|
+----+-----+-----------+
| ID1|    2|          1|
| ID2|    3|          1|
| ID3|    5|          1|
| ID6|   10|          1|
| ID4|    6|          1|
| ID9|   10|          1|
| ID8|   15|          1|
|ID10|    5|          1|
| ID3|    8|          1|
| ID2|    5|          1|
| ID4|    3|          2|
| ID2|    2|          1|
| ID3|    4|          1|
+----+-----+-----------+