Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Counting repetitons in Pyspark

Currently I’m working with a large dataframe and faced with an issue.

I want to return a number of time (count) each value is repeated in a table.

For example:
number 10 is repeated twice, so I want to get number 2 and so on…

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

My code is:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType

right_table_23 = [
    ("ID1", 2),
    ("ID2", 3),
    ("ID3", 5),
    ("ID4", 6),
    ("ID6", 10),
    ("ID8", 15),
    ("ID9", 10),
    ("ID10", 5),
    ("ID2", 5),
    ("ID3", 8),
    ("ID4", 3),
    ("ID2", 2),
    ("ID3", 4),
    ("ID4", 3)
]

A schema for the table showed above:

schema = StructType([
    StructField("ID", StringType(), True),
    StructField("Count", IntegerType(), True)
    ])

Next I create my table with the following code:

df_right_table_23 = spark.createDataFrame(right_table_23, schema)

In order to count the number of repetitions I use the following code:

#It can be implemented in order to find repetitions for a number 2
df_right_table_23.select().where(df_right_table_23.count == 2).count()

But if the range of digits include numbers from 2 up to 100 it is hard and time-consuming to rewrite the above-mentioned code.

Is it possible to somehow automate the process of counting repetitions?

>Solution :

You dont need to stress when You can automate the process of counting the repetitions of each value in your DataFrame simply with the good old the groupBy and count functions in PySpark.

I must say you already there , here is a code snippet to help you

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("CountRepetitions").getOrCreate()

# your schema
schema = StructType([
    StructField("ID", StringType(), True),
    StructField("Value", IntegerType(), True)  # rename 'Count' to 'Value' makes sense that way
])

# df 
df_right_table_23 = spark.createDataFrame(df_right_table_23, schema)

# shows the both 'ID' and 'Value' columns and count the number of occurrences for each
result = df_right_table_23.groupBy("ID", "Value").count()

# change 'count' column to 'occurrences' for sake of simplicity
result = result.withColumnRenamed("count", "Occurrences")

# display 
result.show()

show results here

+----+-----+-----------+
|  ID|Value|Occurrences|
+----+-----+-----------+
| ID1|    2|          1|
| ID2|    3|          1|
| ID3|    5|          1|
| ID6|   10|          1|
| ID4|    6|          1|
| ID9|   10|          1|
| ID8|   15|          1|
|ID10|    5|          1|
| ID3|    8|          1|
| ID2|    5|          1|
| ID4|    3|          2|
| ID2|    2|          1|
| ID3|    4|          1|
+----+-----+-----------+

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading