Home pyspark dataframe groupby with aggregate unique values

Questions

pyspark dataframe groupby with aggregate unique values

December 13, 2021

I looked up for any reference for pyspark equivalent of pandas df.groupby(upc)['store'].unique() where df is any dataframe in pandas.

Please use this piece of code for data frame creation in Pyspark

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql import *
from datetime import date
import pyspark.sql.functions as F

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

data2 = [("36636","M",3000),
    ("40288","M",4000),
    ("42114","M",3000),
    ("39192","F",4000),
    ("39192","F",2000)
  ]

schema = StructType([ \
    StructField("upc", StringType(), True), \
    StructField("store", StringType(), True), \
    StructField("sale", IntegerType(), True) \
  ])
 
df = spark.createDataFrame(data=data2,schema=schema)

I know pyspark groupby unique_count, but need help with unique_values

>Solution :

You can use collect_set to get unique values

from pyspark.sql import functions as F
from pyspark.sql.functions import col
df_group = df.groupBy('upc').agg(F.collect_set(col('store')))

group-by

byMR

Published December 13, 2021

Add a comment

sql query with avg price per day and group by day

byMR

December 13, 2021

Questions

Panda astype not converting column to int even when using errors=ignore

byMR

December 13, 2021

Questions

I have to sort an array using insertion sort and using recursion (without loops)

byMR

December 13, 2021

Questions

Cannot delete old AWS IAM Role

byMR

December 13, 2021

Questions

Python not creating new file headers when it does not exist

byMR

December 13, 2021

Questions

Please check the error in my code for increment of an integer

byMR

December 13, 2021

pyspark dataframe groupby with aggregate unique values

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

sql query with avg price per day and group by day

Panda astype not converting column to int even when using errors=ignore

I have to sort an array using insertion sort and using recursion (without loops)

Cannot delete old AWS IAM Role

Python not creating new file headers when it does not exist

Please check the error in my code for increment of an integer

Keep Up to Date with the Most Important News

pyspark dataframe groupby with aggregate unique values

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

sql query with avg price per day and group by day

Panda astype not converting column to int even when using errors=ignore

I have to sort an array using insertion sort and using recursion (without loops)

Cannot delete old AWS IAM Role

Python not creating new file headers when it does not exist

Please check the error in my code for increment of an integer

Discover more from Dev solutions