Follow

Follow

Contact

Home Unable to perform row operations in pyspark dataframe

Questions

Unable to perform row operations in pyspark dataframe

byMR

May 6, 2022

I have a dataset in this form:

Store_Name         Items                                      Ratings

Cartmax         Cosmetics, Clothing, Perfumes                  4.6/5
DollarSmart     Watches, Clothing                              NEW
Megaplex        Shoes, Cosmetics, Medicines, Sports            4.2/5

I want to create a new column which contain the number of items in the store. For example in this first row, the item column has 3 items, so the column have value 3 for first row.
In the ratings column, few rows have ‘NEW’ and ‘NULL’ values. I want to remove all those rows.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.
Visit Medevel

>Solution :

You can achieve this with filter and split as below –

Data Preparation

s = StringIO("""
Store_Name  Items   Ratings
Cartmax Cosmetics, Clothing, Perfumes   4.6/5
DollarSmart Watches, Clothing   NEW
Megaplex    Shoes, Cosmetics, Medicines, Sports 4.2/5
""")

df = pd.read_csv(s,delimiter='\t')

sparkDF = sql.createDataFrame(df)

sparkDF.show(truncate=False)

+-----------+-----------------------------------+-------+
|Store_Name |Items                              |Ratings|
+-----------+-----------------------------------+-------+
|Cartmax    |Cosmetics, Clothing, Perfumes      |4.6/5  |
|DollarSmart|Watches, Clothing                  |NEW    |
|Megaplex   |Shoes, Cosmetics, Medicines, Sports|4.2/5  |
+-----------+-----------------------------------+-------+

Filter & Split

sparkDF = sparkDF.filter(~(F.col('Ratings').isin(['NEW','NULL'])) | F.col('Ratings').isNotNull())\
                 .withColumn('NumberOfItems',F.size(F.split(F.col('Items'),',')))


sparkDF.show(truncate=False)

+----------+-----------------------------------+-------+-------------+
|Store_Name|Items                              |Ratings|NumberOfItems|
+----------+-----------------------------------+-------+-------------+
|Cartmax   |Cosmetics, Clothing, Perfumes      |4.6/5  |3            |
|Megaplex  |Shoes, Cosmetics, Medicines, Sports|4.2/5  |4            |
+----------+-----------------------------------+-------+-------------+

pyspark

byMR

Published May 06, 2022

Add a comment

Leave a ReplyCancel reply

Read more

Questions

Python flask API Main class test failes with "self.assertEqual"

byMR

May 6, 2022

Questions

Python local variable confusion

byMR

May 6, 2022

Questions

Weird bug in pointers

byMR

May 6, 2022

Questions

How to subtract char out from string in c++?

byMR

May 6, 2022

Questions

How prevent Blazor EventCallback reset selected item to default value

byMR

May 6, 2022

Questions

How prevent Blazor EventCallback reset selected item to default value

byMR

May 6, 2022