I have a dataset in this form:
Store_Name Items Ratings
Cartmax Cosmetics, Clothing, Perfumes 4.6/5
DollarSmart Watches, Clothing NEW
Megaplex Shoes, Cosmetics, Medicines, Sports 4.2/5
-
I want to create a new column which contain the number of items in the store. For example in this first row, the item column has 3 items, so the column have value 3 for first row.
-
In the ratings column, few rows have ‘NEW’ and ‘NULL’ values. I want to remove all those rows.
>Solution :
You can achieve this with filter and split as below –
Data Preparation
s = StringIO("""
Store_Name Items Ratings
Cartmax Cosmetics, Clothing, Perfumes 4.6/5
DollarSmart Watches, Clothing NEW
Megaplex Shoes, Cosmetics, Medicines, Sports 4.2/5
""")
df = pd.read_csv(s,delimiter='\t')
sparkDF = sql.createDataFrame(df)
sparkDF.show(truncate=False)
+-----------+-----------------------------------+-------+
|Store_Name |Items |Ratings|
+-----------+-----------------------------------+-------+
|Cartmax |Cosmetics, Clothing, Perfumes |4.6/5 |
|DollarSmart|Watches, Clothing |NEW |
|Megaplex |Shoes, Cosmetics, Medicines, Sports|4.2/5 |
+-----------+-----------------------------------+-------+
Filter & Split
sparkDF = sparkDF.filter(~(F.col('Ratings').isin(['NEW','NULL'])) | F.col('Ratings').isNotNull())\
.withColumn('NumberOfItems',F.size(F.split(F.col('Items'),',')))
sparkDF.show(truncate=False)
+----------+-----------------------------------+-------+-------------+
|Store_Name|Items |Ratings|NumberOfItems|
+----------+-----------------------------------+-------+-------------+
|Cartmax |Cosmetics, Clothing, Perfumes |4.6/5 |3 |
|Megaplex |Shoes, Cosmetics, Medicines, Sports|4.2/5 |4 |
+----------+-----------------------------------+-------+-------------+